Beancounters

Resource Management: Beancounters Pavel Emelianov Denis Lunev Kirill Korotaev [email protected] [email protected] [email protected] July 31, 2007 Abstract The paper outlines various means of resource management available in the Linux kernel, such as per-process limits (the setrlimit(2) interface), shows their shortcomings, and illustrares the need for another resource control mechanism: beancounters. Beancounters are a set of per-process group parameters (proposed and implemented by Alan Cox and Andrey Savochkin and further developed for OpenVZ) which can be used with or without containers. Beancounters history, architecture, goals, efficiency, and some in-depth implementation details are given. 1 Current state of resource Again, most of these resource limits apply to a management in the Linux single process, which means, for example, that all the memory may be consumed by a single user run- kernel ning the appropriate number of processes. Setting the limits in such a way as to have the value of Currently the Linux kernel has only one way to multiplying the per-process limit by the number of control resource consumption of running processes processes staying below the available values is im- – it is UNIX-like resource limits (rlimits). practical. Rlimits set upper bounds for some resource usage parameters. Most of these limits apply to a single process, except for the limit for the number 2 Beancounters of processes, which applies to a user. For some time now Linus has been accepting The main goal of these limits is to protect pro- patches adding the so-called namespaces into the cesses from an accidental misbehavior (like infi- kernel. Namespaces allow tasks to observe their nite memory consumption) of other processes. A own set of particular kernel objects such as IPC better way to organize such a protection would be objects, PIDs, network devices and interfaces, etc. to set up a minimal amount of resources that are Since the kernel provides such a grouping for tasks, always available to the process. But the old way it is necessary to control the resource consumption (specifying upper limits to catch some cases of ex- of these groups. cessive consumption) may work, too. The current mainline kernel lacks the ability of It is clear that the reason for introducing limits tracking the kernel resource usage in terms of arbi- was the protection against an accidental misbehav- trary task groups and this ability is required rather ior of processes. For example, there are separate badly. limits for the data and stack size, but the limit The list of the resources that the kernel provides on the total memory consumed by the stack, data, is huge but the main resources are: BSS, and memory mapped regions does not exist. Also, it is worth to note that the RLIMIT CORE and Memory, RLIMIT RSS limits are mostly ignored by the Linux kernel. CPU, 1 IO bandwidth, 2.2 The beancounters basics Disk space, The core object is the beancounter itself. The beancounter represents a task group which is a resource Networking bandwidth. consumer. Basically a beancounter consists of an ID to make This article describes the architecture the it possible to address the beancounter from the OpenVZ team proposes for controlling the first re- userspace for changing its parameters and the set source (memory) called “beancounters”. Other re- of usage-limit pairs to control the usage of several source control mechanisms are out of the scope of kernel resources. this article. More precisely each beancounter contains not just usage-limit pairs but a more complicated set. It includes the usage, limit, barrier, fail counter and 2.1 Beancounters history maximal usage values. The notion of “barrier” has been introduced to All the deficiencies of the per-process resource ac- make it possible to start rejecting the allocation of counting were noticed by Alan Cox and Andrey a resources before the limit has been hit. For ex- Savochkin long ago. Then Alan introduced an idea ample when a beancounter hits the mappings bar- that crucial kernel resources must be handled in rier, the subsequent sbrk and mmap calls start to terms of groups and set the “beancounter” name fail, though execve, which maps some areas, still for this group. He also stated that once a task works. This allows the administrator to “warn” the moves to a separate group it never comes back and tasks that a beancounter is short of resources be- neither do the resources allocated on its requests. fore the corresponding group hits the limit and the These ideas were further developed by Andrey tasks start dying due to unrecoverable rejections of Savochkin, who proposed the first implementation resource allocation. of beancounters [UB patch]. It included the track- Allocating a new resource is preceded with its ing of process page tables, the numbers of tasks “charging” to the beancounter the current task be- within a beancounter, and the total length of map- longs to. Essentially the charging consists of an pings. Further versions included such resources as atomic checking that the amount of resource con- file locks, pseudo terminals, open files, etc. sumed doesn’t exceed the limit, and adding the re- Nowadays the beancounters used by OpenVZ source size to the current usage. control the following resources: Here is a list of memory resources controlled by the beancounters subsystem: kernel memory, total size of allocated kernel space objects, user-space memory, size of network buffers, number of tasks, lengths of mappings, number of open files and sockets, number of physical pages. number of PTYs, file locks, pending signals Below are some details on the implementation. and iptable rules, total size of network buffers, 3 Memory management with beancounters active dentry cache size, i.e. the size of the dentries that cannot be shrunk, This section describes the way beancounters are used to track memory usage. The kernel memory is dirty page cache that is used to track the IO accounted separately from the userspace one. First, activity. the userspace memory management is described. 2 3.1 Userspace memory management the sum of unused and used pages within unreclaimable VMAs and used pages in reclaimable The kernel defines two kinds of requests related to VMAs. memory allocation. 1. The request to allocate a memory region for physical pages. This is done via the mmap(2) Process system call. The process may only work within Unreclaimable memory a set of mmap-ed regions, which are called VMA VMAs (virtual memory areas). Such a request does not lead to allocation of any physical memory to the task. On the other hand, Unused pages it allows for a graceful reject from the kernel P space, i.e. an error returned by the system r Touched call makes it possible for the task to take some i pages P actions rather than die suddenly and silently; v h v 2. The request to allocate a physical page within y Reclaimable m one of the VMAs created before. This request s VMA p may also create some kernel space objects, e.g. p a page tables entries. The only way to reject a g this request is to send a signal – SEGV, BUS, g Touched e or KILL – depending on the failure severity. e pages s This is not a good way of rejecting as not ev- s ery application expects critical signals to come during normal operation. The beancounters introduce the “unused page” term to indicate a page in the VMA that has not yet Unused been touched, i.e. a page whose VMA has already pages been allocated with mmap, but the page itself is not yet there. The “used” pages are physical pages. Kernel VMAs may be classified as: reclaimable VMAs, which are backed by some Figure 1: User space memory management file on the disk. I.e. when the kernel needs some more memory, it can safely push the The first resource is account-only (i.e. not lim- pages from this VMA to the disk with almost ited) and is called “physpages”. The second one is no limitation; limited in respect of only unused pages allocation and is called “privvmpages”. This is illustrated on unreclaimable VMAs, which are not backed by Fig. 1. any file and the only way to free the pages The only parameter that remains unlimited – the within such VMAs is push the page to the swap size of pages touched from a disk file – does not space. This way has certain limitations in that create a security hole since the size of files is limited the system may have no swap at all or the by the OpenVZ disk quotas. swap size may be too small. The reclamation of pages from this kind of VMAs is more likely to fail in comparison with the previous kind. 3.2 Physpages management In the terms defined above, the OpenVZ bean- While privvmpages accounting works with whole counters account for the following resources: pages, physpages accounts for pages fractions in the case some pages are shared among beancoun- the number of used pages within all the VMAs, ters [RSS]. This may happen if a task changes its 3 beancounter or if tasks from different groups map beancounters list is moved to the new beancounter. and use the same file. Thus when the page is sequentially mapped to 4 This is how it works. There is a many-to-many different beancounters, its fractions would look like dependence between the mapped pages and the bc1 bc2 bc3 bc4 beancounters in the system. This dependence is 1 1 1 1 tracked by a ring of page beancounters associated 2 2 2 1 1 1 with each mapped page (see Fig.

Beancounters

Flexible Lustre Management

Reducing Power Consumption in Mobile Devices by Using a Kernel

I.MX Linux® Reference Manual

Networking with Wicked in SUSE® Linux Enterprise 12

O'reilly Linux Kernel in a Nutshell.Pdf

Communicating Between the Kernel and User-Space in Linux Using Netlink Sockets

Postmodern Strace Dmitry Levin

Compression and Replication of Device Files Using Deduplication Technique

Daemon Management Under Systemd ZBIGNIEWSYSADMIN JĘDRZEJEWSKI-SZMEK and JÓHANN B

Interaction Between the User and Kernel Space in Linux

Instant OS Updates Via Userspace Checkpoint-And

An Overview of the Crypto Subsystem