Bolt: Faster Reconfiguration in Operating Systems Sankaralingam Panneerselvam and Michael M

Bolt: Faster Reconfiguration in Operating Systems Sankaralingam Panneerselvam and Michael M. Swift, University of Wisconsin—Madison https://www.usenix.org/conference/atc15/technical-session/presentation/panneerselvam This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15). July 8–10, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-225 Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX. Bolt: Faster Reconfiguration in Operating Systems Sankaralingam Panneerselvam1, Michael M. Swift1, and Nam Sung Kim2 1Department of Computer Sciences 2Department of Electrical and Computer Engineering University of Wisconsin-Madison 1 sankarp,swift @cs.wisc.edu { [email protected]} Abstract Current operating systems assume that the processor Dynamic resource scaling enables provisioning extra re- cores available to them are essentially static and almost sources during peak loads and saving energy by reclaim- never change over runtime of the system. These systems ing those resources during off-peak times. Scaling the support scaling through a hotplug mechanism that was number of CPU cores is particularly valuable as it allows primarily designed to remove faulty cores from the sys- power savings during low-usage periods. Current system [11]. This mechanism is slow, bulky and halts the tems perform scaling with a slow hotplug mechanism, entire machine for a few milliseconds while the OS re- which was primarily designed to remove or replace faulty configures. The current Linux hotplug mechanism takes cores. The high cost of scaling is reflected in power tens of milliseconds to reconfigure the OS [22]. In con- management policies that perform scaling at coarser time trast, processor vendors are aiming for transitions to and scales to amortize the high reconfiguration latency. from sleep states in the order of microseconds [26, 24], We describe Bolt, a new mechanism built on existing making OS the bottleneck in scaling. In spite of these hotplug infrastructure to reduce scaling latency. Bolt also drawbacks, hotplug is being widely used by mobile ven- supports a new bulk interface to add or remove multiple dors as a means to scale the number of processor cores cores at once. We implemented Bolt for x86 and ARM to save energy [23] due to the lack of better alternatives. architectures. Our evaluation shows that Bolt can achieve We propose a new mechanism, Bolt, that builds on over 20x speedup for entering offline state. While turn- the Linux hotplug infrastructure with the assumption that ing on CPUs, Bolt achieves speedup upto 10x and 21x scaling events are frequent and a goal of low latency for x86 and ARM respectively. scaling mechanism. Bolt classifies every operation car- ried out during hotplug as critical or non-critical opera- 1 Introduction tions. The critical operations are performed immediately Most operating system policies focus on improving perfor proper functioning of the system whereas non-critical formance given a fixed set of resources. For example, operations are done lazily. Bolt also supports a new bulk schedulers assume a fixed set of processors to which as- interface to turn on/off multiple cores simultaneously, sign threads, and memory managers assume a fixed pool which is not supported by the hotplug mechanism. of memory. However, this assumption is increasingly We implemented Bolt for both x86 and ARM proces- violated in the current computing landscape, where re- sors, and find that Bolt can achieve upto 20x better la- sources can be added or removed dynamically during tency over native hotplug offline mechanism. For get- runtime for reasons of energy reduction, cost savings, ting a CPU to online state, Bolt offers a speedup of 10x virtual-machine scaling or hardware heterogeneity [12]. for modern x86 processors and 21x for Exynos (ARM) We refer to changing the set of resources as resource processor. The bulk interface supported in Bolt achieves scaling and our work focuses on scaling the set of proces- speedup of 18x-67x when adding or removing 3 cores in sors available to the operating system. This scaling may x86 processors compared to doing the same operations be helpful in virtualized settings such as cloud comput- over native hotplug mechanism. ing [8] and disaggregated servers [15, 2] where it is pos- sible to add or remove CPUs at anytime within a system. 2 Motivation Scaling can improve performance during peak loads and 2.1 Processor Scaling minimize energy during off-peak loads [9]. Dynamically reconfigurable processors (e.g., [17]), which allow pro- Many current and future systems will support a dynamic cessing resources to be reconfigured at runtime to meet changing set of cores for three major reasons. application demands, can also benefit from scaling the Energy Proportionality. This property dictates that en- set of available CPUs [22]. ergy consumption should be proportional to the amount USENIX Association 2015 USENIX Annual Technical Conference 511 of work done and is getting a major focus in all forms (PCPU). In terms of latency, virtualization could provide on computing from cloud to mobile devices. Low power an ideal support where it could switch VCPUs from a (P-states) and sleep state (C-states) support from proces- PCPU that is being switched off to a different physical sors help achieve energy proportionality by reducing en- CPU instantly by saving and restoring context of those ergy consumption when the processor is not fully uti- virtual CPUs. lized. However, in many processors further power sav- However, the drawbacks of using virtualization-based ings could be achieved by turning off cores to a deeper techniques are two-fold. First, virtualization adds over- sleep state, turning off an entire socket or allowing to head in the common case, particularly for memory ac- enter package level sleep state. These deeper states re- cess [7]. Second, multiplexing VCPUs on a physical quire OS intervention, as the core is logically turned off CPU can hurt performance due to context switches dur- and not available for scheduling threads or processing ing critical sections [27]. interrupts. For example, most mobile systems turn off Power Management. OS support for idle sleep states cores during low system utilization to conserve battery (C-states) can be used to move unused cores to a sleep capacity by reducing static power consumption. Some state and wake up when needed. The latency of entering processors (e.g. Exynos [19]) provide a package-level and exiting such sleep state is very low when compared deep sleep state that is enabled when all but one core to the hotplug mechanism. However, OS power man- is switched off. Operating systems may use core park- agement support requires that cores can still respond to ing [13] to consolidate tasks in a single socket to switch interrupts, which is not the case for all deep sleep states off other sockets completely. Quick scaling support by or for non-power uses of scaling. Furthermore, the OS the OS can allow rapid transitions into and out of deep may accidentally wake up a sleeping core unnecessarily sleep states. to involve them in regular activities like scheduling or Heterogeneity. Many processors support heterogeneity TLB shootdowns [25]. either statically [1] via different core designs or dynam- Processor Proxies. Chameleon [22] proposed a new al- ically [17] through reconfiguration. On dark silicon sys- ternative to hotplug called processor proxies that is sev- tems [6], not all processors could be used at full perfor- eral times faster than hotplug. A proxy represents an of- mance together, and hence the OS must decide which fline CPU and runs on an active CPU making the sys- processors to enable based on the application character- tem believe that the offline CPU is still active. However, istics. Processors like Exynos [4] and Tegra [21] em- proxies can only be used for a short period because they ploy ARM’s big.LITTLE architecture, with a mix of high handle interrupts and Read-Copy-Update (RCU) opera- performance and high efficiency cores. In systems with tions only, and do not reschedule threads from a CPU in these processors, the OS must choose the type of proces- offline state. sor core based on the performance need. The OS must change the processor set when switching between differ- Scalability-Aware Kernel. Ideally, an OS kernel could ent CPU types. natively support changing the set of CPUs at low latency and with low overhead. Rather than assuming that VM Scaling. Virtual machines are widely used in cloud scaling events are rare, a scalability-aware kernel would environment, and many hypervisors provide support for spend little time freeing resources during a scale-down scaling the number of virtual CPUs in a virtual machine event when they are likely to be re-allocated during an (VM) [20]. IBM supports VM scaling through DLPAR upcoming scale-up event. (dynamic logical partitioning [18]) and VMware supports them through hot add/remove interfaces. Some ap- 2.3 Hotplug plication like databases benefit from scaling up of virtual machines by provisioning more resources rather than Hotplug is a widely used mechanism available in Linux scaling out where more virtual machines are spawned. to support processor set scaling. It offers interfaces for These techniques can be used at finer scale if the guest any kernel subsystem to subscribe to notifications for OS provides quick processor scaling. processor set changes. However, handling of notifications by every subsystem follow the assumption that hot- 2.2 OS Support plug events are rare. The shortcomings of the mechanism There are several mechanisms an OS can use to scale the are discussed below. number of CPUs in use. Repeat Execution. A direct implication of the above as- Virtualization. An extra layer of indirection through vir- sumption is that most kernel subsystems free or reinitial- tualization decouples the physical execution layer from ize the software structures during hotplug event.

Bolt: Faster Reconfiguration in Operating Systems Sankaralingam Panneerselvam and Michael M

An Emerging Architecture in Smart Phones

SECOND AMENDED COMPLAINT 3:14-Cv-582-JD

Three Ways of Seeing Improved Health and Productivity

Survey and Benchmarking of Machine Learning Accelerators

2016 Comparison of Application Processor (AP) Packaging 1 Table of Contents

Android Vs Ios

Digital Forensic Analysis of Smart Watches

NEON Implementation of an Attribute-Based Encryption Scheme

Arxiv:1910.06663V1 [Cs.PF] 15 Oct 2019

Exynos 9110 with Eplp: First Generation of Samsung’S Fan-Out Panel Level Packaging (FO-PLP)

Evolution of the Samsung Exynos CPU Microarchitecture

Samsung Galaxy S4's Jaw Dropping Official Specs