Cpuidle from User Space Madhu Palmur, Zhichao Li, and Erez Zadok FSL Technical Report FSL-13-05

Cpuidle from user space Madhu Palmur, Zhichao Li, and Erez Zadok FSL Technical Report FSL-13-05

Abstract when it is up, consumes 40W in C1 state, 33W in C3 state and, only 12W in C6 state. Also, On an x86 ma- In this paper we present a user space cpuidle gover- chine, C1, C3, and C6’s exit latencies are 3, 20 and 200 nor. In addition to providing a user space interface to microseconds, respectively [1]. High enter-exit latencies pick idle states for individual cores in a multicore sys- can have negative effects on the performance of the system, the governor also ensures that each core stays in tem. Hence, it is essential for the cpuidle module to pick the specified idle state forever. In other words, the cores the most appropriate idle state at any given time. do not wake up from a specified idle state unless specified by the user. This gives a user complete control over 2 Background a core’s idle states without worrying about any kind of wake-ups. As mentioned before, the Linux kernel has the cpuidle A user space governor can be very useful in scenar- module which implements the entire infrastructure of ios where every workload is run with a customized cpui- picking an appropriate idle state, transitioning cores into dle power saving algorithm. Coding different algorithms those idle states and waking them up when required. The and dynamically switching between them is fairly sim- cpuidle module has the following components: pler in user space when compared to kernel space. From our evaluation results, we have concluded that this tech- • The sysfs interface nique does not hurt power consumption savings or per- • The cpuidle governors formance benefits in any way. 1 Introduction • The core cpuidle logic Cpuidle is a module in the Linux kernel which is respon- • The platform specific driver functions sible for running some power saving routines on a core when the core does not have any task in its run queue [3]. The sysfs interface exposes several read only and read The power saving routines try to put the core into a low write interfaces. Some of the read-only interfaces in- power state or an idle state. Processors in the market clude current driver, current governor ro today offer various idle states with different power con- for the governor algorithm and driver type that has been sumption savings and wakeup latency overheads. selected. For every core read only information like All machines with i386 and x86 64 architecture fol- power consumed while in a particular idle state, la- low an open standard for power management called tency to exit out of a particular idle state is also ex- ACPI which stands for Advanced Configuration and posed. The sysfs has some writable interfaces like Power Interface. In ACPI terminology, the idle states current driver and current governor so that are called C-states. Linux code for platform spe- current driver and governor selections can be changed cific ACPI functionalities can be found in the directory dynamically. drivers/acpi [4]. Among many other things, this driver is Cpuidle governors implement the main algorithm mainly responsible for implementing hardware specific which core has to enter which idle state at any given instructions to make the actual transition of a core into time. Two of the traditional governors in the Linux ker- and out of a specified idle state. nel are the menu and the ladder governors. The ladder The idle states come with a trade-off. Every idle state governor initially starts off with the lowest possible idle has an enter and exit latency associated with it. These state for a core and increments the idle state if the core latencies are the time required for a core to enter the can afford the latency overheads of the new idle state. corresponding idle state and time required for a core to The menu governor improvises on this algorithm by not exit from the corresponding idle state respectively. The starting off with the lowest possible idle state but by trade-off is, as and when the idle states get deeper, the picking the best idle state a core can afford to transi- power consumption savings increase but the enter-exit tion into at any given point of time. The menu gover- latencies increase as well. For example, An Intel Xeon nor considers metrics like energy break even point, av- 5,600 series processor which consumes 80W of power erage load on the core into consideration, performance requirements, latency requirements, etc. while picking process is time consuming. Also, writing kernel code the best possible idle state for a core. is any day harder than writing programs in user space. The core cpuidle logic is called from the idle thread Building a tool which exports all the idle state decision for every core. It is responsible for enabling the idle in- making infrastructure to user space can hence turn out to frastructure for all online cores. Once enabled, it makes be useful. a call to the selected governor algorithm to get the next The main goals of this project are: desired idle state and then makes a call to the selected driver function to actually transition the core into the 1. Build a kernel tool that exports the whole idle state specified idle state. decision making infrastructure to the user space. The platform specific idle functions are hooked into 2. The tool should make sure that users can specify the main cpuidle logic. They implement architecture de- core specific idle state value. It should also make pendent functionalities like transitioning the cores into sure, that a core actually stays in the specified idle a selected idle state, waking up the core when certain state until the user changes it. conditions are met, etc. 3. Run a workload and prove that even though the idle 3 Related work state decisions are made in the user space, the per- The Ladder governor was the first governor algorithm formance benefits and power consumption savings written for the cpuidle module. The algorithm always are at least as compared to the existing kernel space starts off with the lowest possible idle state. When the governor algorithms. processor is idle again for the next time, it checks for the QOS requirements of the system and jumps into the 5 Design next higher idle state. When an idle state no longer meet The goal is to build a system which gives the user com- the QOS requirements of the system, the ladder gover- plete control over the idle states of the cores. The tra- nor jumps to the immediate lower idle state. This has ditional cpuidle governors have their own algorithm for proven to work well with tick-based kernels. However, picking the next c-state a core should transition into. Ini- this step-wise approach does not work well with tickless tially, we wrote a custom governor which does nothing kernels because the kernel can be idle for a really long but look at the value which indicates what c-state the time without the periodic tick and not get a chance to user is requesting for and set the next c-state value for step into a deeper idle state whenever it goes idle. that core to this value. We called that custom governor The Menu governor was written as an improvement the “Noop” governor. over the ladder governor algorithm. The Menu gover- The current design of the cpuidle module is such that nor takes into account three factors before deciding a even though the core gets transitioned into a c-state, it is C-state: energy break even point, performance impact, woken up as soon as there are tasks or IRQs that have and latency tolerance. The Menu governor implements to be scheduled on the core. To give a clearer picture, a na¨ıve prediction mechanism to speculate how long a the current flow of the idle thread is something as fol- core is going to be in a particular idle state so that C lows: [2] state entry and exit energy cost breaks even with the energy that is saved when the core actually resides in a par- 1. The idle thread gets scheduled when there are no ticular idle state. The Menu governor follows a heuristic other tasks present in the run queue of the core. of picking a C-state that has the least impact if the sys- 2. The idle thread runs an infinite loop. It calls the tem is busy. However, the Menu governor is said to not cpuidle module inside this loop. perform well with short lived tasks. 3. The cpuidle module in turn calls its governor to se- 4 Motivation lect the next c-state the core should transition into. The cpuidle governor algorithms currently run from the kernel space. The present governor algorithms are 4. The cpuidle module then transitions the core into generic, they do not offer any kind of customizations de- that c-state. pending on the workload. Not just that, implementing 5. The idle thread yields itself when there are other a new customized algorithm for every type of workload tasks present in the run queue of the core. in the kernel space is hard. Every time a new gover- nor algorithm is written, a new patch has to be written Given this, it is obvious that even though we move a and applied to the kernel, the kernel config file has to core into user requested c-state using our custom gover- be changed to run the new governor algorithm, and the nor, it is woken up time and again to run tasks and IRQs kernel has to be recompiled and installed again. This assigned to it. This means that there can be a situation

2 wherein even though the user requested a deeper c-state c state for every core, we have added a new integer field for a particular core, it can move back to C0 temporar- called c state in the structure cpuidle device. ily to empty its run queue. This is well against our goal We use the cpu hot-plug API cpu down() to bring of giving the user complete control over the idle states a core down to any c-state. In the store function of of cores. Hence, just using the noop governor will not the sysfs c state file described in the previous section, meet our desired goals. we make a call to this cpu down() function when the It is not so simple to halt a core completely without c state requested is anything greater than C0. Internally, waking it up for a long time. This is because the Linux this cpu down() API makes a call to an architecture- scheduler can pick any online core to run a particular specific function which calls monitor, MWAIT instruc- task or an IRQ. Also, each core has been destined to tions pair to move the core into a particular idle state. service some specific set of IRQs. The core should also In this function, we add an extra piece of code which keep emptying its run queue as and when the ready tasks checks c state field of the cpuidle device structure arrive. Not waking up the core can make the system and passes appropriate argument to the mwait() call. completely unresponsive. This argument determines which idle state the core will We utilized the CPU hot-plugging infrastructure [5] to enter into. Also, we use the CPU hot-plug API, cpu up() achieve our goal of keeping a core in the user requested to bring the core back up to C0 state in case the user re- c-state and never waking it up. The CPU hot-plug infras- quests a C0 state for a core which was not already in C0 tructure can be used to make a core completely invisible state. A point to note is that to move a core from one c- to the OS so that the scheduler can never try to sched- state to another c-state, it should first be brought online ule any task on that core. The CPU hot-plugging in- by using cpu up() API. Figure 1 explains this. frastructure essentially migrates off all tasks, interrupts, The third part of the implementation is to keep a cus- timers, and tasklets from the victim core to the other on- tom governor which will not interfere with any of the line cores. The CPU hot-plugging code makes use of core’s idle states. Note that the cores for which the idle the monitor, MWAIT instruction pair to halt the cores. state selected is something other than C0 will not even By picking the appropriate argument to be passed to the reach this part of the code. This is because we halt MWAIT instruction, the core can be moved to any desire that core by running cpu down() immediately after the idle state. user requests for a c-state change. So, we do not give We invoke this infrastructure whenever a user requests a chance for the governor to even run on that core after a c-state greater than C0 for a core. We still keep a cus- the c-state change. The cores for which either the user tom governor running for the rest of the cores which are has not requested for any particular c state or the user in C0 state. These cores are in C0 state either because has specifically requested for C0 state, we need to keep the user has selected C0 state for them or user never re- them in C0 state. The custom governor always selects quested for a different c-state for these cores. The gov- C0 state for such cores. ernor just selects C0 state for these cores for which the user has selected no other state other than C0. 7 Evaluation 6 Implementation We measure the following two major metrics. The user space governor should not perform worse than the kernel The implementation can be broken down to three parts. space governor w.r.t., the following metrics at any cost. The three parts are: 1. Performance 1. Change the sysfs code to accept core specific user input 2. Power consumption savings 2. Hook the cpu hot-plug infrastructure to the sysfs We have evaluated our tool with a completely CPU store operation bound workload. The workload forks some specified 3. Write a custom kernel which ensures that the core number of processes which continuously perform ran- in C0 state always stay in C0 state until the user dom mathematical floating point computations. We requests a different one wrote a simple bash script that writes the value of the desired C-state into the c-state file of the sysfs interface. We have added a new file under the sysfs di- The bash script runs the workload by specifying a spec- rectory to take the c-state value as input from the ified number of processes to be forked and keeps those user: /sys/devices/system/cpu/cpuX/cpuidle/ many number of cores active. It puts the rest of the cores c_state. The new file c state is present under the to the deepest possible C-state. We optimistically picked cpuidle directory of every core. To remember the current the deepest possible C-state to save more energy with the

3 /sys/devices/system/cpu/cpuX/cpuidle/c_state

Previous c_state == 0 c_state > 0 cpu_up( ) value

> 0

Previous Follow the regular path of the idle thread. cpu_up( ) > 0 c_state The cpuidle module uses a custom governor value to keepto the c_state value to be C0 always.

== 0

cpu_down( ) with the appropriate argument to the mwait call

Figure 1: Implementation

200 100 menu governor menu governor noop governor noop governor

80 150

100

40 Latency (seconds) 50

Power consumption (Watts) Power consumption 20

0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 No. of cores in C2 state No. of cores in C2 state

Figure 2: Comparing Power Consumption savings of Noop Figure 3: Performance comparison of Noop governor with governor with Menu governor for CPU bound workload Menu governor for CPU bound workload intention of lowering the C-state if performance took a Figure 3 shows the latencies of our user space gover- hit. nor when compared to menu governor. The term latency All experiments are carried out on a server containing here refers to the total time taken by the system while Intel Xeon 5,650 processor. There are 6 cores in the sys- it waits for the specified number of forked processes to tem in total (with hyper threading disabled). This system complete their mathematical computations and exit. As supports three C states: C0, C1 and C2. In all our exper- it can be seen from the figure, the latency results of the iments we have used the C2 state whenever we want to noop governor is comparable with that of menu gover- move a core into an idle state. nor’s. Since the number of online cores is always equal to the number of processes forked, the result is pretty Figure 2 shows the power consumption savings of our straight forward. user space governor when compared with the menu governor. We called our user space governor the noop gover- We also ran our user space governor with a workload nor because the governor does not implement any power which has intermittent I/Os. The workload just does saving algorithm on its own but relies on the user to do Linux kernel compilation by spawning a specified num- it. As it can be seen, the power consumption savings ber of threads. The command make -jX spawns X of the noop governor are almost similar to menu gover- number of threads to finish the kernel compilation. We nor. The result is pretty intuitive because the workload is used the same bash script which starts the workload with completely CPU bound and keeps the CPU almost 100% a specified number of threads and keeps exactly those busy all the time. Since the cores that are not put in any many number of cores active and the rest are put into the idle state in our case all remain active even when the deepest possible idle state. What we observed was the menu governor is run and the cores put in idle state re- latency graph of menu and noop governor were almost main in the idle state in both the cases, menu governor similar but the power consumption graph was drastically does not save any extra energy. different.

4 predicting as to when I/O actually happen so that the menu governor noop governor window can be used to save power. When run from user 200 space, we believe this should perform at least as good as

150 the menu governor or even better because the menu governor presently does not use such a sophisticated predic-

100 tion mechanism. 9 Conclusion

Power consumption (Watts) Power consumption 50 In conclusion we have built a very useful tool that ex- 0 ports the whole idle state decision making infrastructure make -j1 make -j2 make -j3 make -j4 make -j5 make -j6 No. of cores in C2 state to the user space. This should give a lot of ﬂexibility to the users to use customized algorithms from the user Figure 4: Comparing Power Consumption savings of Noop space for every kind of workload as opposed to using a governor with Menu governor for a workload running “make generalized algorithm from the kernel space. Needless -jX” to say, writing simple programs from the user space is much easier when compared to writing governor algorithms inside the kernel. Dynamic switching between Figure 4 shows the power consumption savings of our algorithms and addition of new algorithms can be very user space governor when compared with the menu gov- easily done from the user space as compared to the ker- ernor. As it can be seen from the graph, we perform nel space. We have proved by running a completely CPU worse than the menu governor in all cases. The reason bound workload that running governor algorithms from for this is, the workload has intermittent I/Os. Consider user space can be at least as good as running it from the an example where we have spawned just one thread and kernel space. We also point out the future work of writ- we have kept just one core active to run this thread and ing sophisticated algorithms for I/O bound workloads put the rest of them into idle states. In this cases, when by using a very good prediction mechanism to predict the thread performs its intermittent I/Os, the core has I/O patterns which has the potential of performing better no work to do and runs its idle thread. This is again a than the existing kernel space algorithms. great window of opportunity to save power. We unfortu- nately keep our core active all the time without putting References it into any idle state when compared to the menu governor. Also, the change in the power consumption sav- [1] Intel Inc. Intel xeon processor 5600 series. ings is quite a lot when compared to menu governor be- http://www.intel.com/content/www/us/ cause of the following reason. When the one core that en/processors/xeon/xeon-5600-vol-2- is active goes to an idle state due to an I/O, effectively datasheet.html, 2011. all cores are in an idle state. We observed that maxi- [2] Linux 3.2.9. http://www.kernel.org. mum power consumption savings are incurred in a sys- [3] S. Li and A. Belay. cpuidle — Do nothing, efﬁ- tem when all the shared resources like L3 cache of the ciently... In Proceedings of the Linux Symposium, cores are switched off too. Shared resources can only be volume 2, 2007. switched off when all the cores are in idle states. This limitation can be overcome by writing a more intelligent [4] Acpi in linux. http://acpi.sourceforge.net. user space program that learns from the workload. The [5] Zwane Mwaikambo, Ashok Raj, Rusty Russell, and program should be able to study the workload initially, Joel Schopp. Linux Kernel Hotplug CPU Support. and start predicting as to when I/Os happen and take ap- Proceedings of the Linux Symposium, 2, July 2004. propriate action on the cores.

8 Future Work As mentioned in the previous section, even though we have the tool ready to run governor algorithms from user space and we did show that the user space governor can perform at least as good as the kernel space governor, we still do not have an algorithm ready for every type of I/O bound workloads. A simple HMM (Hidden Markov Model) type of model can be used to ﬁrst learn about the I/O patterns from the given workload and then start