<<

University of Calgary PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2016 Experimental Evaluation of Speed Scaling Systems

Skrenes, Arsham Bryan

Skrenes, A. B. (2016). Experimental Evaluation of Speed Scaling Systems (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/26057 http://hdl.handle.net/11023/3315 master thesis

University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY

Experimental Evaluation of Speed Scaling Systems

by

Arsham Bryan Skrenes

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

GRADUATE PROGRAM IN COMPUTER SCIENCE

CALGARY, ALBERTA

SEPTEMBER, 2016

© Arsham Bryan Skrenes 2016 Abstract

Speed scaling policies are a critical component in modern operating systems, impacting both energy efficiency and performance. Energy efficiency is important from a sustainability standpoint, especially since datacenters account for roughly 2% of the global energy consumption, growing by 6% per year.

Understanding the features of modern processors facilitates the development of more effective policies. As a first contribution, this thesis provides such information, along with the details necessary to properly interpret experimental measurement results. The second contribution is a profiler that makes it easy to perform controlled workloads made up of precise units of work at defined speeds, and produces high-resolution timing and energy measurement data broken down by process and workload.

The profiler is used to collect empirical data about several theoretical speed scaling policies using a modern processor, with detailed analysis and comparisons to the most common policy on contemporary operating systems.

ii Acknowledgements

First and foremost, I would like to thank Carey Williamson. From the very first email to arrange our initial meeting, and throughout my entire program, he has continually been kind and encouraging, exemplifying outstanding leadership and invoking inspiration, not only in myself, but also in those who have had the privilege of knowing him. I cannot imagine a better supervisor nor can words adequately express my gratefulness.

I am also immensely thankful to my wife, Allison Skrenes, who was never doubtful and always encouraging, even when I felt demotivated. Her advice and strategies, forged from her own academic successes, were instrumental. Our children, Tyrian and Nyala, have had to endure this long haul, but when I had to “work on the thesis”, they had the most amazing mother at their side.

I must also thank my parents and in-laws for their encouragement. My mother was the person who initially inspired me to pursue graduate studies. She also gave me valuable advice that helped me finish the thesis.

Finally, I am grateful to my defense committee members, Dr. Mea Wang, Dr. Diwaker

Krishnamurthy, and Dr. Carey Williamson, who took the time to read my thesis and provide their feedback.

iii Table of Contents

Abstract ...... ii Acknowledgements ...... iii Table of Contents ...... iv List of Tables ...... vi List of Figures and Illustrations ...... vii List of Abbreviations ...... viii

CHAPTER 1: INTRODUCTION ...... 1 1.1 Objectives ...... 3 1.2 Thesis Roadmap ...... 3

CHAPTER 2: BACKGROUND AND RELATED WORK ...... 5 2.1 The Importance of Energy Efficiency ...... 5 2.2 Theory and Practice ...... 6 2.3 Speed Scaling Process Schedulers ...... 8 2.4 Summary ...... 10

CHAPTER 3: PROBLEM FORMULATION ...... 12 3.1 Profiling Schedulers on Real Systems ...... 12 3.2 High-Resolution Hardware Timers ...... 14 3.3 Running Average Power Limit (RAPL) ...... 15 3.3.1 Using the RAPL MSRs within Profilo ...... 20 3.4 Summary ...... 22

CHAPTER 4: DESIGN AND IMPLEMENTATION ...... 23 4.1 Design Choices ...... 23 4.1.1 User Mode ...... 23 4.1.2 Kernel Mode ...... 24 4.1.3 Kernel/User Mode Hybrid ...... 25 4.1.4 Workload ...... 25 4.1.5 Units of Work ...... 29 4.2 Implementation ...... 30 4.2.1 Kernel Module ...... 31 4.2.2 Lockup Detectors ...... 35 4.2.3 Clocks ...... 36 4.2.4 User Mode Application ...... 38 4.2.4.1 Processing the Arguments ...... 38 4.2.4.2 CPU Compatibility ...... 43 4.2.4.3 Environment Setup ...... 45 4.2.4.4 Preparations for Profiling ...... 50 4.2.4.5 Profiling ...... 52 4.2.4.6 Concluding Profiling ...... 55 4.2.5 Idler Utility ...... 56 4.3 Summary ...... 57

iv CHAPTER 5: MICRO-BENCHMARKING ...... 59 5.1 Platform ...... 59 5.2 ACPI Specification ...... 61 5.3 Total System Power Consumption ...... 63 5.3.1 Global Sleep States ...... 63 5.3.2 Normalizing Measurements ...... 66 5.3.3 Idle Power Consumption ...... 69 5.3.3.1 Linux cpuidle ...... 70 5.3.3.2 CPU Sleep States ...... 73 5.3.4 Active Power Consumption ...... 82 5.4 Profilo Workload Benchmarking ...... 88 5.5 Mode and Context Switches ...... 92 5.6 Switch Test ...... 96 5.7 Summary ...... 106

CHAPTER 6: PROFILING COUPLED AND DECOUPLED SPEED SCALERS ...... 109 6.1 Workloads ...... 109 6.2 Building the Trace Files ...... 110 6.2.1 The PS Generator ...... 111 6.2.2 The FSP-PS Generator ...... 112 6.2.3 The YDS Generator ...... 113 6.3 Profiling Results ...... 114 6.4 Summary ...... 120

CHAPTER 7: CONCLUSIONS AND FUTURE WORK ...... 122 7.1 Thesis Summary ...... 122 7.1.1 The Importance of Speed Scaling Scheduling Policies ...... 122 7.1.2 Building a Profiler ...... 123 7.1.3 Examining Modern Architectures ...... 123 7.1.4 Experimental Evaluation of Speed Scaling Schedulers ...... 123 7.2 Conclusions ...... 124 7.3 Relevance and Contributions ...... 125 7.4 Future Work ...... 125

APPENDIX A ...... 128

APPENDIX B ...... 129

APPENDIX C ...... 131

REFERENCES ...... 134

v List of Tables

Table 5.1 Power Consumption of Components ...... 69

Table 5.2 Ivy Bridge C1 Power Savings ...... 74

Table 5.3 Ivy Bridge C-State Power Measurements ...... 80

Table 5.4 Ivy Bridge and Broadwell C-State Durations ...... 81

Table 5.5 Single-Core Busy Wait ...... 84

Table 5.6 Single-Core Trial Division ...... 86

Table 5.7 Benchmark Results (150 Primes) ...... 92

Table 5.8 Mode and Context Switch Test ...... 95

Table 5.9 Switch Test Results ...... 100

Table 5.10 Mode Switch Durations and Energy Estimates ...... 102

Table 6.1 Profiling Results ...... 115

vi List of Figures and Illustrations

Figure 3.1 Structural Illustration of RAPL MSRs ...... 18

Figure 4.1 Trial Division Primality Test Workload ...... 28

Figure 4.2 Conceptual Overview of Profilo ...... 31

Figure 6.1 Profiling Graph (Workloads 1-5) ...... 119

Figure 6.2 Profiling Graph (Workloads 4-5) ...... 120

Figure A.1 Inline Assembly to Read MSRs ...... 128

Figure B.1 Workload C Programming Language Code ...... 129

Figure C.1 PS Trace File ...... 131

Figure C.2 Verbose Profilo Summary for PS Example ...... 132

Figure C.3 FSP-PS Trace File ...... 132

Figure C.4 Profilo Summary for FSP-PS Example ...... 133

vii List of Abbreviations

AC Alternating Current

ACPI Advanced Configuration and Power Interface

AES Advanced Encryption Standard

AES-NI Advanced Encryption Standard New Instructions

AKS Agrawal–Kayal–Saxena Primality Test

ALU Arithmetic Logic Unit

API Application Programming Interface

AVX Advanced Vector Extensions

BLE Bluetooth Low Energy

CPU Central Processing Unit

CSV Comma-Separated Values

DC Direct Current

DDR Double Data Rate

DMA Direct Memory Access

DMI Direct Memory Interface

DRAM Dynamic Random-Access Memory

DVFS Dynamic Voltage and Frequency Scaling

EDF Earliest Deadline First

EPA Environmental Protection Agency

FIVR Fully-Integrated Voltage Regulator

FPU Floating-Point Unit

FSP Fair Sojourn Protocol

viii GB Gigabyte

GDDR Graphics Double Data Rate Synchronous Dynamic Random-Access

Memory

GHz Gigahertz

GIMPS Great Internet Search

GPR General Purpose Register

GPU Graphics Processing Unit

HLT Halt (x86 Instruction)

HPET High Precision Event Timer

ICT Information and Communications Technology

IO Input/Output

IRQ Interrupt Request

ISA Instruction Set Architecture

KB Kilobyte

LAN Local Area Network

LCD Liquid-Crystal Display

LTS Long Term Support

MB Megabyte

MHz Megahertz

MMX Multimedia Extension

MRT Mean Response Time

MSR Machine Specific Register

NEMA National Electrical Manufacturers Association

ix NMI Non-Maskable Interrupt

OEM Original Equipment Manufacturer

OpenGL Open Graphics Library

OS Operating System

OSX Mac Operating System 10

PCH Platform Controller Hub

PCI Peripheral Component Interconnect

PCU Power Control Unit

PIT Programmable Interval Timer

PKG Package Power Plane (Entire CPU)

PLL Phase-Locked Loop

PM Power Management

POS Power On Suspend

PP0 Power Plane 0 (Cores)

PP1 Power Plane 1 (Uncore Devices / GPU)

PS Processor Sharing

RAM Random Access Memory

RAPL Running Average Power Limit

RMS Root Mean Square

RPM Revolutions Per Minute

RTC Real-Time Clock

SI International System of Units (Système International d'unités)

SIMD Single Instruction, Multiple Data

x SMI System Management Interrupt

SMM System Management Mode

SMP Symmetric Multiprocessing

SRAM Static Random-Access Memory

SRPT Shortest Remaining Processing Time

SSE Streaming SIMD Extensions

TDP Thermal Design Power

TSC Time Stamp Counter

TSS Task State Segment

VCC IC Power-Supply Pin (Positive)

WLAN Wireless Local Area Network

YDS Yao-Demers-Shenker Speed Scaling Scheduling Algorithm

xi

Chapter 1: Introduction

Energy efficiency has always been a consideration in processor design, but in recent years it has topped the list of priorities. When Intel abandoned their Netburst architecture in 2004, forfeiting a couple years of costly research and development, in favour of a considerably different design that promised greater energy efficiency, they made their priorities very clear. Since then, they have reduced their thermal design power (TDP), increased the number of idle and sleep states, and added features that manage the behaviour of the processor to maximize energy efficiency and minimize cooling requirements.

This change in priorities has also been reflected in the design of process schedulers – the software component that manages how jobs are executed on the processor. The concept of changing the service rates at which jobs are executed is called speed scaling. Incorporating speed scaling into scheduling policies has led to additional complexity that has attracted analysis in the literature. However, for the sake of analysis and comparison, assumptions are made and characteristics of hardware are not modeled, which separates theory from practice.

Connecting theory and practice offers the possibility of improvements to the systems community, while understanding the features of modern processors can give rise to improved algorithms that are first described and analysed in theory. Bridging this gap can serve as a catalyst for progress. Unfortunately, modern processor features tend to only be described in datasheets, Web site reviews, and blogs, which often lack a quantitative assessment. On the theory side, algorithms can make simplifying assumptions and omit important behaviours that make experimental evaluation challenging.

1

This thesis tries to address these concerns by bridging the two perspectives. To accomplish this, the features and characteristics of Intel’s newest microarchitecture need to be described and quantified through careful micro-benchmarks. This includes an examination of all sleep, idle, and active states, describing and quantifying well-known properties like the power rating of the maximum discrete speed, as well as properties like the exit latency of the deepest idle state. This information helps answer what processor states and features are available to improve performance and energy efficiency. It also addresses what behaviours are necessary for a modern processor to accomplish work while saving power.

The second thing needed to address the gap between theory and systems is a profiling tool that allows different schedulers and speed scalers to be evaluated on real hardware. While there may have been attempts at doing this in the past, to the best of my knowledge, there has never been a generalized profiler that produces high-resolution timing and multi-domain energy consumption data. Furthermore, the profiler described in this thesis can provide this information without ancillary (and costly) equipment, requiring only an Intel processor from 2011 or later.

Such a profiler can help answer these two important and pragmatic questions:

1. How different are the speed scaling scheduling policies that make different trade-offs to

attain certain optimizations?

2. How do the best of these policies compare to the implementations found on modern

operating system kernels?

2

1.1 Objectives

The objectives of this research are as follows:

• Determine the performance and energy saving features of modern processors, as well as

their analytical properties, and then quantify this information through micro-benchmarks.

• Develop a generalized profiler that makes it easy to execute controlled workloads made

up of precise units of work at defined speeds, and receive high-resolution timing and

energy measurements, broken down by process and workload.

• Profile decoupled speed scaling and other interesting speed scaling and scheduling

policies on real hardware to see what gains can be made compared to the policies found

in modern operating systems.

1.2 Thesis Roadmap

This thesis is organized as follows. Chapter 2 provides further background and motivation for this research, and describes the prior research work from both the theoretical and systems communities. Chapter 3 delves into the problem formulation and explains, in detail, the technical resources that are used to address these challenges. Chapter 4 presents the details of the design and implementation of Profilo, the profiler that fulfills one of the thesis objectives. Chapter 5 presents the architectural features and micro-benchmarking results of the test platform. It also sets the context for the subsequent chapter. Chapter 6 examines the profiling results of the speed scaling scheduling policies described in Chapter 2. It compares two theoretical policies to the

3

most common implementation in modern operating systems, breaking the results down through several metrics, thereby satisfying the final thesis objective.

4

Chapter 2: Background and Related Work

The goal of this chapter is to provide some of the background information necessary to understand the motivation for Profilo. It illustrates the importance of process schedulers and speed scaling policies from both a theoretical and systems perspective.

2.1 The Importance of Energy Efficiency

In recent years, the power efficiency of computer processors has become increasingly important.

In a 2007 report, it was estimated that the total footprint for the ICT sector accounted for 2% of the world’s carbon emissions from human activity [59]. This same report estimated a growth rate of 6% per year, at least until 2020. A more recent 2013 report suggests that the entire ICT ecosystem may be responsible for as much as 10% of the world electricity consumption [40]. In response to a 2007 Congressional request, the United States Environmental Protection Agency

(EPA) estimated that the energy consumption for operating data centers located in the United

States consumed 61 billion kilowatt-hours or roughly 1.5% of the nation’s electricity use [74].

On the other end of the spectrum, mobile devices also benefit greatly when improvements to energy efficiency can be made, due to their limited battery capacity.

The most energy-intensive component of almost any computer is its processor(s). While some datacenters may use graphic processing units (GPU’s) as compute resources, all datacenters utilize central processing units (CPU’s). This is therefore the focus for Profilo and this thesis.

5

There have been many ideas to improve processor power efficiency. One solution has been

Dynamic Voltage and Frequency Scaling (DVFS); in modern processors this is more commonly referred to as CPU speed scaling. There are typically several discrete voltage/frequency states that can be used. Because power consumption grows in proportion to the frequency multiplied by the square of the voltage (which must be increased to maintain stability at higher frequencies), it is substantially more energy efficient to run a processor at half the frequency for a given job, taking twice as long, than it is to run the same processor at its full frequency. To properly accomplish DVFS, a software component is required. More explicitly, the kernel scheduler in conjunction with its speed scaling policy controls the processor frequency (and voltage).

2.2 Theory and Practice

In dynamic CPU speed scaling systems, the speed at which the processor executes is adjusted over time based on the ambient workload demand. If no jobs are present, the processor can enter a rest state (e.g., “sleep”, “idle”, or “gated off”) to drastically reduce power consumption. In the presence of one job, the processor can run at a modest baseline speed. If the number of active jobs increases, the processor speed can be increased, perhaps to some maximum rate, to dissipate the backlog quickly. Speed scaling strategies produce interesting trade-offs between response time, fairness, and energy consumption. These issues have fostered research efforts in both the systems community and the theory community.

In the published literature, there is a dichotomy between the speed scaling results for the systems and theory research communities. The theoretical work tends to provide elegant results on the

6

optimality and efficiency of speed scaling algorithms [3], [5], [9], [11], [12], [62], albeit under many assumptions (e.g., weighted cost functions for delay and energy consumption; known job sizes; energy-proportional operation; job-count-based speed scaling; continuous and unbounded speeds; zero cost for context switches, speed changes, or return from sleep states). Simulation is sometimes used to augment the evaluation of speed scaling systems, but the simulators often have similar assumptions as the analytical work.

In the systems community, research tends to focus on DVFS. In this context, practical issues such as processor utilization, heat dissipation, and job size variability are primary considerations

[39], [47], [48], [56], [61], while optimality is not. Other concerns arise regarding granularity of control, the set of discrete speeds available, non-linear power consumption effects, and unknown job characteristics [54]–[56], [60], [61]. Practical energy saving techniques include threshold- based control (Section 4.2.4.3), race-to-idle [4], and power-gating [23], [31].

In the theory community, speed scaling typically assumes a continuous and unbounded range of available speeds, with the choice of speed determined either by job deadlines [65] or system occupancy [6], [9]. Albers et al. have done extensive work on energy-efficient algorithms [3],

[5]. Some of this work optimizes the trade-off between energy consumption and mean response time. Several studies on this metric suggest that energy-proportional speed scaling is near optimal [6], [9]. An alternative approach has focused on minimizing the response time in systems, given a fixed energy budget [11], [12].

7

Andrew, Lin, and Wierman formally consider the trade-offs between response time, fairness, and energy consumption [6]. Their paper identifies algorithms that can optimize up to two of these metrics, but not all three. For example, SRPT (Shortest Remaining Processing Time) is optimal for response time [50], but can be unfair, while PS (Processor Sharing) is always fair, but suboptimal for both response time and energy [6], [21]. Decoupled speed scaling divorces speed selection from system occupancy [21]. This violates the definition of “natural” speed scaling in

[6], since it can require speed changes at arbitrary points in job execution, even if occupancy remains the same. While decoupled speed scaling provides an elegant theoretical model, it has not been implemented nor evaluated in a practical system [53]. The primary motivation for creating Profilo was to provide this empirical data on a real system, the results of which are explored in Chapter 6.

2.3 Speed Scaling Process Schedulers

Most operating system kernels today, including the Linux kernel, employ variations on the round-robin scheduling algorithm [14]. This scheduler has its origins well established in the early days of operating systems as a vast improvement to non-preemptive multiprogramming. In the theoretical community, this algorithm is idealized as processor sharing (PS) [34]. It is often cited in the literature because it sets the benchmark for fairness [8], [28] and is free of problems like process starvation. Even in speed scaling systems, as long as the chosen speed is a function of the number of jobs in the run queue, PS remains the criterion for fairness [6]. That is the reason this early algorithm persists in both the theoretical and systems community.

8

One of the design goals with speed scaling process schedulers is to run the system just fast enough to complete all of the work in a timely fashion, but no faster. By doing so, the energy consumption for completing the workload is as low as possible. One of the earliest systems papers on speed scaling was by Weiser et al. [61]. In their work, they considered a diverse mix of processes in a Unix system, and attempted to determine the energy savings if the jobs were executed using different system speeds. A subset of the same authors later contributed to one of the seminal theoretical papers on speed scaling [65]. Their paper proposed an optimal offline algorithm for speed scaling (now known as YDS, based on the names of the authors), with the objective of minimizing power consumption.

The YDS algorithm works by calculating the minimum speed required to finish each job by a certain deadline (the job departure times under PS can be used), based on each job getting uninterrupted execution to completion in earliest deadline first (EDF) order. Execution speeds are calculated by considering the set of jobs with the most stringent deadlines, which defines the work intensity. By remaining at such a speed, all jobs will finish at or before the deadline. The algorithm then recursively calculates the minimum speed required for the remaining jobs. Yao et al. also proposed a heuristic online algorithm for the same problem. Both algorithms are deadline-based, and require knowledge of job sizes and deadlines.

SRPT also requires knowledge of job size, but uses it to optimize mean response time. Under certain job size distributions, SRPT can improve every job over PS [10], while consuming less energy [21]. However, SRPT can be unfair to the large jobs [63]. There are several scheduling policies that try to maintain this response time advantage while considering fairness. The Fair

9

Sojourn Protocol (FSP) [25] scheduling policy incorporates qualities of PS while honouring parts of SRPT. It conceptually runs a “virtual PS” queue to determine the order that jobs would complete under PS scheduling, and then devotes full processing resources to each of the jobs in that order. FSP dominates PS in the sense that no job finishes later with FSP than in PS, while mean response time is improved. As a result, the property of fairness is maintained.

The trouble with FSP is that it is ill-defined for job-count-based speed scaling, unless the speed scaling policy is decoupled, as described by Elahi et al [21]. This concept of decoupled speed scaling involves separating the scheduling policy that determines what the processor works on from the speed scaling policy that determines the speed at which the processor does work.

Therefore, a scheduling policy that optimizes mean response time can be used in conjunction with a speed scaling policy that optimizes power efficiency. Furthermore, under this decoupled regime, a policy like FSP is well defined, and retains its property of dominating PS. This mixing and matching of policies allows for numerous interesting combinations and may indeed make it possible to simultaneously attain fairness, robustness, and near optimality [21].

2.4 Summary

This chapter began with the motivation and importance of energy efficient practices, particularly given the global impact of the ICT sector. This is one of several metrics used to evaluate speed scaling scheduling policies. A number of policies were then described. A few of notable interest are: the YDS policy, which minimizes power consumption; the PS policy, which epitomises fairness; and the FSP-PS decoupled system (i.e., FSP as scheduler and virtual speed changes

10

under PS used for speed scaling policy), which is provably efficient and has simulation results that suggest a noteworthy performance advantage over PS [21]. These three policies are evaluated on a real system (described in Section 5.1) with results in Chapter 6.

11

Chapter 3: Problem Formulation

From the review of background research in the previous chapter, it is clear that the speed scaling models used by the theoretical community are oversimplified, with several limiting assumptions.

The best way to determine the effectiveness of scheduling and/or speed scaling algorithms is to test them on a real system. Building a custom kernel is not practical every time a new algorithm is to be evaluated. Furthermore, some algorithms require known job sizes or other information that is not possible on real systems. However, knowing how an algorithm performs on a real system is helpful for determining potential savings, and perhaps could lead to a practical implementation on a real operating system.

To bridge the gap between theory and practice, a profiling tool named Profilo was created that allows different scheduling and speed scaling algorithms to be run on real systems. This chapter describes the technical requirements for such a profiler, and the features of modern processors that facilitate this endeavour.

3.1 Profiling Schedulers on Real Systems

As outlined in Section 2.1, the energy impact of the ICT ecosystem is significant and global. It is therefore fitting for energy consumption/efficiency to be an important metric in the evaluation of speed scaling scheduling policies. Traditionally, this would have been difficult to accomplish because it would require an expensive, complicated, pre-calibrated power meter. Such a meter would need to export data, which in turn need processing and analysis to isolate the scheduling policy activity from the rest of the system’s energy consumption. However, most CPU’s and

12

GPU’s manufactured in the last few years support accurate energy readings that isolate their specific domain. Of particular interest in this thesis is the Intel Running Average Power Limit

(RAPL) Model Specific Registers (MSRs) supported on Sandy Bridge and newer Intel x86 microarchitectures [69, Vol. 3B Ch. 14.9.1]. This is currently the only energy interface supported by Profilo, although there is no technical reason preventing other vendors from being supported.

All consumer and server grade operating systems have standard timing libraries with a resolution that is similar to a time quantum (the maximum amount of time a process can occupy the processor without preemption). It is therefore important to have access to a much higher resolution timing system to properly evaluate speed scaling schedulers. Fortunately, most modern processors contain high-resolution hardware timers.

On the Linux operating system, high-resolution timers are supported through a secondary timing subsystem that is described in Section 3.2. With regards to the RAPL interface (described in

Section 3.3), Linux does not directly support any type of energy profiling hardware; however, most distributions ship with an MSR kernel module that gives any process with root privileges a relatively straightforward mechanism to access any MSR. It is for this reason, as well as those discussed in Section 4.1.2, that Linux is the chosen operating system for building Profilo.

Incidentally, despite the flexibility, ingenuity, and carefully designed features such as file-based user/kernel mode boundaries, the Linux kernel implements a variation of PS as its scheduler and a relatively simple threshold-based speed scaler (described in Section 4.2.4.3). This is not unique to Linux; most contemporary operating system kernels are similarly implemented. Therefore, a

13

final criterion for a good profiler is its ability to control the CPU directly, by circumventing its hosting kernel. This design is described in detail throughout Chapter 4.

3.2 High-Resolution Hardware Timers

All operating systems require kernel timers for preemption, drivers, timeouts, and user mode services. The Linux kernel has two separate timer subsystems. The standard timer framework is a low resolution solution based on units of jiffies, which on most platforms is 10 milliseconds [68].

Its primary structure is a timer wheel, which is made up of 5 buckets that represent logarithmically sized blocks of time in the future. Timers are moved into smaller blocks until they are expired. Each bucket contains the timers in a linked list. Therefore, timer insertions are

O(1) complexity, but they cascade to expiration in O(n) time, where n is the number of timers

[32]. Since the initial and predominant usage of this framework is driver time-outs that then generate errors, the cascades are often cut short because the timers are removed before they expire.

The second timer framework is the high-resolution timers subsystem, called hrtimers. It is better suited to precise measurement than countdown timers, though the latter is supported through a similar API to the standard timer framework or through external kernel functions (e.g., usleep_range); however, it does not make use of a timer wheel, but rather a time-ordered red- black tree [13], [27] (a type of self-balancing binary search tree), which happens to be a data structure implemented as a kernel library. The timers are ordered at insertion to minimize

14

processing at activation, since these timers tend to not be cancelled, unlike the general case for the standard timers.

As the name suggests, hrtimers is based on a much higher resolution unit of one nanosecond, with a typically high level of accuracy and no aliasing since the common hardware clock source on modern systems often have resolutions 3 or 4 times finer than a nanosecond (i.e., in the low hundreds of picoseconds).

The hrtimers subsystem is used by several kernel features, such as the lockup detectors, which make use of the hrtimers interrupt and callback functions (discussed in Section 4.2.2). The subsystem is also exposed to user mode applications through the clock_gettime function that

Profilo uses to time processes as accurately as possible (see Section 4.2.4.4).

Modern systems have dozens of clocks and even several high-resolution clocks with various levels of precision and reading costs. The differences span more than an order of magnitude.

This is why this subsystem has a clock manager, which selects the most suitable clock, and supports user intervention to force a clock source. A thorough discussion of system clocks is featured in Section 4.2.3.

3.3 Running Average Power Limit (RAPL)

When Intel introduced the 80386 processor in 1986, it included two test registers that were not guaranteed to be included in future versions of the x86 instruction set architecture (ISA). These

15

registers were called model-specific registers (MSRs) or machine-specific registers. The subsequent 80486 processor, introduced in 1989, included a generalized rdmsr instruction to read an MSR and a wrmsr instruction to write to an MSR. It also included a CPUID instruction

(discussed in Section 4.2.4.2) to help determine the features present on a particular processor.

These instructions have persisted to present day microarchitectures. Some of the MSRs have been inherited from previous generations without modification, and have become architectural

MSRs. Modern processors have hundreds of MSRs with an entire chapter of the Intel 64 and IA-

32 Architectures Software Developer’s Manual [69, Vol. 3C, Ch. 35] dedicated to their addressing, scope, and brief description.

The Intel Running Average Power Limit (RAPL) interface is available on Sandy Bridge (2011) and newer Intel microarchitectures. It is comprised of several non-architectural MSRs that allow policies to be set for several domains that in turn manage the thermal output and energy consumption of the processor and memory [20], via throttling, to keep it within defined envelope(s). For this approach to work, accurate power meters are required. This is done through a digital event-based pre-calibrated power model that uses performance counters and I/O to increment energy counters [69, Sec. 14.9.1]. Fortunately, Intel exposes the raw values of the power meters along with conversion values to transform the counters to SI based units. The values produced by the power meter have independently been found to match the actual power measurements [29], [49].

There are four domains in the RAPL interface; however only three of these domains are available on any given CPU [69, Vol. 3B Ch. 14.9.3]. The CPUs targeting the server market have

16

Package (PKG), Power Plane 0 (PP0), and DRAM domains. The CPUs targeting the client/consumer market have PKG, PP0, and PP1 domains. For compatibility with both product categories, Profilo only makes use of PP0 (the processor cores) and the PKG (the entire CPU) domains.

The PP0 domain energy counter is available through an MSR called

MSR_PP0_ENERGY_STATUS, while the PKG domain energy counter is available through an

MSR called MSR_PKG_ENERGY_STATUS. The values to convert the RAPL counters to SI based units are contained in the MSR called MSR_RAPL_POWER_UNIT. Figure 3.1 shows the structure of these MSRs with MSR_PP0_ENERGY_STATUS at the top (sourced from [69, Vols.

3B, Fig. 14–34]) and MSR_RAPL_POWER_UNIT at the bottom (sourced from [69, Vols. 3B,

Fig. 14–32]). The format of the MSR_PKG_ENERGY_STATUS register is the same as

MSR_PP0_ENERGY_STATUS.

17

Figure 3.1 Structural Illustration of RAPL MSRs

Profilo only makes use of the “energy status units” portion of MSR_RAPL_POWER_UNIT, which is an unsigned integer using bits 8 to 12. This value is used with the formula

! to create a unit in joules. The current default in binary is 10000, which means !!"#$%& !"#"$% !"#$% the energy status unit is in ! = ~15.3 microjoule increments. It is important to note the reserved !!" bits in these MSRs. In particular, half the MSR_PP0_ENERGY_STATUS and

MSR_PKG_ENERGY_STATUS MSRs are reserved for future use, leaving only 32 bits for the

PP0/PKG energy counters. Taking into account the value of the current energy status unit, this means that the energy counters wrap around every 2!" − 1 ! ≅ 65536 joules. On the !!" testing platform, with a measured peak power consumption of 72 watts for the package (see

18

Section 5.3.4), the energy counter would wrap approximately every 15 minutes. Intel recommends tallying the results, taking into account integer overflow, at least every 60 seconds

[69, Vol. 3B Ch. 14.9.3].

Another important detail regarding the energy counters is their update frequency. Updates occur approximately every millisecond, most of the time. The jitter is fairly high at 20 milliseconds

[29], which is why Profilo runs should be sufficiently long. There is also a system management interrupt (SMI) that puts these processors into System Management Mode (SMM). This mode is used to handle power management functions, system hardware control, and run OEM-designed system firmware. As a result, SMM is outside of the control of general-purpose systems software, including operating system kernels, even when interrupts are disabled. According to analysis performed by [29] on their Sandy Bridge based machine, the switch to SMM is periodic and occurs every 16 milliseconds. This can cause update delays to the RAPL counters in excess of 120 milliseconds. In their work, they mitigate these issues by delaying their benchmark until the first RAPL counter updates immediately after the processor exits SMM. When the benchmark is complete, they note the time, and then measure the delay until the next RAPL counter update, making estimates to subtract the energy associated with this delay.

In the interest of simplifying the implementation, Profilo does not take into consideration the

SMM. If profile runs are much longer than 16 milliseconds, this should not impose too much inaccuracy. The Profilo kernel module exposes two sysfs files, called sleep_busy and sleep_deep

(see Section 4.2.1), which wait for both PKG and PP0 energy counters to update before storing the energy and high-resolution timer values. Before terminating, they once again wait for both of

19

the energy counters to change before calculating the final duration and energy consumption. This is why both sysfs files have a minimum duration of roughly 1 millisecond. The user mode portion of Profilo is even simpler and only waits for the energy counters to change before profiling. To avoid introducing more variance to the duration and overestimating the energy usage, when profiling is complete, it calculates the energy consumption from the last reading of the MSRs, without waiting for them to change.

It is important to emphasize that when waiting for the energy counters to change, both of them need to be checked. The reason for this is that they update asynchronously (roughly 0.4 microseconds apart on the test platform described in Section 5.1), with no guarantees on which counter updates first.

3.3.1 Using the RAPL MSRs within Profilo

The RAPL implementation details are handled in user mode by a function call to startRAPLmetric. This function has one argument: a pointer to the custom cpuData structure described in Section 4.2.4.5, which essentially stores the user mode application’s global context for the energy counters, and an open file descriptor to the MSR kernel module sysfs file (see

Section 4.2.4.2). The function initializes two temporary integers that are equal to the bottom 32- bits (using a mask implemented with the bitwise and operator) of the

MSR_PP0_ENERGY_STATUS and MSR_PKG_ENERGY_STATUS MSRs. It then runs a loop function that continuously reinitializes the cpuData copies of the MSRs until they both differ

20

from the temporary variables. Each loop iteration takes roughly 10 nanoseconds and when both

RAPL counters have changed, the function returns.

The cpuData structure also contains the “energy status units” portion of the power unit MSR

(MSR_RAPL_POWER_UNIT), from which a calculation is used to convert the energy counters to joules. It also has two unsigned 64-bit integers called pp0LastRead and pkgLastRead that are copies of the MSR_PP0_ENERGY_STATUS and MSR_PKG_ENERGY_STATUS MSRs, respectively. Finally, it contains two double precision floating point variables called pp0Consumed and pkgConsumed.

To guard against integer overflow, there is a function called tallyRAPLmetric that takes a pointer to cpuData. It first reads from the core and package MSRs, using the open file descriptor to the

MSR module. It then calculates the difference (taking into account integer overflow) since the last read of the MSRs, using 64-bit integers. Finally, it multiplies this difference by the pre- calculated energy unit and adds it to the double-precision floating point variables, which contain the number of joules consumed since the startRAPLmetric function returned. It concludes by updating pp0LastRead and pkgLastRead. The implementation of Profilo follows Intel’s conservative RAPL overflow advice, and tries to run this tally function every minute through a mechanism described in Section 4.2.4.5. The function is always run at the end of profiling to calculate the total number of joules consumed.

The startRAPLmetric and tallyRAPLmetric functions are part of the user mode application and rely on the MSR kernel module. As mentioned in the previous section, the Profilo kernel module

21

exposes two sysfs files, called sleep_busy and sleep_deep (details in Section 4.2.1), which both rely on hrtimers and the RAPL MSRs, just like the user mode application. The hrtimers subsystem is easily accessible through variables like ktime_t and functions like usleep_range, both of which are used by the Profilo kernel module. However, there is no formal way of accessing MSRs, much less the RAPL MSRs. Therefore, to accomplish this in kernel mode, inline assembly language code had to be written (see Appendix A).

The idea behind the assembly code is fairly straightforward. First, the address of the desired

MSR is loaded into the ECX general purpose register (GPR). Second, the rdmsr instruction is executed, which dumps the lower 32-bits of the MSR into the general purpose EAX register, and the upper 32-bits of the MSR into the general purpose EDX register. The upper 64-bits of the

GPRs (called RAX/R0 and RDX/R2) are cleared by the rdmsr instruction. This may seem like an odd approach, since the GPRs are 64-bit as well, but this ensures that rdmsr remains compatible with older 32-bit processors with 32-bit GPRs and some 64-bit MSRs. The code and its line-by- line explanation are available in Appendix A.

3.4 Summary

The objective of this chapter was to explain the motivation for profiling speed scaling schedulers on real systems, along with describing the technical requirements for achieving this goal. The high-resolution timing subsystem and Intel RAPL interface were explained conceptually and with some implementation points. The following chapter presents the design and implementation details that were required to create Profilo.

22

Chapter 4: Design and Implementation

This chapter describes the design and implementation for Profilo, which is the application that is used to control the micro-benchmarking experiments and collect energy profiles. This chapter begins with the design choices to accurately measure timing and power consumption. Finally, it describes the implementation details of Profilo and its accompanying utilities.

4.1 Design Choices

Because Profilo was designed to model certain datacenter workloads, it could not simply be a single process application executing synthetic workloads, even with multiple threads. Hardware context switches are a necessary component of the measurement, both for execution time and for energy consumption. Therefore Profilo had to create a multi-process environment.

4.1.1 User Mode

The initial version of Profilo was written as a multi-process user mode application to measure hardware context switches. The relative simplicity of the initial version allowed the application to be created quickly, and the different speed scaling algorithms and processor schedulers to be tested to see if there were repeatable and statistically significant differences in timing and energy consumption.

One of the challenges with a fully user mode application, however, is that the kernel can pre- empt processes at any time due to higher priority processes and/or interrupts. Furthermore,

23

regular processes are confined to occupying the CPU for at most a maximum duration, commonly called a time slice, quantum, or jiffy, which on most platforms is 10 milliseconds

[68]. This is problematic since the Fair Sojourn Protocol (FSP) scheduling policy requires uninterrupted execution for the entire lifetime of the process [25].

4.1.2 Kernel Mode

To address the aforementioned issues, kernel mode code is necessary. An open source operating system is therefore the best starting point. Naturally, Linux (Ubuntu 14.04 LTS) was chosen because it has the largest community and hardware support, including an x86 Model Specific

Register (MSR) module described in section 4.2.4.2.

Kernel mode in Linux can be accessed in two ways:

1. Modifying the kernel: This approach requires hi-jacking the kernel to perform the

profiling operations on synthetic workloads. The modified kernel would still need to be

able to interact with the user to take instructions and follow the desired

scheduler/workload as well as to log the timing and energy consumption information.

This would be complex and have the added disadvantage of requiring a rebuild and

system reboot every time the kernel is modified. Portability would also become a serious

hurdle and community patches would no longer be compatible.

2. Kernel module: In most cases, this is the preferred approach since it is portable across

different distributions and kernel versions. Modules are also loadable and unloadable,

without the need to recompile the entire kernel or reboot the system, which makes

24

development significantly faster and safer. Finally, kernel mode code can be exposed via

an API to user mode by using the sysfs virtual file system. However, contemporary Linux

kernels provide multicore, pre-emptive scheduling, so “interrupt disable” code is required

to effectively hi-jack the CPU and avoid unwanted context switches.

4.1.3 Kernel/User Mode Hybrid

The final version of Profilo was implemented as a hybrid with a kernel module that takes care of performing uninterrupted work, busy waiting, and sleeping, while performing high-resolution timing, and energy profiling. It exposes these features through sysfs files, which a user mode application interacts with to perform all the other features such as reading the tasks, setting CPU affinity, collecting data, processing, and output. Retaining as much as possible in user mode makes programming and debugging substantially easier, due in part to the myriad of user mode libraries and virtualization of resources.

4.1.4 Workload

The power profile of a sleeping, busy-waiting, and working processor are quantifiably different, which is shown in Chapter 5. Therefore, to model compute-intensive datacenter jobs, a simple, controllable workload has to be associated with each process. Profilo uses the trial division primality test algorithm for this purpose. The rationale for this choice is as follows:

25

1. It is CPU-bound and fully contained within the processor package (e.g., core, cache, etc.).

This feature means that the Running Average Power Limit (RAPL) counters can be used

to profile the energy consumption [69, Vol. 3B Ch. 14.9.2].

2. It is easily implemented in kernel mode without the need for complicated mathematical

operations and/or floating point units.

3. It utilizes superscalar and pipelined integer architecture, while reasonably disrupting

branch prediction [58].

4. It is easily parameterized to generate jobs that range from microseconds to hours.

The trial division primality test is the simplest of primality algorithms, with an exponential running time, when measured in terms of the size of the input in bits. A future improvement to

Profilo is to use an optimized trial , which constrains the divisor (in the modulo operation) from exceeding the square root of the candidate , similar to the , bearing in mind that the latter still has a better theoretical complexity [41].

In addition to improving the runtime (which is not really a goal), the optimized trial division algorithm would better utilize the arithmetic logic unit (ALU). The code and its explanation are available in Appendix B.

Prime number sieves, such as the ancient sieve of Eratosthenes, the more recent sieve of

Sundaram [2], the faster [7], and the wheel sieves [44], [45], do not make good benchmark algorithms because they tend to be memory-bound. Memory-bound processes tend to under-utilize the CPU, and are not easily measured with the RAPL energy counters.

26

There are faster, general, and deterministic primality test algorithms, such as the AKS algorithm

[1], which was the first to have a provable (unconditional) polynomial runtime. A further improvement is primality testing with Gaussian periods [35]; however, these implementations are much more complex, especially in kernel mode.

Unfortunately, this trade-off for simplicity means that many components of these new microarchitectures are never used, such as the floating point unit (FPU), the single instruction, multiple data (SIMD) [24] extensions, Advanced Encryption Standard New Instructions (AES-

NI), and virtualization extensions. Nevertheless, this is arguably satisfactory to model simple compute-intensive workloads, such as unencrypted Web server based datacenters.

Having said that, the workload code can be replaced with another algorithm without too much effort. The trace file that Profilo takes as input (details in section 4.2.4.1) has a column that specifies the amount of work the process needs to do. That work is expressed as an integral multiple of a basic “work unit”. Having 2 or more units of work simply means reiterating the workload associated with a “work unit” that many times. This linearizes work even if the workload algorithm itself is non-linear, such as the class of primality test algorithms. In other words, a process that has 50 units of work (fifty iterations of the workload associated with the

“work unit”) has exactly twice the work of a process with 25 units of work, which itself has five times the work of a process with 5 units of work, and so on.

With the trial division workload, the “work unit” is the number of consecutive primes to find, starting from two. The “work unit” integer is another argument to Profilo. Figure 4.1 shows the

27

execution time for various definitions of “work unit”. Different traces should only be compared if they share the same workload and “work unit”.

1.4

1.3

1.2

1.1

1

0.9

0.8

0.7

Duraon (s) 0.6

0.5

0.4

0.3

0.2

0.1

0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Prime Calculaons ("work unit")

Figure 4.1 Trial Division Primality Test Workload

28

4.1.5 Units of Work

A subtle but significant detail about Profilo is that processes are modeled on units of work, not units of time. For example, a line in the trace file may state that process “Small” instructs the

CPU to find the first 200 primes (a “work unit” for this run of Profilo), 80 times, at speed 1

(which may translate to 1.2 GHz). This is in contrast to telling the CPU to calculate primes for 70 milliseconds at speed 1. Units of work deliver more consistent, repeatable results because the kernel module can keep control of the processor, performing the work required without invoking context switches to set, check, and be interrupted by a timer or checking a clock and calculating durations. To a lesser but measurable degree, clocks, timers, and counters also experience inconsistencies based on their state (i.e., between ticks) and the state of the processor and its management cycles [15].

Profilo requires a “work unit” argument to define what a single work unit is in the trace. By definition, a “work unit” is the minimum unit of work, which roughly corresponds to a unit of time at a given frequency. Naturally, with a fixed “work unit” value, the unit of time for each discrete speed on a given processor will be inversely proportional to the frequency. Therefore if one wishes to approximate a time slice for each discrete speed, for instance for a PS trace, a certain number of loops, unique to each speed and proportional to the frequency, needs to be determined. Section 5.4 explores the empirical values for loop count at each discrete speed with relation to various values of “work unit”.

Another argument to Profilo related to work is “primes per second”. Since the trial division primality algorithm has an exponential running time, this argument is based upon the “work

29

unit”. More precisely, “primes per second” is the reciprocal of the duration (in seconds) to perform a “work unit” at the slowest discrete speed, multiplied by “work unit”. Internally, this value is used to estimate one minute of work at the lowest frequency to conservatively tally the

RAPL MSRs and prevent data loss due to integer overflow discussed in Chapter 3. The reason this is provided as an argument, and not calculated at runtime, is to maintain consistency across multiple runs of Profilo and allow different schedulers to be compared.

Amongst the alternative runtimes for Profilo is a benchmark mode that allows one to determine appropriate “work unit” and “primes per second” arguments, along with the respective loop values for each discrete speed at whatever time slice, for the host processor.

4.2 Implementation

Profilo is composed of two parts: the kernel module, and the user mode application, which itself has a few runtime modes. There is also an accompanying idler utility that makes use of components in the Profilo kernel module that the user mode application does not use. Future work for Profilo could include integrating the idler utility into the user mode Profilo application.

Figure 4.2 provides a conceptual overview of Profilo and its components.

30

Application (User Mode) Kernel Module Trace File 1. Process Arguments 2. Setup Environment work_unit sysfs • 3. Profiling • do_work 4. Summarizing • sleep_busy sysfs Idler • sleep_deep

Figure 4.2 Conceptual Overview of Profilo

4.2.1 Kernel Module

The kernel module makes use of the sysfs virtual file system provided by the Linux kernel [14, p.

527]. This virtual file system is intended to expose kernel subsystems, device drivers, and two- way communication between kernel functionality and user mode. Although introduced to the

Linux kernel over a decade ago, there remain kernel modules that incorrectly use procfs, which is now intended only for providing information and configuration for processes and related system information [38, p. 360]. This dichotomy is exacerbated by the lack of documentation on sysfs, often requiring a developer to peruse kernel source code.

Nonetheless, there are a number of clever macros and well-designed functions that inspire elegant sysfs code. It is common and sometimes required to use a number of macros to properly build a kernel module. A few examples are the license, author, description, and supported devices macros that describe a module when probed by a user using modinfo. There is also an init and exit macro used to tell the kernel which function to call when loading and unloading the module, respectively.

31

In addition to those functions, every sysfs virtual file can have a show and/or store function. A show function is invoked whenever a virtual file is read; a store function is invoked whenever a virtual file is written. The Profilo kernel module has four sysfs files:

• work_unit: as the name suggests, this file defines what a unit of work is -- specifically the

number of consecutive primes to find, using the trial division primality algorithm

described in Section 4.1.4. Writing an integer value to this file sets the number of primes,

while reading it returns its current value.

• do_work: this file runs the actual workload. To use it, simply write the number of

units/loops of work_unit to the file. Reading from the file only displays the kernel module

name and version.

• sleep_busy: writing an integer value to this file busy-loops the processor without context-

switches until that duration, in microseconds, has elapsed. Subsequently reading the file

reveals the formerly written value as well as the actual amount of time (in microseconds)

that the processor busy-waited (including the time it took for the RAPL MSRs to

change), and the difference in the PP0/PKG RAPL MSRs.

• sleep_deep: writing an integer value, interpreted in microseconds, to this file sleeps the

processor for that duration. Similar to sleep_busy, subsequently reading the file reveals

the formerly written value as well as the actual amount of time (in microseconds) that the

processor slept, plus any busy-wait period waiting for the RAPL counters to increment.

Concatenated to this output is also the difference in the PP0/PKG RAPL MSRs.

The change in the PP0/PKG RAPL MSR values coming from the Profilo kernel module must be multiplied by the processor-defined energy unit, which is calculated from the

32

MSR_RAPL_POWER_UNIT MSR (done in user mode where floating point operations are easy), to establish the energy consumed by those specific domains. Section 3.3 elaborates on the RAPL

MSRs, with some kernel code details in Section 3.3.1.

To simplify the code for handling the high-resolution timers, the Profilo kernel module enforces a directive making it only compatible with the 64-bit Linux kernel. Both sleep_busy and sleep_deep make use of high-resolution ktime_t, which is a 64-bit integer where each unit is a nanosecond. In addition, sleep_deep makes use of the kernel function usleep_range, which is also built on top of hrtimers, the subsystem for high-resolution kernel timers (see Section 3.2).

The accompanying show functions for those sysfs files output their respective time values (in microseconds) with three decimal places to retain the nanosecond precision.

The state of the kernel module can be stored in either a kernel object structure or in static global variables. For simple kernel modules like this one, static global variables make the code base smaller and easier to read/understand. However, larger modules, or modules with independent sysfs files, would benefit from creating custom structures for each sysfs file with their respective variables using kernel macros (e.g., container_of) for ad hoc polymorphism.

When the kernel module is loaded, the init function allocates, initializes, and adds a kernel object structure, which contains an array of attribute structures and a sysfs operation structure. Each attribute structure contains the name of the file and its permissions, which are all set to 0666, corresponding to the Unix permission for read/write access for user/group/other. The sysfs operation structure allows the definition of a single show function and a single store function.

33

At first thought, one may consider this problematic, since each file’s read and/or write behaviour has to be different. To resolve this issue, the signature for a show/store function has several arguments: a pointer to a kernel object structure, a pointer to an attribute structure, and a character pointer (string). Intuitively, the store function signature differs only in that the character pointer is a constant and there is an additional string length argument.

By using the pointer to the attribute structure, the virtual file that is being read from or written to can be determined. A clever container_of kernel macro takes a pointer, type, and member as argument, and subtracts the memory offset between the type and its member from the pointer, returning a pointer to the parent structure. By encapsulating every attribute in a parent structure that includes a function pointer to a custom show/store function for that particular virtual file, one can use the container_of macro to get the parent structure and then call the respective show/store function. This resembles the Delegation Design Pattern [37] and reduces the default show/store function to an elegant two lines of code without branches.

The last thing that the init function does is initialize a global spinlock. Since the Linux kernel is pre-emptive, even single core systems behave like a symmetric multiprocessing (SMP) system with regards to concurrency [38, p. 201]. The spinlock is used to enter a critical section where the interrupt request (IRQ) table is saved and then disabled [38, p. 130]. The process work is done without interruption on that core, then the IRQ table is restored and the spinlock is unlocked.

34

4.2.2 Lockup Detectors

There is one more precondition for uninterrupted process execution: the hard lockup detector, which is implemented with a non-maskable interrupt (NMI), needs to be disabled. The Linux kernel has both soft and hard lockup detection. A soft lockup is a situation where the scheduler is unable to give processes a chance to run within a certain period of time. A hard lockup is similar with the addition that interrupts are not able to run. Both base their timeouts on a sysfs configurable value /proc/sys/kernel/watchdog_thresh, which is typically 10 (seconds).

The soft lockup detector is a watchdog task implemented as a high priority kernel thread that updates a timestamp every time it runs and an hrtimers interrupt with a period of

2*watchdog_thresh/5. If that timestamp is older than 2*watchdog_thresh seconds, the softlockup detector inside the hrtimers callback function dumps debug information (registers, call trace, etc.) to the system log as a kernel warning. On many Linux distributions, the console log level

(exposed at /proc/sys/kernel/printk) is set to print kernel warnings and higher priority messages.

This results in I/O time that adds 10-15 milliseconds.

The hard lockup detector is implemented inside the callback function for an NMI perf event that has a period of watchdog_thresh. If the soft lockup detector’s hrtimers interrupt callback function has not been run since the last NMI perf event, bearing in mind that it should have run 2 or 3 times in this duration, it dumps debug information (registers, call trace, etc.) to the system log as a kernel notice. The kernel documentation [75] states that this should be a kernel warning, but on Ubuntu 14.04 it is a kernel notice, which has a lower priority than a kernel warning and

35

therefore is not output to the console, causing no I/O. As a result, dumping to the kernel message buffer only takes about 0.2 milliseconds.

The watchdog_thresh, soft lockup detector timeout value (rcu_cpu_stall_timeout), hrtimer, and

NMI perf event periods are all configurable. There are also configuration settings to make the kernel panic when there is a soft and/or hard lockup detected. Finally, the soft lockup detection and hard lockup detection can be respectively suppressed and disabled, which is done by the user mode implementation of Profilo, described in section 4.2.4.3.

4.2.3 Clocks

The Linux kernel has a simple 64-bit counter called jiffies that is incremented based on the inverse of the compile-time value of HZ, which is 100 on the x86 instruction set architecture

(ISA) [68]. It is implemented across all architectures using interrupts and is used in various kernel functions and kernel modules [19, Ch. 7.1], especially legacy code.

Contemporary x86 systems, such as Intel's Sandy Bridge and later microarchitectures, have many clocks, including, but not limited to, the High Precision Event Timer (HPET), the

Advanced Configuration and Power Interface Power Management Timer (ACPI PM), the

Programmable Interval Timer (PIT), and the Real-Time Clock (RTC). A comprehensive list of clock sources can be found in Chapter 6 (Timing Measurements) of Understanding the Linux

Kernel [14].

36

Each processor core also has a 64-bit register called the Time Stamp Counter (TSC), which was first introduced in the Pentium processor [69, Vol. 3B Ch. 17.13]. The TSC counts the number of clock cycles, starting at zero when the core goes online. With SpeedStep, Intel’s implementation of dynamic voltage and frequency scaling (DVFS), the TSC could no longer be used to accurately time events until recent microarchitectures, when Intel changed the counter to an invariant version that ticks at the processor’s highest rate, regardless of ACPI P/C/T state [67].

This once again makes the TSC a good clock source, not only because of its high precision, but also because of the very low overhead (i.e., not having to access main memory or a platform resource).

The Linux kernel evaluates most system clock sources based on their overhead, precision, and accuracy. This sophisticated clock source manager goes through detection, calibration, and verification phases, at boot time and when restoring to an ACPI G0/S0 state (i.e., waking up from sleep, hibernation, etc.). This culminates in a prioritized list of each clock source that can be seen in /sys/devices/system/clocksource/clocksource0/available_clocksource. To satisfy the hrtimers subsystem’s need for high-resolution timers, this list is usually limited to the TSC,

HPET, and ACPI PM, in that order, on Intel Sandy Bridge and newer x86 microarchitectures.

The differences between even these clocks can be very large. For instance, in the Realtime

Referencing Guide from Red Hat, a benchmark evaluating the cost of 10 million reads from the respective clock sources resulted in the TSC taking 0.6 seconds, the HPET taking 12.3 seconds, and the ACPI PM taking 24.5 seconds [16].

37

Unfortunately, when interrupts are disabled long enough to cause a large drift between the jiffies counter (which would no longer be advancing) and the TSC, the kernel wrongly assumes the

TSC is unstable and switches the clock source to the less precise and heavier HPET. It therefore becomes important for the Profilo user mode application to check and retain the current clock source before running, and then reinstating that clock source, if necessary, before exiting, as described in section 4.2.4.3.

4.2.4 User Mode Application

The Profilo user mode application is a command line tool that begins by reading, translating, and sanity-checking the shell arguments and trace file. It then modifies the operating system environment and profiles the CPU using its accompanying kernel module and an efficient representation of the trace file (see Appendix C for an example). At the end, it returns the modified operating system parameters back to their former settings and then tallies and summarizes the results. This section provides the details of how this is done.

4.2.4.1 Processing the Arguments

The application supports both short option (single dash followed by a letter), and long option

(double-dash followed by a word) argument formats. In addition to the self-explanatory

“version” and “help” arguments, there is a “processor” argument that displays information about the processor, including if it is compatible with Profilo, and if so, the resolution and units of the

RAPL counters.

38

If an input (trace) filename of jobs to execute is given, then the user is expected to either add a

“check” argument or provide the “work unit” and “primes per second” arguments discussed in section 4.1.5. As the name implies, the “check” argument instructs Profilo to verify that the trace file is suitably formatted and semantically correct, without performing any additional work. The trace file is a comma-separated-values (CSV) file with a header row and three columns:

1. Process Name: This is a case-sensitive and whitespace-sensitive name that becomes the

unique identifier for a process.

2. Work: This is a positive integer with the amount of contiguous “work units” that the

process is to perform at this instant in the trace.

3. Speed: This is a positive integer that is mapped to a processor frequency to perform the

work at this instant in the trace.

Even when the “check” argument is not present, Profilo silently performs the checks. If there are problems, it outputs the error(s) to the console and terminates. If there are no issues and the

“check” argument is present, a summary is displayed with the number of unique processes, total units of work, the number of speeds in the trace (including the minimum and maximum value), the number of discrete frequencies mapped from the speeds (including the minimum and maximum value) and the number of rows (sometimes called tasks), which is in accord with the number of preemptions.

If the “check” argument is not present, Profilo creates a single instance of a custom kernelVariables object (C struct), which contains a number of dynamically allocated arrays,

39

structures, and variables, and thus has an associated destructor function. Functions pass this object around, which circumvents the problems with global variables [64], and allows the user mode application to be composed of reusable components for the alternative runtimes (i.e., benchmark, processor information, etc.) and the utilities.

When Profilo is building the kernelVariables object, it starts by building an array called schedule, made up of scheduleTask structures containing three unsigned integers: procID, work, and speed. The schedule array is an optimized copy of the trace file and therefore has the same number of elements as the trace file has lines, less the header. The procID is a monotonically increasing unique integer identifier, starting at zero, for each process in the trace file. The work variable is the same value in the trace file. Finally, the speed variable is the index value to the speedLookup array, elaborated on below.

Similar to speed, the procID also functions as the index to an array called processes, which is made up of process structures. These process structures have the name corresponding to the name of the process in the trace file, the workLeft (to determine when the process leaves the system), and two struct timespec variables that store the startTime and endTime for when the process first gets CPU time, and when it finished all of its work.

The speeds from the trace file are essentially mapped to a frequency by the following formula:

������� + ������� − ������� !"##$!!"#$%&&' , where minFreq is the processor’s !"#$%&&'!!"#$%&&' slowest frequency, maxFreq is the fastest frequency, minSpeed is the smallest speed value in the

40

entire trace file, and maxSpeed is the largest speed value in the entire trace file. This value may not be an integer, so it is stored as a floating point number that is then rounded to the closest available discrete frequency for the host processor. If the calculated frequency is actually between two available discrete frequencies, it is rounded up. The list of available discrete frequencies for the first logical processor (cpu0) is available in the following sysfs file:

/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies.

Setting the frequency of the first logical processor is also done through a sysfs file

(/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed), which means it must be printed as a string to be handled by its store function. Converting an integer to a string on every context switch is an unnecessary overhead and storing a string value for each process creates a very large memory footprint that is made worse by cache misses. Therefore, to prevent both of these situations, a temporary linked list is used.

The temporary linked list is made up of node structures called speedStruct. In this structure are two unsigned integers called uiSpeed and uiMapped. The uiSpeed variable is the same value from the trace file and uiMapped is initially assigned to zero. The speedStruct is treated as a mathematical set, so before any insertions occur, the list must be checked to ensure that uiSpeed is unique. The list is insertion sorted. When the trace file has been completely parsed, the next step is the map from uiSpeed to uiMapped using the mapping formula given previously. This mapping function is many-to-one; so when it is complete, the number of unique uiMapped values should be known, and be less than or equal to the number of discrete frequencies on the processor.

41

If the “check” argument is present, the number of unique speeds (the size of the speedStruct linked list), along with the minimum and maximum values are printed in the summary. The summary section about the number of discrete frequencies mapped from the speeds, with the minimum and maximum values, comes from the unique uiMapped values.

If the “check” argument is not present, the number of unique uiMapped values is used to create the speedLookup array of strings, which are the string versions of all the uiMapped values. The uiMapped values are then changed to be the index of speedLookup, where the string versions of the discrete processor frequencies are found. Finally, the speed variable in each of the scheduleTask structures is changed from the uiSpeed value to the associated uiMapped value.

The result is that during the profiling, if the new speed is different from the previous speed, then with a line of code, writing the speed element of the cached speedLookup array to the scaling_setspeed file changes the processor frequency.

The processes array is similarly created, while the trace file is being read, by first using a temporary linked list made up of node structures called processList. These nodes contain the same procID, name, and workLeft variables in the process structures (the elements of the processes array). As the scheduleTask structures of the schedule array are being built up, the process names that are being read from the trace file are crosschecked against the processList linked list to determine the value of procID. If a process name does not exist in the linked list, it is added to the head of the list, for maximum insertion efficiency (constant time). Typically, this also makes searching for the process quicker, since the most recently added processes are at the

42

start of the list. When the trace file is complete, the processes array is allocated based on the number of elements in the linked list. The process elements are then initialized using the variables in the linked list, from the end of the array to the start, so that the procID values in the linked list correspond to the index of the process elements in the processes array.

In summary, the speedStruct and processList linked lists are temporarily created, while reading the trace file, to either generate the printed summary (when the “check” argument is present) or to create the respective speedLookup and processes arrays. The linked lists are then properly deallocated. With the arguments processed and the contents of the trace file loaded and ready for profiling, the next step is to check the CPU for compatibility.

4.2.4.2 CPU Compatibility

As discussed in Section 3.3, Profilo uses some of the Intel RAPL MSRs to capture the energy profiles. These are present in the Intel Sandy Bridge and newer microarchitectures. An easy and universally compatible way of detecting the CPU vendor and microarchitecture on Linux is to read /proc/cpuinfo. This procfs file uses the CPUID instruction on contemporary x86 microarchitectures to get its identity and features [69, Vol. 1 Ch. 17.1]. Unfortunately, there is no

RAPL feature flag on Sandy Bridge or any of the subsequent microarchitectures (Ivy Bridge,

Haswell, and Broadwell); however, Advanced Vector Extensions (AVX) that was introduced with Sandy Bridge has a CPUID feature flag [69, Vol. 1 Ch. 5.13].

43

Consequently, Profilo’s CPU compatibility function starts by first ensuring that cpuinfo has a vendor_id of GenuineIntel, and then looking for the presence of avx in the flags section. If

Profilo is run with the “processor” argument, this function also prints out the CPU model and frequency range on the console. In addition, it uses the low-level open function to create a direct, unbuffered I/O file descriptor to the MSR file, supplied by the MSR module.

The MSR module is a kernel module that provides an interface to x86 processors through the virtual file /dev/cpu/0/msr. The file offset corresponds to an MSR address. Reading/writing to the file is interpreted in 8-byte (64-bit) chunks. Larger chunks correspond to multiple read/writes to the same MSR, unless, of course, one opens the MSR file with a file abstraction layer that automatically advances the file offset. Reading/writing to the file typically requires elevated

(root) privileges. When Profilo opens the file, it is implicitly checking for root privileges, which is also required for setting some of the environment variables in section 4.2.4.3. Profilo uses the low-level pread function with the MSR file because, in addition to working with the low-level open function’s file descriptor, it takes the file offset and the number of bytes to read as arguments.

At this point, if the CPU compatibility function has been instructed to report (when Profilo has been run with the “processor” argument), the MSR containing the RAPL energy, power, and time units will be read, interpreted, and then printed in a summary about the CPU, which includes the model and frequency range of the processor. If the CPU compatibility function has not been instructed to report, the file descriptor associated with the MSR module is closed and the function ends, indicating that the processor is compatible. Just like with argument checking,

44

every step is checked for errors, and if present, prints helpful messages, indicating where and why a test did not pass, followed by Profilo gracefully exiting.

4.2.4.3 Environment Setup

With the RAPL energy unit read, interpreted, and stored within the kernelVariables object (see section 4.2.4.1), Profilo moves on to setting up the environment for profiling. Ultimately, it is the user’s responsibility to set up a testing environment with as few kernel modules/drivers, services, and other interfering tasks as possible. However, some measures can be taken by the application to minimize interference.

Linux, like most operating systems, supports process priorities. These priorities are numbers, ranging from -20 to +19, called “nice levels”, “nice values”, or “niceness”. This nomenclature is inherited from Unix. Its etymology comes from the idea that a process with a higher value of

“niceness” is nicer to other processes. Therefore, from a scheduling perspective, a nice value of -20 is the highest priority, while +19 is the lowest priority. The default priority for processes is a nice value of zero. On most distributions, setting a nice value of less than zero requires elevated privileges. Profilo programmatically invokes the setpriority system call to set its nice value to -20. Recall that at this point in its execution, Profilo will have already implicitly verified that it has elevated privileges, by collecting the energy unit for the RAPL MSRs.

In addition to managing priorities, the kernel also needs to continuously load balance processes in SMP environments. Each logical processor has its own scheduler, complete with its own

45

runqueue. The load balancer ensures that no runqueue has more than 25% the number of processes than any other runqueue. When this occurs, the load balancer tries to move non- running, least cache-hot, processes using spin locks and special migration threads. While this balances computing resources and maximizes parallelism, it can also result in longer context switches and cache-misses.

The load balancer tries to impose some process affinity, based on the topology of the logical cores and their association to one another. For instance, Intel Hyper-Threading Technology (their proprietary implementation of simultaneous multi-threading) is treated as two highly coupled logical processors, while cores on the same package have a greater affinity to one another than cores from different packages/processors. Sometimes it is advantageous to lock processes to particular logical processors for performance reasons, which can sometimes be quite substantial

[57], or due to licensing restrictions. Linux supports this with its cpus_allowed bit mask in each processes’ task_struct structure, which by default is all ones (can run on all available logical processors). Both the load balancer and scheduler honour this bit mask.

The initial version of Profilo made use of processor affinities, setting the bit mask for all its processes to run on the first logical core. This changed in the final version of Profilo. As discussed in section 4.2.1, the Profilo kernel module disables interrupts on the logical processor it is running on, before performing its workload. With the contemporary SMP Linux kernel, these system calls no longer apply to any of the other logical processors. This is particularly problematic since even the finest granularity for the RAPL counters is the domain of all of the

46

cores (PP0). So to capture accurate power profiles, Profilo needs to be the only active task on all of the cores.

To address these issues, Profilo instead disables all but the first logical processor. The kernel takes care of all the migrations (processes, interrupts, etc.) and then puts the deactivated cores into a C6/C7 state (see section 5.3.3.2) by executing the MWAIT(C7) instruction; this causes the cores to save their state, flush their level 1 and 2 (L1/L2) cache into level 3 (L3) cache, which is shared amongst all the cores, and then reduce the voltage of the core to zero volts. On the only active core, the load balancer is implicitly disabled, since it is no longer necessary when the system is relegated to being a (logically) uniprocessor system. Disabling the logical processors is easily performed by writing zero to the sysfs file /sys/devices/system/cpu/cpun/online where n is greater than zero. Before this is done, however, Profilo first reads and retains the online status of each processor, so that it can restore the environment back to its previous state. In the architectures compatible with Profilo, there will typically be 8 logical processors, due to Hyper-

Threading on a quad-core processor. However, this number can range from 4 up to 288 logical processors, on an 8-socket motherboard, each featuring an 18-core Haswell-EX processor.

Next, Profilo changes the current governor to assert control of the processor’s frequency. The

Linux kernel implements processor frequency scaling through infrastructure called cpufreq. This infrastructure contains a generic governor interface that allows software-defined speed scaling policies to be implemented when the processor is busy. The Linux kernel uses a separate cpuidle subsystem when there is no work left for the processor (see section 5.3.3). Most x86 based Linux

47

distributions have five governors that rely on sysfs files located in

/sys/devices/system/cpu/cpun/cpufreq/ (where n is an integer associated with a logical processor):

• performance: This governor sets the processor to the frequency defined by the sysfs file

scaling_max_freq, which is usually the highest frequency, unless a user with elevated

privileges changes the value. This can be beneficial if the agenda is a dichotomy of

prolonged CPU-bound work, followed by similarly protracted idle durations, keeping the

processor either fully utilized or allowing it to remain in its halted state (ACPI C1 or

greater). Another usage for this governor is if the processor is being performance

benchmarked.

• powersave: This governor sets the processor to the frequency defined by the sysfs file

scaling_min_freq, which is usually the lowest frequency, unless a user with elevated

privileges changes the value. This can be helpful if there is a power or heat limitation on

the system infrastructure. However, if the processor is tasked with a heavy workload, this

governor may consume more energy than even the performance governor because the

latter can finish faster and then go into an energy saving ACPI C1 or greater state. This

can further result in cascading power savings across other components, thereby reducing

the idle waste power. Section 5.3.3 elaborates on this perhaps less obvious theme.

• ondemand: Unlike the previous two static governors, this governor dynamically changes

the processor’s frequency based on system load. When the system load is above the

up_threshold (typically 80 percent), it sets the processor frequency to scaling_max_freq.

When the system load drops below down_threshold (typically 20 percent), it decreases

the processor frequency to the next highest frequency in scaling_available_frequencies.

Because there is a cost to evaluating and switching the frequency, there is a sysfs

48

configurable sample rate (sampling_rate in microseconds) that is loosely based on the

CPU’s transition latency (cpuinfo_transition_latency in nanoseconds). This sampling rate

for determining system load and adjusting the frequency is typically 10ms on

contemporary x86 ISA’s.

• conservative: this governor is very similar to the ondemand governor, but instead of

jumping to scaling_max_freq when up_threshold is exceeded, it moves up the discrete

frequencies in scaling_available_frequencies at each sampling_rate. So while ondemand

is biased towards high performance, conservative treats frequency increases the same as

frequency decreases.

• userspace: this governor changes the frequency of the processor based on input from

scaling_setspeed. This gives full control to user mode processes/users (with elevated

privileges) to set the frequency.

The list of governors in a particular system is available in scaling_available_governors. The default governor is ondemand. To change the governor, the name of a valid governor must be written to scaling_governor. Profilo retains the name of the current governor in scaling_governor, and then writes userspace to the sysfs file. At this point, the kernel stops all frequency scaling, giving Profilo full control to set the frequency, which it does during profiling, when it writes the speed element of the cached speedLookup array to the scaling_setspeed file, as described in section 4.2.4.1.

Similarly, the status of the lockup detectors (section 4.2.2) and selected clock source (section

4.2.3) need to be retained. Specifically, the value of the sysfs file located at

49

/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress is stored in a variable. Afterwards, writing zero to the file, if that is not already its value, suppresses the soft lockup detection. The exact same thing is done for retaining and disabling the hard lockup detection using the file

/proc/sys/kernel/nmi_watchdog. Finally, the current clock source is retained by reading the contents of the sysfs file /sys/devices/system/clocksource/clocksource0/current_clocksource into a string buffer.

4.2.4.4 Preparations for Profiling

With the operating system configured for Profilo, there are only a few things left to do before profiling takes place. First, the Profilo kernel module sysfs file, time_unit, described in section

4.2.1, needs to be set by writing the respective value that was provided as an argument to Profilo.

This is actually handled by the function in section 4.2.4.1 that processes the arguments. Once this is done, the Profilo kernel module sysfs file, do_work, can be used. Just like with the MSR module, a write-enabled low-level open system call is invoked, which creates a direct, unbuffered I/O file descriptor to the virtual file. In the case of the MSR module, the file descriptor is stored in a variable called MSRfd in the cpuData structure (described in section

4.2.4.5), which is inside the custom kernelVariables object. The file descriptor for do_work is simply a local variable.

With the governor set to userspace (section 4.2.4.3), the same open system call is used to create a file descriptor to the sysfs file /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed, which is

50

used to set the first logical processor’s frequency. This interface is used in conjunction with the speedLookup array from section 4.2.4.1.

Because profiling can sometimes take a long time, there needs to be a way for the user to prematurely abort the process. A common and efficient way of accomplishing this goal is through signals, another communications channel between kernel mode and user mode. A user can interrupt a process with an interrupt character (usually Ctrl+C), which the kernel sends to listening processes in the form of a SIGINT signal. A process can listen for a signal by giving the kernel a handler function, using the signal system call. Most kernels, by default, kill processes that do not register the SIGINT signal. This is particularly undesirable in Profilo’s case, since it shuts down all but the first logical processor and then takes control of that processor’s frequency scaling by changing the CPU governor, as well as changing the functionality of the lockup detectors, and possibly the clock source. So to provide some summary information on what has been profiled up to the interruption, and to return the operating system back to its former state,

Profilo delegates a handler for the SIGINT signal. This handler simply sets the ops variable within the custom kernelVariables object to zero, signalling a section within the RAPL overflow prevention code to terminate Profilo gracefully. The details are in the next section (4.2.4.5).

Since the Profilo kernel module, do_work, disables interrupts, if an interrupt character is generated during this time, the signal handler will not be invoked until the sysfs file has completed its work.

Next, Profilo creates an array called isRunning that contains the same number of elements as the processes array, which is the number of unique processes from the trace file. Recall, from section

51

4.2.4.1, that each unique process has its own process ID, which corresponds to the index of the processes array. Therefore, the index locations in isRunning can be treated the same way. During allocation, each element in isRunning is initialized to zero (using the calloc function) to indicate that no processes are running. A value of one indicates that the process is running.

At this juncture, profiling can begin, so the last thing to do is to read the energy and timing counters. The startRAPLmetric function, from Section 3.3.1, is invoked. Recall that this reads the core (MSR_PP0_ENERGY_STATUS) and package (MSR_PKG_ENERGY_STATUS) energy

MSRs into a temporary variable, and then continuously rereads the MSRs into a custom cpuData structure (supplied as an argument to the function), until both values have changed. This is an unpredictable amount of time that typically takes up to a millisecond, although it can sometimes take much longer, as discussed in that section. This is why the energy MSR is read before the clock. When it returns, the clock_gettime function is invoked, using the raw high-resolution clock (CLOCK_MONOTONIC_RAW), which on architectures compatible with Profilo is the

TSC clocksource from section 4.2.3, with nanosecond resolution. The result is stored in the first of two locally scoped struct timespec variables. The second variable stores the end time for the profiling.

4.2.4.5 Profiling

The entire profiling section of the application is contained within a for-loop that systematically traverses the scheduleTask array, described in section 4.2.4.1. Therefore, the number of loop iterations is equal to the number of lines in the trace file (minus the header). Each run begins by

52

checking the isRunning array to see if the current process is already running. If the process is not running (equal to zero), the startTime structure for the current process is initialized with the same clock_gettime function from section 4.2.4.4, and then the isRunning element is set to one.

Next, the value of the speed element for the current scheduleTask is used as an index value to the speedLookup array, which contains the string value for the desired processor frequency. This string is written to the already open file descriptor for the scaling_setspeed file. With the processor at the intended frequency, the work value (also from scheduleTask) is written to the also previously opened file descriptor for the do_work file. At this point, a context switch occurs and the work is done in kernel mode without interruptions. When the write operation returns, the work has been done. The workLeft value, located at the current process’ index value in the processes array, is decremented by the work value. If workLeft is zero, the endTime structure for the current process is initialized, exactly the same way it was for startTime.

Before the loop draws to a conclusion, a swift assessment is made to guard against RAPL overflow (see Section 3.3.1). As discussed in Section 4.1.5, one of the arguments to Profilo is

“primes per second”. This argument is multiplied by 60 (seconds) and then stored within the custom kernelVariables object as a variable called ops. To guard against RAPL overflow, ops is compared to a variable called calculations, which stores the sum of work (from scheduleTask) sent to do_work. If calculations is larger than ops, then the tallyRAPLmetric function is invoked, and calculations is reset to zero.

53

The tallyRAPLmetric function takes the cpuData structure as the argument. As described in section 4.2.4.4, the cpuData structure contains an open file descriptor to the MSR module, called

MSRfd. In addition, it contains the interpreted energy units (in joules) for the processor

(described in Section 3.3). It also has two unsigned 64-bit integers called pp0LastRead and pkgLastRead that are copies (with the reserved portions masked to zero) of the

MSR_PP0_ENERGY_STATUS and MSR_PKG_ENERGY_STATUS MSRs, respectively. Finally, it contains two double precision floating point variables called pp0Consumed and pkgConsumed.

The job of tallyRAPLmetric is to first read from the core and package MSRs, using the open file descriptor to the MSR module. It then calculates the difference (taking into account integer overflow) since the last read of the MSRs, using 64-bit integers. Finally, it multiplies this difference by the interpreted energy unit and adds it to the double-precision floating point variables, which contain the number of joules consumed since the startRAPLmetric function returned. It concludes by updating pp0LastRead and pkgLastRead.

If Profilo is given properly matched “work unit” and “primes per second” arguments for the processor at its slowest speed, the tallyRAPLmetric function should be invoked at most every 60 seconds. However, recall from section 4.2.4.4 that the SIGINT signal handler sets the ops variable to zero. This is just a trivial optimization to reduce overhead by preventing an extra if- statement comparison on every profiling loop. Within the RAPL overflow prevention code, it checks if ops is zero (ordinarily an invalid value) and only calls tallyRAPLmetric if this is not the case. If ops is zero, Profilo terminates gracefully, which includes reversing all of the potential environment changes from section 4.2.4.3.

54

4.2.4.6 Concluding Profiling

As soon as the profiling for-loop is concluded, the clock_gettime function is once again called using the raw high-resolution clock (CLOCK_MONOTONIC_RAW) and the second locally scoped struct timespec variable. Immediately afterwards, tallyRAPLmetric is invoked. Recall that the RAPL MSRs could be power measurements that are as old as a millisecond. Unlike startRAPLmetric, the tallyRAPLmetric function does not wait for the RAPL MSRs to change, as this would over-estimate the energy usage. A future improvement could be to wait for the MSRs to change and then subtract the estimated power usage for the portion of the wait period; however, this task is not straightforward (see Section 7.4) and only marginally improves accuracy. It is because of the reduced accuracy and precision of the RAPL energy counters (in comparison to the high-resolution clock), and because of the additional calculations in tallyRAPLmetric, that the clock_gettime function is called first.

Profilo then undoes the possible changes to the environment, described in section 4.2.4.3. It restores the settings for the soft and hard lockup detection. If the kernel changed the clocksource, it is restored. The online status setting for the logical processors are returned to their prior states.

Lastly, the former cpufreq governor is reinstated.

Next, the data from profiling is summarized, which includes calculating and displaying the total duration for the profiling. The processes array is traversed so that the durations can be further analyzed for each process, including their relative start and end times. The summary also includes the overall energy and power consumption for both the cores and the package (see

55

Appendix C for an example). Finally, all file descriptors are closed and dynamically allocated structures are freed. Profilo then exits.

4.2.5 Idler Utility

While the Profilo application takes processor schedules in the form of a trace file and then measures the execution time and energy profile, the Idler utility performs an energy benchmark on the processor. More specifically, Idler measures the power profile of the processor in its halted states and in a busy loop, at all of the CPU’s frequencies. To accomplish this, it uses the sleep_deep and sleep_busy features of the Profilo kernel module, described in section 4.2.1.

Naturally, the implementation of Idler shares many similarities with Profilo. Idler only takes one mandatory argument and one optional argument, so argument processing is far more straightforward. As a result, there are no complex data structures or optimizations required either. The mandatory argument is the number of microseconds to pass to sleep_deep and sleep_busy. The optional argument is a verbosity flag that, when present, prints human-friendly messages to the console. When the flag is absent, the output is comma-separated-values (CSV) format, which can easily be imported into a spreadsheet application.

Idler makes the same environment changes that Profilo makes in section 4.2.4.3, likewise retaining the values so that the changes can be reversed. Low-level open system calls are made to create direct, unbuffered I/O file descriptors to the kernel module sleep files. Since both sleep_busy and sleep_deep provide the high-resolution breakdown of execution time and the

56

change in the energy readings, a file descriptor to the MSR file is not needed. The MSR module is only briefly used to gather the RAPL energy unit.

Idler then goes into its primary for-loop, which has as many iterations as there are frequencies for the first logical processor. This number comes from reading scaling_available_frequencies, first described in section 4.2.4.1. Within each loop, the frequency is set and then sleep_busy is written with the duration argument to Idler. The measured duration (using the high-resolution timer) and the energy consumed are read back from sleep_busy, and then the power is calculated

(energy in joules divided by duration in seconds). The duration, energy, and power are stored in double-precision floating point variables that are subsequently divided by a thousand and then printed to the console in respective units of milliseconds, millijoules, and milliwatts, to be consistent with the units of the duration argument. The same thing is done with sleep_deep, except with the addition that reading from sleep_deep breaks down the measured duration into the sleep and busy component. The busy component is time spent waiting for the RAPL counter to change. This duration breakdown is conveyed to the console. At this stage, Idler returns the environment back to the former values and then terminates.

4.3 Summary

This chapter began by presenting an overview of the design choices made to create Profilo. The implementation details for both the kernel module and user mode components were explained.

At times this necessitated details in processor architecture as well as kernel function and implementation. This was transitioned into the implementation details to the separate user mode

57

application, Idler. The next chapter delves into the experimental evaluation of Idler and Profilo, with the subsequent chapter using Profilo, exclusively, to assess and compare established and theoretical schedulers and speed scaling algorithms.

58

Chapter 5: Micro-Benchmarking

This chapter uses the tools described in Chapter 4 to analyze the performance and energy characteristics of a quad-core, simultaneous multithreading (Intel Hyper-Threading Technology)

Ivy Bridge CPU under several conditions and across numerous tests. Through this analysis, the strengths and weaknesses of both the platform and tools are explored. This chapter therefore provides the background and context by which to investigate the schedulers in Chapter 6.

5.1 Platform

Before describing the testing methodology and results, it is important to contextualize the platform in terms of the system specifications and its total system power consumption. All of the results from Chapter 5 and 6 were gathered from a mid 2012 Apple MacBook Pro Retina equipped with a 2.3 GHz (base) quad-core Intel Core i7-3615QM Ivy Bridge processor with 32

KB of level-1 (L1) instruction cache per core, 32 KB of level-1 (L1) data cache per core, 256 KB of level-2 (L2) cache per core, 6 MB of on-chip level-3 (L3) shared cache, and 8 GB of 1600

MHz DDR3 random access memory (RAM).

This processor has 12 controllable discrete frequencies that range from 1200 MHz to 2300 MHz in 100 MHz increments. There is also one additional discrete frequency with a misleading value of 2301 MHz in scaling_available_frequencies, the sysfs file first described in Section 4.2.4.1.

This discrete speed is a special value used to enable Intel’s Turbo Boost Technology. This feature allows the processor to clock one or more of its cores above its top-rated frequency, if there is some thermal design power (TDP) headroom and the power, current, and thermal limit

59

are within a specific range. In the case of this processor, the maximum turbo frequency is 3.3

GHz for a single active core, 3.2 GHz for two simultaneously active cores, and 3.1 GHz for three or four simultaneously active cores. While all of the micro-benchmarking results in this chapter include the “2301 MHz” turbo mode, that speed setting is avoided in Chapter 6 because it is not directly controllable, even in Ring 0 of the hierarchical protection domains (protection rings).

As described in Section 3.3, there are a total of four Running Average Power Limit (RAPL) domains; however, there are only three of these domains available on any given CPU [69, Vol.

3B Ch. 14.9.3]. The CPUs targeting the server market have Package (PKG), Power Plane 0

(PP0), and DRAM domains. The CPUs targeting the client/consumer market have PKG, PP0, and PP1 domains. The Intel Core i7-3615QM CPU falls under the latter category. For compatibility with both product categories, Profilo only makes use of PP0 (the processor cores) and the PKG (the entire CPU) domains. While this captures most of the dynamics in CPU-bound activities, the platform is composed of a lot more than just the CPU. Therefore, to understand the

CPU in the context of the entire system, a crude but effective consumer-grade power monitor was used between the wall receptacle and the switched power supply for the entire system.

The specific power monitor used was the Kill A Watt EZ (Model 4460.01) by P3 International

[73]. It has a NEMA 5-15 plug and receptacle that accepts line-in AC voltages of 85-125 volts

RMS at up to 15 amps RMS. It is then capable of reading the outgoing voltage and current draw, and therefore the active power in watts. In the range of 90-125 volts and 0.2-15 amps, the typical active power accuracy is within 0.5%, with a 2% maximum when outside that range [71].

60

One of the limitations of the Kill A Watt is its slow (1 Hertz) refresh rate, and lack of automatic data logging [33]. Some have taken to modifying the power monitor and have tapped into the two sensor outputs of the quad op-amp LM2902N chip to add automatic data logging [22][43].

Although this is reasonably straightforward to do, it would have necessitated substantial additional software development. Instead, measurement was kept considerably simpler: three types of tests were devised to keep the CPU at a consistent performance level for several seconds. The tests were scripted, and while they were running, both the laptop display and the power monitor were filmed. The video was then played back, so that each of the power monitor’s updates could be added to a row in a spreadsheet. The median values for each configuration in every type of test were retained. These values had a high correlation to the RAPL PKG readings, suggesting that the power meter was sufficiently accurate under these conditions.

5.2 ACPI Specification

The first of the three types of tests with the power monitor involved measuring the total system power while the processor was in an idle state. On a modern Advanced Configuration and Power

Interface (ACPI) system, “idle” is actually a multitude of states. The specification defines four global (G0 to G3) and six sleep states (S0 to S5) [67]. Larger integer values correspond to greater power savings, usually at the cost of wake-up latency. The nil global state (G0) corresponds to the nil sleep state (S0), which represents normal running operation of the system. The first power saving global state (G1) encapsulates the first few power-saving sleep states (S1-S4). The next global state (G2) corresponds to the last sleep state (S5), and is essentially a “soft off”, whereby the system can be brought back to the G0/S0 state with a button, keyboard, mouse, etc. The final

61

global state (G3) state is simply mechanical off, such as a system that is unplugged and without a battery.

The specification defines numerous other state types, including device states (D0-D3), CPU states (C0-C3), and performance states (P0-P16). The zero integer values imply full performance while larger integer values provide better power savings, generally at the cost of performance or wakeup latency. Unfortunately, beyond the global states, the specification is poorly written and lacks consistency [18]. The intention is likely to leave the details to hardware vendors; however, the specification actually contradicts itself, resulting in unavoidable misinterpretations. For instance, in revision 5.1 of the ACPI specification, the S1 state is described as a low wake-latency sleeping state where “no system context is lost (CPU or chip set) and hardware maintains all system context” [67, Sec. 2.4]. It goes on to read that the S2 state is “similar to the

S1 sleeping state except that the CPU and system cache context is lost”. However, in section

16.1.1, the S1 state is defined as being a state where “all system context is preserved with the exception of CPU caches”. The example in the subsequent section of the specification then describes asserting a standby signal to the phase-locked loop (PLL) to stop all system clocks, except for the real-time clock (RTC), essentially disabling the entire CPU. Hardware manufacturers implement this example by power-gating the processor [31]. This is a benign example of the inconsistency in the specification. It is unsurprising that Linus Torvalds, the initial creator of the Linux kernel, describes ACPI as “a complete design disaster in every way”

[51]. Perhaps this is the source of unreliable sleep implementations that plagued computers 15 years ago and continue to be a problem today, albeit to a lesser degree [30]. Therefore, my discussion with respect to the ACPI standard is restricted to the testing platform in its typical

62

configuration, with supplementary information on the latest Intel microarchitecture (Broadwell), when applicable.

5.3 Total System Power Consumption

This section quantifies the supported ACPI states and extensions from the perspective of the total system power consumption, using the power monitor and RAPL MSRs. It begins with measurements of the global sleep states. Those tests are used to normalize the measurements in the subsequent subsections, which include an examination of the global working state with the

CPU in its various C-states, in a busy wait, and under load.

5.3.1 Global Sleep States

The Profilo-compatible platforms do not implement all of the ACPI states. This is a result of the ambiguous specifications and/or overlap with other state types. However, these gaps are inconsequential, since there are numerous performance and power saving states. The idle test type is comprised of four different power meter measurement tests. Each measures a different level of idle: one in G2/S5, one in G1/S3, one in G1/S1, and one in G0/S0.

Recall that the G3 state has the most power savings. By definition, it has a power consumption of zero, because it has no power source. The next state is G2/S5. This is a “soft off” state consuming 0.3 watts to power circuitry that allows the system to be turned on with a button or

RTC alarm. Restoring from this power state requires cold booting the operating system, which

63

can take seconds to minutes, depending a lot on the performance of the non-volatile storage. If

Wi-Fi is enabled, power consumption doubles (0.6 watts), but the system can be turned on through Wake on Wireless LAN (WoWLAN).

Incidentally, Wi-Fi (Broadcom BCM4331KML1G 3x3 802.11 a/b/g/n) consumes 0.3 watts, while idle, in all of the power states. The Bluetooth 4.0 device (Broadcom BCM20702), which is

Bluetooth low energy (BLE) compliant, consumes an imperceptible amount of power while idle.

Finally, the keyboard backlight can consume up to 0.5 watts at full brightness. To reiterate, all of these features are disabled in the profiling Linux environment.

The next power state is G1/S3 or more commonly, “Suspend to RAM”. In this state, the processor caches are flushed to RAM and then the CPU package is powered off, along with almost every ancillary component, except for the DRAM modules. The DRAM modules perform a minimal self-refresh to maintain state and minimize power consumption. This is called S3-

Cold. Intel processors do not support S3-Hot [72, Sec. 4.1.1], likely because the wake-up latency would be no different, but power consumption would be higher. The result is the system consumes only 0.7 watts but can return to the previously functional state within a few seconds, without rebooting the operating system.

There is actually a G1/S4 state, commonly called “Hibernation” or “Suspend to Disk”, which flushes the RAM to non-volatile storage and then goes into a power configuration identical to

G2/S5 (0.3 watts). The only difference is that when the system comes out of “soft off”, it accesses a ring 0 defined hibernation file location and reads its contents back into RAM prior to

64

continuing the same bootstrap as G1/S3. Restoring from the G1/S4 state can take seconds to minutes depending on the amount of system RAM and the sequential transfer rate of the non- volatile storage. As a side note, when Mac OS is told to “Suspend” or the lid is closed, it creates the hibernation file, and then goes into the G1/S3 state. If the remaining battery capacity becomes low, an interrupt is invoked (similar to how the RTC alarm works), which wakes the processor long enough to put it into a G1/S4 state (this happens very quickly since the hibernation file already exists).

The least-effective power-saving global state is G1/S1, often called Power on Suspend (POS) or just “suspend”. As already mentioned, this state lacks a good definition and may soon be deprecated, like the G1/S2 state, since power-gating makes implementing these states even more ambiguous. In fact, Intel makes no reference to this state anywhere in the processor datasheet

[66], [72]. The datasheet explicitly lists support for only the following system states: G0/S0,

G1/S3-Cold (“S3-Hot is not supported by the processor”), G1/S4, G2/S5, and G3. Nevertheless, the system is able to enter the G1/S2 state under Linux, although the graphics context becomes corrupt after resuming. In this state, the system consumes 0.8 watts. One can only speculate where the extra 0.1 watt is consumed: it could be the CPU package, or the Nvidia GeForce GT

650M graphics processing units (GPU) in a “D3 Hot” state (described in the next subsection) with insufficient power to the GDDR5 memory, resulting in the corruption, or the Platform

Controller Hub (PCH) in an unsupported state.

65

5.3.2 Normalizing Measurements

One of the features of this MacBook Pro Retina is that it has two GPUs: the capable Intel HD

Graphics 4000 on the CPU package (that uses system DDR3 RAM) and the higher performance

Nvidia GeForce GT 650M with dedicated 1 GB GDDR5 memory. Under most circumstances, when high-performance GPU-acceleration is not required, the significantly more energy efficient

HD4000 GPU is used. When an application requests high-performance GPU-acceleration, an

OpenGL context is created, or a user connects a secondary display, Mac OS seamlessly switches to the 650M. Once the 650M is no longer required, and the HD4000 is adequate, Mac OS seamlessly switches back to the HD4000. When the display is turned off (login screen timeout, screensaver, or user invoked), the active GPU (ACPI D0 state) is put into a low power state

(ACPI D3). For the HD4000, this state is the “D3 Cold” state, which cuts power to the power plane associated with the GPU (PP1). For the 650M, this state is the “D3 Hot” state, which provides auxiliary power to keep the GDDR5 memory powered. Therefore, even in the D3 state, the HD4000 is more energy efficient.

Unfortunately, this level of power management sophistication is unavailable in Linux.

Furthermore, at this time the only (reliably) available GPU in Linux is the 650M. This means that the RAPL PP1 domain can be ignored since it stays zero. As a result, the PKG domain does not need to be reduced by the value in the PP1 domain to isolate CPU activity. This simplifies things, especially considering that had the HD4000 been active with this version of Linux, the energy consumption reported by the PKG domain may have been offset by more than just the power to the PP1 domain. For instance, the power control unit (PCU), which is accounted for in the PKG domain, but not in either of the PP0 or PP1 domains, may consume more energy if it

66

manages the dynamic frequency adjustments of the HD4000. Most properly configured headless servers would have their GPU disabled to save power. The exception would be if the GPU were an active compute resource.

To better understand the impact of using the 650M under Linux, the power monitor was used to measure the system in the G0/S0/C7 state using the HD4000 and then the 650M in Mac OS. All tests were performed with Wi-Fi, Bluetooth, and the keyboard backlight disabled. When the

HD4000 is in the D3 (cold) state, the power monitor reads 5.7 watts. The 650M in its D3 (hot) state consumes an additional 2.1 watts. When the HD4000 is put into the D0 state (with the display turned off), the power consumption rises to 8.1 watts, suggesting that the HD4000 GPU consumes 2.4 watts at idle. With the 650M in the same situation, the total power consumption is

10.5 watts, which is 2.4 watts more than the HD4000 in the D0 state. This suggests that the

650M (and its GDDR5 memory) consumes 4.8 watts at idle – double the HD4000. With the display turned on and at full brightness, the power consumption rises by 8.4 watts. One would expect to see the same 18.9 watt power consumption from Linux with the processor, 650M, and display in the exact same power state; however, the power consumption is 25.4 watts. The extra

6.5 watts could possibly be a result of the PCI Express links and/or Direct Media Interface

(DMI) being in the active transfer state (L0) instead of a lower power state (i.e., L1-L3), and/or devices (Wi-Fi, Bluetooth, etc.) being in higher power states (i.e., D0-D2) rather than being off

(D3), despite most device drivers being disabled. The extra power consumption also raises the temperature of the processor, necessitating active thermal dissipation from the two DC brushless fans that can consume up to 2W (rated) each.

67

Rather than attempting to correct the relatively poor power optimization of the system under

Linux, which could possibly require driver fixes, the results gathered from Profilo were normalized against the Mac OS optimized system configuration in the G0/S0/C7/D3 state from the first measurement (5.7 watts). Therefore, to capture the total system power consumption, including the switched power supply, the median values for each scenario in every type of test, except for the idle (S1 or deeper) measurement types, was captured and then reduced by 19.7 watts. This power consumption is more in line with properly setup, power efficient headless servers found in well-managed datacenters.

Table 5.1 summarizes the power consumption of these isolated components. An important consideration to this normalized approach is that the RAPL PKG values do not include energy associated with using the HD4000 (if there is any in the D3 state). Energy associated with the

GPU (if any) is included in the “Other” values, which is the power consumption outside of the

PKG domain.

68

Table 5.1 Power Consumption of Components

Component Power Consumption

Wi-Fi & Bluetooth 0.3 W

Keyboard Backlight 0.5 W

650M (D3 Hot) 2.1 W

HD4000 (D0) 2.4 W

650M (D0) 4.8 W

LCD Display 8.4 W

DC Brushless Fans 0 W to 4.3 W

Linux Idle (C7) 25.4 W

OSX Idle (C7; Headless) 5.7 W

5.3.3 Idle Power Consumption

The global working state (G0/S0) is where all of the CPU’s power saving C-states function. The characteristics and trade-offs of the C-states are arguably as important as the performance and energy characteristics of the P-states, which are the processor’s dynamic voltage and frequency scaling (DVFS) states. Until recently, discussion around processor idle states has generally been restricted to “race-to-idle”, which is running the processor at its maximum speed and then putting it into an idle state [4]. However, this scheme has been shown to be suboptimal in a variety of situations [26]. The topic has become further complicated by the fact that there are now many idle states that make compromises between power savings and exit latency. Multiple

69

cores also allow the physical processor to be in a myriad of hybrid states. From the perspective of energy efficiency, choosing how a processor rests is as important as choosing how it works.

The ACPI specification loosely defines C0 through C3; however, Intel extends the number of states to further reduce power consumption for situations where the wake-up latency and cache warm-up time is acceptable. Over time, some power saving techniques have been deprecated because of newer methods that increase savings with equal or improved wake-up latencies. As a result, contemporary processors have gaps in the numbering of the C-states. On the Ivy Bridge microarchitecture, there are six C-states: C0, C1, C1E, C3, C6, and C7. The Broadwell microarchitecture adds to this with three more C-states: C8, C9, and C10.

Just like the other nil states, the C0 state is the active state of the CPU. The power consumption in this state depends on the voltage, frequency, and activity on the CPU, since inactive execution units are power-gated [31]. Nevertheless, even nominal power consumption on the Ivy Bridge processor in the C0 state (at the lowest frequency) is 59% higher than the first power saving state

(C1) and 4.1 times higher than the last power saving state (C7). The rest of Section 5.3.3 discusses the details and power measurements of the remaining C-states. Section 5.3.4 focuses almost exclusively on power consumption in the C0 state.

5.3.3.1 Linux cpuidle

Just as the Linux kernel has the cpufreq infrastructure for managing the processor frequency (P- states), there is a subsystem for managing the CPU’s idle states, called cpuidle. Since every

70

processor has different idle characteristics, prerequisites, side effects, states, and actions to enter/leave those states, this complexity is abstracted to a driver layer. A driver defines a cpuidle_state structure for every processor state. In the case of the x86 instruction set architecture (ISA), these are the C-states. The driver then defines a cpuidle_device structure for each logical processor, which is usually the same for every logical processor on the same CPU, although technically, they could be different. The driver then registers the device(s) with cpuidle.

Similar to cpufreq, the cpuidle subsystem has a governor interface that supports the implementation of different idle policies. A governor makes decisions based on information it gets from the device. Each of the states defined in a device has a number of predefined parameters, such as its name, desc (description), flags, exit_latency, power_usage, and target_residency.

The name and desc are for users since the governor and driver/devices are made available through sysfs. The flags indicate predefined characteristics that are important for the governor to make decisions. For instance, if a state has the CPUIDLE_FLAG_CHECK_BM flag, this indicates that the state is not compatible with bus-mastering direct memory access (DMA) activity. Entering such a state (i.e., deep sleep that is unresponsive to snooping but keeps the last level cache active) while bus-mastering DMA activity is occurring, could result in the processor caches failing to update in response to DMA, leading to data corruption.

The exit_latency is the number of microseconds it takes for the processor to return to the operating state. If the state turns off caches, there could be additional delays due to cache misses.

The power_usage is the number of milliwatts consumed by the processor in this power state. It is

71

important to remember that transitioning to an energy saving state and then back again costs both time and energy. Therefore, the target_residency is the minimum number of microseconds that the processor should stay in the state to save any energy and make the transition worthwhile.

When a logical processor has nothing left to do, the select function to its cpuidle governor is invoked. This is where the governor applies its heuristics and returns the integer index to a target state in the states array defined by the device for the logical processor. Selecting an appropriate sleep state involves several critical steps, such as looking at upcoming timing events, determining if there is DMA activity, and power management quality of service latency requirements, which is supported by the pm_qos subsystem that provides a kernel (e.g., drivers) and user mode interface for registering performance expectations. Currently, there is only one governor implemented, called menu, which takes into account all of these considerations and then immediately picks the deepest possible idle state. An Intel conference paper suggests the possibility of another governor called ladder, which takes a progressively deeper step-wise approach to selecting an idle state [42], similar to cpufreq governors, in conjunction with a tick- based kernel.

Intel actively develops and supports a cpuidle driver for Sandy Bridge and newer microarchitectures [17]. From the source code, it appears that power_usage is an optional parameter that is not used by the governor interface to make decisions. It is not defined throughout the Intel driver. The driver first detects the microarchitecture of the processor by invoking the CPUID instruction and then gathering the vendor, family, and model, just like

72

cpuinfo in section 4.2.4.2. Based on this information, the states and device(s) are statically created and registered.

5.3.3.2 CPU Sleep States

The first C-state on these microarchitectures is C1. It can be entered with the HLT (halt) instruction or the MWAIT(C1) instruction. This state offers power savings by stopping the instruction clock, but allows the core to return to C0 almost instantaneously. The bus/interconnect frequency is left unchanged, so unlike the deeper C-states, the C1 state power savings is dependent on the clock frequency. The C-state of the entire CPU package is governed by the core with the lowest C-state value (least power savings). For example, if one core goes into the C1 state and the remaining cores are in C7 states, the package transitions into the C1 state. The package C1 state is purely a semantic state that is no different than C0. On Ivy Bridge

(the testing platform), Intel defines the target residency as 1 microsecond and the exit latency as

1 microsecond. On Broadwell, these values are both 2 microseconds. When the cpuidle governor decides the system needs to idle for less than the C1 target residency or have a shorter response time than the exit latency, it invokes a busy wait instead. This means that in the 1 to 2 microsecond range, the older Ivy Bridge microarchitecture is actually more energy efficient.

Table 5.2 shows that transitioning all cores from C0 to C1 results in the processor dropping power consumption by 37% to 52%, depending on if the cores are clocked at the lowest or highest frequency. This table includes the power consumption breakdown of the PP0 domain

(cores), the entire package (processor), the estimated fan power consumption, and components outside of the processor package (other), as well as the normalized grand total power

73

consumption from the wall, as discussed in Section 5.3.2. It is important to emphasize that PP0 is a subset of PKG and that both are a subset of total (the last column).

Table 5.2 Ivy Bridge C1 Power Savings

Core 0 Core 1-7 Frequency (MHz) PP0 (W) PKG (W) Fan (W) Other (W) Total (W)

C0 C0 1200 8.8 12.7 1.5 2.9 17.1

C0 C7 1200 3.0 6.8 1.5 2.9 11.2

C1 C1 1200 4.2 8.0 1.5 2.9 12.4

C0 C0 1300 9.3 13.3 1.6 2.9 17.8

C0 C7 1300 3.2 7.0 1.5 2.9 11.4

C1 C1 1300 4.4 8.2 1.5 2.9 12.6

C0 C0 1400 9.9 13.8 1.6 2.9 18.3

C0 C7 1400 3.4 7.2 1.5 2.9 11.6

C1 C1 1400 4.6 8.4 1.5 2.9 12.8

C0 C0 1500 10.5 14.4 1.7 3.0 19.1

C0 C7 1500 3.6 7.4 1.5 3.0 11.9

C1 C1 1500 4.8 8.6 1.5 3.0 13.1

C0 C0 1600 11.0 15.0 1.7 3.0 19.7

C0 C7 1600 3.7 7.5 1.5 3.0 12.0

C1 C1 1600 5.0 8.8 1.5 3.0 13.3

C0 C0 1700 11.6 15.5 1.8 3.0 20.3

C0 C7 1700 3.9 7.7 1.5 3.0 12.2

74

Core 0 Core 1-7 Frequency (MHz) PP0 (W) PKG (W) Fan (W) Other (W) Total (W)

C1 C1 1700 5.2 9.0 1.5 3.0 13.5

C0 C0 1800 12.0 16.0 1.8 3.0 20.8

C0 C7 1800 4.1 7.9 1.5 3.0 12.4

C1 C1 1800 5.4 9.2 1.5 3.0 13.7

C0 C0 1900 12.6 16.6 1.9 3.0 21.5

C0 C7 1900 4.3 8.1 1.5 3.0 12.6

C1 C1 1900 5.6 9.4 1.5 3.0 13.9

C0 C0 2000 13.2 17.2 1.9 3.0 22.1

C0 C7 2000 4.5 8.3 1.5 3.0 12.8

C1 C1 2000 5.8 9.6 1.5 3.0 14.1

C0 C0 2100 13.7 17.7 2.0 3.1 22.8

C0 C7 2100 4.7 8.5 1.5 3.1 13.1

C1 C1 2100 6.0 9.8 1.5 3.1 14.4

C0 C0 2200 14.2 18.2 2.0 3.1 23.3

C0 C7 2200 4.8 8.7 1.5 3.1 13.3

C1 C1 2200 6.2 10.1 1.5 3.1 14.7

C0 C0 2300 15.1 19.1 2.1 3.1 24.3

C0 C7 2300 5.3 9.1 1.5 3.1 13.7

C1 C1 2300 6.7 10.5 1.5 3.1 15.1

C0 C0 2301 27.5 31.4 4.3 7.3 43.0

C0 C7 2301 11.3 15.3 1.5 4.0 20.8

75

Core 0 Core 1-7 Frequency (MHz) PP0 (W) PKG (W) Fan (W) Other (W) Total (W)

C1 C1 2301 11.3 15.0 1.5 4.0 20.5

The table also includes a scenario where the first logical processor is left active (C0) and the rest are put into the deepest sleep (C7), resulting in an additional 13% to 15% power savings over all cores being in the C1 state, with the exception of the turbo frequency (described in Section 5.1).

This presents an interesting class of speed scaling algorithms, where most cores can be put into states with high wake-up latencies (i.e., C7), while one or more cores are left in states with low or no wake-up latencies (i.e., C0).

The Intel driver disables a feature called C1E (state) auto-promotion, which can be turned on and off through the IA32_MISC_ENABLE model specific register (MSR). When enabled, this causes the core to automatically transition from the C1 state into the C1E (C1-Enhanced or enhanced halt) state, which further reduces power consumption by transitioning to the lowest supported clock frequency and reducing voltage to the core. The state can alternatively be entered by executing the MWAIT(C1E) instruction. If the other cores are in a C1E or deeper state, the package also enters the C1E state, which reduces bus/interconnect frequencies and voltages.

Naturally, the target residency increases to 20 microseconds, while the exit latency climbs to 10 microseconds, on both Ivy Bridge and Broadwell. The power consumption for the Ivy Bridge processor drops to 7.6 watts, which is 5% to 49% (depending on frequency) less than when the processor is in the C1 state.

76

The next consecutive state is C3, which saves power by flushing the core’s L1 caches (32 KB instruction and 32 KB data) and L2 cache (256 KB) to the shared L3 cache (6 MB on testing platform) and then stopping all core clocks. In addition to power-gating [23], the package does not need to wake the core when a snoop is detected or an active core accesses cacheable memory. A core enters this state with the MWAIT(C3) instruction. The target residency for this state jumps to 156 microseconds on Ivy Bridge and a more efficient 100 microseconds on

Broadwell. The exit latency on Ivy Bridge is defined as 59 microseconds, while Broadwell is faster at 40 microseconds. Since the caches are cold, there are additional penalties for misses; however, some or all of these hits could land on the L3 cache, which is roughly an order of magnitude faster than RAM. The Ivy Bridge processor in the C3 state consumes 4.7 watts, resulting in a 38% power savings over C1E.

With the exception of the package C1 and C7 states, the Intel datasheets are strikingly vague on the package power savings [66, Sec. 4.2.5], [72, Sec. 4.2.5.1]. On Broadwell, there is an additional package C2 state that is internal and cannot be explicitly requested by software. The package can fall into this state when cores and graphics are in C3 or greater states and there is a

“memory access request received” (perhaps a delayed response from a microcontroller or compute module, such as Intel’s Many Integrated Core Architecture). One possible reason for the elusive details is that they are irrelevant to firmware and kernel developers and therefore kept as a trade secret.

The deepest power saving state with respect to the cores is C6. When a core is instructed to enter

C6, through the MWAIT(C6) instruction, it flushes the L1/L2 caches and saves its architectural

77

state to dedicated SRAM. The entire core and its phase-locked loop (PLL) are then powered off.

The target residency jumps to 300 microseconds on Ivy Bridge and a less efficient 400 microseconds on Broadwell. Likewise, the exit latency climbs to 80 microseconds on Ivy Bridge and a slower 133 microseconds on Broadwell. Just like the C3 state, the exit latency does not include the extra delays associated with L1/L2 cache misses. The power consumption on the Ivy

Bridge processor in the C6 state is 3.3 watts, which is a 30% drop from C3.

If all of the cores execute the MWAIT(C7) instruction, they each do the exact same routines as in

C6, with the exception that the last transitioning core is responsible for flushing the contents of the L3 cache before powering down. With all of the cores in the C7 state, the processor can transition into the package C7 state. This involves disabling the L3 cache and notifying the platform of the state transition so that it knows that the CPU possesses no snoopable information, thus not waking the CPU unnecessarily. The package also powers down other components in the uncore (components in close proximity to the core and essential for core performance) and then enters a low power state. On Broadwell, the package may enter an even lower voltage state called Package C7 Plus. These savings come with the target residency cost of 300 microseconds for Ivy Bridge and a much longer 500 microseconds for Broadwell. The exit latency is 87 microseconds on Ivy Bridge and almost twice that for Broadwell at 166 microseconds.

Furthermore, the exit does not immediately re-enable the L3 cache until the processor has stayed out of C6 or deeper for a preset amount of time – and even then, there are undisclosed internal heuristics that govern how the L3 cache is gradually expanded. This saves energy by power- gating the L3 cache and preventing unnecessary and expensive repopulation/flushing cycles. The

Ivy Bridge processor in the C7 state, which is its deepest C-state, measures 3.1 watts, which is

78

only a 6% power savings over the C6 measurement; however, the power consumption for the C6 and greater states will measure higher if the L3 cache is saturated, resulting in greater relative power savings in the C7 state.

A breakdown of the power consumption in the various C-states of the Ivy Bridge testing platform is contained in Table 5.3. This table has the same format as Table 5.2 and summarizes its results by including the lowest, highest, and turbo frequency for the C0 and C1 states, as well as the hybrid C0/C7 state, touched on at the beginning of this section. The corresponding durations for the C-states can be found in Table 5.4.

79

Table 5.3 Ivy Bridge C-State Power Measurements

Core 0 Core 1-7 Frequency (MHz) PP0 (W) PKG (W) Fan (W) Other (W) Total (W)

C7 C7 0.3 3.1 0.7 2.6 6.4

C6 C6 0.5 3.3 0.7 2.7 6.7

C3 C3 1.8 4.7 0.7 2.8 8.2

C1E C1E 3.9 7.6 1.2 2.9 11.7

C1 C1 1200 4.2 8.0 1.5 2.9 12.4

C1 C1 2300 6.7 10.5 1.5 3.1 15.1

C1 C1 2301 11.3 15.0 1.5 4.0 20.5

C0 C0 1200 8.8 12.7 1.5 2.9 17.1

C0 C0 2300 15.1 19.1 2.1 3.1 24.3

C0 C0 2301 27.5 31.4 4.3 7.3 43.0

C0 C7 1200 3.0 6.8 1.5 2.9 11.2

C0 C7 2300 5.3 9.1 1.5 3.1 13.7

C0 C7 2301 11.3 15.3 1.5 4.0 20.8

As a result of having a fully-integrated voltage regulator (FIVR), first introduced in Haswell, the

Broadwell microarchitecture also has package C8, C9, and C10 states, which can be entered by executing MWAIT(C8), MWAIT(C9), and MWAIT(C10), respectively. The cores initiate the transition exactly the same as for C7. The package C8 state builds on the package C7 state by turning off all internally generated voltage rails and reducing the input VCC from 1.3 volts to 1.15 volts. This pushes the target residency to 900 microseconds and the exit latency to 300

80

microseconds. The package C9 state reduces the input VCC to zero volts but at a cost of doubling both the target residency (1800 microseconds) and exit latency (600 microseconds). Finally, the

C10 package state puts the single-phase core controller (VR12.6) into a quiescent mode (PS4) that consumes only 0.5 milliwatts. Despite the datasheet stating that the exit latency for the controller is 48 microseconds [70], the target residency jumps to a considerably larger 7.7 milliseconds with an exit latency of 2.6 milliseconds.

Table 5.4 Ivy Bridge and Broadwell C-State Durations

Target Residency (µs) Exit Latency (µs)

State Ivy Bridge Broadwell Ivy Bridge Broadwell

C1 1 1 2 2

C1E 20 20 10 10

C3 156 100 59 40

C6 300 400 80 133

C7 300 500 87 166

C8 900 300

C9 1800 600

C10 7700 2600

Core and package states can be statically restricted at the hardware level (through configuration

MSRs) or dynamically regulated through a feature called auto-demotion. In essence, this feature allows the processor to use the per-core immediate residency history and interrupt rate to demote

81

C6 and deeper requests to the C3 state, or C3 and deeper states to C1, based on the configuration options. None of these features are enabled on the testing platform.

This concludes all of the idle states from the system perspective as well as from the CPU, on both the Ivy Bridge testing platform and the latest Intel platform at the time of writing

(Broadwell). While Table 5.3 showcases the power consumption of each of the Ivy Bridge C- states, when all of the logical processors are in the same state, this table could be significantly expanded for all of the permutations of C-states that each of the cores can be in. As shown with the C0/C7 hybrid state, there exist interesting cases were the CPU can remain active and consume less power than when all cores are in a non-functional intermediate sleep state. Other non-synchronous C-state options could be exercised to increase power savings, while retaining better exit latencies. Further discussion on this topic is reserved for future work.

5.3.4 Active Power Consumption

As detailed in section 5.1, the testing platform has 12 controllable discrete frequencies and an

Intel Turbo Boost Technology state, yielding 13 ACPI performance states or P-states. The ACPI specification currently has a defined limit of up to 16 P-states [67]. While the relative power savings are large between the C-states, the absolute power consumption, in watts, is most significant between the P-states. This is because the power consumption of processors is proportional to the square of the voltage, multiplied by the frequency [46]. Since the voltage must be increased with the frequency, the power requirements of processors grow in accordance to this polynomial function. The valid range of the voltage identification value for the Intel Core

82

i7-3615QM Ivy Bridge processor is 0.65 to 1.35 volts [72, Sec. 7.9.1]. Using the Idler utility, described in Section 4.2.5, each of the discrete frequencies was measured from the perspective of the core power plane (PP0), processor package (PKG), and total system power from the wall with the first core busy-waiting, while the remaining cores were “offline” (C7 state). The results are summarized in Table 5.5. Once again, the “Total” column, as with the previous tables, is the normalized (described in Section 5.3.2) total power consumption from the wall, with PKG being a subset of this total, and PP0 being a subset of PKG. The last two columns are the relative change from the prior frequency.

83

Table 5.5 Single-Core Busy Wait

Frequency (MHz) PP0 (W) PKG (W) Total (W) ΔFrequency ΔPP0

1200 2.8 6.5 11.8

1300 3.0 6.7 11.9 8.3% 0.15 (+5.5%)

1400 3.1 6.9 12.2 7.7% 0.16 (+5.4%)

1500 3.3 7.0 12.3 7.1% 0.15 (+4.9%)

1600 3.4 7.2 12.6 6.7% 0.17 (+5.1%)

1700 3.6 7.3 12.7 6.3% 0.16 (+4.6%)

1800 3.8 7.5 12.9 5.9% 0.17 (+4.6%)

1900 3.9 7.7 13.1 5.6% 0.19 (+5.0%)

2000 4.1 7.8 13.1 5.3% 0.15 (+3.8%)

2100 4.3 8.0 13.3 5.0% 0.17 (+4.1%)

2200 4.4 8.2 13.5 4.8% 0.15 (+3.5%)

2300 4.7 8.5 13.8 4.5% 0.32 (+7.1%)

2301 (3300) 9.9 13.7 20.2 43.5% 5.17 (+109.4%)

Curiously, the PP0 power consumption increases linearly and with an almost perfect correlation to the frequency, up until 2200 MHz. This means that Intel maintains the same voltage for all user-selectable frequencies, except the last (2300 MHz). This is supported by further observations from Table 5.6. This voltage schema is obviously suboptimal, particularly for the lowest frequencies. The last user-selectable frequency (2300 MHz) seems to experience a

84

voltage increase. With only one core active, the maximum turbo frequency is 3.3 GHz, including a dramatic spike in power consumption, signifying that the voltage is further increased.

Despite this substantially larger power consumption, power-gating is still impactful at saving energy. Power-gating is a technique to save power on inactive execution units [31]. The trial division primality test algorithm is a heavy, integer-based workload, for which the power consumption increases over a simple busy-wait loop, as seen in Table 5.6. The “Total” column is the same as the previous tables. The last column in this table is the relative increase in the PP0 power consumption over the same frequency with a busy-wait loop on the first core (remaining cores in C7, just like in Table 5.5). This is the power consumption of an active virtual process

(calculating the first 200 prime numbers, 10000 times) in Profilo, repeated at all of the discrete speeds available on the Ivy Bridge testing platform.

85

Table 5.6 Single-Core Trial Division Primality Test

Frequency (MHz) Duration (s) PP0 (W) PKG (W) Total (W) ΔPP0

1200 8.68 3.1 6.9 12.0 12.3%

1300 8.02 3.3 7.1 12.4 12.6%

1400 7.44 3.5 7.3 12.6 12.1%

1500 6.95 3.7 7.5 12.7 12.5%

1600 6.51 3.9 7.6 12.9 12.4%

1700 6.13 4.1 7.9 13.1 13.3%

1800 5.79 4.3 8.0 13.4 13.2%

1900 5.49 4.5 8.3 13.5 12.9%

2000 5.21 4.6 8.4 13.7 12.9%

2100 4.96 4.8 8.6 13.9 13.3%

2200 4.74 5.0 8.9 14.0 14.2%

2300 4.53 5.4 9.2 14.4 14.3%

2301 (3300) 3.16 11.5 15.3 21.8 15.9%

At every single speed, the duration to complete the work is nearly exactly proportional to the frequency increase, including the turbo mode, which is running at precisely the full 3300 MHz.

This means that the performance scales perfectly with frequency and that none of the frequencies suffer from timing woes. Furthermore, this supports the PP0 observations of Table 5.5 that the voltage is constant from 1.2 GHz to 2.2 GHz, and that it increases at 2.3 GHz, and then again for the turbo speed (3.3 GHz). This claim is supported because the only way more energy is being

86

consumed is either by a voltage increase or wasteful idle cycles (due to timing issues), and the duration measurements eliminate the latter supposition.

With regards to power consumption, the PP0 values show that even a single active core with this workload, on an optimally configured system (see Section 5.3.2), can consume over half of the total system power consumption from the wall. On the other hand, the package (PKG) power consumption outside of PP0, only consumes a narrow 3.7 to 3.9 watts (C0 state), depending on frequency and workload.

An interesting observation is that under this workload, the total system power consumption, outside of the package, is fairly consistent at 5.3 watts, except for when the core is at its turbo speed, where it jumps up to 6.5 watts. The components outside of the package, such as the

Platform Controller Hub (PCH), and its various I/O, seem to be independent of the processor’s load and clock frequency until the turbo mode is engaged, where the frequency and/or voltage of one or more external components seems to increase. A final thought to consider is how well the consumer-grade power meter (Kill-A-Watt) correlates to the PP0/PKG values (0.999/0.998) for all user controllable frequencies. This is reassuring that the power meter offers precise (+/- 0.1 watt) and likely accurate readings.

As mentioned in Section 4.1.4, the trial division primality test algorithm is an integer-based workload that was chosen for numerous reasons, including its simplicity to code in kernel mode.

As a result, many execution units, such as the floating point unit (FPU), the single instruction, multiple data (SIMD) [24] extensions, Advanced Encryption Standard New Instructions (AES-

87

NI), and virtualization extensions, are never used. The Great Internet Mersenne Prime Search

(GIMPS) foundation has a tool called , which is heavily optimized to take advantage of some of these extra components to accelerate its search for Mersenne prime numbers. When executing on all of the logical processors, the power consumption at the wall is a substantial 72 watts, with each core consuming roughly 15.4 watts. These additional execution units increase the power consumption by 30% to 40% over the trial division primality test algorithm, approaching the peak power consumption of the processor. Under this workload, the processor consumes 80% (when hot) to 90% (when cool) of the total power consumption from the wall, depending on thermal envelope, which causes throttling in the turbo mode, and active heat dissipation (i.e., fans), which spins both 2W fans up to 6000 revolutions per minute (RPM) to cool the processor.

5.4 Profilo Workload Benchmarking

Recall from Section 4.1.4 that the trial division primality test algorithm has an exponential running time, when measured in terms of the size of the input in bits. This is visualized in Figure

4.1. Section 4.1.5 covers the implications of this algorithm and how Profilo uses a “work unit” argument to essentially linearize any workload, including the trial division primality test algorithm. The caveat is that with a fixed “work unit” value, the duration to complete a “work unit” will be architecture dependent and inversely proportional to the frequency. Therefore if one wishes to approximate a time slice for each discrete speed, for instance for a PS trace, a “work unit” along with the number of loops, unique to each speed, needs to be determined.

88

Included as an alternative runtime for Profilo is a benchmark mode that allows one to determine appropriate “work unit” and “primes per second” arguments for the host processor, along with the respective loop values for each discrete speed to achieve a particular time slice. For Chapter

6, the desired time slice is 10 ms, since this is the maximum duration that regular processes are allowed to occupy the CPU [68].

There are many different combinations for the “work unit” and loop values for the discrete speeds. The smallest “work unit” is 1, but then all of the benefits of the workload from Section

4.1.4, such as branch prediction disruption, would be bypassed since the processor would not be doing very much work, besides decrementing a counter each time it returns a constant.

Technically, the largest “work unit” is determined by the slowest discrete speed that is able to calculate the defined number of prime numbers within the desired period (i.e., 10ms). However, the loop value for the next discrete speed would either have to be the same, which is incorrect, or incremented by one, doubling the number of calculations and therefore grossly missing the target duration.

For example, on the Ivy Bridge platform at its lowest speed (1200 MHz), the processor can calculate the first 850 prime numbers in 9.8 milliseconds. At the next discrete speed (1300

MHz), it performs the exact same work in 8.3 milliseconds. For the lowest speed, the loop value would be 1, underestimating the target time slice by 0.2 milliseconds. With the second speed, a loop value of 1 would be logically incorrect, because the same amount of work would be completed in a time slice as the lowest speed, despite it calculating significantly faster.

Furthermore, the time slice would be underestimated by 1.7 milliseconds. Unfortunately, with a

89

loop value of two, the work would double, taking 16.6 milliseconds, overestimating the time slice. The overestimation, with unique loop values, would get considerably worse at higher speeds (e.g., at 1600 MHz, the fifth discrete speed, the time slice would be 24.19 milliseconds).

Therefore, the set of satisfactory combinations has to be one where the time slice divided by the duration to perform the work unit results in a unique loop integer for each discrete speed.

Subsequently choosing a combination from that set is based upon the dichotomy of having the largest “work unit” (because of the benefits of the workload) and better approximating the time slice.

In the interest of time, the benchmark calculates “work unit” values 1 through 10, then every 10 to 100, then every 50 to 1000, and finally every 250 to 10000. Each discrete speed is timed with a high-resolution timer (the same as described at the end of Section 4.2.4.4) and then the results are displayed, rounded to the nearest microsecond. A future improvement to Profilo is to enter the desired time-slice (in milliseconds) and maximum error (in percent), and have it output the set of satisfactory combinations, pruned of combinations that are fully dominated by another set

(i.e., equal or lower absolute over/underestimation for each discrete speed, with a higher “work unit” value).

It is important to reiterate from Section 5.1 that while the results include every available speed, the turbo mode is avoided in Chapter 6 because it is not directly controllable, even in Ring 0 of the hierarchical protection domains (protection rings). Therefore, the following results do not include the turbo mode.

90

With the aforementioned discrete values for “work unit”, the test platform has a range of choices from 50 to 200. A “work unit” of 200 results in a maximum underestimation of 397 microseconds (4.0%), a maximum overestimation of 10 microseconds (0.1%), and an average case of underestimating the 10-millisecond time slice by 152 microseconds (1.5%). The next tested “work unit” is 150, which results in a maximum underestimation of 172 microseconds

(1.7%), a maximum overestimation of 116 microseconds (1.2%), and an average case of underestimating the 10-millisecond time slice by 36 microseconds (0.4%). Because of the somewhat arbitrary desire to have an average case error less than 1% (i.e., ±100 microseconds for a 10-millisecond time slice), the “work unit” of 150 was chosen for use throughout Chapter 6.

A summary of the benchmark results for this “work unit” is featured in Table 5.7. The duration column is the execution time to find 150 prime numbers at the corresponding discrete speed. The loop value is the number of rounds for a 10-millisecond time slice. The last column is the deviation from 10 milliseconds.

91

Table 5.7 Benchmark Results (150 Primes)

Frequency Duration Loop Value Deviation

1200 MHz 468 µs 21 -172 µs

1300 MHz 432 µs 23 -64 µs

1400 MHz 401 µs 25 +25 µs

1500 MHz 374 µs 27 +98 µs

1600 MHz 351 µs 28 -172 µs

1700 MHz 330 µs 30 -100 µs

1800 MHz 312 µs 32 -16 µs

1900 MHz 295 µs 34 +30 µs

2000 MHz 281 µs 36 +116 µs

2100 MHz 267 µs 37 -121 µs

2200 MHz 255 µs 39 -55 µs

2300 MHz 244 µs 41 +4 µs

2301 MHz (3200 MHz) 170 µs 59 +30 µs

5.5 Mode and Context Switches

Since Profilo is a hybrid application, implemented in both kernel mode code and user mode code, it is important to describe and quantify this domain crossing as well as contrast it to process switching. Crossing between kernel mode and user mode requires a mode switch, which involves changing to/from a supervisor/privileged mode, supported by hardware as hierarchical

92

protection domains, commonly refereed to as protection rings. The number of modifications to registers and segments is minimal. A context switch, which occurs in kernel mode, requires that all general non-floating-point registers be saved/swapped (although optimizations can be made with software-implemented context switches), in addition to appropriately updating control registers.

Linux does not use hardware context switches because full software context switches are about the same speed; however, the latter allows the kernel to optimize and speed up the switches, maintain counters and metrics, and verify registers for security. Nevertheless, the kernel is forced to set up each online logical processor’s Task State Segment (TSS) to store hardware contexts.

There are two main reasons for this:

1. When x86 processors switch from user mode to kernel mode, they fetch the address of

the kernel mode stack from the TSS [14, Ch. 4, 10].

2. When a user mode process attempts to access I/O, the processor may verify its port

privileges by referencing the I/O Permission Bitmap stored in the TSS.

To optimize context switches, floating-point registers such as those used by the Floating Point

Unit (FPU) and single instruction, multiple data (SIMD) instruction sets (MMX/SSE/AVX) are only changed if a process tries to use them. This is supported in hardware with the Task-

Switching (TS) flag, which is set in the CR0 control register every context switch. If the process executes the ESCAPE instruction or a MMX/SIMD/AVX instruction, the processor raises a

"Device not available" exception, allowing a handler to load floating-point registers with values saved in the TSS for that process.

93

In the case of Profilo, floating-point instructions are only used in the user mode application to tally and summarize the metrics at the end of profiling. The workload also makes use of very little memory, so cache misses are not a consideration either. Therefore a synthetic benchmark to quantify the duration of a context switch is a good model. This is achieved with the futex (fast userspace mutex) system call, which is used between two processes that take turns contending for the mutex. Quantifying a mode switch is accomplished by using the gettid (get thread ID) system call, which may be the cheapest Linux system call.

The mode switch and context switch test is further broken down with a script that disables all but the first logical processor, and then sets the governor to userspace, similar to Section 4.2.4.3.

The frequency is then set to the base frequency, and then a mode switch test is performed, followed by a context switch test. The C code for the tests was sourced from [52]. This is repeated for every frequency including the turbo frequency (described in Section 5.1).

Afterwards, all of the logical processors are re-enabled, their governors set to userspace, followed by testing again at each discrete frequency, including the turbo frequency. The system is then returned to its prior state.

The results for the testing platform (described in Section 5.1) show that mode switches consistently take 45 ns to 123 ns, depending on the frequency of the processor. The context switches take a less consistent 1.1 µs to 3.2 µs, depending on the frequency. As already discussed, this is an optimized software-based context switch with no floating-point instructions within the processes, nor cache misses. When cache misses occur, context switches take two

94

orders of magnitude longer [36]. This means that generally context switches can take between 1 and 4 orders of magnitude longer than mode switches depending on the processor frequency, instruction set usage, and cache pollution. The results are summarized in Table 5.8. The last two columns of Table 5.8 are the mean of the samples and the standard deviation in the following format: � ± �

Table 5.8 Mode and Context Switch Test

Processors Frequency (MHz) Mode Switch (ns) Context Switch (ns)

One 1200 123.1 ± 0.2 3166.5 ± 8.9

One 1300 113.4 ± 0.0 2885.9 ± 24.3

One 1400 105.3 ± 0.0 2708.8 ± 13.0

One 1500 98.6 ± 0.5 2526.3 ± 9.4

One 1600 92.1 ± 0.0 2369.4 ± 23.8

One 1700 86.7 ± 0.0 2212.5 ± 7.9

One 1800 81.9 ± 0.0 2118.1 ± 25.2

One 1900 78.3 ± 1.2 1998.5 ± 25.5

One 2000 73.7 ± 0.0 1897.5 ± 20.8

One 2100 70.2 ± 0.0 1807.9 ± 31.5

One 2200 67.0 ± 0.0 1708.3 ± 2.8

One 2300 64.2 ± 0.2 1634.1 ± 8.9

One 2301 (3300) 44.8 ± 0.2 1139.5 ± 11.6

All 1200 122.9 ± 0.1 3098.8 ± 64.6

95

Processors Frequency (MHz) Mode Switch (ns) Context Switch (ns)

All 1300 113.4 ± 0.0 2861.7 ± 19.2

All 1400 107.0 ± 2.9 2672.4 ± 66.8

All 1500 99.9 ± 2.7 2503.9 ± 46.6

All 1600 92.1 ± 0.0 2316.0 ± 28.7

All 1700 86.7 ± 0.0 2192.1 ± 18.2

All 1800 81.9 ± 0.0 2033.4 ± 24.6

All 1900 77.8 ± 0.3 1966.4 ± 53.1

All 2000 73.7 ± 0.0 1809.2 ± 15.0

All 2100 70.7 ± 0.5 1771.1 ± 20.7

All 2200 67.3 ± 0.5 1703.1 ± 19.8

All 2300 64.1 ± 0.0 1629.4 ± 18.7

All 2301 (3300) 45.8 ± 1.2 1198.9 ± 45.9

5.6 Switch Test

One of the interesting benefits of Profilo is that in addition to being able to compare different speed scaling algorithms and processor schedulers, a few interesting micro-benchmarks can be performed to analyze the characteristics of the processor(s). The switch test is a set of traces that micro-benchmark the testing platform and roughly measures the cost of frequency changes in terms of duration and power consumption. It also supplements the ground-truth timing data presented in Section 5.5 by cross-calibrating the data presented in this section to both estimate

96

Profilo’s overhead and estimate the energy costs of mode and context switches. The switch test is composed of four subtests:

1. Locking the processor at a discrete frequency and comparing one continuous task (i.e., no

context switches) to one that is broken up into 100,000 discrete tasks. This is done for

every single discrete frequency and sets a baseline for Profilo process switches only,

including a power consumption metric, which is not included in Section 5.5.

2. Locking the processor at its slowest speed for half the task and then switching to the

second lowest speed (e.g., 1300 MHz on the testing platform) and finishing the task

compared to the same workload broken up into 100,000 tasks that alternate between the

lowest and the second lowest speed for each task. This aims to measure the additional

cost of frequency switches, in terms of time and energy, over the first test.

3. The same as test 2, but with the second speed being the highest non-turbo frequency (e.g.,

2300 MHz on the testing platform). This aims to determine if there is a difference

between the “closest” speed and the “furthest” speed.

4. The same as test 3, but with the second speed being the turbo frequency (e.g., labeled

“2301 MHz”, which turns out to be 3300 MHz). This aims to determine if there is a

difference between the slowest speed and the highest possible speed for the processor, in

comparison with other frequency changes.

Recall from Section 4.2.1 that the Profilo kernel module provides a sysfs file called do_work, which ensures that a task can be performed without context switches. However, when the work is complete, the kernel module releases control back to the operating system kernel, which consequently allows the user mode component of Profilo (Section 4.2.4) to run. Because of this

97

lack of control in user mode, it is possible for a competing user mode process to run, despite the user mode component of Profilo having the highest priority (Section 4.2.4.3). Therefore, each task in Profilo involves at least one mode switch and the possibility of context switches.

Furthermore, when a task requires a frequency change, this is done through an invocation to scaling_setspeed (see Section 4.2.4.1) from user mode, which requires a switch to kernel mode

(it is a sysfs file) before returning back to user mode. A future improvement to Profilo is to change the frequency from within the kernel module (in the do_work store function) and avoid these extra mode switches and potential context switches.

In summary, the continuous run of test 1 makes no switches; the switching run of test 1 requires a mode switch for every task and possibly context switches; and each task in tests 2 through 4 require at least 2 mode switches (and a frequency switch) with the possibility of context switches. With a properly configured testing environment, free of unnecessary modules, drivers, services, and processes, the switches should be few in number (i.e., next to no context switches).

The goals of these tests are to measure the overhead of Profilo relative to Section 5.5, and then use that data to micro-benchmark the testing platform and estimate the cost of process and frequency switches. For these tests, a “work unit” (see Section 4.1.5) of 1 (i.e., no work) is appropriate. The tests were run with an ops value (see Section 4.2.4.5) of 71,555,000. The large ops value is to minimize the interference from the RAPL overflow prevention code. Bearing in mind that it is interpreted as the approximate number of loops of the “work unit” that the processor can perform in a second, it is multiplied by 60 and then cast to a 32-bit integer. To prevent integer overflow of the 32-bit variable, the largest ops value is 71,582,788. The

98

71,555,000 value is close to this upper bound and easy to read in the script. Considering that even in the switching version of each test, and at the lowest discrete speed, the testing platform can perform nearly half a billion process switches, this ops value still leads to RAPL overflow prevention code running 6 to 19 times per second, instead of once per minute. A future improvement to Profilo is therefore to store the ops value and calculations variable as 64-bit integers.

Because the first trace of every test continuously runs the respective processes without a context switch, the workload is essentially reduced to incrementing a loop counter 100,000 times. This occurs very quickly, without much overhead from Profilo, including RAPL overflow prevention

(because the calculations value is checked against the ops value between processes). Tests 2 through 4 split this into two processes with a speed switch in-between. However, this occurs much faster than the finest resolution of 1 millisecond in the summary report. The energy consumption of these tests is also below the 1 millijoule resolution of the summary report. As a consequence, the data in Table 5.9 contains only the results of the second (switching) part of each test. Furthermore, the results in the table represent the per-process cost. This value is calculated by dividing the Profilo summary results by 100,002 (the first two tasks scale the minimum and maximum speed according to the formula in Section 4.2.4.1) and then multiplying the values by one million to yield duration in microseconds and energy consumption in microjoules. Every test was run 10 times with the last three columns being comprised of the mean of the samples and the standard deviation in the following format: � ± �

99

Table 5.9 Switch Test Results

Test F1 (MHz) F2 (MHz) Duration (µs) PP0 (µJ) PKG (µJ)

1 (Switching) 1200 2.09 ± 0.03 7.8 ± 0.3 16.1 ± 0.5

1 (Switching) 1300 1.93 ± 0.02 7.6 ± 0.2 15.2 ± 0.3

1 (Switching) 1400 1.81 ± 0.03 7.5 ± 0.2 14.6 ± 0.4

1 (Switching) 1500 1.67 ± 0.03 7.2 ± 0.1 13.8 ± 0.2

1 (Switching) 1600 1.56 ± 0.02 7.0 ± 0.1 13.2 ± 0.1

1 (Switching) 1700 1.47 ± 0.02 6.9 ± 0.1 12.7 ± 0.2

1 (Switching) 1800 1.38 ± 0.02 6.8 ± 0.1 12.2 ± 0.2

1 (Switching) 1900 1.32 ± 0.03 6.7 ± 0.2 12.0 ± 0.3

1 (Switching) 2000 1.26 ± 0.02 6.6 ± 0.1 11.6 ± 0.2

1 (Switching) 2100 1.20 ± 0.02 6.5 ± 0.1 11.3 ± 0.2

1 (Switching) 2200 1.14 ± 0.02 6.4 ± 0.1 11.0 ± 0.2

1 (Switching) 2300 1.09 ± 0.02 6.6 ± 0.1 10.9 ± 0.2

1 (Switching) 2301 (3300) 0.76 ± 0.01 9.6 ± 0.2 12.6 ± 0.2

2 (Switching) 1200 1300 4.60 ± 0.02 28.9 ± 9.6 47.2 ± 9.5

3 (Switching) 1200 2300 6.44 ± 0.03 27.2 ± 0.2 52.8 ± 0.4

4 (Switching) 1200 2301 (3300) 4.62 ± 0.03 28.7 ± 9.7 47.1 ± 9.8

The standard deviation for the duration across all tests is below 2%. The standard deviation for the energy consumption across the PP0/PKG domains in the first test is below 4%, with most results being below 2%. The energy consumption for test 2 and test 4 varies by a substantial

100

amount, despite the fairly consistent duration. Without further tests and isolating the frequency switches from the kernel function calls, it is unclear as to why this occurs. It is particularly puzzling since the third switch test does not have a large standard deviation. However, taking into account the means and standard deviations, the values are statistically comparable. As discovered in Section 5.3.4, test 2 requires no voltage change, while tests 3 and 4 both require voltage changes.

Since the first switch test takes 0.76 to 2.09 µs per process, depending on the frequency, it is clear that this is done without a context switch, but rather a mode switch. In fact, when compared to Table 5.8 at the specific frequencies, the mode switch represents a very consistent 5.9% of the overhead. Put another way, Profilo’s overhead (without the mode switch) at each discrete speed is equivalent to 16 times the duration of a mode switch, which seems somewhat high. This may have to do with the high number of RAPL overflow prevention function calls generating overhead that otherwise would not be present when Profilo has a workload. With a “work unit” of 150, used in Chapter 6, based on the discussion from Section 5.4, the relative overhead would likely be at least an order of magnitude smaller.

Because of the consistent temporal representation of overhead, Table 5.10 estimates the energy required to perform a mode switch, at each discrete frequency. Without actually performing such a micro-benchmark directly, there are no guarantees on accuracy or precision. The last three columns of Table 5.9 and Table 5.10 use the same format (� ± �).

101

Table 5.10 Mode Switch Durations and Energy Estimates

Frequency (MHz) Mode Switch (ns) PP0 (nJ) PKG (nJ)

1200 123.1 ± 0.2 462 ± 18 948 ± 28

1300 113.4 ± 0.0 446 ± 12 893 ± 18

1400 105.3 ± 0.0 436 ± 14 852 ± 21

1500 98.6 ± 0.5 425 ± 8 814 ± 13

1600 92.1 ± 0.0 417 ± 5 781 ± 8

1700 86.7 ± 0.0 409 ± 6 753 ± 11

1800 81.9 ± 0.0 401 ± 6 725 ± 10

1900 78.3 ± 1.2 398 ± 9 708 ± 16

2000 73.7 ± 0.0 389 ± 8 680 ± 13

2100 70.2 ± 0.0 382 ± 5 660 ± 10

2200 67.0 ± 0.0 377 ± 6 644 ± 10

2300 64.2 ± 0.2 386 ± 6 640 ± 10

2301 (3300) 44.8 ± 0.2 566 ± 11 744 ± 14

Despite Profilo’s large overhead in the switch test, it represents only 62.5% of a context switch.

So Profilo is still actually over-optimized compared to context switches between real processes managed by a contemporary Linux kernel, especially when considering that there are additional delays created by kernel management code. However, if a Linux environment is not contending between different processes, a process may get a consecutive time slice, experiencing only a mode switch in between the time slices. Therefore, with traces that result in a pseudo runqueue

102

(see Section 4.2.4.3) that have a consistent depth greater than 1, the high number of context switches would result in Profilo underestimating the real-world duration/energy. Conversely, in situations where there is only one process in the runqueue, Profilo would over-estimate the real- world time/energy. In summary though, it is reasonable to suggest that Profilo’s overhead can be ignored with mixed workloads.

A future improvement to Profilo is to have a run time option that generates a mode/context switch calibration file for each frequency that can subsequently be taken as input where Profilo adds and subtracts overhead duration/energy, depending on the situation, yielding a sanitized and extremely accurate summary that is also precisely repeatable on subsequent runs.

It can be observed from the first test in Table 5.9 that as the frequency increases, the per-process duration for switching goes down. Conversely, the power goes up, but the energy used goes down due to less idle-overhead, which is as expected and explained in detail in Section 5.3.

When frequency switching is added (test 2), the duration increases by a factor of 2.2, despite running 8.3% faster for half the time. The results are even more interesting with test 3, where the duration is 40% longer than test 2, despite the upper frequency running 77% faster! It would seem that it takes longer to go to frequencies that are further away – so much so that the speed boost from even the highest non-turbo frequencies cannot compensate for switch time. Even test

4 has a similar duration to the test 2, despite its second (turbo) frequency being clocked at over

2.5 times faster.

103

The situation gets worse in terms of energy consumption. For the second test, consumption increases 3.7 fold and 2.9 fold, on average, across the respective PP0 and PKG domains.

Strangely, the energy consumption varies wildly for the second and fourth test but not for the third test. The reason is not apparent, but taking into account the standard deviation, the values are all comparable.

It would seem that changing the frequency in test two through four requires a context switch; however, this may not necessarily be true. If we take the results for the first test at the lowest speed (1200 MHz) and second lowest speed (1300 MHz), and average the results, then subtract the average mode switch time for both frequencies, we get an average overhead for both frequencies of 1.89 µs per process. This would leave only 2.71 µs per process for mode/context and frequency switching. This is notably less than the combined average ground truth context switch time for both frequencies of 3.03 µs ± 0.15 µs. It therefore seems more plausible that frequency changes are performed without a context switch. To be certain, investigation of the kernel code used by the sysfs file (/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed) would be necessary.

Carrying this analysis further, with this conjecture, we calculate the average mode switch time for both frequencies, which is 0.12 µs. There are two of them: one to/from the Profilo kernel module and user mode application, and then another to/from the sysfs file referenced above that enables user mode applications to change the frequency. The second test takes 4.60 µs per process, so taking into consideration the average overhead (1.89 µs) and combined mode switches (0.24 µs), this means it takes 2.47 µs just to switch from the lowest to the second lowest

104

speed (1200 MHz to 1300 MHz). Similar calculations for the third test yield a substantially longer 4.76 µs to switch from the lowest speed to the highest non-turbo frequency (1200 MHz to

2300 MHz). Oddly though, in the fourth test, where the second speed is the turbo frequency

(3300 MHz), the calculated switch duration is 3.11 µs, which is still longer than the second test, but less than the third. The reduced time when compared to the highest non-turbo frequency may have to do with it running 43% faster. The voltage change from 2200 MHz to 2300 MHz

(observed in Section 5.3.4) could be responsible for this increased duration. Expanding the test to analyze each discrete frequency would be a relatively straightforward task with Profilo, but is unfortunately out of scope for this subsection.

If the isolated frequency switch time over the total duration of each test is multiplied by the respective power consumption, the estimated power consumption results for the second test is

15.5 ± 5.1 µJ per switch for PP0 and 25.4 ± 5.1 µJ per switch for PKG. For the third test, the estimated results are 20.1 ± 0.1 µJ per switch for PP0 and 39.0 ± 0.3 µJ per switch for PKG. For the fourth test, the estimated results are 19.3 ± 6.6 µJ per switch for PP0 and 31.7 ± 6.6 µJ per switch for PKG.

When the energy values are instead calculated by taking the total energy value subtracted by the average energy for the discrete speeds in the first test, as well as the estimations on energy consumption of a mode switch for the discrete speeds, the values are in close agreement. The

PP0/PKG values for the first, second, and third tests are 20.7/30.6, 19.6/38.5, and 19.5/31.9 microjoules per switch, respectively.

105

Since these calculations are within the general vicinity of one another, it is safe to say that their estimates are not far from the actual measurement values. To properly measure and assess the isolated frequency switching costs, in terms of duration and energy, the Profilo kernel module would need to be extended to add frequency switching directly (via assembly, similar to how the

RAPL MSRs are read). This would not only give more precise control to Profilo during regular run time, it would support a frequency switching benchmark feature that generates a matrix of duration and energy requirements from and to all of the frequencies available on the processor(s)

– an idea introduced, in concept, at the beginning of this section.

In summary, it can be observed that switching the discrete frequency at which a processor does work is dependent on the destination frequency and possibly the source frequency. Furthermore, the amount of time to perform the switch is non-trivial at around one or two times that of a context switch. The amount of energy required to make a switch appears to be even more significant, costing roughly two or three times more energy than a context switch.

5.7 Summary

Before this chapter could delve into micro-benchmarking, the testing platform was described in detail, including its hardware specifications and limitations. The platform was also often compared to Intel’s latest microarchitecture to get a sense of the evolution for their implementation of the x86 instruction set architecture (ISA). The first subsection was dedicated to the testing platform description and the tools and methodology for capturing the total system power consumption, outside of the scope of the RAPL MSRs. The ACPI specification was also

106

introduced with its goals, shortcomings, nomenclature, and its relationship with modern platforms.

Following this background information, the subsequent sections focused on micro-benchmarks, beginning with the total system power consumption. The energy values for most of the devices in their numerous states were evaluated. The most energy efficient configuration was used as a yardstick to normalize the testing operating system.

The measurement values for the system in its various global states (G-states) and system states

(S-states) were studied. This led into the last, but important, evaluation of power saving idle states: the processor idle states (C-states). The characteristics of these states were explored and measured, along with how Linux makes use of them. In examining these states, the justification for why idle states are as important for energy efficiency as the active performance (P-states) states was made. As an example, a hybrid state, available only to multi-core/processor platforms where work can be conducted on one or more cores, while the remaining cores stay in their deepest sleep state (with a long wake-up latency), was found to save a substantial amount of power in comparison to when all cores are in an idle state with a shorter wake-up latency to meet response-time requirements, but where no useful work can be conducted by any of the cores.

Finally, the active DVFS states (P-states) were investigated and quantified. Voltage switches were discernible from the power characteristics, giving rise to the observation that Intel has suboptimal voltage mapping, particularly for the lowest frequencies (at least on the Ivy Bridge

107

microarchitectures). It was also observed through Profilo that power-gating can offer sizable power savings.

To complete the analysis of power states, the workload was examined and the rationale for choosing the particular “work unit” in the next chapter was explained with observed benchmark data. The “work unit” value of 150, combined with unique loop values at each discrete speed, was chosen for the testing platform to create 10 millisecond time slices. The focus then moved to mode switches and context switches, which occur between time slices. The time and energy costs of mode, context, and frequency switches were quantified. Improvements to Profilo were noted for future work to make it an even more accurate model of real world systems.

In conclusion, this chapter provides the background information and up-to-date analysis of

Intel’s recent instruction set architecture. It also qualifies Profilo, describing its usefulness and limitations. Armed with this knowledge, almost any kind of scheduler and speed scaler can be synthesized, tested, and analyzed. Some interesting examples are explored in the next chapter.

108

Chapter 6: Profiling Coupled and Decoupled Speed Scalers

Having described the implementation details of Profilo in Chapter 4 and the hardware characteristics of the testing platform in Chapter 5, this chapter builds on that foundation by profiling both established and theoretical speed scaling schedulers from Chapter 2 using five different job batches. Throughout this chapter, a “work unit” of 150 is implied, based on the discussion from Section 5.4. It must also be stressed that only the non-turbo discrete speeds of the test platform (described in Section 5.1) are used.

6.1 Workloads

Before any scheduler or speed scaling policy can be evaluated, several workloads need to be defined that ideally represent the gamut of valid workloads that a real system may encounter. To utilize every available (non-turbo) discrete speed, using job-count based speed scaling, on the test platform (described in Section 5.1), each workload needs to contain at least 12 jobs. For simplicity, none of the workloads contain more than 12 jobs.

Workload 1 is a batch of homogenous jobs that each take about 1-2 seconds to complete on the test platform, depending on the speed of execution. This workload is somewhat contrived but provides a simple baseline for comparison with the other heterogeneous workloads. Workload 2 is a batch of jobs whose sizes differ additively in a simple arithmetic progression. The smallest job takes less than a second while the largest job takes 5-12 seconds, depending on the speed of execution. Workload 3 is a batch of jobs whose sizes differ by successive factors of 2. The

109

smallest job takes tens of milliseconds while the largest job takes 50-100 seconds to complete, depending on the execution speed, which alone accounts for half of the total system work.

The final two workloads are composed of heterogeneous jobs with less variability than the second and third workloads. Workload 4 is calculated by the formula �! + 1399, where b is a base value that starts from 1 and arithmetically increases by 1.5, with the last job having a base value of 17.5. The smallest job takes roughly half a second, while the largest job takes 2-3 seconds depending on the execution speed. Workload 5 is calculated by the formula �! + 4999, where b is a base value that starts from 1 and arithmetically increases by 1, with the last job having a base value of 12. Depending on the execution speed, the smallest job takes 1-2 seconds, while the largest job takes 6-12 seconds.

6.2 Building the Trace Files

As described in Section 2.3, there are three speed scaling schedulers that are of interest to evaluate with Profilo: the YDS policy, which minimizes power consumption; the PS policy, which epitomises fairness; and the FSP-PS decoupled system, which is provably efficient and has simulation results that suggest a noteworthy performance advantage over PS [21]. These policies are fundamentally different and illustrate the generality and flexibility of Profilo.

The first step to profiling these schedulers is to create the trace file using the workloads described in the previous section. The workloads vary from 35,214 to 409,500 loops of the

“work unit” (150). PS time slices can execute 21 to 41 loops depending on the discrete speed

110

(see Table 5.7), which means that the PS trace files will have 1087 to 16,839 lines. Since the other policies also require knowledge of PS execution, manually generating the trace files would be tedious and error-prone (see Appendix C for a simple example). Instead, small utility applications were written for each of the speed scaling schedulers, taking into account the PS time slice loop values for each discrete speed, that accept job number and their respective sizes as input, and generate the comma-separated-values (CSV) trace file as output.

6.2.1 The PS Generator

The PS generator contains a speed array equal in size to the number of (non-turbo) discrete speeds, where each element is the loop value corresponding to a PS time slice in Table 5.7. It also has a tasks array that holds each of the jobs and their respective size (loop values) in increasing order to simplify the code in the loop block below. After generating the trace header, the trace file is scaled with a single-unit task at speed 10 and another at speed 121. Then each job can run at a speed equal to ten times the number of jobs left (e.g., the number of non-zero elements in the tasks array) in the system. This results in the correct job-based discrete speed scaling policy according to the formula in Section 4.2.4.1, while preventing any task from running at speed 121, which is mapped to the turbo frequency (described with its caveats in

Section 5.1).

Next the utility enters a loop where each iteration writes a line to the trace file, going through the elements of the tasks array. The amount of work performed for each task is found in the speed array, where the index is the number of jobs left, unless the value of this element is larger than

111

the amount of work left, in which case this latter value is used. When there are no more jobs left in the system, the utility terminates, and the trace file is complete.

6.2.2 The FSP-PS Generator

The FSP-PS generator has mostly the same code prior to the loop as the PS generator. The major exception is a variable called fspWork and an fspTasks array that is an exact copy of the tasks array. The loop runs through a “virtual PS” routine (described in Section 2.3), which is similar to the PS loop above, except instead of writing the values, it adds the work to be done to the fspWork variable.

When a virtual process is complete, a speed change is imminent, so an inner loop begins. The inner loop goes through the fspTasks array, writing a line to the trace file with the amount of work the smaller of either the amount of work a process has left or the fspWork variable, decrementing both variables by the value used. If the fspWork variable is larger, the inner loop reiterates with the next process in fspTasks. When fspWork is zero, the inner loop terminates and the outer loop moves onto a new iteration.

When the number of processes is exhausted, all of the elements in both the tasks and fspTasks arrays will be zero. The trace file will then be complete, having devoted continuous execution to each process (in SRPT order, since this is a batch workload) breaking up the tasks across multiple lines where a PS speed change would have occurred.

112

6.2.3 The YDS Generator

Recall from Section 2.3 that the offline YDS algorithm is recursive, requiring knowledge of the highest intensity jobs as it successively steps down the speed (if any steps are necessary).

However, the EDF job execution order and the rounding up of the chosen execution speed (to meet the deadlines) allow the job size variability to be enormous before the highest intensity job becomes anything other than the last job of a single batch workload (i.e., where the jobs in a system all have the same arrival time). Despite the large variability of the five chosen workloads, none of them have sufficiently large variability to overcome this characteristic of batch workloads. In other words, for these five workloads, the YDS algorithm arrives at a speed based on the deadline of the largest job, which is the same as for the entire workload.

As a result, for these specific workloads, a simplified, non-recursive implementation suffices to produce the correct YDS traces. The implementation begins the same as the FSP generator, swapping the fspWork variable for an ops and switches variable. A similar loop to the “virtual

PS” routine runs through the round-robin PS algorithm, accumulating the number of loop operations and context/mode switches performed. When there are no more virtual processes left

(i.e., all of the elements of the tasks array are empty), the speed array is traversed, looking for the first element that is greater than or equal to the ceiling of ops/switches. The index value of this element (plus one) corresponds to the execution speed for all of the jobs in that specific workload. The trace file is then built giving each process full execution to completion in EDF order, at the calculated minimum speed.

113

6.3 Profiling Results

Profilo was run with each of the traces using a “work unit” of 150 (see Section 5.1) and a

“primes per second” argument (see Section 4.1.5) of 320,000. The latter argument is based on the slowest discrete speed performing a single run of the “work unit” in 468 µs (see Table 5.7), which means that in a minute it would perform ! ×150 ≅ 320,513 primes per second. !"#×!"!!

The slower speed is used for the calculation to establish a more conservative “primes per second” value that guards against RAPL overflow (see Section 3.3).

The results of all the traces are shown in Table 6.1. The columns list the workload batch, the evaluated policy, the total elapsed wall-clock time to complete the workload under the given policy, the mean response time (� � ) for the 12 jobs, and the amount of energy (in joules) consumed in the PP0 (cores) and PKG (entire CPU package) domains. To clarify, the mean response time is the average duration, from arrival to departure, of each job in the respective workload. A graphical summary of the results is in Figure 6.1.

114

Table 6.1 Profiling Results

Workload Scheduler Duration (s) MRT (s) PP0 (J) PKG (J)

1 PS 14.57 14.55 76.8 131.5

1 FSP-PS 14.57 7.89 76.8 131.6

1 YDS 14.55 7.88 76.5 130.9

2 PS 46.23 30.16 200.0 373.0

2 FSP-PS 46.21 16.33 199.4 372.4

2 YDS 45.80 17.81 198.8 369.9

3 PS 166.15 38.10 562.5 1184.4

3 FSP-PS 166.08 25.43 560.3 1180.8

3 YDS 163.12 27.15 560.9 1170.0

4 PS 10.71 6.57 46.5 86.6

4 FSP-PS 10.70 3.61 46.5 86.7

4 YDS 10.34 4.01 45.3 84.1

5 PS 37.29 21.95 160.4 300.6

5 FSP-PS 37.27 12.12 160.7 300.9

5 YDS 35.43 13.42 155.8 289.3

As observed, the total duration of each workload is similar across all of the schedulers. This makes sense since the job deadlines are all set by the PS policy, with job departures that are in the same order as under SRPT with these simple (single batch) workloads [25]. Theoretically, the overall duration of these algorithms should be identical; however, the YDS traces for workload 4

115

and 5 are 3.5% and 5% shorter because the execution speed is rounded up to the discrete frequency necessary to meet the job deadlines, which favours these particular workloads a little more than the others. The total execution time of a decoupled speed scaling system, in theory, ought to be set by the speed scaler, regardless of the decisions the scheduler makes; however, on real systems, there are costs for mode/context switches (quantified in Sections 5.5 and 5.6), which is the domain of the scheduler. The benefits of having fewer process switches with the

FSP-PS scheduler can be observed by the fact that when a difference in duration is observed, it tends to favour the FSP-PS scheduler. A similar trend exists with the energy consumption, although it is not as obvious due to the lower resolution and higher jitter of the RAPL counters, the cause for which is explained in Section 3.3.

With respect to the individual jobs, the mean response time improvements of FSP-PS and YDS over PS are dramatic. In Workload 1, where the jobs are homogenous, the mean response time of

PS is nearly twice as long. That is because PS keeps all the jobs in the system until the end, when the jobs start leaving the system one at a time. As the jobs depart, the execution speed decreases and the amount of work per time slice is reduced. Since the difference between the slowest and fastest (non-turbo) discrete speeds on the test platform is less than a factor of two, it is possible that the last jobs to leave the system require an extra time slice, on top of waiting for the jobs before them, and having their service rate reduced. If the workload is sufficiently large though, this exit behaviour will have a relatively small impact. In the case of Workload 1, this impact is less than 1% of the total duration of the workload since each job exits without requiring another time slice.

116

The improvements to mean response time for the heterogeneous workloads under FSP-PS and

YDS relative to PS still exhibit a 29% to 46% reduction. While the FSP-PS and YDS strategy of exclusive service to one job at a time results in more timely departures, the advantage becomes less pronounced when the final job accounts for a greater portion of the total execution time.

With Workload 1, the mean response time of the YDS policy is almost the same as FSP-PS. For the remaining workloads, FSP-PS reduces the mean response time by 6% to 10% in comparison to YDS. This improvement is due to the higher service rate that FSP-PS invokes early in the workload compared to the “blended average” service rate of YDS. In principle, YDS optimizes for energy efficiency, losing on the mean response time compared to FSP-PS.

The energy optimization of YDS is barely discernible with a homogenous workload, as is the case for Workload 1. This should come as no surprise since the service rates of these policies are almost the same in this case, except for at the very end, when the jobs under PS (and “virtual PS” for FSP-PS) sequentially terminate. This exit behaviour, as described above as contributing to less than 1% of the total workload runtime, affords YDS slight energy consumption reduction.

With the other (heterogeneous) workloads, the energy consumption is reduced by 1% to 4%, which is notable considering it does this while completing the entire workloads in 1% to 5% less time.

Taking into consideration all of these metrics, it can be observed that FSP-PS dominates PS in the sense that it performs at least as well or better than PS. Its greatest strength is its dramatic

117

improvement in mean response time, reducing it by 33% to 46% on these workloads, with no disadvantages compared to PS.

When comparing FSP-PS with YDS, it can be observed that Workload 1, with no job variability

(i.e., homogenous), results in comparable metrics due to the mostly similar execution speed and power profiles. When the workload is heterogeneous, FSP-PS always has a distinct mean response time advantage over YDS of 6% to 10%. To a lesser degree, YDS instead improves the overall workload execution time and power efficiency. The results suggest that YDS only appreciably improves on power efficiency with workloads that have medium job size variability.

Workload 2, with low job size variability, results in simple, relatively high service rates for all of the policies, similar to the homogenous case. Workload 3, with its high job size variability, creates a situation with the last job consuming a substantial amount of the overall execution time, resulting in a relatively low service rate for all of the policies.

118

100.0%

90.0%

80.0%

70.0%

60.0% PS 50.0% FSP-PS

YDS 40.0% 30.0%

20.0%

10.0%

0.0% MRT Duraon Energy

Figure 6.1 Profiling Graph (Workloads 1-5)

However, Workloads 4 and 5, with their intermediate job size variability, result in the greatest differences between FSP-PS and YDS. This can be visually observed in Figure 6.2. In relation to one another, FSP-PS reduces the mean response time by 10%, and YDS reduces the overall execution time by 5%, while also reducing the overall energy consumption by 3-4%. The fusion of high and low service rates (with a PS speed scaling policy) drives the differences between these two policies. While FSP-PS reduces the response time of the smallest jobs by running at the higher frequencies, YDS chooses a single blended service rate that optimizes for energy efficiency, making no further frequency switches, and results in the longer jobs having a shorter

119

response time than with FSP-PS. This combined with fewer context switches (equal to the number of processes) results in the entire workload completing sooner.

100.0%

90.0%

80.0%

70.0%

60.0% PS 50.0% FSP-PS

YDS 40.0% 30.0%

20.0%

10.0%

0.0% MRT Duraon Energy

Figure 6.2 Profiling Graph (Workloads 4-5)

6.4 Summary

This chapter showcased Profilo using five batch workloads under the PS, decoupled FSP-PS, and

YDS speed scaling scheduling policies described in Chapter 2. The implementation details of the trace generators were described along with the arguments used with Profilo. The results were broken down in terms of overall duration, mean response time of jobs, and energy consumption across two power domains. FSP-PS was found to match or improve upon PS across every

120

workload and metric. FSP-PS was also shown to have a clear mean response time advantage over

YDS in all heterogeneous workloads. The strengths of YDS were shown to be somewhat minor compared to FSP-PS, with the differences most evident on the workloads with medium job size variability. Under those workloads, YDS finished the overall batch sooner and with less energy than FSP-PS – and with a dramatic improvement in mean response time relative to PS.

121

Chapter 7: Conclusions and Future Work

This chapter summarizes the information presented in this thesis, the conclusions that can be made from its work, and the contributions it adds to the field. Future research and improvements are suggested before the chapter draws to a close.

7.1 Thesis Summary

This thesis can be broken down into four major parts:

1. The motivation for effective speed scaling policies, and the difficulties of bridging

between theoretical work and systems work.

2. Designing and implementing a profiler using the features of modern processors to gather

accurate, high-resolution, measurement data.

3. Describing and quantifying the energy usage and overheads of modern micro-

architectures.

4. The experimental evaluation of theoretical and practical speed scaling schedulers.

7.1.1 The Importance of Speed Scaling Scheduling Policies

The thesis begins by describing the need for energy efficient computing from the perspectives of both theoretical and systems communities. The gaps that exist between these communities and the challenges of bridging them are also described in general and specific terms. Fortunately, modern processors now have features, aimed at governing energy consumption, that integrate

122

efficient and precise energy meters. These features are described with the explicit intent of using them to test theoretical speed scaling solutions in a simple but useful way.

7.1.2 Building a Profiler

Chapter 4 of this thesis presents the design and implementation of Profilo. The design decisions that led to the current implementation are justified. The implementation is also described in detail. Finally, the strengths and weaknesses of Profilo are discussed through qualitative and quantitative evaluations.

7.1.3 Examining Modern Architectures

As explained in the early chapters, the need for examining modern architectural features and then subsequently quantifying these features is not only important for the design of speed scaling schedulers, it is a prerequisite for investigating profiling results. Chapter 5 of this thesis is devoted to this micro-benchmarking objective.

7.1.4 Experimental Evaluation of Speed Scaling Schedulers

The final objective of the thesis is to determine the impact of speed scaling scheduling policies from a performance and energy perspective. This part of the thesis makes use of Profilo to examine a few policies and make comparisons with the most commonly implemented policy on current operating systems.

123

7.2 Conclusions

The experimental systems work in this thesis leads to the following conclusions:

• Processor idle states can offer greater relative power savings than DVFS, while DVFS

can offer greater absolute power savings. Both are equally significant, and therefore

important to incorporate into speed scaling schedulers.

• On multicore systems, idling cores in higher sleep states (e.g., to save power before

another batch of jobs or to meet a latency service level) can consume more power than

keeping one core active (and able to do work) and the rest in their deepest idle state.

Other win-win hybrid states exist.

• The delay and energy consumption of mode, context, and frequency switches are not

insignificant; their costs are evident even in small workloads.

• The experimental evaluation of PS shows that decoupled speed scaling (specifically FSP-

PS) drastically improves the mean response time of jobs with a small but measurable

improvement on power savings and batch execution time.

• The experimental evaluation of YDS shows that FSP-PS has a better mean response time,

but under certain workloads, YDS is better for energy efficiency and batch execution

time.

124

7.3 Relevance and Contributions

Speed scaling scheduling policies are numerous and impactful in terms of performance and energy efficiency. Understanding the characteristics and features of modern processors fosters the development of more effective policies. Being able to test even purely theoretical policies on real hardware with performance and energy metrics is something that, to my knowledge, was not possible until the implementation of the profiler described by this thesis.

The following contributions are made by this thesis:

• Micro-benchmarks that describe and quantify important behaviours and states of modern

processors leading to improved and better suited scheduling and speed scaling policies.

This information also provides the knowledge necessary to properly analyze the results

that are generated from the profiler.

• A profiler that makes it easy to perform controlled workloads made up of precise units of

work at defined speeds, and produces high-resolution timing and energy data broken

down by process and workload.

• Empirical data from theoretical speed scaling scheduling policies using a modern

processor with detailed analysis that makes direct comparisons to the most common

policy on contemporary operating systems.

7.4 Future Work

There are several possible future directions to build upon the work in this thesis:

• Improve Profilo:

125

o Increase the accuracy and reduce the jitter of the RAPL energy counters by

applying all of the techniques described in Section 3.3. o Implement support for a profiling argument that points to a calibration file

containing mode/context switch duration/energy costs for every discrete speed.

The summary results at the end of profiling can then be corrected to represent a

true production operating system result that neither has profiling overhead nor

over-optimizations. o Expand the benchmark runtime mode:

§ Accept an optional argument indicating the consecutive prime numbers to

find, and then output the amount of time and energy it took to do so at

each discrete speed.

§ Accept optional “time slice” and “loop tolerance” arguments that can be

used to determine the ideal “work unit” for a desired “time slice” given the

“loop tolerance”.

§ Mode and Context switch benchmark that calculates the duration and

energy cost per switch.

§ Frequency switch benchmark, with a source and target frequency, which

calculates duration and energy cost per switch, performed directly without

a mode switch.

§ Idle benchmark that calculates the energy consumption, exit latency, and

target residency of every available C-state as well as the busy-wait power

rating for every available speed. o Assimilate the idler utility into Profilo as an additional runtime mode.

126

• Support multiple logical processors:

o This would allow the hybrid C-states from Section 5.3.3.2 to be studied.

o This would enable multiprocessor schedulers to be profiled. Modern operating

systems continuously load balance logical processors, where a scheduler is

assigned to each logical processor, which makes decisions without any knowledge

of the other cores. This is made worse by the load balancer when it moves

processes between logical processors with no regard to topology. So it may move

a process to an adjacent virtual core (e.g., simultaneous multithreading) that share

L1/L2, an adjacent physical core that only shares L3, or a different package

(multiple CPU’s) that only share RAM. This is suboptimal since it causes

unnecessary cross talk, and code and data cache misses that hurt performance and

energy efficiency.

• Support idle states and arrival times in trace file.

o This would allow full versatility of workloads, allowing for dynamic/online

scheduling policies to be profiled.

• Profile and analyze more speed scaling policies.

127

Appendix A

Since there is no formal way of accessing MSRs in kernel mode, retrieving the RAPL energy counters requires the inline assembly language written in Figure A.1.

asm volatile( "rdmsr;" // assembly code :"=a" (*eax) // output: read from eax :"c" (ecx) // input: load into ecx :"%edx" // clobbered register );

Figure A.1 Inline Assembly to Read MSRs

The inline assembly in Figure A.1 begins by using the keyword volatile, which tells the compiler that there are side effects and to not optimize the assembly. The second line is the custom assembly code, which only runs the rdmsr instruction. This does not occur before the input variable (ecx) in line 4 is loaded into the ECX GPR, which the compiler writes. Line 3 instructs the compiler to write assembly, after the custom assembly code, that moves the contents of the

EAX register into the output variable (eax). The final line tells the compiler what register(s) it cannot use because it/they get(s) clobbered by the custom assembly. In this case, it is the EDX

GPR, which gets assigned the reserved portion of the RAPL MSRs for the PP0/PKG energy counters.

128

Appendix B

Profilo uses the trial division primality test algorithm for its workload. This is the simplest of primality algorithms, with an exponential running time, when measured in terms of the size of the input in bits. The (C programming language) code of the workload is written in Figure B.1.

for (j = 0; j < loop_value; j++) { count = 2; for (i = 3; count <= work_unit; i++) { for (check = 2; check <= i - 1; check++) { if (i%check == 0) break; } if (check == i) count++; } }

Figure B.1 Workload C Programming Language Code

The code is composed of three progressively nested for-loops. The outermost for-loop implements the loop value described in Section 4.2.1 and analyzed in Section 5.4. Starting from

2, the consecutive prime number defined by work_unit1 is found a total of loop_value times.

The trial division primality algorithm begins with the middle for-loop. This for-loop increments the candidate prime number (defined as i) until the consecutive number of prime numbers is

1 The relationship between the “work unit” and loop value is explained in Section 4.1.5 129

found, which occurs when count exceeds work_unit. The reason the less-than-equal operator is used is to ensure that a work_unit of 1 is not the same as a work_unit of 2. The nested code of this middle for-loop contains the innermost for-loop that increments a check variable, starting at

2 and terminating when it is equal to the candidate prime number. Each iteration performs a modulo operation to see if check is a factor of the candidate prime number. If the remainder is zero, the loop is prematurely broken. The subsequent if-statement determines if the number is prime by checking if the previous loop was prematurely broken. If it was, the number is not prime and another iteration of the middle for-loop begins. If the check value is equal to the candidate prime number, then these numbers are indeed prime, so count is incremented.

A future improvement to this workload is to use an optimized trial division algorithm, which constrains the divisor (in the modulo operation) from exceeding the square root of the candidate prime number, similar to the sieve of Eratosthenes, bearing in mind that the latter still has a better theoretical complexity [41]. In addition to improving the runtime (which is not really a goal), the optimized trial division algorithm would better utilize the arithmetic logic unit (ALU).

130

Appendix C

As a simple example, imagine a workload with three jobs named P1, P2, and P3 that perform

1000, 2000, and 3000 units of work, respectively, at the lowest discrete speed, with a scheduler that executes 1000 units of work before preempting. The PS trace file may look like Figure C.1.

Process,Work,Speed P0,1,2 P1,1000,1 P2,1000,1 P3,1000,1 P2,1000,1 P3,1000,1 P3,1000,1

Figure C.1 PS Trace File

The first line is the header, which is ignored by Profilo. The second line, with job P0, scales the speed using the formula in Section 4.2.4.1 so that speed 2 is mapped to the highest available frequency, and speed 1 is mapped to the lowest available frequency. Running this trace through a debug/verbose version of Profilo would produce an output like in Figure C.2.

131

Starting Profilo virtual kernel. P0: Started... P0: Doing 1 unit(s) of work at 2301 MHz. P0: Terminating... P1: Started... P1: Doing 1000 unit(s) of work at 1200 MHz. P1: Terminating... P2: Started... P2: Doing 1000 unit(s) of work at 1200 MHz. P3: Started... P3: Doing 1000 unit(s) of work at 1200 MHz. P2: Doing 1000 unit(s) of work at 1200 MHz. P2: Terminating... P3: Doing 1000 unit(s) of work at 1200 MHz. P3: Doing 1000 unit(s) of work at 1200 MHz. P3: Terminating... The profiling took 6.00 seconds. Process P0 took 0.000 seconds. It started at 0.000s and ended at 0.000s. Process P1 took 1.000 seconds. It started at 0.000s and ended at 1.000s. Process P2 took 3.000 seconds. It started at 1.000s and ended at 4.000s. Process P3 took 4.000 seconds. It started at 2.000s and ended at 6.000s. The processor cores (PP0) consumed 18.000 joules (3.000 watts). The processor package (PKG) consumed 39.000 joules (6.500 watts).

Figure C.2 Verbose Profilo Summary for PS Example

If we consider the trace file as a batch workload that arrives at the same time, we would calculate the mean response time using only the time the jobs ended. Because we do not count job P0, the value would be !!!!! ≅ 3.7 seconds. Using the same example, an FSP-PS trace file may look ! like Figure C.3.

Process,Work,Speed P0,1,2 P1,1000,1 P2,2000,1 P3,3000,1

Figure C.3 FSP-PS Trace File

132

The jobs leave the system in the same order as PS, but instead get continuous execution until complete, without preemption. The corresponding release/non-verbose version of Profilo would produce an output similar to Figure C.4.

Starting Profilo virtual kernel. The profiling took 6.00 seconds. Process P0 took 0.000 seconds. It started at 0.000s and ended at 0.000s. Process P1 took 1.000 seconds. It started at 0.000s and ended at 1.000s. Process P2 took 2.000 seconds. It started at 1.000s and ended at 3.000s. Process P3 took 3.000 seconds. It started at 3.000s and ended at 6.000s. The processor cores (PP0) consumed 18.000 joules (3.000 watts). The processor package (PKG) consumed 39.000 joules (6.500 watts).

Figure C.4 Profilo Summary for FSP-PS Example

Calculating the mean response time in the same way as for PS would result in !!!!! = 3.3 ! seconds. This is a 9% reduction in the mean response time. For the sake of simplicity, the speed scaling policy in this example is static, and PS finishes the workload in only 6 time slices. As a result, most of the popular schedulers would have the same overall profiling time and energy consumption. However, in job-count based speed scaling workloads, such as the suite of workloads described in Section 6.1, there would certainly be differences in mean response time, overall profiling time, and energy consumption. This would be driven by the chosen execution frequency(s) and number of mode/context switches.

133

References

[1] M. Agrawal, N. Kayal, and N. Saxena, “PRIMES is in P,” Ann. Math., pp. 781–793,

2004.

[2] V. Aiyar, “Sundaram’s Sieve for Prime Numbers,” Math. Stud., vol. 2, no. 2, p. 73, 1934.

[3] S. Albers, “Energy-Efficient Algorithms,” Commun. ACM, vol. 53, no. 5, pp. 86–96,

2010.

[4] S. Albers and A. Antoniadis, “Race to Idle: New Algorithms for Speed Scaling with a

Sleep State,” ACM Trans. Algorithms, vol. 10, no. 2, p. 9, 2014.

[5] S. Albers, F. Müller, and S. Schmelzer, “Speed Scaling on Parallel Processors,”

Algorithmica, vol. 68, no. 2, pp. 404–425, 2014.

[6] L. Andrew, M. Lin, and A. Wierman, “Optimality, Fairness, and Robustness in Speed

Scaling Designs,” in Proceedings of ACM SIGMETRICS, 2010, pp. 37–48.

[7] A. Atkin and D. Bernstein, “Prime Sieves using Binary Quadratic Forms,” Math.

Comput., vol. 73, no. 246, pp. 1023–1030, 2004.

[8] B. Avi-Itzhak and H. Levy, “On Measuring Fairness in Queues,” Adv. Appl. Probab., pp.

919–936, 2004.

[9] N. Bansal, H.-L. Chan, and K. Pruhs, “Speed Scaling with an Arbitrary Power Function,” in Proceedings of ACM-SIAM, 2009, pp. 693–701.

[10] N. Bansal and M. Harchol-Balter, “Analysis of SRPT Scheduling: Investigating

Unfairness,” in Proceedings of ACM SIGMETRICS, 2001, vol. 29.

134

[11] N. Bansal, T. Kimbrel, and K. Pruhs, “Speed Scaling to Manage Energy and

Temperature,” J. ACM, vol. 54, no. 1, p. 3, 2007.

[12] N. Bansal, K. Pruhs, and C. Stein, “Speed Scaling for Weighted Flow Time,” SIAM J.

Comput., vol. 39, no. 4, pp. 1294–1308, 2009.

[13] R. Bayer, “Symmetric Binary B-trees: Data structure and Maintenance Algorithms,” Acta

Inform., vol. 1, no. 4, pp. 290–306, 1972.

[14] D. Bovet and M. Cesati, Understanding the Linux Kernel. O’Reilly Media, Inc., 2005.

[15] B. Brandenburg, H. Leontyev, and J. Anderson, “An Overview of Interrupt Accounting

Techniques for Multiprocessor Real-Time Systems,” J. Syst. Archit., vol. 57, no. 6, pp. 638–654,

2011.

[16] L. Brindley, A. Young, and C. Tan, “Example 15.1. Comparing the Cost of Reading

Hardware Clock Sources,” Red Hat Enterprise MRG 2 Realtime Reference Guide. [Online].

Available: https://access.redhat.com/documentation/en-

US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-

Realtime_Reference_Guide-Timestamping.html#example-Hardware_Clock_Cost_Comparison.

[Accessed: 17-Feb-2015].

[17] A. Brown, “Linux/drivers/idle/intel_idle.c,” Linux Cross Reference - Free Electrons.

[Online]. Available: http://lxr.free-electrons.com/source/drivers/idle/intel_idle.c. [Accessed: 11-

Apr-2015].

[18] A. Brown, “The State of ACPI in the Linux Kernel,” in Linux Symposium, 2004, p. 128.

135

[19] J. Corbet, A. Rubini, and G. Kroah-Hartman, Linux Device Drivers. O’Reilly Media, Inc.,

2005.

[20] H. David, E. Gorbatov, U. Hanebutte, R. Khanna, and C. Le, “RAPL: Memory Power

Estimation and Capping,” in 2010 ACM/IEEE International Symposium on Low-Power

Electronics and Design (ISLPED), 2010, pp. 189–194.

[21] M. Elahi, C. Williamson, and P. Woelfel, “Decoupled Speed Scaling: Analysis and

Evaluation,” Perform. Eval., vol. 73, pp. 3–17, Mar. 2014.

[22] J. Enos, C. Steffen, J. Fullop, M. Showerman, G. Shi, K. Esler, V. Kindratenko, J. Stone, and J. Phillips, “Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC

Clusters,” in Green Computing Conference, 2010 International, 2010, pp. 317–324.

[23] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy Caches: Simple

Techniques for Reducing Leakage Power,” in Proceedings of ISCA, 2002, pp. 148–157.

[24] M. Flynn, “Some Computer Organizations and their Effectiveness,” IEEE Trans.

Comput., vol. 100, no. 9, pp. 948–960, 1972.

[25] E. Friedman and S. Henderson, “Fairness and Efficiency in Web Server Protocols,” in

Proceedings of ACM SIGMETRICS, 2003, pp. 229–237.

[26] A. Gandhi, M. Harchol-Balter, R. Das, and C. Lefurgy, “Optimal Power Allocation in

Server Farms,” in Proceedings of ACM SIGMETRICS, 2009, vol. 37, pp. 157–168.

[27] L. Guibas and R. Sedgewick, “Dichromatic Framework for Balanced Trees,” 1978.

[28] E. Hahne, “Round-Robin Scheduling for Max-Min Fairness in Data Networks,” IEEE J.

Sel. Areas Commun., vol. 9, no. 7, pp. 1024–1039, 1991.

136

[29] M. Hähnel, B. Döbel, M. Völp, and H. Härtig, “Measuring Energy Consumption for

Short Code Paths Using RAPL,” ACM SIGMETRICS Perform. Eval. Rev., vol. 40, no. 3, pp. 13–

17, Jan. 2012.

[30] F. Hu and J. Evans, “Linux Kernel Improvement: Toward Dynamic Power Management of Beowulf Clusters,” in Proceedings of the 8th LCI International Conference on High-

Performance Clustered Computing (CDROM), 2007.

[31] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose,

“Microarchitectural Techniques for Power Gating of Execution Units,” in Proceedings of the

2004 International Symposium on Low Power Electronics and Design, 2004, pp. 32–37.

[32] T. Jones, “Kernel API’s, Part 3: Timers and Lists in the 2.6 Kernel,” 30-Mar-2010.

[Online]. Available: https://www.ibm.com/developerworks/library/l-timers-list/. [Accessed: 23-

Jun-2016].

[33] K. Kasichayanula, D. Terpstra, P. Luszczek, S. Tomov, S. Moore, and G. Peterson,

“Power Aware Computing on GPUs,” IEEE, 2012.

[34] L. Kleinrock, “Time-Shared Systems: A Theoretical Treatment,” J. ACM, vol. 14, no. 2, pp. 242–261, 1967.

[35] H. Lenstra Jr and C. Pomerance, “Primality Testing with Gaussian Periods,” Lect. Notes

Comput. Sci., 2011.

[36] C. Li, C. Ding, and K. Shen, “Quantifying the Cost of Context Switch,” in Proceedings of the 2007 Workshop on Experimental Computer Science, 2007.

137

[37] H. Lieberman, “Using Prototypical Objects to Implement Shared Behavior in Object-

Oriented Systems,” ACM Sigplan Not., vol. 21, no. 11, pp. 214–223, 1986.

[38] R. Love, Linux Kernel Development. Pearson Education, 2010.

[39] D. Lu, H. Sheng, and P. Dinda, “Size-Based Scheduling Policies with Inaccurate

Scheduling Information,” in Proceedings of IEEE MASCOTS, 2004, pp. 31–38.

[40] M. Mills, “The Cloud Begins with Coal,” Digital Power Group, Aug. 2013.

[41] M. O’Neill, “The Genuine Sieve of Eratosthenes,” J. Funct. Program., vol. 19, no. 01, pp. 95–106, Jan. 2009.

[42] V. Pallipadi, S. Li, and A. Belay, “cpuidle: Do Nothing, Efficiently,” in Proceedings of the Linux Symposium, 2007, vol. 2, pp. 119–126.

[43] D. Petersen, J. Steele, and J. Wilkerson, “WattBot: A Residential Electricity Monitoring and Feedback System,” in CHI’09 Extended Abstracts on Human Factors in Computing Systems,

2009, pp. 2847–2852.

[44] P. Pritchard, “Explaining the Wheel Sieve,” Acta Inform., vol. 17, no. 4, pp. 477–485,

1982.

[45] P. Pritchard, “Improved Incremental Prime Number Sieves,” in Cornell University, 1994, pp. 280–288.

[46] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, vol. 2. Prentice

Hall Englewood Cliffs, 2002.

138

[47] I. Rai, G. Urvoy-Keller, and E. Biersack, “Analysis of LAS Scheduling for Job Size

Distributions with High Variance,” in Proceedings of ACM SIGMETRICS, 2003, vol. 31, pp.

218–228.

[48] T. Rauber and G. Rünger, “Energy-Aware Execution of Fork-Join-Based Task

Parallelism,” in Proceedings of IEEE MASCOTS, 2012, pp. 231–240.

[49] E. Rotem, A. Naveh, A. Ananthakrishnan, E. Weissmann, and D. Rajwan, “Power-

Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge,” IEEE

Micro, vol. 32, no. 2, pp. 20–27, Mar. 2012.

[50] L. Schrage, “A Proof of the Optimality of the Shortest Remaining Processing Time

Discipline,” Oper. Res., vol. 16, no. 3, pp. 687–690, 1968.

[51] D. Searls, “Linus & the Lunatics, Part II,” Linux Journal, 23-Nov-2003. [Online].

Available: http://www.linuxjournal.com/article/7279. [Accessed: 31-Mar-2015].

[52] B. Sigoure, “Tsuna’s Blog: How Long Does It Take To Make A Context Switch?,” 14-

Nov-2010. [Online]. Available: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make- context.html. [Accessed: 15-Mar-2016].

[53] A. Skrenes and C. Williamson, “Experimental Calibration and Validation of a Speed

Scaling Simulator,” in Proceedings of IEEE MASCOTS, 2016.

[54] D. Snowdon, E. Le Sueur, S. Petters, and G. Heiser, “Koala: A Platform for OS-Level

Power Management,” in Proceedings of the 4th ACM European Conference on Computer

Systems, 2009, pp. 289–302.

139

[55] D. Snowdon, S. Petters, and G. Heiser, “Accurate On-line Prediction of Processor and

Memory Energy Usage Under Voltage Scaling,” in Proceedings of the 7th ACM & IEEE

International Conference on Embedded Software, 2007, pp. 84–93.

[56] V. Spiliopoulos, A. Sembrant, and S. Kaxiras, “Power-Sleuth: A Tool for Investigating

Your Program’s Power Behavior,” in Proceedings of IEEE MASCOTS, 2012, pp. 241–250.

[57] M. Squillante and E. Lazowska, “Using Processor-Cache Affinity Information in Shared-

Memory Multiprocessor Scheduling,” IEEE Trans. Parallel Distrib. Syst., vol. 4, no. 2, pp. 131–

143, Feb. 1993.

[58] C. Stolte, R. Bosche, P. Hanrahan, and M. Rosenblum, “Visualizing Application

Behavior on Superscalar Processors,” in 1999 IEEE Symposium on Information Visualization,

1999. (Info Vis ’99) Proceedings, 1999, pp. 10–17, 141.

[59] The Climate Group on behalf of the Global eSustainability Initiative (GeSI), “Smart 2020

Report: Global ICT Solution Case Studies,” 2008. [Online]. Available: http://www.smart2020.org/publications/. [Accessed: 10-Jun-2016].

[60] V. Weaver, M. Johnson, K. Kasichayanula, J. Ralph, P. Luszczek, D. Terpstra, and S.

Moore, “Measuring Energy and Power with PAPI,” in 2012 41st International Conference on

Parallel Processing Workshops (ICPPW), 2012, pp. 262–268.

[61] M. Weiser, B. Welch, A. Demers, and S. Shenker, “Scheduling for Reduced CPU

Energy,” in Mobile Computing, Springer, 1994, pp. 449–471.

[62] A. Wierman, L. Andrew, and A. Tang, “Power-Aware Speed Scaling in Processor

Sharing Systems,” in Proceedings of IEEE INFOCOM, 2009, pp. 2007–2015.

140

[63] A. Wierman and M. Harchol-Balter, “Classifying Scheduling Policies with Respect to

Unfairness in an M/GI/1,” in Proceedings of ACM SIGMETRICS, 2003, vol. 31, pp. 238–249.

[64] W. Wulf and M. Shaw, “Global Variable Considered Harmful,” ACM Sigplan Not., vol.

8, no. 2, pp. 28–34, 1973.

[65] F. Yao, A. Demers, and S. Shenker, “A Scheduling Model for Reduced CPU Energy,” in

Proceedings of Annual Symposium on Foundations of Computer Science, 1995, pp. 374–382.

[66] “5th Generation Intel® CoreTM Processor Family Datasheet Vol. 1.” .

[67] “ACPI Specification Version 5.1,” Unified Extensible Firmware Interface Forum.

[Online]. Available: http://www.uefi.org/sites/default/files/resources/ACPI_5_1release.pdf.

[Accessed: 18-Feb-2015].

[68] “include/uapi/asm-generic/param.h,” Linux Cross Reference - Free Electrons. [Online].

Available: http://lxr.free-electrons.com/source/include/uapi/asm-generic/param.h#L5. [Accessed:

17-Feb-2015].

[69] “Intel® 64 and IA-32 Architectures Software Developer Manuals,” Intel. [Online].

Available: http://www.intel.com/content/www/us/en/processors/architectures-software- developer-manuals.html. [Accessed: 14-Feb-2015].

[70] “Intersil ISL95813 Single Phase Core Controller for VR12.6 Datasheet.” Intersil, 15-

May-2013.

[71] “Kill A Watt EZ (P4460) Manual.” P3 International Corporation.

[72] “Mobile 3rd Generation Intel® CoreTM Processor Family: Datasheet, Vol. 1.” .

141

[73] “P3 Kill A Watt EZ Power Monitor,” P3 International Corporation. [Online]. Available: http://www.p3international.com/products/p4460.html. [Accessed: 29-Mar-2015].

[74] “Report to Congress on Server and Data Center Energy Efficiency,” Aug-2007. [Online].

Available: https://www.energystar.gov/buildings/tools-and-resources/report-congress-server- and-data-center-energy-efficiency-opportunities. [Accessed: 10-Jun-2016].

[75] “Softlockup Detector and Hardlockup Detector (aka nmi_watchdog),” The Linux Kernel

Archives. [Online]. Available: https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt.

[Accessed: 13-Feb-2015].

142