Multidimensional Load Balancing and Finer Grained Resource Allocation Employing Online Performance Monitoring Capabilities

A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment of the requirements for the degree Master of Science

Jacob A. Cooper August 2015

© 2015 Jacob A. Cooper. All Rights Reserved. 2

This thesis titled Multidimensional Load Balancing and Finer Grained Resource Allocation Employing Online Performance Monitoring Capabilities

by JACOB A. COOPER

has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by

Frank Drews Assistant Professor of Electrical Engineering and Computer Science

Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract

COOPER, JACOB A., M.S., August 2015, Computer Science Multidimensional Load Balancing and Finer Grained Resource Allocation Employing Online Performance Monitoring Capabilities (99 pp.) Director of Thesis: Frank Drews The development of increasingly accurate Proportional Share Schedulers (PSS) over recent years has allowed for reduced jitter and improved Quality of Service (QoS) to clients competing for computing system resources. Having originating in packet switching networks, PSS constructs have been extended to ’s time sharing of system CPUs between competing system tasks. A great deal of research has concentrated on fairness with respect to tasks local to the same run queue and has been successful in providing bounds limiting discrepancy between the ideal and actualized CPU utilization. Providing fairness in multiprocessor systems introduces added complexity and inherently reduces fairness guarantees over uni-processor counterparts. System developers and hardware manufactures alike have long strived to provide for Symmetric Multiprocessing (SMP) computing, providing equivalent performance across Processing Elements (PE), though are mathematically constricted by the problem construct. In order to provide equivalent performance, eachPE must receive equal work even though computing work is not infinitesimally divisible. Therefore the problem relies on optimizing the potentially infeasible Partition Problem, a variant of the NP-Complete subset sum problem. Task processing requirements extend beyond CPU utilization, and include use of execution units, caches, and buses within the processor. These resources are generally allocated indirectly, allowing their use with respect to scheduled CPU time. This work focuses on addressing this additional complexity in the fair allocation of multiprocessor systems by describing a method which provides online profiling of some of these 4 resources and a multidimensional load balancing technique aimed at increasing fairness by reducing contention on one type of finer grained resource. 5 Table of Contents

Page

Abstract...... 3

List of Tables...... 7

List of Figures...... 8

List of Acronyms...... 10

1 Introduction...... 12 1.1 Motivating Practical Example...... 14 1.2 Contributions...... 15

2 Background and Related Work...... 19 2.1 Generalized Processor Sharing...... 19 2.2 Load Balancing...... 20 2.2.1 Subset Sum Problem...... 21 2.2.2 Partition Problem...... 22 2.2.3 Subset Sum and Partition Problem Literature...... 23 2.2.4 Infeasible Task Weights...... 23 2.2.5 Load Balancing in the 3.2 Linux Kernel...... 25 2.3 Task ...... 26 2.3.1 Task Classes...... 27 2.3.1.1 Batch Tasks...... 27 2.3.1.2 Interactive Tasks...... 27 2.3.1.3 Real-Time Tasks...... 28 2.3.2 Dynamic System Events...... 29 2.4 Proportional and Fair Share Scheduling...... 29 2.4.1 Earliest Eligible Virtual Deadline First...... 31 2.4.2 Completely Fair Scheduler...... 32 2.4.3 Red Black Binary Search Tree...... 33 2.4.4 Distributed Weighted Round-Robin...... 35 2.4.5 Additional Proportional Share Algorithms...... 36 2.5 Performance Monitoring Counters...... 38

3 Methodology...... 39 3.1 Motivation...... 39 3.2 Synopsis...... 40 3.2.1 Partition Optimization Problem...... 40 3.2.2 Partition Load Range...... 41 6

3.2.3 Mean Deviation...... 41 3.2.4 Variance...... 42 3.3 Finer Grained Resources...... 43 3.3.1 Multidimensional Load Balancing and Partition Problem...... 43 3.3.2 Interdependent Multidimensional Load Balancing...... 44

4 Testing Framework and Experiments...... 46 4.1 Experimental Environment...... 46 4.2 Initial Considerations and Experiments...... 46 4.3 Performance Monitoring Overview...... 49 4.3.1 Performance Monitoring Profiling Overhead...... 51 4.3.2 Performance Monitoring in Linux...... 52 4.4 Linux Scheduling Classes...... 53 4.5 Performance Scheduler Class...... 54 4.6 Performance Monitoring Kernel Modules...... 56 4.6.1 Controlling Performance Event Counters...... 56 4.6.2 Reading Performance Event Data...... 57 4.6.3 Performance Scheduler Debug Module...... 58 4.7 Performance Scheduling Class Tests...... 58 4.7.1 Performance Monitoring Accuracy...... 58 4.7.2 Performance Monitoring Overhead...... 59 4.8 Simulation Tests...... 61 4.8.1 Simulator Design...... 61 4.8.1.1 Experimental Resource Dependent Balanced Load Par- titioning Algorithm...... 64 4.8.1.2 Scalability of Finely Grained Resource Partitioning... 68 4.8.1.3 Dynamics of Finely Grained Resource Partitioning.... 69 4.8.1.4 Simulator Experimentation Evaluation...... 69 4.8.2 Simulator Experimental Results - Dynamic Load...... 70 4.8.3 Multidimensional Load Balancing Effects on Fairness...... 75

5 Conclusions and Future Work...... 78 5.1 Conclusions...... 78 5.2 Future Work...... 78

References...... 80

Appendix A: Algorithms...... 83

Appendix B: Additional Test Results and Figures...... 90 7 List of Tables

Table Page

4.1 Baseline Contrived Load...... 47 4.2 Imbalanced Contrived Load...... 48 4.3 Balanced Contrived Load...... 49 4.4 Interbench Result: Audio Load...... 60 4.5 Work Factor Table...... 70 4.6 Simulator Test Results - Multiple Test Instances...... 71 4.7 Simulator Test Results - Ratio Range Test...... 73 4.8 Simulator Test Results - Repeated Consistency Test...... 74

B.1 Video Load Interbench: Vanilla Scheduler vs Performance Monitoring Delta.. 90 B.2 X Load Interbench: Vanilla Scheduler vs Performance Monitoring Delta.... 90 B.3 Gaming Load Interbench: Vanilla Scheduler vs Performance Monitoring Delta. 91 8 List of Figures

Figure Page

2.1 Red-Black Binary Search Tree...... 34

4.1 PERFEVTSELx Bitfeilds...... 51 4.2 PERF GLOBAL CTRL Bitfeilds...... 51 4.3 Performance Monitoring Counters (PMC) Selection Pseudocode...... 52 4.4 PMC Collection Pseudocode...... 52 4.5 Linux Scheduling Class Priority List Modified Priority Marked by Dashed Arrows...... 53 4.6 PMC Accuracy Test Pseudocode...... 58 4.7 Baseline Balancing Dynamic Load Generation - Seed 0 Number of Migrations and Tasks...... 63 4.8 Simulator Test Results Repeated Consistency Normal Distribution Probability Density...... 75 4.9 Standard Deviation Percent Imbalance - 4 Queues Balancing through Steady State Static Load Generation - Seed 893756...... 76

A.1 Distributed Partitioning Dynamic Programming Algorithm...... 83 A.2 Distributed Partitioning Algorithm - Priority Dimension - Part 1...... 85 A.3 Distributed Partitioning Algorithm - Priority Dimension - Part 2...... 86 A.4 Distributed Partitioning Algorithm - Second Dimension - Part 1...... 87 A.5 Distributed Partitioning Algorithm - Second Dimension - Part 2...... 88 A.6 Baseline Balancing Algorithm...... 89

B.1 cpuid Command Results...... 91 B.2 Baseline Balancing to Steady State Static Load Generation Efficiency Over Time 92 B.3 Baseline Balancing to Steady State Static Load Generation - Seed 893756 Number of Migrations and Tasks...... 92 B.4 Multidimensional Balancing to Steady State Static Load Generation Efficiency Over Time...... 93 B.5 Multidimensional Balancing to Steady State Static Load Generation - Seed 893756 Number of Migrations and Tasks...... 93 B.6 Standard Deviation Percent Imbalance - 8 Queues Balancing through Steady State Static Load Generation - Seed 893756...... 94 B.7 Standard Deviation Percent Imbalance - 16 Queues Balancing through Steady State Static Load Generation - Seed 893756...... 95 B.8 Standard Deviation Percent Imbalance - 32 Queues Balancing through Steady State Static Load Generation - Seed 893756...... 95 B.9 Dynamic Efficiency Time Plot Seed 0...... 96 B.10 Dynamic Efficiency Time Plot Seed 478192...... 96 9

B.11 Dynamic Efficiency Time Plot Seed 584371...... 96 B.12 Dynamic Efficiency Time Plot Seed 26553280...... 97 B.13 Dynamic Efficiency Time Plot Seed 2715362...... 97 B.14 Dynamic Efficiency Time Plot Seed 2910919...... 97 B.15 Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 0...... 98 B.16 Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 478192... 98 B.17 Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 584371... 98 B.18 Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 26553280. 99 B.19 Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 2715362.. 99 B.20 Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 2910919.. 99 10 List of Acronyms SMP Symmetric Multiprocessing ...... 3 PE Processing Elements ...... 3 PSS Proportional Share Schedulers...... 3 QoS Quality of Service...... 3 PMC Performance Monitoring Counters [12, 14]...... 8 EEVDF Earliest Eligible Virtual Deadline First [29]...... 16 CFS Completely Fair Scheduler [22]...... 16 DWRR Distributed Weighted Round-Robin [21]...... 16 LLC Last Level ...... 16 ALU Arithmetic Logic Unit ...... 16 FPU Floating Point Unit...... 15 GPS Generalized Processor Sharing [26]...... 19 SSP Subset Sum Problem ...... 21 SSP-D Subset Sum Decision Problem ...... 21 PP Partition Problem ...... 23 PP-D Partition Decision Problem ...... 23 SSP-OPT Subset Sum Optimization Problem ...... 22 PTAS Polynomial-Time Approximation Scheme ...... 23 PP-OPT Partition Optimization Problem...... 23 BKL Big Kernel [4,5, 22]...... 25 BST Binary Search Tree...... 33 SFQ Start-time Fair Queuing ...... 37 BBRR Bit-by-bit Round-Robin ...... 37 WFQ Weight Fair Queuing ...... 37 SFQ Start-time Fair Queuing ...... 37 MSR Model Specific Registers [12, 14]...... 38 ISA Instruction Set Architecture [12]...... 38 SFS Surplus Fair Scheduling ...... 37 NUMA Non-Uniform Memory Access ...... 24 MPP-D Multidimensional Partition Decision Problem ...... 44 MPP Multidimensional Partition Problem ...... 44 11

MMU Memory Management Unit ...... 50 API Application Programming Interface ...... 52 12 1 Introduction

Modern computing systems operate by allocating to a competing set of tasks finite resources. These resources include computational, storage, and communicative circuits. The allocation of computing resources to competing workloads is accomplished in modern systems by first partitioning workloads among redundant processing cores, and time multiplexing or sharing their execution capabilities. These resource allocation schemes are referred to as load balancing and task scheduling respectively; both of which are commonplace in multitasking Operating Systems (OS) capable of executing dynamic task sets on multiprocessor systems. The end goal of computational resource allocation within this scope is to dynamically allocate the system’s computational capacity to all competing tasks with respect to their computational requirements. Systems must be devised in such a way to quantify the computational resources each task will receive. With the exception of systems dedicated to executing static task sets, the OS is without the ability to predict future resource requirements. Without execution deadlines, tasks described as non-real time must be quantified by a relative importance to distinguish resource requirements. This quantity is synonymously referred to a priority or weight, and equivalent in UNIX like systems as the nice value. The OS may also predict resource requirements by profiling tasks during runtime. One such example is assigning a resource premium to tasks identified as interactive in nature with respect to execution dependencies on user input and output to provide lower latency. Profiling duties are most optimally performed during task scheduling, for the context switch presents the unique ability to attribute recent system events, as is done with the accounting of tasks’ resource share accounting. The task scheduler has the responsibility of time sharing individual Processing Elements in an OS which supports multitasking. Scheduling decisions may make use of task profiling in the attempt to optimize performance in terms of latency, jitter, or throughput. Task scheduling has been 13 well researched by both academics and operating system developers. Much of the recent research discussed in the following chapter deals with providing guaranteed fair shares to tasks with respect to their relative priority. This provides the motivation for Proportional Share Schedulers. PSS have been shown to provide bounds to limit resource allocation error while providing flexible scheduling notwithstanding a multitude of dynamic system events. Assuming the system has the ability to assign the relative resource dependency of each task in the form of weights, modern multiprocessing systems rely on load balancing schemes to proportionally partition the workload, quantified by summation of task weights, among resources according to their computational requirements and ability respectively. Unfortunately the task partitioning decision problem is NP-Complete, thus giving no guarantee that an optimal solution exists which will result in equilibrium. Many load balancing schemes assume the system operates according to the SMP specification in which each processor behaves equivalently with respect to computational and communicative ability. Unfortunately, modern systems are designed with increasingly complex technologies which behave asymmetrically. Such technologies which result in asymmetric processing include multiprocessors with independently dynamic frequency scaling, clusters without point-to-point interconnects, and architectures designed to integrate multiple processing elements with different computational performance. Resource allocation within such asymmetric multiprocessing systems was the focus of previous work by Dunn [10], which attempts to proportionally partition task loads between processors dependent upon their relative performance. As tasks are generally partitioned along a single dimension, specifically their relative computational requirements, the resulting partitioning is likely to ignore any overhead incurred due to interdependencies between concurrently executing software threads executing on different hardware threads sharing the same physical computational circuitry. Such a hardware 14 layout is commonly referred to as multithreading or hyperthreading as termed by Intel. Since at least version 3.2, the Linux kernel made a modest attempt to distinguish between physical and logical processing elements in the balancing of system load. By simply ordering the (CPU) identifiers with respect to physical and logical hardware threads, and migrating tasks during active load balancing accordingly, a modest attempt is made to balance smaller loads while distinguishing between physical cores and hardware threads. An example of the ordering is discussed in subsequent chapters and provided as Figure B.1. Due to interdependencies in resources at a finer grain than the CPU which may introduce variable processing rates, current models and algorithms lack the ability to provide fair allocations of resources while solely considering time shares. To provide for the graceful degradation of task performance, it is imperative to consider the resource interdependencies and contention. One favorable method to measure resource allocation fairness, is that of lag. Lag is defined as the difference between some ideal resource allocation time and it actualized utilization time. The argument could be made that due to interdependencies which cause varying processing throughput, measurements based solely on time alone are not a truly fair unless processing rates are considered. By reducing the possible variation in processing rates by considering contending resources, their utilization time is thereby more meaningful and results in fairer systems.

1.1 Motivating Practical Example

In the recent years, a momentous shift has arrived at computing which saw the end of ever increasing clock rates in favor of multiple processing cores begin manufactured on a single silicon chip. Processing throughput is now dependent among the operation of multiple redundant execution units, therefore for software developers it is crucial to identify strengths and weaknesses in hardware and to take into consideration the 15 interdependent relationships of their programs. In addition, users have steadily demanded systems which are capable of an ever increasing set of functionality. It is not uncommon for a user to be requesting their hardware resources through dozens of processes, frequently without their knowledge. A single web page may very well have embedded a variety of elements which require various resource requirements. On the other end, a web server must service the user that same variety of elements. These elements which may include computationally expensive visual or video elements, requiring accessing databases, decrypting or decoding as well as handling service requests in the form of network traffic. Both systems are similar in that the tasks which are required of them are vast, frequently consisting of small jobs with narrow resource requirements. These jobs, or tasks, are broken down by software application developers for a number of reasons including simplified development and improved performance. As these tasks have various resource requirements, if we may be able to distribute their system load in such a way that that like dependent tasks execute on different hardware or at different times, contention for resources will be reduced and a higher execution throughput will likely follow. This work studies this idea and attempts to make a case for profiling and reducing contention along a single resource, namely the Floating Point Unit (FPU). Many tasks either require, as in the case for multimedia and gaming, or do not, as in the case for compression and compilation heavy use of floating point instructions. Therefore there exists a sound motivation to investigate multidimensional load balancing algorithms and their implications.

1.2 Contributions

This work focuses on the allocation of resources denoted as fine grained due to their lack of one-to-one correspondence with respect to hardware threads. Hardware threads are 16 either physical or logical, and are capable of executing a single software as well as being individually scheduled by the OS. Various degrees of competition between tasks heavily dependent upon shared fine grained resources are very likely to vary in performance, most notably throughput. Unnecessary competition results from poor balancing of such tasks among these resources negatively effecting performance. Finer grained resources are dependent upon system architecture and therefore their consideration in resource allocation must be system dependent. Examples of finer grained resources include those shared among hardware threads including the Last Level Cache (LLC), and execution units such as Arithmetic Logic Unit (ALU) and FPU. Presented is a method to utilize a feature currently available in commercial hardware to perform online task profiling by the OS. Performance Monitoring Counters (PMC) allow real time analysis of finer grain resource utilization with trivial overhead which is ideal for use in kernel space. Their ease of use was the deciding factor in their inclusion to this work, though any low overhead method of profiling may be substituted and is left for future work. As the scope of this work considers the fair allocation of system resources, Proportional Share Schedulers (PSS) are discussed. Academic research on the Earliest Eligible Virtual Deadline First (EEVDF) and Distributed Weighted Round-Robin (DWRR) algorithms are discussed along with the Completely Fair Scheduler (CFS) algorithm used within the Linux kernel since version 2.6.23. The accounting of fine grain resource utilization was implemented and tested by creation of a scheduling class within the Linux kernel version 3.2 under Ubuntu 12.04 LTS. Additional modules and system called were implemented were implemented to allow for debugging and data collection. The modifications to the kernel intend to demonstrate the effectiveness and viability of utilizing PMC registers while incurring limited overhead. 17

Load balancing analysis took place utilizing a resource allocation simulator, which was designed to incorporate various load patterns to demonstrate the effectiveness of load balancing while considering a finer grained resource. As a baseline, task sets were also evaluated while load balancing ignoring the finer grained resource. Synthetic loads were generated to perform testing under both static and dynamic conditions, which allows the examination of a system having the ability to reach a steady state or being forced perpetually modify its load partitioning. The overall goal of this work is to profile the utilization of a finer grained resource under contention, and distribute system load accordingly to reduce contention. By reducing contention, the stabilization of processing rates allows for fairer allocation under models which account for resource time utilization. In addition, reduced contention typically correlates to greater throughput. Details are provided in following sections for the major contributions which include:

• Show tasks with high degree of some finely grained resource utilization such as those with either intensive ALU or FPU dependencies perform ideally without competition from tasks with similar characteristics.

• Demonstrate Performance Monitoring Counters may be used to obtain finely grained resource metrics with nominal overhead.

• Introduce a multidimensional load balancing algorithm which attempts to improve throughput while aimed at only modestly affecting fairness.

• Demonstrate under simulated loads, multidimensional load balancing is feasible and may provide increased throughput over standard load balancing techniques using metrics obtained by actual, though contrived, tests.

• Implement a Linux scheduling class to monitor finely grained resource utilization and schedule tasks specified by system administrator’s invocation of supporting 18

implemented modules based on Ubuntu 12.04 LTS and the 3.2 Linux Kernel. Load balancing as yet has been left for future work. The remainder of the thesis is laid out as follows. Chapter 2 discusses previous and related work including load balancing, its impact on Proportional Share Schedulers based on the Generalized Processor Sharing model, and it introduces Performance Monitoring Counters . Chapter 3 details the motivation for the work along with the formulation of methodology which expands previous work of the fair allocation of resources by introducing a method to consider the profiling of tasks by finer grained resources. Chapter 4 describes the framework developed to perform a number of experiments and their results. Chapter 5 contains conclusions and proposes future work. 19 2 Background and Related Work

Multitasking on multiprocessing systems has become the norm in meeting modern demands of computational throughput. This is true from the largest supercomputing clusters to handheld devices such as smartphones and tablets. Computing resources are allocated within these systems by time sharing each resource among a subset of competing tasks, or discrete units of work.

2.1 Generalized Processor Sharing

Fairness with respect to resource allocation algorithms considers the deviation from an ideal allocation. Generalized Processor Sharing (GPS)[26] is a scheme which models the ideal allocation of multiple resources to competing clients. In keeping with the scope of this work, we will be considering multiprocessors and tasks although GPS was originally introduced with respect to networking[26]. We begin our discussion by presenting a unifying notation for clarity. Given a system wide set of tasks T = {0, 1, ..., n − 1}, each with an associated weight denoted wi for task i. As the scope of this work considers multitasking resource allocation within a multiprocessor system, we denote K as the number of system hardware threads. Therefore we define the task load of L(S), which defined for some general task set S ⊂ T as X L(S) = wi. (2.1) i∈S

A GPS system is characterized in general for a system task set T as:

S i(a, b) wi = ∀i, j ∈ T (2.2) S j(a, b) w j

wi S i(a, b) = · (b − a) · K ∀i ∈ T, L(T)

where S i and S j represent the ideal resource allocation for tasks i and j between time points a and b, and K denotes the number of processors. Equation (2.2) defines 20 proportional shares with respect to given task weights and provides an ideal for which resource allocation algorithms may be measured against. It should be noted that (2.2) assumes over the interval (a, b) that the task set T is static. A task set is said to be static if no new tasks are created, no task exits the system, and no task has a change in weight. Unfortunately resource allocation must anticipate these dynamic system events, therefore modified notation will be introduced in later sections that discuss of resource allocation under dynamic events.

2.2 Load Balancing

Load balancing refers to determining and assigning disjoint subsets of competing clients to resources. In this context, disjoint task subsets form partitions, each of which are assigned to a single hardware thread. Partitions may be well defined in separate data structures or stipulated in an impromptu manner. The partitioning of tasks may also be determined centrally or by a distributed algorithm. Examples of each have been well documented in the Linux Kernel. Multitasking support on multiprocessors was originally provided by Linux by in impromptu approach to centrally distribute workload across hardware threads [5]. A single non-concurrent scheduling thread was capable of running on each hardware thread. Each hardware thread accessed the single data structure called a run queue which contained all tasks within the system. This approach was eventually deemed inefficient as a system wide lock was required to maintain non-concurrent operation and protect the run queue from race conditions. A single run queue is also inefficient in terms of caching. As tasks execute, their code and data is cached for speedier retrieval. Thus if a task migrates between hardware threads that do not share caches, cached data is invalidated and further execution may not take advantage of faster caches, and must resort to retrieving data from slower memory 21 locations. Therefore, tasks which are said to be cache hot, having a large amount of its data cached were identified with additional heuristics in an attempt to prevent their migration. To eliminate the problems of the impromptu central load distribution under early Linux implementations, a distributed algorithm was developed to explicitly partition tasks across hardware threads. Capable of concurrent operation, hardware threads individually maintained run-queues from which tasks were scheduled [5, 22]. Task run queues are balanced by migrating tasks between pairs of run queues, from the overly burdened to under burdened run queues. Balancing run queues is of particular interest as the problem is an extension of the subset sum problem.

2.2.1 Subset Sum Problem

The Subset Sum Problem (SSP) is a classical example of the NP-complete set of problems in terms of computational complexity. Featuring a wide number of practical uses, SSP has been widely researched and is considered a variant of the popular knapsack problem. SSP attempts to create a subset of integers, the sum of which equals some target capacity c. Utilizing presented notation in the context of task weights and loads, the decision form denoted SSP-D is formally for target capacity c:   true if ∃ ⊆ : L( ) = c  S T S SSP − D(T, c) =  (2.3)   f alse otherwise.

Equation (2.3) considers set of tasks T and a target capacity c and decides if there exists some subset S ⊂ T whose load equals the target capacity, or L (S) = c. The Subset Sum Decision Problem (SSP-D) is useful in determining the computational complexity of load balancing, particularly in the case K = 2. Practical solutions though must consider the optimization version in order to provide for the case where SSP − D(T, c) = f alse. We define the optimization variant of the subset sum 22 problem for target capacity c as

SSP − OPT(T, c) = S ⊆ T such that (2.4)

L (S) = max ({L (V) : V ⊆ T ∧ L (V) ≤ c}) .

Equation (2.3) considers set of tasks T and a target capacity c and optimally identifies the subset S ⊆ T with the maximal load L (S) bound by c. A consequence of SSP-D being within the class of NP-Complete problems as a decision problem requires its optimization variant, Subset Sum Optimization Problem (SSP-OPT) to fall within the NP-Hard class of problems.

2.2.2 Partition Problem

The Partition Problem is closely related to SSP and considers a K-partitioning, or set

of disjointed subsets, which is defined formally[20] for a partitioning P as:

P = {P0, P1, ..., PK−1} such that (2.5) [ P = T Pi∈P

∀Pi, P j ∈ P ⇐= Pi ∩ P j = ∅.

Equation (2.5) defines a K-partitioning as a set of disjoint subsets of the the original task

set T whose union equals T. The Partition Problem then considers the equilateral partitioning mathbbP of the

system task set T. Formally the decision variant PP-D is given as:   L true if ∃ : ∀P ∈ , L (P ) = (T)  P i P i K PP − D (T, K) =  (2.6)   f alse otherwise.

Equation (2.6) considers a task set T and determines if there exists some partitioning P, as defined by Equation (2.5), containing partitions with equivalent load. 23

Extending the decision versions of SSP-D to Partition Decision Problem (PP-D) is

L(T) trivial for K = 2 by defining capacity as c = 2 . Therefore we can state that PP-D is at least as hard as SSP-D. As was said concerning SSP − OPT ∈ NP-Hard, the same could be said with any opimization variant of Partition Problem (PP), denoted Partition Optimization Problem (PP-OPT). Therefore PP − D ∈ NP-Complete implies PP − OPT ∈ NP-Hard. Defining a practical optimization ofPP is less trivial than

SSP-OPT and requires consideration of fairness in the case PP − D(T, K) = f alse thus will be considered in following chapters.

2.2.3 Subset Sum and Partition Problem Literature

Dynamic programming algorithms exist which solve the SSP in pseudo-polynomial time with respect to set cardinality and elements range [16, 25]. The pseudocode of such an example used in the early stages of the resource allocation simulator is presented in Figure A.1 in AppendixA. In addition, their exists Polynomial-Time Approximation Scheme (PTAS) algorithms to solve both SSP-OPT and the PP-OPT[6]. Although these algorithms are not generally considered computationally complex for a problem in the class NP, they exceed acceptable limits within kernel context due to exceeding sub-linear computational complexity. A great deal of research in this area is focused on global balancing which incurs an unacceptable task migration overhead with respect to invalidating cached program data and updating run queue data structures. Distributed set partitioning optimization algorithms allows for the elimination of overheads exacerbated by centralized set partitioning while benefiting from parallelism.

2.2.4 Infeasible Task Weights

Within the subset of task sets which contain no ideal partitioning, unfeasibly weighted task sets exist in which each possible partitioning results in system being unable 24 to provide a proportional share of computational capability to a subset of system tasks. A task i ∈ T with associated weight wi is considered infeasible in SMP systems if the condition holds such that

∃i ∈ T such that (2.7) w 1 i > . L(T) K

An infeasible weight assignment symbolizes an idealized resource share which is impossible to supply given the number of hardware threads. Chandra et al. [7] presented an algorithm that reweighs tasks, up to (K − 1) tasks in total, to feasible weights. This process is costly as it requires knowledge of the (K − 1) highest weighted tasks within the entire system and must be considered upon any dynamic event which alters system workload L(T)[21]. The algorithm presented Chandra et al. [7] has a single major drawback for any practical resource allocation implementation. The requirement for the online reweighing of the (K − 1) possible infeasible tasks, which must be considered upon any change to task weights, task creation and exiting, is extremely costly. The system must first obtain the number of infeasible tasks and their weights, calculate the remaining system load, and reweigh each task executing on separate processors. Clearly on Non-Uniform Memory Access (NUMA) systems, the cost of this synchronization outweighs the benefits of obtaining information which solely aids in accounting a fair share under the GPS model. An novel alternative would be to simply modify any resource accounting method for such infeasible weighed tasks to indicate the task in question received its proper share equivalent to the proportion of a system’s single hardware thread. This method is appropriate for many if not all PSS and certainly any included in this chapter. In such a system, the remaining feasible weighed tasks may be allocated accordingly with the knowledge of simply the number of, and the original weights of the infeasible tasks. This 25 information may be transmitted upon load balancing triggers which would occur naturally in the case of an arriving or departing infeasible weighed task. Such an event would require task migrations to or from run queues composed of feasibly weighed tasks. Additionally, any effective load balancing implementation would isolate infeasible weighted tasks, individually mapped to a run queue and hardware thread. Thus a task scheduler would readily have knowledge of the task weight infeasibility. Although the implementation of task reweighing is impractical, its use for verifying fairness bounds for PSS is surely useful in the consideration of boundary task set cases.

2.2.5 Load Balancing in the 3.2 Linux Kernel

The Linux kernel has relied on a distributed load balancing implementation since developers determined the need to eliminate the Big Kernel Lock (BKL). The BKL was originally implemented in early kernels to protect kernel data against concurrent access to critical sections code [22, p. 223]. As of Linux version 2.6.39 the BKL has been removed in place of finer grained locking mechanisms [8]. Finer grained locking mechanisms allow for greater multiprocessing performance The load balancing algorithm in the Linux kernel is essentially a distributed approximation algorithm. Run queues are balanced upon executing code which migrates task subsets between pairs of run queues. In this fashion, thePP is approximated by potentially concurrent executions of a SSP-OPT implementation under the condition that tasks may only migrate in a single direction. The approximation based on local run queues eliminates additional calculations incurred by finding an optimal global solution and reduces overhead by limiting task migrations. The implementation essentially propagates load balancing duties by distributing the imbalance recursively. The load balancing implementation has multiple triggers, though it is largely implemented in the run rebalance domains function invoked periodically by the task 26 scheduler upon software interrupt [4, p. 287]. Under both cases, the load balance function attempts to find the busiest run queue with the find busiest group function. If such a run queue is non-local, both run queues data structures are protected by spin locks. If the locks are obtained successfully, a suitable subset of tasks are migrated from the overloaded run queue by the pull task function. Similarly, during scheduling, any idle processors will attempt to pull tasks to schedule from non-local run queues in an effort to balance load and maintain work conserving scheduling. This may require active load balancing, which is triggered from one kernel thread to be performed on another CPU, preempting its current operation in an effort to migrate tasks to avoid physical and logical imbalances. 1 Upon the creation of a new task with either the fork or execve system commands the operating system has a unique and optimal opportunity to balance the new created system load. Similarly, tasks waking with the try to wake up function, after waiting for IO for example, must be placed on a run queue. In both cases, these tasks are not yet currently assigned to a run queue, thus selecting the optimal run queue is seized by invoking the select task rq function prior to placing it on the run queue with the enqueue task function.

2.3 Task Scheduling

Task schedulers attempt to arrange hardware threads’ computing capabilities into time shares, during which a single competing tasks is granted exclusive right to execute. Tasks in this context are the finest grained sequence of instructions within a computer process which contain a single software thread. Run queues refer to data structures which contain references to tasks in competition for resources. Scheduling must provide for

1 Ignoring the features of scheduling domains, groups, and CPU hotplug was intentional to avoid unnecessary complexity in this discussion as they fall outside the scope of this research. 27 adequate latency and resource share, while allowing for optimal throughput, therefore must determine for each task proper execution duration and frequency.

2.3.1 Task Classes

Multitasking operating systems must support a wide range of applications requiring knowledge of tasks’ resource requirements in order to provide for a stable and secure system. Incorrectly assuming task requirements will likely result in system instability and thus failing a core OS responsibility. The responsiveness to user commands and the ability to complete operations within a reasonable period are useful in describing the stability of a system [28]. Although it is impossible to ascertain user performance expectations in all cases, under specific cases tasks present clues to the OS allowing for task classification.

2.3.1.1 Batch Tasks

From the perspective of a system developer, batch tasks present the easiest scenario. Batch tasks’ user interaction is limited, usually to initialization parameters, and thus are able complete work at a rate dependent solely on provided resources. Interrupting and later resuming batch tasks in order to serve more immediate requests or requirements provides the system scheduler flexibility [5, 22, 28]. In the absence of conflicting tasks with immediate requirements, the set of these tasks may monopolize resources. Therefore, it is adequate to guarantee these operations complete in a reasonable period if the system scheduling routine is work conserving. The identification by the OS of these tasks is trivial as batch tasks lack certain system calls corresponding to user input.

2.3.1.2 Interactive Tasks

Interactive tasks contain the same system calls and callback mechanisms batch tasks lack, prompting the user for various inputs. The priority requirement for an interactive task is low latency to provide responsiveness to user inputs [5, 22, 28]. Increasing the 28 frequency that the task scheduler grants resources to interactive tasks allows for higher responsiveness. Simultaneously, interactive tasks voluntarily surrender granted resources until the user provides the requested input. Thus, the OS may allow these tasks to aggregate the surrendered allocations of resources for later use to provide for reasonable completion time of user prompted requests. This aggregation must be limited however to prevent indefinite aggregation of these resource allocations in the case that the task must wait for user interaction for an extended period. Such a situation would likely result in an exceedingly unacceptable resource monopolization.

2.3.1.3 Real-Time Tasks

Tasks with defined throughput requirements are real-time tasks and include all those with multimedia playback or streaming duties. Expectations exist for such tasks to perform a set amount of work over a duration defined by the application. In the case of multimedia applications, the media source provides a baseline amount of work to carry out over a period, usually measured in units of bit rate. Media applications may attempt to modify these requirements in order to optimize efficiency and stability or even to provide for graceful degradation if resources are not sufficiently available. Video playback applications may perform optimizations by reducing video resolution during the rendering to match the window or display size. Likewise, a video streaming server and or client may, however annoyingly, decrease video quality due to network or server load in order to provide graceful degradation of service. The ability to reconcile insufficient access to resources required for ideal execution of real-time tasks implies the system softly implements real-time tasks. Hard real-time tasks are those with stiff performance requirements in which the failure of meeting deadlines results in system failure as a whole. Commonly found in embedded system controllers, such as antilock braking systems and Engine Control 29

Modules (ECM) in automobiles, their workload must be bound by system developers to ensure system resources are not overloaded. The ECM is an ideal example, as its workload may be dynamic; such is the case with variable engine controls. As a hard real system, a failure of the ECM to meet performance requirements results directly in system failure. Such a failure would likely result in the engine stalling or failing catastrophically. However, the engine inputs and components controlled are finite, bounding total workload.

2.3.2 Dynamic System Events

Regardless of the method responsible for properly allocating resource time-shares to tasks, accounting for dynamic system events is required to ensure fairness under GPS and therefore must first consider the amount of system resources task require. Therefore, dynamic events such as tasks entering or leaving the system, as well as modifications to a task’s priority or classification, require a recalculation of requested resources [3, 10, 15, 29]. Further complicating matters, the OS considers tasks inactive while waiting for sluggish resources, such as hard disks. These tasks are not considered to be competing for computational resources, therefore are seen to have left the system until the resource has responded and are said to be blocked by I/O.

2.4 Proportional and Fair Share Scheduling

Proportional Share Schedulers are those which attempt to allocate resources according to relative importance of each task. Tasks are assigned a value or weight indicating their relative importance, and the scheduler accounts for resource utilization in comparison to their ideal proportional share. Earliest attempts to proportionally share resources according to competing weighted clients simply scheduled resource access in a round robin fashion and in fact was the primary principle in the 2.5 Linux kernel scheduling algorithm. The scheduler attempts to grant proportional resource access by simply allowing each task to execute for a period representative of its priority. Fairness is 30 ensured for static task sets, though dynamic system events complicate the system’s ability to ensure fairness. In order to correct errors resulting from dynamic events with respect to proportional resource access, true proportional share schedulers follow the ideal fluid flow model. Ideal fluid flow refers to the physics of fluids flowing through a constrained cross sectional area with respect to velocity and pressure. Specifically, fluid velocity is directly and inversely proportional to its pressure. The physical model is analogous in that the speed at which tasks may progress has the equivalent property in any system with finite resources and thus limited execution throughput. This concept is directly realized with the definition of virtual time V as

t Z 1 V(t) = dt (2.8) L(T(t)) 0 where at any moment t relative to system start time 0, virtual time V is dependent upon the

load L of the dynamic task set T(t) with respect to time t. As mentioned previously, resource allocation in real systems must consider dynamic events such as the arrival of new tasks, task completion, and task weight changes. Therefore to denote a task set at

some time t we use the notation T(t). Equation 2.8 clearly illustrates that the rate of progress of virtual time is inversely

related to the system load. The ideal resource share S i of some task i between some time

periods a and b with respect to their weight wi is derived directly from the definition of virtual time V as2

S i(a, b) = wi · (V(b) − V(a)) (2.9) since share is proportional with respect to task weight in relation to total load at any given time. 2 Note that Equation (2.9) is equivalent to (2.2) for a static task set and a uni-processor system. 31

Dynamic system events such as the exiting and creation of tasks and their weight alterations may be performed while maintaining fair shares. Scheduling may occur with respect to task utilization deviations from an ideal, thus we define lagi for some task i as

the difference between an ideal allocation S i and actualized utilization si as

lagi(a, b) = S i(a, b) − si(a, b). (2.10)

Note that over any period in which a single task monopolizes resource access, all other tasks’ lag values increase while the serviced task’s lag value decreases. Additionally, note P that the sum of all tasks’ lag results in i∈T lagi(a, b) = 0 which is a trivial proof exercise.

2.4.1 Earliest Eligible Virtual Deadline First

Stoica and Abdel-Wahab [29] introduced the Earliest Eligible Virtual Deadline First (EEVDF) algorithm which utilizes virtual time and the fluid flow model to account resource allocation. The algorithm makes use of two additional values, virtual eligibility and virtual deadline to make scheduling decisions and limit lag as well as defining the period in which a single resource request ri may be made by any single task. A task i with eligibility time ei and service request time ri is related to its deadline di by

ri V(di) = V(ei) + . (2.11) wi

To see why the math works, assume the simplest case in which each task makes a resource

request for a period equal to its weight ri = wi in some unit of actual time. The virtual deadline calculation results in one virtual time unit difference from its virtual eligibility. Over any single virtual time unit, the system would be capable of satisfying a proportional share, assuming it may divide time infinitesimally. Each task would be assigned a virtual deadline of a single virtual time unit, which is equal to the quantity of system load in actual time units. 32

The degree to which tasks request with respect to their weight determines their deadline as would be expected. Note that a tasks’ deadline is in terms of virtual time since a wall time equivalent is unknown being dependent upon future resource competition. As a practical note, resources may not support infinitesimal time shares; therefore tasks will not relinquish resource access at the moment their request has been met. Therefore to account for resource deficits and excesses, tasks may not be eligible for another time slice until its actualized resource access equals its ideal. Therefore some task i is said to have the eligibility time ei in corresponding virtual time as defined in Equation (2.8) by the relation 0 0 si(ti , t) V(ei) = V(ti ) + (2.12) wi

where si denotes the actualized service utilization from the time task i began execution,

0 denoted by ti , to the current system time t. The remainder of the algorithm is straightforward from its name, tasks are scheduled and granted resource access with respect to their virtual deadline times given the system’s virtual time has surpassed the task’s virtual eligibility. EEVDF has been demonstrated to be optimal in terms of providing bounds for task lag in proportional share algorithms thus is appropriate in systems with real time computation requirements. Lag is bound by the inequality relation[29]

0 − ri < lagi(ti , t) < max(rk, q) ∀i ∈ T (2.13) ∀k∈T where q denotes the scheduling quanta, or how often scheduling decisions are made.

2.4.2 Completely Fair Scheduler

Along with the 2.6.23 Linux kernel, Completely Fair Scheduler (CFS) was introduced aimed at being fair with respect to bounding the difference between an ideal proportional share and actualized resource utilization. As an alternative to task weights, UNIX like systems such as Linux have historically preferred the nice terminology. The 33 term refers to how nice a task is to the system, normalized by neutral tasks with nice value of zero, and nicer tasks having a greater positive value while negatively nice tasks requiring additional system resources. Thus tasks with lower priority have a higher nice value and vise versa. The weight of some task i with nice value nicei is determined by a static look-up table, having the approximate relation 1024 wi ≈ . (2.14) 1.25nicei Instead of maintaining a virtual time value directly, CFS accounts for the runtime of each task, or actualized resource utilization, in virtual time. In this way the system is modeled similar to GPS or the ideal fluid flow model in that each task progresses at the same rate proportional to its priority. The virtual runtime variable vruntime is incremented for task i by 1024 vruntimei+ = ∆t · (2.15) wi where ∆t refers to the actualized system resource runtime since the previous instance accounting occurred, and its weight is a function of nice value. Notice that the value 1024 reflects the weight of a task with a nice value of zero. Since runtime is accounted for in virtual terms rather than the system time, accounting may be accomplished for each task during scheduling with no knowledge other than the duration the task was scheduled and its nice values. This allows for CFS to provide fairness with low computational overhead.

2.4.3 Red Black Binary Search Tree

The CFS algorithm selects tasks to run based on the lowest virtual runtime by maintaining a sorted data structure. The data structure of choice is a self-balancing red-black Binary Search Tree (BST), used widely in the kernel and thus is a implemented in the library lib/rbtree.c whose source is accompanied by documentation in doc/rbtree.txt[19]. Red-black trees are similar to other self balancing BST such as AVL trees, though trade off slightly slower look up times due to larger bounds in 34 root-to-leaf path lengths. Relaxing the bounds slightly allows for reduced rotations during balancing upon node insertion and removal. The use of a balanced BST ensures CFS is computationally efficient having complexity O(lg n) with respect to number of tasks when modifying the run queue with insertion and removal of tasks as well as searching operations. Red-back trees operate on the invariant conditions:

• Nodes are colored either red or black • Root and leaf nodes are colored black • Each red node must contain two child nodes colored black • The count of black nodes in each simple path from root to leafs contain equal number of black nodes

• Null or empty leaf nodes may be used or implied to comply with preceding rules.

14

5 18

1 6 17 19

NULL 4 NULL NULL NULL NULL NULL 23

NULL NULL NULL NULL

Figure 2.1: Red-Black Binary Search Tree

The use of the Linux rb tree within the kernel is fairly straightforward. Rather than inserting objects, the library inserts that objects’ member rb node structure by managing 35 its pointers. It is required of the developer to determine the sorted key value of the tree’s objects and ensure a stable comparison to avoid collisions and ensure proper searching if required. Key values which sort objects are defined by the developer while creating the insertion function, which determines the correct placement and calls the functions rb link node and rb insert color to insert the object’s rb node member. Searching may be accomplished without use of the library with the usual BST algorithm and the rb entry3 macro while considering any stable comparison operation. Likewise insertion of nodes to the tree requires the location to be determined with respect to a pointer to the parent node and pointer to the parent node’s associated link, either rb right or rb left. These parameters, along with the tree root rb root are passed to rb link node. The coloring, and balancing, of the tree occurs upon a call to the rb insert color function. The library provides for the easy removal of nodes by simply passing an object pointer rb node and the tree’s root rb root to the rb erase function. Due to the ease of the implementation of CFS and in according to provide a reasonable baseline, CFS was selected as the scheduling algorithm implemented in the simulation further detailed in subsequent chapters. Additionally, a user space rbtree implementation provided under GNU GPL was sourced[11] and modified for use with the simulator, which was based on the kernel implementation.

2.4.4 Distributed Weighted Round-Robin

A number of round robin algorithms have been utilized in task scheduling literature and perform well in terms of computational complexity though generally less so in terms of fairness. As an example, the Linux 2.5 kernel implementation maintained two run queues, active and inactive, implemented as a set of lists, each associated with a given priority. Scheduling occurs by simply scanning the active queue decreasing order of

3 The rb entry macro operates equivalently to the container of macro, and is used to obtain the object which contains some referenced member. 36 priority for the next available active task in constant time. Tasks concluding with their allotted time slice are placed upon the inactive queue until queues switch places when no active tasks remain [27, p. 179] [4, p. 268]. Heuristics allowed the system to shift task priorities to allow for task interactivity, though also opened up vulnerabilities which allowed those privy to the workings of the scheduler to overuse its share of processor time [30]. Fairness was achieved in static systems, though waned upon dynamic system events. DWRR[21] is based on instituting local and global fairness separately. Implemented with the 2.5 scheduler as described above, DWRR provides local fairness in terms of round slicing. Rounds are defined as the total time required to service each client in a distributed run queue with a proportional share. Formally, a round concludes once all tasks have executed and been relocated from the active to inactive run queue. Global fairness is achieved by performing round balancing, essentially pulling tasks from lagging run queues to those having completed the current round more quickly. Global fairness is bound by ensuring that no two run queues may be separated by more than a single round completed. DWRR maintains O(1) complexity, provides low migration overheads and tasks which may continuously run with lag bounds. Round balancing may also be implemented independently of round slicing, as was presented and demonstrated good performance with CFS[21].

2.4.5 Additional Proportional Share Algorithms

The concept of virtual time is similar to traffic control algorithms designed for packet switching networks by Zhang [32]. Data transmission flows are compared to their requested average transmission rate by advancing a Virtual Clock value by the reciprocal of the rate. The virtual clock value is then compared to the real time, and if greater than some threshold, indicates the requirement to trigger control mechanisms. Packets are time 37 stamped with respect to the virtual clock of their associated flow, and transmitted in increasing order of virtual clock time stamps. Similar to the work of Zhang [32], Weight Fair Queuing (WFQ) was introduced as a GPS equivalent model referred to Bit-by-bit Round-Robin (BBRR)[9]. As GPS models time multiplexing of resources with respect to infinitesimal time units, BBRR is models packet transmission with the infeasible granularity of a bit[9]. The model does provide the ability to calculate transmission completion times under such a model, which is used in WFQ to order packet transmission. Alternatively to scheduling according to the GPS modeled completion time as in WFQ, Start-time Fair Queuing (SFQ) is a CPU scheduling algorithm which grants resource access to clients according to and in increasing order of start times. Start times are defined to be either the completion time of the previously requested transmission, or the current time, whichever is latest. Surplus Fair Scheduling (SFS) generalizes the SFQ algorithm for the multiprocessor case by reassigning weight values as described in the previous section 2.2.4, symbolized by φ[7]. Tasks are then sorted based on the surplus

0 service time α, as calculated for some task i with respect to its execution start time ti by

 0  0  αi = φ · ti − min {t j : j ∈ T} (2.16)

The SFS implementation was presented for use in the now vastly outdated 2.2.14 Linux kernel. Due to its sorting complexity of O(n log n) is not a practical consideration and is included solely for historical context of an early PSS. An additional class of network fair queuing algorithms named Stochastic Fair Queuing was presented by McKenney[23]. A variant is provided in the Linux kernel and documented in the manual pages with the tc command. Stochastic Fair Queuing uses a periodically perturbing hash function on source-destination based conversations to discourage collisions and provide a statistically fair servicing of communications. The 38 resulting hash calculation is used to assign packets to a number of queues, which are serviced in a round robin fashion in constant time.

2.5 Performance Monitoring Counters

Performance Monitoring Counters (PMC) are special processor registers which accumulate system events4. For the x86 architecture, these registers fall within the set of Model Specific Registers (MSR). MSR are special registers whose use falls outside of any Instruction Set Architecture (ISA). Rather their use is dependent upon the processor model. Their capabilities must be identified by probing the processor for Central Processing Unit Identification [12] (CPUID) data and cross referencing supported features documented within manufacturer’s manuals. PMC registers are used to count system events. These events may include those related to fine-grained resources as described previously, including FPU and ALU operations, LLC misses, etc. The use of PMC registers have widely been used by during software developers to profile code and identify resource utilization patterns and requirements. To date, previous work utilizing PMC registers in profiling tasks at run time for use in resource allocation, and specifically in load balancing, demonstrates a research gap. The use of PMC registers for the Intel x86 processors is well documented in Intel publications [12, 14]. Events are counted by writing values to a special MSR called the performance event selector, whose bit fields indicate counted events and options. Each selector register is associated with a counter register, from both reading and writing is accomplished by a single instruction. Thus the overhead incurred during context switching is modestly fixed. Specifics on their use in this work is provided in following chapters.

4 Though PMC registers exist across a wide range of architectures, this work focuses on the Intel x86 architecture, the implementation of which is architecture dependent. [12] 39 3 Methodology

3.1 Motivation

The literature on fair resource allocation has been directed towards proportional scheduling within both theoretical and applied realms. These include the presentation of EEVDF and inclusion of CFS within the Linux kernel as discussed in the previous chapters. Proportional fairness is achieved by considering the relative flow of execution throughput, defined as the reciprocal to system load with respect to time[29]. This concept, commonly referred to as virtual time, originated in packet switching network research presented by Zhang [32], for “rate-based traffic control algorithms”. Unfortunately, unlike network packets, which may be processed in constant time due to network protocol standardizations such at Ethernet frames, task processing rates are infrequently constant. Previous work by Dunn [10] considers resource allocation within systems with inconsistent processing rates. Specifically load balancing in asymmetric multiprocessor systems where inconsistent processing rates are the result of dynamic frequency scaling. Loads are balanced by an algorithm utilizing a minimal objective function to calculate relative error between allowable task set migrations between pairs of processors. While the work by Dunn considers the bogoMIPS5 value as representative metric of processor performance, he reserves that an ideal metric is outside of the scope of his work. This work extends the work of Dunn [10] in that an attempt is made to identify PMC registers as a useful metric in determining processor performance, and specifically in the profiling of tasks. A number of other technologies result in inconsistent processing rates. Examples include systems designed with NUMA due to variance in memory access latency.

5 Linux determines the bogoMIPS value at boot time by executing a busy-loop and recording the Millions of the processor may achieve. 40

Hardware optimizations, such as out-of-order execution and branch prediction, that may or may not succeed, may also vary throughput. Hardware multithreading incurs inconsistent processing rates due to super-scalar and super-pipelining technologies which attempt to fully utilize execution units. Different sets of concurrently executing software threads, which are multiplexed on the same physical circuit under multithreading, may have different contention patterns for execution units than others. Multiprocessing technologies may introduce inconsistent processing rates due to resource interdependencies such as cache and bus usage. Consequently as hardware complexity improves overall throughput, the ability for systems to manage resources incurs additional complexity in their attempt to consider fair resource utilization. These considerations led to the initial test, which aimed to verify the assumption that certain resource loads may incur additional overhead due to contention which may be alleviated with proper task profiling and load balancing. This test is described in detail in section 4.2.

3.2 Synopsis

This chapter presents considered approaches to balance system load with respect to the added dimension of a finer grained resource to eliminate contention for the improvement of computational throughput.

3.2.1 Partition Optimization Problem

Revisiting the partitioning problem from subsection 2.2.2, the partitioning valuation functions must be considered to complete the formulation ofPP into a optimization problem. The optimization version ofPP is combinatorial in nature, and considers the valuation of each instance of a valid partitioning given a set of tasks. Fairness must drive the definition such a valuation function, and so summary statistics are used that allow us to quantify the partitioning imbalance. 41

3.2.2 Partition Load Range

Given a task set T we may decide to evaluate a partitioning P with respect to the range of its partition’s load as calculated by

Range(P, T) = max{L(Pi)} − min {L(Pi)} (3.1) ∀Pi∈P ∀Pi∈P for a K-partitioning P of task set T as defined by Equation (2.5). The range valuation of a partitioning provides developers the simplest solution to calculate and implement a load balancing scheme given its online algorithmic nature. Local changes to the run queue requires verifying or altering either the minimal and maximal loaded run queues. Load balancing may be accomplished as is done in Linux by pulling tasks from the maximally loaded run queue to the minimally loaded. Valuation results are bound in the worse case to the maximum task weight and allows a simple greedy implementation, that is further optimized by sorting tasks with respect to weight.

3.2.3 Mean Deviation

A more accurate partitioning evaluation would consider the mean deviation of partitions’ load with respect to optimal load which can be calculated by

P L(T) L(Pi) − K P ∈ MeanDeviaiton( , , K) = i P (3.2) P T K

for a K-partitioning P of a task set T as defined by Equation (2.5). The benefit mean deviation has over range valuation is the increase of fairness of tasks across the entire system, rather than simply those incurring the greatest deviation from a fair allocation. From a developers’ standpoint, partitioning evaluation based on mean deviation may be strictly considered locally with respect to run queues whose loads are changed allowing for a distributed implementation similar to that of Linux. This is due to the fact that if any subset of partitions reduces their mean deviation from an ideal partitioning, the accumulation of mean deviation is reduced accordingly globally. 42

3.2.4 Variance

Mean deviation as a valuation function may be improved upon by squaring the deviation of each partition load resulting in the variance. The variance valuation equation is defined by P  L(T) 2 L(Pi) − |P| Pi∈P Variance(P, T) = (3.3) |P| for a K-partitioning P of a task set T as defined by Equation (2.5). This incurs a slight increase in computational overhead, but results in additional fairness by imposing a penalty to partitionings with outliers with respect to task load.

In each of the preceding subsections, summary statistics are used to evaluate partitions in terms of error. Clearly with these valuation functions, we may formulate the

partitioning problem optimization function PP-OPT. The optimal partitioning P for some task set T executing on a K processor system is given by

PP − OPT(T, K) = P ∈ U : valuation(P, T) = min ({valuation(Pi, T): Pi ∈ U}) (3.4)  for the univeral set U = Pi : Pi is a valid partitioning of T .

where valuation (P, T)) is some metric of partitioning fairness, for example any of the Equations (3.1), (3.2), or (3.3). In terms of fairness, the partitioning valuations presented only ensures tasks a limited rate of unfairness accumulation. Care must be given to ensure no run queue is overloaded for an extended duration, regardless of how a different partitioning may affect short term fairness. Without such consideration graceful degradation and QoS may not be guaranteed. By imposing an additional trigger to load balancing routines to implement a round balancing algorithm as done in DWRR[21], task lag may be bound. Round balancing may be imposed with PSS algorithms by ensuring run queue’s virtual time, in the case of EEVDF and others, or tasks’ virtual runtime, as in 43 the case of CFS, has a maximum deviation defined. As the main topic of this work considers multidimensional load balancing, the implementation of round balancing has been left for future work.

3.3 Finer Grained Resources

As defined previously, finer grained resources refer to system resources which are not associated with a single hardware thread with a one-to-one correspondence. The term finer grained refers to resources which are utilized indirectly as a result of scheduling CPU time shares. Examples consist of execution units within the hardware thread processing pipeline and storage and communication circuits such as caches and buses. As mentioned previously, and demonstrated in the following chapter, imbalances in finer grained resource utilization reduces throughput by imposing additional competition. To reduce competition and optimize throughput, the system must first profile tasks to quantify resource requirements. Tasks may be profiled across for multiple resources, though each additional resource incurs a higher order of complexity resulting from higher ordered dimensionality with respect to load balancing. For this reason, this work considers a single finer grained resource to be considered ensuant to weight based load balancing resulting in the consideration of two dimensions.

3.3.1 Multidimensional Load Balancing and Partition Problem

To consider the allocation of finer grained resources, it is useful to formulate a multidimensional partitioning problem which is dependent upon a set of resources.

Therefore, given a set of redundant resource types R = {R0, R1...}, across which a set of competing tasks must be allocated, resource load may be redefined across a single   resource type R j ∈ R as L S, R j . Further, we say that each resource type Ri ∈ R may have multiple units. A multiprocessor system is one such example having multiple processing units. To denote multiple resource units, we use the notation |Ri|. PP-D is then extended to 44

Multidimensional Partition Decision Problem (MPP-D) by across multiple dimensions, or resource set R      L(T,R j) true if ∃P : ∀Pi ∈ P, ∀R j ∈ R, L Pi, R j =  |Ri| MPP − D(T, R) =  (3.5)   f alse otherwise.   where L T, R j denotes the utilization of resource R j by task set T. We can use equation

(3.5) to decide if there exists partitions such that for each resource Ri ∈ R in the set, a equilateral partitioning R exists. Unfortunately (3.5) is insufficiently general for the case of most resource sets within modern computer systems as there are likely hierarchical constraints. For example, if the

set set of resources was defined as R = {RDISK, RRAM, RCPU } for its disks, memory, and

processors respectively, it is very likely that |RDISK| < |RRAM| < |RCPU |. Many systems are designed hierarchical where resource access is dependent upon a subset of other resources. Therefore, partitionings must be created which takes such hierarchical system designs in consideration.

3.3.2 Interdependent Multidimensional Load Balancing

Certain resources have interdependencies between processing elements which further complicate Multidimensional Partition Problem (MPP). Hardware multithreading, a technology which allows for the multiplexing of the processing pipeline between multiple software threads, may demonstrate a wide discrepancy between execution throughput for various task sets. Generally, symmetric multithreading behaves optimally with task sets with non-conflicting resource requirements. For example, a mixture of tasks with high dependencies on either memory access, the FPU, the ALU, or caches allow for super-pipelined, super-scalar multi-threaded processors to more fully utilize the processor’s pipeline to improve throughput. These resources are the focus of this paper, and are further referred to as finer grained resources. 45

Unfortunately it is insufficient to simply balance task sets along hardware threads with independent task scheduling while considering finer-grained resources6 since many finer-grained resources are shared across multiple hardware threads. Such an implementation would balance heavily fine grained resource dependent tasks upon multiple run queues, each mapped to a hardware thread, of those may having to share the resource. For example consider a system with hardware multithreading support which executes a CPU bound task set, equally dependent upon the ALU and FPU, though only a portion uses the FPU. Modern Operating Systems would refer to each hardware thread as an independent logical processor, between which tasks would be assigned in a balanced fashion. It would likely partition the tasks independently of FPU dependency, thus resulting in a portion of FPU dependent task occupying each run queue. Since scheduling is an independent operation, multiple tasks may compete for a single finer-grained resource during hardware multithreading’s multiplexing of the processor pipeline. If these tasks were required to occupy a single run queue, no additional competition would occur due to hardware multithreading. Correlating examples may be constructed for each finer-grained resource if consideration of the systems’ architecture is examined.

6 We ignore run-queue interdependent task scheduling due concurrency concerns and overheads. 46 4 Testing Framework and Experiments

4.1 Experimental Environment

Experiments are performed on a system utilizing a Intel Core i7-920 hyperthreaded quadcore processor with 12 GB of DDR3 main memory across 3 channels. Ubuntu 12.04 LTS was chosen as the test operating system as its release indicates long-term support for which software packages are maintained[2]. The 12.04 Ubuntu LTS version also includes the Linux 3.2 kernel which is considered a stable kernel with long-term support[1].

4.2 Initial Considerations and Experiments

As discussed in the previous chapter, additionally complex computational system designs introduce varying computation rates depending on the resources required. With this consideration, the initial test to verify the assumption that additional overheads may be incurred in task loads considers a very simple CPU bound task set. The contrived task set consisted of two identical loops which executed arithmetic operations extensively on either floating point or integer data types. The number of each instance was equivalent and were made to execute for a set number of iterations over an equivalent duration such that if each were to be executed independently without contention, their completion would fall well within a tenth of a second and thus having been concurrently executing at a minimum of 98%. The contrived task set is then created and executed, and displays the computation throughput with respect to the time required to complete execution in various resource dependent patterns. As a baseline, both the ALU dependent and FPU dependent tasks are executed independently to verify their time requirements under no resource contention. In addition, each CPU hardware thread is made to concurrently execute an instance of the ALU dependent task. This is followed by each concurrently executing an instance of the FPU dependent task. This provides an upper bound for resource contention and verifies 47

Table 4.1: Baseline Contrived Load

pid comm cpu time %diff 8966 do iops 6 5.419 6.78 8964 do iops 4 5.451 7.39 8965 do fops 5 5.507 15.16 8967 do fops 7 5.566 16.41 8963 do fops 3 5.918 23.77 8961 do fops 1 5.939 24.21 8960 do iops 0 6.651 31.05 8962 do iops 2 6.700 32.01 6.700 max 9.76 std dev No Set Affinity Balanced by Unmodified Linux Kernel

conditions under improperly balanced task sets. Task sets and resource dependency patterns were contrived for two remaining tests to verify our assumption that multithreaded processing may introduce overhead which may be alleviated by properly balancing loads due to finer grained resource requirements. In order to contrive a resource dependency pattern, first tasks must have the ability to be mapped to a single hardware thread. Processor core and hardware thread topology are identified within the Linux operating system with the cpuid command, which assigns each a processor identifier. Those hardware threads sharing a physical processing core are identified with equivalent core id values. With this information we may assign task loads to cores such that an equivalent amount of computation is performed where pairings of hardware threads sharing a core id contend with tasks requiring equivalent or differing resources. Executing the cpuid command on the i7-920 in the machine used in testing is provided in figure B.1 in AppendixB. With this information tests were performed which set affinities to either FPU or ALU dependent tasks, named do fops and do iops respectively. Tasks were made to run 48 exclusively on a specific hardware thread by invoking the sched setaffinity Linux system call. A representative sample of the results is included below in Figures 4.2 and 4.3. For each, the CPU identifier the task was defined affinity is listed as well as process identifier with the duration required to execute each task. The percent column refers to the overhead incurred over the same task executing independent of any resource contention.

Table 4.2: Imbalanced Contrived Load

pid comm cpu time %diff 8975 do fops 7 5.497 14.96 8971 do fops 3 5.505 15.12 8973 do fops 5 5.505 15.12 8969 do fops 1 5.537 15.80 8972 do iops 4 7.050 38.91 8968 do iops 0 7.073 39.36 8974 do iops 6 7.083 39.56 8970 do iops 2 7.146 40.79 7.146 max 13.06 std dev Finer Grained Resource Imbalance Physical Cores Assigned Either ALU or FPU Dependent Tasks

As Tables 4.2 and 4.3 display when referring to Figure B.1, both the do iops and do fops tasks benefit from a contrived affinity in which pairs of ALU and FPU heavily dependent tasks execute on hardware threads within a physical processing core. When one studies the advantages of symmetric multiprocessing utilizing superpipelining and superscalar technologies, this is the expected result. 49

Table 4.3: Balanced Contrived Load

pid comm cpu time %diff 8978 do iops 2 5.369 5.78 8979 do iops 3 5.408 6.56 8976 do iops 0 5.433 7.04 8977 do iops 1 5.729 12.87 8982 do fops 6 5.842 15.11 8980 do fops 4 5.868 15.61 8983 do fops 7 5.897 16.19 8981 do fops 5 6.001 18.24 6.002 max 4.97 std dev Finer Grained Resource Forced Balance Each Physical Core Assigned Both ALU and FPU Dependent Tasks

A number of other load patterns were tested, though for brevity only the boundary cases were presented as well as an example of the same task set executing without CPU affinity in Table 4.17. The code executed to generate both FPU and ALU dependent tasks8 was designed to generate a heavy load upon respective resources while eliminating overhead of cache misses residing entirely in registers during execution.

4.3 Performance Monitoring Overview

PMC are Model Specific Registers introduced to allow developers to measure low level hardware performance events. Although PMC are model specific, Intel has provided for consistent performance monitoring across certain architectures, regarded to as

7 This example was included to demonstrate a single possible load partitioning determined by the Linux load balancing algorithm. As it is possible, and was experienced upon running this test, a balanced partitioning may be decided upon, but as shown is not guaranteed. 8 The pseudocode has been omitted due to its simplicity consisting of a single for-loop and a single arithmetic instruction. 50

Architectural Performance Monitoring. Performance monitoring capabilities are enumerated by executing the CPUID as described in full detail in reference manuals provided by Intel [12, 13] 9. Once a developer ascertains supported PMC features, their use is relatively straight forward. Chapter 18 of the Intel Architecture Software Developer Manual[12] lists general performance events supported across individual architectures such as instructions retired, branch prediction and caching efficiency. Supported features based on CPU model are more specific and are listed separately [12, ch 19] and include counting events occurring within execution units such as the FPU, ALU, Memory Management Unit (MMU) and various cache levels. Supported events may be counted by programming the performance event select registers, denoted PERFEVTSELx where x refers to the associated PMCx register. The number of PMC registers are limited based on architecture, for example the test systems’ Intel i7-920 includes four PMC registers per core. Once an event is selected, the PMC register must be cleared to expunge any old or undefined data. The PERFEVTSELx registers also include bit fields to enable additional features such as counting events occurring within OS and or user modes, as specified by the processor privilege level as shown in Figure 4.1. The counter must then be enabled by setting a bit in both the desired PERFEVETSELx register and a global performance control register, denoted PERF GLOBAL CTRL in Figure 4.2. Writing and reading these registers are protected instructions requiring privilege level 0, accessible only by the OS. The RDMSR and WRMSR instructions are used to read and write respectively, and use the ECX register to refer to specific registers, and EDX:EAX registers to pass data to and from the registers10.

9 Not all x86 processors support all functions of the CPUID instruction nor PMC functionality and thus their use must comply with documentation provided by Intel. 10 Care must be given in distributing software using attempting to utilize PMC registers as their support is relatively new and not guaranteed to exist in subsequent hardware. 10 Figures 4.2 and 4.1 display only the lower 32bit, are actually 64bit registers with the higher ordered bits reserved and inactive. 51

ontrol +

INV -EN Invert - EnableANY Counter -INT Counter Any Mask -PC APICThread Mask- PinE interrupt (PMV3)- EnableOS enable- OperatingUSR - User system mode mode 31 24 23 2122 20 19 18 17 1516 8 7 0

I A I U Counter Mask E P O Unit Mask Event N N N E S N C S (CMASK) V Y T R (UMASK) Select Figure 4.1: PERFEVTSELx Bitfeilds

PMC3PMC2 EnablePMC1 EnablePMC0 Enable Enable 31 3 2 1 0

Figure 4.2: PERF GLOBAL CTRL Bitfeilds

4.3.1 Performance Monitoring Profiling Overhead

Performance monitoring counters provide developers with a near perfect cost to benefit ratio, incurring in terms of overhead as few as three instructions to either set up counters or record counter values. Values may be associated with tasks by simply storing values from the PMC to a new member of the task structure upon context switch. Since we are concerned with resource utilization over time, we store the rate at which events occur as the task struct member perf val rate. Additionally an exponential moving average is maintained to account for short term variations and emphasize long term tendencies. The EXP COEFF value allows the exponential average to be tuned to favor more recent events with a larger value, but must be bound by EXP BOUND. 52

Included below is the pseudocode for performing PMC operations including setting up the counters by writing a value PERFEVETSEL VAL, which is performed at run queue construction, and collecting PMC values during context switch. 11

wrmsr(PERFEVETSEL, PERFEVETSEL_VAL); wrmsr(PMC, 0); wrmsr(PERF_GLOBAL_CTRL, SET_PERF_ENABLE_FEILDS);

Figure 4.3: PMC Selection Pseudocode

pmc_value = rdmsr(PMC); wrmsr(PMC, 0); task->perf_val_rate = pmc_value / delta_time; task->perf_exp_ave = (EXP_COEFF*task->perf_val_rate + (EXP_BOUND-EXP_COEFF)*task->perf_exp_ave)/EXP_BOUND;

Figure 4.4: PMC Collection Pseudocode

4.3.2 Performance Monitoring in Linux

PMC registers have been in use in the Linux kernel for some time within drivers and in an optional package called linux-tools. This tool includes user space control of PMC registers with the perf command, which is useful for developers to profile their code. The perf framework is extensive, and includes a great deal of functionality across system architectures. As the perf framework includes a large amount of code to create a cross-platform Application Programming Interface (API), its use was omitted in this work to eliminate all possible overheads within task scheduling code. Additionally, a slight modification to the perf-tools framework was required to ensure no contention existed

11 The wrmsr and rdmsr functions referenced are simple x86 in-line assembler function wrappers, which include a write to the EDX:EAX registers and the wrmsr and rdmsr instructions as documented[12]. 53 between it and our perf sched class with respect to setting values to the various PMC registers. This was accomplished by simply reserving the PMC3 register by modifying how many PMC registers the perf-tool sees when scanning the system upon start-up.

4.4 Linux Scheduling Classes

Linux introduced a modular scheduler in kernel version 2.6.23 [18] which allows for multiple scheduling algorithms to operate concurrently. Scheduler modules are maintained within their own source code file, and each must implement a set of functions. The sched class struct is defined in /include/linux/sched.h file of the Linux source code, which includes function pointer prototypes which are populated by each scheduling class and referenced by the larger scheduler algorithm. Each scheduling class must statically define a sched class structure, and place itself within a linked list by in order of priority as displayed in 4.5.

stop sched class rt sched class fair sched class idle sched class

perf sched class

Figure 4.5: Linux Scheduling Class Priority List Modified Priority Marked by Dashed Arrows

With the Linux modular scheduler framework, implementing a new scheduling class is fairly straightforward. Since each scheduling class resides within its own source file, the new perf sched class scheduler was created and statically linked within the list from Figure 4.5. A set of callback functions are defined and statically assigned to the perf sched class members. Of these callback functions, those with the greatest importance are 54

• enqueue task: passes a pointer to a task struct object which has become able to run and must be placed in run queue.

• dequeue task: passes a pointer to a task struct object which is no longer able to run and must be removed from the run queue.

• pick next task: called by the schedule() function and either returns a pointer to the next task struct object dictated by the scheduling class algorithm to run next or returns NULL if no such task exists. In order for a scheduler to be work conserving, it must return a task if there exists tasks which are able to run.

• put prev task: called by the schedule() function to preempt the current running task and replace it to the run queue (if pick next task removed it).

• task tick: must determine if current running task has expended its allotted running time and must be preempted. In addition to these callback functions, there exist a set of tasks which are optional, depending on the scheduling algorithm implementation and desired features such as group scheduling and priority premption.

4.5 Performance Scheduler Class

The implementation of a custom scheduling class allows for the tests to determine the overhead of performing performance monitoring counter operations in kernel space and to determine their accuracy. Alternative performance monitoring tools exist such as perf and the Intel Performance Counter Monitor tools[31], though are targeted more for use by developers in creating user space software or monitoring servers. A utility by I. Molnar[24] was released for kernel developers to use these performance monitoring capabilities to profile the scheduler itself, but not the tasks which are scheduled. In comparison with these tools, which provides for the stochastic sampling of the PMC registers, by placing PMC sampling instructions within the scheduler upon context 55 switches we may obtain more accurate metrics. The only event which would introduce errors in the events counted and attributed to a task are those performed by the operating system. For this reason we focus on a single event known not to occur within the kernel, specifically floating point operations. The perf sched class provides for task profiling in the put prev task perf function, as well as the usual accounting of CPU time utilization. Therefore tasks must migrate to this class in order to be profiled by utilizing the sched setscheduler system call. The ability to obtain and controlling performance monitoring data has been provided by implementing loadable Kernel modules. To demonstrate the low overhead of placing PMC sampling instructions within context switches, the scheduling algorithm used by the perf sched class scheduling class is a bare bones implementation of CFS, without the additional features of group or domain scheduling. The choice of CFS as the task scheduler was due to the simplicity and efficiency of the algorithm rather than the fairness bounds provided for tasks’ resource utilization. A more detailed discussion on the Performance Scheduling Class implementation follows. As mentioned earlier, newly created or arriving tasks are placed within the run queue by the enqueue task function. The enqueue task perf function places a task struct object upon its class run queue, which is maintained by the Linux rb tree library discussed in section 2.4.3. In order to maintain accurate runtime accounting, if the currently running task is within the perf sched class its CPU utilization must be accounted for at this time since a newly arrived task changes the local run queue load and thus the rate that vruntime increments. The inverse operation occurs when a task ceases the ability to run and the dequeue task perf is called. The run queue is also referenced when the run queue must select the next process to run in the pick next task perf 56 function. The rb tree provided macro rb first provides access to the task having been serviced least with respect virtual runtime as determined by its proportional share. The accounting of CPU utilization must occur within a number of the scheduling class API callback functions including when tasks fork, the CPU changes context, or when tasks change weight, arrive or leave a run queue. During the accounting of CPU utilization, as described by Equation 2.15 in Section 2.4.2, the PMC values are recorded and calculated as described in Figures 4.3 and 4.4. The performance monitored scheduling class performs a basic amount of load balancing when tasks are created. The select task rq perf callback function determines the run queue new tasks are placed upon. This is completed by simply scanning all run queues operating across the system and selecting the associated CPU index with the least load.

4.6 Performance Monitoring Kernel Modules

Within loadable modules, in keeping with the separation of mechanism and policy, access was given to users with administrative access to monitor and control certain policies with respect to the Performance Scheduling Class. These modules are described in the following subsections.

4.6.1 Controlling Performance Event Counters

The x86 instructions rdmsr and wrmsr allow values to be read and written to MSR registers such as the PERFEVETSELx. Both instructions are protected, and must be invoked while the CPU is in privileged mode executing kernel code. This limitation requires an interface to modify the performance events selected to be counted. Ideally in production software, such functionality would be provided by a system call. The downside of this approach would require a lengthy recompilation of the kernel upon modification. For the purpose of this work, a loadable kernel module is sufficient to provide control of the 57 performance event selector. This provides for much faster development time and testing, and far fewer modified source files. The perf sched class provides for the PERFEVETSELx to be modified upon each context switch. As the PMC value is recorded, the kernel variable verifies that the selector corresponds to the msr event kernel variable. A difference indicates the need to update the performance selector. The kernel module sched perf msr provides the direct ability to read and write to the msr event kernel variable by utilizing the kernel macro DEFINE SIMPLE ATTRIBUTE. This macro links a proc file operation to callback functions for formatted read and write access. The ability to modify the msr event performance selector value was also made available to user space by implementing the system call setmsrevent. The getmsrevent likewise allows for the reading of the variable. The implementation allowed the value to be modified either system wide or is dependent upon a particular process identification number or pid.

4.6.2 Reading Performance Event Data

Performance event data is stored within the Linux task struct structure. Similar to the sched perf msr module, access to the task struct must be handled by the kernel and thus is implemented in a kernel module for convenience. To list per task performance counter data, another proc file is created named sched perf ps to mirror the Linux ps command. Unlike the sched perf msr file, sched perf ps is read only. The data returned from invoking the proc file system call-back function provides details specific to tasks within the perf sched class, as well as system wide accumulated stats, formatted to resemble the Linux ps command. Enabling real time examination as tasks’ PMC values are updated is enabled by utilizing the Linux watch command, which essentially allows the user to display updated per task data analogous to such commands as top and htop. 58

4.6.3 Performance Scheduler Debug Module

A final kernel module was implemented to provide debugging information and stack tracing named sched perf mod. If loaded, certain scheduling functions within perf sched class store data concerning scheduling decisions. The module creates a like named read-only proc file that allow the user space to read this data.

4.7 Performance Scheduling Class Tests

4.7.1 Performance Monitoring Accuracy

To test the accuracy of the performance monitoring event counters under the perf sched class, a number of user space applications performed a set number of counted events. Again floating point events were used as their use may be considered exclusively attributed to user space processes. The pseudocode for the test is contained within Figure 4.6.

float PI = 3.14159265359, E = 2.718281828459; float x[size] = { ... }; // random data float y[size] = { ... }; // random data float z[size]; write(PERFEVTSEL_VAL, "/proc/sched_perf_msr"); sched_setscheduler(getpid(), PERF_SCHED); for (int i(0); i < size; ++i) z[i] = PI * x[i] + E * y[i]; // do work

printf("Size = %d", size); sched_yield(); // allows perf_sched_class to obtain // finalPMCvalue

print("/proc/sched_perf_ps") // prints per-task pmc data

Figure 4.6: PMC Accuracy Test Pseudocode 59

Clearly the number of executed floating point instructions is dependent upon the variable size. The text confirmed the accuracy of events counted within a reasonable degree on a number of tests, carried out upon a system in various states and executing various loads. The PERFEVTSEL VAL selected for this test is described as FP COMP OPS EXE.X87 in the Intel Software Developers Manual[12, p19-39]. Instructions which prompt a count include floating point addition, subtraction, multiplication, division, and square root calculations.

4.7.2 Performance Monitoring Overhead

The Interbench[17] benchmark was used to demonstrate the overhead from accumulating performance monitoring events within kernel context switches. Interbench tests simulated interactive tasks within a system under some simulated background load. For the experimental tests the source code was modified to utilize the performance scheduling class as described in section 4.5, while the control runs the benchmark unmodified utilized the vanilla CFS scheduling class. Table 4.4 below displays a portion of the results obtained from the Interbench Linux interactive benchmark. In this example, the benchmark simulates a number of simulated loads, during which the interactivity of a simulated audio is measured in terms of latency and the ability of the scheduler to provide sufficient CPU utilization with real time deadlines. Table 4.4 displays the results of running the benchmark on the vanilla scheduler, as compared to the difference when ran utilizing the performance monitoring scheduling class, as denoted by the delta. Negative delta values in latency columns indicate improved performance whereas positive values indicate decreased performance and dash marks indicate no change. The simulated audio load has real time requirements, as noted by the final column, though is simulated as having very low resource requirements as one would 60

Table 4.4: Interbench Result: Audio Load

Simulated Latency Latency Max % Desired % Deadlines Load Ave (ms) Std Dev Latency (ms) CPU Met None 0.1 — 0.1 — 0.2 — 100 — 100 — Video 0.1 ∆-0.1 0.1 ∆-0.1 0.2 ∆-0.1 100 — 100 — X 0.1 — 0.1 — 0.2 — 100 — 100 — Burn 0.0 — 0.0 — 0.2 ∆-0.2 100 — 100 — Write 0.1 ∆0.2 0.1 ∆1.6 0.2 ∆17 100 — 100 — Read 0.1 — 0.1 ∆0.2 0.3 ∆7.6 100 — 100 — Compile 0.0 — 0.2 ∆-0.2 2.3 ∆-2.2 100 — 100 — Memload 0.1 ∆-0.1 0.1 ∆-0.1 0.2 ∆-0.1 100 — 100 — Unmodified Linux 3.2 Scheduler vs Performance Scheduling Class Deltas expect. The values indicate the ability of the scheduler, in this case, to service the simulated audio load. The simulated audio load performs uncached reads from RAM at 50 millisecond intervals and simulates decoding with a 5% cpu load. The video decoding simulation is similar to the audio load simulation, though requires an additional resources. Read and Write simulates loads which perform disk operations, whereas the Burn load is a CPU bound task. As a compilation task requires both disk reads and writes as well as cpu bound tasks, the Compile load simulation performs each of these operations. The X simulation requests between zero and complete utilization of the CPU, to simulate a user performing operations within a GUI environment. Memload simulates a task which overwhelms the memory capacity, requiring memory pages to be swapped to disk. As Table 4.4 depicts, a reduction in scheduling latency for the audio simulation was present during simulated background loads performing disk read and writer operations. Nevertheless, real-time deadlines were met, and in the cases of video decoding, 61 compilation, and memory swapping tasks, latency was improved somewhat. Full description of the Interbench benchmark is provided by its documentation[17]. A number of loads were simulated with the same background loads, including Video, X, and Gaming simulations. For brevity, he remaining Interbench result tables are provided in Section B.1 of AppendixB. 12

4.8 Simulation Tests

4.8.1 Simulator Design

To evaluate the effectiveness and fairness of a proposed load balancing algorithm which considers finer grained resource utilization in allocation decisions, a user space simulator was implemented. Simulation of multiprocessor run queues13, scheduling, task utilization and load balancing was implemented. The simulator was fully implemented in C++, utilizing POSIX threads for each run queue to fully simulate resource requirements under load balancing due to concurrency locking mechanisms. The goal of the simulator is to measure the increased realized throughput by balancing tasks along a finer grained resource along in addition to system load. Similar to the kernel modifications, the simulator allows for tasks to be identified as heavily dependent upon a resource, and balances loads accordingly. Tasks were scheduled within each run queue utilizing the CFS algorithm as described in Section 2.4.2. The choice of CFS over other algorithms was made due to the simplicity of the algorithm at the expense of the fairness of task time shares. Likewise, the task balancing algorithm used as a baseline was modeled after the Linux implementation, which after ignoring scheduling domains, groups and additional complexities is

12 It is believed that the results from disk read and write operations may be caused due to a delay in scheduling tasks after waking upon completion of a disk operation. Improving performance within the performance scheduling class for waking tasks is beyond the scope of this work. 13 The number of run queues simulated was fixed at run time, though the framework allows for the simulation of any number of run queues. 62 effectively a distributed optimization algorithm which utilizes the Equation 3.1 which attempts to eliminate load range. The pseudocode for the simplified Baseline Algorithm is presented in AppendixA as Figure A.6. The algorithm essentially finds a queue which is most overloaded, ensures data protection by locking data structures, and attempts to pull tasks to reduce the difference in run queue loads. Workloads were generated using an algorithm which used pseudo-random number generation to determine both the number of tasks as well as weights. The number of tasks to execute was determined utilizing the common random number generator provided by the C standard library. Each test simulated a random sequence of scheduling and balancing of a generated task sets. The pseudo-random number generator determined a sequence of numbers, such that   ai = rand()%MOD VAL if i = 0   (a0, a1...an) ai = ai = rand()%ai−1 if i%2 = 1 and ai−1 , 0    ai = rand()%MOD VAL+ai−1 otherwise (4.1) Likewise, the task weights and thus system loads were pseudo-randomly generated. Though unlike the sequence of generated tasks, the weight values were generated according to a normal distribution generated by the newer C++11 standard random number generator. According to reference documentation, random numbers are generated according to the probability density function

1 x − µ2 p(x|µ, σ) = √ e− σ 2π 2σ2 for the mean value of µ with a standard deviation of σ. According to the CFS algorithm, tasks are weighted according to how nice they are to the system, with a value of zero being the default. Accordingly, µ was set to zero, and upon brief experimentation, a standard deviation of σ = 5 chosen as their values appeared within the Linux limits of 63 niceness [−20, 19] as defined in the source file sched.h. To ensure that randomness did not play a role in the results of the tests, equivalent seed values were passed as a program argument for comparision in both the baseline and experimental tests. In accordance to simulating load balancing under dynamic and steady states, two load generation methods were evaluated. The first load generation method created a number of tasks according to Equation 4.1. Under inspection, it is clear to see that the sequence of numbers generated by Equation 4.1 oscillates. Guided by these oscillating numbers, the load generator either creates new tasks to match the next element in sequence, or waits until sufficient tasks complete and exit the system so that the number of tasks remaining in the system equates to the next sequence element. The purpose was to stress the load balancing algorithm and require migration. An example of the resulting migration and task creation is provided below in Figures 4.7.

Baseline Balancing Multidimensional Balancing

50 50 50 50

40 40 40 40

30 30 30 30 # Tasks # Tasks

20 20 # Migrations 20 20 # Migrations

10 10 10 10

0 0 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 Processor Time ·106 Processor Time ·106 Figure 4.7: Baseline Balancing Dynamic Load Generation - Seed 0 Number of Migrations and Tasks

The second load generation method utilizes the same sequence of numbers from Equation 4.1, though during each phase creates the number of tasks equal to the sequence 64 number in addition to the tasks already existing in the system. Each load generation phase is halted until a steady state exists with no remaining load balancing operations possible. The number of migrations with respect to tasks is intended to display the amount of work required for the load balancing algorithm. Rather than provide steady state times, as the simulator operates in user space without timing equivalent to that of a operating system, number of migrations is the best metric to display required operations to steady state operation. Due to the test performing all load balancing under a static state, these tests will be referred to as Static Load Generation in text and figures as in Figure B.3.

4.8.1.1 Experimental Resource Dependent Balanced Load Partitioning Algorithm

As mentioned previously, the Linux load balance algorithm is based on a eliminating differences between the loads assigned to pairs of run queues. This algorithm was implemented to migrate a best first fit of tasks to reduce load ranges and was selected to constitute a baseline measurement for algorithms which ignore finer grained resource dependency, as is the norm. The algorithm pseudocode is presented as Figure A.6 in AppendixA. In designing a resource allocation scheme to reduce contention of finer grained resources, the decision must be made to attempt to reduce contention in the task scheduler, load balancing algorithm, or both. For the purpose of reducing interdependent scheduling decisions between hardware threads sharing a finer grained resource, the decision was made to allocate finer grained resources solely within the load balancing algorithm. For the case of an FPU in a multithreaded architecture, the consequence of this decision requires tasks to fall in either two classes: either they highly utilize the finer grained resource under consideration, or they do not. This ensures that the independent scheduling of tasks on each hardware thread incurs a reduced amount of contention with respect to the FPU. 65

A number of load balancing algorithms were developed, though many incurred far too much complexity in terms of either computational requirements or design. Rudimentary experiments included comparison against an optimal dynamic programming algorithm given in Figure A.1. After testing these rudimentary balancing designs all but one was eliminated from consideration. The algorithm builds upon the distributed framework of the Linux load balancing algorithm. Tasks are migrated between pairs of run queues, the algorithm determining which executes periodically for each run queue and serves as the task migration destination. Like the Linux scheduler and the baseline algorithm, tasks are migrated periodically and selected by an algorithm executed on behalf of each run queue which selects a run queues from which it may pull tasks. Likewise the algorithm presented does the same, though may not perform a simplistic equalizing of run queue loads due to the added dimensionality while considering finer grained resource utilization. The implication of such a decision reduces the parallelization of balancing duties as load imbalances no longer propagate nearly the same degree. Rather than equalizing run queue loads, an optimal load partitioning is determined with respect to tasks with and to which dimension tasks should be acquired. For brevity, such task requiring the finer grained resource will be denoted as priority tasks for the remainder of the section. To simplify the implementation, a single hardware thread per physical core was designated to execute the lion’s share of tasks with finer grain resource dependency, accordingly denoted as a priority run queue. In general run queues are

L(T) balanced with respect to an ideal partitioning, as defined by equation I = K and migrate tasks only if such an action reduces the difference between actual load and its calculated

ideal. To determine the ideal load incurred by priority tasks, first we denote Tp as the set of all priority tasks. Then we define the ideal load of priority tasks I p for run queues 66 associated with hardware threads designated to execute them as ! 2L( p) I p = min I, T (4.2) K

Equation 4.2 defines the ideal priority load a priority run queue should receive. The ideal

priority load Tp should not exceed the ideal partitioning load I, though may not contain more than the proportion of priority queues. As this work considers the FPU and dependencies between pairs of multithreaded processors, the number of priority queues is

2/K. Thus the ideal priority load is bound by the system wide priority load L(Tp) factored by number of priority queues 2/K. For either type of run queues, the task loads are determined to be sourced if they will reduce the local difference between the ideal load and that which is realized. For example, as a destination priority queue has a priority to reduce the absolute value of the difference

p p |I − L(Pdestination ∩ T )|, the initial goal is source priority tasks from run queues which will not incur additional imbalance. Accordingly, the search originates in non priority run queues for priority tasks. Tasks are pulled regardless if doing so will increase the difference |I − L(Psource)| as their priority is not to maintain acquisition of such tasks. If such tasks are not available to reduce the priority ideal difference, the following step searches along neighboring priority run queues which are overloaded with respect to their priority ideal. Lastly, a search is made to acquire non priority tasks to reduce overall load imbalances if and only if doing so improves both source and destinations overall ideal

p difference |I − L(Pdestination)|. Similar to priority run queues, the remaining run queues attempt to pull tasks when they are underutilized in terms of overall load. The search for tasks to reduce

p |I − L(Pdestination)| for non priority destination run queues begins with priority queues, followed by neighboring non priority queues. This essentially mirrors the priority run queue’s algorithm. Lastly, if the load of priority tasks exceed that which would balance 67 along priority queues, an attempt to source these tasks is made last, and only if doing so

p p will not result in the situation I < L(Psource ∩ T ). This added condition requires that a majority of tasks reside on run queues which would be ideal under more ideal conditions. This reduces migrations in the event that the system normalizes to a more ideal state. To simplify the search for tasks in the run queues in terms of whether they are denoted with a priority flag or not, a total of four data structures are utilized to store tasks. Each data structure is implemented using a user space implementation of the Linux kernel Red-Black binary search tree as described in section 2.4.3. The first rbtree is sorted by the key value pertaining to the virtual runtime as defined by the CFS algorithm in section 2.15. This rbtree is denoted as the queue, and is a proper subset of the second rbtree, whose key value is primarily task weight, and secondly sorted by process identification number to arbitrate collisions. The single task contained within the weight sorted rbtree not contained within the queue rbtree is the task denoted as current which has been scheduled to accumulate CPU resource utilization. This implementation detail allows for the current task to migrate among run queues once their scheduling duration has expired. The weight queue is again partitioned into two disjoint sets either tasks denoted as priority tasks or not. Again this implementation feature allows for the reduction of searching for tasks to migrate. Rather than iterating in sorted order, the algorithm may search for an appropriate task weight using the binary nature of the sorted rbtree. Additionally, priority queues attempting to pull priority tasks from non priority queues are likely to search a dramatically reduced number of tasks, as is non priority queues attempting to pull non priority tasks from priority priority queues. The underlying algorithm is presented in Section A.2. 68

4.8.1.2 Scalability of Finely Grained Resource Partitioning

As the design of the balanced load partitioning algorithm presented in the previous subsection 4.8.1.1 lent greatly from the Linux load balancing algorithm in terms of the distribution of determining task migrations, we can analyze the scalability of the presented algorithm in relative terms. In the worst case, under the Linux implementation, a single run queue would suddenly become idle leaving an imbalance between a single run queue and each of its neighbors. If prior to such an event, the system was completely balanced, a

1 task migration would occur between a pair of run queues, effectively splitting K−1 th load such that both would remain underutilized. This would effectively distribute an imbalance between a pair of run queues. This recursive ripple effect would propagate exponentially require the migration of tasks to or from each run queue and result in O(log K) load balancing triggers. Alternatively, the presented algorithm performs task migrations only if such an action reduces the difference between an ideal partition. Rather than propagating, a single run queue would be required to visit each neighboring run queue and attempt to

pull tasks equivalent to L(T(K − 1)/K2). While complexity alone may indicate that attempting to distribute task migration is more effective, this assumes that each run queue consists of infinitesimal divisible load, which of course is not the case. Additionally, the number of migrations required under optimal conditions is equivalent, being K − 1. The main issue with the propagating nature of the Linux implementation has to do with tasks essentially traveling among multiple run queues since the task sets are not able to be evenly partitioned in any efficient manner. Once the algorithm is triggered, the propagation also has the incurred overhead of synchronization requirements, effectively locking down each pair of run queues. Even in the optimal condition where K − 1 migrations take place, O(K log K) lock and unlock sequences are required, which can often become contended upon propagation. On the 69 other hand, by attempting to pull tasks to a single run queue, no contention is required, and the number of run queue locks is linear as each queue is visited once. In terms of scalability with respect to workload measured in number of tasks, it can be said that in the intended case with respect to the scope of this work that the proposed algorithm performs better due to the reduction of search space. The scope of this work intends to properly distribute task sets which include a fairly even distribution of tasks which are or are not heavily dependent upon a finer grained resource. In the worse case, either set of tasks is no greater than their union, therefore a search for tasks to migrate is bound similar to the Linux implementation.

4.8.1.3 Dynamics of Finely Grained Resource Partitioning

In considering the dynamic effects of the proposed algorithm and methodology it is important to note that under a steady state, every load balancing algorithm performs best when minimal migrations exist. While fairness may be compromised to an extent, excessive migrations incur system overheads in excess of the required calculations to perform such migrations. Memory cache invalidation and potential memory migration in large scale NUMA systems are costly and must be considered. For this reason, the decision to avoid migration propagation under reducing pair-wise load differences to reduce the number of task migrations was made. Additionally, fairness is only affected in any meaningful way in the case that the number of system tasks is low and an balanced partition is not possible. Such is not the norm in many systems in use today.

4.8.1.4 Simulator Experimentation Evaluation

In order to evaluate the effectiveness of partitioning a load along finer grained resources, specifically the FPU, the simulation was made to account the task performance in regards to contention to competing tasks made to share a single simulated physical processor. During each scheduling tick, the simulator records the task’s efficiency by 70 considering the nature of the competition of the opposing task., The accounting of work completed for each task was simulated using and the average results from tests previous detailed in section 4.2. Individual tasks were said to complete work with respect to a factor detailed in Table 4.5.

Table 4.5: Work Factor Table

This Task Opposing Task Efficiency Factor non priority non priority 69.6% priority priority 89.9% priority non priority 84.8% non priority priority 93.2% none 100.0%

Each task was assigned a pseudo random amount of work to complete. The control and experiment load generation sequence was controlled the random number seed. The task was said to complete once the completed work met or exceeded the set amount of work. No preemption was simulated to enforce the completion of a task prior to the next scheduling tick. Each task efficiency value was accumulated at scheduling ticks by evaluating the ratio of service time to work done, in addition to accumulating both values for each run queue. Once the simulation concluded, the system wide efficiency values were obtained.

4.8.2 Simulator Experimental Results - Dynamic Load

Table 4.6 displays the results from a number of instances of the simulator under load generation. Each instance was ran utilizing simulating a system with two multithreaded cores for a total of four symmetric multithreaded processors. The results clearly show an improved efficiency indicating that the Floating Point Unit would be under lower 71 contention and thus allow for greater throughput given the a task set executing similarly to those described in section 4.2. For these tests, tasks were generated under the condition that every other would be highly dependent upon the FPU.

Table 4.6: Simulator Test Results - Multiple Test Instances

Baseline Number of 2D Number of Load Migrations Load Migrations Efficiency Num Run Seed Balancing Per Balancing Per % Migrations Efficiency Task Efficiency Task Difference % Difference 1 0 85.57 % 2.55 88.98 % 1.77 3.91 % -36.11 % 2 6291625 85.61 % 2.01 89.68 % 1.63 4.64 % -20.88 % 3 478192 85.87 % 2.06 89.09 % 1.61 3.68 % -24.52 % 4 2910919 85.67 % 2.4 89.76 % 1.96 4.66 % -20.18 % 5 2715362 85.76 % 2.27 89.32 % 1.53 4.07 % -38.95 % 6 7822987 85.82 % 2.33 88.93 % 1.58 3.56 % -38.36 % 7 8202285 85.93 % 2.4 90.28 % 1.71 4.94 % -33.58 % 8 6853288 87.37 % 2.05 89.99 % 1.68 2.95 % -19.84 % 9 7948478 86.6 % 2.8 88.25 % 1.57 1.89 % -56.29 % 10 3502578 85.61 % 2.11 89.64 % 1.49 4.60 % -34.44 % 11 3228095 85.34 % 2.23 90.09 % 1.66 5.42 % -29.31 % 12 4118480 85.83 % 2.41 89.94 % 1.72 4.68 % -33.41 % 13 2653280 85.17 % 2.06 89.31 % 1.39 4.75 % -38.84 % 14 584371 86.65 % 2.65 89.68 % 1.56 3.44 % -51.78 % 15 5513999 84.78 % 2.35 89.67 % 1.59 5.61 % -38.58 % 16 3014600 85.88 % 2.4 89.79 % 1.57 4.45 % -41.81 % 17 6887348 86.09 % 2.31 89.4 % 1.88 3.77 % -20.53 % 18 8606400 85.47 % 2.35 87.58 % 1.72 2.44 % -30.96 % 19 8467957 85.34 % 2.12 88.95 % 1.43 4.14 % -38.87 % Average 85.81 % 2.31 89.39 % 1.63 4.08 % -34.07 % Std Dev 0.58 0.21 0.66 0.14 0.96 10.16

In addition to higher efficiency, the two-dimensional load balancing algorithm performed with lower number of migrations per task, averaging roughly 35% lower 72 migrations for a percent difference of 40.50%. For the measurements in this test, for both average number of migrations and load efficiencies as well as percent difference of efficiencies fell, results fall well within a single standard deviation indicating that under similar conditions; similar results would likely follow. For the number of migrations, less consistent results were obtained, though each were seen to be much lower than the baseline to indicate additional rules may be created to further reduce any error in load imbalance. As mentioned previously, the results displayed in Table 4.6 considered a even ratio of tasks modeled as highly dependent upon the FPU as those not. To establish the scope of this work and to evaluate the range of ratios which an improvement in throughput may be seen in actual systems, a number of additional tests were performed under the same conditions as Table 4.6 with two exceptions. First, the seeds of each run was maintained constant at an arbitrary 893756. Second, the ratio of tasks modeled to be highly dependent upon the FPU was variable. Table 4.7 contains the results for the ranges in ratios for which this work demonstrates promise. As would be expected considering the constraints of the simulation, as fewer FPU dependent tasks exist in the system, performing load balancing as if those tasks were relevant in securing higher throughput is less effective. The same is true for higher rates of FPU simulated tasks. To give clearer indication on the ranges in which task dependency ratios impact the ability of multidimensional load balancing, the next test intends to verify the degree to which varying results are likely to occur. Table 4.8 demonstrates the variability the load balancing simulation to indicate the results predicted from Table 4.6. Table 4.8 considers the 1:1 ratio test performed in Table 4.7, performed through 18 iterations. Tables 4.8 and 4.7 indicate that higher throughput is likely realized utilizing the presented 2D Balancing Algorithm for FPU:ALU task ratios of 1:1 to 2:1 and quite 73

Table 4.7: Simulator Test Results - Ratio Range Test

FPU:ALU Baseline Baseline 2D Balancing 2D Balancing %Efficiency Ratio %Efficiency Migrations/Task %Efficiency Migrations/Task Difference 1:8 75.58 % 2.41 76.26 % 1.08 0.68 % 1:7 75.75 % 2.49 76.72 % 1.01 0.97 % 1:6 77.27 % 2.50 77.76 % 1.03 0.49 % 1:5 77.50 % 2.20 78.26 % 1.17 0.76 % 1:4 78.08 % 2.22 81.05 % 1.16 2.97 % 1:3 80.10 % 2.50 81.68 % 1.35 1.58 % 1:2 81.82 % 2.20 83.20 % 1.29 1.38 % 1:1 84.89 % 1.93 88.59 % 1.52 3.70 % 2:1 88.30 % 2.49 90.10 % 1.40 1.80 % 3:1 89.40 % 2.14 90.28 % 1.51 0.88 % 4:1 89.92 % 2.35 90.40 % 1.42 0.48 % 5:1 90.14 % 2.59 90.23 % 1.49 0.09 % 6:1 90.00 % 2.30 90.33 % 1.37 0.33 % 7:1 90.30 % 2.72 90.60 % 1.53 0.30 %

possible for ratios 1:4 to 2:1. Additionally, the multidimensional load balancing algorithm cannot be said to perform worse with any consistency in terms of efficiency for the ratio ranges presented. To visualize the probability of efficiency improvement, Figure 4.8 displays the normal distribution probability density functions f for the Baseline and Multidimensional Balancing according to

2 1 − (x−µ) f (x|µ, σ) = √ e 2σ2 (4.3) µ 2π where standard deviation is denoted by σ and the mean by µ. To demonstrate the effectiveness of the proposed load balancing algorithm under optimal conditions, consider the average efficiency values provided from Table 4.8 and potential work factors from Table 4.5. It is logically assumed that load balancing which does not consider finer grained resource contention would incur task efficiencies randomly 74

Table 4.8: Simulator Test Results - Repeated Consistency Test

Baseline Baseline 2D 2D Balancing Balancing Migrations Balancing Migrations Run %Efficiency Per Task %Efficiency Per Task 1 85.83 % 2.29 89.58 % 1.53 2 85.72 % 2.47 89.26 % 1.53 3 85.58 % 2.29 89.72 % 1.55 4 85.86 % 2.49 89.12 % 1.47 5 84.98 % 2.43 89.34 % 1.46 6 85.58 % 2.36 89.11 % 1.44 7 85.76 % 2.02 89.20 % 1.46 8 86.35 % 2.26 89.01 % 1.67 9 85.55 % 2.53 88.85 % 1.40 10 85.22 % 2.37 88.64 % 1.48 11 86.43 % 2.26 89.14 % 1.40 12 85.69 % 2.12 89.82 % 1.55 13 85.82 % 2.53 89.11 % 1.54 14 85.06 % 2.63 88.93 % 1.63 15 84.96 % 2.57 89.62 % 1.63 16 85.43 % 2.41 89.46 % 1.46 17 85.45 % 2.19 89.17 % 1.57 18 85.84 % 2.15 89.39 % 1.64 19 86.10 % 2.29 89.08 % 1.52 Average 85.64 % 2.35 89.24 % 1.52 Std Dev 0.41 0.17 0.30 0.08 Range 1.47 % 0.61 1.18 % 0.27 distributed between each efficiency value in 4.5. Therefore, comparing the average of possible efficiency factors, 84.375% to the realized simulation of 85.64%, we can see that the baseline load balancing algorithm does somewhat better than expected. On the other hand, by averaging the balanced finer grained resource factors from table 4.5 resulting in 89.0% compared to the actualized simulation results of 89.24%, the proposed algorithm 75

1.4

1.2

1

0.8

0.6

0.4

0.2

0

84 85 86 87 88 89 90 91 Baseline Efficiency Probability Density µ = 85.64 σ = 0.41 2D Efficiency Probability Density µ = 89.24 σ = 0.3 Figure 4.8: Simulator Test Results Repeated Consistency Normal Distribution Probability Density

also performs greater than an expected upper limit. Both is likely due to the dynamic load generation resulting in lower contention when tasks complete. Regardless, it is sufficient to say that the performance of the proposed algorithm is near an upper bound of possible throughput improvement.

4.8.3 Multidimensional Load Balancing Effects on Fairness

A limitation of the Multidimensional Load Balancing Algorithm yet discussed concerns fairness as defined by GPS and other PSS research. In such research, fairness is a measure of the system’s ability to dedicate a proportional share of a resource to a task. Clearly in multicore systems, fairness is dependent upon load partitioning, if only in the short term. As previous work elaborates, over the long term tasks may be migrated in an attempt to bound deviations from fairness[21]. As tasks migrate, care must also be given 76 to normalize scheduling parameters to ensure doing so will not cause deviations in lag as defined by equation 2.10. As this work primarily considers load balancing, discussion of how to normalize scheduling parameters falls outside the scope of this work. Instead we evaluate the degree to which fairness is effected by load imbalances.

Baseline Balancing 14 Multidimensional Balancing

12

10

8

6

4

Imbalance % Standard Deviation 2

0

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Processor Time ·107 Figure 4.9: Standard Deviation Percent Imbalance - 4 Queues Balancing through Steady State Static Load Generation - Seed 893756

Figure 4.9 displays the a measurement of fairness over time of a Static Load Generation as described previously. We can easily see the steady states as horizontal lines. Fairness was measured in terms of relative percent to ideal loads on each run queue. Formally the standard deviation equation using presented notation was v t !2 1 X L(P ) σ = i − 100 . (4.4) K I ∀Pi∈P As seen in figure 4.9 and in additional plots in Section B.5 of the Appendix, there is room for improvement in terms of fairness for the Multidimensional Load Balancing Algorithm over the Baseline considered. As well as providing fairness in these tests, 77 scalability with respect to number of run queues is presented.14 Similar results were seen in each of these tests with the exception of one notable situation. As the number of cores increases, infeasible task sets exist at a higher rate with constant number of tasks. Thus as the number of tasks are low as in the beginning of these tests, imbalances are much higher. In the case of the tests concerning 16 and 32 run queues, for the first couple steady states, there are less than two tasks per run queue on average. This results in widely variable loads among run queues.15

14 In terms of Percent Efficiency and Number of Tasks Migrated, values were seen in each of these tests in accordance with other values described previously in this chapter and thus have been omitted. 15 With the consideration of fairness and infeasible task sets with widely variable task weights, the random task weight generator was revised to narrow the range around the mean with a standard deviation of 4 as opposed to 5 in the previous tests. 78 5 Conclusions and Future Work

5.1 Conclusions

This work introduces the online profiling of tasks using Performance Monitoring Counters and demonstrates their accuracy and usefulness in resource allocation decisions. Experimental tests clearly demonstrate the need to consider the dependency of finer grained resources as contention from ignoring such behavior introduces varying processing rates. The advances of proportional share schedulers were discussed as well as their success in providing guarantees of quality and graceful degradation of service. These terms however assume consistent processing rates as guarantees of service are defined by resource allocation time shares. With variable processing rates, these bounds are relaxed in real system environments. By reducing unneeded contention and therefore increasing overall throughput as well as stabilizing processing rates, it is argued that increased fairness may be achieved. The ability to partition task sets across multiple dimensions without incurring additional computational complexity was demonstrated by the introduction of the distributed multidimensional load balancing algorithm. The simulation results from both static and dynamic work loads were evaluated and indicate that such a implementation is feasible. By partitioning tasks by finer grained resource dependency, smaller task sets are considered for migrations as well as reducing the total number of migrations required to obtain a steady state. As the number of migrations is fewer there is room for advances in future work to increase fairness by migrating additional tasks.

5.2 Future Work

It was assumed under this work that task resource requirements were static. Unfortunately real work applications execute with a variety of resource requirements patterns, making their behavior highly dynamic. Although this behavior was not modeled 79 with the simulator, such dynamic behavior may be modeled under the presented work as if a task exits with its old behavior and reenters competition for resources showing its new behavior. Only once the proposed profiling scheduling class is extended to support load balancing may such situations be fully evaluated and considered. Due to the complexity of the modern preemptive Linux kernels, this feature was left unimplemented in this work. One may also consider the conditions which qualify tasks as highly dependent upon a resource, and if such considerations may utilize continuously variable metrics as opposed to binary which was assumed in this work. Further methods of task profiling in place of Performance Monitoring Counters as well as alternative resources to profile may also be considered. As task fairness is one metric which suffers from a considering multiple dimensions in load balancing decisions, one may investigate the positive effects that round balancing may contribute by shifting load imbalances among run queues. It is likely that a bound may be derived with respect to scheduling and load balancing quanta, in addition to range of task weights. 80 References

[1] “The linux kernel archives.” [Online]. Available: https://www.kernel.org/

[2] “Lts ubuntu wiki.” [Online]. Available: https://wiki.ubuntu.com/LTS/

[3] R. R. Al-Ouran, “Linux implementation of a new model for handling task dynamics in proportional share based scheduling systems,” Master’s thesis, Ohio University, 2010.

[4] D. P. Bovet and M. Cesati, Understanding the Linux Kernel.

[5] ——, Understanding the Linux Kernel, 2nd ed., A. Oram, Ed. O’Reilly Media, 2002.

[6] A. Caprara, H. Kellerer, and U. Pferschy, “The multiple subset sum problem,” SIAM Journal on Optimization, vol. 11, no. 2, pp. 308–319, 2000.

[7] A. Chandra, M. Adler, P. Goyal, and P. Shenoy, “Surplus fair scheduling: A proportional-share cpu scheduling algorithm for symmetric multiprocessors,” in Proceedings of the 4th conference on Symposium on Operating System Design & Implementation-Volume 4. USENIX Association, 2000, pp. 45–58.

[8] J. Corbet. (2011, January) The real bkl end game. [Online]. Available: https://lwn.net/Articles/424657/

[9] A. Demers, S. Keshav, and S. Shenker, “Analysis and simulation of a fair queueing algorithm,” in ACM SIGCOMM Computer Communication Review, vol. 19, no. 4. ACM, 1989, pp. 1–12.

[10] M. S. Dunn, “Asymmetric non-uniform proportional share scheduling,” Master’s thesis, Ohio University, 2010.

[11] F. Haiping, A. Arcangeli, and D. Woodhouse, “rbtree,” https://github.com/forhappy/rbtree/, 2012.

[12] Intel R 64 and IA-32 Architectures Software Developers Manual, Intel, 2013, volume 3b: System Programming Guide (Part 2). [Online]. Available: http://www.intel.com/ content/www/us/en/processors/architectures-software-developer-manuals.html

[13] Intel Processor Identification and the CPUID Instruction, Intel, 2009, application Note 485.

[14] P. Irelan and S. Kuo, Performance Monitoring Unit Sharing Guide, Intel. [Online]. Available: https://software.intel.com/sites/default/files/m/0/f/6/5/e/ 20476-EPS05 PMU Sharing Guide v2 5 final.pdf 81

[15] D. Jovanovska, “Scheduling time-sensitive tasks using a combination of proportional-share and priority scheduling algorithms,” Master’s thesis, Ohio University, 2011.

[16] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack problems. Springer Science & Business Media, 2004.

[17] C. Kolivas. (2006, March) The homepage of interbench the linux interactivity benchmark. [Online]. Available: http://users.on.net/∼ckolivas/interbench/

[18] A. Kumar, “Multiprocessing with the completely fair scheduler,” IBM developerWorks, 2008.

[19] R. Landley. (2007, January) Red-black trees (rbtree) in linux. [Online]. Available: https://www.kernel.org/doc/Documentation/rbtree.txt

[20] D. Levine, “A parallel genetic algorithm for the set partitioning problem,” Argonne National Laboratory, 1994.

[21] T. Li, D. Baumberger, and S. Hahn, “Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin,” in Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’09. New York, NY, USA: ACM, 2009, pp. 65–74. [Online]. Available: http://doi.acm.org/10.1145/1504176.1504188

[22] R. Love, Linux kernel development, 3rd ed. Pearson Education, 2010.

[23] P. E. McKenney, “Stochastic fairness queuing,” in INFOCOM’90, Ninth Annual Joint Conference of the IEEE Computer and Communication Societies. The Multiple Facets of Integration. Proceedings, IEEE. IEEE, 1990, pp. 733–740.

[24] I. Molnar. (2009, September) ’perf sched’: Utility to capture, measure and analyze scheduler latencies and behavior. [Online]. Available: http://lwn.net/Articles/353295/

[25] C. H. Papadimitriou, “On the complexity of integer programming,” J. ACM, vol. 28, no. 4, pp. 765–768, October 1981. [Online]. Available: http://doi.acm.org/10.1145/322276.322287

[26] A. K. Parekh and R. G. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the single-node case,” IEEE/ACM Transactions on Networking (ToN), vol. 1, no. 3, pp. 344–357, 1993.

[27] A. Silberschatz, P. B. Galvin, and G. Gagne, Operating System Concepts, 7th ed. John Wiley & Sons, 2004.

[28] ——, Operating System Concepts, 8th ed. Wiley Publishing, 2008. 82

[29] I. Stoica and H. Abdel-Wahab, “Earliest eligible virtual deadline first: A flexible and accurate mechanism for proportional share resource allocation,” Old Dominion University, Norfolk, VA, 1995.

[30] D. Tsafrir, Y. Etsion, and D. G. Feitelson, “Secretly monopolizing the cpu without superuser privileges,” in 16th USENIX Security Symposium, vol. 7. USENIX Association, 2007, pp. 1–18.

[31] T. Willhalm, R. Dementiev, and P. Fay. (2014, December) Intel performance counter monitor - a better way to measure cpu utilization. Intel. [Online]. Available: https://software.intel.com/en-us/articles/intel-performance-counter-monitor

[32] L. Zhang, “Virtual clock: A new traffic control algorithm for packet switching networks,” in ACM SIGCOMM Computer Communication Review, vol. 20, no. 4. ACM, 1990, pp. 19–29. 83 Appendix A: Algorithms

A.1 Distributed Partitioning Dynamic Programming Algorithm

Require: Pi, P, L (T) , K L(T) 1: if L (Pi) < K then 2: let Po be such that L (Po) = max {L (Pk)} // Find overloaded partition ∀Pk∈P 3: if Pi , Po then L(T) 4: optimal ← K // Calculate optimal partition load 5: for i := 0 to optimal do tasks[i]:= NULL 6: for i := 0 to optimal do back[i]:= 0 7: for ∀ti ∈ Po ∪ Pi do 8: for j := optimal to 0 do 9: if ( j = 0 OR tasks[ j] , NULL) AND j + wi ≤ optimal then 10: tasks[wi + j] ← ti 11: back[wi + j] ← j 12: end if 13: end for 14: end for 15: while tasks[optimal] , NULL do 16: optimal ← optimal − 1 17: end while 18: index := optimal − L (Pi) 19: Px := ∅ // Build optimal partition 20: while index , 0 do 21: Px ← Px ∪ {task[index]} 22: index ← back[index] 23: end while 24: Py := (Pi ∪ Po) ∩ Px // Build second partition 25: Pi ← Px 26: Po ← Py return 27: end if 28: end ifreturn

Figure A.1: Distributed Partitioning Dynamic Programming Algorithm 84

A number of short circuit evaluation optimizations may be made to the algorithm above, of note to check if the optimal load partition is found after line 11. Additionally, rather than iterating over the union of the overloaded and underloaded run queues, by first iterating over the overloaded run queues searching for a subset with load equal to the

L(T) difference K − L (Pi) and simply migrate that subset from the overloaded run queue to the underloaded run queue. Additionally, the use of more efficient data structures to contain the memoization data such as a iteratively accessible map would reduce the need

for the tasks[ j] , NULL at line9. In addition to the algorithm above, which is based on the Bellman recursion, there exists an algorithm which performs in O (nwmax) [16, p 83] under the condition

L(T) wmax < K . Clearly if such a condition is not able to be met, the partitioning is trivial as an infeasible weight situation exists as described by Chandra et al. [7]. The algorithm, named Balsub, is based on adding and removing items from a split solution calculated as

Ps n P j o16 xˆ = i=0 wi for the split item, defined as s = min w j : i=0 wi .

A.2 Multidimensional Load Balancing Algorithm

The algorithm works by identifying the correlating dimensionality to which the load differs from an ideal. From there, the selection from which partition to pull tasks is optimized by prioritizing run queues from where to source the difference in ideal load.

16 Here the ordering of elements is arbitrary 85

p Require: Pcpu, P, L (T) , L(T ), K L(T) 1: let I = K p p  2L(T )  17 2: let I =min I, K // 3: // Ensure destination rq is priority dimension 4: if is priority rq(cpu) then p p 5: if L(Pcpu ∩ T ) < I then 6: // attempt to pull priority dimension task from other dimension rq 7: for ∀Psource : !is perf priority(source) do p p p 8: while |Psource ∩ T | > 0 and L(Pcpu ∩ T ) < I do p p p 9: let tpull ∈ Psource ∩ T : min|I − L((Pcpu ∩ T ) ∪ {tpull})| p p p p 10: if |I − L((Pcpu ∩ T ) ∪ {tpull})| < I − L(Pcpu ∩ T ) then 11: Pcpu ← Pcpu ∪ {tpull} 12: Psource ← Psource \{tpull} 13: else break 14: end if 15: end while 16: end for 17: // attempt to balance along same dimension 18: for ∀Psource : is priority rq(source) do p p p p 19: while L(Psource ∩ T ) > I and L(Pcpu ∩ T ) < I do p p p 20: let ti ∈ Psource ∩ T : min|I − L((Pcpu ∩ T ) ∪ {ti})| p p p 21: let t j ∈ Psource ∩ T : min|I − L((Psource ∪ T ) \{t j})| 22: if wi < w j then let tpull := ti 23: else let tpull := t j 24: end if p p p p 25: if |I − L((Pcpu ∩ T ) ∪ {tpull})| < I − L(Pcpu ∩ T ) then p p p p 26: if |I − L((Psource ∪ T ) \{tpull})| < L(Psource ∩ T ) − I then 27: Pcpu ← Pcpu ∪ {tpull} 28: Psource ← Psource \{tpull} 29: else break 30: end if 31: else break 32: end if 33: end while 34: end for 35: end if

Figure A.2: Distributed Partitioning Algorithm - Priority Dimension - Part 1

17 where we assume 2 = |{Pi : is priority(i) ∀Pi ∈ K}| 86

36: if L(Pcpu) < I then 37: // attempt to pull other dimension tasks from non priority rq18 38: for ∀Psource : !is priority rq(source) do 39: while L(Psource) > I do 40: let ti ∈ Psource : min|I − L(Pcpu ∪ {ti})| 41: let t j ∈ Psource : min|I − L(Psource \{t j})| 42: and I < L(Psource) \{tJ}) 43: if wi < w j then 44: let tpull := ti 45: else 46: let tpull := t j 47: end if 48: if I < L(Psource) \{tpull}) then 49: if |I − L(Pcpu ∪ tpull)| < I − L(Pcpu) then 50: if |I − L(Psource) \{tpull})| < L(Psource) − I then 51: Pcpu ← Pcpu ∪ {tpull} 52: Psource ← Psource \{tpull} 53: else break 54: end if 55: else break 56: end if 57: else break 58: end if 59: end while 60: end for 61: end if 62: end if

Figure A.3: Distributed Partitioning Algorithm - Priority Dimension - Part 2

The algorithm for the first and second dimension is largely mirrored, though an added priority to pull tasks to the first dimension was arbitrary given according to the metrics used in Table 4.5. 18 Notation ignores other priority tasks but we have already looked for those; presentation for brevity 87

p Require: Pcpu, P, L (T) , L(T ), K L(T) 1: let I = K p 2: let Tp = T ∩ T 3: // Ensure destination is not priority dimension 4: if !is priority rq(cpu) then 5: if L(Pcpu ∩ Tp) < I then 6: // attempt to pull non priority dimension tasks from other dimension rq 7: for ∀Psource : is priority rq(source) do 8: while |Psource ∩ Tp| > 0 and L(Psource) > I do 9: let tpull ∈ Psource ∩ Tp : min|I − L((Pcpu ∩ Tp) ∪ {tpull})| 10: if |I − L((Pcpu ∩ Tp) ∪ {tpull})| < I − L(Pcpu) then 11: Pcpu ← Pcpu ∪ {tpull} 12: Psource ← Psource \{tpull} 13: else break 14: end if 15: end while 16: end for 17: // attempt to balance along same dimension 18: for ∀Psource : !is priority rq(source) and L(Psource) > I do 19: while |Psource| > 1 and L(Psource) > I do 20: let ti ∈ Psource : min|I − L(Pcpu ∪ {ti})| 21: let t j ∈ Psource : min|I − L(Psource \{t j})| 22: if wi < w j then 23: let tpull := ti 24: else 25: let tpull := t j 26: end if 27: if |I − L(Pcpu ∪ {tpull})| < I − L(Pcpu) then 28: if |I − L(Psource \{tpull})| < L(Psource) − I then 29: Pcpu ← Pcpu ∪ {tpull} 30: Psource ← Psource \{tpull} 31: else break 32: end if 33: else break 34: end if 35: end while 36: end for

Figure A.4: Distributed Partitioning Algorithm - Second Dimension - Part 1 88

37: // attempt to pull other dimension tasks from priority rq p p 38: for ∀Psource : is priority rq(source) and L(Psource ∩ T ) > I do p p 39: while |Psource ∩ T | > 1 and L(Psource ∩ Tp) > I do p 40: let ti ∈ Psource ∩ T : min|I − L(Pcpu ∪ {ti})| p p p 41: let t j ∈ Psource ∩ T : min|I − L((Psource ∩ T ) \{t j})| p p 42: and I < L((Psource ∩ T ) \{tJ}) 43: if wi < w j then 44: let tpull := ti 45: else 46: let tpull := t j 47: end if p p 19 48: if I < L((Psource ∩ T ) \{tpull}) then 49: if |I − L(Pcpu ∪ tpull)| < I − L(Pcpu) then p p p p 50: if |I − L((Psource ∩ T ) \{tpull})| < L(Psource ∩ T ) − I then 51: Pcpu ← Pcpu ∪ {tpull} 52: Psource ← Psource \{tpull} 53: else break 54: end if 55: else break 56: end if 57: else break 58: end if 59: end while 60: end for 61: end if 62: end if

Figure A.5: Distributed Partitioning Algorithm - Second Dimension - Part 2 89

1: let Psource = Pi : L(Pi) = max (L(P j)) ∀P j∈P 2: if Psource , Pcpu then 3: while L(Psource) > L(Pcpu) do

4: let tpull ∈ Psource : min L(Psource \{tpull}) − L(Pcpu ∪ {tpull} ∀t∈Psource

5: if L(Psource) − L(Pcpu) > L(Psource \{tpull}) − L(Pcpu ∪ {tpull} then 6: Pcpu ← Pcpu ∪ {tpull} 7: Psource ← Psource \{tpull} 8: else break 9: end if 10: end while 11: end if

Figure A.6: Baseline Balancing Algorithm 90 Appendix B: Additional Test Results and Figures

B.1 Additional Interbench Results

Table B.1: Video Load Interbench: Vanilla Scheduler vs Performance Monitoring Delta

Simulated Latency Latency Max % Desired % Deadlines Load Ave (ms) Std Dev Latency (ms) CPU Met None 0.1 — 0.1 ∆0.2 0.2 ∆7.0 100 — 100 — X 0.1 ∆0.1 0.5 ∆2.9 18.9 ∆93.2 100 ∆-0.3 99.9 ∆-0.2 Burn 0.0 — 0.0 — 0.1 ∆1.5 100 — 100 — Write 0.1 ∆0.4 0.1 ∆3.2 0.2 ∆36.5 100 ∆-0.4 100 ∆-0.9 Read 0.1 — 0.1 — 0.3 — 100 — 100 — Compile 0.0 ∆0.1 0.4 ∆0.4 16.7 ∆18.4 100 ∆-0.1 99.9 — Memload 0.1 ∆-0.1 0.1 ∆-0.1 0.2 ∆-0.1 100 — 100 —

Table B.2: X Load Interbench: Vanilla Scheduler vs Performance Monitoring Delta

Simulated Latency Latency Max % Desired % Deadlines Load Ave (ms) Std Dev Latency (ms) CPU Met None 0.0 — 0.0 — 0.1 — 100 — 100 — Video 0.0 — 0.0 ∆0.2 0.1 ∆3.8 100 — 100 — Burn 1.3 ∆-1.2 3.9 ∆-2.5 16 ∆8.0 80.3 ∆19.7 74.2 ∆25.8 Write 0.0 — 0.0 ∆0.4 0.1 ∆6.2 100 — 100 — Read 0.0 — 0.0 — 0.2 ∆-0.1 100 — 100 — Compile 0.8 ∆4.3 3.1 ∆8.7 15 ∆23 88.7 ∆-29 85.1 ∆-34.8 Memload 0.0 — 0.0 — 0.1 ∆-0.1 100 — 100 — 91

Table B.3: Gaming Load Interbench: Vanilla Scheduler vs Performance Monitoring Delta

Simulated Latency Latency Max % Desired Load Ave (ms) Std Dev Latency (ms) CPU None 0.0 — 0.0 — 0.0 — 100 — Video 0.0 — 0.0 — 0.0 — 100 — X 0.0 — 0.0 — 0.0 — 100 — Burn 7.5 ∆-7.5 10.5 ∆-10.5 17.8 ∆-17.8 93 ∆7 Write 0.0 — 0.0 — 0.0 — 100 — Read 0.0 — 0.0 — 0.0 — 100 — Compile 4.7 ∆-4.7 9.3 ∆-9.3 37.8 ∆-37.5 95.5 ∆4.5 Memload 0.0 — 0.0 — 0.0 — 100 —

B.2 CPU Identifier Configuration in Linux Kernel

processor :0 core id : 0 processor :1 core id : 1 processor :2 core id : 2 processor :3 core id : 3 processor :4 core id : 0 processor :5 core id : 1 processor :6 core id : 2 processor :7 core id : 3

Figure B.1: cpuid Command Results 92

B.3 Simulated Baseline Load Balancing Results

85

84

83 ciency %

ffi 82 E

81

80 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Processor Time ·108 Figure B.2: Baseline Balancing to Steady State Static Load Generation Efficiency Over Time

140

500 120

400 100

300 80

60 200 # Tasks (Line) # Migrations (Bar) 40 100 20 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Processor Time ·108 Figure B.3: Baseline Balancing to Steady State Static Load Generation - Seed 893756 Number of Migrations and Tasks 93

B.4 Simulated Multidimensional Load Balancing Results

89

88.5

88

87.5 ciency % ffi E 87

86.5

86 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Processor Time ·108 Figure B.4: Multidimensional Balancing to Steady State Static Load Generation Efficiency Over Time

250 140

120 200 100

150 80

60 100 # Tasks (Line) # Migrations (Bar) 40 50 20

0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Processor Time ·108 Figure B.5: Multidimensional Balancing to Steady State Static Load Generation - Seed 893756 Number of Migrations and Tasks 94

B.5 Standard Deviation Percent Imbalance and Scalability

35 Baseline Balancing Multidimensional Balancing 30

25

20

15

10

5 Imbalance % Standard Deviation

0

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Processor Time ·107 Figure B.6: Standard Deviation Percent Imbalance - 8 Queues Balancing through Steady State Static Load Generation - Seed 893756 95

60 Baseline Balancing Multidimensional Balancing 50

40

30

20

10 Imbalance % Standard Deviation

0

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Processor Time ·107 Figure B.7: Standard Deviation Percent Imbalance - 16 Queues Balancing through Steady State Static Load Generation - Seed 893756

Baseline Balancing 120 Multidimensional Balancing

100

80

60

40

20 Imbalance % Standard Deviation

0

0 0.5 1 1.5 2 2.5 3 Processor Time ·107 Figure B.8: Standard Deviation Percent Imbalance - 32 Queues Balancing through Steady State Static Load Generation - Seed 893756 96

B.6 Simulated Baseline Load Balancing Efficiency Results

85

84

83 ciency %

ffi 82 E

81

80 0 0.5 1 1.5 2 Processor Time ·106 Figure B.9: Dynamic Efficiency Time Plot Seed 0

85

84

83 ciency %

ffi 82 E

81

80 0 0.5 1 1.5 2 Processor Time ·106 Figure B.10: Dynamic Efficiency Time Plot Seed 478192

85

84

83 ciency %

ffi 82 E

81

80 0 0.5 1 1.5 2 2.5 Processor Time ·106 Figure B.11: Dynamic Efficiency Time Plot Seed 584371 97

85

84

83 ciency %

ffi 82 E

81

80 0 0.5 1 1.5 2 Processor Time ·106 Figure B.12: Dynamic Efficiency Time Plot Seed 26553280

85

84

83 ciency %

ffi 82 E

81

80 0 0.5 1 1.5 2 Processor Time ·106 Figure B.13: Dynamic Efficiency Time Plot Seed 2715362

85

84

83 ciency %

ffi 82 E

81

80 0 0.5 1 1.5 2 Processor Time ·106 Figure B.14: Dynamic Efficiency Time Plot Seed 2910919 98

B.7 Simulated Multidimensional Load Balancing Efficiency Results

92

90

88 ciency % ffi E 86

84

0 0.38 0.75 1.13 1.51 1.88 2.26 Processor Time ·106 Figure B.15: Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 0

94

92

90

88 ciency %

ffi 86 E 84

82

0 0.36 0.71 1.07 1.42 1.78 2.13 Processor Time ·106 Figure B.16: Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 478192

90

88 ciency % ffi E 86

84

0 0.35 0.7 1.06 1.41 1.76 2.11 Processor Time ·106 Figure B.17: Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 584371 99

86

84

82

ciency % 80 ffi E 78

76

74 0 0.37 0.74 1.11 1.47 1.84 2.21 Processor Time ·106 Figure B.18: Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 26553280

85

80 ciency % ffi E 75

70

0 0.35 0.7 1.05 1.4 1.75 2.1 Processor Time ·106 Figure B.19: Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 2715362

90

85 ciency % ffi E 80

75 0 0.36 0.72 1.09 1.45 1.81 2.17 Processor Time ·106 Figure B.20: Dynamic Efficiency Time Plot Multidimensional Balancing - Seed 2910919 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !