HARDWARE ARCHITECTURES FOR SECURE, RELIABLE, AND ENERGY-EFFICIENT REAL-TIME SYSTEMS

A Dissertation Presented to the Faculty of the Graduate School of Cornell University

in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

by

Daniel Lo August 2015 2015 Daniel Lo ALL RIGHTS RESERVED HARDWARE ARCHITECTURES FOR SECURE, RELIABLE, AND ENERGY-EFFICIENT REAL-TIME SYSTEMS Daniel Lo, Ph.D.

Cornell University 2015

The last decade has seen an increased ubiquity of computers with the widespread adoption of smartphones and tablets and the continued spread of embedded cyber-physical systems. With this integration into our environ- ments, it has become important to consider the real-world interactions of these computers rather than simply treating them as abstract computing machines. For example, cyber-physical systems typically have real-time constraints in or- der to ensure safe and correct physical interactions. Similarly, even mobile and desktop systems, which are not traditionally considered real-time systems, are inherently time-sensitive because of their interaction with users. Traditional techniques proposed for improving hardware architectures are not designed with these real-time constraints in mind. In this thesis, we explore some of the challenges and opportunities in hardware design for real-time systems. Specifi- cally, we study recent techniques for improving security, reliability, and energy- efficiency of computer systems. Run-time monitoring has been shown to be a promising technique for im- proving system security and reliability. Applying run-time monitoring to real- time systems introduces challenges due to the performance impact of monitor- ing. To address this, we first developed a technique for estimating the worst- case execution time impact of run-time monitoring, enabling it to be applied within existing real-time system design methodologies. Next, we developed hardware architectures that selectively enable and disable monitoring in order to meet real-time deadlines while still allowing a portion of the monitoring to be performed. We show architectures for hard and soft real-time systems, with different design decisions and trade-offs for each. Dynamic voltage and frequency scaling (DVFS) is commonly found in mod- ern processors as a way to dynamically trade off power and performance. In the presence of real-time deadlines, this presents an opportunity for improved energy-efficiency by slowing down jobs to reduce energy while still meeting deadline requirements. We developed a method for creating DVFS controllers that are able to predict the appropriate frequency level for tasks before they execute and show improved energy savings and reduced deadline misses com- pared to current state-of-the-art DVFS controllers. BIOGRAPHICAL SKETCH

Daniel Lo received his Bachelor of Science (with Honors) degree in Electrical

Engineering from the California Institute of Technology in 2009. At that time, he began doctoral studies at Cornell University in the department of Electrical and Computer Engineering. He has interned with the Jet Propulsion Labora- tory, Applied Minds, and Intel. His research lies at the intersection of computer architecture, computer security, and real-time systems.

iii ACKNOWLEDGEMENTS

First and foremost, I would like to thank my advisor, Ed Suh. Without Ed’s guidance, collaboration, and mentorship, this thesis would not have been pos- sible. I would also like to thank my other committee members, Rajit Manohar and Mark Campbell, for their guidance and advice throughout my Ph.D. career.

The work in this thesis was not a solo effort. Special thanks to Tao Chen and Mohamed Ismail for their help on our work on secure and reliable real- time systems. Similarly, special thanks to Taejoon Song for his help on our work on energy-efficient real-time systems. I want to thank Rahul Mangharam for our early discussions that led us to pursue this research direction into real-time systems. I would also like to thank Franc¸ois Guimbretiere` for his feedback and insight on user interactions. The Suh Research Group and CSL have been a great academic, as well as social, community to be a part of. They have been a wonderful group of peo- ple to bounce ideas of, share knowledge, and to get lunch with. Thank you to Ray Huang for helping me get set up and started when I first arrived at Cor- nell. Thank you to KK Yu for always being there whenever I needed a favor.

Thanks to Ji Kim for introducing to me to Dungeons and Dragons and a myriad of new board games. Thanks to Andrew Ferraiuolo for always being free when I needed to take a break and thanks to Shreesha Srinath for showing me some great places to eat in Ithaca. My time at Cornell would not have been as enjoyable without all the friends that I have made here. Thank you to Nata Saslafsy, Koyuki Nakamura, Harri- son Ko, and Scott Johnson for some great skate sessions and for teaching me to snowboard. Thanks to Crystal Lu for showing me that there’s more to life than work. Thanks to Carissa Kang, Elizabeth Wijaya, Lynette Saw, Bryan Chong,

iv and Mairin Ng for teaching me Singlish over Mahjong. I also want to thank the Cornell Longboarding Club, Libe Slope FC, Mystery Machine, the Cayuga Windsurfing Club, the Cornell Cube Clube, the Glen Region SCCA, and the En- chanted Badger. These communities have helped me make Ithaca a place to call home for these six years. Finally, I would like to thank my parents for teaching me the value of hard work and discipline. I would not have been able to complete this journey with- out their continued support.

v TABLE OF CONTENTS

Biographical Sketch ...... iii Acknowledgements ...... iv Table of Contents ...... List of Tables ...... ix List of Figures ...... x

1 Introduction 1 1.1 Secure and Reliable Real-Time Systems ...... 3 1.2 Energy-Efficient Real-Time Systems ...... 4 1.3 Organization ...... 5

2 Background 6 2.1 Real-Time Systems ...... 6 2.2 Run-Time Monitoring ...... 8 2.2.1 Overview ...... 8 2.2.2 Hardware Architecture ...... 8 2.2.3 Example: Array Bounds Check ...... 12

3 WCET Analysis of Run-Time Monitoring 15 3.1 Introduction ...... 15 3.2 WCET Analysis ...... 16 3.2.1 Analysis Assumptions ...... 16 3.2.2 Implicit Path Enumeration ...... 16 3.2.3 Sequential Monitoring Bound ...... 18 3.2.4 FIFO Model ...... 18 3.2.5 MILP Formulation ...... 23 3.3 Example MILP Formulation ...... 26 3.3.1 Example Setup ...... 26 3.3.2 Creating the MFG ...... 28 3.3.3 Calculating the Monitoring Load ...... 28 3.3.4 MILP Optimization ...... 33 3.4 Evaluation ...... 33 3.4.1 Experimental Setup ...... 33 3.4.2 WCET Results ...... 36 3.4.3 Time to Solve Linear Programming Problem ...... 38

4 Run-Time Monitoring for Hard Real-Time Systems 40 4.1 Introduction ...... 40 4.2 Opportunistic Monitoring ...... 42 4.2.1 Measuring Slack ...... 42 4.2.2 Dropping Tasks ...... 45 4.3 Hardware-Based Dropping Architecture ...... 47

vi 4.3.1 Slack Tracking ...... 48 4.3.2 Metadata Invalidation Module ...... 49 4.3.3 Filtering Invalid Metadata ...... 52 4.3.4 Full Architecture ...... 53 4.4 Monitoring Extensions ...... 54 4.4.1 Uninitialized Memory Check ...... 56 4.4.2 Return Address Check ...... 56 4.4.3 Dynamic Information Tracking ...... 58 4.5 Evaluation ...... 59 4.5.1 Methodology ...... 59 4.5.2 Amount of Monitoring Performed ...... 60 4.5.3 FPGA-based Monitor ...... 64 4.5.4 Area and Power Overheads ...... 68

5 Run-Time Monitoring for Soft Real-Time Systems 70 5.1 Introduction ...... 70 5.2 Partial Run-Time Monitoring ...... 71 5.2.1 Overhead of Run-Time Monitoring ...... 71 5.2.2 Partial Monitoring for Adjustable Overhead ...... 73 5.2.3 Applications and Metrics ...... 73 5.2.4 Design Challenges ...... 75 5.3 Architecture for Partial Monitoring ...... 76 5.3.1 Effects of Dropping Monitoring ...... 77 5.3.2 Invalidation for Preventing False Positives ...... 78 5.3.3 Dataflow Engine for Preventing False Positives ...... 79 5.3.4 Filtering Null Metadata ...... 81 5.4 Dropping Policies ...... 83 5.4.1 Deciding When to Drop ...... 83 5.4.2 Deciding Which Events to Drop ...... 85 5.5 Evaluation ...... 89 5.5.1 Experimental Setup ...... 89 5.5.2 Baseline Monitoring Overhead ...... 91 5.5.3 Coverage with Adjustable Partial Monitoring ...... 92 5.5.4 FPGA-based Monitor ...... 95 5.5.5 Comparing Dropping Policies ...... 97 5.5.6 Multiple-Run Coverage ...... 99 5.5.7 Area and Power Overheads ...... 100

6 DVFS Control for Soft Real-Time Systems 103 6.1 Introduction ...... 103 6.2 Execution Time Variation in Interactive Applications ...... 105 6.2.1 Variations in Execution Time ...... 105 6.2.2 Existing DVFS Controllers ...... 106 6.2.3 Prediction-Based Control ...... 108

vii 6.3 Execution Time Prediction for Energy-Efficiency ...... 110 6.3.1 Overview ...... 110 6.3.2 Program Features ...... 111 6.3.3 Execution Time Prediction Model ...... 114 6.3.4 DVFS Model ...... 116 6.3.5 Alternate Prediction Models ...... 119 6.4 System-Level Framework ...... 120 6.4.1 Annotation ...... 120 6.4.2 Off-line Analysis ...... 121 6.4.3 Run-time Prediction ...... 124 6.5 Evaluation ...... 125 6.5.1 Experimental Setup ...... 125 6.5.2 Energy Savings and Deadline Misses ...... 127 6.5.3 Analysis of Overheads and Error ...... 130 6.5.4 Under-prediction Trade-off ...... 133 6.5.5 Race to Idle ...... 134 6.5.6 Heterogeneous Cores ...... 136

7 Related Work 146 7.1 Real-Time Systems ...... 146 7.1.1 Hardware Architectures for Real-Time Systems ...... 146 7.1.2 Worst-Case Execution Time Analysis ...... 147 7.2 Security and Reliability ...... 148 7.2.1 Run-Time Monitoring ...... 148 7.2.2 Partial Monitoring ...... 149 7.3 Performance-Energy Trade-off ...... 151 7.3.1 Dynamic Voltage and Frequency Scaling ...... 151 7.3.2 Execution Time Prediction ...... 152

8 Conclusion 154 8.1 Summary ...... 154 8.2 Future Directions ...... 155 8.2.1 Secure and Reliable Real-Time Systems ...... 155 8.2.2 Energy-Efficient Real-Time Systems ...... 156 8.2.3 Hardware Design for Real-Time Systems ...... 157

Bibliography 159

viii LIST OF TABLES

3.1 Estimated and observed WCET (clock cycles) with and without monitoring...... 35 3.2 Running time of lp solve in seconds to determine worst-case stalls (stall), sequential bound (sequential), and worst-case ex- ecution times (wcet)...... 38

4.1 Monitoring tasks for UMC, RAC, and DIFT...... 54 4.2 Operation of the MIM for UMC, RAC, and DIFT...... 55 4.3 Operation of the MFM for UMC, RAC, and DIFT...... 55 4.4 Average power overheads for dropping hardware at zero head- start slack. Percentages in parentheses are normalized to the main core’s power usage...... 68

5.1 Trade-off between performance overhead and flexibility/com- plexity of run-time monitoring systems...... 72 5.2 Average power overhead for dropping hardware. Percentages are normalized to the main core power...... 101 5.3 Average runtime power of the monitoring core...... 102

6.1 Variable and notation descriptions...... 113 6.2 Differences in features selected and predicted execution times for different platforms compared to an ARM-LITTLE platform. . . . 122 6.3 Benchmark descriptions and task of interest...... 125 6.4 Execution time statistics when running at maximum frequency. . 126 6.5 Parameters used in DVFS and heterogeneous core model. . . . . 138

7.1 Previous work on partial run-time monitoring...... 149

ix LIST OF FIGURES

1.1 Overview of techniques presented in this thesis...... 2

2.1 Example of tasks, jobs, and deadlines...... 6 2.2 Timings related to a job...... 7 2.3 Parallel monitoring architecture model...... 9 2.4 The main core is modified to forward information on certain instruction types...... 9 2.5 Pipeline diagram of monitoring. The main core stalls due to the slower throughput of the monitor...... 11 2.6 Example of array bounds check monitor...... 13

3.1 Example control flow graph and its corresponding monitoring flow graph. Blue (dark) nodes include a forwarded instruction. . 19 3.2 Monitoring load is modeled for each node...... 20 3.3 Code example with its corresponding control flow graph and monitoring flow graph. Blue (dark) nodes include a forwarded (ldr) instruction...... 27 3.4 Toolflow for WCET estimation of parallel monitoring...... 34 3.5 Ratio of estimated WCET to observed WCET...... 36 3.6 Ratio of WCET using our proposed method compared to assum- ing sequential monitoring...... 36 3.7 Ratio of WCET with monitoring to WCET without monitoring. . 37

4.1 Dynamic slack and total slack...... 43 4.2 Dynamic slack increases when a sub-task finishes early. Slack is consumed as monitoring causes the task to stall...... 43 4.3 WCET with original monitoring task, with software dropping task, and with no monitoring for UMC. Results are normalized to the WCET with no monitoring...... 46 4.4 Hardware modules for keeping track of slack and making a drop decision...... 48 4.5 Metadata invalidation module (MIM)...... 49 4.6 Metadata filtering module (MFM)...... 52 4.7 Block diagram of full hardware architecture for opportunistic monitoring on hard real-time systems...... 53 4.8 Percentage of monitored, dropped, and filtered monitoring events with zero headstart slack...... 61 4.9 Percentage of checks that are not dropped or filtered for zero headstart slack...... 62 4.10 Percentage of checks performed as headstart slack is varied for UMC. Headstart slack is shown normalized to the main task’s WCET...... 63

x 4.11 Percentage of checks performed as headstart slack is varied for RAC implemented on a processor core. Headstart slack is shown normalized to the main task’s WCET...... 63 4.12 Percentage of checks performed as headstart slack is varied for DIFT. Headstart slack is shown normalized to the main task’s WCET...... 64 4.13 Percentage of monitored, dropped, and filtered monitoring events for an FPGA-based monitor with zero headstart slack. . . 65 4.14 Percentage of checks that are not dropped or filtered on an FPGA-based monitor for zero headstart slack...... 66 4.15 Percentage of checks performed as headstart slack is varied for UMC on an FPGA-based monitor. Headstart slack is shown nor- malized to the main task’s WCET...... 67 4.16 Percentage of checks performed as headstart slack is varied for RAC on an FPGA-based monitor. Headstart slack is shown nor- malized to the main task’s WCET...... 67 4.17 Percentage of checks performed as headstart slack is varied for DIFT on an FPGA-based monitor. Headstart slack is shown nor- malized to the main task’s WCET...... 68

5.1 Example of array bounds check monitor...... 76 5.2 Example of using invalidation information to prevent false pos- itives...... 78 5.3 Hardware support for dropping...... 79 5.4 Hardware architecture of the dataflow engine...... 80 5.5 Example of using information about null metadata to filter mon- itoring events...... 81 5.6 Slack and its effect on monitoring over time...... 83 5.7 Slack tracking and drop decision hardware...... 85 5.8 Comparison of dropping policies using metadata dependence graphs. Square nodes represent events where check are per- formed. Blue (dark) nodes indicate which nodes can be dropped. 86 5.9 Monitoring overhead with null metadata filtering...... 91 5.10 Coverage vs. varying overhead budget for BC...... 92 5.11 Coverage vs. varying overhead budget for UMC...... 92 5.12 Coverage vs. varying overhead budget for LS...... 93 5.13 Average error vs. varying overhead budget for instruction-mix profiling. Whisker bars show min and max errors...... 93 5.14 Coverage versus varying overhead budget for UMC running on an FPGA-based monitor...... 95 5.15 Comparing dropping policies for UMC...... 96 5.16 Comparing dropping policies for BC...... 98 5.17 Coverage over multiple runs for BC with a 10% probability to not drop events...... 99

xi 6.1 Execution time of jobs (frames) for ldecode (video decoder). . . . 105 6.2 Execution time of jobs (blue, solid) and execution time expected by a PID controller (red, dashed) for ldecode (video decoder). . . 107 6.3 Overview of prediction-based control...... 108 6.4 Example control flow graph. Each node is annotated with its number of instructions...... 109 6.5 Steps to predict execution time from job input and program state. 110 6.6 Example of feature counters inserted for conditionals, loops, and function calls...... 111 6.7 Example of program slicing for control flow features...... 112 6.8 Average execution time of jobs (frames) for ldecode (video de- coding) as frequency level varies...... 116 6.9 The effective budget decreases due to slice and DVFS execution time...... 118 6.10 95th-percentile switching times for DVFS...... 118 6.11 Example of programmer annotation to mark task boundaries and time budgets...... 120 6.12 Overall flow for prediction-based DVFS control...... 121 6.13 Options for how to run predictor...... 123 6.14 Normalized energy usage and deadline misses...... 127 6.15 Normalized energy usage and deadline misses as time budget is varied...... 129 6.16 Average time to run prediction slice and switch DVFS levels. . . 130 6.17 Normalized energy usage and deadline misses with overheads removed and oracle prediction...... 131 6.18 Box-and-whisker plots of prediction error. Positive number cor- respond to over-prediction and negative numbers correspond to under-prediction...... 132 6.19 Energy vs. deadline misses for various under-predict penalty weights (α) for ldecode...... 134 6.20 Idle power for the ODROID-XU3’s Cortex-A7 core for each fre- quency setting...... 134 6.21 Normalized energy usage and deadline misses when the fre- quency is dropped to minimum (200 MHz) when jobs finish. . . 135 6.22 Relationship between frequency, execution time, and frequency based on analytical model...... 139 6.23 Energy and deadline misses for running with only a little core (littleonly) and with both big and little cores (biglittle)...... 141 6.24 Number of core switches and jobs run on the big core as a per- centage of total jobs for varying normalized budgets when com- paring biglittle to littleonly...... 142 6.25 Energy and deadline misses for running with only a big core (bigonly) and with both big and little cores (biglittle) for a nor- malized budget of 1...... 143

xii 6.26 Number of core switches and jobs run on the big core as a per- centage of total jobs for a normalized budget of 1 when compar- ing biglittle to bigonly...... 143

xiii CHAPTER 1 INTRODUCTION

The last decade has seen an increased ubiquity of computers with the widespread adoption of smartphones and tablets and the continued spread of embedded cyber-physical systems. With this deep integration into our environ- ments, it has become important to consider the real-world interactions of these computers rather than simply treating them as abstract computing machines.

For example, cyber-physical systems typically have real-time constraints in or- der to ensure safe and correct physical interactions. Similarly, even mobile and desktop systems, which are not traditionally considered real-time systems, are inherently time-sensitive because of their interaction with users. These real-time systems introduce new constraints and trade-offs for computer systems design.

Real-time systems introduce new constraints and trade-offs to system de- sign. This is due to the presence of real-time deadlines for tasks. Correct oper- ation involves both correct computation of outputs as well as completing tasks before their deadlines. Depending on the application, deadlines can be either considered hard deadlines, where a missed deadline indicates a fatal error, or soft deadlines, where a missed deadline results in poor performance but is not fatal. Depending on the type of deadline, the level of timing guarantee differs which can affect the design decisions made.

Typically, these systems are designed conservatively so that, even in the worst-case, tasks will not exceed their deadlines. This has been traditionally handled at the software layer and there exists extensive work in developing real- time scheduling algorithms [20], real-time operating systems (RTOS) [3, 38], and worst-case execution time (WCET) analysis techniques [92]. On the other hand,

1 Security & Reliability Energy Efficiency

WCET Analysis of Run- Time Monitoring (Chapter 3)

Systems Systems Monitoring Architecture for Hard Real-Time Systems Hard Real-Time (Chapter 4)

Monitoring Architecture for DVFS Control for Soft Soft Real-Time Systems Real-Time Systems (Chapter 5) (Chapter 6) Systems Systems Soft Real-Time

Figure 1.1: Overview of techniques presented in this thesis.

modern hardware features are typically evaluated and optimized for average- case performance in order to apply to general compute workloads. Modern hardware features do not take into account the deadlines of the real-time dead- line that runs on them. In this thesis, we explore the hardware/software co- design of real-time systems in order to address the challenges and opportunities that arise when the underlying hardware is aware of real-time requirements.

Specifically, we explore two modern hardware techniques. First, we explore run-time monitoring techniques for improving system security and reliability in the context of real-time systems. Second, we look at using real-time deadlines to inform dynamic frequency and voltage scaling of hardware for improved energy efficiency. We apply these techniques to hard and soft real-time systems. Figure 1.1 shows an overview of the work developed in this thesis and the areas

2 they target.

1.1 Secure and Reliable Real-Time Systems

Run-time monitoring techniques have been shown to be useful for improving the reliability, security, and debugging capabilities of computer systems. For example, Hardbound is a hardware-assisted technique to detect out-of-bound memory accesses, which can cause undesired behavior or create a security vul- nerability if uncaught [25]. Intel has recently announced plans to support buffer overflow protection similar to Hardbound in future architectures [44]. Simi- larly, run-time monitoring can enable many other new security, reliability, and debugging capabilities such as fine-grained memory protection [93], informa- tion flow tracking [82, 33], control flow integrity checking [4], hardware error detection [56], data-race detection [26, 68], etc. However, today’s run-time mon- itoring techniques cannot be easily applied to critical real-time systems due to their lack of timing guarantees. Thus, we have developed three techniques in order to enable run-time monitoring on real-time systems.

Traditional real-time systems design requires estimation of the worst-case execution time (WCET) of tasks in order to guarantee schedulability. In order to enable run-time monitoring to be applied to traditional real-time design flows, an estimate of its impact on worst-case execution time is needed. Thus, we first developed a static analysis method to estimate the WCET impact of monitoring (Chapter 3). This allows run-time monitoring to be enabled if there is enough static slack in a real-time system’s schedule. Our analysis extends the traditional integer linear programming (ILP) formulation used for estimating WCET by adding new constraints in order to model the impact of run-time monitoring.

3 Next, we look at the challenge of applying run-time monitoring to an exist- ing real-time system design. In this case, the real-time schedule may not have enough extra time (i.e., slack in the schedule) in order to support the full over- head of performing run-time monitoring. However, it may still be possible to perform a portion of the monitoring. We designed a hardware architecture that enables and disables run-time monitoring in order to perform as much moni- toring as possible without violating hard real-time deadlines (Chapter 4).

Finally, soft real-time and interactive systems do not require strict timing guarantees but still look to achieve timely execution of programs. For these sys- tems, we designed an architecture that enables adjustable overheads by trading off monitoring coverage (Chapter 5). By relaxing some of the constraints for our architecture for hard real-time systems, we are able to achieve higher monitor- ing coverage while still remaining close to deadlines.

1.2 Energy-Efficient Real-Time Systems

For modern mobile and embedded systems, energy usage is a major concern due to limited battery life. Techniques, such as dynamic voltage and frequency scaling (DVFS) and core migration, have been developed to enable a trade-off between power and performance. These resource allocation techniques can be used to reduce energy usage at the cost of increased execution time.

Traditional managers for these resource allocation techniques, such as DVFS, work in a best effort manner. They operate under the assumption that better performance is desired and only attempt to decrease resources when the perfor- mance impact is minimal or when power or thermal limits are reached. How-

4 ever, many applications or jobs have response-time deadlines. Finishing faster than this response-time requirement has no benefit. For example, responding to a user input faster than human reaction time does not improve user experience.

Similarly, decoding video frames faster than the video frame rate has no added benefit.

For these interactive applications, we can save energy by using techniques such as DVFS to run at a lower power point, without impacting user experience. In this thesis, we present a method to predict the appropriate DVFS operating point for tasks in order to meet deadlines with minimal energy (Chapter 6). This prediction needs to take into account the effect of inputs and program state in order to account for dynamic variations in task execution time. We show how this is possible through analysis of task source code in order to generate task- specific predictors for DVFS control.

1.3 Organization

The rest of the thesis is organized as follows. Chapter 2 discusses some back- ground knowledge and terminology that will be used throughout the thesis.

Chapters 3, 4, and 5 discuss techniques on improving system security and relia- bility through run-time monitoring on real-time systems. Chapter 6 presents a method for improving the energy efficiency of real-time systems through prediction-based DVFS control. Finally, Chapter 7 discusses related work and Chapter 8 concludes the thesis and discusses future directions.

5 CHAPTER 2 BACKGROUND

This chapter provides some background knowledge that will be useful in understanding the work presented in this thesis. Section 2.1 discusses some terminology related to real-time systems that will be used throughout the thesis. Section 2.2 describes run-time monitoring techniques for security and reliability which will be used in Chapters 3, 4, and 5.

2.1 Real-Time Systems

A real-time application is typically specified as a set of tasks which have timing requirements. We refer to the time period in which a task must complete as its time budget. For example, games are typically written with a main task that handles reading user input, updating game state, and displaying the output.

In order to maintain a certain frame rate (e.g., 30 frames per second), this task must finish within the frame period budget (e.g., 33 milliseconds for 30 frames per second operation).

We define a job as a dynamic instance of a task. Figure 2.1 shows how a

Application while(true){ task(); }

Job 0 Job 1 Job 2 ···

time budget deadline

Figure 2.1: Example of tasks, jobs, and deadlines.

6 dynamic slack static slack Job

actual time execution time deadline

actual WCET estimated WCET budget

Figure 2.2: Timings related to a job.

task maps to multiple jobs. Each job has a deadline which is the time by which it must finish execution. For example, for a game running at 30 frames-per- second, 30 jobs for the game loop task are run each second. Each of these jobs has a deadline which is 33 milliseconds after the job’s start time. These jobs all correspond to the same set of static task code, but their dynamic behavior differs due to different inputs and program state. For example, one job may see that the user has pressed a button and thus execute code to process this button press. Other jobs may see no new user input and skip the need to process user input.

As a result, job execution times can vary depending on input and program state.

There are several time periods involved with the execution of a job in a real- time system (see Figure 2.2). The time that it takes a job to finish is its actual execution time. The worst-case execution time of the task across all possible in- puts and initial states is its actual worst-case execution time (WCET). Tasks are statically analyzed to produce an estimated worst-case execution time (WCET) for timing analysis and scheduling. This estimated WCET is guaranteed to be con- servative (i.e., it is greater than or equal to the actual WCET). The time between the estimated WCET and the deadline is the static slack that exists. That is, the

7 schedule could be adjusted to reduce the deadline and still guarantee that tasks finish on time. On the other hand, we refer to the difference in actual execution time and estimated WCET as the dynamic slack. This is the slack time that is only available at run-time.

2.2 Run-Time Monitoring

2.2.1 Overview

Run-time monitoring techniques have been shown to be useful for improving the security and reliability of computer systems. The basic idea behind run-time monitoring techniques is to observe the run-time operation of programs and check that certain invariants are met in order to identify anomalous behavior. A wide variety of checks have been proposed including fine-grained memory protection [93], information flow tracking [82, 33], etc. Section 2.2.2 discusses the hardware architecture for implementing run-time monitoring and Section 2.2.3 walks through how an example array bounds check operates.

2.2.2 Hardware Architecture

There are several options on how to implement run-time monitoring. Imple- menting run-time monitoring in software using binary instrumentation or other similar methods introduces especially high overheads. For example, dynamic information flow tracking (DIFT) implemented in software suffers a 3.6x slow- down on average [69]. Implementing monitors in hardware greatly decreases

8 monitoring Monitoring Main Core event Core FIFO

Register RF

File Metadata

Program Memory State Metadata

Figure 2.3: Parallel monitoring architecture model. To FIFO Instruction Enqueue Type Filter Fetch Decode Commit Memory Memory Execute Instruction Instruction Instruction Op Code Register Values ...

Figure 2.4: The main core pipeline is modified to forward information on certain instruction types.

the performance impact by performing monitoring in parallel to a program’s execution. A dedicated hardware implementation of DIFT reduces average per- formance overheads to just 1.1% [82]. However, fixed hardware loses the pro- grammability of software. A fixed hardware implementation cannot be updated and cannot change the type of run-time monitoring performed. Thus, recent studies have proposed using programmable parallel hardware, such as extra cores in a multi-core system or FPGA coprocessors [15, 61, 16, 23, 24, 30, 27], for monitoring. Our work in this thesis is targeted at these programmable parallel hardware monitors.

Figure 2.3 shows a model of the parallel run-time monitoring architecture that is assumed in this thesis. This model includes the main components of par-

9 allel run-time monitoring designs. The architecture consists of two parallel pro- cessing elements, main and monitoring cores, which are loosely coupled with a FIFO buffer. The main core runs a computation task, called the main task, which performs the original function of the real-time system. The monitoring core re- ceives a trace of certain main task instructions through a FIFO, and performs a monitoring task in parallel in order to check whether the system is operating properly. We refer to the main task instructions that need to be sent to the mon- itoring core as forwarded instructions or monitored instructions. The forwarded in- structions are determined based on a particular monitoring technique, and are sent by the main core transparently without explicit instructions added to the main task. Figure 2.4 shows how an example main core pipeline can be mod- ified to automatically forward information when instructions of certain types are committed. When an instruction commits, a comparator checks whether the opcode of the committed instruction is one that should be forwarded. If the instruction should be forwarded, then an enqueue signal is sent to the FIFO. In- formation about the instruction that is needed for monitoring, such as register values, memory locations accessed, etc., is stored in the FIFO. In this way, the detection and handling of monitoring events is handled in hardware, with no need to modify the main task.

The forwarded instruction triggers the monitoring core to execute a series of monitoring instructions. A different monitoring task is executed depending on the type of forwarded instruction. We refer to the collection of monitoring tasks as a monitor or a monitoring scheme. Different monitoring schemes seek to ensure different run-time properties. Monitors typically maintain metadata corresponding to each memory location (mem metadata[addr]) and register (rf metadata[reg id]) of the main core. If the monitoring scheme detects an

10 Cycle 1 2 3 4 5 6 7 8 9 10 LDRr1,[r3] Main FIFO Monitor LDRr2,[r3,#4] Main FIFO Monitor ADDr0,r1,r2 Main FIFO Monitor ADDr0,r0,#15 MainSTALL FIFO Monitor STRr0,[r4] Main STALL FIFO Fifo Entries Free 2110000000

Figure 2.5: Pipeline diagram of monitoring. The main core stalls due to the slower throughput of the monitor.

inconsistent or undesired behavior in the sequence of monitoring events, then an error is detected. We do not focus here on how to handle an error. However, there are several options on how to handle an error, depending on the monitor- ing scheme, such as raising an exception, notifying the user, or switching to a simpler, more trusted main task [77].

In order to decouple the execution of the main task and the monitoring task, a dedicated hardware FIFO is used to buffer monitoring events between the main core and the monitoring core. In the ideal case, the monitoring task oper- ates in parallel with the main task. However, since every forwarded instruction of the main task takes several instructions to monitor, the monitoring core is not always able to keep up with the throughput of monitoring events generated by the main core. Thus, when the FIFO is full, the main core is forced to stall on monitoring events. Figure 2.5 shows an example execution where a series of instructions are monitored. In this simple example, the FIFO only has 2 entries and it takes the monitoring task 2 cycles to process a monitoring event. For sim- plicity, only the commit stage of the main core pipeline is shown. After the 4th instruction (ADD) is committed by the main core, the FIFO is full and the mon- itoring core is busy. Thus, in order to not miss information about a monitoring

11 event, the main core must stall before continuing execution and committing the 5th instruction (STR). These stalls due to the mismatch in throughput are a ma- jor source of performance overhead. We refer to these stalls of the main core due to monitoring as monitoring stalls and the number of cycles stalled as monitoring stall cycles. In the example code sequence shown, three monitoring stall cycles occur, two for instruction 4 and one for instruction 5.

2.2.3 Example: Array Bounds Check

There are many possible monitoring schemes that can be implemented on this type of fine-grained parallel monitoring architecture such as memory protec- tion [93], information flow tracking [82, 33], control flow integrity checking [4], soft error detection [56], data-race detection [68, 73, 55, 11], etc. For example, an array bounds check (BC) [25] can be implemented in order to detect when software attempts to read or write to a memory location outside of an array’s bounds. This can be done by associating metadata with array pointers that in- dicate the arrays’ base (start) and bound (end) addresses. On loads or stores with the array pointer, the monitor checks that the memory address accessed is within the base and bound addresses. In addition, this base and bound meta- data is propagated on ALU and memory instructions to track the corresponding array pointers.

Figure 2.6 shows an example pseudo-code segment, its assembly level in- structions, and the corresponding monitoring operations for an array bounds check monitor. First, an array x is allocated using malloc (line 1). As malloc returns the array’s address in a register, the monitor associates base and bounds

12 Main Core (High-level) Main Core (Assembly) Monitoring Core

1:int*x=malloc(4*sizeof(int)); 1:callmalloc 1:rf_metadata[1]={r1,r1+0xf} ;returnpointer0x12340000inr1 //{base,bounds}ofarray 2:int*y=x+2; 2:addr2,r1,#8 2:rf_metadata[2]=rf_metadata[1] 3:x[3]=1; 3a:movr3,#1 3a:rf_metadata[3]=NULL 3b:strr3,[r1,#12] 3b:if(r1+12<base(rf_metadata[1])|| ;storeto0x1234000c r1+12>bound(rf_metadata[1])){ //raiseerror } mem_metadata[r1+12]=rf_metadata[3]; 4:y[3]=1; 4:strr3,[r2,#12] 4:if(r2+12<base(rf_metadata[2])|| ;storeto0x12340014 r2+12>bound(rf_metadata[2])){ //raiseerror } mem_metadata[r2+12]=rf_metadata[3];

Figure 2.6: Example of array bounds check monitor.

metadata with the corresponding register. Next, pointer y is set to point to the middle of array x (line 2). At the assembly level, a register r2 is set to array x’s address plus an offset. The monitor propagates the metadata of the original pointer in r1 to the metadata of r2. This is to ensure that pointer y is not used to exceed the array’s bounds. Line 3a shows setting register r3 to a constant value of 1. When this happens, r3’s metadata is reset to NULL in case it previ- ously stored a pointer. Finally, the value of r3 is written to memory using both pointers x and y. In both cases, the monitor checks whether the store address is within the bounds of the register metadata. In the first case (line 3b), x+12 is within the original array’s bounds. No error is raised and the metadata of r3 is propagated to the metadata of the store address. If r3 were a pointer, then this would allow a future instruction to load the pointer and use it to access its corresponding array or data structure. In the second case, y+12 (line 4) corre- sponds to x+20 which is not within the array’s bounds and the monitor will raise an error.

Note that this monitoring is performed automatically and transparently with

13 almost no modification needed to the main program. The only modification to the main program that is needed is to initially set metadata, such as setting the base and bounds addresses on a malloc call for array bounds checking.

The propagation and checks of the metadata occur automatically as instructions are forwarded to the monitor. Here, we have shown the dynamic sequence of operations executed by the monitor, but the actual static code consists of a set of instructions to be run for each possible instruction type (load, store, etc.). Only instruction types relevant for the particular monitoring scheme are forwarded.

14 CHAPTER 3 WCET ANALYSIS OF RUN-TIME MONITORING

3.1 Introduction

In order to apply parallel run-time monitoring to traditional real-time systems, we must be able to analyze the worst-case execution time (WCET) of running a program with run-time monitoring. The only existing way to do this is to as- sume that monitoring is done sequentially. However, this bound is overly con- servative when used to analyze parallel run-time monitoring. In this chapter, we present a method for estimating the increase in WCET of programs running on a system with parallel monitoring. We first investigate how to mathemati- cally model the loosely-coupled relationship between the main processing core and parallel monitoring hardware, which are connected through a FIFO buffer with a fixed number of entries. The resulting model is non-linear but can be transformed into a mixed integer linear programming (MILP) formulation. The MILP formulation produces the maximum number of cycles for each basic block that the main core may be stalled due to monitoring. These monitoring stalls can be incorporated into popular WCET analysis methods based on implicit path enumeration techniques (IPET) [48] in order to estimate the WCET of tasks with run-time monitoring.

The rest of this chapter is organized as follows. Section 3.2 explains our MILP-based formulation for estimating WCET and Section 3.3 provides an ex- ample MILP formulation. Finally, Section 3.4 shows our evaluation results.

15 3.2 WCET Analysis

3.2.1 Analysis Assumptions

Our WCET analysis focuses on the interaction between the main core and the monitoring core through a FIFO with an assumption that each core has its own memory. Therefore, there is no interference between the two cores on memory accesses. This configuration applies for typical multi-core embedded micropro- cessors or a small monitor with a dedicated memory that is attached to a large core. The main core is assumed to not exhibit timing anomalies. This is re- quired so that the worst-case monitoring stall cycles can be assumed to produce the WCET on the main core. We do not make any other assumptions on the microarchitecture of each processing core. However, we assume that the WCET of a main task and a monitoring task on the given processing cores can be esti- mated individually using traditional WCET techniques.

3.2.2 Implicit Path Enumeration

Most of the WCET analysis techniques today rely on an integer linear program- ming (ILP) formulation that is obtained from implicit path enumeration tech- niques (IPET) [48]. In this method, a program is converted to a control flow graph (CFG). From the control flow graph, an ILP problem is formulated that seeks to maximize

X t = NB · cB,max

B∈BCFG

16 where BCFG is the set of basic blocks in the control flow graph. NB is the num- ber of times block B is executed and cB,max is the maximum number of cycles to execute block B. The maximum value of t is the WCET of the task. To account for the fact that only certain paths in the graph will be executed, a set of con- straints are placed on NB. For example, on a branch, only one of the branches will be taken on each execution of the block. A variable can be assigned to each edge corresponding to the number of times that edge is taken. The number of times edges out of the block are taken must equal the number of times the block is executed. Similarly, the number of times edges into the block are taken must equal the number of times the block is executed. Various methods have been developed to create additional constraints to convey other program behavior [48, 92].

Integer linear programming is an attractive optimization technique for this problem because the solution found is a global optimum. In addition, many aspects of program and architecture behavior can be described by adding con- straints to the ILP problem. Several open source and commercial ILP solvers exist which can solve the formulated ILP problem [9, 43]. Thus, in developing a method for estimating the WCET of parallel run-time monitoring, we look to build upon this ILP framework.

The IPET-based ILP formulation can be extended in a straightforward fash- ion to incorporate run-time monitoring overheads if we have the maximum (worst-case) monitoring stall cycles for each basic block by maximizing

X t = NB · (cB,max + sB,max)

B∈BCFG

Here, sB,max represents the maximum number of cycles that block B is stalled due to monitoring. In this sense, the challenge in WCET analysis with monitor-

17 ing lies in determining sB,max. The rest of this section addresses this problem.

3.2.3 Sequential Monitoring Bound

One way to determine a conservative bound on the worst-case monitoring stall cycles is to consider sequential monitoring. In sequential monitoring, the mon- itoring task is run in-line with the main task on the same core rather than in parallel. That is, after each instruction that would be forwarded, the monitoring task is run on the main core before the main task resumes execution. In this case, the WCET estimate can be obtained from a traditional method by analyzing one program that contains both main and monitoring tasks. The resulting WCET can be considered as a simple bound for parallel monitoring because it mod- els the case where every forwarded instruction causes the main core to stall.

However, this bound is extremely conservative as it does not account for the FIFO buffering or the parallel execution of the monitoring core. These features are critical to utilizing run-time monitoring techniques while maintaining low performance overheads.

3.2.4 FIFO Model

To obtain tighter WCET bounds, we need to model the FIFO. The main task can continue its execution as long as a FIFO entry is available, but needs to stall on a forwarded instruction if the FIFO is full. The WCET model needs to capture the worst-case (maximum) number of entries in the FIFO at each forwarded instruction and determine how many cycles the main task may be stalled due

18 0.0

0.1

0 1.0 2.0

1 2 1.1

3 3.0

(a) Control flow graph. (b) Monitoring flow graph.

Figure 3.1: Example control flow graph and its corresponding monitoring flow graph. Blue (dark) nodes include a forwarded instruction. to the FIFO being full. Here, we propose a mathematical model to express the load in the FIFO and estimate the worst-case stalls.

In this approach, the original control flow graph must be first transformed so that each node contains at most one forwarded instruction which is located at the end of the code sequence represented by the node. This transformed graph is called a monitoring flow graph (MFG) (see Figure 3.1). Intuitively, the analysis needs to consider one forwarded instruction at a time in order to model the

FIFO state on each forwarded instruction and capture all potential stalls from monitoring.

To model how full the FIFO is, we define the concept of monitoring load. The monitoring load is the number of cycles required for the monitoring core to

19 lo0 lo1 … loN

max li Node M M

cM,min

forwarded instruction?

yes no

tM,max +

liM + lM lmax

bound(0, l_max) bound(0, )

loM sM

Figure 3.2: Monitoring load is modeled for each node.

process all outstanding entries in the FIFO at a given point in time. We model the input (liM ), output (loM ), and change (∆lM ) in monitoring load for each node

M. These are used to calculate the worst-case stall cycles (sM ) for each node. Figure 3.2 shows a block diagram of the model for monitoring load which this section will describe in detail.

Monitoring load increases when a new instruction is forwarded by the main task, and decreases as the monitoring core processes forwarded instructions. For simplicity, the increase in monitoring load for any forwarded instruction is conservatively assumed to be the worst-case (maximum) monitoring task exe- cution time among all possible forwarded instructions. This maximum, tM,max,

20 can be obtained from the WCET analysis of the monitoring tasks. We make this simplification because it is difficult to model the FIFO mathematically at an entry-by-entry level. With this simplification, each FIFO entry is identical and so the monitoring load fully represents the state of the FIFO. The monitoring load cannot be negative and is upper-bounded by the maximum monitoring load the

FIFO can handle, lmax. The maximum monitoring load is the number of FIFO entries, nF , multiplied by the increase in monitoring load for one forwarded instruction, tM,max.

In order to find the worst-case monitoring stalls, we need to determine the worst-case (maximum) monitoring load at the node boundaries in the MFG.

For a given node, M, in the MFG, we define liM as the monitoring load coming into the node and loM as the monitoring load exiting the node. The change in monitoring load for the node is denoted by ∆lM . The maximum ∆lM can be calculated as the difference between the WCET of a monitoring task that corresponds to M and the minimum execution cycles of the node, cM,min:   tM,max − cM,min, forwarded inst. ∈ M ∆lM =  −cM,min, no forwarded inst. ∈ M

In order to ensure that the analysis is conservative in estimating the worst-case

(maximum) stalls, we use the best-case (minimum) execution time for the main task here.

Because the monitoring load is bounded by zero and the maximum load that

21 the FIFO can handle, lmax, the monitoring load coming out of a node is   0, liM + ∆lM < 0   loM = liM + ∆lM , 0 ≤ liM + ∆lM ≤ lmax    lmax, liM + ∆lM > lmax

lmax =nF · tM,max

The worst-case monitoring load entering node M, liM , is the largest of the output monitoring loads among nodes with edges pointing to node M. Let

Mprev represent the set of nodes with edges pointing to node M. Then,

liM = max loMprev Mprev∈Mprev

The above equations describe the worst-case monitoring load at each node boundary. A monitoring stall occurs when a forwarded instruction is executed but there is no empty entry in the FIFO buffer. In terms of monitoring load, if a node would add monitoring load that would cause the resulting total load to exceed lmax, then a monitoring stall occurs. The number of cycles stalled, sM , is the number of cycles that this total exceeds lmax. That is,   0, liM + ∆lM < lmax sM =  (liM + ∆lM ) − lmax, liM + ∆lM ≥ lmax

These sets of equations describe the monitoring stalls for each node in the MFG. The worst-case monitoring stall cycles can then be determined by maxi- mizing the value of each sM . This is equivalent to a single objective of maximiz- ing the sum of the sM , X max sM

M∈MMFG

22 where MMFG is the set of nodes in the MFG. Once the worst-case stalls for each

MFG node is found, the worst-case stalls for a CFG node, sB,max, can be com- puted by simply summing the stalls from the corresponding MFG nodes. We note that since the monitoring load is always conservative in representing the FIFO state, no timing anomalies are exhibited by this analysis. That is, deter- mining the individual worst-case stalls results in the global worst-case stalls.

3.2.5 MILP Formulation

The proposed FIFO model requires solving an optimization problem to obtain the worst-case stalls, where the input and output monitoring loads, liM and loM , and the monitoring stalls, sM , need to be determined for each node. Here, we show how the problem can be formulated using MILP. Although the equations for loM and sM are non-linear, they are piecewise linear. Previous work has shown that linear constraints for piecewise linear functions can be formulated using MILP [79]. In the following constraints, all variables are assumed to be lower bounded by zero unless otherwise specified, as is typically assumed for MILP.

First, a set of variables, lo0 and s0, are created to represent the unbounded versions of lo and s. For readability, the per block subscript M has been omitted.

0 0 s =li + ∆l − lmax, s ∈ (−∞, ∞)

lo0 =li + ∆l, lo0 ∈ (−∞, ∞)

23 The following piecewise linear function calculates s from s0.   0 0, s < 0 s = f(s0) =  0 0 s , s ≥ 0

This function can be described in MILP using the following set of constraints.

0 asλ0 + bsλ2 =s

λ0 + λ1 + λ2 =1

δ1 + λ2 ≤1

δ2 + λ0 ≤1

δ1 + δ2 =1

bsλ2 =s

0 where as is chosen to be less than the minimum possible value of s and bs is

0 chosen to be greater than the maximum possible value of s . The choice of as and bs is arbitrary as long as it meets these requirements. λi are continuous variables

0 and δi are binary variables. In this set of constraints, s is expressed as a sum of the endpoints of a segment of the piecewise function. The δi variables ensure

0 that only the segment corresponding to s is considered. δ1 = 1 corresponds to

0 0 0 0 the s < 0 segment of f(s ) and δ2 = 1 corresponds to the s ≥ 0 segment of f(s ).

0 The λi variables represent exactly where s falls on the domain of that segment. s can be calculated using this information and the values of the function at the segment endpoints.

Similarly, lo can be bound between 0 and lmax by using the following set of

24 constraints.

0 alλ3 + lmaxλ5 + blλ6 =lo

λ3 + λ4 + λ5 + λ6 =1

2δ3 + λ5 + λ6 ≤2

2δ4 + λ3 + λ6 ≤2

2δ5 + λ3 + λ4 ≤2

δ3 + δ4 + δ5 = 1

lmaxλ5 + lmaxλ6 =lo

0 As before, al and bl are chosen such that lo ∈ (al, bl). Again, λi are continuous variables and δi are binary variables.

Finally, for each node, the input monitoring load liM must be determined. liM depends on the previous nodes, Mprev. If there is only one edge into the node, then liM is simply

liM = loMprev

When there is more than one edge into node M, one set of constraints is used to

lower bound liM by all loMprev .

liM ≥loMprev , ∀Mprev ∈ Mprev

Then, another set of constraints upper bounds liM by the maximum loMprev .

liM − b · δMprev ≤loMprev , ∀Mprev ∈ Mprev X δMprev =|Mprev| − 1

Mprev∈Mprev   where b is chosen to be greater than max(loMprev ) − min(loMprev ) and |Mprev| is the number of nodes with edges pointing to M. δi are binary variables. The use

25 of the binary variables δi and the second constraint ensure that liM is only upper

bound by one of the loMprev . In order for all constraints to hold, this must be the maximum loMprev . Together with the lower bound constraints, these constraints result in liM = max(loMprev ).

3.3 Example MILP Formulation

In this section we show a detailed example of applying our MILP-based method for estimating the WCET of a task running on a system with parallel run-time monitoring.

3.3.1 Example Setup

An example code sequence is shown in Figure 3.3(a). Its associated control flow graph is shown in Figure 3.3(b). We assume that the worst-case execution time for each node has already been calculated using previous methods. These exe- cution times are labeled as cB max in the figure.

In this example, let us assume that the monitoring technique requires loads and stores to be forwarded, as in the case of an uninitialized memory check

(UMC). The monitoring task requires 5 cycles to handle a load and 7 cycles to handle a store. Thus, the maximum execution time of the monitoring task, tM,max, is 7 cycles.

Because of the simplicity of the example, we assume that the FIFO only holds one entry (nF = 1). Thus, lmax = nF · tM,max = 7.

26 L0:movr1,addr ldrr2,[r1,#0] Basic Block 0 cmpr2,r3 beqL2 L1:addr2,r2,r2 subr2,r2,r3 addr2,r2,#1 Basic Block 1 0 addr3,r3,#1 c0_max = 8 jmpL3 L2:addr2,r2,r2 addr2,r2,#1 ldrr2,[r2,#0] Basic Block 2 1 2 movr3,r2 c1_max = 5 c2_max = 8 L3:addr1,addr addr1,r1,r2 addr1,r1,r3 Basic Block 3 3 ldrr3,[r1,#0] c3_max = 8

(a) Code example. (b) Control flow graph.

0.0 c0.0_min = 2

0.1 c0.1_min = 2

1.0 2.0 c1.0_min = 5 c2.0_min = 3

2.1 c2.1_min = 1

3.0 c3.0_min = 4

(c) Monitoring flow graph.

Figure 3.3: Code example with its corresponding control flow graph and monitoring flow graph. Blue (dark) nodes include a forwarded (ldr) instruction.

27 3.3.2 Creating the MFG

The first step is to create the monitoring flow graph. For each node in the CFG, the code represented by that node is analyzed. After any forwarded instruction, in this case any load or store instructions, an edge is created, dividing a node into two nodes in the monitoring flow graph. For example, the second instruc- tion of node 0 in the CFG is a load instruction. Thus, node 0 is split into two nodes in the MFG. The first node represents the first two instructions and the second node represents the last two instruction.

The resulting MFG is shown in Figure 3.3(c). Nodes that are blue (dark) in- clude a forwarded instruction, which is located at the end of the node. Nodes that are yellow (light) do not include a forwarded instruction. The nodes in the graph are labeled with minimum (cB min) rather than maximum execution times. It can be seen that node 2 from the CFG corresponds to nodes 2.0 and

2.1 in the MFG. In this example, nodes in the CFG were only transformed into at most two nodes in the MFG. However, in general, a CFG node will be trans- formed into a number of nodes in the MFG equal to the number of forwarded instructions plus one.

3.3.3 Calculating the Monitoring Load

Once the MFG is constructed, a set of MILP constraints is generated for each node. This process can be automated, but for this example we will construct the constraints for one node by hand. Specifically we will consider node 3.0 in the MFG. We will also calculate, by hand, the MILP solution for the node using some assumed values for variables associated with other nodes. Note that all

28 variables are assumed to be non-negative unless otherwise specified.

Calculating input monitoring load: First, we will determine the worst-case input monitoring load for node 3.0, li3.0. One set of constraints lower bounds the monitoring load by all possible incoming monitoring loads.

li3.0 ≥lo1.0

li3.0 ≥lo2.1

Then, a set of constraints upper bounds this input monitoring load.

li3.0 − 1000δ1.0 ≤lo1.0

li3.0 − 1000δ2.1 ≤lo2.1

δ1.0 + δ2.1 =1

Here, the value 1000 is chosen arbitrarily but is known to be greater than |lo1.0 − lo2.1|. A different value could have been chosen as long as this condition was true. δ1.0 and δ2.1 are binary variables which can only assume values of 0 or 1.

To see how these constraints work, suppose that li1.0 = 4 and li2.1 = 7. The constraints are then evaluated as

li3.0 ≥4

li3.0 ≥7

li3.0 − 1000δ1.0 ≤4

li3.0 − 1000δ2.1 ≤7

δ1.0 + δ2.1 =1

The first pair of constraints ensures that li3.0 ≥ 7. This means that for the third constraint to hold, δ1.0 = 1. If δ1.0 = 1, then by the last constraint, δ2.1 = 0.

Plugging this value into the fourth constraint gives li3.0 ≤ 7. Thus, the only possible solution is li3.0 = 7 which is the maximum of lo1.0 and lo2.1.

29 Calculating output monitoring load: In order to determine the output mon- itoring load for node 3.0, we must first calculate the change in monitoring node,

∆l3.0. Since there is a forwarded instruction in node 3.0,

∆l3.0 =tM,max − c3.0,min

=7 − 4 = 3

0 We first create a variable, lo3.0 to represent the unbounded output monitoring load.

0 0 lo3.0 =li3.0 + ∆l3.0, lo3.0 ∈ (−∞, ∞)

Using the example input monitoring load previously calculated of li3.0 = 7, this

0 unbounded output monitoring load is lo3.0 = 7 + 3 = 10. Then, the following set of constraints determines the bounded output monitoring load, lo3.0.

−1000λ3 + 7λ5 + 1000λ6 =10 (3.1a)

λ3 + λ4 + λ5 + λ6 =1 (3.1b)

2δ3 + λ5 + λ6 ≤2 (3.1c)

2δ4 + λ3 + λ6 ≤2 (3.1d)

2δ5 + λ3 + λ4 ≤2 (3.1e)

δ3 + δ4 + δ5 = 1 (3.1f)

7λ5 + 7λ6 =lo3.0 (3.1g)

0 The -1000 and 1000 values were chosen arbitrarily and only require that lo3.0 to fall between them. δ3, δ4, and δ5 are binary variables. By Constraint 3.1b, it can be seen that all λi are less than or equal to 1. Thus, in order for Constraint 3.1a to hold, λ6 > 0. Since λ6 > 0, Constraints 3.1c and 3.1d force δ3 and δ4 to both be zero. From this, by Constraint 3.1f, δ5 = 1. Then, by Constraint 3.1e, λ3 and λ4

30 are both forced to be zero. If we now go back to the first two constraints, they are reduced to

7λ5 + 1000λ6 =10

λ5 + λ6 =1

Solving this system of equations gives the solution (λ5, λ6) = (0.997, 0.003). Plugging these values into Constraint 3.1g,

lo3.0 =7λ5 + 7λ6

=7 · 0.997 + 7 · 0.003

=7

Thus, the output monitoring load is indeed bound by the maximum monitoring load of 7. Although this may seem to be a complicated series of calculations to determine this obvious result, this set of constraints is required in order for the piecewise linear, and thus non-linear, bounding function to be expressed in an

MILP problem.

Calculating the monitoring stall cycles: The one remaining value that needs to be determined for node 3.0 is the monitoring stall cycles. Based on our pre- vious calculations, the worst-case input monitoring load (li3.0) is 7, the change in monitoring load (∆l3.0) is 3, and the maximum monitoring load (lmax) is 7. Thus, we expect the worst-case monitoring stall cycles to be (7 + 3) − 7 = 3. To handle this as an MILP problem, first the unbounded monitoring stall cycles, s0, is calculated.

0 0 s3.0 =li3.0 + ∆l3.0 − lmax, s3.0 ∈ (−∞, ∞)

=7 + 3 − 7 = 3

31 0 0 In this case, since s3.0 is positive, we expect s3.0 = s3.0. The MILP problem determines s3.0 using the following set of constraints.

−1000λ0 + 1000λ2 =3 (3.2a)

λ0 + λ1 + λ2 =1 (3.2b)

δ1 + λ2 ≤1 (3.2c)

δ2 + λ0 ≤1 (3.2d)

δ1 + δ2 =1 (3.2e)

1000λ2 =s3.0 (3.2f)

0 The -1000 and 1000 values are chosen arbitrarily, only requiring that s3.0 is be- tween them. From Constraint 3.2a, λ2 must be positive. Since δi are binary variables, Constraint 3.2c then implies that δ1 = 0. Constraints 3.2d and 3.2e then force δ2 = 1 and λ0 = 0. The first two constraints then reduce to

1000λ2 =3

λ1 + λ2 =1

Solving this system of equations leads to (λ1, λ2) = (0.997, 0.003) and thus cal- culating s3.0 using Constraint 3.2f:

s3.0 =1000λ2

=1000 · 0.003 = 3

0 This is the value for s that we expected. If s had instead been negative, then δ1 would be forced to 1 and λ2 would be forced to 0. From the last constraint, it can be seen that if λ2 is 0, then s is also 0.

32 3.3.4 MILP Optimization

In the previous section, the monitoring loads for one node were calculated in detail. However, note that the output monitoring load for each node with an edge pointing to node 3.0 was assumed to be a certain value. In an actual MILP problem, these would be variables that are also being solved for. Solving for these inter-related variables and determining the global maximum number of cycles stalled due to monitoring is impractical to do by hand. While the amount of calculations may seem excessive for these simple examples, the ability to for- mulate the problem in MILP is essential for solving larger problems. Given an MILP formulation, existing solvers [9, 43] can be used to obtain the worst-case stall cycles.

3.4 Evaluation

3.4.1 Experimental Setup

Our toolflow for the proposed WCET method is shown in Figure 3.4. We first use Chronos [47], an open source WCET tool, to estimate the WCET for the main task and the monitoring task. We also modified Chronos to produce a MFG of the main task. This MFG and the monitoring task WCET are used to produce an MILP formulation as in Section 3.2. This MILP problem is solved using lp solve [9], which produces the worst-case monitoring stall cycles for each forwarded instruction. These monitoring stalls are combined into the ILP formulation that is originally generated for the main task to estimate the overall WCET with

33 main task monitoring task

Chronos Chronos

WCET MFG create MILP

ILP MILP

lp_solve

monitoring stalls

incorporate stalls

ILP

lp_solve

WCET

Figure 3.4: Toolflow for WCET estimation of parallel monitoring.

parallel run-time monitoring. Although we use Chronos and lp solve for our implementation, these components can be replaced with any WCET estimation tool and LP solver respectively.

To evaluate the effectiveness of our WCET scheme, we compared its esti- mate with a simple WCET bound from sequential monitoring (Section 3.2.3) as well as simulation results using the SimpleScalar architectural simulator [8]. In addition to the WCET estimates with monitoring, we also compared the results with the WCET of the main task without monitoring, using both Chronos and simulations.

For the experiments, we configured Chronos and SimpleScalar to model sim- ple processing cores that execute one instruction per cycle for both main and monitoring cores and used an 8-entry FIFO. This configuration represents typ- ical embedded microcontrollers, and is designed to focus on the impact of par-

34 Benchmark Monitoring Experiment cnt expint fdct fibcall insertsort matmult ns wcet-none 64531 3483 1805 245 598 133668 5951 None sim-none 62931 2293 1805 245 598 133668 5951 sequential-umc 103052 3591 4382 257 2489 357453 10338 UMC wcet-umc 64550 3498 3035 245 2083 256120 5953 sim-umc 62931 2297 2564 245 1864 235120 5951 sequential-cfp 151732 11669 1976 794 1174 231507 18623 CFP wcet-cfp 93544 8984 1805 547 677 133668 13614 sim-cfp 72540 5247 1805 382 598 133668 9824

Table 3.1: Estimated and observed WCET (clock cycles) with and without monitoring.

allel run-time monitoring by removing complex features such as branch pre- diction and caches. In the evaluation, we used seven benchmarks from the

Malardalen¨ WCET benchmark suite [36] and two monitoring techniques: unini- tialized memory checks (UMC) and control flow protection (CFP). UMC de- tects a software bug that reads memory without a write. CFP protects a pro- gram’s control flow by checking a target address on each control transfer [7]. In this technique, a compiler determines a set of valid targets for each branch and jump in the main task. This information is stored on the monitoring core.

On a branch or jump, the monitoring core ensures that the target is contained in the list of valid targets. The WCET for UMC tasks was 7 cycles (i.e., tM,max) and the WCET for CFP was 9 cycles. Thus, the max monitoring loads were 56 and 72 cycles for UMC and CFP respectively. These are relatively aggressive numbers that assume the monitoring core is set up to automatically jump to the appropriate task based on forwarded instruction type.

35 cnt fdct insertsort ns expint fibcall matmult geomean 1.8 1.6 1.4 1.2 1.0 0.8

wcet : sim 0.6 0.4 0.2 0.0 None UMC CFP

Figure 3.5: Ratio of estimated WCET to observed WCET.

cnt fdct insertsort ns expint fibcall matmult geomean 1.0 0.8 0.6 0.4 0.2 wcet : sequential 0.0 UMC CFP

Figure 3.6: Ratio of WCET using our proposed method compared to as- suming sequential monitoring.

3.4.2 WCET Results

Table 3.1 shows the experimental results for each benchmark under different configurations. The first set of rows show the WCET estimate from Chronos (wcet-none) and actual run-times from simulations (sim-none) without mon- itoring. The remaining rows show the WCET for the UMC and CFP monitoring extensions. The results are shown for three different approaches: a bound from sequential monitoring (sequential), our approach (wcet), and simulations

(sim). The numbers indicate the number of clock cycles.

36 cnt fdct insertsort ns expint fibcall matmult geomean 3.5 3.0 2.5 2.0 1.5 1.0 0.5

wcet-mon : wcet-none 0.0 UMC CFP

Figure 3.7: Ratio of WCET with monitoring to WCET without monitoring.

Figure 3.5 shows the ratio of the WCET estimates from ILP/MILP formu- lations to the worst-case simulation cycles for each monitoring setup. The re- sults show that the analytical WCET estimates from our proposed scheme are larger than the observed WCET by 0% to 52% for UMC and 0% to 71% for CFP, depending on the main task. This difference is comparable to the case with- out parallel run-time monitoring, where the analytical WCET from Chronos is larger than simulation results by 0% to 52%. In fact, for expint, the majority of the difference is from the WCET estimate of the main task rather than the effects of monitoring. This result suggests that our WCET approach is not significantly more conservative than the baseline WCET tool for the main task.

Figure 3.6 shows the WCET from our proposed method normalized to the WCET calculated assuming sequential monitoring. For UMC, our approach shows up to a 43% reduction in WCET estimates over the simple bound. Sim- ilarly, for CFP, our method shows up to a 42% improvement. These results demonstrate that modeling the FIFO decoupling between the main and mon- itoring tasks is important for obtaining tight WCET estimates of parallel moni- toring.

37 Benchmark Solver Target cnt expint fdct fibcall insertsort matmult ns stall-umc 17.789 6.256 21.733 0.043 0.390 161.796 3.655 stall-cfp 3.691 97.93 0.038 0.024 0.025 14.209 1.474 sequential-umc 0.006 0.004 0.004 0.005 0.002 0.004 0.006 sequential-cfp 0.007 0.001 0.003 0.002 0.003 0.006 0.003 wcet-none 0.003 0.003 0.004 0.002 0.002 0.002 0.001 wcet-umc 0.004 0.004 0.003 0.001 0.004 0.005 0.002 wcet-cfp 0.002 0.007 0.005 0.004 0.003 0.005 0.004

Table 3.2: Running time of lp solve in seconds to determine worst-case stalls (stall), sequential bound (sequential), and worst-case ex- ecution times (wcet).

Finally, Figure 3.7 compares the WCET estimates with and without run-time monitoring. The results show that the increase in WCET varies significantly depending on benchmark and monitoring technique. Benchmarks with in- frequent monitoring events (forwarded instructions) show minimal overheads while ones with frequent monitoring can see significant impacts. Also, the benchmarks with large WCET increases differ between UMC and CFP. There- fore, when applying parallel run-time monitoring techniques to real-time sys- tems, a careful WCET analysis for the given tasks and monitoring techniques needs to be performed.

3.4.3 Time to Solve Linear Programming Problem

In this section, we evaluate the running time of our analysis method. The most time intensive portion of this analysis is solving the linear programming (LP) problem. For our experiments, we used lp solve 5.5.2.0 [9] as our LP solver.

These experiments were run on a 2.67 GHz Xeon E5430 quad-core processor with 4 GB of RAM. The running times for lp solve are shown in Table 3.2.

38 The first set of rows show the running time for determining the worst-case stalls from the monitoring flow graph (stall). The second set of rows show the lp solve running time for finding the sequential bounds. The final set of rows show the running time for determining the overall WCET (wcet). For wcet-umc and wcet-cfp, this is for the ILP problem given that the worst-case stalls have already been found.

The running times for the sequential cases and the wcet cases are very similar. This is because these cases are all solving essentially the same problem with different numbers. That is, for a given benchmark, these different cases are all solving a linear programming problem for the same control flow graph (CFG). As a result, the number of variables and the set of constraints is the same, though the WCET for each basic block changes depending on the extension and the estimation method.

The stall cases, where we calculate the worst-case stalls using our MILP- based method, have a longer running time. This is due to the fact that a MFG has more nodes than its corresponding CFG. The increased number of nodes also implies more variables and more constraints. An interesting direction for future work would be to explore graph transformations to reduce the MFG size and thus reduce the analysis time.

39 CHAPTER 4 RUN-TIME MONITORING FOR HARD REAL-TIME SYSTEMS

4.1 Introduction

In the last chapter, we showed how the WCET of run-time monitoring could be estimated. By accounting for this WCET, run-time monitoring can be imple- mented in traditional real-time frameworks while giving strong guarantees that deadlines will be met. However, the increase in WCET introduced by run-time monitoring can be quite high, requiring a large amount of slack in a system’s schedule in order to support it. For example, Section 4.5 showed that even an aggressive implementation of an uninitialized memory check (UMC) showed increased WCET of 3.5x compared to without monitoring. A more realistic im- plementation that includes deciding which tasks to run in software can show increases of WCET of up to 12x compared to the baseline. Thus, applying mon- itoring to real-time systems requires a large increase in the time allocated to tasks. Currently, if a real-time system cannot support this extra utilization, then monitoring cannot be applied to the system.

In this chapter, we describe a system that greatly decreases the amount of time that must be statically allocated to tasks in the worst case (i.e., WCET) in order to enable monitoring. Specifically, this system exploits dynamic slack in order to opportunistically perform as much monitoring as possible within the time constraints of the system. This is based on three key observations:

1. Tasks often run faster than their worst-case time.

2. The performance impact of monitoring is rarely equal to the worst-case

40 impact.

3. Performing partial monitoring can still provide some degree of protection.

Since tasks typically run faster than their conservatively estimated WCET, there exists dynamic slack at run-time that can be used for monitoring. Similarly, the estimation of the WCET of monitoring is conservative. In actuality, the overheads of monitoring are much lower, as shown by the average-case perfor- mance impact. Finally, although performing all monitoring operations in full is preferred, performing only a portion of the monitoring still gives increased protection over a system with no monitoring applied. By shifting the decision of whether to enable or disable monitoring to run time, we are able to trade off monitoring coverage in order to reduce the WCET impact of monitoring. The system we present decides whether or not to perform monitoring based on whether there is enough dynamic slack to account for the worst-case perfor- mance impact of the monitoring.

A main challenge in skipping monitoring operations is ensuring that the monitor can still run in a useful manner. Although we trade off the coverage provided by monitoring in order to meet timing requirements, we must also guarantee that no false positives occur in order to prevent false alarms. We pre- vent false positives through the use of a dropping task that invalidates metadata when monitoring is skipped. With hardware optimizations, this invalidation can be handled with no impact on the task’s WCET. In addition, the hardware architecture we present skips monitoring on invalidated metadata, saving dy- namic slack to be used for useful monitoring on valid metadata.

The rest of this chapter is organized as follows. In Section 4.2, we discuss how to decide when to skip monitoring and how to handle this in a safe manner.

41 Hardware optimizations for handling this dropping operation are presented in Section 4.3. Section 4.4 details how our architecture applies to three different monitoring techniques. We present our evaluation methodology and experi- mental results in Section 4.5.

4.2 Opportunistic Monitoring

In this section we present a framework for opportunistically performing moni- toring in such a way as to ensure that the main task’s execution time does not exceed its WCET bound. The basic idea is that on a monitoring event, the sys- tem checks whether there is enough slack to account for the worst-case impact of monitoring on the main task’s execution time. If there is enough slack, then the monitoring task proceeds. If there is not enough slack, then instead the event is dropped.

There are two main challenges in this scheme. The first is how to measure slack at run time and decide when it is necessary to drop monitoring tasks, which we discuss in Section 4.2.1. Once it has been decided to drop a monitor- ing event, the second issue is how to drop the monitoring task in such a way as to maintain the correctness of the monitoring scheme which we discuss in Section 4.2.2.

4.2.1 Measuring Slack

In order to decide when it is possible to perform monitoring, we must be able to measure the dynamic slack available. Dynamic slack is defined as the differ-

42 Actual ET Total Slack

Dynamic Slack

Task Execution Time BCET Real WCET Estimated Deadline WCET

Figure 4.1: Dynamic slack and total slack.

Sub-task 1 Slack

Sub-task 2

Monitoring Stall

Start Task Sub-task 1 Task WCET WCET

Slack added Slack = 0

Figure 4.2: Dynamic slack increases when a sub-task finishes early. Slack is consumed as monitoring causes the task to stall.

ence between a task’s expected worst-case execution time (WCET) and its actual execution time [2]. This is only a portion of the total slack which is the differ- ence between a task’s finish time and its deadline (see Figure 4.1). Although the dynamic slack only accounts for a portion of the total slack, we only focus on dynamic slack because this is the portion of slack that is specific to a task. Additional slack in the schedule could be assigned to a specific task to be used for monitoring by the system designer or scheduler. For brevity, we will use the term slack to refer to dynamic slack.

43 In order to perform monitoring as the task runs, we need to be able to mea- sure dynamic slack as the task runs. We can track dynamic slack by setting a number of checkpoints throughout the task. These checkpoints effectively di- vide the task into a number of sub-tasks. For each of these sub-tasks, the sub- task’s WCET is determined. At run-time when a sub-task finishes, the difference between its actual run-time and its WCET is the slack generated by the sub-task.

Specifically, we insert code to mark each sub-task boundary. At the beginning of a sub-task, the WCET of a sub-task is loaded into a timer. Each cycle, the timer decreases. At the end of a sub-task, the remaining value in the timer is the slack generated by the sub-task. This value is added to the current slack. An example of this process is shown in Figure 4.2. In our experiments, the division of a task into sub-tasks was done by hand but this process could be automated to divide a task using function boundaries, code length, or some other criteria. In addition to the slack accumulated while running, a portion of headstart slack can be initially assigned by the designer or scheduler to the task. For example, if the designer knows that static slack exists in the schedule, this slack can be added to the initial dynamic slack of a task to be used for monitoring.

By accumulating this slack as the task runs, we can determine whether mon- itoring can be performed while still meeting the real-time constraints. If the worst-case impact of a monitoring task on the main task is less than the ac- crued slack, then the monitoring task can execute. If running the monitoring task causes the main task to stall, then slack is consumed. Slack was initially generated since the main task was running ahead of its WCET, so stalling up to the slack time will not cause the main task to exceed its WCET (see Figure 4.2). In the best case, the monitoring task executes entirely in parallel and does not affect the main task and thus consumes no slack. On the other hand, if the

44 worst-case impact of the monitoring task on the main task’s execution time is greater than the slack, then, conservatively, the monitoring task cannot be run. Instead, it must be dropped in order to guarantee that the main task finishes within its WCET.

4.2.2 Dropping Tasks

Dropping a monitoring task implies that some functionality of the monitor has been lost. This may cause either false negatives, where an error that occurs in the main task’s execution is not detected, or false positives, where the monitor incorrectly believes an error has occurred. For example, a false positive can occur for an uninitialized memory check (UMC) if a store monitoring event is dropped. This causes the memory location of the store to not be marked as initialized. A subsequent load for the memory location will incorrectly cause an error to be raised. We accept false negatives as the loss in coverage that we trade off in order to ensure the WCET of the main task is met. However, we must safely drop monitoring events in such a way as to avoid false positives so that the system does not incorrectly raise an error.

In order to ensure that no false positives occur, we need to run a dropping task when a monitoring task is dropped. The specifics of how this dropping task operates may vary for different monitoring schemes. However, in analyz- ing various monitoring schemes, we found that most monitoring tasks perform operations of primarily two types: checks and metadata updates. Monitoring tasks check certain properties to ensure correct main task execution and they update metadata for bookkeeping. Skipping a check operation can only cause false neg-

45 Monitoring + No Dropping Monitoring + Dropping No Monitoring 12

10

8

6

4

Normalized WCET 2

0 crc edn fir jfdc compress insertsort nsichneu statemate geomean

Figure 4.3: WCET with original monitoring task, with software dropping task, and with no monitoring for UMC. Results are normalized to the WCET with no monitoring.

atives and will never cause a false positive. Therefore, the dropping task may simply skip a check operation. Skipping an update operation can cause false negatives but may also cause false positives. Essentially, when an update oper- ation is skipped, we can no longer trust the corresponding metadata. In some cases, the dropping task can handle this by setting the metadata to a neutral value that will not cause false positives (e.g., cleared or null). A more general solution is for the dropping task to mark the corresponding metadata as invalid, to prevent future false positives.

In the worst case, no monitoring can be done and the system must ensure that there is enough time to run a dropping task for every monitoring event in order to avoid false positives. Thus, the main task’s WCET estimation must be modified to take into account the worst-case impact due to the dropping task. By minimizing the dropping task’s execution time, the impact on the main task’s WCET can be much lower than the impact due to the monitoring task. Figure 4.3 compares the original WCETs with and without monitoring to the WCET with

46 dropping tasks implemented in software for UMC. The WCETs are normalized to the WCET without monitoring. The WCET with dropping is reduced by 43% on average from the WCET without dropping.

4.3 Hardware-Based Dropping Architecture

In Section 4.2, we presented a general framework for how to design a system that allows dropping of monitoring tasks to ensure the main task’s execution time. However, implementing this framework in software still has significant impacts on the WCET as shown in Figure 4.3. In this section, we present a novel hardware architecture that eliminates these impacts to the main task’s WCET.

There are two main sources that affect the main task’s WCET: the additional code for keeping track of slack and the impact of the dropping task. In Sec- tion 4.3.1, we describe hardware to keep track of slack and to make the deci- sion on whether to run the monitoring or dropping task. In Section 4.3.2, we present a hardware-based metadata invalidation scheme that allows dropping to be performed in a single cycle. By handling the dropping in a single cycle, the throughput is able to match the throughput of the main core. Section 4.3.3 builds upon the hardware-based invalidation scheme to filter out monitoring tasks for invalid metadata in order to reserve slack for more useful cases of run- ning the monitoring task. Finally, in Section 4.3.4 we show the full architecture.

47 headstart slack

sub-task WCET remaining drop Countdown time Dynamic decision > start sub-task Timer Slack

end sub-task Main stall Monitoring Monitoring Core WCET Core

FIFO

Figure 4.4: Hardware modules for keeping track of slack and making a drop decision.

4.3.1 Slack Tracking

Figure 4.4 shows a hardware slack tracking module (STM). In order to keep track of slack, a countdown timer is loaded with a sub-task’s WCET at the start of a sub-task and counts down each cycle. At the end of a sub-task, the remain- ing time in this countdown timer is added to the current dynamic slack which is stored in a register. At the start of a task, the dynamic slack register is cleared and set to the headstart slack if one is specified. In addition, whenever the main task is stalled due to the monitoring core, this dynamic slack is decremented.

The currently measured dynamic slack is used to determine whether the monitoring task can run in full. When the monitor is initialized, the monitor will load its worst-case needed slack in order to perform the monitoring task into a register. When there is a monitoring event in the FIFO to be processed, a hardware comparator checks whether the dynamic slack is greater than or equal to the necessary slack for full monitoring. If enough slack exists, the monitor- ing core is signaled to perform the monitoring task. Otherwise, the monitoring

48 FIFO entries

Previous addr Metadata Metadata Constant 0 Address ALU Invalidation Scratchpad FIFO entries Metadata

Previous addr Invalidation Table Constant 1

Figure 4.5: Metadata invalidation module (MIM).

event is dropped.

4.3.2 Metadata Invalidation Module

In the worst-case, all monitoring events must be dropped. Thus, it is important that whatever dropping task needs to be run has a minimal impact on the main task. If this dropping task can match the maximum throughput of the main core, then the original WCET of the main task is not affected, removing the need to redo the WCET analysis. In this section, we present a hardware architecture that can handle dropping monitoring events in a single cycle, matching the through- put of the main core.

As mentioned in Section 4.2.2, the dropping task must invalidate metadata on a dropped monitoring event. Thus, we have designed a hardware module to perform this invalidation. Figure 4.5 shows a block diagram of this module, which we call the metadata invalidation module (MIM). When the slack track- ing module indicates that a monitoring task must be dropped, the metadata

49 invalidation module sets a bit in the metadata invalidation table (MIT) corre- sponding to the metadata to be invalidated. The metadata to be invalidated depends on the monitoring scheme and the monitoring event. For example, for the uninitialized memory check monitoring scheme, on a store event, metadata corresponding to the memory access address is set to indicate initialized mem- ory. Thus, on a dropped event, the MIM must be able to calculate the address of this metadata in order to set a corresponding invalidation flag. The MIM includes a small ALU to perform these simple address calculations. Since this metadata address mapping varies for different monitoring schemes, the inputs to the ALU can either be data from the monitoring event, the previous metadata address, or a constant. The input selection and the ALU operation to perform are looked up from a configuration table depending on the type of monitoring event. The monitor sets up this configuration table during initialization.

In order for this dropping operation to match the throughput of the main core (i.e., up to one monitoring event per cycle), the metadata invalidation infor- mation is stored on-chip. The MIT is implemented as a small on-chip memory, similar to a cache. This memory stores invalidation flags and is indexed using part of the metadata address. It stores the remaining portion of the address as a tag. Unlike a cache, if an access misses in the MIT, there is no lower-level memory structure to go to. This is done in order to ensure that the MIM can handle a monitoring event every cycle. Instead of backing the MIT with lower- level memory, if writing to the MIT would force an eviction, we instead mark the corresponding cache set as “aliased”. From this point on, we are forced to conservatively consider any metadata that would be mapped to this cache set as invalid, regardless of the tag corresponding to its metadata invalidation address. This reduces the amount of useful monitoring that can be done, but

50 guarantees that the dropping hardware can match the main core’s throughput. These aliased sets can be reset by re-initializing (e.g., resetting to a null value) all metadata that could map to the aliased set. By using dynamic slack or a dedicated periodic task, the system can occasionally reset aliased sets. In either case, a sufficiently sized MIT should ensure that aliasing is rare.

In some cases we would like to use the MIM hardware to operate in such a way as to ensure that aliasing does not occur. For example, we found that some monitoring schemes save metadata information about registers. Since this metadata is used often, it is important to manage it in such a way that aliasing does not occur. Thus, the MIM also includes a metadata invalidation scratchpad (MISP). This MISP is explicitly addressed and it is up to the system designer to utilize it in such a way that aliasing will not occur.

Both the MIT and MISP are accessible by the monitor. This allows the mon- itor to be aware of what has been invalidated. The monitor can also re-validate metadata when possible, such as when the monitoring task writes metadata values.

We note that the MIM was designed with the idea of marking certain meta- data as valid or invalid with a 1-bit flag. However, in general the MIM simply sets or clears a bit based on a calculated address. We expect that for certain monitoring schemes, designers may be able to use the MIM in new ways such as using the MIT and MISP to directly express certain metadata. Section 4.4 shows how the MISP can be used to directly express the register file taint meta- data used for a dynamic information flow tracking (DIFT) scheme.

51 FIFO entries Previous addr Constant 0 Addr 0 ALU FIFO entries filter lookup table

Previous addr flag 0 00 filter Constant 1 MISP/ 01 operation MIT

10 FIFO entries flag 1 11 Previous addr Constant 2 ALU

Addr 1 FIFO entries Previous addr Constant 3

Figure 4.6: Metadata filtering module (MFM).

4.3.3 Filtering Invalid Metadata

Given that certain metadata becomes invalidated by the metadata invalidation module, performing check and update monitoring operations based on the in- valid metadata is not useful. Thus, we can drop these monitoring tasks. By skip- ping these invalid monitoring tasks, slack can be reserved for monitoring tasks which operate on valid metadata. Determining when a monitoring event can be filtered out in this manner is done by the metadata filtering module (MFM) which is shown in Figure 4.6.

We examined multiple monitoring schemes and found that most monitoring schemes read in up to two metadata, corresponding to the two input operands of an instruction, in order to perform an update. Thus, the MFM was designed with a pair of configurable metadata address generation units, similar to the ones used in the MIM. These address generation units calculate a pair of ad- dresses which correspond to a pair of metadata invalidation flags which are

52 invalidate MIT/MISP

Main event Monitoring Core FIFO MFM MIM Core

drop STM

Figure 4.7: Block diagram of full hardware architecture for opportunistic monitoring on hard real-time systems.

read from the MIT and/or MISP. The two flag bits are then used to look up an entry in a lookup table that specifies whether to filter the event or not. Typi- cally, if either of the source metadata operands are marked as invalid, then the monitoring task can be filtered. We must ensure no false positives occur due to filtering out these monitoring events. Thus, the entry in the lookup table can also inform the MIM of metadata that must be invalidated. Similar to the MIM, the monitor also configures the MFM address calculations and the lookup table during initialization.

4.3.4 Full Architecture

Figure 4.7 shows a block diagram of the complete architecture. Monitoring events from the main core are first enqueued in a FIFO. The events in the FIFO are dequeued and processed by the metadata filtering module. The MFM checks the MIT and/or MISP to decide whether the event should be filtered because of invalid metadata. If the event is not filtered, then the metadata invalidation module decides whether to drop the event based on the slack tracking mod-

53 Table 4.1: Monitoring tasks for UMC, RAC, and DIFT.

Monitor Event Task Operation

Store Set metadata tag (data initialized) Update UMC Load Check that metadata tag is set Check

Call Push return address Update RAC Return Pop return address and check Update & Check

ALU Set the output tag as the OR of the input tags Update

Load Set the register tag as the tag of the loaded data Update DIFT Store Set the memory tag as the register tag Update Indirect jump Check that tag is not set (data is untainted) Check

ule. If the event is dropped or filtered, then the MIM marks invalidation flags using the MIT/MISP. Instead, if the event is not dropped or filtered, then it is forwarded to the monitoring core to perform the monitoring task.

4.4 Monitoring Extensions

In this section, we detail how our architecture works for the three different mon- itoring schemes we evaluated: uninitialized memory check (UMC), return ad- dress check (RAC), and dynamic information flow tracking (DIFT). Table 4.1 shows a summary of the monitoring tasks for these monitoring schemes. Ta- ble 4.2 shows how the MIM operates and Table 4.3 shows how the MFM oper- ates for each monitoring scheme.

54 Table 4.2: Operation of the MIM for UMC, RAC, and DIFT.

Monitor Event Invalidation

Store Set invalidation flag UMC Load Do nothing

Call Invalidate metadata, increment metadata address RAC Return Decrement metadata address

ALU Clear taint in MISP Load Clear taint in MISP DIFT Store Invalidate metadata Indirect jump Do nothing

Table 4.3: Operation of the MFM for UMC, RAC, and DIFT.

Monitor Event Filter Condition Filter Operation

Store Never - UMC Load Invalid metadata Do nothing

Call Never - RAC Return Invalid metadata Decrement stack pointer

ALU Always Perform propagation Load Invalid metadata Clear taint in MISP DIFT Store Never -

Indirect jump Invalid metadata Do nothing

55 4.4.1 Uninitialized Memory Check

Uninitialized memory check (UMC) is a monitoring scheme that checks that memory is written to before it is read from. Table 4.1 summarizes the monitor- ing tasks for UMC.

Tables 4.2 and 4.3 shows the operations of the MIM and MFM. The moni- toring task performed on a load instruction is a check operation. Thus, it can simply be skipped without causing any false positives. However, the moni- toring task on a store updates metadata. On a dropped store event, the MIM calculates the metadata address and sets an invalidation flag corresponding to this address in the MIT to indicate that the metadata for this word is invalid.

In terms of using the metadata filtering module, on a load, if the metadata is invalid, then performing the check is useless. Thus, if the filtering module detects that the metadata invalidation flag corresponding to the memory access address is invalid, then the monitoring event is dropped and no operation is performed. For a store event, the source operand of the operation is a register value which is always valid. Thus, the MFM never filters a store event. Instead, if the store event is not dropped, the monitoring task will set the metadata bit and can clear the corresponding invalidation flag.

4.4.2 Return Address Check

Several security attacks, such as return-oriented programming (ROP) [74], at- tempt to manipulate the return address of a function so that the control flow of a program is diverted. One method to prevent these types of attacks is to

56 use monitoring to save the return address on a call to a function and to check on the return that the correct return address is used. We refer to this type of monitoring scheme as a return address check (RAC). On a call instruction, RAC pushes the correct return address for the function onto a stack data structure. On a return instruction, RAC pops an address off the stack and checks that this saved address matches the address the main task is returning to (Table 4.1).

In the case of UMC, the metadata that we cared about was a function of information from the monitoring event. For RAC, the metadata entry that we care about is based on the current stack pointer for the return address stack that the monitor maintains. The MIM maintains the previous metadata address used in a register and we can use this register to maintain the stack pointer between the monitoring task and the MIM. The monitoring task updates this register as necessary to ensure that the MIM has the correct metadata stack pointer. On a call instruction that needs to be dropped, the MIM marks the metadata at the current stack pointer (i.e., the previous metadata address) as invalid and increments the stack pointer. On a return, the MIM decrements the stack pointer and simply skips the check operation (Table 4.2).

On a call instruction, the source of our monitoring operation is the return address from the monitoring event. This is always valid, so call instructions are never processed by the MFM. On a return instruction, if the stored return address is marked as invalid, then the check can be skipped. Thus, if the MFM detects that the stored return address is invalid, then the stack pointer will be decremented but no check will occur, similar to the invalidation case.

57 4.4.3 Dynamic Information Flow Tracking

Dynamic information flow tracking (DIFT) [82] is a security monitoring scheme that seeks to detect when information from untrusted sources is used to affect the control flow. In its simplest form, DIFT keeps a 1-bit metadata tag for each memory location and each register, indicating tainted or untainted. Multi-bit versions of DIFT have been proposed to track more detailed semantic informa- tion about data. When data is read from an untrusted source, it is marked as tainted. As this data is used for other operations, the results of these operations are also marked as tainted. On an indirect jump, the taint bit of the register used is checked. If the register is found to be tainted, then an error is detected. These monitoring task operations are summarized in Table 4.1.

In the case of DIFT, register metadata is used often, so we will use the MISP to store the register metadata invalidation information in order to avoid alias- ing. Furthermore, in this discussion, we are considering a 1-bit taint, so the MISP can actually be used to store the exact taint value, rather than an invali- dation flag. This shows some of the flexibility in how these hardware structures can be used for specific monitoring schemes. For more complicated multi-bit DIFT schemes, the MISP would be used to store invalidation flags as before. For ALU and load instructions which have a register as a destination, the MIM will clear the register’s taint in the MISP on a dropped monitoring event. This is safe because an error is only raised for a tainted data value. On a store in- struction, which writes to memory as a destination, the MIM invalidates the associated memory address in the MIT, similar to how UMC operates. Finally, on an indirect jump, only a check operation is performed. Thus, this can be dropped without any extra invalidation.

58 The MFM operates similar to the UMC case for some of these monitoring events (Table 4.3). For an indirect jump, if the metadata is found to be invalid, then the event is filtered and no check is performed. For a load instruction, if the loaded data has its metadata marked as invalid, then the MFM will clear the taint bit for the target register, as in the invalidation case. For a store instruction, the source is from a register, which always has a valid taint bit in the MISP, so these are never filtered out. The most interesting case is for ALU instructions. The MFM is able to read the taint bits from the MISP and then signal to the MIM to write to the MISP, so we can actually perform the entire monitoring task in hardware. The MFM reads the taint bits from the MISP for both source registers. These are fed into the filter lookup table (see Figure 4.6) which is configured to signal the MIM to set or clear the taint bit in the MISP for the destination register depending on the read taint bits.

4.5 Evaluation

4.5.1 Methodology

We implemented our monitoring architecture for real-time systems by modify- ing the ARM version of the gem5 simulator [10] to support parallel hardware monitoring and our hardware optimizations for opportunistic monitoring. In order to explore the generality of the architecture for different monitors, we im- plemented the three different monitors described in Section 4.4: uninitialized memory check (UMC), return address check (RAC), and dynamic information flow tracking (DIFT). We tested our system using several benchmarks from the

59 Malardalen¨ WCET benchmark suite [36].

We model the main and monitoring cores as 500 MHz in-order cores, each with 16 KB of L1 I/D-caches. The latency to main memory is 15 ns. This setup is similar to Freescale’s i.MX353 processor which targets embedded, automotive, and industrial applications. For our experiments, we used a FIFO of 16 entries connecting the main and monitoring cores. The MISP was 16 entries matching the 16 registers found in the ARM architecture. The MIT was configured with 2 ways and 256 entries.

No static WCET analysis tools exist for the gem5 simulator. In order to esti- mate the WCET of tasks, we ran tasks several times on the gem5 simulator and took the worst-case observed execution time. Our WCET estimate is expected to be lower than those that a WCET analysis tool would generate since WCET analysis tools guarantee a conservative estimate. As a result, in our experiments the gap between actual execution times and our estimated WCET is lower than what we would expect with a WCET analysis tool. Therefore, the results we present for the amount of monitoring that can be done are less than what is expected when a strictly conservative WCET is used.

4.5.2 Amount of Monitoring Performed

Figure 4.8 shows the number of monitoring events that are monitored, dropped, and filtered for UMC, RAC, and DIFT as a percentage of the total number of monitoring events. In these experiments, no headstart slack was added. Thus, we see that a portion of monitoring can still be done without exceeding the main task’s original WCET. This is due to the dynamic slack that is gained during run

60 Monitored Dropped Filtered 100

80

60

40

20 Monitoring Events [%] 0 crc fir edn jfdc average compress insertsort nsichneustatemate

(a) UMC

Monitored Dropped Filtered 100

80

60

40

20 Monitoring Events [%] 0 crc fir edn jfdc average compress statemate

(b) RAC

Monitored Dropped Filtered 100

80

60

40

20 Monitoring Events [%] 0 crc fir edn jfdc average compress insertsort nsichneustatemate

(c) DIFT

Figure 4.8: Percentage of monitored, dropped, and filtered monitoring events with zero headstart slack.

61 compress edn insertsort nsichneu average crc fir jfdc statemate 100 80 60 40 Checks [%] 20 0 UMC RAC DIFT

Figure 4.9: Percentage of checks that are not dropped or filtered for zero headstart slack.

time. On average, UMC (Figure 4.8(a)) can perform 17% of its monitoring tasks. RAC (Figure 4.8(b)) can perform 66% of its monitoring and DIFT (Figure 4.8(c)) can perform 31% of its monitoring. No results are shown for insertsort and nsichneu for RAC because these benchmarks do not make any function calls. For DIFT, we store the actual DIFT register file metadata in the MISP instead of invalidation flags as discussed in Section 4.4.3. This optimization allows us to use the MIM and MFM to perform certain monitoring tasks. These correct metadata updates are counted as “Monitored” in the statistics while events are counted as dropped or filtered only if they were handled by the dropping hard- ware due to insufficient slack.

As expected, false positives never occurred in our experiments. How- ever, false negatives can occur due to dropping monitoring tasks. Specifically, dropped or filtered check-type monitoring operations can result in false nega- tives. Figure 4.9 shows the number of checks that are monitored as a percentage of the total checks. This acts as a measure of the coverage achieved by the mon- itors. The coverage for UMC is 15% on average and the coverage for RAC is

66% on average. The average coverage for DIFT is 59% which is much higher

62 0% 100% 200% 300% 400% 500% 600% Headstart Slack / Main Task WCET 50% 150% 250% 350% 450% 550% 100 80 60 40

Checks [%] 20 0 compress crc edn fir insertsort jfdc nsichneu statemate

Figure 4.10: Percentage of checks performed as headstart slack is varied for UMC. Headstart slack is shown normalized to the main task’s WCET.

0% 100% 200% 300% 400% 500% 600% Headstart Slack / Main Task WCET 50% 150% 250% 350% 450% 550% 100 80 60 40

Checks [%] 20 0 compress crc edn fir jfdc statemate

Figure 4.11: Percentage of checks performed as headstart slack is varied for RAC implemented on a processor core. Headstart slack is shown normalized to the main task’s WCET.

than the percentage of monitoring tasks that are not dropped or filtered. In fact, for some of the benchmarks, DIFT is able to achieve 100% coverage. This implies that only a portion of the monitoring operations performed by DIFT actually affect the checks. For example, DIFT propagates metadata on every ALU, load, and store instruction. However, only instructions that eventually propagate metadata to an indirect jump instruction affect the coverage.

For an underutilized system, if some headstart slack is given to a task ini- tially, then the amount of monitoring that can be performed can be increased.

63 0% 100% 200% 300% 400% 500% 600% Headstart Slack / Main Task WCET 50% 150% 250% 350% 450% 550% 100 80 60 40 Checks [%] 20 0 compress crc edn fir statemate

Figure 4.12: Percentage of checks performed as headstart slack is varied for DIFT. Headstart slack is shown normalized to the main task’s WCET.

As an example, Figure 4.10 shows how the coverage increases as we increase the headstart slack given to the main task for UMC. Figure 4.11 shows how the cov- erage varies with headstart slack for RAC and Figure 4.12 shows the coverage variation for DIFT. The headstart slack is displayed as a percentage of the main task’s WCET for each benchmark and is varied from 0% to 600%. With enough headstart slack, 100% of the monitoring is able to be performed. For DIFT, only benchmarks which were not able to reach 100% coverage with zero headstart slack are shown.

4.5.3 FPGA-based Monitor

Performing monitoring in software, although parallelized, can still incur high overheads since multiple instructions are needed to handle each monitoring event. One possible solution to improve the performance of monitoring while maintaining programmability is to use an FPGA-based monitor [23]. We model the FPGA-based monitor as being able to run at 250 MHz and handle up to one monitoring event each cycle. Note that this means that the FPGA-based mon-

64 Monitored Dropped Filtered 100

80

60

40

20 Monitoring Events [%] 0 crc fir edn jfdc average compress insertsort nsichneustatemate

(a) UMC

Monitored Dropped Filtered 100

80

60

40

20 Monitoring Events [%] 0 crc fir edn jfdc average compress statemate

(b) RAC

Monitored Dropped Filtered 100

80

60

40

20 Monitoring Events [%] 0 crc fir edn jfdc average compress insertsort nsichneustatemate

(c) DIFT

Figure 4.13: Percentage of monitored, dropped, and filtered monitoring events for an FPGA-based monitor with zero headstart slack.

65 compress edn insertsort nsichneu average crc fir jfdc statemate 100 80 60 40 Checks [%] 20 0 UMC RAC DIFT

Figure 4.14: Percentage of checks that are not dropped or filtered on an FPGA-based monitor for zero headstart slack.

itor can process a monitoring event every two cycles of the main core which runs at 500 MHz. Figure 4.13 shows the number of monitored, dropped, and

filtered events with no headstart slack for this FPGA-based monitor. UMC (Fig- ure 4.13(a)) and RAC (Figure 4.13(b)) are able to run 67% of their monitoring tasks on average, while DIFT (Figure 4.13(c)) is able to run 72% of its monitoring tasks on average. Figure 4.14 shows the coverage achieved by these monitoring schemes on an FPGA-based monitor. RAC shows similar numbers to processor- based monitoring because the number of calls and returns in these benchmarks is relatively small. However, for UMC and DIFT, we see that an FPGA-based monitor allows much more monitoring to be done without increasing the head- start slack. The coverage for UMC increases from 15% to 62% while the coverage for DIFT increases from 59% to 86%.

Figure 4.15 shows how the coverage for UMC varies as we increase the head- start slack from 0% to 10% for an FPGA-based monitor. Figure 4.16 shows this data for RAC and Figure 4.17 shows this data for DIFT. Benchmark and moni- toring scheme combinations which showed 100% coverage with zero headstart slack are omitted. We can see that for a high-performance FPGA-based moni-

66 0% 2% 4% 6% 8% 10% Headstart Slack / Main Task WCET 1% 3% 5% 7% 9% 100 80 60 40

Checks [%] 20 0 compress crc edn fir insertsort jfdc nsichneu statemate

Figure 4.15: Percentage of checks performed as headstart slack is varied for UMC on an FPGA-based monitor. Headstart slack is shown normalized to the main task’s WCET.

0% 2% 4% 6% 8% 10% Headstart Slack / Main Task WCET 1% 3% 5% 7% 9% 100 80 60 40

Checks [%] 20 0 compress crc edn jfdc statemate

Figure 4.16: Percentage of checks performed as headstart slack is var- ied for RAC on an FPGA-based monitor. Headstart slack is shown normalized to the main task’s WCET.

, with a small amount of slack, we are able to achieve 100% monitoring for almost all benchmarks while guaranteeing the main task’s execution time does not exceed its WCET.

67 0% 2% 4% 6% 8% 10% Headstart Slack / Main Task WCET 1% 3% 5% 7% 9% 100 80 60 40 Checks [%] 20 0 compress crc statemate

Figure 4.17: Percentage of checks performed as headstart slack is var- ied for DIFT on an FPGA-based monitor. Headstart slack is shown normalized to the main task’s WCET.

Table 4.4: Average power overheads for dropping hardware at zero head- start slack. Percentages in parentheses are normalized to the main core’s power usage.

Monitor Peak Power [mW] Runtime Power [mW]

UMC 4.9 (3.2%) 4.7 (6.6%) Processor RAC 4.7 (3.1%) 4.7 (6.6%) DIFT 5.0 (3.3%) 4.8 (6.7%)

UMC 16.7 (10.9%) 8.5 (11.9%)

FPGA RAC 4.7 (3.1%) 4.7 (6.6%) DIFT 13.0 (8.5%) 8.8 (12.3%)

4.5.4 Area and Power Overheads

Adding the dropping hardware in order to enable adjustable overheads adds overheads in terms of area and power. We use McPAT [46] to get a first-order es- timate of these area and power overheads in a 40 nm technology node. McPAT estimates the main core area as 1.96 mm2 and the peak power usage as 152.9 mW averaged across all benchmarks. The average runtime power usage was

71.6 mW. These area and power numbers consist of the core and L1 cache, but

68 do not include memory controllers and other peripherals. The power numbers include dynamic as well as static (leakage) power. For the dropping hardware, the ALUs, MISP, MIT, and configuration tables are modeled using the corre- sponding objects in McPAT. We note that this is only a rough area and power result since components such as the wires connecting these modules have not been modeled. However, this gives a sense of the order-of-magnitude over- heads involved with implementing our approach.

An additional 0.132 mm2 of silicon area is needed, an increase of 7% of the main core area. Table 4.4 shows the peak and runtime power overheads with zero headstart slack for both a processor-based monitor as well as an FPGA- based monitor. The peak power is 5-17 mW, which is 3-11% of the main core’s peak power usage. The average runtime power is 5-9 mW, corresponding to 7- 12% of the main core’s runtime power. For UMC and DIFT, the dropping hard- ware has lower power usage with the processor-based monitor due to more monitoring events being filtered out that do not require invalidation. This reduces the activity of the Metadata Invalidation Module which reduces the power usage.

69 CHAPTER 5 RUN-TIME MONITORING FOR SOFT REAL-TIME SYSTEMS

5.1 Introduction

In the last chapter, we described a hardware architecture that dynamically limits the amount of monitoring performed in order to meet hard real-time deadlines. This allows run-time monitoring to be implemented even when the worst-case impact of performing full monitoring cannot be tolerated. Although soft real- time systems do not require strict guarantees, the performance of the system must still be controlled in order to meet soft real-time deadlines. In this chapter, we explore the use of similar hardware techniques in order to enable run-time monitoring with adjustable overhead. This allows us to use partial monitoring to create a trade-off between overhead and monitoring coverage or accuracy.

We enable partial monitoring with adjustable overhead by dynamically dropping monitoring operations when the overhead exceeds a specified over- head budget. Similar to the architecture presented in Chapter 4, we must again be careful to prevent false positives due to out-of-date metadata information. In order to prevent these false positives, we use an invalidation-based approach as before. However, here we back up the invalidation flags to memory and do not have aliasing. The result is that the hardware effectively acts as a dataflow tracking engine. We show how a simple extension to this dataflow engine can enable null metadata filtering. In addition, we investigate different policies for deciding which events to drop for partial monitoring. These different policies show a trade-off between closely matching the overhead budget and increasing the monitoring coverage.

70 This chapter is organized as follows. Section 5.2 introduces the notion of ad- justable overhead through partial monitoring and its possible uses. Section 5.3 discusses the hardware architecture that enables partial monitoring using our dataflow tracking engine. Section 5.4 investigates the design space for drop- ping policies that determine when and which monitoring operations to drop. Finally, Section 5.5 presents our evaluation methodology and results.

5.2 Partial Run-Time Monitoring

5.2.1 Overhead of Run-Time Monitoring

There have been a number of proposals for run-time monitoring systems ex- ploring various design points. Table 5.1 summarizes some of the representative designs and their reported performance overheads. The previous studies show that there exist trade-offs between efficiency, flexibility, and hardware costs. For example, a run-time monitoring scheme can often be realized with fairly low performance overhead (less than 20%) if implemented with custom hardware that is designed only for one monitor or a narrow set of monitors. However, these custom hardware monitors cannot be modified or updated, and require dedicated silicon area. On the other hand, flexible systems that support a wide range of monitors lead to noticeable performance overhead, often too high for wide deployment in practice. Software-only implementations [61, 70, 39, 62] or multi-core monitors with minimal hardware changes [15] are reported to have severalfold slowdowns. On-chip FPGA monitors [23] and cores with monitor- ing accelerators [16, 30] can reduce overhead significantly, but still show slow-

71 Type, monitoring scheme, Slowdown Name and flexibility (average/worst) Custom HW DIFT [82] 1.01x / 1.23x DIFT only Custom HW FlexiTaint [89] 1.01x-1.04x / 1.09x DIFT w/ programmable policies Custom HW Hardbound [25] 1.05x-1.09x / 1.22x Array bounds checks only Custom HW Harmoni [24] 1.01x-1.10x / 1.20x Tag-based monitors Dedicated FPGA FlexCore [23] 1.05x-1.44x / 1.84x Instruction-trace monitoring Core+Custom HW FADE [30] Instruction-trace monitoring 1.2x-1.8x / 3.3x (effective when HW filters work) Multi-core+Custom HW LBA-accelerated [16] Instruction-trace monitoring 1.02x-3.27x / 5x (effective when accelerators work) Multi-core+Custom HW LBA [15] 3.23x-7.80x / 11x Instruction-trace monitoring SW (multithreaded) Multi-core DIFT [61] 1.48x / 2.2x DIFT (compiled for each application) SW (DBI) LIFT [70] 3.6x / 7.9x DIFT (fully flexible) SW (DBI) Purify [39] 2.3x / 5.5x Memory leak checks (fully flexible) SW (DBI) TaintCheck [62] 10x / 27x DIFT (fully flexible)

Table 5.1: Trade-off between performance overhead and flexibility/com- plexity of run-time monitoring systems.

downs of several tens of percents in some cases. In today’s monitoring systems, the overhead is also unpredictable because it depends heavily on the character- istics of applications and monitoring operations. In summary, users currently need to either pay noticeable overhead or the cost of custom hardware in order to use fine-grained run-time monitoring in deployed systems.

72 5.2.2 Partial Monitoring for Adjustable Overhead

Our goal is to develop a general framework that enables monitoring overhead to be dynamically adjusted by dropping a portion of monitoring operations if necessary. In essence, this framework adds a new dimension to the monitor de- sign space by allowing coverage or accuracy to be traded off for lower overhead.

For example, the capability to adjust overhead allows users to use monitoring with partial coverage in order to reduce performance or energy overheads. Al- ternatively, partial monitoring allows designers to use less expensive hardware for a given performance overhead budget.

In this framework, a user specifies how much monitoring should be done in the form of a target overhead budget, a target coverage, a percentage of moni- toring operations to be performed, etc. Then, the framework dynamically drops a portion of monitoring operations to match the target. In particular, we focus on matching a performance overhead target while maximizing the coverage. Given that the overhead of custom hardware monitors can already be quite low, the focus is on enabling the trade-off in general-purpose monitoring systems that support a wide range of monitors. We also consider the target overhead as a soft constraint and do not aim to provide a strict worst-case guarantee.

5.2.3 Applications and Metrics

While it is ideal if run-time monitoring can be performed in full, we believe that the capability to trade off coverage/accuracy for lower performance/hardware overhead will be useful in many application scenarios where full monitoring is not a viable option. Here, we briefly discuss example applications of partial

73 monitoring and the metrics that are important in each case.

Cooperative testing, debugging, and protection: Recent studies have shown that software testing, debugging, or attack detection may be done in a cooperative fashion across a large number of systems [49, 17, 32, 33]. In this case, each system is only willing to pay very low overhead (e.g., a few percent) and only performs a small subset of checks. High coverage is achieved by hav- ing different systems check different parts of a program. The partial monitoring framework allows high-overhead monitoring to be used in a cooperative fash- ion. The main metric that represents the effectiveness of monitoring in this case is the coverage (the percentage of checks that were performed) over multiple runs.

Monitoring of soft real-time systems: Soft real-time systems or interactive systems need to meet deadlines or response-time requirements. Unfortunately, the overhead of run-time monitoring is often unpredictable and varies signif- icantly depending on the application and monitor characteristics. The partial monitoring framework allows monitoring to be performed while providing a level of guarantee on its impact on the execution time. In this case, it is im- portant that the system can closely match the desired overhead target while maximizing the effectiveness of monitoring.

Partial protection for low overhead: Even without real-time constraints, systems may have tight budgets for the performance, energy, or hardware over- heads that they can tolerate for run-time monitoring. In such cases, full moni- toring cannot be enabled unless its overhead is low enough. Adjustable mon- itoring allows partial protection on such systems. For example, array bounds may be checked for a subset of memory accesses. For dynamic information

74 flow tracking (DIFT), a subset of information flows may be tracked for partial attack detection. In this scenario, the effectiveness of partial monitoring can be measured as the percentage of run-time checks that are performed on each pro- gram run, which reflects how likely it is for a bug or an attack to be detected for a system.

Profiling: The run-time monitoring system can be used to implement vari- ous profiling tools to collect statistics on program behavior for performance op- timizations as well as security protection. For example, a recent study showed that an instruction mix can be used to identify malware from normal programs [22, 84]. In such profiling tools, the partial monitoring framework can be used to obtain statistical samples rather than complete counts of all program events, essentially trading off accuracy for lower overhead.

5.2.4 Design Challenges

While conceptually simple, designing a general framework to dynamically ad- just monitoring overhead introduces new challenges that need to be addressed. The following summarizes the main design goals and associated design chal- lenges.

1. General: Since we mainly target flexible run-time monitoring systems, which often have high overhead, the framework also needs to be general enough to be applicable to a wide range of monitoring schemes.

2. No false positive: The framework needs to ensure that dropping a por- tion of monitoring operations does not lead to a false positive. We found

75 Main Core (High-level) Main Core (Assembly) Monitoring Core

1:int*x=malloc(4*sizeof(int)); 1:callmalloc 1:rf_metadata[1]={r1,r1+0xf} ;returnpointer0x12340000inr1 //{base,bounds}ofarray 2:int*y=x+2; 2:addr2,r1,#8 2:rf_metadata[2]=rf_metadata[1] 3:x[3]=1; 3a:movr3,#1 3a:rf_metadata[3]=NULL 3b:strr3,[r1,#12] 3b:if(r1+12<base(rf_metadata[1])|| ;storeto0x1234000c r1+12>bound(rf_metadata[1])){ //raiseerror } mem_metadata[r1+12]=rf_metadata[3]; 4:y[3]=1; 4:strr3,[r2,#12] 4:if(r2+12<base(rf_metadata[2])|| ;storeto0x12340014 r2+12>bound(rf_metadata[2])){ //raiseerror } mem_metadata[r2+12]=rf_metadata[3];

Figure 5.1: Example of array bounds check monitor.

that a dataflow engine that tracks invalid metadata can serve as a general

solution to this problem (Section 5.3).

3. Intelligent dropping: The framework needs to match the overhead bud-

get while maximizing the effectiveness of monitoring. To this end, par- tial monitoring needs to carefully choose which operations to drop and when. We address this problem by studying different dropping policies

(Section 5.4) and their trade-offs.

5.3 Architecture for Partial Monitoring

In this section, we present our hardware architecture that enables partial moni- toring for adjustable overhead. We will revisit the array bounds check monitor and example code sequence that was shown in Figure 2.6 from Section 2.2.3.

This example is repeated in Figure 5.1 for easy reference.

76 5.3.1 Effects of Dropping Monitoring

Our goal is to drop some monitoring operations in order to reduce the overhead of run-time monitoring. This dropping can affect the functionality of the mon- itoring scheme. There are three possible outcomes for dropping a monitoring operation. The first is that there is no difference in operation from the original execution. For example, if we drop line 3b from our array bounds check exam- ple (Figure 5.1), then the check on accessing x+12 is skipped. However, this is a valid access and so skipping the check does not change anything.

On the other hand, if the monitoring for accessing memory location y+12 on line 4 is skipped, then a false negative occurs. Originally, the monitor would catch this access as an out-of-bounds access and raise an error. However, if the monitoring operation for this is dropped, then the error is not detected. This re- flects the trade-off that we make in order to reduce overhead. Instead of either

100% coverage with all the associated overhead or no coverage and no over- head, we enable middle points of partial coverage with some fraction of the full overhead.

The final possible outcome of dropping a monitoring operation is a false positive. For example, suppose the monitoring for line 1 is dropped, causing the bound information for pointer x to never be set. The result is that when the access using x is checked on line 3b, an error will be raised. This creates a false positive where an error is incorrectly raised. Although false negatives are part of the trade-off we make to reduce overhead, we need to prevent false positives.

77 Main Core (Assembly) Invalidation Monitoring Core

1:callmalloc;returnpointer 1:rf_invalid[1]=false//trueifdropped 1:rf_metadata[1]={r1,r1+0xf} 0x12340000inr1 //{base,bounds}ofarray 2:addr2,r1,#8 2:if(rf_invalid[1]){ 2:rf_metadata[2]=rf_metadata[1] //dropmonitoring } rf_invalid[2]=rf_invalid[1] 3a:movr3,#1 3a:rf_invalid[3]=false 3a:rf_metadata[3]=NULL 3b:strr3,[r1,#12] 3b:if(rf_invalid[1]||rf_invalid[3]){ 3b:if(r1+12<base(rf_metadata[1])|| ;storeto0x1234000c //dropmonitoring r1+12>bound(rf_metadata[1])){ } //raiseerror mem_invalid[r1+12]=rf_invalid[3] } mem_metadata[r1+12]=rf_metadata[3]; 4:strr3,[r2,#12] 4:if(rf_invalid[2]||rf_invalid[3]){ 4:if(r2+12<base(rf_metadata[2])|| ;storeto0x12340014 //dropmonitoring r2+12>bound(rf_metadata[2])){ } //raiseerror mem_invalid[r2+12]=rf_invalid[3] } mem_metadata[r2+12]=rf_metadata[3];

Figure 5.2: Example of using invalidation information to prevent false positives.

5.3.2 Invalidation for Preventing False Positives

The key cause of false positives is dropping monitoring operations that update metadata. Dropping monitoring operations that check for an error (such as the check for line 4 in Figure 5.1) can only cause false negatives and will never cause false positives. On the other hand, skipping monitoring operations that update metadata can lead to false positives and false negatives. Essentially, when an update operation is skipped, we can no longer trust the corresponding meta- data. Thus, our general approach for preventing false positives is to associate a 1-bit invalidation flag with each metadata in order to mark these metadata as valid or invalid, as was done for the hard real-time monitoring architecture pre- sented in Chapter 4. Furthermore, this invalidation information is propagated in the same way that metadata is. Figure 5.2 shows an example of associat- ing invalidation flags with metadata. Suppose that the monitoring for line 1 is dropped in order to meet the overhead target. When a monitoring event is

78 Main Dataflow Monitoring FIFO Core Engine Core

Program Dataflow Metadata State Flags

Figure 5.3: Hardware support for dropping.

dropped, metadata that would have been updated is marked as invalid. In this case, instead of the normal operation of marking rf invalid[1] as false, it is instead marked as true. Thus, when line 3 is reached, the monitoring event is dropped since rf invalid[1] is marked as true. Note that in this case, line 2 also propagates this invalidation flag to rf invalid[2] and causes the check performed on line 4 to also be dropped. This is necessary because an error would have been raised even if the access on line 4 was within bounds.

5.3.3 Dataflow Engine for Preventing False Positives

Although the functionality of dropping and invalidation could be implemented on the monitoring core, this is unlikely to be much faster than performing the full monitoring operations. Instead, in order to efficiently support dropping monitoring events and to prevent false positives, we propose to insert a hard- ware module between the main core and the monitoring core (see Figure 5.3). This module handles the invalidation operations shown in the middle column of Figure 5.2. There are two operations that are done for handling invalidation information:

79 flag 0

Dataflow FIFO FIFO entries entry rd addr 0 Flag RF flag 1 Filter ALU Decision Constant rd addr 1 Table inst type Dataflow inst Source 0 Address Generation wr addr Flag type Config Source 1 Address Generation Cache Table write enable Destination Address Generation Dataflow Flags write value filter

Figure 5.4: Hardware architecture of the dataflow engine.

1. Propagate invalidation flags, following the dataflow of metadata.

2. Filter out monitoring operations based on invalidation flags.

Thus, the hardware acts effectively as a dataflow tracking engine in order to track a 1-bit invalidation flag per metadata. Figure 5.4 shows a detailed block diagram of this hardware module. The dataflow engine uses two address gen- eration units to read in up to two invalidation flags. These source invalidation

flags are used to decide whether a monitoring event should filtered. A third address generation unit is used to optionally specify a target to propagate the invalidation information. The module includes a register file to store invalida- tion flags corresponding to register file metadata. In addition, it uses a small memory-backed cache to handle invalidation flags corresponding to memory metadata. The fact that the dataflow flags are backed to memory differs from the design presented in Chapter 4 for hard real-time systems. This allows the architecture to have perfect invalidation information without aliasing. In the worst-case, misses in the dataflow flag cache can cause high overheads for drop- ping. However, we expect misses to be rare.

80 Main Core (Assembly) Null Filtering 1:callmalloc;return 1:rf_null[1]=false pointer0x12340000inr1 2:movr2,#8 2:rf_null[2]=true ;r2isnotapointer 3a:movr3,#1 3:rf_null[3]=true 3b:strr3,[r1,#12] 3b:if(rf_null[1]&&rf_null[3]){ ;storeto0x1234000c //filtermonitoring } mem_null[r3+12]=rf_null[r3] 4:addr2,r2,r3 4:if(rf_null[2]&&rf_null[3]){ //filtermonitoring } rf_null[r3]=rf_null[3]&rf_null[2]

Figure 5.5: Example of using information about null metadata to filter monitoring events.

Since different monitoring operations are performed based on instruction type, the dataflow engine is also configured based on instruction type. The source operands and operation for the address generation units are set based on instruction type. Note that the address generation units also take information from the monitored event as inputs. Thus, in the same way that the monitoring core selects metadata based on register indices or the memory address of the specific monitoring event, the dataflow engine also reads the appropriate flags. In addition, the filter decision table is configured based on instruction type to decide what combination of input flags will lead to a filtered event and whether propagation is required.

5.3.4 Filtering Null Metadata

One way to reduce the number of monitoring events that must be handled by the monitoring core is to filter out monitoring events that correspond to operat-

81 ing on null metadata. Null metadata correspond to events that are not relevant to the monitoring core. For example, Hardbound [25] filters out operations on non-pointer (i.e., no base and bounds metadata) instructions since it is not rele- vant to array bounds checking. More recently, FADE [30] has been proposed as a general hardware module to perform this null metadata filtering for a variety of monitoring schemes. Our architecture is able to support this null metadata

filtering with a small modification.

Figure 5.5 shows an example of how this null filtering operates. Here, the main core’s code has been slightly modified from Figure 5.1 and on line 2, r2 is no longer set as an array pointer. Without null metadata filtering, the instruction for line 4 would be forwarded to the monitoring core since the system does not know whether r2 contains a pointer address or not. However, if we use a 1-bit flag to mark r2 as null when it is loaded with a constant, then we can propagate this null information and filter out monitoring for line 4.

The operations performed by null filtering are almost identical to the oper- ations needed for invalidation shown in Figure 5.2 except for a small change in the propagation policy. Instead of taking a logical OR of the source invalidation

flags to determine whether monitoring can be skipped, null metadata filtering takes a logical AND of the source null flags. Thus, we can easily enable this null metadata filtering on our architecture by extending the dataflow flags to be two bits wide. One bit is used to keep track of invalidation information while the second bit is used to keep track of null information. All flags are initialized to null and the filter decision table is extended with the propagation and filtering decision rules for null metadata filtering. The result is a single hardware design that enables both partial monitoring and null metadata filtering.

82 slack monitoring overheads

initial slack

time perform drop perform monitoring monitoring monitoring

Figure 5.6: Slack and its effect on monitoring over time.

5.4 Dropping Policies

In order to use partial monitoring to enable adjustable overhead, we must also specify a policy for when and which monitoring events are dropped. In this section, we discuss some of the options and trade-offs for dropping policies. We split this decision into two components:

1. When do we need to drop events in order to enable reduced overhead? (Section 5.4.1)

2. Which events should be dropped? (Section 5.4.2)

5.4.1 Deciding When to Drop

In this section, we discuss two possible ways to determine when events should be dropped. The first possibility is to probabilistically drop events. By setting the probability of dropping events appropriately, overhead can be reduced. This works well for enabling partial monitoring for cooperative testing and debug- ging since the randomness allows different users and runs to monitor different

83 portions of the program. However, using a probabilistic dropping policy can make it difficult to meet a target overhead without prior profiling.

Alternatively, we can specify a target overhead and estimate, at run-time, the overhead of monitoring in order to decide whether dropping is needed. The overhead budget is specified as a percentage of the main program’s execu- tion cycles without monitoring. In order to estimate the overhead at run-time, we define slack as the number of cycles of monitoring overhead that can be in- curred while staying within the target budget. Slack is essentially the difference between the actual overhead seen and the budget specified. Note that this defi- nition differs slightly from the one used in Chapter 4 where dynamic slack was the difference between actual execution time and the worst-case execution time.

Slack is generated as the main program runs and consumed as monitoring over- heads occur. For example, if no monitoring overheads occur during 1000 cycles of the main program’s execution and the designer sets a 20% overhead target, then the slack that is built up during this period is 200 cycles. If the main core is then stalled for 50 cycles due to monitoring, then the remaining slack is 150 cycles. In addition to this accumulated slack, a small amount of initial slack can be given in order for monitoring to be performed at the start of a program. Fig- ure 5.6 shows an example of how slack can change over time. In this slack-based policy, if the slack falls below zero (i.e., the overhead budget is exceeded), then monitoring events are dropped.

Figure 5.7 shows how slack can be easily measured in hardware by using a counter that increments on every k-th cycle of the main core (e.g., every 5th cycle for a 20% target budget). The value of this counter is the accumulated slack. Whenever the main core is stalled due to monitoring, the measured slack

84 overhead percentage k 1/x Slack slack drop decision > clk Counter

stall monitoring execution cycles

Figure 5.7: Slack tracking and drop decision hardware.

is decremented. It is difficult to precisely determine the entire impact of mon- itoring on the main core due to the difficulty in measuring certain overheads such as contention for shared memory. However, we have found that using only the stalls due to FIFO back pressure works well in practice.

5.4.2 Deciding Which Events to Drop

In addition to deciding when dropping is required, trade-offs also exist in de- ciding which events should be dropped. The simplest policy is to drop mon- itoring events when slack is less than or equal to zero. However, this can re- sult in wasted work. For example, consider the metadata dependence graph shown in Figure 5.8(a). Here, an edge from node A to node B represents that if event A is dropped, then due to its invalidated metadata, it will cause event

B to be dropped. Square nodes indicate events where monitoring checks are performed. In the example, suppose that event E is meant to perform a check operation but is dropped. In this situation, the monitoring operations that were done for events C and D were wasted since their results were not used for any monitoring checks. That is, by the time we decide to drop event E, we have already updated metadata for events C and D even though they are no longer

85 A B C D E

F G

(a) Unrestricted Dropping

A B C D E

F G

(b) Source-Only Dropping

A B C D E

F G

(c) Sub-Flow Dropping

Figure 5.8: Comparison of dropping policies using metadata dependence graphs. Square nodes represent events where check are per- formed. Blue (dark) nodes indicate which nodes can be dropped. needed.

An alternative dropping policy which eliminates this wasted work is to only make dropping decisions at the root or source of these metadata flows (see Fig- ure 5.8(b)). That is, we will decide to either monitor or not monitor an entire metadata flow. An example of these source nodes is the monitoring done to initialize base and bounds information on malloc for an array bounds check.

These source nodes are easily identifiable by the dropping hardware because they typically correspond to the special instructions that are used to set up metadata information. Thus, it is not necessary to generate and analyze the monitoring dependence graph to identify source nodes. We refer to this drop- ping decision policy as source-only dropping and we refer to the previous policy of dropping any event as unrestricted dropping.

86 More complex dropping policies can be implemented given more detailed information about the monitoring dependence graph. This information can be found through static analysis or profiling, though we do not explore how to

find it in detail here. Given this information, we describe one possible policy to use it, which we call sub-flow dropping. Sub-flow dropping makes dropping decisions at a granularity in between unrestricted dropping and source-only dropping (see Figure 5.8(c)). The basic idea of sub-flow dropping is to drop por- tions of the monitoring dependence graph at the smallest granularity such that no work is wasted. Sub-flow dropping allows source nodes to be dropped and nodes after branch points in the monitoring dependence graph to be dropped. From Figure 5.8(c), this corresponds to nodes A, C, and F. For example, drop- ping node C causes nodes D and E to be skipped. However, performing moni- toring on the remaining nodes allows the check at node G to be performed with no wasted work. Similarly, dropping F allows node E to be checked with no wasted work. Note that sub-flow dropping can still result in wasted work if there are merge points in the monitoring dependence graph and only part of the incoming flows are dropped, but it should result in less wasted work than compared to an unrestricted dropping policy.

Source-only dropping will result in no wasted work and thus provide better coverage than unrestricted dropping at a given overhead. However, because of the coarser-grained decision, it may be more difficult to closely match overhead targets. Sub-flow dropping enables a design point in between unrestricted drop- ping and source-only dropping in terms of matching overhead and coverage achieved. If the monitoring dependence graph is highly connected, then source- only dropping will perform poorly. On the other hand, we expect source-only dropping to work well when there are a large number of independent metadata

87 flows. Monitoring dependence graphs with a large number of branches will favor a sub-flow dropping policy.

The choice of dropping policy can also depend on whether probabilistic dropping is performed instead of slack-based dropping. Probabilistic dropping works poorly with an unrestricted dropping policy. Since every event in a de- pendent chain (e.g., events A through E) needs to be monitored in order for the monitoring check to be useful, randomly dropping any event is likely to cause the final check to be invalid by dropping at least one event in the chain. Instead, source-only dropping works well when performing probabilistic dropping.

Depending on the target application of partial monitoring, different poli- cies are more applicable. For applications where closely matching an overhead target is important, a slack-based, unrestricted dropping policy is appropriate. However, if matching the overhead target is not as important, then a slack- based, sub-flow or source-only dropping policy could provide better coverage.

Finally, if the goal is to use partial monitoring to enable cooperative debugging and testing with very low overhead, then a probabilistic, source-only or sub- flow dropping policy can be used to provide good total coverage over multiple runs.

88 5.5 Evaluation

5.5.1 Experimental Setup

We implemented our dataflow-guided monitoring architecture by modifying the ARM version of the gem5 simulator [10] to support parallel run-time mon- itoring. We model the main and monitoring cores as running at 2.5 GHz with 4-way set-associative 32 kB private L1 I/D caches and a shared 8-way 2 MB L2 cache. This setup is similar to the Snapdragon 801 processor commonly found in mobile systems. A 16-entry FIFO was used to connect the main core to the monitoring core. The dataflow engine uses a 1 kB cache for invalidation and null flags.

In order to explore the generality of the architecture for different mon- itors, we implemented five different monitors: uninitialized memory check

(UMC), array bounds check (BC), dynamic information flow tracking (DIFT), instruction-mix profiling (IMP), and LockSet-based race detection (LS). Since we implement run-time monitoring using a processor core and our dataflow engine was designed to be generally applicable, the same hardware platform supports all of these monitoring techniques. Uninitialized memory check seeks to detect loading from memory locations that are not initialized first. Array bounds check is a monitoring scheme that aims to detect buffer overflows where memory ac- cesses go beyond the boundaries of an array. We modify the implementation of malloc to set base and bound metadata information. Dynamic information

flow tracking is a security monitoring scheme which detects when information from untrusted sources is used to affect the program control flow (i.e., indirect control instructions). For the benchmarks we consider, we mark data read from

89 files as untrusted. We implemented a multi-bit DIFT scheme which marks un- trusted data with a 32-bit metadata identifier so that if an error is detected, it is possible to have information about where the data originated from. Instruction- mix profiling counts the number of ALU, load, store, and control instructions that are executed. This profiling information can be useful for performing opti- mizations or to detect malicious software [84]. Finally, LockSet [73] attempts to detect possible race conditions in multi-threaded programs by tracking meta- data about which locks are protecting shared memory locations. If a shared memory location is accessed while unprotected then a race condition may exist.

We tested our system using benchmarks from SPECint CPU2006 [80]. Since our implementation of BC depends on the modification of malloc to set array bounds information, we focus on the C SPECint benchmarks. Although we do not show results for the C++ benchmarks, we note that the results for UMC, DIFT, and IMP for these benchmarks are similar to the other results shown. We fast-forwarded each benchmark for 1 billion instructions and then simulated for 2 billion instructions.

Since LockSet race detection requires multi-threaded programs, we tested it using the applications from the SPLASH-2 benchmark set [94]. Each bench- mark was run using 2 main cores to run the benchmark application. fmm and raytrace were run without fast-forwarding because they ran to completion in under 2 billion instructions. Each main core was connected to a dataflow engine and a monitoring core.

90 perlbench gcc gobmk sjeng h264ref bzip2 mcf hmmer libquantum geomean 14 12 10 8 6 4 2 0

Normalized Execution Time UMC BC DIFT IMP

Figure 5.9: Monitoring overhead with null metadata filtering.

5.5.2 Baseline Monitoring Overhead

Figure 5.9 shows the execution times of performing monitoring with null fil- tering enabled normalized to the execution time of the benchmarks without monitoring. UMC sees normalized execution times of 2.6-9.8x with an aver- age of 5.7x. For BC, normalized execution times are 2.7x on average and range from 1.02x to 13.6x. DIFT sees an average of 1.1x slowdown with a maximum slowdown of 1.3x with null metadata filtering. This low overhead is due to the fact that for our implementation of DIFT on SPEC benchmarks, we only mark data read from files as tainted. Instead, if we targeted network or streaming applications, which have larger amounts of untrusted input data, we would ob- serve higher overhead. IMP shows normalized execution times of 1.6-9.5x with an average of 5.8x. Overheads for LS (not pictured) applied to the SPLASH-2 benchmarks are 1.04-1.29x after null metadata filtering and the average over- head observed is 1.13x. Our baseline system and overheads are similar to pre- vious multi-core monitoring systems such as LBA-accelerated [16] and FADE

[30] (see Table 5.1). In Section 5.5.4, we also evaluate a higher performance, FPGA-based monitor that shows low tens of percent of overhead.

91 1.1x 1.5x 2.0x 4.0x 8.0x 14.0x 100

80

60

40

Coverage [%] 20

0

gcc mcf bzip2 sjeng gobmk hmmer h264ref perlbench geomean libquantum

Figure 5.10: Coverage vs. varying overhead budget for BC.

1.5x 2.0x 4.0x 6.0x 8.0x 12.0x 100

80

60

40

Coverage [%] 20

0

gcc mcf bzip2 sjeng gobmk hmmer h264ref perlbench geomean libquantum

Figure 5.11: Coverage vs. varying overhead budget for UMC.

5.5.3 Coverage with Adjustable Partial Monitoring

In this section, we evaluate the effectiveness of using partial monitoring to trade off coverage for reduced overhead. For these results, we use a slack-based, un- restricted dropping policy. Figure 5.10 shows the monitoring coverage achieved by array bounds checking as we vary the overhead budget. We define monitor- ing coverage as the percentage of dynamic checks that are performed (indirect jumps in DIFT, loads in UMC, and memory accesses in BC and LS). The metric

92 1.01x 1.05x 1.1x 1.2x 1.3x 1.4x 100 80 60 40

Coverage [%] 20 0

fmm barnes raytrace ocean_cp radiosity geomean ocean_ncp

water_nsquared

Figure 5.12: Coverage vs. varying overhead budget for LS.

1.1x 1.5x 2.0x 4.0x 6.0x 8.0x 10 8 6 4 Error [%] 2 0 gcc bzip2 mcf sjeng gobmk hmmer h264ref average perlbench libquantum

Figure 5.13: Average error vs. varying overhead budget for instruction- mix profiling. Whisker bars show min and max errors.

is chosen to understand how likely an error/attack instance is to be detected on an individual system, assuming that errors/attacks are uniformly distributed across checks. This may not necessarily be true for actual errors/attacks and so the reported coverage may not be the same as the probability of detecting actual errors/attacks. However, we believe this metric provides a good initial estimate of the detection capabilities of the system. In the figure, the bottom portion of each bar shows the coverage for a 1.1x overhead target and the additional cov- erage for increased overhead targets are stacked above this.

We see that by varying the overhead budget, the coverage achieved also

93 varies. With only a 1.1x overhead target, array bounds check still achieves over 80% monitoring coverage on average. The high coverage achieved with such low overhead is due to two main effects. The first is that monitoring can be done in parallel, providing a certain level of monitoring coverage without intro- ducing overhead. The second effect is that there may still exist a large number of monitoring events that do not lead to bounds checks. As a result, dropping these events can reduce overhead without a large impact on monitoring cover- age.

Figure 5.11 shows the analogous graph for UMC. Again, we see that varying overhead budgets enables varying levels of partial monitoring. With a 2x over- head target, UMC achieves 22% monitoring coverage on average and with a 4x overhead target, this increases to 49%. Even higher coverage can be achieved by allowing higher overhead budgets. Similarly, Figure 5.12 shows the cover- age achieved by LS. With only a 1.01x overhead, LS achieves 30% coverage on average. This increases to 71% coverage with a 1.1x overhead budget. For DIFT (not pictured), an overhead target of 1.05x is enough for all benchmarks tested to reach 100% coverage. Note that the overheads shown are the target over- head. In some cases, a high overhead target is needed in order to achieve 100% coverage. For example, for mcf with UMC monitoring, a 12x overhead target is needed to achieve 100% coverage while the overhead of performing monitoring in full was 2.6x (see Figure 5.9). This is due to the fact that we accumulate slack gradually and so bursty monitoring events may require a higher overhead tar- get in order for all monitoring to be performed. However, the actual overheads seen are in-line with the overhead of performing monitoring without dropping (e.g., mcf with UMC monitoring at a 12x overhead target shows a 2.6x over- head).

94 1.01x 1.05x 1.10x 1.50x 2.00x 100

80

60

40

Coverage [%] 20

0

gcc mcf bzip2 sjeng gobmk hmmer h264ref perlbench geomean libquantum

Figure 5.14: Coverage versus varying overhead budget for UMC running on an FPGA-based monitor.

The instruction-mix profiling monitor does not perform check operations. Thus, the concept of coverage is not applicable here. Instead, for each instruc- tion type profiled, we calculate its count as a percentage of the total instructions monitored. We take the difference of these percentages compared to the case when full monitoring is performed to calculate the error. Figure 5.13 shows the min, max, and average error across instruction types for each benchmark and for varying overhead. We see that with a 1.1x overhead, a max error of 9% is observed across the benchmarks and the average error across instruction types and benchmarks is 3%.

5.5.4 FPGA-based Monitor

In addition to evaluating our system for a core-based monitor, we also eval- uated partial monitoring on a higher performance FPGA-based monitor. This setup is based on the FlexCore [23] setup and uses a fully-pipelined monitor running on an FPGA fabric at half the frequency of the main core. In this case,

95 Unrestricted Dropping Source-Only Dropping

40

20

0

20

40 Overhead Difference [%]

gcc mcf bzip2 sjeng gobmk hmmer h264refaverage perlbench libquantum

(a) Overhead budget error.

Unrestricted Dropping Source-Only Dropping 50

40

30

20

Coverage [%] 10

0

gcc mcf bzip2 sjeng gobmk hmmer h264ref perlbench geomean libquantum

(b) Coverage.

Figure 5.15: Comparing dropping policies for UMC.

the full overheads of monitoring range from 1.1-1.9x with an average overhead of 1.4x. Figure 5.14 shows the coverage achieved as we sweep the overhead target from 1.01-2.0x. We see that partial monitoring can allow the overhead of such a system to be pushed to 1.1x while still providing 45% coverage.

96 5.5.5 Comparing Dropping Policies

In this section, we evaluate the trade-offs between different dropping policies.

Figure 5.15(a) shows the difference between the overhead budget and the run- time monitoring overhead for UMC when the overhead target is set to 2.0x. A positive value means that the overhead target was overshot while a negative value indicates that the overhead budget was met. For most benchmarks, we see similar results for unrestricted dropping and source-only dropping due to the fact that UMC consists of a large number of short, independent monitoring dependence chains. Source-only dropping causes overshoot of the overhead target for hmmer and h264ref. Figure 5.15(b) shows the coverage of UMC for unrestricted dropping and source-only dropping. We see that, in several cases, source-only dropping achieves higher coverage than unrestricted drop- ping while still closely matching the overhead target. For example, perlbench shows a 10% increase in coverage and libquantum shows an 18% increase in coverage by using source-only dropping. This is due to the fact that some of the overhead of monitoring for unrestricted dropping is being spent on wasted work as discussed in Section 5.4.2.

Next, we evaluate these trade-offs between source-only dropping and unre- stricted dropping for BC. Figure 5.16(a) shows the overhead differences for BC and Figure 5.16(b) shows the coverage for BC. Here, the overhead target is 1.5x. From Figure 5.16(a), we see that for several benchmarks, source-only dropping fails to meet the specified overhead target. The overshoot of the overhead tar- get is quite high with overhead differences over 100% for bzip2, hmmer, and h264ref. Since only infrequently occurring array allocations are considered as source events for BC, it can be difficult for source-only dropping to match over-

97 Unrestricted Dropping Source-Only Dropping

40

20

0

20

40 Overhead Difference [%]

gcc mcf bzip2 sjeng gobmk hmmer h264refaverage perlbench libquantum

(a) Overhead budget error.

Unrestricted Dropping Source-Only Dropping 100

80

60

40

Coverage [%] 20

0

gcc mcf bzip2 sjeng gobmk hmmer h264ref perlbench geomean libquantum

(b) Coverage.

Figure 5.16: Comparing dropping policies for BC.

head targets. Although Figure 5.16(b) shows higher coverage for source-only dropping, this is largely due to the fact that it is running with higher overhead.

From these results, we see that depending on the monitoring scheme and the program behavior, source-only dropping can provide better coverage than un- restricted dropping with the same overhead. However, unrestricted dropping is better at meeting an overhead target. Thus, for applications where staying within an overhead target is especially important, such as soft real-time sys- tems, a slack-based, unrestricted dropping policy is more appropriate. On the

98 1 2 3 4 5 6 7 8 runs 100

95

90 Coverage [%] 85

80 Unrestricted Sub-Flow Source-Only

Figure 5.17: Coverage over multiple runs for BC with a 10% probability to not drop events.

other hand, if maximum coverage is desired and occasional slowdowns are ac- ceptable, then source-only dropping can provide better coverage on average.

The results for the sub-flow dropping policy are not included in these graphs because our profiling infrastructure to identify sub-flow nodes currently does not support fast-forwarding. Instead, we compared the sub-flow dropping with other policies by running simulations for 1 billion instructions without fast- forwarding. The results (not shown) showed that sub-flow dropping produced similar coverage and overhead target matching to unrestricted dropping for the benchmarks and monitoring schemes that we tested.

5.5.6 Multiple-Run Coverage

One application of partial monitoring with low overhead is to enable coopera- tive debugging. The idea with cooperative debugging is to use partial monitor- ing with very low overhead across a large number of users or runs. By varying

99 the pattern of partial monitoring done on each run, the goal is to achieve high coverage across multiple runs. Varying the monitoring that is done for differ- ent runs can be achieved by using a probabilistic dropping policy. Figure 5.17 shows the total coverage achieved over multiple runs for array bounds check using unrestricted, sub-flow, and source-only dropping policies with probabilis- tic dropping. These numbers are averaged across all benchmarks. Here, we use a 10% probability that events are not dropped. Each run was simulated for 500 million instructions. Since the effectiveness of cooperative debugging is often measured by code coverage, coverage here is measured as the percentage of static instructions which are monitored in at least one of the runs. We see that for the unrestricted dropping policy there is almost no increase in coverage. This is due to the fact that it is likely that at least one monitoring event in a metadata dependence chain will be dropped with the unrestricted dropping policy. In- stead, sub-flow dropping and source-only dropping are much better suited for achieving high coverage over multiple runs. Source-only dropping shows a 6% increase in coverage with only eight runs while sub-flow dropping shows a 1% increase in coverage over eight runs. While sub-flow dropping shows a slower increase in coverage than source-only dropping, sub-flow dropping is better able to meet overhead targets. With enough runs, both sub-flow dropping and source-only dropping should be able to reach 100% code coverage.

5.5.7 Area and Power Overheads

Adding the dataflow engine in order to enable filtering and partial monitoring adds overheads in terms of area and power consumption. We use McPat [46] to get a first-order estimate of these area and power overheads in a 40 nm technol-

100 Monitor Peak Power [mW] Runtime Power [mW] UMC 48 (4.9%) 29 (7.7%) BC 69 (7.1%) 41 (10.7%) DIFT 72 (7.4%) 41 (10.6%) IMP 41 (4.3%) 20 (5.1%) LS 31 (3.2%) 13 (3.9%)

Table 5.2: Average power overhead for dropping hardware. Percentages are normalized to the main core power.

ogy node. McPat estimates the main core area as 2.71 mm2 and the peak power usage as 965 mW averaged across all benchmarks. The average runtime power usage was 385 mW. These area and power numbers consist of the core and L1 cache, but do not include L2 cache, memory controller, and other peripherals. The power numbers include dynamic as well as static (leakage) power. For the dataflow engine, we modeled the ALUs used for address calculation, the dataflow flag register file and cache, the configuration tables, and the filter de- cision table. These were modeled using the corresponding memory and ALU objects in McPat. We note that this is only a rough area and power estimate since components such as the wires connecting these modules have not been modeled. However, this gives a sense of the order-of-magnitude overheads in- volved with implementing our approach.

Our results show that an additional 0.197 mm2 of silicon area is needed, an increase of 7% of the main core area. Table 5.2 shows the peak and runtime power overheads averaged across all benchmarks running with a 1.5x monitor- ing overhead target. The peak power is 31-72 mW, which is less than 8% of the main core’s peak power usage. Similarly, the average runtime power is 13-43 mW, corresponding to 4-11% of the main core’s runtime power.

Table 5.3 shows the runtime power usage of the monitoring core averaged

101 Monitor 1.5x Overhead Full Monitoring UMC 338 mW 363 mW BC 320 mW 343 mW DIFT 304 mW 304 mW IMP 345 mW 367 mW LS 327 mW 327 mW

Table 5.3: Average runtime power of the monitoring core.

across all benchmarks. These results are shown for an overhead target of 1.5x as well as when full monitoring is performed.

102 CHAPTER 6 DVFS CONTROL FOR SOFT REAL-TIME SYSTEMS

6.1 Introduction

In the previous chapters, we have explored improving the security and reliabil- ity of real-time systems. We have looked at the challenges and design choices involved with applying run-time monitoring to real-time systems. However, information about real-time requirements can also provide an opportunity with which to better optimize designs. In this chapter, we discuss how soft real-time constraints can be used to inform the control of dynamic voltage and frequency scaling (DVFS) for improved energy-efficiency.

Many modern applications on mobile and desktop systems are interactive. That is, users will provide inputs and expect a response in a timely manner.

For example, games must read in user input, update game state, and display the result within a small time window in order for the game to feel responsive. Previous studies on human-computer interaction have shown that latencies of

100 milliseconds are required in order to maintain a good experience [29, 13, 57] while variations in response time faster than 50 milliseconds are impercepti- ble [51, 98]. These tasks are effectively soft real-time tasks which have a user response-time deadline. The task must finish by the deadline for good user ex- perience, but finishing faster does not necessarily improve the user experience due to the limits of human perception.

As finishing these tasks faster is not beneficial, we can use power- performance trade-off techniques, such as dynamic voltage and frequency scal-

103 ing (DVFS), in order to reduce energy usage while maintaining the same utility to the user. By reducing the voltage and frequency, we can run the task slower and with less energy usage. As long as the task still completes by the deadline, then the experience for the user remains the same.

Existing Linux power governors [12] do not take into account these response-time requirements when adjusting DVFS levels. More recent work has looked at using DVFS to minimize energy in the presence of deadlines. How- ever, these approaches are typically reactive, using past histories of task exe- cution time as an estimate of future execution times [35, 19, 53, 60]. However, the execution time of a task can vary greatly depending on its input. Reactive controllers cannot respond fast enough to input-dependent execution time vari- ations. Instead, proactive or predictive control is needed in order to adjust the operating point of a task before it runs, depending on the task inputs. This has been explored for specific applications [34, 99, 98, 40], but these approaches were developed using detailed analysis of the applications of interest. As a re- sult, they are not easily generalizable to other domains.

In this chapter, we present an automated and general framework for creat- ing prediction-based DVFS controllers. Given an application, we show how to create a controller to predict the appropriate DVFS level for each task execution, depending on its input and program state values, in order to satisfy response- time requirements. Our approach is based on recognizing that the majority of execution time variation can be explained by differences in control flow. Given a task, we use program slicing to create a minimal code fragment that will cal- culate control-flow features based on input values and current program state. We train a model to predict the execution time for a task given these features.

104 34

32

30

28 Execution Time [ms] 26

50 100 150 200 250 Job (Frame)

Figure 6.1: Execution time of jobs (frames) for ldecode (video decoder).

At run-time, we are able to quickly run this code fragment and predict the exe- cution time of a task. With this information, we can appropriately select a DVFS level in order to minimize energy while satisfying response-time requirements.

The rest of the chapter is organized as follows. Section 6.2 discusses gen- eral characteristics of interactive applications and the challenges of designing a

DVFS controller to take advantage of execution time variation. Section 6.3 dis- cusses our method of predicting execution time and using this to inform DVFS control. Section 6.4 discusses the overall framework and system-level issues.

Finally, Section 6.5 shows our evaluation results.

6.2 Execution Time Variation in Interactive Applications

6.2.1 Variations in Execution Time

The execution time for a task can vary significantly from job to job. Figure 6.1 shows the execution time per job (frame) for a video decoder application (lde-

105 code) running on an ODROID-XU3 development board (see Section 6.5 for full experimental setup). We can see that there are large variations in execution time from job to job due to differences in input and program state when each job ex- ecutes. Because of this large variation, setting the appropriate DVFS level is a difficult problem.

Using a single DVFS level based on the average execution time (28.6 millisec- onds) will lead to a number of jobs missing their deadline. On the other hand, looking at the worst-case execution time (32.3 milliseconds) implies that the ap- plication must be run at its maximum frequency in order run at 30 frames per second, resulting in no energy savings from using DVFS. Instead, the large vari- ations from job to job imply that we need a fine-grained, per-job decision of the

DVFS level to use in order to minimize energy usage while avoiding deadline misses.

6.2.2 Existing DVFS Controllers

There are a large number of existing and proposed DVFS controllers. Most con- trollers, such as the built-in Linux governors [12], adjust DVFS based on CPU utilization. When utilization is high, voltage and frequency are increased, while when utilization is low, voltage and frequency are decreased. This does not explicitly take into account deadlines and can result in high energy usage or deadline misses. For example, high CPU utilization can cause a high voltage and frequency level to be used. However, the time budget for the task may actually be very long and a lower voltage and frequency would be sufficient, re- sulting in lower energy usage. Similarly, CPU utilization for a job could be low

106 34 Actual PID

32

30

28 Execution Time [ms] 26

10 12 14 16 18 20 Job (Frame)

Figure 6.2: Execution time of jobs (blue, solid) and execution time ex- pected by a PID controller (red, dashed) for ldecode (video de- coder).

due to memory stalls, causing the controller to lower voltage and frequency levels. However, if the task has a tight time budget, then this can result in a deadline miss, whereas running at higher frequencies may have been able to meet the deadline.

DVFS has been explored in hard real-time systems in order to save energy while guaranteeing that deadlines are met [72]. In order to ensure that deadlines are never missed, the analysis must be strictly conservative which limits the amount of energy that can be saved. That is, a task will always be run at a frequency such that even the slowest jobs will meet the deadline.

Other work has explored using run-time information to inform DVFS control in the presence of deadlines. These approaches are largely reactive and use information about past job execution times to predict future job times [35, 19,

53, 60]. This can capture coarse-grained phase changes in execution time, but cannot capture the fine-grained job-to-job variations in execution times as the adjustment in DVFS level happens too late. For example, Figure 6.2 shows the expected execution times that a PID-based controller uses for setting the DVFS

107 Job Job

budget time

Job Job

time Predict and set DVFS level

Figure 6.3: Overview of prediction-based control.

level and the real execution times of the jobs. As we can see, the PID controller’s decision lags the actual execution times of the jobs.

More recently, people have investigated predicting job execution time and setting DVFS in order to meet deadlines for specific applications (e.g., game rendering [34], web browsing [99, 98], web server/Memcached [40]). These ap- proaches involved careful analysis of the application of interest, requiring ex- tensive programmer effort, in order to design the controller. As a result, the resulting controllers cannot be applied to other applications.

6.2.3 Prediction-Based Control

Our goal in this work is to develop a general and automated framework in or- der to create prediction-based DVFS controllers that can minimize energy usage without missing deadlines. Figure 6.3 shows an overview of the operation of our proposed prediction-based controller. The basic idea is to pre-pend tasks with a small segment of code. This segment of code will predict the appropriate DVFS level to use for each of the task’s jobs depending on the job’s input and current program state.

108 1

10 1

5

Figure 6.4: Example control flow graph. Each node is annotated with its number of instructions.

The main source of execution time variation between jobs is due to dif- ferent inputs and program state. Thus, the main challenge in developing a prediction-based DVFS controller is determining how to map job input and pro- gram state values to the appropriate DVFS frequency level. In general, finding a direct mapping from input values to frequency levels is challenging because the mapping can be irregular and complicated. In addition, this mapping varies from application to application. For example, for one application, pressing the “down” key may correspond to a large increase in execution time while for other applications it may have no effect on execution time. Our solution is to take ad- vantage of the program source to give us hints about how input values and program state will affect execution time. We use the program source to auto- matically generate a prediction-based DVFS controller.

109 Job Input and Program State Generate Program Features Feature Values Predict Job Execution Time Execution Time Predict Frequency Frequency

Figure 6.5: Steps to predict execution time from job input and program state.

6.3 Execution Time Prediction for Energy-Efficiency

6.3.1 Overview

The basic intuition behind our prediction methodology is that, to first-order, execution time correlates with the number of instructions run. Variations in the number of instructions run are described by the control flow taken by a specific job. For example, consider the control flow graph for a task shown in Figure 6.4. Each node is marked with its number of instructions. Taking the left branch instead of the right branch corresponds to nine more instructions being executed. Similarly, each additional loop iteration of the last basic block adds five instructions to the number of instructions executed. By knowing which branch is taken and the number of loop iterations, we can know the number of instructions executed and estimate the execution time. With an estimate of the execution time, we can then estimate the performance-scaling impact of DVFS and choose an appropriate frequency and voltage level to run at in order to just meet the deadline.

Figure 6.5 shows the main steps in our prediction method. We first instru-

110 Original Code Instrumented Code if(condition){ if(condition){ … feature[0]++; } … } Conditional for(i=0;inext){ while(n=n>next){ Loops … feature[2]++; } … }

(*func)(); (*func)();

Call feature[3]=func;

Figure 6.6: Example of feature counters inserted for conditionals, loops, and function calls.

ment the task source code and use program slicing to create a code fragment that will calculate control flow features for a job. The code fragment is run before a job executes in order to generate the control flow features (Section 6.3.2). Next, we use a linear model, which we train off-line, to map control flow features to an execution time estimate for the job (Section 6.3.3). Finally, we use classical lin- ear models [96, 95] that describe the frequency-performance trade-off of DVFS to select an appropriate frequency (Section 6.3.4).

6.3.2 Program Features

The first step needed for our prediction is to generate control flow features. That is, we want to know the control flow of a task when executing with a specific input and program state. For this purpose, we instrument the task source to count these control flow features. Specifically, we instrument the task to count the following features:

111 Original Code Instrumented Code Program Slice if(condition){ if(conditon){ feature[0]++; if(condition){ compute(); compute(); feature[0]++; } } } for(i=0;i

Figure 6.7: Example of program slicing for control flow features.

• Number of iterations for each loop

• Number of times each conditional branch is taken

• Address of each function pointer call

Figure 6.6 shows examples of how these features are instrumented. We focus on control flow features because these explain most of the execution time vari- ation, especially over a large number of instructions. However, other features, such as variable values or memory accesses, could be included to improve the prediction accuracy.

Generating these features using an instrumented version of the task code is not suitable for prediction because the instrumented task will take at least as long as the original task to run. Instead, we need to quickly generate these features before the task execution. In order to minimize the prediction execu- tion time, we use program slicing [86] to produce the minimal code needed to calculate these features. Figure 6.7 shows a simple example of this flow. By re- moving the actual computation and only running through the control flow, the execution time can be greatly reduced. Note that since the information from this slice will ultimately be used to make a heuristic decision on DVFS control,

112 Variable Type Description y¯ Scalar Predicted execution time x Vector Feature values β Vector Model coefficients y Vector Profiled execution times X Matrix Profiled feature values Xβ − y Vector Prediction errors α Scalar Under-predict penalty weight γ Scalar Number of terms penalty weight k · k Scalar L2-norm (sum of squares) k · k1 Scalar L1-norm (sum of absolute values)

Table 6.1: Variable and notation descriptions.

we do not need an exact slice that perfectly calculates the program features. Is- sues such as pointer aliasing can limit the amount of code removed for an exact slice. Instead, using an approximate slice can reduce the slice’s size and execu- tion time. As long as inaccuracies in the generated program features are low, an approximate slice is adequate for our prediction needs. We refer to the resulting program slice that computes the control flow features as the prediction slice or simply as the slice.

One problem that arises with running this prediction slice before a task is the issue of side-effects. That is, the slice could write to global variables and break the correctness of the program. In order to prevent this, the slice creates local copies of any global variables that are used. Values for these local copies are updated at the start of the slice and writes are only applied to the local copy. A similar process is applied to any arguments that are passed by reference.

113 6.3.3 Execution Time Prediction Model

Next, we need to predict the execution time from the control flow features. This section describes our model that maps features to execution time. Table 6.1 sum- marizes the variables and notation that are used in this section. We use a linear model to map features to execution time as this captures the basic correlation.

Higher-order or non-polynomial models may provide better accuracy. How- ever, a linear model has the advantage of being both simple to train and fast to evaluate at run-time. In addition, it is always convex which allows us to use convex optimization-based methods to fit the model. Our linear model can be expressed as

y¯ = xβ where y¯ is the predicted execution time, x is a vector of feature values, and β are the coefficients that map feature values to execution time. These β coeffi- cients are fit using profiling data. Specifically, we profile the program to pro- duce a set of training data consisting of execution times y and feature vectors

X (i.e., each row of X is a vector of features, xi, for one job). Note that in or- der to achieve the expected linear correlation between features and execution time, addresses recorded for function calls are converted to a one-hot encoding indicating whether particular function addresses were called or not.

The most common way to fit a linear model is to use linear least squares regression. Linear least squares regression finds the coefficients β that minimize the mean square error:

argmin kXβ − yk2 β

114 Essentially, this aims to minimize the sum of the absolute errors in the predic- tion. That is, it weights negative and positive errors equally. However, these two errors lead to different behaviors on our system. Negative errors (under- prediction) lead to deadline misses since we predict the job to run faster than its actual execution time. On the other hand, positive errors (over-prediction) result in an overly conservative frequency setting which does not save as much energy as possible. In order to maintain a good user experience, we would prefer to avoid deadline misses, possibly at the cost of energy usage. In other words, we should place greater weight on avoiding under-prediction as op- posed to over-prediction.

We can place greater weight on under-prediction by modifying our opti- mization objective:

argmin kpos(Xβ − y)k2 + αkneg(Xβ − y)k2 β where pos(x) = max{x, 0} and neg(x) = max{−x, 0} and these functions are applied element-wise to vectors. Thus, kpos(Xβ − y)k2 represents the over- prediction error and kneg(Xβ−y)k2 represents the under-prediction error. α is a weighting factor that allows us to place a greater penalty on under-predictions by setting α > 1. Since this objective is convex, we can use existing convex optimization solvers to solve for β.

Coefficients which are zero imply that the corresponding control flow fea- tures do not need to be calculated by the prediction slice. We can use this infor- mation to further reduce the size and execution time of the prediction slice. We can extend our optimization objective to favor using less features by using the

115 300 250 200 150 100 50 0

Execution time [ms] 50 0 1 2 3 4 5 1/frequency [ns]

Figure 6.8: Average execution time of jobs (frames) for ldecode (video de- coding) as frequency level varies.

Lasso method [85]:

2 2 argmin kpos(Xβ − y)k + αkneg(Xβ − y)k + γkβk1 β where k · k1 is the L1-norm and γ is a weighting factor that allows us to trade off prediction accuracy with the number of features needed.

6.3.4 DVFS Model

Given a predicted execution time, we need to estimate how the execution time will change with varying frequency. For this, we use the classical linear model found in literature [96, 95]:

t = Tmem + Ndependent/f

where t is the execution time, Tmem is the memory-dependent execution time that does not scale with frequency, Ndependent is the number of CPU cycles that do not overlap with memory and scale with frequency, and f is the frequency. In order to verify this linearity assumption, we ran our benchmark applications at each available frequency level. We then tested the linear correlation between

116 average job time and frequency for each application. Figure 6.8 shows the aver- age job execution time versus 1/f for ldecode (video decoder application). We can see that t and 1/f do show a linear relationship. All applications we tested showed a linear relationship between execution time and 1/f with squared cor- relation coefficient r2 > 0.99.

For this linear model, by predicting the execution time at two points, we can determine Tmem and Ndependent for a job and calculate the minimum frequency f to satisfy a given time budget tbudget. More specifically, we predict the execution time t¯fmin at minimum frequency fmin and the execution time t¯fmax at maximum frequency fmax. Using these two points, we can calculate Tmem and Ndependent as

fminfmax(t¯fmin − t¯fmax) Ndependent = fmax − fmin fmaxt¯fmax − fmint¯fmin Tmem = fmax − fmin

For a given budget tbudget, we want the minimum frequency fbudget that will meet this time. This can be calculated as

Ndependent fbudget = tbudget − Tmem

Since execution time can vary even with the same job inputs and program state, we add a margin to the predicted execution times used (tfmin and tfmax).

In our experiments we used a margin of 10%. A higher margin can decrease deadline misses while a lower margin can improve the energy savings. The resulting predicted frequency is the exact frequency that we expect will just satisfy the time budget. However, DVFS is only supported for a set of discrete frequency levels. Thus, the actual frequency we select is the smallest frequency allowed that is greater than fbudget.

117 Predictor DVFS Job

effective budget time original budget

Figure 6.9: The effective budget decreases due to slice and DVFS execution time.

1400 2400 1200 2100

1000 1800 1500 800 1200 600 900

600 Switching time [us]

Start Frequency [MHz] 400 300 200 0 200 400 600 800 100012001400 End Frequency [MHz]

Figure 6.10: 95th-percentile switching times for DVFS.

The execution of the prediction slice and DVFS switch reduces the amount of time available for a job to execute and still satisfy its budget. Thus, the effective budget when choosing a frequency to run at needs to consider these overheads (see Figure 6.9). Although the execution time of the prediction slice can be mea- sured, the DVFS switching time must be estimated, as the switch has not been performed yet. This is done by microbenchmarking the DVFS switching time. Figure 6.10 shows the 95th-percentile DVFS switching times for our test plat- form for each possible start and ending frequency. We use the 95th-percentile switching times in order to be conservative in our estimate of DVFS switching time while omitting rare outliers.

118 6.3.5 Alternate Prediction Models

In this section, we have described our specific prediction strategy for each step in our overall prediction flow shown in Figure 6.5. However, we note that each step in this prediction flow can be substituted with alternate models as long as it produces the needed prediction for the next step. The most obvious change would be to use more complex prediction models for each step (e.g., more fea- tures generated and higher-order, non-linear models) in order to improve the prediction accuracy. Here, we discuss some other possible directions for ex- tending our prediction flow.

For feature generation, we have focused on automated generation in order for the approach to be general and limit the need for domain-specific exper- tise. However, this does not preclude the programmer from manually adding “hints” that they expect would correlate well with a job’s execution time. For example, the programmer may be able to extract metadata from input files and manually encode these to features.

One interesting extension to execution time prediction involves its use in feature selection. Additional constraints could be added to the execution time prediction in order to limit the use of features which require high overhead to generate. Features over some overhead threshold could be explicitly disallowed or the overhead for each feature could be introduced as penalties in the opti- mization objective.

The last step in our flow focuses on selecting an appropriate frequency level for DVFS control. However, this last step could be substituted to support other performance-energy trade-off mechanisms, such as heterogeneous cores. By us-

119 Original Code Programmer Annotation while(true){ while(true){ #pragmastart_task50ms task(); task(); } #pragmaend_task }

Figure 6.11: Example of programmer annotation to mark task boundaries and time budgets.

ing alternate models for how the execution time scales with the performance- energy trade-off mechanism, an appropriate operating point can be selected for the mechanism of interest.

6.4 System-Level Framework

This section describes the overall framework and operation of our prediction- based controller.

6.4.1 Programmer Annotation

In order to identify tasks and their time budgets, programmer annotation is required. The programmer must annotate the start and the end of a task and the desired response-time requirement. Figure 6.11 shows an example of this annotation. For ease of analysis and to ensure that tasks that start always end, we require the start and end of a task to be within one function. Arbitrary code paths can be modified to fit this model by using a wrapper function or re-writing the code. Multiple non-overlapping tasks can be supported, though we only considered one task in the applications we tested.

120 Sample Inputs

Source Instrument Instrumented Feature Values, Train Profile Code Features Code Execution Times Model Off-line

Prediction Generate feature selection Execution Slice Slice Time Model

DVFS Predictor

Predict Job Generate Feature Execution Predict Execution Frequency Inputs Features Values Time Frequency Time Run-time

Run Prediction Set DVFS Execute Job

Figure 6.12: Overall flow for prediction-based DVFS control.

6.4.2 Off-line Analysis

Figure 6.12 shows the overall flow of our framework for creating prediction- based DVFS controllers. Given programmer annotation to identify tasks, we can automatically instrument these tasks to record control flow features. Off- line, we profile these tasks in order to collect traces of feature values and job execution times. This is used to train our execution time prediction model, as described in Section 6.3.3. Since execution time depends on the specific hard- ware and platform that an application is run on, profiling and model training needs to be done for the platform that the application will be run on. For com- mon platforms, the program developer can perform this profiling and distribute the trained model coefficients with the program. Alternatively, profiling can be done by the user during the application’s installation.

The trained execution time model only requires a subset of all the features to

121 Prediction Diff Platform Benchmark Feature Diff Average Max 2048 – 0% 0% curseofwar – 0% 0% ldecode – 0% 0% pocketsphinx – 0% 0% ARM-big rijndael – 0% 0% sha – 0% 0% uzbl – 0% 0% xpilot -2, +1 0.1% 8.7% 2048 +67 0.4% 2.9% curseofwar – 0% 0% ldecode – 0% 0% pocketsphinx +2 0% 0% x86 rijndael – 0% 0% sha – 0% 0% uzbl -1 0% 0% xpilot +6, -8 0.7% 2.9%

Table 6.2: Differences in features selected and predicted execution times for different platforms compared to an ARM-LITTLE platform.

perform prediction. Thus, this information is used to eliminate code for calcu- lating unneeded features. Program slicing is then used to create a minimal code fragment to calculate the needed control flow features. Note that since the fea- tures needed depends on the training of the execution time prediction model, which is platform-dependent, the features needed could vary across platforms. However, we expect the features that are needed are primarily a function of the task semantics (i.e., execution time variations across control paths) rather than the platform it is run on. In fact, we compared the predictions made for an x86-based (Intel Core i7) platform when using the features selected for an ARM

LITTLE-based ODROID-XU3 platform and for the features selected for the x86 platform itself. We also compared using the ARM big core compared to the LIT- TLE core on the ODROID-XU3 platform. Table 6.2 shows the features that are added or removed in the prediction slices for these platforms compared to the prediction slice generated for the ARM LITTLE core. In addition, the differences

122 Predictor Job

1 1 2 2 3 3

time (a) Sequential

1 2 3

1 2

(b) Pipelined

1 2 3

1 2 3

(c) Parallel

Figure 6.13: Options for how to run predictor.

in predicted execution times using the prediction slice generated for the plat- form and using the prediction slice from the ARM LITTLE core are shown. For the ARM big core, only one benchmark shows differences in features selected with the resulting difference in prediction times being only 0.1% on average. For the x86-based platform, four benchmarks show differences in features se- lected. In two of these cases, the predicted execution times are not affected. In the remaining two cases, average differences in predicted execution times are less than 1%. Although we had to re-train the execution time model coefficients for these new platforms, the same prediction slice was applicable across both platforms.

123 6.4.3 Run-time Prediction

The prediction slice, execution time predictor, and frequency predictor are com- bined to form the DVFS predictor or simply predictor There are several options for how to run the predictor in relation to jobs. Figure 6.13 shows some of these options. The simplest approach is to run the predictor just before the execution of a job (see Figure 6.13(a)). This uses up part of the time budget to execute the predictor, as mentioned in Section 6.3.4. However, if this time is low, then the impact is minimal.

An alternative option would be to run the predictors and jobs in a parallel, pipelined manner such that during job i, the predictor for job i + 1 is run (see

Figure 6.13(b)). This ensures that the DVFS decision is ready at the start of a job with no impact on time budget from the predictor. Due to the longer time budget, running in a pipelined manner can save more energy than running in a sequential manner. However, this assumes that information needed by the prediction slice, specifically the job inputs and program state, is ready one job in advance. This is not possible for interactive tasks which depend on real-time user inputs or tasks which are not periodic.

The predictor could also be run in parallel with the task (see Figure 6.13(c)).

This avoids the issue of needing input values early. In terms of time budget, this mode of operation still reduces the effective budget by the predictor execution time. However, part of the task also executes during the prediction time, so the remaining work to be done is also reduced. Thus, this could still lead to higher energy savings than running in a sequential manner. Running in parallel also avoids the issue of side-effects caused by the prediction slice that was discussed in Section 6.3.2. However, running in parallel either requires forking off the

124 Benchmark Description - Task 2048 [88] Puzzle game - Update and render one turn curseofwar [63] Real-time strategy game - Update and render one game loop iteration ldecode [83] H.264 decoder - Decode one frame pocketsphinx [42] Speech recognition - Process one speech sample rijndael [37] Advanced Encryption Standard (AES) - Encrypt one piece of data sha [37] Secure Hash Algorithm (SHA) - Hash one piece of data uzbl [67] - Execute one command (e.g., refresh page) xpilot [81] 2D space game - Update and render one game loop iteration

Table 6.3: Benchmark descriptions and task of interest.

predictor for each job or sharing data with a dedicated predictor thread, both of which can introduce overhead.

For our target applications, we found that the execution time of the predictor was low. Thus, we decided to run the predictor and task in a sequential manner

(Figure 6.13(a)). For applications which require more complicated predictors, these alternative operation modes may be beneficial.

6.5 Evaluation

6.5.1 Experimental Setup

We applied our framework for prediction-based DVFS control to a set of eight benchmark applications including three games, a web browser, speech recog- nition, a video decoder and two applications from the MiBench suite [37]. Ta- ble 6.3 lists and describes these benchmarks and the task of interest for each.

Table 6.4 shows the minimum, average, and maximum job execution times for these benchmarks when run at maximum frequency.

125 Benchmark Min [ms] Avg. [ms] Max [ms] 2048 0.52 1.2 2.1 curseofwar 0.02 6.2 37.2 ldecode 6.2 20.4 32.5 pocketsphinx 718 1661 2951 rijndael 14.2 28.5 43.6 sha 4.7 25.3 46.0 uzbl 0.04 2.2 35.5 xpilot 0.2 1.3 3.1

Table 6.4: Execution time statistics when running at maximum frequency.

We ran our experiments on an ODROID-XU3 [64] development board run- ning 14.04. The ODROID-XU3 includes a Samsung Exynos5422 SoC with ARM Cortex-A15 and Cortex-A7 cores. We show results here for running on the more power-efficient A7 core but we saw similar trends when running on the A15 core. In order to isolate measurements for the application of interest, we pinned the benchmark to run on the A7 core while using the A15 to run OS and background jobs. We measured power using the on-board power sensors with a sampling rate of 213 samples per second and integrated over time to calculate energy usage.

We compare our proposed prediction-based DVFS controller with existing controllers and previously proposed control schemes. Specifically, we measure results for the following DVFS schemes:

1. performance: The Linux performance governor [12] always runs at the

maximum frequency. We normalize our energy results to this case.

2. interactive: The Linux interactive governor [12] was designed for interac-

tive mobile applications. It samples CPU utilization every 80 milliseconds and changes to maximum frequency if CPU utilization is above 85%.

3. pid: The PID-based controller uses previous prediction errors with a PID

126 performance interactive pid prediction 120 45 40 100 35 80 30 25 60 20 40 15 Misses [%] Energy [%] 10 20 5 0 0

2048 sha uzbl xpilot 2048 sha uzbl xpilot ldecode rijndael average ldecode rijndael average curseofwar curseofwar pocketsphinx pocketsphinx

Figure 6.14: Normalized energy usage and deadline misses.

control algorithm in order to predict the execution time of the next job [35]. The PID parameters are trained offline and are optimized to reduce

deadline misses.

4. prediction: This is our prediction-based controller as described in this

chapter.

6.5.2 Energy Savings and Deadline Misses

Figure 6.14 compares energy consumption and deadline misses for the different DVFS controllers across our benchmark set. These experiments are run with a time budget of 50 milliseconds per job as running faster than this is not notice- able to a user [51, 98]. pocketsphinx takes at least 100s of milliseconds to run (see Table 6.3) so we use a 4 second deadline. This corresponds to the time limit that a user is willing to wait for a response [57]. Energy numbers are normalized to the energy usage of the performance governor. Deadline misses are reported as the percentage of jobs that miss their deadline.

127 We see that, on average, our prediction-based controller saves 56% energy compared to running at max frequency. This is 27% more savings than the in- teractive governor and 1% more savings than the PID-based controller. For lde- code, pocketsphinx, and rijndael, prediction-based control shows higher energy consumption than PID-based control. However, if we look at deadline misses, we see that PID-based control shows a large number of misses for these bench- marks. On average, the interactive governor shows 2% deadline misses and the PID-based controller shows 13% misses. In contrast, our prediction-based con- troller shows 0.1% deadline misses for curseofwar and no deadline misses for the other benchmarks tested. Overall, we see that the interactive governor has a low number of deadline misses, but achieves this with high energy consump- tion. On the other hand, PID-based control shows lower energy usage than the prediction-based controller in some cases, but this comes at the cost of a large number of deadline misses. Instead, on average, our prediction-based control is able to achieve both better energy consumption and less deadline misses than the interactive governor and PID-based control.

Since our prediction-based controller takes the time budget into account, it is able to save more energy with longer time budgets. Similarly, with shorter time budgets, it will spend more energy to attempt to meet the tighter deadlines. In order to study this trade-off, we swept the time budget around the point where we expect to start seeing deadline misses. Specifically, we set a normalized bud- get of 1 to correspond to the maximum execution time seen for the task when running at maximum frequency (see Table 6.3). This corresponds to the tight- est budget such that all jobs are able to meet their deadline. Figure 6.15 shows the energy usage and deadline misses for the various benchmarks as the nor- malized budget is swept below and above 1. We see that our prediction-based

128 performance interactive pid prediction performance interactive pid prediction 250 100 100 20 200 80 80 15 150 60 60 10 100 40 40 Misses [%] Misses [%] Energy [%] 50 20 Energy [%] 20 5 0 0 0 0 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 Normalized Budget Normalized Budget Normalized Budget Normalized Budget

(a) 2048 (b) curseofwar

performance interactive pid prediction performance interactive pid prediction 120 70 120 60 100 60 100 50 80 50 80 40 40 60 60 30 30 40 20 40 20 Misses [%] Misses [%] Energy [%] Energy [%] 20 10 20 10 0 0 0 0 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 Normalized Budget Normalized Budget Normalized Budget Normalized Budget

(c) ldecode (d) pocketsphinx

performance interactive pid prediction performance interactive pid prediction 120 60 100 45 40 100 50 80 35 80 40 30 60 60 30 25 40 20 40 20 15 Misses [%] Misses [%] Energy [%] Energy [%] 20 10 20 10 5 0 0 0 0 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 Normalized Budget Normalized Budget Normalized Budget Normalized Budget

(e) rijndael (f) sha

performance interactive pid prediction performance interactive pid prediction 100 3.5 120 100 80 3.0 100 80 2.5 80 60 2.0 60 60 40 1.5 40 1.0 40 Misses [%] Misses [%] Energy [%] Energy [%] 20 0.5 20 20 0 0.0 0 0 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 Normalized Budget Normalized Budget Normalized Budget Normalized Budget

(g) uzbl (h) xpilot

Figure 6.15: Normalized energy usage and deadline misses as time budget is varied.

129 predictor dvfs predictor+dvfs 5 24 25 4 3 2

Time [ms] 1 0

2048 sha uzbl xpilot ldecode rijndael average curseofwar pocketsphinx

Figure 6.16: Average time to run prediction slice and switch DVFS levels.

controller is able to save more energy with longer time budgets and continues to outperform the interactive governor and the PID-based controller for vary- ing time budgets. For normalized budgets less than 1, our prediction-based controller shows deadline misses. However, the number of misses is typically close to the number seen with the performance governor. This implies that the only deadline misses are the ones that are impossible to meet at the specified time budget, even with running at the maximum frequency.

6.5.3 Analysis of Overheads and Error

Figure 6.16 shows the average times for executing the predictor and for switch- ing DVFS levels. On average, the predictor takes 3.2 ms to execute and DVFS switching takes 0.8 ms. Excluding pocketsphinx, the average total overhead is less than 1 ms which is 2% of a 50 ms time budget. pocketsphinx shows a longer execution time of 24 ms for the predictor. However, this time is negligi- ble compared to the execution time of pocketsphinx jobs which are on the order of seconds.

130 prediction w/o dvfs w/o predictor+dvfs oracle 100 80 60 40

Energy [%] 20 0

2048 sha uzbl xpilot ldecode rijndael average curseofwar pocketsphinx

Figure 6.17: Normalized energy usage and deadline misses with over- heads removed and oracle prediction.

The overheads of executing the predictor and DVFS switching decrease the energy savings achievable. This is due to the energy consumed to perform these operations as well as the decrease in effective budget. Better program slicing and/or feature selection could reduce the predictor execution time. Similarly, faster DVFS switching circuits [58, 66, 31] have shown switching times on the or- der of tens of nanoseconds. In order to explore the limits of what is achievable, we evaluated our prediction-based control when the overheads of the predictor and DVFS switching are ignored. Figure 6.17 shows the energy and deadline misses when these overheads are removed. These results are shown for a time budget of 4 seconds for pocketsphinx and 50 milliseconds for all other bench- marks. On average, we see a 3% decrease in energy consumption when re- moving the overheads of DVFS switching. Removing the overhead of running the predictor shows negligible improvement past removing the DVFS switch- ing overhead. Thus, improvements to the program slicing and feature selection execution time are expected to have little effect on energy savings.

In addition to these overheads, the accuracy of our prediction limits the effectiveness of the prediction-based controller. Figure 6.18 shows box-and-

131 30 20 10 0 10

Prediction Error [ms] 20 30

2048 sha uzbl xpilot ldecode rijndael curseofwar

Figure 6.18: Box-and-whisker plots of prediction error. Positive number correspond to over-prediction and negative numbers corre- spond to under-prediction.

whisker plots of the prediction error. The box indicates the first and third quar- tiles and the line in the box marks the median value. Outliers (values over 1.5 times the inner quartile range past the closest box end) are marked with an “x” and the non-outlier range is marked by the whiskers. Positive values represent over-prediction and negative-numbers represent under-prediction. We can see that the prediction skews toward over-prediction with average errors greater than 0. Most benchmarks show prediction errors of less than 5 ms, which is 10% of a 50 ms time budget. ldecode and rijndael show higher prediction errors, which limits the energy savings possible. pocketsphinx (not shown) has errors ranging from 60 ms under-prediction to 2 seconds over-prediction with an av- erage of 880 ms over-prediction. Although these errors are larger in absolute terms than the other benchmarks, they are on the same order of magnitude as seen for the other benchmarks when compared to the execution time of pocket- sphinx jobs. From these results, we can see that there are small but significant inaccuracies in our prediction of execution time. However, based on our energy and deadline miss results (see Figure 6.14), the predicted times are adequate

132 as heuristics for making DVFS decisions without violating deadlines. That is, the prediction is accurate enough that we are able to differentiate between short jobs that can be slowed down and long jobs that must be run at high frequencies.

In addition, by skewing the prediction towards over-predicting, we are able to tolerate these inaccuracies without causing deadline misses.

In order to explore the possible improvements with perfect prediction, we implemented an “oracle” controller that uses recorded job times from a previous run with the same inputs to predict the execution time of jobs. Figure 6.17 also includes these oracle results. We see that an additional 11% energy savings are achievable with the oracle controller on top of removing the predictor and DVFS switching overheads. From these results and the prediction errors seen in Figure 6.18, the most room for improvement for our prediction-based DVFS controller lies in improving the prediction accuracy. Note that we do not have oracle results for uzbl and xpilot as non-deterministic variations in the ordering of jobs across runs made it difficult to implement a good oracle controller.

6.5.4 Under-prediction Trade-off

In Section 6.3.3, we discussed how we can vary the penalty weight for under- prediction, α, when we fit our execution time prediction model. Placing greater penalty on under-prediction increases the energy usage but reduces the likeli- hood of deadline misses. Figure 6.19 shows the energy and deadline misses for varying under-predict penalty weights for ldecode. We see that as the weight is decreased, energy consumption decreases but deadline misses increase. Reduc- ing the weight from 1000 to 100 keep misses at 0, but reducing the weight to 10

133 α =1 α =10 α =100 α =1000 2.0 1.5 1.0 0.5 0.0

Deadline Misses [%] 0.5 40 50 60 70 80 90 100 110 Energy [%]

Figure 6.19: Energy vs. deadline misses for various under-predict penalty weights (α) for ldecode.

140 120 100 80 60 40 20 Idle Power [mW] 0 200 400 600 800 1000 1200 1400 Frequency [MHz]

Figure 6.20: Idle power for the ODROID-XU3’s Cortex-A7 core for each frequency setting.

introduces a small number of deadline misses (0.03%). Other benchmarks show similar trends and we found that across the benchmarks we tested, an under- predict penalty weight of 100 provided good energy savings without sacrificing deadline misses. The results in this chapter have been shown with a weight of 100.

6.5.5 Race to Idle

It is possible, in some situations, to optimize energy usage by running quickly and then dropping to a lower power mode. This is the idea behind “race to

134 performance interactive pid prediction 80 70 60 50 40 30 20 Energy [%] 10 0

2048 sha uzbl xpilot ldecode rijndael average curseofwar pocketsphinx

Figure 6.21: Normalized energy usage and deadline misses when the fre- quency is dropped to minimum (200 MHz) when jobs finish.

idle” techniques. If a job finishes quickly enough and the idle power is low enough, then this can save more energy than running for longer at a lower fre- quency level. Figure 6.20 shows the idle power at each frequency setting for the ODROID-XU3’s Cortex-A7 core. There is a large difference in idle power at maximum and minimum frequency, implying that significant energy savings are possible by reducing the frequency after a job finishes.

In an attempt to estimate the possible energy savings from “race to idle” techniques, we calculated the energy usage for each controller assuming that the core only consumes the minimum idle power between jobs. Figure 6.21 shows these energy numbers normalized to the energy usage when always run- ning at maximum frequency. These results are shown for a deadline of 4 seconds for pocketsphinx and 50 milliseconds for all other benchmarks and is an analo- gous setup to the results shown in Figure 6.14. The performance governor here represents “race to idle” where the max frequency is used during job execu- tion and then the frequency is dropped to the minimum frequency. We see that this significantly reduces the energy usage of the performance governor. This technique can also be applied to the other controllers in order to reduce energy usage. Although the prediction-based controller seeks to run as slow as possi-

135 ble in order to finish just before the deadline, because the prediction is skewed to overestimate, there are cases where jobs finish early and reducing frequency can be used to further reduce energy usage. We see that for the benchmarks we tested, our prediction-based controller is still able to save more energy than the performance and interactive governors. In some cases, the prediction-based controller uses more energy than the PID-based controller. However, as this technique does not affect the number of deadline misses, the PID-based con- troller is still missing many of its deadlines.

6.5.6 Heterogeneous Cores

Recent architectures have introduced the use of heterogeneous CPU cores in or- der to allow a wider power-performance trade-off range. For example, ARM’s big.LITTLE architecture couples a set of fast out-of-order cores (e.g., Cortex- A15) with a set of energy-efficient in-order cores (e.g., Cortex-A7). These ad- ditional operating points can be taken advantage of with our prediction-based scheme to provide a wider range of operating points, providing greater energy savings by using slower operating points and/or less deadline misses by using faster operating points. In this section, we explore the possible effects of such an architecture using first-order analytical models. We will focus on an architec- ture with two different cores (big and little) and assume that they share an ISA (i.e., programs can run on either without modification).

The majority of our prediction-based methodology can be applied, un- changed, to scheduling resources with heterogeneous cores. The first two steps of our methodology (see Figure 6.5), generating program features and predict-

136 ing execution time, remain unchanged. We modify the last step, choosing the target frequency, to instead select both core type and frequency. This is done by extending our simple linear model of frequency-based performance scaling of a single core to be a pair of linear models. For our experiments, we model the big core as a constant factor faster than the little core.

We created an analytical model to study the first-order effects of having het- erogeneous cores. For each core, execution time scales inversely with frequency, following the model from Section 6.3.4:

t = Tmem + Ndependent/f

As we found that Tmem was small for the benchmarks we tested, we set Tmem = 0 for our modeling here.

Active power for each core is modeled as

1 P = CV 2Af 2 where C is the capacitance and A is activity factor. With voltage scaling linearly with respect to frequency (i.e., V = δf), we have that power scales cubicly with frequency:

1 P = CAδf 3 2

Finally, energy is power multiplied by time:

E = P t 1  N  = CAδf 3 dependent 2 f 1 = CAδN f 2 2 dependent

137 Parameter Value λ (Little power / Big power) 0.332 µ (Little execution time / Big execution time) 2.06 Little frequency levels {200, 300, ..., 1400} MHz Big frequency levels {200, 300, ..., 2000} MHz

Table 6.5: Parameters used in DVFS and heterogeneous core model.

We will normalize our energy results to running at maximum frequency: E f 2 Enormalized = = 2 Efmax fmax

Thus, the values used for the parameters C, A, δ, and Ndependent are irrelevant.

In order to account for the two heterogeneous cores, we introduce scaling factors for power and execution time.

Plittle(f) = λPbig(f)

tlittle(f) = µtbig(f)

That is, λ is the ratio of power between running on the little core and running on the big core at the same frequency. Similarly, µ is the ratio of execution time between running on the little core and on the big core at the same frequency. The values for these parameters were found through microbenchmarks on the ODROID-XU3 and are shown in Table 6.5. In addition, Table 6.5 shows the

DVFS levels used for each core. Switching between cores introduces overheads. For ARM’s big.LITTLE architecture, this switching time has been reported to be as low as 20,000 cycles ([18]) which corresponds to 20 microseconds at 1 GHz.

As this switching time is an over an order-of-magnitude lower than the time budgets we are considering, our first-order model does not account for switch- ing overheads.

Based on this model, we can look at the relative energy and execution time

138 little big 1.0 0.8 0.6 0.4 0.2 0.0 200 400 600 800 1000 1200 1400 1600 1800 2000

Normalized Execution Time Frequency [MHz]

(a) Execution time versus frequency

little big 1.0 0.8 0.6 0.4 0.2

Normalized Energy 0.0 200 400 600 800 1000 1200 1400 1600 1800 2000 Frequency [MHz]

(b) Energy versus frequency

little big 1.0 0.8 0.6 0.4 0.2

Normalized Energy 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Execution Time

(c) Energy versus execution time

Figure 6.22: Relationship between frequency, execution time, and fre- quency based on analytical model.

139 scaling for operating at different frequencies on each core. Figure 6.22(a) shows the normalized execution time for each frequency setting. We can see that at the same frequency, the big core achieves lower execution time than the little core. Figure 6.22(b) shows the energy for each frequency setting. At the same frequency, the little core uses less energy than the big core. Finally, Figure 6.22(c) directly shows the trade-off between energy and execution time for each core at the various frequency levels. Only the big core is able to achieve the lowest execution times. On the other hand, when longer execution times are adequate, the little core is able to scale out to lower energies. For execution times that are achievable by both cores, we see that for the model parameters chosen, the big core is actually more energy-efficient. This trade-off will be different depending on the model parameters λ and µ. Note that we do not model idle power here.

In order to evaluate the effects of heterogeneous cores on actual workloads, we collected traces of job features and execution times while running at max- imum frequency on the big and little cores of the ODROID-XU3. Using these traces, we predict the core and frequency to use for each job. Our analytical model is used to calculate the energy usage of the resulting allocations.

We first compare the case of running only on a little core versus having both big and little cores. Figure 6.23 shows the energy and deadline misses for vary- ing time budgets which are normalized to the maximum execution time for each task running on the little core at max frequency (see Table 6.4). Energy is nor- malized to running all jobs at maximum frequency on the little core and can exceed 100% due to using the big core. We see that introducing the big core can actually reduce energy. This is because running on the big core reduces ex- ecution time and leads to reduced energy usage in some cases, as was shown

140 littleonly biglittle

100 0.30

80 0.25 0.20 60 0.15 40 Misses [%]

Energy [%] 0.10 20 0.05 0 0.00

2048 sha uzbl xpilot 2048 sha uzbl xpilot ldecode rijndael average ldecode rijndael average curseofwar curseofwar pocketsphinx pocketsphinx

(a) Normalized budget of 1.0.

littleonly biglittle

100 90 80 80 70 60 60 50 40 40 Misses [%]

Energy [%] 30 20 20 10 0 0

2048 sha uzbl xpilot 2048 sha uzbl xpilot ldecode rijndael average ldecode rijndael average curseofwar curseofwar pocketsphinx pocketsphinx

(b) Normalized budget of 0.8.

littleonly biglittle

120 90 80 100 70 80 60 50 60 40 Misses [%]

Energy [%] 40 30 20 20 10 0 0

2048 sha uzbl xpilot 2048 sha uzbl xpilot ldecode rijndael average ldecode rijndael average curseofwar curseofwar pocketsphinx pocketsphinx

(c) Normalized budget of 0.6.

Figure 6.23: Energy and deadline misses for running with only a little core (littleonly) and with both big and little cores (biglittle).

141 biglittle-0.6 biglittle-0.8 biglittle-1.0

60 100

50 80 40 60 30 40 20

Big Core Jobs [%] 20 Core Switches [%] 10 0 0

2048 sha uzbl xpilot 2048 sha uzbl xpilot ldecode rijndael average ldecode rijndael average curseofwar curseofwar pocketsphinx pocketsphinx

Figure 6.24: Number of core switches and jobs run on the big core as a percentage of total jobs for varying normalized budgets when comparing biglittle to littleonly.

in Figure 6.22(c). For a normalized budget of 1 (Figure 6.23(a)), we see an av- erage reduction in normalized energy of 31% by introducing the big core. For tighter budgets (Figures 6.23(b) and 6.23(c)), the little core alone is not able to meet all deadlines, showing deadline misses of 20% and 35% for normalized budgets of 0.8 and 0.6 respectively. Introducing the big core eliminates these deadline misses because of the faster operating points available. This can lead to a higher energy usage than when running on the little core only, as seen for certain benchmarks with a normalized budget of 0.6. Figure 6.24 shows the number of times that the core type is switched between jobs (e.g., moving from big core to little core or vice versa) as a percentage of number of jobs. In addi- tion, it also shows the percentage of jobs run on the big core. For a normalized budget of 1, 6.6% of jobs switch cores on average and 78% of jobs on average run on the big core.

We also compared the case of running only on a big core versus having big

142 bigonly biglittle

100 0.6

80 0.5 0.4 60 0.3 40 Misses [%]

Energy [%] 0.2 20 0.1 0 0.0

2048 sha uzbl xpilot 2048 sha uzbl xpilot ldecode rijndael average ldecode rijndael average curseofwar curseofwar pocketsphinx pocketsphinx

Figure 6.25: Energy and deadline misses for running with only a big core (bigonly) and with both big and little cores (biglittle) for a nor- malized budget of 1.

45 100 40 35 80 30 60 25 20 40 15 10

Big Core Jobs [%] 20 Core Switches [%] 5 0 0

2048 sha uzbl xpilot 2048 sha uzbl xpilot ldecode rijndael average ldecode rijndael average curseofwar curseofwar pocketsphinx pocketsphinx

Figure 6.26: Number of core switches and jobs run on the big core as a percentage of total jobs for a normalized budget of 1 when comparing biglittle to bigonly.

143 and little cores. Figure 6.25 shows energy and deadline misses for this setup. As the introduction of the little core is for energy savings as opposed to reducing deadline misses, we only show results for a normalized budget of 1 which corre- sponds to the maximum task execution time for running on the big core at max frequency. Energy results are normalized to running all jobs on the big core at max frequency. For some benchmarks, introducing the little core is able to increase energy savings with reduction in normalized energy up to 13%. Intro- ducing the little core does not affect deadline misses, though we do see a small number of misses for xpilot and ldecode due to mispredictions. Figure 6.26 shows the number of core switches and the number of jobs run on the big core as a percentage of total jobs. On average, 23% of jobs switch cores and 60% are run on the big core.

In this section, we have shown how our prediction-based methodology can be extended to architectures with heterogeneous cores. Our first-order re- sults based on analytical modeling show that with heterogeneous cores, our prediction-based controller is able to further reduce energy usage and/or dead- line misses. However, there are several additional points to consider for imple- menting this on a real system. First, a detailed characterization of the energy- performance points provided by the target platform are needed. Here, we have assumed a constact factor of power and performance difference between the big and little cores, but this may not hold for a particular system. In addition, the difference in idle power of the different cores should also be accounted for. Sec- ond, the overhead of switching cores needs to be measured and, if significant, accounted for in the prediction algorithm. We have mentioned that the switch- ing time between cores has been reported to be low, but this does not take into account other possible overheads such as cold cache misses, pipeline filling, etc.

144 that come with core migration. Finally, in order to use our prediction-based methodology, a software interface for core migration will need to be exposed to the controller. Depending on its implementation, this interface may also intro- duce switching overheads that need to be accounted for.

145 CHAPTER 7 RELATED WORK

This chapter describes some of the previous work that is related to this the- sis. Section 7.1 discusses related work in hardware design and WCET analysis of real-time systems. Section 7.2 talks about related work on security and reli- ability, specifically previous work on run-time monitoring and partial monitor- ing. Finally, Section 7.3 discusses work related to our techniques for prediction- based DVFS control including previous DVFS control techniques and methods for execution time prediction.

7.1 Real-Time Systems

7.1.1 Hardware Architectures for Real-Time Systems

Previous work has looked at designing processor architectures for real-time sys- tems. These projects focus on improving performance in predictable ways in order to improve the worst-case execution time. For example, several projects [21, 90, 75, 76] have created architectures with simple pipelines that allow for static scheduling of instructions. These architectures are easily analyzable so that the resulting WCET estimates are tight and lower than more unpredictable architectures. Other projects have explored using fine-grained simultaneous multi-threading (SMT) in order to improve performance by isolating hard real- time threads from each other [28, 50, 52] or from soft or non-real-time threads [87, 59, 100]. Whitham et al. explored reducing the unpredictability of control flow by co-designing software and hardware to execute traces [91]. Finally, the

146 Virtual Simple Architecture (VISA) [1, 2] uses dynamic slack to run in a faster but difficult to analyze mode. It switches to a simpler mode under which timing analysis and WCET is calculated when there is not enough slack. This is similar to our use of slack to opportunistically perform monitoring but in the context of improving performance. These architecture all look to improve on the an- alyzable worst-case performance. Instead, our work has looked at designing hardware architectures to improve on security, reliability, and energy-efficiency in the context of real-time deadlines.

7.1.2 Worst-Case Execution Time Analysis

Estimating the worst-case execution time of a sequential program on a single- core system is a well studied problem. A survey paper by Wilhelm et al. [92] provides an overview of existing methods and tools in this context. However, to the best of our knowledge, we were the first to study the WCET of parallel monitoring. Researchers have recently started studying the WCET problem for multi-core systems. For example, Paolieri et al. proposed a multi-core hardware architecture for hard real-time systems and analyzed its WCET behavior [65]. McAiT is a tool that has been developed for WCET analysis of multi-core real- time software [54]. Chattopadhyay et al. developed an ILP-based approach for multi-core WCET analysis including shared caches and buses [14]. These stud- ies focused on the contention between parallel programs for shared resources such as memory. However, the loosely coupled link between the main core and parallel monitoring hardware represents a producer-consumer relationship rather than shared resources. Thus, we found that previously developed tech- niques were not directly applicable or easily adaptable to provide a tight WCET

147 bound for a system with parallel monitoring.

7.2 Security and Reliability

7.2.1 Run-Time Monitoring

In this thesis, we have explored applying run-time monitoring to real-time sys- tems. Our work has been designed to be applicable to a wide range of paral- lel hardware run-time monitoring techniques. We briefly mention some recent parallel monitoring platforms as examples. For example, INDRA [78] uses a checker core to monitor coarse-grained events on a computation core such as function call/return, code origin inspection, and control flow inspection. Na- garajan et al. studied implementing DIFT on a multi-core system [61]. Chen et al. proposed hardware acceleration techniques for multi-core systems and showed that a set of parallel monitoring techniques for security and software debugging can be realized with low performance overheads (tens of percents) [16]. The FlexCore [23] and Harmoni [24] architectures showed that parallel monitoring can be made even more efficient by implementing monitors on re- configurable hardware. SecureCore [97] is a monitoring scheme targeted specif- ically at real-time systems which attempts to detect code injection attacks by de- tecting anomalous timing behavior. However, the architecture assumes enough buffering so that the timing behavior of the main task is not affected. Over- all, these previous studies demonstrate that parallel monitoring can be used to significantly improve system security and reliability with minimal overheads.

148 Target Prevent Name General Overhead False Pos. Scalable Bug Isolation [49] - - - Mem Leak Detection [17] - - - LiteRace [55] - - X PACER [11] - - X Testudo [33] - - X Scalable Dataflow [32] - X X Arnold & Ryder [5] X - - Huang et al. [41] X X - QVM [6] X* X X *Limited generalizability Table 7.1: Previous work on partial run-time monitoring.

7.2.2 Partial Monitoring

There exists a number of previous projects that have looked into performing partial monitoring in order to reduce the performance overhead. These plat- forms differ from the architectures we have presented in Chapter 4 and 5 in a number of ways including how monitoring is implemented and the monitoring techniques targeted. However, we note three main properties that differentiates our work:

1. Generality: Our architectures apply to a variety of monitoring techniques.

2. Target Overhead: Our architectures allow an overhead or deadline to be targeted. The architecture presented in Chapter 4 provides hard guaran- tees on the overhead while the architecture from Chapter 5 relaxes the

strict guarantee. Pervious work performs sampling to reduce overhead but does not try to bound overhead, which we have shown can vary greatly.

3. Prevent False Positives: We have presented an invalidation-based mech-

anism to prevent false positives. Previous work either has false positives

149 or targets monitoring techniques which degrade gracefully with sampling rather than exhibit false positives.

Our work is the first that we know of to present a hardware platform for partial monitoring that is general, allows a deadline or overhead target to be specified, and explicitly prevents false positives. In contrast, previous work on partial monitoring achieves only some, but not all, of these properties. Table 7.1 sum- marizes these differences.

For example, there exists previous work for using statistical sampling to re- duce the performance overhead of various debugging techniques. These in- clude sampling for bug isolation [49], memory leak detection [17], race detec- tion [55, 11], and information flow tracking [33, 32]. These techniques modify the monitoring to support statistical sampling and so are not generalizable. For those that prevent false positives, this is also done with monitor-specific modi- fications. Finally, with the exception of the work by Greathouse et al. [32], they do not allow an overhead target to be specified.

There also exists several projects that have looked into more general partial monitoring platforms. Arnold and Ryder [5] presented a general platform for sampling of instrumented code, but do not allow an overhead target to be spec- ified and do not prevent false positives. Huang et al. [41] proposed a general framework for controlling the overhead of software-based monitoring. How- ever, they also do not explicitly address false positives and only target mon- itors which degrade gracefully when performed partially. Finally, QVM [6] proposes a modification to the Java Virtual Machine (JVM) to support moni- toring with adjustable overhead. QVM is limited to monitoring for code run on a JVM which limits its generalizability while our framework works for any

150 binary. QVM prevents false positives by enabling or disabling monitoring on a per-object basis. This limits the monitoring schemes that can be implemented. Also, this method of preventing false positives is similar to the idea of perform- ing source-only dropping (see Section 5.4.2) which our results show can lead to overshooting the overhead target depending on the monitoring technique.

7.3 Performance-Energy Trade-off

7.3.1 Dynamic Voltage and Frequency Scaling

There have been many designs for DVFS controllers. Most controllers look to decrease frequency when the performance impact is minimal. For example, the built-in Linux governors [12] adjust DVFS based on CPU utilization. However, these controllers do not take into account performance requirements or dead- lines.

DVFS has been studied in the context of hard real-time systems [72]. For these systems, deadlines are strict requirements that cannot be violated. Thus, lowering frequency must be done in a conservative manner. By relaxing this strict requirement, our prediction-based controller is able to achieve higher en- ergy savings.

A number of reactive DVFS controllers have been proposed that use the past history of job execution times to predict the execution time of future jobs. Choi et al. [19] used moving averages of job execution time history to predict exe- cution times for an MPEG decoder. Similarly, Pegasus [53] used instantaneous

151 and average job execution times to make DVFS decisions. Nachiappan et al. [60] used a moving average to set DVFS for multiple IP cores. Gu and Chakraborty [35] used a PID-based controller to predict execution times of frames in a game.

These history-based, reactive controllers are not able to adapt fast enough to job-to-job variations in execution time, resulting in either high energy usage or deadline misses. In fact, our results show that prediction-based control outper- forms PID-based control.

Prediction-based approaches have been designed for specific applications.

Gu and Chakraborty [34] predicted the rendering time for a game frame based on the number of objects in the scene. Zhu et al. used prediction-based control to select core and DVFS levels for a web browser based on HTML and CSS fea- tures [99] and event types [98]. Adrenaline [40] looked to reduce the tail latency of datacenter applications including web search and Memcached by classifying jobs by query type. These predictive approaches required careful analysis of the applications of interest in order to identify features and create predictive mod- els. Our approach presents an automated approach to create these DVFS pre- dictors for a wide range of applications. For example, for the web browser we tested, our approach automatically identifies event types as a feature because of changes in control flow depending on event type.

7.3.2 Execution Time Prediction

Worst-case execution time analysis is a well-studied problem in hard real-time systems [92]. This can be used as a conservative bound for setting DVFS in or- der to meet deadlines, but it does not predict the changes in job execution time

152 based on specific inputs or program state. ATLAS [71] looked at predicting ex- ecution time in the context of soft real-time scheduling. Their approach uses programmer-marked features in a linear model in order to predict execution time. Instead, our approach is able to automatically identify features without programmer assistance. Mantis [45] presents an automated approach for pre- dicting execution time, similar to the approach we have presented. However, neither Mantis nor ATLAS looks at execution time prediction in the context of DVFS control. To apply execution time prediction to DVFS control, we needed to create a prediction method that placed greater penalty on under-prediction and to extend the predictor to select the appropriate frequency level.

153 CHAPTER 8 CONCLUSION

8.1 Summary

In this thesis, we have addressed some of the challenges and explored some of the opportunities for the hardware design of real-time systems. We have looked at how to design secure, reliable, and energy-efficient real-time systems.

We have explored how to apply run-time monitoring techniques for improv- ing security and reliability to real-time systems. We have developed a method to estimate the worst-case execution time (WCET) of run-time monitoring, al- lowing it to be accounted for in real-time scheduling and analysis frameworks (Chapter 3). Our estimated bounds for run-time monitoring are within 71% of observed worst-case performance, similar to the baseline tools which show up to 52% differences between simulations and WCET estimates. We have also developed architectures for applying run-time monitoring to existing sys- tems with hard real-time deadlines (Chapter 4) or soft performance constraints

(Chapter 5). For hard real-time systems, we show that an average of 15-66% of monitoring checks, depending on the monitoring scheme, can be performed with no impact on WCET. Similarly, for performance-constrained systems, we show average monitoring coverage of 14-85% with significantly reduced perfor- mance impact compared to performing full monitoring.

We have also explored some of the opportunities for reducing the energy usage of real-time systems. We have shown how soft real-time requirements can be used to inform the dynamic voltage and frequency scaling of hardware.

154 By developing a prediction-based DVFS controller, we showed how the appro- priate frequency could be selected for each job in order to minimize energy while meeting its deadline requirement. Our experiments showed 56% average energy savings over running at maximum frequency by using our prediction- based controller. In addition, our prediction-based controller outperforms both the Linux interactive governor and a PID-based controller in energy savings and deadline misses.

8.2 Future Directions

8.2.1 Secure and Reliable Real-Time Systems

Run-time monitoring has not yet been implemented in commercial processors. However, current trends could cause hardware run-time monitoring to be real- ized sooner rather than later. Although Moore’s Law has continued to increase transistor count on chips, it is becoming more difficult to find ways to use these transistors to improve performance. Additionally, the last few years have seen numerous high-profile and high-cost security attacks occur. The continued in- crease in transistor count and the rising importance of computer security could indicate the beginning of hardware security features, such as run-time monitor- ing, in commercial processors in the near future. Applying these architectures to real-time systems will be especially beneficial, due to the often safety-critical cyber-physical interactions of real-time systems. In addition, these systems have not been immune to attack as seen by recent reports of attacks in automobiles and airplanes.

155 One interesting future research direction is to explore possible gains from combining static analysis of applications with run-time monitoring. Run-time monitoring can catch certain errors that would not be caught by static analysis due to incomplete information (e.g., input-dependent data values). However, there are a number of properties that can be checked statically. By verifying some properties statically and passing this information to the run-time system, the amount of run-time monitoring performed can be reduced. This idea is not specific to real-time systems. However, real-time system design flows already include static analysis for timing guarantees, so additional analysis for security may fit well in the flow. In addition, reduction in run-time monitoring can lead to big gains for real-time systems in terms of reducing WCET and increasing coverage.

8.2.2 Energy-Efficient Real-Time Systems

In this thesis, we have demonstrated energy savings using prediction-based

DVFS control for applications running on a real system. The main weakness in extending this DVFS control to commercial applications and systems lies in the quality of the tools written. We have written a set of tools to instrument code for features, perform program slicing, and generate the resulting predictor. The tools written for our experiments may run into issues with more complex code. However, these tool requirements are not particularly new. For example, com- mercial tools exist for performing program slicing, which is the most complex portion of the toolflow. Thus, the challenges in creating a more robust toolset are not insurmountable.

156 Our work presented in this thesis is a first step in developing general and automated prediction-based resource control. Generalizing this to other appli- cations and platforms is a promising direction for future research. Similar ap- proaches could be applied to DVFS and resource control of other domains such as GPUs, FPGAs, hardware accelerators, heterogeneous datacenters, etc. These will require expanding on the prediction methods presented in this thesis. We have shown one example of this by extending our controller to select the appro- priate core for a heterogeneous system with big and little cores.

On the application side, the workloads we tested worked well with using control flow features. However, it is likely that irregular workloads exist which do not show execution time that correlates well with control flow features. This will require further study into what additional features may be useful for pre- diction, such as information about memory patterns. In addition, although we have avoided needing detailed programmer knowledge, hints from the pro- grammer have the potential to greatly improve prediction accuracy, especially for these irregular workloads.

8.2.3 Hardware Design for Real-Time Systems

We have shown in this thesis how to handle some of the challenges as well as how to capture some of the opportunities created from this co-design of hard- ware and real-time requirements. The increased integration of computers in our daily , especially in the form of cyber-physical systems, is increasingly cre- ating contexts where real-time is an issue for computing. In addition, as hard- ware becomes increasingly complex, it is harder to manage the timing of tasks

157 purely from the software layer. For example, hardware-based run-time mon- itoring is designed to be largely transparent to the programmer but can have large impacts on program execution time. As a result, the idea of pushing real- time requirements past the software layer and down into the hardware will be important for future computer design and research.

158 BIBLIOGRAPHY

[1] Aravindh Anantaraman, Kiran Seth, Kaustubh Patil, Eric Rotenberg, and Frank Mueller. Virtual Simple Architecture (VISA): Exceeding the Com- plexity Limit in Safe Real-Time Systems. In Proceedings of the 30th Interna- tional Symposium on Computer Architecture, 2003.

[2] Aravindh Anantaraman, Kiran Seth, Eric Rotenberg, and Frank Mueller. Enforcing Safety of Real-Time Schedules on Contemporary Processors us- ing a Virtual Simple Architecture (VISA). In Proceedings of the 25th Inter- national Real-Time Systems Symposium, 2004.

[3] Tran Nguyen Bao Anh and Su-Lim Tan. Survey and Performance Evalu- ation of Real-Time Operating Systems (RTOS) for Small Microcontrollers. IEEE Micro, 2009.

[4] Orlando Arias, Lucas Davi, Matthias Hanreich, Yier Jin, Patrick Koe- berl, Debayan Paul, Ahmad-Reza Sadeghi, and Dean Sullivan. HAFIX: Hardware-Assisted Flow Integrity Extension. In Proceedings of the 52nd Design Automation Conference, 2015.

[5] Matthew Arnold and Barbara G. Ryder. A Framework for Reducing the Cost of Instrumented Code. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2001.

[6] Matthew Arnold, Martin T. Vechev, and Eran Yahav. QVM: An Efficient Runtime for Detecting Defects in Deployed Systems. In Proceedings of the 23rd Conference on Object-Oriented Programming Systems Languages and Ap- plications, 2008.

[7] Divya , Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. Se- cure Embedded Processing Through Hardware-Assisted Run-Time Mon- itoring. In Proceedings of the Conference on Design, Automation and Test in Europe, 2005.

[8] Todd Austin, Eric Larson, and Dan Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Computer, 2002.

[9] Michel Berkelaar, Kjell Eikland, and Peter Notebaert. lp solve Version 5.5. http://lpsolve.sourceforge.net/5.5/.

159 [10] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Rein- hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 Simu- lator. SIGARCH Computer Architecture News, 2011.

[11] Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. PACER: Proportional Detection of Data Races. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010.

[12] Dominik Brodowski. CPU Frequency and Voltage Scaling Code in the LinuxTMKernel. ://android.googlesource.com/kernel/ common/+/a7827a2a60218b25f222b54f77ed38f57aebe08b/ Documentation/cpu-freq/governors.txt.

[13] Stuart K. Card, George G. Robertson, and Jock D. Mackinlay. The Infor- mation Visualizer, an Information Workspace. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1991.

[14] Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury, Timon Kelter, Peter Marwedel, and Heiko Falk. A Unified WCET Analysis Framework for Multi-core Platforms. In Proceedings of the 18th Real Time and Embedded Technology and Applications Symposium, 2012.

[15] Shimin Chen, Babak Falsafi, Phillip B. Gibbons, Michael Kozuch, Todd C. Mowry, Radu Teodorescu, Anastassia Ailamaki, Limor Fix, Gregory R. Ganger, Bin Lin, and Steven W. Schlosser. Log-based Architectures for General-purpose Monitoring of Deployed Code. In Proceedings of the 1st Workshop on Architectural and System Support for Improving Software Depend- ability, 2006.

[16] Shimin Chen, Michael Kozuch, Theodoros Strigkos, Babak Falsafi, Phillip B. Gibbons, Todd C. Mowry, Vijaya Ramachandran, Olatunji Ruwase, Michael Ryan, and Evangelos Vlachos. Flexible Hardware Ac- celeration for Instruction-Grain Program Monitoring. In Proceedings of the 35th International Symposium on Computer Architecture, 2008.

[17] Trishul M. Chilimbi and Matthias Hauswirth. Low-overhead Memory Leak Detection Using Adaptive Statistical Profiling. In Proceedings of the 11th International Conference on Architectural Support for Programming Lan- guages and Operating Systems, 2004.

[18] Hyun-Duk Cho, Kisuk Chung, and Taehoon Kim. Benefits of the

160 big.LITTLE Architecture. Technical report, Samsung Electronics Co., Ltd., 2012.

[19] Kihwan Choi, K. Dantu, Wei-Chung Cheng, and M. Pedram. Frame-Based Dynamic Voltage and Frequency Scaling for a MPEG Decoder. In Proceed- ings of the IEEE/ACM International Conference on Computer Aided Design, 2002.

[20] Robert I. Davis and Alan Burns. A Survey of Hard Real-time Scheduling for Multiprocessor Systems. ACM Computing Surveys, 2011.

[21] M. Delvai, W. Huber, P. Puschner, and A. Steininger. Processor Support for Temporal Predictability - The SPEAR Design Example. In Proceedings of the 15th Euromicro Conference on Real-Time Systems, 2003.

[22] John Demme, Matthew Maycock, Jared Schmitz, Adrian Tang, Adam Waksman, Simha Sethumadhavan, and Salvatore Stolfo. On the Feasi- bility of Online Malware Detection with Performance Counters. In Pro- ceedings of the 40th International Symposium on Computer Architecture, 2013.

[23] Daniel Y. Deng, Daniel Lo, Greg Malysa, Skyler Schneider, and G. Edward Suh. Flexible and Efficient Instruction-Grained Run-Time Monitoring Us- ing On-Chip Reconfigurable Fabric. In Proceedings of the 43rd International Symposium on Microarchitecture, 2010.

[24] Daniel Y. Deng and G. Edward Suh. High-performance Parallel Acceler- ator for Flexible and Efficient Run-time Monitoring. In Proceedings of the 42nd International Conference on Dependable Systems and Networks, 2012.

[25] Joe Devietti, Colin Blundell, Milo M. K. Martin, and Steve Zdancewic. Hardbound: Architectural Support for Spatial Safety of the C Program- ming Language. In Proceedings of the 13th International Conference on Archi- tectural Support for Programming Languages and Operating Systems, 2008.

[26] Joseph Devietti, Benjamin P. Wood, Karin Strauss, Luis Ceze, Dan Gross- man, and Shaz Qadeer. RADISH: Always-on Sound and Complete Race Detection in Software and Hardware. In Proceedings of the 39th Interna- tional Symposium on Computer Architecture, 2012.

[27] Udit Dhawan, Catalin Hritcu, Raphael Rubin, Nikos Vasilakis, Silviu Chiricescu, Jonathan M. Smith, Thomas F. Knight, Jr., Benjamin C. Pierce, and Andre DeHon. Architectural Support for Software-Defined Metadata

161 Processing. In Proceedings of the 20th International Conference on Architec- tural Support for Programming Languages and Operating Systems, 2015.

[28] Stephen A. Edwards and Edward A. Lee. The Case for the Precision Timed (PRET) Machine. In Proceedings of the 44th Design Automation Con- ference, 2007.

[29] Yasuhiro Endo, Zheng Wang, J. Bradley Chen, and Margo Seltzer. Using Latency to Evaluate Interactive System Performance. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation, 1996.

[30] Sotiria Fytraki, Evangelos Vlachos, Onur Kocberber, Babak Falsafi, and Boris Grot. FADE: A Programmable Filtering Accelerator for Instruction- Grain Monitoring. In Proceedings of the 20th International Symposium on High Performance Computer Architecture, 2014.

[31] Waclaw Godycki, Christopher Torng, Ivan Bukreyev, Alyssa Apsel, and Christopher Batten. Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks. In Proceedings of the 47th International Symposium on Microarchitecture, 2014.

[32] Joseph L. Greathouse, Chelsea LeBlanc, Todd Austin, and Valeria Bertacco. Highly Scalable Distributed Dataflow Analysis. In Proceedings of the 9th International Symposium on Code Generation and Optimization, 2011.

[33] Joseph L. Greathouse, Ilya Wagner, David A. Ramos, Gautam Bhatnagar, Todd Austin, Valeria Bertacco, and Seth Pettie. Testudo: Heavyweight Security Analysis via Statistical Sampling. In Proceedings of the 41st Inter- national Symposium on Microarchitecture, 2008.

[34] Yan Gu and Samarjit Chakraborty. A Hybrid DVS Scheme for Interactive 3D Games. In Proceedings of the 14th Real-Time and Embedded Technology and Applications Symposium, 2008.

[35] Yan Gu and Samarjit Chakraborty. Control Theory-based DVS for Inter- active 3D Games. In Proceedings of the 45th Design Automation Conference, 2008.

[36] Jan Gustafsson, Adam Betts, Andreas Ermedahl, and Bjorn¨ Lisper. The Malardalen¨ WCET Benchmarks – Past, Present and Future. In Proceedings of the 10th International Workshop on Worst-Case Execution Time Analysis, 2010.

162 [37] Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. MiBench: A Free, Commercially Representative Embedded Benchmark Suite. In Proceedings of the 4th In- ternational Workshop on Workload Characterization, 2001.

[38] Prasanna Hambarde, Rachit Varma, and Shivani Jha. The Survey of Real Time : RTOS. In Proceedings of the International Conference on Electronic Systems, Signal Processing and Computing Technologies, 2014.

[39] Reed Hastings and Bob Joyce. Purify: Fast Detection of Memory Leaks and Access Errors. In Proceedings of the Winter 1992 USENIX Conference, 1991.

[40] Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Jason Mars, Lingjia Tang, and Ronald G. Dreslinski. Adrenaline: Pinpointing and Reining in Tail Queries with Quick Voltage Boosting. In Proceedings of the 21st Symposium on High Performance Com- puter Architecture, 2015.

[41] Xiaowan Huang, Justin Seyster, Sean Callanan, Ketan Dixit, Radu Grosu, Scott A. Smolka, Scott D. Stoller, and Erez Zadok. Software Monitoring with Controllable Overhead. International Journal on Software Tools for Tech- nology Transfer, 2012.

[42] David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W. Black, Mo- sur Ravishankar, and Alex I. Rudnicky. Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices. In Pro- ceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2006.

[43] IBM ILOG CPLEX Optimizer. http://www-01.ibm.com/software/ integration/optimization/cplex-optimizer/.

[44] Introduction to Intel Memory Protection Extensions. http://software.intel.com/en-us/articles/ introduction-to-intel-memory-protection-extensions.

[45] Yongin Kwon, Sangmin Lee, Hayoon Yi, Donghyun Kwon, Seungjun Yang, Byung-Gon Chun, Ling Huang, Petros Maniatis, Mayur Naik, and Yunheung Paek. Mantis: Automatic Performance Prediction for Smart- phone Applications. In Proceedings of the 2013 USENIX Conference on An- nual Technical Conference, 2013.

163 [46] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, and Dean M. Tullsen. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Proceedings of the 42nd International Symposium on Microarchitecture, 2009.

[47] Xianfeng Li, Yun Liang, Tulika Mitra, and Abhik Roychoudury. Chronos: A Timing Analyzer for Embedded Software. Science of Computer Program- ming, 2007.

[48] Yau-Tsun Steven Li and Sharad Malik. Performance Analysis of Embed- ded Software Using Implicit Path Enumeration. In Proceedings of the 32nd Conference on Design Automation, 1995.

[49] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jor- dan. Scalable Statistical Bug Isolation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2005.

[50] Ben Lickly, Isaac Liu, Sungjun Kim, Hiren D. Patel, Stephen A. Edwards, and Edward A. Lee. Predictable Programming on a Precision Timed Ar- chitecture. In Proceedings of the International Conference on Compilers, Archi- tectures and Synthesis for Embedded Systems, 2008.

[51] Gitte Lindegaard, Gary Fernandes, Cathy Dudek, and J. Brown.˜ Attention web designers: You have 50 milliseconds to make a good first impression! Behavior & Information Technology, 25(2), 2006.

[52] Isaac Liu, Jan Reineke, David Broman, Michael Zimmer, and Edward A. Lee. A PRET Microarchitecture Implementation with Repeatable Timing and Competitive Performance. In Proceedings of the 30th International Con- ference on Computer Design, 2012.

[53] David Lo, Liqun Cheng, Rama Govindaraju, Luiz Andre´ Barroso, and Christos Kozyrakis. Towards Energy Proportionality for Large-scale Latency-critical Workloads. In Proceeding of the 41st International Sympo- sium on Computer Architecuture, 2014.

[54] Mingsong Lv, Wang Yi, Nan Guan, and Ge Yu. Combining Abstract In- terpretation with Model Checking for Timing Analysis of Multicore Soft- ware. In Proceedings of the 31st Real-Time Systems Symposium, 2010.

[55] Daniel Marino, Madanlal Musuvathi, and Satish Narayanasamy. LiteR- ace: Effective Sampling for Lightweight Data-race Detection. In Proceed-

164 ings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2009.

[56] Albert Meixner, Michael E. Bauer, and Daniel Sorin. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. In Proceedings of the 40th International Symposium on Microarchitecture, 2007.

[57] Robert B. Miller. Response Time in Man-computer Conversational Trans- actions. In Proceedings of the 1968 Fall Joint Computer Conference, Part I, 1968.

[58] Timothy N. Miller, Xiang Pan, Renji Thomas, Naser Sedaghati, and Radu Teodorescu. Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips. In Proceedings of the 18th International Symposium on High Performance Com- puter Architecture, 2012.

[59]J org¨ Mische, Irakli Guliashvili, Sascha Uhrig, and Theo Ungerer. How to Enhance a Superscalar Processor to Provide Hard Real-time Capable In- order SMT. In Proceedings of the 23rd International Conference on Architecture of Computing Systems, 2010.

[60] Nachiappan Chidambaram Nachiappan, Praveen Yedlapalli, Niranjan Soundararajan, Anand Sivasubramaniam, Mahmut T. Kandemir, Ravi Iyer, and Chita R. Das. Domain Knowledge Based Energy Management in Handhelds. In Proceedings of the 21st Symposium on High Performance Computer Architecture, 2015.

[61] Vijay Nagarajan, Ho-Seop Kim, Youfeng Wu, and Rajiv Gupta. Dynamic Information Flow Tracking on Multicores. In Proceedings of the Workshop on Interaction Between Compilers and Computer Architectures, 2008.

[62] James Newsome and Dawn Song. Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software. In Proceedings of the 12th Network and Distributed Systems Security Symposium, 2005.

[63] Alexey Nikolaev. Curse of War – Real Time Strategy Game For Linux. https://github.com/a-nikolaev/curseofwar.

[64] ODROID-XU3. http://www.hardkernel.com/main/products/ prdt_info.php?g_code=G140448267127.

165 [65] Marco Paolieri, Eduardo Quinones,˜ Francisco J. Cazorla, Guillem Bernat, and Mateo Valero. Hardware Support for WCET Analysis of Hard Real- time Multicore Systems. In Proceedings of the 36th International Symposium on Computer Architecture, 2009.

[66] Nathaniel Pinckney, Matthew Fojtik, Bharan Giridhar, Dennis Sylvester, and David Blaauw. Shortstop: An On-Chip Fast Supply Boosting Tech- nique. In Proceedings of the 2013 Symposium on VLSI Circuits, 2013.

[67] Dieter Plaetinck. Uzbl – Web Interface Tools Which Adhere to the Unix Philosophy. http://www.uzbl.org.

[68] Milos Prvulovic. CORD: Cost effective (and nearly overhead-free) Order Recording and Data race detection. In High-Performance Computer Archi- tecture, 2006.

[69] Feng Qin, Cheng Wang, Zhenmin Li, Ho-Seop Kim, Yuanyuan Zhou, and Youfeng Wu. LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks. In Proceedings of the 39th Interna- tional Symposium on Microarchitecture, 2006.

[70] Feng Qin, Cheng Wang, Zhenmin Li, Ho-Seop Kim, Yuanyuan Zhou, and Youfeng Wu. LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks. In Proceedings of the 39th Interna- tional Symposium on Microarchitecture, 2006.

[71] Michael Roitzsch, Stefan Wachtler,¨ and Hermann Hartig.¨ ATLAS: Look- Ahead Scheduling Using Workload Metrics. In Proceedings of the 19th Real- Time and Embedded Technology and Applications Symposium, 2013.

[72] Sonal Saha and Binoy Ravindran. An Experimental Evaluation of Real- Time DVFS Scheduling Algorithms. In Proceedings of the 5th International Systems and Storage Conference, 2012.

[73] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multi- Threaded Programs. ACM Transaction of Computer System, 1997.

[74] Hovav Schacham. The Geometry of Innocent Flesh on the Bone: Return- into-libc without Function Calls (on the x86). In Proceedings of the 14th Conference on Computer and Communications Security, 2007.

166 [75] Martin Schoeberl. A Java Processor Architecture for Embedded Real-Time Systems. Journal of Systems Architecture, 2007.

[76] Martin Schoeberl, Pascal Schleuniger, Wolfgang Puffitsch, Florian Brand- ner, and Christian W. Probst. Towards a Time-predictable Dual-Issue Mi- croprocessor: The Patmos Approach. In Bringing Theory to Practice: Pre- dictability and Performance in Embedded Systems, 2011.

[77] Lui Sha. Using Simplicity to Control Complexity. IEEE Software, 2001.

[78] Weidong Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh. IN- DRA: An Integrated Framework for Dependable and Revivable Architec- tures Using Multicore Processors. In Proceedings of the 33rd International Symposium on Computer Architecture, 2006.

[79] Gerard Sierksma. Linear and Integer Programming, pages 237–239. Marcel Dekker, Inc., 2002.

[80] SPEC CPUTM2006. http://www.spec.org/cpu2006/.

[81] Bjørn Stabell, Ken Ronny Schouten, Bert Gysbers,¨ and Dick Balaska. XPi- lot. http://www.xpilot.org/.

[82] G. Edward Suh, Jae W. Lee, David Zhang, and Srinivas Devadas. Secure Program Execution via Dynamic Information Flow Tracking. In Proceed- ings of the 11th International Conference on Architectural Support for Program- ming Languages and Operating Systems, 2004.

[83] Karsten Suhring.¨ H.264/AVC Software Coordingation. http:// iphome.hhi.de/suehring/tml/.

[84] Adrian Tang, Simha Sethumadhavan, and Salvatore J. Stolfo. Unsuper- vised Anomaly-based Malware Detection using Hardware Features. In Proceedings of the 17th International Symposium on Research in Attacks, Intru- sions, and Defenses, 2014.

[85] Robert Tibshirani. Regression Shrinkage and Selection via the Lasso. Jour- nal of the Royal Statistical Society. Series B (Methodological), 1996.

[86] Frank Tip. A Survey of Program Slicing Techniques. Journal of Program- ming Languages, 3(3), 1995.

167 [87] Theo Ungerer, Francisco J. Cazorla, Pascal Sainrat, Guillem Bernat, Zlatko Petrov, Hugues Casse,´ Christine Rochange, Eduardo Quinones,˜ Sascha Uhrig, Mkike Gerdes, Irakli Guliashvili, Michael Houston, Florian Kluge, Stefan Metzlaff, Jorg¨ Mische, Marco Paolieri, and Julian Wolf. Merasa: Multicore Execution of Hard Real-Time Applications Supporting Analyz- ability. IEEE Micro, 2010.

[88] Maurits van der Schee. 2048.c. https://github.com/mevdschee/ 2048.c.

[89] Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. FlexiTaint: A Programmable Accelerator for Dynamic Taint Propagation. In Proceedings of the 14th International Symposium on High Performance Com- puter Architecture, 2008.

[90] Jack Whitham and Neil Audsley. MCGREP – A Predictable Architecture for Embedded Real-Time Systems. In Proceedings of the 27th International Real-Time Systems Symposium, 2006.

[91] Jack Whitham and Neil Audsley. Time-Predictable Out-of-Order Execu- tion for Hard Real-Time Systems. IEEE Transactions on Computers, 2010.

[92] Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, Frank Mueller, Isabelle Puaut, Peter Puschner, Jan Staschulat, and Per Stenstrom.¨ The Worst-Case Execution- Time Problem – Overview of Methods and Survey of Tools. ACM Trans- actions on Embedded Computing Systems, 2008.

[93] Emmett Witchel, Josh Cates, and Krste Asanovic. Mondrian Memory Pro- tection. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.

[94] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Interna- tional Symposium on Computer Architecture, 1995.

[95] Qiang Wu, V. J. Reddi, Youfeng Wu, Jin Lee, Dan Connors, David Brooks, Margaret Martonosi, and Douglas W. Clark. A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance. In Proceedings of the 38th International Symposium on Microarchitecture, 2005.

168 [96] Fen Xie, Margaret Martonosi, and Sharad Malik. Compile-time Dynamic Voltage Scaling Settings: Opportunities and Limits. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implemen- tation, 2003.

[97] Man-Ki Yoon, Sibin Mohan, Jaesik Choi, Jung-Eun Kim, and Lui Sha. Se- cureCore: A Multicore-based Intrusion Detection Architecture for Real- Time Embedded Systems. In Proceedings of the 19th Real-Time and Embedded Technology and Applications Symposium, 2013.

[98] Yuhao Zhu, Matthew Halpern, and Vijay Janapa Reddi. Event-Based Scheduling for Energy-Efficient QoS (eQoS) in Applications. In Proceedings of the 21st Symposium on High Performance Computer Archi- tecture, 2015.

[99] Yuhao Zhu and Vijay Janapa Reddi. High-performance and Energy- efficient Mobile Web Browsing on Big/Little Systems. In Proceedings of the 19th International Symposium on High-Performance Computer Architec- ture, 2013.

[100] Michael Zimmer, David Broman, Chris Shaver, and Edward A. Lee. Flex- PRET: A Processor Platform for Mixed-Criticality Systems. In Proceedings of the 20th Real-Time and Embedded Technology and Application Symposium, 2014.

169