HARDWARE ARCHITECTURES FOR SECURE, RELIABLE, AND ENERGY-EFFICIENT REAL-TIME SYSTEMS
A Dissertation Presented to the Faculty of the Graduate School of Cornell University
in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
by
Daniel Lo August 2015 c 2015 Daniel Lo ALL RIGHTS RESERVED HARDWARE ARCHITECTURES FOR SECURE, RELIABLE, AND ENERGY-EFFICIENT REAL-TIME SYSTEMS Daniel Lo, Ph.D.
Cornell University 2015
The last decade has seen an increased ubiquity of computers with the widespread adoption of smartphones and tablets and the continued spread of embedded cyber-physical systems. With this integration into our environ- ments, it has become important to consider the real-world interactions of these computers rather than simply treating them as abstract computing machines. For example, cyber-physical systems typically have real-time constraints in or- der to ensure safe and correct physical interactions. Similarly, even mobile and desktop systems, which are not traditionally considered real-time systems, are inherently time-sensitive because of their interaction with users. Traditional techniques proposed for improving hardware architectures are not designed with these real-time constraints in mind. In this thesis, we explore some of the challenges and opportunities in hardware design for real-time systems. Specifi- cally, we study recent techniques for improving security, reliability, and energy- efficiency of computer systems. Run-time monitoring has been shown to be a promising technique for im- proving system security and reliability. Applying run-time monitoring to real- time systems introduces challenges due to the performance impact of monitor- ing. To address this, we first developed a technique for estimating the worst- case execution time impact of run-time monitoring, enabling it to be applied within existing real-time system design methodologies. Next, we developed hardware architectures that selectively enable and disable monitoring in order to meet real-time deadlines while still allowing a portion of the monitoring to be performed. We show architectures for hard and soft real-time systems, with different design decisions and trade-offs for each. Dynamic voltage and frequency scaling (DVFS) is commonly found in mod- ern processors as a way to dynamically trade off power and performance. In the presence of real-time deadlines, this presents an opportunity for improved energy-efficiency by slowing down jobs to reduce energy while still meeting deadline requirements. We developed a method for creating DVFS controllers that are able to predict the appropriate frequency level for tasks before they execute and show improved energy savings and reduced deadline misses com- pared to current state-of-the-art DVFS controllers. BIOGRAPHICAL SKETCH
Daniel Lo received his Bachelor of Science (with Honors) degree in Electrical
Engineering from the California Institute of Technology in 2009. At that time, he began doctoral studies at Cornell University in the department of Electrical and Computer Engineering. He has interned with the Jet Propulsion Labora- tory, Applied Minds, and Intel. His research lies at the intersection of computer architecture, computer security, and real-time systems.
iii ACKNOWLEDGEMENTS
First and foremost, I would like to thank my advisor, Ed Suh. Without Ed’s guidance, collaboration, and mentorship, this thesis would not have been pos- sible. I would also like to thank my other committee members, Rajit Manohar and Mark Campbell, for their guidance and advice throughout my Ph.D. career.
The work in this thesis was not a solo effort. Special thanks to Tao Chen and Mohamed Ismail for their help on our work on secure and reliable real- time systems. Similarly, special thanks to Taejoon Song for his help on our work on energy-efficient real-time systems. I want to thank Rahul Mangharam for our early discussions that led us to pursue this research direction into real-time systems. I would also like to thank Franc¸ois Guimbretiere` for his feedback and insight on user interactions. The Suh Research Group and CSL have been a great academic, as well as social, community to be a part of. They have been a wonderful group of peo- ple to bounce ideas of, share knowledge, and to get lunch with. Thank you to Ray Huang for helping me get set up and started when I first arrived at Cor- nell. Thank you to KK Yu for always being there whenever I needed a favor.
Thanks to Ji Kim for introducing to me to Dungeons and Dragons and a myriad of new board games. Thanks to Andrew Ferraiuolo for always being free when I needed to take a break and thanks to Shreesha Srinath for showing me some great places to eat in Ithaca. My time at Cornell would not have been as enjoyable without all the friends that I have made here. Thank you to Nata Saslafsy, Koyuki Nakamura, Harri- son Ko, and Scott Johnson for some great skate sessions and for teaching me to snowboard. Thanks to Crystal Lu for showing me that there’s more to life than work. Thanks to Carissa Kang, Elizabeth Wijaya, Lynette Saw, Bryan Chong,
iv and Mairin Ng for teaching me Singlish over Mahjong. I also want to thank the Cornell Longboarding Club, Libe Slope FC, Mystery Machine, the Cayuga Windsurfing Club, the Cornell Cube Clube, the Glen Region SCCA, and the En- chanted Badger. These communities have helped me make Ithaca a place to call home for these six years. Finally, I would like to thank my parents for teaching me the value of hard work and discipline. I would not have been able to complete this journey with- out their continued support.
v TABLE OF CONTENTS
Biographical Sketch ...... iii Acknowledgements ...... iv Table of Contents ...... vi List of Tables ...... ix List of Figures ...... x
1 Introduction 1 1.1 Secure and Reliable Real-Time Systems ...... 3 1.2 Energy-Efficient Real-Time Systems ...... 4 1.3 Organization ...... 5
2 Background 6 2.1 Real-Time Systems ...... 6 2.2 Run-Time Monitoring ...... 8 2.2.1 Overview ...... 8 2.2.2 Hardware Architecture ...... 8 2.2.3 Example: Array Bounds Check ...... 12
3 WCET Analysis of Run-Time Monitoring 15 3.1 Introduction ...... 15 3.2 WCET Analysis ...... 16 3.2.1 Analysis Assumptions ...... 16 3.2.2 Implicit Path Enumeration ...... 16 3.2.3 Sequential Monitoring Bound ...... 18 3.2.4 FIFO Model ...... 18 3.2.5 MILP Formulation ...... 23 3.3 Example MILP Formulation ...... 26 3.3.1 Example Setup ...... 26 3.3.2 Creating the MFG ...... 28 3.3.3 Calculating the Monitoring Load ...... 28 3.3.4 MILP Optimization ...... 33 3.4 Evaluation ...... 33 3.4.1 Experimental Setup ...... 33 3.4.2 WCET Results ...... 36 3.4.3 Time to Solve Linear Programming Problem ...... 38
4 Run-Time Monitoring for Hard Real-Time Systems 40 4.1 Introduction ...... 40 4.2 Opportunistic Monitoring ...... 42 4.2.1 Measuring Slack ...... 42 4.2.2 Dropping Tasks ...... 45 4.3 Hardware-Based Dropping Architecture ...... 47
vi 4.3.1 Slack Tracking ...... 48 4.3.2 Metadata Invalidation Module ...... 49 4.3.3 Filtering Invalid Metadata ...... 52 4.3.4 Full Architecture ...... 53 4.4 Monitoring Extensions ...... 54 4.4.1 Uninitialized Memory Check ...... 56 4.4.2 Return Address Check ...... 56 4.4.3 Dynamic Information Flow Tracking ...... 58 4.5 Evaluation ...... 59 4.5.1 Methodology ...... 59 4.5.2 Amount of Monitoring Performed ...... 60 4.5.3 FPGA-based Monitor ...... 64 4.5.4 Area and Power Overheads ...... 68
5 Run-Time Monitoring for Soft Real-Time Systems 70 5.1 Introduction ...... 70 5.2 Partial Run-Time Monitoring ...... 71 5.2.1 Overhead of Run-Time Monitoring ...... 71 5.2.2 Partial Monitoring for Adjustable Overhead ...... 73 5.2.3 Applications and Metrics ...... 73 5.2.4 Design Challenges ...... 75 5.3 Architecture for Partial Monitoring ...... 76 5.3.1 Effects of Dropping Monitoring ...... 77 5.3.2 Invalidation for Preventing False Positives ...... 78 5.3.3 Dataflow Engine for Preventing False Positives ...... 79 5.3.4 Filtering Null Metadata ...... 81 5.4 Dropping Policies ...... 83 5.4.1 Deciding When to Drop ...... 83 5.4.2 Deciding Which Events to Drop ...... 85 5.5 Evaluation ...... 89 5.5.1 Experimental Setup ...... 89 5.5.2 Baseline Monitoring Overhead ...... 91 5.5.3 Coverage with Adjustable Partial Monitoring ...... 92 5.5.4 FPGA-based Monitor ...... 95 5.5.5 Comparing Dropping Policies ...... 97 5.5.6 Multiple-Run Coverage ...... 99 5.5.7 Area and Power Overheads ...... 100
6 DVFS Control for Soft Real-Time Systems 103 6.1 Introduction ...... 103 6.2 Execution Time Variation in Interactive Applications ...... 105 6.2.1 Variations in Execution Time ...... 105 6.2.2 Existing DVFS Controllers ...... 106 6.2.3 Prediction-Based Control ...... 108
vii 6.3 Execution Time Prediction for Energy-Efficiency ...... 110 6.3.1 Overview ...... 110 6.3.2 Program Features ...... 111 6.3.3 Execution Time Prediction Model ...... 114 6.3.4 DVFS Model ...... 116 6.3.5 Alternate Prediction Models ...... 119 6.4 System-Level Framework ...... 120 6.4.1 Programmer Annotation ...... 120 6.4.2 Off-line Analysis ...... 121 6.4.3 Run-time Prediction ...... 124 6.5 Evaluation ...... 125 6.5.1 Experimental Setup ...... 125 6.5.2 Energy Savings and Deadline Misses ...... 127 6.5.3 Analysis of Overheads and Error ...... 130 6.5.4 Under-prediction Trade-off ...... 133 6.5.5 Race to Idle ...... 134 6.5.6 Heterogeneous Cores ...... 136
7 Related Work 146 7.1 Real-Time Systems ...... 146 7.1.1 Hardware Architectures for Real-Time Systems ...... 146 7.1.2 Worst-Case Execution Time Analysis ...... 147 7.2 Security and Reliability ...... 148 7.2.1 Run-Time Monitoring ...... 148 7.2.2 Partial Monitoring ...... 149 7.3 Performance-Energy Trade-off ...... 151 7.3.1 Dynamic Voltage and Frequency Scaling ...... 151 7.3.2 Execution Time Prediction ...... 152
8 Conclusion 154 8.1 Summary ...... 154 8.2 Future Directions ...... 155 8.2.1 Secure and Reliable Real-Time Systems ...... 155 8.2.2 Energy-Efficient Real-Time Systems ...... 156 8.2.3 Hardware Design for Real-Time Systems ...... 157
Bibliography 159
viii LIST OF TABLES
3.1 Estimated and observed WCET (clock cycles) with and without monitoring...... 35 3.2 Running time of lp solve in seconds to determine worst-case stalls (stall), sequential bound (sequential), and worst-case ex- ecution times (wcet)...... 38
4.1 Monitoring tasks for UMC, RAC, and DIFT...... 54 4.2 Operation of the MIM for UMC, RAC, and DIFT...... 55 4.3 Operation of the MFM for UMC, RAC, and DIFT...... 55 4.4 Average power overheads for dropping hardware at zero head- start slack. Percentages in parentheses are normalized to the main core’s power usage...... 68
5.1 Trade-off between performance overhead and flexibility/com- plexity of run-time monitoring systems...... 72 5.2 Average power overhead for dropping hardware. Percentages are normalized to the main core power...... 101 5.3 Average runtime power of the monitoring core...... 102
6.1 Variable and notation descriptions...... 113 6.2 Differences in features selected and predicted execution times for different platforms compared to an ARM-LITTLE platform. . . . 122 6.3 Benchmark descriptions and task of interest...... 125 6.4 Execution time statistics when running at maximum frequency. . 126 6.5 Parameters used in DVFS and heterogeneous core model. . . . . 138
7.1 Previous work on partial run-time monitoring...... 149
ix LIST OF FIGURES
1.1 Overview of techniques presented in this thesis...... 2
2.1 Example of tasks, jobs, and deadlines...... 6 2.2 Timings related to a job...... 7 2.3 Parallel monitoring architecture model...... 9 2.4 The main core pipeline is modified to forward information on certain instruction types...... 9 2.5 Pipeline diagram of monitoring. The main core stalls due to the slower throughput of the monitor...... 11 2.6 Example of array bounds check monitor...... 13
3.1 Example control flow graph and its corresponding monitoring flow graph. Blue (dark) nodes include a forwarded instruction. . 19 3.2 Monitoring load is modeled for each node...... 20 3.3 Code example with its corresponding control flow graph and monitoring flow graph. Blue (dark) nodes include a forwarded (ldr) instruction...... 27 3.4 Toolflow for WCET estimation of parallel monitoring...... 34 3.5 Ratio of estimated WCET to observed WCET...... 36 3.6 Ratio of WCET using our proposed method compared to assum- ing sequential monitoring...... 36 3.7 Ratio of WCET with monitoring to WCET without monitoring. . 37
4.1 Dynamic slack and total slack...... 43 4.2 Dynamic slack increases when a sub-task finishes early. Slack is consumed as monitoring causes the task to stall...... 43 4.3 WCET with original monitoring task, with software dropping task, and with no monitoring for UMC. Results are normalized to the WCET with no monitoring...... 46 4.4 Hardware modules for keeping track of slack and making a drop decision...... 48 4.5 Metadata invalidation module (MIM)...... 49 4.6 Metadata filtering module (MFM)...... 52 4.7 Block diagram of full hardware architecture for opportunistic monitoring on hard real-time systems...... 53 4.8 Percentage of monitored, dropped, and filtered monitoring events with zero headstart slack...... 61 4.9 Percentage of checks that are not dropped or filtered for zero headstart slack...... 62 4.10 Percentage of checks performed as headstart slack is varied for UMC. Headstart slack is shown normalized to the main task’s WCET...... 63
x 4.11 Percentage of checks performed as headstart slack is varied for RAC implemented on a processor core. Headstart slack is shown normalized to the main task’s WCET...... 63 4.12 Percentage of checks performed as headstart slack is varied for DIFT. Headstart slack is shown normalized to the main task’s WCET...... 64 4.13 Percentage of monitored, dropped, and filtered monitoring events for an FPGA-based monitor with zero headstart slack. . . 65 4.14 Percentage of checks that are not dropped or filtered on an FPGA-based monitor for zero headstart slack...... 66 4.15 Percentage of checks performed as headstart slack is varied for UMC on an FPGA-based monitor. Headstart slack is shown nor- malized to the main task’s WCET...... 67 4.16 Percentage of checks performed as headstart slack is varied for RAC on an FPGA-based monitor. Headstart slack is shown nor- malized to the main task’s WCET...... 67 4.17 Percentage of checks performed as headstart slack is varied for DIFT on an FPGA-based monitor. Headstart slack is shown nor- malized to the main task’s WCET...... 68
5.1 Example of array bounds check monitor...... 76 5.2 Example of using invalidation information to prevent false pos- itives...... 78 5.3 Hardware support for dropping...... 79 5.4 Hardware architecture of the dataflow engine...... 80 5.5 Example of using information about null metadata to filter mon- itoring events...... 81 5.6 Slack and its effect on monitoring over time...... 83 5.7 Slack tracking and drop decision hardware...... 85 5.8 Comparison of dropping policies using metadata dependence graphs. Square nodes represent events where check are per- formed. Blue (dark) nodes indicate which nodes can be dropped. 86 5.9 Monitoring overhead with null metadata filtering...... 91 5.10 Coverage vs. varying overhead budget for BC...... 92 5.11 Coverage vs. varying overhead budget for UMC...... 92 5.12 Coverage vs. varying overhead budget for LS...... 93 5.13 Average error vs. varying overhead budget for instruction-mix profiling. Whisker bars show min and max errors...... 93 5.14 Coverage versus varying overhead budget for UMC running on an FPGA-based monitor...... 95 5.15 Comparing dropping policies for UMC...... 96 5.16 Comparing dropping policies for BC...... 98 5.17 Coverage over multiple runs for BC with a 10% probability to not drop events...... 99
xi 6.1 Execution time of jobs (frames) for ldecode (video decoder). . . . 105 6.2 Execution time of jobs (blue, solid) and execution time expected by a PID controller (red, dashed) for ldecode (video decoder). . . 107 6.3 Overview of prediction-based control...... 108 6.4 Example control flow graph. Each node is annotated with its number of instructions...... 109 6.5 Steps to predict execution time from job input and program state. 110 6.6 Example of feature counters inserted for conditionals, loops, and function calls...... 111 6.7 Example of program slicing for control flow features...... 112 6.8 Average execution time of jobs (frames) for ldecode (video de- coding) as frequency level varies...... 116 6.9 The effective budget decreases due to slice and DVFS execution time...... 118 6.10 95th-percentile switching times for DVFS...... 118 6.11 Example of programmer annotation to mark task boundaries and time budgets...... 120 6.12 Overall flow for prediction-based DVFS control...... 121 6.13 Options for how to run predictor...... 123 6.14 Normalized energy usage and deadline misses...... 127 6.15 Normalized energy usage and deadline misses as time budget is varied...... 129 6.16 Average time to run prediction slice and switch DVFS levels. . . 130 6.17 Normalized energy usage and deadline misses with overheads removed and oracle prediction...... 131 6.18 Box-and-whisker plots of prediction error. Positive number cor- respond to over-prediction and negative numbers correspond to under-prediction...... 132 6.19 Energy vs. deadline misses for various under-predict penalty weights (α) for ldecode...... 134 6.20 Idle power for the ODROID-XU3’s Cortex-A7 core for each fre- quency setting...... 134 6.21 Normalized energy usage and deadline misses when the fre- quency is dropped to minimum (200 MHz) when jobs finish. . . 135 6.22 Relationship between frequency, execution time, and frequency based on analytical model...... 139 6.23 Energy and deadline misses for running with only a little core (littleonly) and with both big and little cores (biglittle)...... 141 6.24 Number of core switches and jobs run on the big core as a per- centage of total jobs for varying normalized budgets when com- paring biglittle to littleonly...... 142 6.25 Energy and deadline misses for running with only a big core (bigonly) and with both big and little cores (biglittle) for a nor- malized budget of 1...... 143
xii 6.26 Number of core switches and jobs run on the big core as a per- centage of total jobs for a normalized budget of 1 when compar- ing biglittle to bigonly...... 143
xiii CHAPTER 1 INTRODUCTION
The last decade has seen an increased ubiquity of computers with the widespread adoption of smartphones and tablets and the continued spread of embedded cyber-physical systems. With this deep integration into our environ- ments, it has become important to consider the real-world interactions of these computers rather than simply treating them as abstract computing machines.
For example, cyber-physical systems typically have real-time constraints in or- der to ensure safe and correct physical interactions. Similarly, even mobile and desktop systems, which are not traditionally considered real-time systems, are inherently time-sensitive because of their interaction with users. These real-time systems introduce new constraints and trade-offs for computer systems design.
Real-time systems introduce new constraints and trade-offs to system de- sign. This is due to the presence of real-time deadlines for tasks. Correct oper- ation involves both correct computation of outputs as well as completing tasks before their deadlines. Depending on the application, deadlines can be either considered hard deadlines, where a missed deadline indicates a fatal error, or soft deadlines, where a missed deadline results in poor performance but is not fatal. Depending on the type of deadline, the level of timing guarantee differs which can affect the design decisions made.
Typically, these systems are designed conservatively so that, even in the worst-case, tasks will not exceed their deadlines. This has been traditionally handled at the software layer and there exists extensive work in developing real- time scheduling algorithms [20], real-time operating systems (RTOS) [3, 38], and worst-case execution time (WCET) analysis techniques [92]. On the other hand,
1 Security & Reliability Energy Efficiency
WCET Analysis of Run- Time Monitoring (Chapter 3)
Systems Systems Monitoring Architecture for Hard Real-Time Systems Hard Real-Time (Chapter 4)
Monitoring Architecture for DVFS Control for Soft Soft Real-Time Systems Real-Time Systems (Chapter 5) (Chapter 6) Systems Systems Soft Real-Time
Figure 1.1: Overview of techniques presented in this thesis.
modern hardware features are typically evaluated and optimized for average- case performance in order to apply to general compute workloads. Modern hardware features do not take into account the deadlines of the real-time dead- line that runs on them. In this thesis, we explore the hardware/software co- design of real-time systems in order to address the challenges and opportunities that arise when the underlying hardware is aware of real-time requirements.
Specifically, we explore two modern hardware techniques. First, we explore run-time monitoring techniques for improving system security and reliability in the context of real-time systems. Second, we look at using real-time deadlines to inform dynamic frequency and voltage scaling of hardware for improved energy efficiency. We apply these techniques to hard and soft real-time systems. Figure 1.1 shows an overview of the work developed in this thesis and the areas
2 they target.
1.1 Secure and Reliable Real-Time Systems
Run-time monitoring techniques have been shown to be useful for improving the reliability, security, and debugging capabilities of computer systems. For example, Hardbound is a hardware-assisted technique to detect out-of-bound memory accesses, which can cause undesired behavior or create a security vul- nerability if uncaught [25]. Intel has recently announced plans to support buffer overflow protection similar to Hardbound in future architectures [44]. Simi- larly, run-time monitoring can enable many other new security, reliability, and debugging capabilities such as fine-grained memory protection [93], informa- tion flow tracking [82, 33], control flow integrity checking [4], hardware error detection [56], data-race detection [26, 68], etc. However, today’s run-time mon- itoring techniques cannot be easily applied to critical real-time systems due to their lack of timing guarantees. Thus, we have developed three techniques in order to enable run-time monitoring on real-time systems.
Traditional real-time systems design requires estimation of the worst-case execution time (WCET) of tasks in order to guarantee schedulability. In order to enable run-time monitoring to be applied to traditional real-time design flows, an estimate of its impact on worst-case execution time is needed. Thus, we first developed a static analysis method to estimate the WCET impact of monitoring (Chapter 3). This allows run-time monitoring to be enabled if there is enough static slack in a real-time system’s schedule. Our analysis extends the traditional integer linear programming (ILP) formulation used for estimating WCET by adding new constraints in order to model the impact of run-time monitoring.
3 Next, we look at the challenge of applying run-time monitoring to an exist- ing real-time system design. In this case, the real-time schedule may not have enough extra time (i.e., slack in the schedule) in order to support the full over- head of performing run-time monitoring. However, it may still be possible to perform a portion of the monitoring. We designed a hardware architecture that enables and disables run-time monitoring in order to perform as much moni- toring as possible without violating hard real-time deadlines (Chapter 4).
Finally, soft real-time and interactive systems do not require strict timing guarantees but still look to achieve timely execution of programs. For these sys- tems, we designed an architecture that enables adjustable overheads by trading off monitoring coverage (Chapter 5). By relaxing some of the constraints for our architecture for hard real-time systems, we are able to achieve higher monitor- ing coverage while still remaining close to deadlines.
1.2 Energy-Efficient Real-Time Systems
For modern mobile and embedded systems, energy usage is a major concern due to limited battery life. Techniques, such as dynamic voltage and frequency scaling (DVFS) and core migration, have been developed to enable a trade-off between power and performance. These resource allocation techniques can be used to reduce energy usage at the cost of increased execution time.
Traditional managers for these resource allocation techniques, such as DVFS, work in a best effort manner. They operate under the assumption that better performance is desired and only attempt to decrease resources when the perfor- mance impact is minimal or when power or thermal limits are reached. How-
4 ever, many applications or jobs have response-time deadlines. Finishing faster than this response-time requirement has no benefit. For example, responding to a user input faster than human reaction time does not improve user experience.
Similarly, decoding video frames faster than the video frame rate has no added benefit.
For these interactive applications, we can save energy by using techniques such as DVFS to run at a lower power point, without impacting user experience. In this thesis, we present a method to predict the appropriate DVFS operating point for tasks in order to meet deadlines with minimal energy (Chapter 6). This prediction needs to take into account the effect of inputs and program state in order to account for dynamic variations in task execution time. We show how this is possible through analysis of task source code in order to generate task- specific predictors for DVFS control.
1.3 Organization
The rest of the thesis is organized as follows. Chapter 2 discusses some back- ground knowledge and terminology that will be used throughout the thesis.
Chapters 3, 4, and 5 discuss techniques on improving system security and relia- bility through run-time monitoring on real-time systems. Chapter 6 presents a method for improving the energy efficiency of real-time systems through prediction-based DVFS control. Finally, Chapter 7 discusses related work and Chapter 8 concludes the thesis and discusses future directions.
5 CHAPTER 2 BACKGROUND
This chapter provides some background knowledge that will be useful in understanding the work presented in this thesis. Section 2.1 discusses some terminology related to real-time systems that will be used throughout the thesis. Section 2.2 describes run-time monitoring techniques for security and reliability which will be used in Chapters 3, 4, and 5.
2.1 Real-Time Systems
A real-time application is typically specified as a set of tasks which have timing requirements. We refer to the time period in which a task must complete as its time budget. For example, games are typically written with a main task that handles reading user input, updating game state, and displaying the output.
In order to maintain a certain frame rate (e.g., 30 frames per second), this task must finish within the frame period budget (e.g., 33 milliseconds for 30 frames per second operation).
We define a job as a dynamic instance of a task. Figure 2.1 shows how a
Application while (true) { task(); }
Job 0 Job 1 Job 2 ···
time budget deadline
Figure 2.1: Example of tasks, jobs, and deadlines.
6 dynamic slack static slack Job
actual time execution time deadline
actual WCET estimated WCET budget
Figure 2.2: Timings related to a job.
task maps to multiple jobs. Each job has a deadline which is the time by which it must finish execution. For example, for a game running at 30 frames-per- second, 30 jobs for the game loop task are run each second. Each of these jobs has a deadline which is 33 milliseconds after the job’s start time. These jobs all correspond to the same set of static task code, but their dynamic behavior differs due to different inputs and program state. For example, one job may see that the user has pressed a button and thus execute code to process this button press. Other jobs may see no new user input and skip the need to process user input.
As a result, job execution times can vary depending on input and program state.
There are several time periods involved with the execution of a job in a real- time system (see Figure 2.2). The time that it takes a job to finish is its actual execution time. The worst-case execution time of the task across all possible in- puts and initial states is its actual worst-case execution time (WCET). Tasks are statically analyzed to produce an estimated worst-case execution time (WCET) for timing analysis and scheduling. This estimated WCET is guaranteed to be con- servative (i.e., it is greater than or equal to the actual WCET). The time between the estimated WCET and the deadline is the static slack that exists. That is, the
7 schedule could be adjusted to reduce the deadline and still guarantee that tasks finish on time. On the other hand, we refer to the difference in actual execution time and estimated WCET as the dynamic slack. This is the slack time that is only available at run-time.
2.2 Run-Time Monitoring
2.2.1 Overview
Run-time monitoring techniques have been shown to be useful for improving the security and reliability of computer systems. The basic idea behind run-time monitoring techniques is to observe the run-time operation of programs and check that certain invariants are met in order to identify anomalous behavior. A wide variety of checks have been proposed including fine-grained memory protection [93], information flow tracking [82, 33], etc. Section 2.2.2 discusses the hardware architecture for implementing run-time monitoring and Section 2.2.3 walks through how an example array bounds check operates.
2.2.2 Hardware Architecture
There are several options on how to implement run-time monitoring. Imple- menting run-time monitoring in software using binary instrumentation or other similar methods introduces especially high overheads. For example, dynamic information flow tracking (DIFT) implemented in software suffers a 3.6x slow- down on average [69]. Implementing monitors in hardware greatly decreases
8 monitoring Monitoring Main Core event Core FIFO
Register RF
File Metadata
Program Memory State Metadata
Figure 2.3: Parallel monitoring architecture model. To FIFO Instruction Enqueue Type Filter Fetch Decode Commit Memory Memory Execute Instruction Instruction Instruction Op Code Register Values ...
Figure 2.4: The main core pipeline is modified to forward information on certain instruction types.
the performance impact by performing monitoring in parallel to a program’s execution. A dedicated hardware implementation of DIFT reduces average per- formance overheads to just 1.1% [82]. However, fixed hardware loses the pro- grammability of software. A fixed hardware implementation cannot be updated and cannot change the type of run-time monitoring performed. Thus, recent studies have proposed using programmable parallel hardware, such as extra cores in a multi-core system or FPGA coprocessors [15, 61, 16, 23, 24, 30, 27], for monitoring. Our work in this thesis is targeted at these programmable parallel hardware monitors.
Figure 2.3 shows a model of the parallel run-time monitoring architecture that is assumed in this thesis. This model includes the main components of par-
9 allel run-time monitoring designs. The architecture consists of two parallel pro- cessing elements, main and monitoring cores, which are loosely coupled with a FIFO buffer. The main core runs a computation task, called the main task, which performs the original function of the real-time system. The monitoring core re- ceives a trace of certain main task instructions through a FIFO, and performs a monitoring task in parallel in order to check whether the system is operating properly. We refer to the main task instructions that need to be sent to the mon- itoring core as forwarded instructions or monitored instructions. The forwarded in- structions are determined based on a particular monitoring technique, and are sent by the main core transparently without explicit instructions added to the main task. Figure 2.4 shows how an example main core pipeline can be mod- ified to automatically forward information when instructions of certain types are committed. When an instruction commits, a comparator checks whether the opcode of the committed instruction is one that should be forwarded. If the instruction should be forwarded, then an enqueue signal is sent to the FIFO. In- formation about the instruction that is needed for monitoring, such as register values, memory locations accessed, etc., is stored in the FIFO. In this way, the detection and handling of monitoring events is handled in hardware, with no need to modify the main task.
The forwarded instruction triggers the monitoring core to execute a series of monitoring instructions. A different monitoring task is executed depending on the type of forwarded instruction. We refer to the collection of monitoring tasks as a monitor or a monitoring scheme. Different monitoring schemes seek to ensure different run-time properties. Monitors typically maintain metadata corresponding to each memory location (mem metadata[addr]) and register (rf metadata[reg id]) of the main core. If the monitoring scheme detects an
10 Cycle 1 2 3 4 5 6 7 8 9 10 LDR r1, [r3] Main FIFO Monitor LDR r2, [r3, #4] Main FIFO Monitor ADD r0, r1, r2 Main FIFO Monitor ADD r0, r0, #15 MainSTALL FIFO Monitor STR r0, [r4] Main STALL FIFO Fifo Entries Free 2110000000
Figure 2.5: Pipeline diagram of monitoring. The main core stalls due to the slower throughput of the monitor.
inconsistent or undesired behavior in the sequence of monitoring events, then an error is detected. We do not focus here on how to handle an error. However, there are several options on how to handle an error, depending on the monitor- ing scheme, such as raising an exception, notifying the user, or switching to a simpler, more trusted main task [77].
In order to decouple the execution of the main task and the monitoring task, a dedicated hardware FIFO is used to buffer monitoring events between the main core and the monitoring core. In the ideal case, the monitoring task oper- ates in parallel with the main task. However, since every forwarded instruction of the main task takes several instructions to monitor, the monitoring core is not always able to keep up with the throughput of monitoring events generated by the main core. Thus, when the FIFO is full, the main core is forced to stall on monitoring events. Figure 2.5 shows an example execution where a series of instructions are monitored. In this simple example, the FIFO only has 2 entries and it takes the monitoring task 2 cycles to process a monitoring event. For sim- plicity, only the commit stage of the main core pipeline is shown. After the 4th instruction (ADD) is committed by the main core, the FIFO is full and the mon- itoring core is busy. Thus, in order to not miss information about a monitoring
11 event, the main core must stall before continuing execution and committing the 5th instruction (STR). These stalls due to the mismatch in throughput are a ma- jor source of performance overhead. We refer to these stalls of the main core due to monitoring as monitoring stalls and the number of cycles stalled as monitoring stall cycles. In the example code sequence shown, three monitoring stall cycles occur, two for instruction 4 and one for instruction 5.
2.2.3 Example: Array Bounds Check
There are many possible monitoring schemes that can be implemented on this type of fine-grained parallel monitoring architecture such as memory protec- tion [93], information flow tracking [82, 33], control flow integrity checking [4], soft error detection [56], data-race detection [68, 73, 55, 11], etc. For example, an array bounds check (BC) [25] can be implemented in order to detect when software attempts to read or write to a memory location outside of an array’s bounds. This can be done by associating metadata with array pointers that in- dicate the arrays’ base (start) and bound (end) addresses. On loads or stores with the array pointer, the monitor checks that the memory address accessed is within the base and bound addresses. In addition, this base and bound meta- data is propagated on ALU and memory instructions to track the corresponding array pointers.
Figure 2.6 shows an example pseudo-code segment, its assembly level in- structions, and the corresponding monitoring operations for an array bounds check monitor. First, an array x is allocated using malloc (line 1). As malloc returns the array’s address in a register, the monitor associates base and bounds
12 Main Core (High-level) Main Core (Assembly) Monitoring Core
1: int *x = malloc(4*sizeof(int)); 1: call malloc 1: rf_metadata[1] = {r1, r1 + 0xf} ; return pointer 0x12340000 in r1 // {base, bounds} of array 2: int *y = x + 2; 2: add r2, r1, #8 2: rf_metadata[2] = rf_metadata[1] 3: x[3] = 1; 3a: mov r3, #1 3a: rf_metadata[3] = NULL 3b: str r3, [r1, #12] 3b: if (r1 + 12 <