An Agile and Rapidly Reconfigurable Test Bed for Hardware-Based Security Features

by Daniel Smith Beard

Master of Science Computer Information Systems Florida Institute of Technology 2009

Bachelor of Science Engineering, Electrical Option University of South Florida 1980

A dissertation submitted to the College of Engineering and Computer Science at Florida Institute of Technology in partial fulfillment of the requirements for the degree of

Doctorate of Philosophy in Computer Science

Melbourne, Florida December, 2019 © Copyright 2019 Daniel Smith Beard All Rights Reserved

The author grants permission to make single copies. We the undersigned committee hereby approve the attached dissertation

An Agile and Rapidly Reconfigurable Test Bed for Hardware-Based Security Features by Daniel Smith Beard

Marco Carvalho, Ph.D. Professor and Dean College of Engineering and Science Committee Chair

Stephen K. Cusick, J.D. Associate Professor College of Aeronautics Outside Committee Member

William H. Allen, Ph.D. Associate Professor Computer Engineering and Sciences Committee Member

Heather Crawford, Ph.D. Assistant Professor Computer Engineering and Sciences Committee Member

Philip J. Bernhard, Ph.D. Associate Professor and Department Head Computer Engineering and Sciences ABSTRACT Title: An Agile and Rapidly Reconfigurable Test Bed for Hardware-Based Security Features Author: Daniel Smith Beard Major Advisor: Marco Carvalho, Ph.D.

Current general-purpose computing hardware and the software that runs on it have evolved over more than a half century from large mainframe systems in corporate, military, and research use to interconnected commodity devices more common than wrist watches. Computational power, storage capacity, and communication capa- bilities have increased in wonderful and staggering ways; however, when we read about the latest vulnerability or data breach it seems that cybersecurity is stuck somewhere between 1983 when Matthew Broderick first heard a synthesized voice ask “Shall we play a game?”, [93] and 1988 when the Morris worm hit the Internet [116]. Multics [82] and Scomp [54] had a shot at establishing secure computing but functionality, cost, and ease of use have largely trumped security so far. For the present, as Jaeger said, “. . . security features fail to protect the system in a myriad of ways.” [77] This study and research effort briefly surveys the roots of secure computing and present vulnerabilities that contribute to insecurity, and presents technological changes that could help stem this tide. We have gleaned a collec- tion of demonstrated security features that could be hardware-based and therefore hardware-enforced, but would require no adaptation of existing legacy applica- tions beyond recompiling already-existing high level source code. In this effort

iii we demonstrate a prototype CPU with hardware-based security features that is amenable to FPGA or ASIC implementation and provide a hardware testbed based on DARPA's Cyber Grand Challenge cybersecurity “experimentation ecosystem” [39]. This will answer the question of whether hardware-based security features can produce a significant security improvement in unadapted legacy C/C++ code, and provide a testbed for further evaluation and testing of hardware-based features.

iv Table of Contents

Abstract iii

List of Figures xii

List of Tables xv

Acknowledgments xvi

Dedication xviii

1 Introduction 1

2 Foundations 4 2.1 Foci in Security ...... 5 2.2 Security Defined ...... 5 2.3 Problem Statement ...... 7 2.3.1 Narrowing the Focus – Security in Hardware ...... 7 2.4 Legacy Secure (Trusted) Systems ...... 8 2.4.1 Multics ...... 8 2.4.2 Honeywell Scomp ...... 10 2.4.3 Drawbacks ...... 10 2.5 A Modern Trusted Computer System Effort – CHERI ...... 11

v 2.5.1 Object-Capability Security Overview ...... 12 2.5.2 Object-Capability Hardware Enhancements ...... 16 2.5.3 Memory Protection in the Object-Capability Model . . . . . 18 2.5.4 CHERI Object-Capability Example ...... 19 2.5.5 Hardware-Software Integration ...... 20 2.5.6 Relevance to the Secure ...... 22 2.6 Common Vulnerability Patterns for Modern Computers ...... 23 2.6.1 Definition of Terms ...... 24 2.6.2 From Attack to Intrusion ...... 27 2.7 Stack Based Buffer Overflows ...... 31 2.7.1 Stack Basics ...... 32 2.7.1.1 Stack Operations – Physical View ...... 32 2.7.1.2 Stack Operations – Computer Memory Represen- tation ...... 33 2.7.1.3 Stack Width and Growth Direction ...... 34 2.7.1.4 Stack Operation in Procedure Calls – CALL, RET vs. PUSH, POP ...... 36 2.7.1.5 Stack Use for Parameters and Variables ...... 37 2.7.2 Stack Overflow Details ...... 39 2.7.3 Co-mingled Control and Data on a Common Stack . . . . . 41 2.7.4 ‘Reverse’ Stack Growth ...... 43 2.7.5 Stack Based Protection Techniques . . 45 2.7.5.1 Stack Execution Prevention ...... 46 2.7.5.2 Stack Canaries ...... 49 2.7.5.3 Return Address Protection or Repair ...... 52

vi 2.7.5.4 Reverse Stack ...... 55 2.8 Non-Stack Buffer Overflows ...... 56 2.9 Return- and Jump-Oriented Programming ...... 59 2.9.1 Gadgets ...... 60 2.9.2 Return-Oriented Programming Details ...... 64 2.9.3 Jump-Oriented Programming Details ...... 65 2.9.4 Control Flow Protection ...... 67 2.10 Code Injection ...... 70 2.11 Memory Protection ...... 71 2.12 Address Space Layout Randomization ...... 74 2.13 ...... 75 2.14 Instruction Set Architecture ...... 77 2.15 Instruction Set Randomization ...... 78 2.16 Hardware-enhanced Authentication ...... 81 2.16.1 Random Number Sources ...... 81 2.16.1.1 Physical Uncloneable Functions ...... 84 2.17 Current State of the Art Summary ...... 84

3 Secure Host CPU 86 3.1 Introduction ...... 86 3.2 Secure Host CPU Design Features ...... 87 3.2.1 High Level Architecture ...... 87 3.2.2 Memory Architecture ...... 87 3.2.3 Register Architecture ...... 88 3.2.4 Stack Architecture ...... 89 3.2.4.1 Reverse Stack Growth ...... 90

vii 3.2.4.2 Dual Stack ...... 91 3.2.5 Instruction Set Architecture ...... 91 3.2.5.1 LAND Group ...... 93 3.2.6 Instruction Set Randomization ...... 94 3.3 Field Programmable Gate Arrays ...... 97 3.3.1 Example Logic Functions in FPGAs ...... 98 3.3.2 FPGA Manufacture and Function Implementation ...... 99 3.3.3 Hardware Description Language ...... 101 3.3.4 Possible Alternatives to FPGAs ...... 102 3.4 Exception Handling ...... 104 3.5 Application Summary ...... 105

4 Secure Host CPU Implementation 106 4.1 Early FPGA Prototype ...... 107 4.1.1 C99 Emulator ...... 108 4.2 Review and Introduction ...... 109 4.3 CPU High Level Architecture ...... 109 4.3.1 Secure Host CPU Emulator ...... 110 4.3.2 Data Types in the C99 Emulator ...... 112 4.4 Memory Architecture ...... 113 4.5 Register Architecture ...... 114 4.5.1 Register Implementation ...... 116 4.5.1.1 Register Identifiers ...... 116 4.5.2 Flags Register (eflags) ...... 119 4.6 Stack Architecture ...... 119 4.6.1 Reverse Stacks ...... 120

viii 4.6.2 Dual Stacks ...... 121 4.7 Instruction Pointer Management ...... 123 4.8 Instruction Set Architecture ...... 124 4.8.1 Instruction Word Overview ...... 124 4.8.2 Instruction Word Architecture ...... 125 4.8.3 Opclass ...... 127 4.8.4 Transfer Width ...... 128 4.8.5 Arguments, Operands, and Operand Types ...... 128 4.8.6 Operands 1 and 2 Differences ...... 129 4.8.7 Instruction Transfer and Argument Sizes ...... 131 4.8.8 Relocation Flag ...... 132 4.8.9 Instruction Word Binary Implementation ...... 133 4.9 Other Security Features of the ISA ...... 133 4.9.1 Jump/Land Flow Control Instructions ...... 133 4.9.2 Other Flow Control Instructions ...... 137 4.9.3 Instruction Set Density ...... 138 4.9.4 Instruction Set Randomization ...... 139 4.9.4.1 ISR Keys ...... 140

5 CPU Testbed and Evaluation 141 5.1 Linux Host ...... 141 5.2 Secure Host Tool Chain ...... 147 5.2.1 Secure Host CPU Compiler ...... 148 5.2.1.1 IR to Assembly Register Allocation ...... 148 5.2.2 Assembler ...... 149 5.2.2.1 Assembler Dictionaries ...... 151

ix 5.2.2.2 Assembler Field Codes ...... 152 5.2.3 Relocating Loader ...... 153 5.2.4 Console Monitor/Debugger ...... 153 5.3 OS Support for the Secure Host CPU ...... 157 5.4 Performance Tuning of the Emulator ...... 157 5.5 Proof of Concept Demonstration ...... 158 5.6 Demonstration Results ...... 159 5.6.1 Invalid Instructions ...... 159 5.6.2 ROP and JOP Gadget Reduction ...... 160 5.6.3 Control Flow Protection ...... 161 5.7 Proof of Concept Demonstration Summary ...... 162

6 DARPA CGC and DECREE 164 6.1 DARPA Cyber Grand Challenge ...... 165 6.2 DARPA DECREE ...... 165 6.2.1 DECREE OS Syscalls ...... 166 6.2.2 DECREE Syscall Interface ...... 168

7 Future Work and Concluding Remarks 170 7.1 Architecture Retrospectives ...... 170 7.1.1 Patterning ...... 170 7.1.2 Concurrent Registers ...... 171 7.1.3 Additional Registers ...... 171 7.1.4 Instruction Set Architecture Changes ...... 172 7.2 Testbed Enhancements ...... 172 7.2.1 Toolchain ...... 172

x 7.2.2 Replacement OS or Microkernel ...... 173 7.3 Secure Host CPU in Real Life ...... 174

References 176

A Ethics in Cybersecurity Research 198

B Use of IP-Core Devices from Untrusted Channels 200

C Secure Host CPU Instruction Set 202

xi List of Figures

2.1 Linux Security Modules Interface (from Jaeger [77] Figure 9.1) . . . 14 2.2 CHERI Capability (from Woodruff [174] Fig. 2) . . 17 2.3 CHERI Capability Coprocessor Register Definitions (from Watson [170], Table 3.1) ...... 20 2.4 Conceptual Stacks ...... 32 2.5 Conceptual Stack Growth ...... 34 2.6 Stack Use, Procedure Call Example ...... 38 2.7 x86 Stack Orientation and Example Stack Frame ...... 40 2.8 Program Control and Program Data on a Common Stack (Frame and Segment) ...... 42 2.9 Stack Growth Alternatives ...... 44 2.10 Buffer Overflow Exploit with Shell Code ...... 47 2.11 Compiler-Generated Assembly With and Without Stack Canary . . 50 2.12 Conventional Stacks With and Without Canary ...... 50 2.13 JOP Dispatcher Gadget (Bletsch [15] Figure 3) ...... 67 2.14 Harvard Architecture (from Francillon [55] Fig. 1) . 76 2.15 ISR Implementation (from Kc [83], Figure 1) ...... 79 2.16 RBG Functional Model (from NIST SP 800-90A Rev 1 Fig. 1) [11] 82

3.1 Secure Host CPU Concurrent General Purpose Registers ...... 88

xii 3.2 Single Stack Example (Repeated from 2.8) ...... 89 3.3 Fixed Key and Keystream Randomization ...... 95 3.4 Look-Up Table-Based Randomization ...... 96 3.5 ISR Block Diagram ...... 96 3.6 Realization of Combinatorial Logic (from Swan [151] Fig. 1.(b)) . . 99 3.7 NAND-NAND and NOR-NOR Latches, Stroud [148] ...... 100 3.8 HDL Modeling Capabilities (Smith [144], Fig. 1) ...... 102

4.1 Early FPGA Testbed Block Diagram ...... 108 4.2 Early FPGA Prototype CPU ...... 108 4.3 Secure Host CPU Register Architecture ...... 115

4.4 Secure Host CPU eflags Register ...... 119 4.5 Secure Host CPU Dual ‘Reverse’ Stacks ...... 120

4.6 Secure Host CPU iword C99 struct ...... 127 4.7 Secure Host CPU iword (C99 Format) ...... 133

5.1 Secure Host Testbed Block Diagram ...... 142 5.2 Secure Host Console -h (Help) Output ...... 144 5.3 Secure Host Console: Waiting for Connection Request ...... 146 5.4 Secure Host Console: Ready, In Halt State ...... 147 5.5 Secure Host Console: User Completion & Shutdown . . . . 147 5.6 Secure Host CPU Assembler Output ...... 150 5.7 Secure Host CPU iword (Assembler Format) ...... 152 5.8 Secure Host Console: Monitor/Debugger Menu ...... 154 5.9 Secure Host Console: Program Step Mode ...... 155 5.10 Secure Host Console: Memory Display ...... 155

xiii 5.11 Secure Host Console: Custom Code Disassembly ...... 156 5.12 Secure Host Console: Stacks Display ...... 156 5.13 Secure Host Console: Invalid Instruction Trap ...... 160

5.14 Secure Host Console: Invalid eip Value ...... 160 5.15 Secure Host Console: Missing landj Trap ...... 162 5.16 Secure Host Console: Missing landj (Conditional) Trap ...... 162 5.17 Secure Host Console: Missing land Trap With Invalid Instruction . 162 5.18 Secure Host Console: Missing landc Trap ...... 163 5.19 Secure Host Console: Missing landr Trap ...... 163

xiv List of Tables

3.1 Secure Host CPU Instruction Word Description ...... 92

4.1 Resulting Secure Host CPU Register IDs ...... 117 4.2 Secure Host CPU Register Type Code Suffixes ...... 118 4.3 Secure Host CPU Instruction Word Description ...... 126 4.4 Secure Host CPU Instruction Operand Flags ...... 130 4.5 Secure Host CPU Instruction Width Flags ...... 132

6.1 DECREE OS Syscall Prototypes ...... 168 6.2 DECREE OS Syscall Format ...... 169

C.1 Instruction Set Summary ...... 204

xv Acknowledgments

I began graduate studies at a non-traditional age when learning is still well worth the effort but our learning styles change [31]. Having experienced all fourquad- rants of the student/teacher vs. age matrix I appreciate even more the knowledge and extraordinary dedication and professionalism of the many, many teachers and professors under whom I was privileged to study. There are several I wish to thank by name: Dr. Marco Carvalho, who graciously consented to be my advisor when my previous advisor returned to industry. He devoted time away from other important duties to work with an older student, and I am grateful for his active and energetic participation and especially his critical guidance and assistance in crystallizing the plan and focusing on the goal. Dr. Richard Ford, who originally encouraged me to tackle this capstone project, patiently tutored and mentored an old horse, and taught me how to be a better student and a better teacher. I am grateful for his instruction and friendship, many valuable insights gained in the classroom and the labs, and many enjoyable hours in the sky. Dr. William Allen, Dr. Heather Crawford, and Dr. Stephen Cusick who took time away from their primary duties to actively participate in and contribute to

xvi the dissertation committee. I am grateful for their willingness and effort, and for the advice and support they each gave throughout the process. In addition to the above there are many other bright spots in the navigation constellation. Reflecting on the journey, I was divinely blessed by the influence of each of the following teachers, mentors, and advocates: Harriet Fether, Peter Knoke, Lloyd Price, Boyd Stephens, Royal “Bud” Weeder, and Kenneth Westbook.

xvii Dedication

Justice Oliver Wendell Holmes, Jr. was a believer in the value of being functional and productive. In a radio address on his 94th birthday in 1931 Justice Holmes said “The riders in a race do not stop short when they reach the goal. There is a little finishing canter before coming to a standstill” [125]. On occasions whenasked why I was interested in a Ph.D. at this stage in life it would have been an honest response to reply “please pardon me if I canter for a while”. Without supporters there would not be a race, but the true fans are still there after the cantering. I dedicate this to the person who has been both of these the longest, has been the most committed of all, and still understands the canter . . . To Vickie, the joy of my life, who has given me encouragement and unfailing support even in challenging times, and showed patience and humor as I pursued the dreams: I could not ask for better.

xviii Chapter 1

Introduction

We hear the term ‘ubiquitous computing’, yet it is hard now to find an arena where computing is not ubiquitous. The proliferation of “smart” devices that automate our lives exposes us at the same time to new levels of digital risk via the very attributes that make them so useful. Simply by possessing even minimal computing power and network connectivity, the devices that comprise the Internet of Things (IoT) [39] are increasingly targets for attack; add more sophisticated sensor capabilities and increased storage of personal and lifestyle data, and the ante rises even further. Referring to the IoT, DARPA said:

“ . . . we are building this connected society on top of a computing

infrastructure we haven't learned to secure. There's evidence to show that while digital insecurity is growing, it is also making its way into

devices we can't afford to doubt.” [39]

It is true that significant efforts are being made to address malicious software (malware) that leverages weaknesses in computing systems; however, attackers en-

1 joy the benefits of experimentation and learning, so malware is not a static threat. The proliferation of IoT-connected commercial off-the-shelf (COTS) devices that are designed for manufacturability and ease of use rather than security expands the attack surface greatly; every connected device is a potential target of attack. The emphasis of this research project is the development and implementation of a more secure processing platform (a Secure Host) based on a device we will refer to as the Secure Host CPU. The Secure Host is not a substitute for software practices that enforce good standards and sound defensive techniques; these will always be required and must be advanced to keep pace with developing threats. The Secure Host is meant to be a more secure and reliable platform for the execu- tion of existing and new applications, providing a hardware foundation on which more secure processing services can be built. Such a processor would require at least some very low-level architecture changes and could be implemented using synthesized logic hosted on a field programmable (FPGA), with many common functions such as memory and I/O controllers implemented using existing IP cores1. Consideration of the vulnerabilities in current and legacy software and the availability of synthesis tools for FPGA-based devices led to the following research questions: Does current general purpose processing hardware contribute to insecu- rity in ways that could be remedied or eliminated by a more secure host processor using currently available technology? Could a secure host implementation provide a substantial improvement in computer security in the near term without modifi-

1‘IP core refers to intellectual property modules (e.g., memory or Ethernet controllers, etc.) provided by FPGA or EDA vendors for integration with customer designs. See section 3.3.2 for additional information.

2 cation of existing code other than recompilation?

3 Chapter 2

Foundations

This section discusses the state of the art with respect to computer vulnerabilities that relate in some way to ‘features’ in the underlying hardware other than hard- ware Trojans [99, 162]. We exclude hardware Trojans as beyond scope but take note of the caution concerning IP-core devices from untrusted channels [69] (See Section 3.3 and Appendix B). The term ‘feature’ is intended to simply denote an artifact or attribute resulting from a design or implementation choice. In context it should become clear that our focus is on ‘features’ that are part of the design or implementation (not ‘bugs’ or production errors), but that do have a negative impact on security. For emphasis we will restate one of our introductory com- ments from Chapter 1: The Secure Host is not a substitute for software practices that enforce good standards and sound defensive techniques. Another thing the Secure Host cannot be is a defense against “social engineering” such as password stealing [22] and supply-chain attacks [100] such as “form-jacking” [5] where mali- cious code is inserted in intermediary hosts for the purpose of stealing or diverting information.

4 Within this chapter, we will examine background issues and specific weaknesses or vulnerabilities of current general purpose computing software and hardware, and follow that with a review of selected previously published mitigation techniques in Section 2.6.

2.1 Foci in Security

Jargon is universal in any specialty field and computer security is no exception. We frequently encounter families of terms of the form ‘abcSec’ and ‘xyz security’; some have formal, widely recognized names (e.g., ComSec, OpSec, InfoSec, TSec) and some may not be widely recognized outside of specialized interest areas (e.g., CovComm, an element of TSec addressing ‘fact of’ vs. content). Some aspects of security such as ‘physical security’ are self-descriptive but so broad that they require an entire separate volume in a security plan. Cybersecurity and computer security are such terms. “Everybody knows what they mean” but an explanation is hardly possible without a definition of scope. Such is Compusec [132] and ‘computer security’, and so we begin.

2.2 Security Defined

Security in the context of computer processing services is generally regarded as having the attributes of confidentiality, integrity, and availability. For information or services to possess these attributes, the systems from which these are received must have certain essential hardware and software qualities supported by a broad range of physical and procedural controls. The Oxford British & World English dictionary defines cybersecurity as:

5 “The state of being protected against the criminal or unauthorized use of electronic data, or the measures taken to achieve this” [126]

Thus, cybersecurity (the state) is the goal. In trying to reach it, the security practitioner identifies external threats and internal flaws, and deploys available resources to eliminate or mitigate them in priority order. It is unlikely that cy- bersecurity will ever be achieved so much as approached asymptotically for any system containing data of any value if there is any interconnection with the outside world. Clearly, security is ‘not working’ in the COTS systems we use and depend on [88]. Computers are reliable in terms of service continuity, but they are insecure and vulnerable to attackers; moreover, increasingly sophisticated attackers are attracted by the value of information in computer systems and emboldened by society's dependence on them. Any list will quickly become dated, but between late 2013 and early 2015 data breaches occurred in the U.S. Government Office of Personnel Management (OPM) [37], Target Department Stores [53], Home Depot [156], Sony Pictures [124], and the Houston Astros [42], to name but a few examples out of many thousands. Granted, some of these may have been at least partially enabled by poor security practices such as weak passwords [42], but serious exploits often rely on a combination of weaknesses including hardware- and software-based vulnerabilities. Our goal for this project is expressed in the title; our interest is in identifying weaknesses of current systems and advancing hardware-based features that provide an inherently more secure processing platform. Since exploits typically result from a combination of weaknesses, small individual improvements may have a large

6 impact in improving security. And finally, compliance with Kerckhoffs’ Law1 [165] is key for hardware-based security features if they are to be of value in COTS devices and systems.

2.3 Problem Statement

It is unlikely that hardware features alone could defeat a malicious actor that possesses a physical or privileged vector to the machine; even an unwitting vector is sufficient as demonstrated by gatekeepers who allowed the original Trojan Horse through the city gates. Such threats are important but lean toward procedural controls and are beyond the scope of this effort, therefore so we bookmark the concern while considering other threats.

2.3.1 Narrowing the Focus – Security in Hardware

Functionality arises from connectivity as does potential for abuse, so we focus on threats from remote actors. In Linux vernacular, we assume anything achieved through standard input/output (stdin/stdout/stderr) to be ‘remote’ without regard to whether the link is a console terminal or a network socket exposed to the Internet. These i/o channels are apertures for attack, especially when actors have the advantage of multiple engagements over a long enough period to conduct statistical or experimental attacks. Our focus is hardening a host computer against remote attackers while en-

1Specifically, Kerckhoffs’ Second Principle — “The system must not require secrecy andcan be stolen by the enemy without causing trouble.” [165]

7 abling the maximum use of existing operational legacy [90] software systems. Our intended approach in achieving these goals is reflected in the title phrase “hardware-based security features” in which we will use hardware features in the CPU () which prevent the remote execution of unintended system functions. We believe this hardened CPU could be effectively employed in general purpose processors intended for deployment either as a stand-alone system hardened against remote threats or as a front-end processors for more specialized computers. Utility and acceptance would be improved if the hardened CPU can perform these functions without the need for firewall-type rules and especially if legacy software can be reused with a minimum of accommodation. In the next section we examine examples of secure processing systems, their applicability to the problem, and causes of their lack of current wide-spread use.

2.4 Legacy Secure (Trusted) Systems

2.4.1 Multics

Any discussion of security in hardware and secure systems should probably begin with Multics (Multiplexed Information and Computing Service) [106]. Multics began as a research project and saw its first operational site at MIT in 1965 hosted on GE-635 hardware [57]; the last publicly-known site (the Canadian Department of National Defense) was shut down in October of 2000. Multics was originally intended to be a large multi-user time-sharing system. Security was a primary goal in part based on operational experience from a time sharing system at MIT (CTSS) [82] where users with competing or conflicting interests required protection from each other.

8 Multics was a general-purpose computing system but was of interest in secure and multi-level computing well into its life cycle. A 1974 Electronic Systems Divi- sion (AFSC) report [82] deemed Multics “not certifiably secure” but described itas a base from which a secure multi-level system could be developed. A 1976 MITRE report [14] related to development of secure computer systems refers to produc- tion of a secure Multics based on a specific mathematical model, and stated that specific model rules “have been adapted to the evolving Multics security kernel design.” Multics security features were grouped into hardware, software, and operating procedures. The Multics hardware provided master and slave modes and used segmented . Segment descriptors contained absolute addresses and access rights information; these were only accessible to the hard-core supervisor in master mode. User processes ran in slave mode and communicated with the super- visor only through a tightly controlled object model. Constraints were enforced by supervisor code in the original Multics version, and parameter checking was moved to hardware in the Honeywell 6180 CPU [106] and following Multics versions (the 6180 being the second generation CPU first fielded in January 1972 [168]). The supervisor/user interface and all other security features of Multics were software- or procedure-based. Since our effort is a testbed for hardware-based security procedures, the hard- ware elements of Multics are of the most interest, specifically, data execution pre- vention via hardware bits, segmented virtual addressing which prevents address overflow off the end of a segment, and a stack orientation that placed bufferover- flow into unused stack frames when they did occur [82]. Deeper treatment ofthese will be given later in this section.

9 By the time Multics ended its 35-year life span, it had seen service in a wide range of installations including academic, commercial, and military. Except for military applications, security by design was not a strongly desired feature, partic- ularly when the capabilities of the hardware could only be realized through strict application of supporting software and procedural controls.

2.4.2 Honeywell Scomp

Similar in security attributes to Multics, the Honeywell Secure Communications Processor (Scomp)2 originated in a joint effort called Project Guardian intended to further enhance Multics security for specialized communications processing [54, 77]. Scomp [118] was eventually evaluated A1 under the Department of Defense Trusted Computer System Evaluation Criteria in 1983 [118]. Class A1 was the highest numerical rating available and denoted a Verified Design with security assurance derived from formal specification and verification of the system security mechanisms.

2.4.3 Drawbacks

Secure operating systems of the period provided much better security than com- mercial systems, but they suffered performance challenges due to hardware limi- tations especially in anything other than limited applications. Systems manufac- turers were faced with providing more secure but poorly performing systems, or relaxing security attributes to provide better performance in general purpose pro-

2The Honeywell Secure Communications Processor is alternately referred to in authoritative literature as Scomp (e.g., [54]) and SCOMP (e.g, [118]).

10 cessing [77]. Neither Multics nor Scomp survived as commercially viable products, and general purpose computing largely continues to sacrifice security for perfor- mance and ease of manufacture and use.

2.5 A Modern Trusted Computer System Effort

– CHERI

After Multics and Scomp faded from commercial use in favor of less secure general purpose computing platforms, security challenges continued to increase as the in- ternet expanded [39, 29]. While interest in secure systems remained in the research, military, and academic communities contemporary society's increased dependence on information processing systems and the value of the information stored in them has increased the impact of insecurity in commercial systems. CHERI is a com- prehensive effort to meet these contemporary challenges [174]. Watson [170] (De- cember 2014) presents a long-running project for development of the Capability Hardware Enhanced RISC Instructions (CHERI) architecture. The CHERI effort comprises a Trusted Computing Base (TCB) processor which extends current instruction set architectures (ISAs) with additional security primitives including fine-grained memory protection and object-capability secu- rity [170]. As a contemporary, advanced platform CHERI warrants a closer look at its strengths, limitations, and relevance to our effort. The remainder of this sec- tion provides an examination of CHERI's major attributes based (unless otherwise noted) on the Watson [170] and Woodruff [174] papers.

11 2.5.1 Object-Capability Security Overview

The object-capability security model for CHERI places it in the general category of ‘capability systems’ or ‘capability-based systems’. These systems can be described in general terms using Kain's basic definitions [80]. In a capability-based system:

• Data is operated on by processes which are governed by reference monitors and policies.

• Processes operate on behalf of a user who is logged in with specific security attributes.

• A segment is a group of data that have identical security attributes.

• Security attributes define, at a minimum

– for data: security level and access permissions, and

– for processes: security level, domain [18, 80], and (access permissions or) identification of the user for whom the process is executing.

• A capability is an object describing a segment, its security attributes, and applicable access rights or control information.

• A reference monitor is a mechanism that checks each attempted access to an object by a process to verify that the access conforms to policy for each (process-object-access mode) triple, where

– mode is one of (read, write, modify/append, or execute).

• Policies define restrictions for access based on security attributes of thepro- cess, the segment a process is attempting access to, and the mode of the access.

12 For the capability-based system, policies must be reduced to rules compatible with the automated reference monitor. For example, consider the following example ‘policies’ for a Linux file system:

• Files have Read, Write, and Execute permissions

• Each file permission can be assigned to User (Owner), Group, Others, and/or All

• Administrators have all rights to all files

• Users are the owner of files they create (or that an Admin has assigned to them)

• Users may modify their own file permissions and assign permissions for their files to their Group, Others not in their Group, and/or All

• Users are unique, are members of All, and can be assigned to specific Group(s)

Within the Linux Security Modules (LSM) framework is a reference monitor interface such as shown in Figure 2.1. for which a variety of reference monitor implementations could be made [77]. Assuming ‘our’ version of Linux implements a reference monitor, every attempt to access a file passes through the reference monitor where, in kernel space, a decision is made to allow or deny the attempt based on the policy rules above and using the process security attributes (which also identify the user), the capabilities (the security attributes of the file), and the requested mode of access. For example, if the access attempt is a write, write access is only granted per the algorithm of Listing 2.1. In a real-world system, security policy (and therefore reference monitor rules) and operation of the reference monitor would be much, much more complex. Con-

13 Figure 2.1: Linux Security Modules Interface (from Jaeger [77] Figure 9.1) sider an extension of the above example for an additional level of complexity accom- modating a DoD three-tiered collateral classification system (Top Secret, Secret, and Confidential)3. In addition to the previous rules, we would add, at a minimum, the following rules to allow appropriate sharing and never ‘read up’ or ‘write down’ with respect to security classification levels [14]:

• Capabilities are assigned to data segments at creation and include a security

3We are aware of Boebert's assertion [17] that an unmodified capability system cannot enforce the ss property of the military security or the *-property of the Bell La Padula rules [14]; if it is germane to the discussion we believe this issue was adequately answered by Kain [80].

14 Listing 2.1: Example Access Policy Rule Algorithm (Pseudo Code)

1 IF (the process is executing on behalf of:

2 (the User or file owner)

3 OR (an Administrator)

4 OR ((a member of Group−n)

5 AND (write access is enabled for Group−n f o r the f i l e ) )

6 OR ((any user)

7 AND (write access is enabled for All for the file)))

8 THEN (Grant Access)

classification

• Users are assigned a security classification at logon based on the LOWEST of their (persistent personal credentials) OR (role granted at logon)

• Write means create new or modify (with Read access to segment contents)

• Append means add new data only (no Read access to any segment contents)

• Users can Read at or below their current level

• Users can Write at their current level

• Users can Append at or above their current level

Our example simplified object-capability system would combine these 7new rules with the 6 previous ‘stock’ Linux rules for a more sophisticated system that allows users to Read their own and lower classifications, Write only to their own classification and never lower (to prevent leakage or intentional exfiltration tolower level users), and Append (only) to higher levels (in order to provide write-only in- put to higher level segments). An example of the latter case would be section chiefs who provide data to a unit readiness report when individual section reports are

15 Unclassified (For Official Use Only), but any aggregation of reports (and therefore the unit report) is designated Confidential. The next element we add to our object-capability foundation is definition of comprehensive policies for a real-world system involving many types of data in many combinations, many different system functions including operations onat- tached archive and output devices, network interfaces, and multiple levels and roles for users. To be brief, the system implementer has to provide a general enough ca- pability framework so as not to limit the system security administrator, and after the policies are defined they must be reduced to explicit and unambiguous rules and optimized for performance. And finally, when we go back to the definition of a capability, we are reminded that a capability is also a data object with its own access rights. Since unauthorized capability modifications could compromise the entire system, capabilities must be differentiated from other data object types by some means such as unique tags or highly privileged locations or instructions. This brings us to consideration of the support functions CHERI provides to enable a high performance capability-based system: object-capability hardware en- hancements and memory protection in the object-capability model. Following the hardware discussions we will close the section with hardware-software integration and relevance of CHERI to our effort.

2.5.2 Object-Capability Hardware Enhancements

CHERI is referred to as a hybrid capability model [174] because a conventional instruction set architecture (ISA) and (MMU) are ex- tended with the addition of a capability coprocessor which operates using a of 32 256-bit wide capability registers that hold capabilities similar to Mul-

16 Figure 2.2: CHERI Capability Coprocessor (from Woodruff [174] Fig. 2) tics'segment descriptor words. The capability coprocessor interacts with CHERI's Bluespec Extensible RISC Implementation (BERI) [171] pipeline in 4 of the 6 pipeline stages using a capability forwarding register file contained in the copro- cessor shown in Figure 2.2. Each capability register provides base and length fields of 64 bits each reflecting the address space and (as of 2014 [174]) a 31-bit permissions vector. Each per- mission is represented by a ‘1’ bit indicating the permission is allowed; currently- designated permissions include load/store/execute for data, and load/store for capabilities. Remaining permissions were in experimental use for items including sandboxing supported by CHERI primitives, protected domain crossing, and co- ordination of interactions between other such as graphics processing units (GPUs) and their userspace code. Capability protection and capability manipulation is implemented through 21 CHERI instruction-set extensions. When capabilities are modified, unforgeability is maintained by only allowing instructions to disclaim a permission or reduce a privilege. By allowing this limited capability manipulation in usermode, overhead associated with context is avoided, providing improved performance.

17 The CHERI hardware extensions [174] noted in this section are straightforward in how they enable the object-capabilities reviewed in the previous section, but to be complete: In the hardware capability register implementation, the permission vector bit positions are assigned to specific permissions by the system architect to create segment and process capabilities. Capabilities are tagged to data segments (see next section for memory implications), and the capability coprocessor matches the process capabilities to the requested access segment location (base plus length) and mode to enable the memory access and/or writeback phases of the processor pipeline. Performance is enhanced by the coprocessor's capability forwarding reg- ister file which enables concurrent parallel operation of the coprocessor withthe CPU instruction pipeline (Figure 2.2) The remaining hardware enhancements for the object-capability model are the memory architecture and memory protection scheme that complements the oper- ation of the capability coprocessor.

2.5.3 Memory Protection in the Object-Capability Model

The CHERI hardware architecture complements conventional -based memory protection with byte-level memory protection [174] by providing a general-purpose memory management unit (MMU) plus unforgeable fat-pointer [89] representation of memory accesses via hardware capability registers as discussed in the previous section. Tagged physical memory stores segment capabilities for hardware match- ing of segment capabilities to process capabilities. The byte-level fine-grained mem- ory protection CHERI implements to support segment capabilities for the object- capability model can be separated from memory management of virtual memory spaces to simplify integration.

18 CHERI's tagged physical memory provides differentiation of pointers from other data elements, and safely allows combining data and capabilities within com- mon data structures by providing protection of in-memory capabilities. During the memory access process, CHERI performs capability addressing on the physical memory space prior to virtual address translation; this allows each process to be self-contained in its own virtual capability system.

2.5.4 CHERI Object-Capability Example

The following example of CHERI's object-capability operation is substantially taken from Watson [170] §3.1 (Capability Registers). This specific example is for a protected procedure call. As a review, recall from section 2.5.1 that a capability is a descriptor for a segment of memory that contains security and control information and access rights for the segment of memory it describes. The memory segment may be data or a process. In addition to access rights, CHERI's capability contains base and length information to delimit the memory section described.

A protected procedure call is made using CHERI's CCall instruction. The format of the CCall instruction is CCall cs, cb [, selector] where cs is a capability for an object, and cb is a capability for the methods of the object's class. CCall cs, cb invokes a handler which compares the types of the sealed exe- cutable (cs) and non-executable (cb) arguments. If the types match, previous PCC and IDC register values (Figure 2.3) of the capability coprocessor are saved, cs and cb are unsealed, and their capabilities are placed in PCC and IDC (for cs and cb respectively). The procedure executes much like in a conventional processor except that for

19 Figure 2.3: CHERI Capability Coprocessor Register Definitions (from Watson [170], Table 3.1) each instruction executed, the capability coprocessor (Figure 2.3) runs concurrently with the processor to validate operations and memory addresses at various stages of the processor instruction pipeline process. If a capability disagreement occurs, the instruction is truncated and a hardware exception is raised. Figure 2.3 lists a number of capability registers dedicated for exception handling as well as kernel capability registers. At the completion of the protected procedure, a protected procedure return instruction (CReturn) invokes a handler which restores the PCC and IDC values that existed prior to the procedure call (CCall), and the protected procedure returns to the caller.

2.5.5 Hardware-Software Integration

The most significant challenge for adoption of the CHERI hardware with existing programs is that the protection features implemented for object-capability require

20 non-trivial adaptation of the code rather than a simple recompile [174]; the issue revolves around capability-qualified pointers. We learned that for pointers of this ilk there are no standards [149], so it is understandable that code adaptation for CHERI will not be trivial. Applying what we have learned about object-capabilities, we understand that capability-qualified pointers contain not only an address, but as data items, have associated capabilities that must specify, at a minimum, legal bounds. We see this in the CHERI capability register as base and length [174], consistent with ‘base’ and ‘extent’ of Suffield's canonical form for fat pointers [149]. In addition to the treatment in the CHERI documentation, we find references to similar structures as “fat pointers” [149] and “C++ pointers to member functions” [177]. In reverse order and as an aside to illustrate the lack of standards, Microsoft C++ pointers to member functions can be 4, 8, 12 or 16 bytes long [177] and are subject to casting errors which in turn can lead to vulnerabilities. Since our resource CHERI documentation only mentions capability-qualified pointers in the context of C code (C++ references are only mentioned in passing) we will concen- trate on C. Fat pointer implementations in C replace pointers with structures that include valid access ranges with the current pointer value [149]. Complementary code is added for each pointer read or write operation to assure that the current or new value is within legal bounds; an attempt to dereference outside of legal bounds generates an error. Suffield [149] reports there have been numerous variations in implementation with different meta-data as capabilities, and significantly, fat pointer systems have large memory requirements; for software implementations their performance penalties are “measurable”.

21 Specific to code adaptation for CHERI, Woodruff [174] observes that while the C standard permits most operations required of capability-qualified pointers, “practical C implementations tolerate undefined pointer behaviors that CHERI capabilities will not”. We are concerned that in addition to the effort required to retrofit any significant code base there also is the danger that flaws inthecode adaptation may create additional vulnerabilities. It is reported that unadapted code in the same address space with capability-adapted code leaves the adapted code vulnerable. Sandboxing allows safer use of unadapted code but the process is complex and has performance impacts. When taken as a whole, we view the CHERI effort as a significant step forward in secure computing, but see the integration of existing and new software with CHERI hardware as a barrier to implementation, particularly for legacy code.

2.5.6 Relevance to the Secure Processor

CHERI points to a “. . . gradual deployment of CHERI features in existing soft- ware, offering a more gentle hardware-software adoption path” [170]. References to “gradual” and “gentle” give us pause, as we are certain that attackers are inclined to be neither. One of the barriers to adoption of CHERI is the hardware- software security model and the corresponding software changes required to take advantage of built-in security features [174]4. While the CHERI approach shows great promise as a complementary solution for future high end desktop and server applications, our approach diverges to address a secure host for more immediate

4Software adaptation is listed as a Limitation for CHERI [174]; unadapted MIPS code on CHERI is still vulnerable.

22 deployment in targeted network-attached commodity systems ranging from indus- trial SCADA devices to consumer automation products. Our effort seeks to provide host hardware and a tool chain that can meet the following goals:

• A proof of concept hardware platform for demonstration and test of hardware– enhanced computer security,

• A rapidly reconfigurable test bed for continuing evolutionary laboratory de- velopment,

• A system that delivers reasonable processing performance in targeted appli- cations requiring network connectivity, and

• A system that could be deployed in the near term in targeted applications with no more than recompilation of existing source code.

2.6 Common Vulnerability Patterns for Modern

Computers

The following sections are intended to provide context for more detailed treatment of specific weaknesses to be addressed in this project. Section 2.6.1 defines addi- tional terms important to the discussions which follow. Section 2.6.2 provides a very general overview of current typical computer security or cybersecurity issues using conceptual examples of attack patterns. This foundation is in preparation for detailed discussions to be presented in sections 2.7 through 2.15.

23 2.6.1 Definition of Terms

We previously defined security and cybersecurity, and gave a very general outline of our intended research emphasis. Before beginning detailed discussions we should extend that groundwork by defining additional important terms and narrowing their scope; then we can describe common cybersecurity issues and patterns they exhibit using these terms. We are sensitized to the importance of clear and unam- biguous terminology by recent issues in legislation caused by poor terminology or ambiguous language [47, 109]. We will use the Glossary of Common Cybersecurity Terminology published by the US Department of Homeland Security [112], para- phrase the definitions, and add subjective notes to focus and limit the discussion. For completeness, we will examine the terms attack, exploit, hazard, intrusion, threat, and vulnerability. Attack: ‘An attempt to gain unauthorized access to or compromise the integrity of a system’ [112]. An attempt will either succeed or fail, and results will be repeatable when the attack is attempted again in the same manner and under the same conditions. The word ‘attempt’ suggests an active process; therefore the participation of a bad actor (the attacker) is implied. Exploit: ‘A technique to breach the security of a network system’ [112]. We amplify ‘technique’ to be a process, procedure, or sequence of steps. A software tool that implements such a technique is also called an exploit. Since an exploit is a ‘technique to breach’, the term exploit implies success against one, maybe some, but not necessarily all targets. Also, we note that under the amplification of ‘technique’, exploits may be recursive or contain other exploits. So far we can observe that an attack employs an exploit against a system, and attacks are successful when they use the right exploit against a system. Addi-

24 tionally, we can speculate that more than one exploit might produce a successful attack on a given system. Hazard: ‘A natural or man-made source or cause of harm or difficulty’ [112]. Flood, fire, and civil insurrection are hazards. Since they do not suggest active misuse of computing resources we could disregard this general class of hazard; how- ever, ‘source of harm’ could also be (e.g.) a password cracking program somewhere on a black hat server, so we will retain the term for now. Intrusion: ‘An unauthorized act of bypassing the security mechanisms of a(n) . . . information system’ [112]. No matter what the symptoms or consequence of an intrusion, we regard an intrusion as the pointy end of a successful attack. An intrusion is the result of a successful attack which is the result of the application of the ‘right’ exploit. Threat: ‘A circumstance or event that has the potential to exploit vulnerabili- ties’ [112]. This brings us to consideration of the distinction between a hazard and a threat. If we propose a progression of hazard to threat as, for example, ‘a flood’ to ‘a hurricane will make landfall here within 12 hours’ (respectively), the vul- nerability would be susceptibility to water damage, and the security breach would be loss of availability (reference the definition of security in section 2.2). Wecan disregard this class of threats as physical security and out of scope, but the illus- tration suggests colloquially that ‘a threat is a hazard in action’. A more relevant example might combine ‘downloadable attack scripts’ as a hazard to ‘disgruntled employee’ as a threat. Our interpretation of this is that a hazard (H) is leveraged by a potential attack or attacker (A) to produce a threat (T )(H ∧ A = T ). If we have anything of value (or an attacker thinks so), we accept that there will be an attack at some point (A = T rue). This produces a practical equivalence between

25 hazard and threat (H ∧ A = T ; A = T rue; H ∧ T rue = T ; H = T ), so within our limited scope we propose to drop hazard and replace it with threat. A few words on the distinction between attack and threat are in order. Standard risk management prioritizes risk mitigation based on likelihood of occurrence and severity of consequences [50]; mitigation is applied for all risks above certain values of likelihood or severity, and above specified combined scores. For our purposes, since a threat “has . . . the potential to exploit . . . ” [112], it has some likelihood or non-zero probability of occurrence. If a system breach is unacceptable (i.e., severity of consequence is high), responsible risk management dictates that we apply mitigation to credible threats. We take this as a tacit assumption that any credible cyber threat will eventually become an attack; given such equivalence, we could drop the term threats, considering them to be (future) attacks. This simplifies our internal working list so farto attack, exploit, and intrusion. Vulnerability: ‘A characteristic or specific weakness that renders a system open to exploitation by a given threat or susceptible to a given hazard’ [112]. ‘Vulner- ability’ represents a very broad range of physical or operational attributes that contribute to insecurity. Frequently vulnerabilities are not defects in manufacture or the result of poor design; e.g., an Ethernet port is installed as a feature, but when an attacker gains access from somewhere on the local or extended network, the connection is a vulnerability. Utility trumps design intent, and regardless of the purpose or intent of an attribute, characteristic, function, or capability, it is a vulnerability when it enables an intrusion. Given the forgoing, we can say that attacks employ exploits that leverage vul- nerabilities to produce intrusions; threats and hazards are useful terms but muddy our waters. When an exploit can be paired with an available complementary vul-

26 nerability, an attack would be successful. We stipulate no such thing as an incon- sequential intrusion because even the most innocuous intrusion is by definition a subversion of security policy; moreover, a seemingly innocuous intrusion may represent a vulnerability for another exploit. Therefore, our insecurity space can be strictly defined or delimited by complementary exploit/vulnerability pairs. Ref- erences to attacks or attackers become not much more than handles for examining scenarios. Descriptive references to intrusions may serve to indicate the gravity of a successful attack or denote interim states for a more complex exploit, but a given type of intrusion may result from multiple exploit/vulnerability pairings. No matter the characterization of the attack or intrusion, exploit/vulnerability pairs define security (S) or insecurity (¬S), and there is no middle (S ∨ ¬S) [61].

2.6.2 From Attack to Intrusion

Security can be compromised by a range of things having little to do with the computer system or the data processed or stored in it. For example, a user may respond improperly to a phishing attack not realizing it is a ruse to gain confidential information; when a criminal uses this information to empty the victim's bank account it would no doubt be classified as a cyber crime. In another example an employee downloads confidential information and provides it to a competitor; again, the characterization would likely be cyber crime. However, insecurity that arises from such things as human failings, poor policies, or lack of physical security is beyond the scope of this effort. Our interest is in processor hardware features

(vulnerabilities) that attackers' techniques (exploits) leverage in order to perform unauthorized functions on a computer system (intrusions). We should note that features which become vulnerabilities are not necessarily design or manufacturing

27 defects; necessary and properly implemented features can be co-opted to enable or accomplish the intrusion. Explicit examples will be given later; for now we will provide two example attack flows in fairly general terms. OpenSSL is an open source effort providing a cryptographic library and im- plementing the Internet Secure Sockets Layer (SSL) and Transport Layer Se- curity (TLS) [114, 115]. In OpenSSL version 1.0.1, code was added to sup- port a Heartbeat function to allow streamlined checking of a connection, making heartbeat request packet handling a default configuration. To check the status of a connection, a client sends a heartbeat request packet including an arbitrary payload and the payload length; the server returns a copy of the payload from its own local memory, transmitting the payload copy byte by byte for the length of the payload specified in the client request. An oversight in the implementation of this feature produced the Heartbleed vulnerability5 described by Carvalho, et al [26]. If a client sent a short payload but specified a false length of up to the maximum payload size of 64KB [62], the server would use the client payload data copy in server memory plus the contents of the adjacent contiguous server mem- ory to return a payload of the size requested. This very simple “buffer overread” vulnerability was serious in that the contents of adjacent memory could include highly sensitive data such as server keys or passwords. To echo the terminology of Section 2.6.1, in this example the malicious client was the attacker, the defective heartbeat request packet was the exploit, the server's buffer overread was the vulnerability, and the unintended exfiltration of server memory contents was the intrusion.

5Heartbleed was corrected in OpenSSL 1.0.1g [26, 109]

28 The Heartbleed vulnerability was not complex. Its exploit was straightforward if you neglect the effort of examining the payload returned by the server, andthe root cause of the vulnerability (since corrected) was the lack of a verification of the length of the payload requested. A more complex attack pattern example begins in a similar fashion in that the linear nature of buffers in memory is exploited by an attacker. Where Heartbleed overread a buffer, this next class of attacks leverages the ability to write pastthe bounds of buffer and is referred to as a “buffer overflow” attack. The attackand some of its variants will be discussed in detail later in this section; for now we wish to highlight three aspects:

1. The general flow of the attack,

2. The dependence of the attack on multiple features or weaknesses, and

3. The fact that an attack can have multiple successful outcomes.

The attack requires a computer process which reserves a fixed-size buffer on the stack and accepts data from a user or client. The attacker engages the process, but instead of supplying a valid input, supplies an arbitrarily long byte string intended to be larger than the available buffer length. Since current hardware lacks context information for buffers, the process allows the buffer to fill and then overflow into adjacent memory. The general exploit tool, then, is simply to overflow a buffer. Specific exploits of this class depend partially on how much excess data is supplied, what it contains, and the arrangement of memory contents adjacent to the buffer. In the simplest case, the buffer overflow attack is able to overwrite and corrupt adjacent or nearby program flow control data on the stack. When the subroutine or function returns, the corrupt return address causes program control to pass

29 to a random location in memory and crash, or the process is terminated by a memory access fault. Such process crashes or abnormal terminations may cause loss of server responsiveness as part of a denial of service attack, or function as a precursor to other attacks. The exploit is buffer overflow, the vulnerability is lack of adequate protection of critical adjacent memory, and the intrusion is interference with the normal function of the process (and server). Another possible outcome results from the attacker either having a priori knowledge of the buffer size and arrangement of data in adjacent memory, or isable to determine it experimentally. The attacker crafts a payload of the right length for his purposes, and content designed not to simply corrupt adjacent control flow information but to substitute it with different values that point to additional data in his payload. This additional data in the payload is then interpreted by the processor as instructions to execute arbitrary commands such as shell scripts that launch other programs available on the system or open a remote terminal session at the command prompt. This example exploit leverages a buffer overflow, injects payload data as code or shell commands, and subverts the process's designed con- trol flow; this exploit can result in a range of intrusions including execution of arbitrary programs or commands, or unauthorized access to the system command prompt. A final variant of the ‘hijacked control flow’ above begins similarly, butifthe attacker possesses the address information for process binaries or library functions already in computer memory, the buffer overflow payload may contain little more than an address sequence that instructs the processor to traverse memory through sections containing carefully selected “gadgets”. These “gadgets” (discussed fur- ther in section 2.9.1) are short sequences of machine instructions that, by virtue

30 of the order they are entered, execute arbitrary functions that were never part of the original design of the program or library binaries. Exploit/vulnerability char- acterization for this attack is similar to the previous attack class, but in this case the intrusion is the ability to execute arbitrary functions not designed into existing binaries, and to do so without the injection of machine code. So we see that attacks can have a variety of techniques or exploits that rely on a combination of weaknesses or ‘features’ contained in the processing system. A wealth of papers provide research on well-focused ‘point’ hardware- or software- based solutions to specific vulnerabilities, but very little research identifies effortsto combine multiple features for more comprehensive security. The following sections discuss relevant published vulnerabilities and previously published solutions or mitigations. Our goal is to develop hardware-based security features that, at least in combination, provide a demonstrable improvement in security.

2.7 Stack Based Buffer Overflows

This section covers a class of stack based buffer overflow vulnerabilities that lead to interference with the intended program control flow. The bounds of a buffer can be abused by over- or under-reading or over- or under-writing such as we described for Heartbleed [26] in the previous section; however, in this section our interest is focused on the stack as a unique structure within the computer and the variables and in particular buffers that are placed there. We begin witha foundation discussion of stack operations, how they are used by programs, and how stacks are implemented in hardware. Then we will review specific stack based buffer vulnerabilities documented in the literature and steps previous researchers

31 Figure 2.4: Conceptual Stacks have proposed to mitigate them. Since the x86 architecture [72, 103] is probably the most well-known and widely used we will concentrate on it in our examples.

2.7.1 Stack Basics

Operation of the stack in a modern computer is easily taken for granted and may be considered minutiae for high level programming; however, stack operation is a key part of many vulnerabilities. This section on Stack Basics may be skipped by experienced assembly programmers, but the information it contains is important to the understanding of stack-based vulnerabilities.

2.7.1.1 Stack Operations – Physical View

The stack is a Last-In-First-Out (LIFO) queue much like a spring-loaded push- down stack of plates in a cafeteria. Plates are pushed onto the top of the stack when they are added and popped from the top of the stack as they are used; the last plate pushed will be the first plate popped. The x86 machine instructions for pushing and popping items onto and off of the stack are PUSH and POP respectively. Conceptual diagrams of physical and in-memory computer stacks are shown in Figure 2.4.

32 2.7.1.2 Stack Operations – Computer Memory Representation

The computer stack is a common data structure where items can simply be PUSHed to the stack in a specific sequence and POPped in reverse order at any time later. The usefulness of PUSH and POP is to easily store values in linear order and retrieve them in reverse order later. Consider evaluation of the expression: x = (a + b) ∗ (c + d)

Intermediate results for the first sum could be PUSHed onto the stack while the second sum is calculated, then POPped for the multiplication step. The computer takes care of indexing, so the programmer is generally not concerned with stack pointer management; he only needs to adhere to his own convention for order of retrieval.

The processor (or CPU) maintains a Stack Pointer (called esp in the x86 pro- cessor in 32-bit mode) as shown in Figure 2.4(b). PUSH and POP operations are straightforward; esp points to the last item PUSHed, so PUSH adjusts the value of esp to point to the next open slot, then writes the value to be stored to the location pointed to by esp.A POP reads the value pointed to by esp, then adjusts the value of esp so it points to the position of the next item down the stack. The POPped location still contains the last value written but is considered abandoned and will be overwritten by the next PUSH operation. While Push and Pop operations are faster (require fewer clock cycles) than random memory access, computer programs commonly engage memory stack space in a random access fashion for temporary storage of local variables. Details are provided in section 2.7.1.5, but for now it is sufficient to recognize that random access to the contents of an indexed stack such as Figure 2.4(b) can be made if the offset from a stack reference pointer suchas esp is known.

33 Figure 2.5: Conceptual Stack Growth

2.7.1.3 Stack Width and Growth Direction

Two items are left to cover in order to complete basic stack operations before we look deeper into how the stack is used: stack width, and stack growth direction. Initially, our conceptual stack width was implied to be ‘one unit’ wide. The x86 stack width is actually 16 or 32 bits wide as determined by the current operating mode. Increment and decrement operations on the stack pointer are done in stack- width increments by the CPU to maintain proper boundary alignment for items on the stack.

Alternate stack growth (i.e. PUSH) directions are possible for a CPU imple- mentation; from higher toward lower memory addresses (such as the x86) or from lower toward higher memory addresses. x86 stack growth direction was fixed by the CPU architects and not controllable by the programmer, but stack direction is germane to stack overflow vulnerabilities so we will cover it here. Figure 2.5is a reference diagram for the discussion; note that the orientation of the items on our reference stack have been reversed from Figure 2.4. This is for consistency

34 with conventional stack frame diagrams (e.g., [72]) and the illustrations in the remainder of this document. Figure 2.5 depicts the two stack growth alternatives. In Figure 2.5a the stack base is at a relatively higher memory address and the stack grows toward lower addresses as additional items are PUSHed. In Figure 2.5b the orientation of stack growth to memory addresses is reversed.

In Figure 2.5, lower numbered items were all PUSHed before higher numbered items so lower numbered items appear ‘above’ the higher numbered items on the stack; however, since esp contains a memory address, Figure 2.5a would require that esp be decremented before write (or pre-decremented) by a PUSH operation while Figure 2.5b would require that esp be pre-incremented. This will become significant when we look at the relationship of ‘caller’ (main program) and ‘callee’ (function or procedure) data. The choice of stack growth direction is made by the processor architect dur- ing CPU design, and stack pointer adjustments for PUSH and POP operations are handled internally by the CPU accordingly. The x86 stack growth is from higher toward lower memory addresses as in Figure 2.5a; while the choice was arbitrary, it allowed the stack and heap areas to grow toward each other from opposite ends of memory and had greater utility in the early days of computing when random access memory was limited. In high level language programming, the coder need not be concerned with growth direction. PUSH-POP operations are handled in hardware as determined by the CPU architecture and compilers are target architecture-aware and handle content addressing transparently.

35 2.7.1.4 Stack Operation in Procedure Calls – CALL, RET vs. PUSH, POP

Structured code uses subroutines and functions called procedures for repetitive tasks that can be written once as a named block of code and invoked, or called, by name when needed. Procedures can be called by a program's main procedure or from within another subroutine. When a procedure is called, the procedure's code is executed before the processor resumes the program at the next instruction after the procedure call (for now we will ignore parameters and returned values). Processors track their place in a program by reference to an instruction pointer; in the x86, the instruction pointer is called eip. When an instruction is fetched from memory for execution the processor updates eip to point to the next instruction awaiting execution. At procedure call the value in eip is incremented to point to the next instruction after the procedure call and saved on the stack, then the address of the procedure is placed in eip to transfer program flow to the called procedure (we will examine a graphic of this process in the next section, Figure 2.6). At the end of the procedure, the return address is retrieved from the stack and placed in eip to resume program flow in the procedure's caller. Procedure calls can be nested to an arbitrary number of levels, and the stack is always the source of the return address from a procedure back to its caller.

When the program's return address is stored on and retrieved from the stack for procedure calls, PUSH and POP are not used. These are data transfer instruc- tions and cannot be used directly on eip [72]; instead, CALL and RET (RETurn) instructions are provided as program control flow instructions. CALL increments eip to the address of the next instruction after the CALL, stores the value in eip on the stack in the same manner as a PUSH, then sets eip to the address of the

36 beginning of the procedure to transfer program flow to the procedure. When a

RET instruction is executed at the end of a procedure, the RET retrieves the re- turn address value from the stack in the same manner as a POP and places it in eip to transfer program flow back to the caller. As we said before, this sequence can be nested to an arbitrary depth. At each caller/callee juncture, the return address to the proper point in the caller is saved or retrieved, and the only difference between a 1st and nth level call is the depth of the current stack.

2.7.1.5 Stack Use for Parameters and Variables

There are a number of calling conventions for 32-bit C including cdecl, stdcall, Gnu, fastcall (Microsoft, Gnu), thiscall (Microsoft), and Watcom [51]. These calling conventions define register usage, whether arguments to a function are passed in registers or on the stack, order of arguments on the stack, stack alignment, whether a return pointer is passed on the stack or in esi, and whether the caller or the function performs stack cleanup. For this example we will assume the cdecl convention which specifies (in part) that arguments are pushed onto the stack right to left, ebp (among others) is a callee-save register, and the caller performs stack cleanup. Figure 2.6 provides a procedure call example with code fragments in C, frag- ments of the resulting assembly code, and a graphic of the stack frame6 within the stack segment near the end of the called procedure.

In this example we assume main will call procedure Func with two integer

6Stack frames are also called activation records.

37 Figure 2.6: Stack Use, Procedure Call Example arguments; no return value is expected. At the C call Func(0, 255) line the compiler will insert instructions to push the arguments onto the stack in right to left order. The assembly call Func line will increment eip to the next following line, store eip's value on the stack (after esp decrement), and place the address of Func: in eip to transfer processing to the Func procedure. Upon entering the assembly at Func:, the procedure pushes ebp onto the stack to preserve the value and loads the value of esp into epb for use as the frame pointer. The stack pointer, esp, is decremented by the size of the local variables to allocate room on the stack for them and reposition esp to allow Func to use the stack for push/pop operations or procedure calls of its own. Func now ‘knows’ the addresses of its local variables and input arguments as address offsets from ebp.

38 Func completes its processing with an epilog to restore esp at call by loading the value previously saved in ebp and restoring the ebp value at call by popping old ebp from the stack. The ret instruction pops the return address (ret addr) from the stack and places it in eip to resume program execution in the caller at the instruction following call Func. Since the calling convention is cdecl, the caller is responsible for removing the calling arguments from the stack. As an aside, the convention of saving ebp at call as a frame pointer effectively makes the stack a singly linked list and the call stack can be traced back by following ‘old ebp’ values.

2.7.2 Stack Overflow Details

Stack based buffer overflows arise from exploits that write beyond the endofa buffer contained in the local variables section of the stack frame. An exampleof a so-called ‘stack smashing’ exploit is described by Levy [113, 140]. The ‘stack smashing’ moniker reflects the initial phase of the exploit, and a ‘stack smashing’ exploit can end in different types of intrusions from corruption of return addresses to delivery of payloads designed to launch shell scripts or applications, or enable return- or jump- orienting programming intrusions (section 2.9). To describe this exploit, we start with the structure of a simple stack frame as shown in Figure 2.7. This stack frame depicts a simple procedure, or function, with no parameters and one local variable, a 1024 byte buffer called buf. Local variables are stored on the stack. When SomeFunc is called, the return address is stored on the stack (return address) by a CALL SomeFunc instruction in the calling program. Upon entry, SomeFunc pushes the value in ebp onto the stack to save it (ebp), copies the value in esp to ebp as a frame pointer (FP), and allocates stack

39 Figure 2.7: x86 Stack Orientation and Example Stack Frame space for buf by subtracting the size of buf from esp. The variable buf (buf[0]) can now be accessed by dereferencing esp. On exit (ret), SomeFunc sets esp to the value in FP (ebp) in preparation for the return to the calling program (ret), pops ebp to restore its value at call, and returns to the caller via the address in return address. The value of understanding the stack frame is this: access to buf is under the control of the code in SomeFunc, and if buf is a line input buffer, it is under the control of the input device or function (e.g., keyboard, socket read buffer, etc.) used or called by SomeFunc. If the input function of this example writes more than 1024 bytes to buf, the next 1 to 4 bytes overwrite and corrupt the old ebp value on the stack (ebp), and the next 5 to 8 bytes after 1024 overwrite and corrupt return address. At best, when SomeFunc returns and the value in return address is placed in eip by ret, the process will fault or crash due to invalid random data in return address. At worst, if the attacker knows the length of buf, an input string or byte sequence can be constructed to overwrite the value in return address with a different value that is useful to the attacker such

40 as the address of injected malicious code or data such as shell code, the beginning of an address chain for a return-oriented programming exploit, or the address of a gadget dispatcher for a jump-oriented programming exploit. Code injection and return- and jump- oriented programming will be covered in more detail in sections 2.9 and 2.10. In a nutshell, this stack buffer overflow vulnerability can be summarized thusly:

Standard stack frame arrangement places a buffer space below the procedure's return address (at a lower memory address). As the buffer is filled, it grows toward the return address. When an attacker is allowed to overflow the buffer with a large enough input byte string, the attacker's input overwrites the return address. We can now revisit a general statement from section 2.6.2 and add specificity: if an attacker has a priori knowledge of the size of buf and its distance from return address or is able to determine them experimentally, he is able to con- struct an input string that will place arbitrary values of his choosing on the stack to replace return address. In doing so he has already hijacked program flow. What he can do with this vulnerability will be covered in the sections on Return- and Jump- Oriented Programming (section 2.9) and Code Injection (section 2.10).

2.7.3 Co-mingled Control and Data on a Common Stack

From the previous discussion of stack operation, we clearly see the use of data stacks for temporary storage of intermediate results and local variables, and for passing parameters and results between caller and callee processes. Control flow information, specifically procedure return addresses, are also stored on thesame stack. Co-mingling program control information and procedure data within a common structure as is done in modern computers such as the x86 is inherently

41 Figure 2.8: Program Control and Program Data on a Common Stack (Frame and Segment) risky as we saw in the previous section. The term ‘common stack’ is used here strictly to denote co-mingled control and data within a stack frame such as shown in Figure 2.8. ‘Common stack’ also extends to common stack segment as shown in the stack segment graphic of Figure 2.8, in which case corruption of multiple caller stack frames or information leakage from ‘retired’ activation records below the stack pointer are also possible. Whether processes create new or separate stack segments or contiguous stack frames are stored in a single linear section of memory, the problem still remains that within a single stack frame, return addresses are adjacent to memory used by a program for data elements or structures including input parameters, buffers, pointers, and variables. Some of the stack elements are control flow data operated on by the

CPU during calls and rets; the rest are under the control of procedure code, or

42 worse, user input buffers subject to erroneous or malicious input. Under- or over- writing a data structure results in change or corruption of the adjacent memory area. In the best case, when a return address is corrupted the process is likely to crash; at worse, the ‘corruption’ is a carefully designed scheme to surreptitiously write a different address over a previous return address in order to hijack program control flow. Clearly, user-accessible machine instructions or procedures that have write access to memory areas in proximity to control data are vulnerabilities, and the seriousness of a vulnerability increases when the exact relationship between the stack-based data and control flow information is known or can be discovered by an attacker.

2.7.4 ‘Reverse’ Stack Growth

In section 2.7.1.3 we briefly mentioned alternate stack growth directions andin 2.7.2 illustrated how a local buffer in stack memory could overflow and overwrite the procedure's return address. In the worst case, the overwrite is malicious and results in subversion of program control flow opening the door for further exploit and deeper intrusion. Figure 2.9a depicts buffer growth in the x86 architecture. The stack base it at a relatively higher address, and items stored on the stack fill the stack segment toward lower address values. Local variable buf[] has an intended maximum size and a specific space allocated for it. The origin of the buffer

(buf[0]) is at the lowest memory address allocated and the buffer fills in the same direction an array normally would, toward higher addresses. As buf[] fills it writes toward the ebp and return address values stored on the stack overwriting both during the buffer overflow. On the other hand, Figure 2.9b depicts a ‘reverse’ stack. The stack base is at a

43 Figure 2.9: Stack Growth Alternatives relatively lower address and the stack grows toward higher address values just as an array would. In this case however, the buffer's origin (buf[0]) is the next free address after the stored ebp, and as the buffer fills it grows away from the stored ebp and return address values. Even on a severe buffer overflow, the process's return address value is preserved so program flow will not be compromised. This, incidentally, was the direction of stack growth in Multics and was one of the three Multics hardware-based security features used [82]. We should point out that a reverse stack is not a panacea; if the stack frame of Figure 2.9b is not the only active frame in the current stack the possibility of buffer overflow from a caller's frame exists. For example, if a called routine contains a strcpy() function where the destination buffer is in an active caller's stack frame, a sufficiently large strcpy() overwrite of the destination buffer will corrupt the current stack frame (thereby demonstrating the “law of conservation of misery”7). Nonetheless, a ‘reverse’ stack is an uncommon architecture in modern processors; its use would increases the diversity in the computing ecosystem and decrease the certainty that

7Professor Horace Gordon, USF College of Engineering, circa 1977.

44 a given scripted attack would result in a successful intrusion. Before we depart the reverse stack discussion it is worth noting that in the early days when RAM was more limited the x86 stack convention served some usefulness. Resident code was loaded in low memory addresses followed by application code and heap; with the stack based in the higher memory addresses, the stack and heap could grow toward each other without conflict until the intervening memory was exhausted [9].

2.7.5 Stack Based Buffer Overflow Protection Techniques

Before we delve into stack protection techniques, we should point out that buffer exploits, stack or otherwise, leverage buffer vulnerabilities. To recap terminology from section 2.6.1, an attack can be an unsuccessful attempt; an exploit is a “tech- nique to breach” that, we assume, achieves success under some given conditions however limited they may be. Attacks or attackers use exploits, and when we use the terms ‘attack’ or ‘attacker’ in this and following sections it brings the assump- tion of specific relevant exploits, and the success or failure of the attack willbe apparent in context. Further, we recognize that buffers also exist outside of the stack; we will address that as a separate issue in section 2.8. For now, we are still under the general heading of Stack Based Buffer Overflows. Thus far we have been purposely vague about what happens after an attacker (or for that matter, a benign malformed input) overflows a stack-based buffer. We will cover more details in following sections, but for now, we should note that stack protection can have multiple dimensions. Unless an attacker wishes only to crash a vulnerable input process by overwriting the stack with random data, the attacker must do at least two things for most exploits: replace the legitimate

45 return address on the stack with a different (but useful) address of his own, and include a payload of additional data. As we will learn later, the payload can include address lists (section 2.9) and/or machine instructions (section 2.10). This brings us to examination of various methods system designers and researchers have implemented or proposed to reduce stack-based vulnerabilities and harden the stack against attack. Forward of this point, protection mechanisms tend to be specialized for given payload type(s) and not necessarily limited only to the stack.

2.7.5.1 Stack Execution Prevention

We previously covered the mechanics of how a buffer overflow starts without fol- lowing it all the way through to an intrusion. Stack-based buffer overflow exploits are often referred to as ‘stack smashing’. An early and well-known paper on this technique, “Smashing the Stack for Fun and Profit” [113] was written by Elias Levy under the name Aleph One [140]. This paper walks through the creation and delivery of an exploit that allows an attacker to break out of an input procedure to a shell command prompt. More ominously, this command prompt will operating at the privilege level of the vulnerable program, so if the vulnerable process is highly privileged, so is the attacker's access to the machine. Very briefly, Levy's exploit uses the technique explained in section 2.5.2 to populate a buffer with ‘shell code’ (a short instruction sequence that executes the Linux execve(‘/bin/sh’) function to produce the command prompt), overflow the buffer, and overwrite the return address on the current stack frame with the address of the shell code. When the input procedure ‘returns’, instead of returning to the caller the processor jumps to the shell code and the attacker receives his ill-gotten command prompt. The simplified diagram of Figure 2.10 illustrates the buffer before input, then shows

46 Figure 2.10: Buffer Overflow Exploit with Shell Code the shell code, a ‘don't care’ field up to the stack 'frame s return address, and a hijacked return address that ‘returns’ control flow to the shell code at the begin- ning of the buffer. We refer you to Levy's paper [113] for details on determining lengths and addresses and eliminating nulls (‘\0’) from the input, but the diagram makes the point that the buffer input included the shell code and overflowed the buffer sufficiently to overwrite the procedure's return address on the stack. One of the assumptions of this type of exploit is that the processor can exe- cute the shell code, and this leads to the protection mechanism of non-executable memory (specifically, a non-executable stack). The attacker's replacement for the return address will be interpreted as an address by the process’ return instruc- tion because that is one of the stack's explicit purposes, but if the return address transfers to data memory that is designated as execution protected memory (in this case the stack), the CPU will fault instead of continuing execution. This is the purpose of NX (for No eXecute) protection; not to prevent the buffer overflow, but to prevent the buffer data (with its overflow) from being executed as code. NX, or NX bit (because of its frequent one-bit representation) is implemented under other names such as Enhanced Virus Protection (EVP) in AMD proces-

47 sors [2, 3], Execute Disable (XD) in processors [73], and Windows refers to the combined OS and hardware implementation as Data Execution Protection (DEP)/NX, or DEP/NX [7, 66]. While a no-execute stack would blunt the type of stack-smashing exploit described above, McGregor [101] points out that marking the stack segment non-executable still allows return address redirection into code already in executable program memory. We will examine this further in section 2.9. Kc presented a means of making the stack and the heap “effectively” non- executable [83] by software that monitors user code for calls to the system or libc. This approach has some functional and performance limitations and would not protect the system from all return-oriented programming (ROP) other than return-to-libc; further, it is not clear that it would protect against any jump- oriented programming (JOP) exploits. See section 2.9 for discussions on ROP and JOP exploit techniques. The issue of execution protection for the stack is complicated by legacy software. For example, since Linux allowed execution of instructions on the stack in the past, many legacy binaries and shared libraries either exhibit or assume this behavior.

The Linux man page for ld(1) (2019-05-08) includes the option “-z execstack” to mark an object as requiring an executable stack. Linking is done to the “lowest common denominator”, i.e., if a single file requires an executable stack it will be executable for the entire program [104, 154]. In this case, using Linux kernel boot parameters such as noexec or noexec32 [97] may result in these programs not executing, and therefore, to use legacy code a conventional processor would be unable to guarantee a non-executable stack.

48 2.7.5.2 Stack Canaries

We recall that to overflow a stack-based buffer to the stack frame return address, the overflow must overwrite intervening data. In the stack diagrams we have used in this section, that would include the old frame pointer (saved ebp) and any other local variable(s) between the buffer and the return address. A Microsoft C compiler-supported technique for stack protection or at least detection of overflows

(the ‘/gs’ compile-time option [20]), is to place a sentinel value (canary8 or cookie) on the stack just below the stack frame return address and verify it for correctness before procedure return. The listings shown in Figure 2.11 show parallel assembly code fragments of the same routine that were generated in Visual Studio 2013 with optimization off; one listing is without canaries enabled and one is with canaries. Lines 3-6 in the “With Canary” listing show generation and storage of the canary to [ebp-4] including an xor operation with the frame pointer to assure the canary value is not predictable. The 3rd through 6th lines from the end of the listing read the canary value back from the stack and call the canary security check routine just before the usual epilogue. Figure 2.12 illustrates the stack frames with and without canaries resulting from the listings in Figure 2.11.

Stack canaries cannot repair a corrupted return address or ebp value, and stack canaries are vulnerable to attacks including string-oriented programming [122]. What they do for security is to protect the computer from stack-based buffer overflow attacks that overwrite the stack frame's return address. When the canary (or cookie) security check fails, a fault is raised and the process is terminated rather than attempting to return to a compromised address.

8Referring to the canaries miners used to detect bad air in mine shafts [13, 23].

49 Figure 2.11: Compiler-Generated Assembly With and Without Stack Canary

Figure 2.12: Conventional Stacks With and Without Canary

The Linux GCC's StackGuard [166] is similar to Microsoft's stack canary or stack cookie compiler protection shown above. A similar option was StackShield [167] which provide three different configurable protection mechanisms. Itdid not use canaries and instead performed address cloning for comparison at return, return address range checking, and checks on indirect function calls [129]. We

50 should note that the latest version of StackShield located was 0.7 beta, January 2000 and was no longer available for download.

Shinagawa's SegmentShield approach [139] is similar to the StackShield cloning option in that it performs integrity checking between memory stack return ad- dresses and return addresses stored in a protected segment. SegmentShield is a compiler extension using a modified procedure prologue and epilogue to conduct the return address storage and comparison with options on disagreement similar to McGregor [101], i.e., signal the disagreement as a buffer overflow and terminate or attempt to continue processing based on the secure return address. Shinagawa's approach leverages the x86's segmentation hardware to provide secure storage for the return address copies to prevent malicious manipulation of the copies. While further research into these protection methods would be interesting, we will close the stack canary section with two observations: First, a brief search re- vealed a number of ways to defeat these and other protections [21, 32, 122, 129] with some interesting insight into how quickly the protect/defeat balance changes between defenders and attackers (or their white-hat counterparts) [21]. Second, these are compiler-based software protection mechanisms and, while intellectu- ally interesting, are applicable to our hardware-based effort only to the extent that lessons can be learned or techniques may arise from their study that lend themselves to hardware implementation provided they are justified by improved performance or security. Noting that canary mechanisms are more accurately de- scribed as detection of overflow conditions rather than ‘armoring’ the stack frame return address, we segue to return address protection and repair.

51 2.7.5.3 Return Address Protection or Repair

Continuing with the previous section's closing remark, we reinforce here that this section examines ways of detecting and correcting return address corruption with- out regard to what may have happened to the remainder of the stack. Armed with this view of stack vulnerability, we found a good deal of information in the literature on stack modifications for performance reasons. Typical examples of stack modifiers were Sun [150], who proposed dual stacks for performance reasons in branch prediction or return address repair in branch mispredictions, Skadron [141], who proposed a mechanism for repairing mispredictions on a single stack, and Paysan [123] who proposed four stacks to eliminate stalls and pipeline bubbles due to branch instructions. These were all performance enhancements only; none mentioned security. In section 2.7.3 we pointed out the potential problems of having control flow information and program data on a single stack. While dual-stack processors are not unknown, they are scarce. Koopman's [87] work on stack machines is dated but with observations focused on functionality rather than security. He discusses multiple stacks used for concurrent data manipulation and subroutine calls, and separates stack functions into four categories: operand stacks for intermediate expression evaluation, e.g., x = (a + b) ∗ (c + d), return address stacks, local variable stacks to avoid statically defined local variables, and parameter stacks. In register machines, registers can take the place of the expression evaluation stack, reducing Koopman's categories to three. Koopman's description of a ‘combination stack’ for stack frames or activation records (a combination of at least the local variable and return address) is ‘the stack’ we have been “smashing” [113]. Other than the performance enhancements noted above, the only noticeable

52 continuing interest in dual stack machines is for the Forth language [134] which uses two stacks, one for expression evaluation and parameter passing, and one for return addresses [87]. Even the venerable Hennessy [64] book addresses distinctions between stack and register machines, but return address buffers are treated in the context of branch prediction rather than segregation from data. To look back into history, the Motorola 68000 was not specifically designed as a multiple , but with multiple address and data registers, it could be used in configurations of up to 8 stacks since the address registers supported post-decrementing and pre-incrementing [87]. Nonetheless, convention was to use the A7 register as the stack pointer in single-stack configuration. Where return addresses cannot be otherwise guarded, the essence of return ad- dress protection and repair is similar to the StackShield [167] and SegmentShield [139] options of cloning return addresses; keeping the copy in a secure location, and comparing the stack frame return address to the secure copy at return. If the copies disagree and depending on the criticality of the application, the disagreement can be flagged to the OS for exception handling involving user notification andlogging with immediate termination, attempted orderly shutdown, or attempted contin- ued processing with the secure copy replacing the corrupted stack frame copy. In an application that always attempts continuing, the only reason for continuing to do comparison is to at least detect conditions that result in corrupt pointers and to provide forensics. The critical difference between the StackShield and Seg- mentShield methods above and the proposed solutions below is in the hardware support each of the mechanisms below employs in processing return address point- ers. Ye [176] proposed Address Pair Tables stored on a hardware Reliable Return

53 Address Stack (RRAS) that matches call and return address pairs to detect cor- rupted return addresses. Effectively, each call/return address pair must be a closed loop; when they are not, tampering is indicated and an exception is raised. Park [119, 120] proposed a hardware modification to the stack to push 3-tuples composed of the saved frame pointer, stack pointer, and return address to a sep- arate hardware stack for later comparison to the return address. This approach maintains instruction set architecture (ISA) compatibility by modifying proces- sor hardware to monitor use of a callee's push ebp after a call to trigger storage of the 3-tuple. The next subsequent pop ebp triggers comparison of ebp, esp, and return address in the memory stack with the 3-tuple stored in the hard- ware stack and disagreements will terminate execution. Since the 3-tuple storage and comparison are triggered by the ebp push and ebp pop, procedure entry and exit use of the standard prologue and epilogue would have to be enforced by the compiler.

McGregor's approach is to implement a Secure Return Address Stack (SRAS) that stores a secure value of the return address at each call [101]. This is very similar to the StackShield (with cloning option) [129] and SegmentShield [139] approaches but with hardware enforcement and protection. When the processor encounters a RET instruction, the return addresses for the memory stack and SRAS are compared; upon disagreement, the process could be terminated with notification to the OS, or processing could continue execution by using the return address from the SRAS. Xu [175] evaluated split control and data stacks to prevent data from overwrit- ing program control data via a compiler-based split stack or a hardware split stack using a secure return address stack (SRAS) for return addresses. The compiler-

54 based split stack maintains a second copy of the data stack return address for comparison. Xu's hardware-based split stack appears to be the most promising for the secure host. It is noted that this approach will require modification of the C jmp buf structure to provide support for setjmp and longjmp. The protection and repair approaches above are all based on conventional, con- temporary processor architectures with the addition of compiler extensions and/or hardware aimed at detection of return address manipulation with the potential to repair return addresses that were corrupted. CHERI (section 2.5) is a departure from that model due to the extensive changes in the architecture and software to support the object-capability model. This gives a broad range of additional possible protections including protection of the stack if stack pointers are cast to capabilities to provide bounds checking. One of the CHERI products is an exper- imental compiler that protects individual frames to prevent stack overflow [174].

2.7.5.4 Reverse Stack

Historical perspectives on ‘reverse’ stacks (or what we consider ‘reverse’ to be) were covered in 2.5.4 so we will not belabor details here other than to emphasize the advantage of a reverse stack as a security enhancement. A reverse stack was one of

Multics' three hardware security features [81]. As we noted previously, reversing the stack operation reduces but does not eliminate the potential for corruption of return addresses on buffer overflows. Salamat [133] used LLVM [127] to implement a reverse execution stack at the compiler phase to demonstrate reverse stack oper- ation even on machines that naturally have a single growth direction. This opens two potential avenues for security improvements. One is that reverse stacks add software diversity, thereby improving diversity in the computing ecosystem since

55 a given stack-based buffer exploit would not work on two otherwise identical ma- chines running the same software if the stack in one were reversed; moreover, since the reverse stack operates in software rather than being tied to the underlying hard- ware, options could be implemented for random orientation of stack growth each time a program is loaded. In this case the diversity would enjoy a temporal dimen- sion. The second option, though having performance impacts, would provide near immunity to stack-based buffer overflow attacks: Salamat proposes that the same process could be run in a parallel configuration of synchronized but stack-opposite process pairs. The outputs of these “multivariate’ [133] configurations would be compared for agreement, with disagreement signifying an attack. Since the same exploit would not work on both processes, complementary synchronized exploits would have to be applied to the multivariate processes in the correct polarity in order to be successful.

2.8 Non-Stack Buffer Overflows

We spent a fair amount of time on stack-based buffer overflows in the previous section, but should point out that any buffer, stack-based or not, can be overflowed, overread, or underread. While overflowing a buffer on the heap is difficult to convert to a control flow exploit of the type we reviewed in section 2.5.2, wedo not discount the possibility altogether and will offer forward references to sections 3.2.5.1 and 4.9.1 for treatment of control flow issues and protections. The most important difference between stack-based and non-stack buffers is that stack-based buffers operate in the vicinity of control flow data (i.e., return addresses) onthe x86 stack.

56 The most stringent protection for buffers (or data structures in general) we have encountered is found in CHERI's capabilities (sections 2.5.1 and 2.5.2), and a close second would be C “fat pointers” and C++ pointers to member functions [89, 140, 149, 177], except that pointers and pointers to member functions do not necessarily need a complementary data element that would comprise a true object-capability model. While these pointers are not always described in exactly the same way, the general concept of addresses (the basic pointer element) along with memory segment delimiters and descriptors that can be matched with access operators’ authorizations and limitations certainly applies to secure processing. Depending on the extent of information attached to memory descriptors, keyed memory such as described by Cragon [34] could be considered to be the attributes of a data object capability. Combining properly constructed fat pointers with keyed memory would be a synergistic match approximating at least part of the object-capability model. Heap-based buffer overflows do not automatically project a program control flow integrity issue, but the possibility cannot be discounted and other typesof intrusions may result. For example, format string overflow vulnerabilities can produce indirect exploits of the global-offset table (GOT) [95], so the need for bounds and access protections extend beyond the stack. Shaw [138] claims the two “most prominent root causes” of buffer overflows are the use of unsafe library functions and bad pointer operations. He used Safe Type Replacement (STR) transformations to replace character buffers with safe data structures that include buffer size information which is used for bounds checks during pointer operations, and Safe Library Replacement (SLR) transformations to replace unsafe C functions with safer alternatives. Shaw applied his automated

57 methods with impressive reported results. According to his 2014 IEEE conference paper:

“They are effective: they fixed all buffer overflows featured in4,505

programs of NIST's SAMATE9 reference dataset, making the changes automatically on over 2.3 million lines of code (MLOC). They are also safe: we applied them to make hundreds of changes on four open source programs (1.7 MLOC) without breaking the programs.” [138]

While not as impressive as Shaw's published results, we find a number of other buffer protection mechanisms such as concurrent heap monitoring [178] and compiler-enabled protection for the GOT (and stack) [145]. These are all software methods to address problems that arise from imperfect software. This is a good place to reinforce that there is no substitute for good defensive coding techniques [85], and a properly implemented buffer and access mechanism would not need extraordinary protections; but humans are imperfect and coders are human. For instance, strcpy() is considered ‘unsafe’ and strncpy() is preferred, but string copy vulnerabilities still exist in legacy code and sneak into new or modified code. Finally, overflowing a buffer and corrupting adjacent data is not the only bounds violation that can present security problems. Section 2.6.2 covered the much- publicized Heartbleed vulnerability; while the Heartbleed vulnerability mechanism has been resolved, Strackx [147] also used “buffer overreads” to dispel the notion that attackers cannot read the contents of memory. This is motivation to provide

9SAMATE - Software Assurance Metrics And Tool Evaluation (https://samate.nist.gov/)

58 hardware-based memory protection that enforces comprehensive access controls on critical memory.

2.9 Return- and Jump-Oriented Programming

Return-Oriented Programming (ROP) and Jump-Oriented Programming (JOP) are general terms for classes of exploits in which the attacker uses machine in- structions already existing in computer memory to perform arbitrary operations that were not part of the program's original design. This introduction reviews the common elements of ROP and JOP and the major feature that sets them apart; other details that are more specific only to ROP or JOP will be discussed later in separate subsections. While String-Oriented Programming (SOP) [122] is not a ‘programming’ method in the sense of ROP and JOP, it can be used in conjunction with ROP and JOP to create exploits, so it is covered here. SOP uses a string format vulnerability to overflow a buffer on the stack (for a ROP exploit) or heap (for a JOP exploit), and SOP simply becomes the payload delivery mechanism to launch ROP or JOP attacks. ROP and JOP share several common elements, but the primary one is the use of gadgets. Gadgets will be examined further before presenting ROP and JOP details, but for now, gadgets are typically short, disjoint sequences of machine instructions already in program memory; they perform operations useful to the attacker and allow the attacker to maintain control of the ROP or JOP ‘program’ flow. Therefore we can expand our commonalities list for ROP and JOP toinclude two requirements: to employ ROP or JOP exploits, the attacker must seize and

59 maintain control of the program flow (usually with at least some stack involve- ment), and the attacker must provide a payload of gadget addresses as control flow information for the malicious program. The details that set ROP and JOP apart are in the mechanisms of control flow. Since the root words for the technique names are ‘return’ and ‘jump’,it should be no surprise that the major distinction between ROP and JOP are that in ROP, control flow arises from the inventive use of the stack and gadgets ending in a RETURN instruction; for JOP, the stack may be the initial launch point, but JOP gadgets end in any jump instruction that directs program flow from gadget to gadget (without use of the stack) until the attacker's mission is complete. When they can be used successfully, ROP and JOP provide the attacker with similar benefits: no code is executed on the stack so ROP and JOP are not defeated by non-executable stack protections (section 2.7.5.1), and arbitrary functions are performed without code injection so W⊕X memory (section 2.11) is not a de- terrent (though it does limit the initial availability of gadgets). Combined with the information in this short introduction, the following sections should give the reader a working understanding of ROP and JOP and provide a technical founda- tion for understanding how a more secure processing platform helps address these exploitation techniques.

2.9.1 Gadgets

Gadgets are like the individual gears or sprockets of a complex machine; each one is a small part and requires proper interconnection to perform its function. In computer programming, gadgets are short, useful sequences of machine instructions found in program binaries. Use of the word gadget for computer programming

60 techniques appears to originate in graph theory (e.g., Trevison [159]), and graph- theoretic concepts are used in computer science. Szab`o[152] traces the concept of gadgets to a 1954 Tutte paper on graph theory using the phrase “Tutte's gadgets” to refer to subgraphs used to replace graph vertices; however, the word gadget was not used in Tutte's paper. In 1993, Leobl published a graph theory paper on gadget classification, and in tracing the evolution of buffer overflows Vallentin [163] describes arc-injection techniques to refer not to gadgets but to the control flow redirection (arcs) that puts them to use, so the association seems to hold. Whatever the origin of the term, ‘gadget’ has become ingrained in programming terminology. In the context of ROP and JOP exploits, gadgets are sparse and disjoint por- tions of the total code base that have specific functionality useful to the attacker. In order to put the functions to use a means for retaining control of program flow and directing it from gadget to gadget is required. ROP or JOP programs are created by identifying gadgets and marshaling them in the proper sequence.

As we previously said, ROP gadgets end with a RET (return) machine instruc- tion. The ROP programmer devises a stack payload containing gadget addresses interspersed with required parameters and stores the payload on the stack with a buffer overflow; the current stack frame's return addresses is overwritten with the address of the first gadget to begin processing, and the ROP programRET ‘ s’ (returns) from gadget to gadget in a sequence not part of the original program.

JOP programming is very similar, however, the RET program flow mechanism is replaced by the use of gadgets ending in controlled jumps to a subsequent gadgets. In a more complex JOP arrangement one or more selected gadgets act as a dis- patch mechanism that substitutes for a conventional instruction pointer. Turing-

61 complete gadget sets have been demonstrated by Ouyang [117] who devised a ROP auto constructor called QExtd and Chen [28] who also demonstrated a prototype tool for producing JOP shellcode. Gadgets may be part of a normal application program or shared library, but they are not part of the program coder's design intent. The last few machine instructions before an intended RET or any jump that can be controlled are obvious candidates for gadgets; however, when the target processor uses a complex multi- byte instruction set, gadgets are not even always part of the original program's assembly code. In a CISC processor such as the x86, unintended sequences of machine instructions can be found in existing code by jumping into an intermediate byte of a multi-byte instruction. In this case the processor will interpret the byte string as something entirely different from the original binary. As long asthis

‘found’ gadget consists of otherwise valid machine instructions ending in a RET or JUMP opcode, it can be part of a ROP or JOP program. The x86 instruction set uses the single-byte 0C3H as the RET opcode; every ROP gadget ends in RET so every place a 0C3H is found is an opportunity to create a ROP gadget. This is very well illustrated in the following example from §1.2.1 of Shacham's paper on return-into-libc programming [136]:

“Two instructions in the entrypoint ecb crypt are encoded as fol- lows:

f7 c7 07 00 00 00 test $0x00000007, %edi 0f 95 45 c3 setnzb -61(%ebp) Starting one byte later, the attacker instead obtains:

c7 07 00 00 00 0f movl $0x0f000000, (%edi) 95 xchg %ebp, %eax 45 inc %ebp

62 c3 ret” [136]

Within the x86 instruction set, instructions can be found that are between 1 and 15 bytes long [72], so it is not hard to speculate that many possible gadgets could be found in a byte-by-byte examination of any substantial binary codebase. In his footnote 7 on page 8, Shacham states:

“In fact, amongst the useful sequences we discover in libc there is a point where four valid instructions all end at the same point; and, ex-

amining libc as a whole, there is a point where seven valid instructions do.”10 [136]

Using his testbed libc, (GNU libc-2.3.5) Shacham cataloged 5,483 0C3H bytes, or one in every 178 bytes. This is noticeably higher than the 1 in 256 bytes we would see in a uniform distribution and is indicative of the prevalence of legitimate RET instructions. In order to reduce the density of available gadgets for ROP, Shacham [136] proposes procedural reduction of RETs that includes, in part, using jumps to single exit points to reduce intentional RETs, reduction of spurious RETs through avoidance of specific register operations that produce 0C3H intermediate bytes, and contrived instruction placement to avoid 0C3H offsets in code. These are partial solutions and not cures (e.g., jumps to RETs are still exploitable in ROP); on the negative side they decrease code readability, add complexity, and decrease register use efficiency. It is clear from this gadget review that a single-width instruction set would

10Context for the quoted footnote is that four useful instructions end at the same point. Longer sequences are ‘valid’ but not useful, e.g., a sequence ending in jmp (address); ret;.

63 decrease gadget density. Therefore, we view a sparsely coded, fixed width, aligned instruction set as a necessary but not sufficient condition for a secure host.

2.9.2 Return-Oriented Programming Details

Return-oriented programming (ROP) is a general term applied to exploits that make use of existing code to perform arbitrary processes without code injection and as a means of bypassing non-executable stacks for stack-based buffer overflow attacks [108, 130, 136, 157]. The attacker constructs a stack-based address chain that sequentially targets the necessary functions already in memory to perform the desired process. When each function or procedure returns, the next function's address is waiting as the current return address [136].

Return-into-libc (RILC) is a special category of ROP that uses libc functions such as system() [44] to bypass non-executable stacks and elevate privileges by launching another program, or using mprotect() to disable W⊕X (Write XOR eXecute) memory protection [157]. RILC is considered a special case of ROP because RILC uses complete libc functions (with parameters) rather than gadgets. More generally, return-oriented programming uses gadgets that end in a return instruction (RET) or a sequence of instructions that emulate a return operation [130]. The attacker executes a process of his own design by gaining program control flow and chaining a sequence of gadgets together to perform complex operations not intended in the original code. ROP begins with a list of pointers to gadgets in the order necessary to perform an attacker's intended process. The list of pointers is typically inserted into write- able memory by exploiting a buffer overflow vulnerability, and the return address of the exploited function is overwritten with the address to the first gadget; pro-

64 gram control flow then passes from gadget to gadget until the' attacker s process is complete. Bania proposed several compiler- and binary- level mitigations for ROP attacks

[9]. These were procedural methods for CALL-RET pairing that test for a CALL before a RET, or stack or code encapsulation that did not involve modification of the host instruction set. These methods limited but did not solve ROP attacks completely, carried heavy performance penalties, and in particular, the encapsulation methods were not OS-portable. One of the ROP mitigations proposed by Bania [9] was “obfuscating” instruc- tions containing intermediate 03ch bytes by replacing them with alternate instruc- tion sequences (similar to Shacham [136]), and where a RET opcode is found in the first byte after an instruction, placing an unconditional short jump tothein- struction several bytes before followed by a short series of ROP-useless instructions such as INT3. In this last case, the string of ROP-useless instructions preceded by an unconditional jump was referred to as a “jump land” (presumably as in a ‘region’), but its intent was to decrease the usefulness of the instruction followed by the RET as part of a longer gadget; a gadget starting prior to the instruction preceding the jump would still function in spite of the “jump land”. In the next section we address a malicious programming technique very similar to ROP's use of gadgets, but with an alternate control flow technique.

2.9.3 Jump-Oriented Programming Details

Jump-oriented programming (JOP) [15], or “return-oriented programming with- out returns” [27, 130] is similar to ROP in that program flow moves from gadget to gadget, but RET instructions are not used. A RET instruction is described as an

65 “update–load–branch” sequence by Checkoway [27] where the update is of some global resource available to the JOP sequence. For the x86 ISA an example is

pop eax; jmp *eax; as one such sequence combined with the stack pointer (esp ) as the global resource. A “sequence catalog” of gadgets interleaved with data is placed on the stack to define the attackers function, and instead of using RETs to cycle through gadget addresses, esp is cycled through the function by gadget-based pops. Checkoway successfully demonstrated this technique on Linux x86 and Android ARM com- puters [27]. A feature of this JOP method is that it can be used to defeat ROP defenses that monitor instruction streams for excessive RETs, last-in/first-out re- lationships of return address stacks, or compilers that produce code that avoids

RET instructions. Another jump-oriented programming technique published by Bletsch [15] that avoids use of the stack altogether is illustrated in Figure 2.13. In this figure, the

“Dispatch table” is the equivalent of Checkoway's “sequence catalog”; edx is the global resource that functions as the ROP ‘program ’ (Checkoway's esp- equivalent); and esi (direct) and [edi] (indirect) control return to the dispatcher after each gadget. This short JOP example loads a register, adds a value to it, and stores the result without use of the stack or any RET instruction. The ROP and JOP examples above are clear evidence of two things: a no- execute stack is not a protection against stack buffer overflow attacks with the capability to execute arbitrary attacker functions, and these functions can be done without injection of machine code into executable memory. So far in this section we have looked in general at techniques to avoid the necessity of injecting machine code by referencing existing functions in libc and

66 Figure 2.13: JOP Dispatcher Gadget (Bletsch [15] Figure 3) using functional gadgets. The number of functions in libc is limited, but the availability of functional gadgets in a complex instruction set such as the x86 can be quite large as demonstrated by the section on gadgets (2.9.1). Next we look at means of preventing subversion of program control flow.

2.9.4 Control Flow Protection

Mitigating ROP/JOP programming could be divided into a two-pronged effort with the possibility that the result could be stronger than the sum of its parts. The two prongs would be elimination of gadgets and enforcement of control flow integrity. We accept that RET (03CH-byte) or RET-like sequence removal from code is impractical on a large scale, so gadgets will continue to exist. RET reduction such as proposed by Shacham [136] would decrease gadget density in binary code and so may be viewed as a control flow ‘hardening’ technique, but would not by itself ensure control flow integrity. The ROP/JOP attacker's task would be harder, but exploits would still exist. We therefore dismiss the notion of elimination of gadgets and relegate the remainder of this section to methods to achieve control flow integrity, i.e., CALLs, JUMPs, and RETurns that were not intended by the original

67 program designer should be prevented or intercepted. Abadi [1] lists “control-flow integrity (CFI)” as a basic safety property and claims CFI enforcement is simple and verifiable with guarantees that can be es- tablished formally; however, he proposes CFI using software adaptation methods that may be a barrier to application to existing code or have large performance impacts. In §3.1 [1], Abadi initially explains dynamic CFI checking using modi- fied x86 code and separately with three ‘new’ machine instructions to illustrate a potential hardware implementation. The ‘new’ machine instructions are not com- patible with any existing processor, but this concept is bookmarked as a forward reference to coverage of an Intel patent application later in this section. An im- portant feature of Abadi's CFI mechanism is that jumps would be made with an ID value in a register and the ID value would be tested at the jump destination. Failure to properly match the correct ID at jump destination would generate a fault or error-handling routine. Abadi also addresses inlined reference monitors (IRMs) which restrict control flow to the start of valid instructions for protected programs and to the beginning of protected sections of code. The limitation of this method is also in its performance impact. Zhang [179] proposes a software-based method for control flow protection (or control flow integrity) called Compact Control Flow Integrity and Randomization (CCFIR) using a Springboard code section of verified destination addresses “en- coded” on aligned boundaries in a method that is similar to the IRMs mentioned by Abadi. Like the IRMs, CCFIR impacts performance; to reduce overhead and complexity, only “sensitive” functions and indirect jumps were included in CCFIR. This left a performance vs. security tradeoff that appears to leave large gaps in control flow protection in order to preserve reasonable performance levels. Inad-

68 dition, CCFIR is vulnerable to time-of-check to time-of-use race conditions when used in multi-threaded programs. Zhang proposed a similar Springboard technique to the protection of pointers to functions [179] with an extra step for disassembly of binary images, but again, for performance reasons, did not protect hardcoded direct CALL/JUMP pairs leaving exploitable gaps in control flow protection. Bania [9] evaluated a number of “promising” compiler-level and binary defenses against ROP and ROP-like attacks that fall into the CFI and CCFIR category,

“obfuscated” RETs, or encapsulating the stack, and stated two important points in his summary: Performance impacts of these defenses were sufficiently large as to discourage their use; and, these defenses “trammel and limit” the attacker but do not completely solve the ROP problem. Bania has also proposed a binary rewriting technique for Windows kernel modules that includes CFI techniques [10]. A hardware mitigation for ROP and JOP was found in a June 2014 Intel patent application with a 2017 award for a “Control Transfer Termination Instruction” [58, 137] in a modified instruction set architecture. This method provides forCPU mechanisms to group all control transfer assembly instructions (CALL/ JUMP and RET) and match them with control transfer terminating instructions (ENDBRANCH and ENDRET) in the compiler to designate programmer-intended control flow trans- fers. “Retirement” logic would be provided in the CPU to raise an exception if a

CALL/JUMP to ENDBRANCH or RET to ENDRET sequence is violated. As a provision for backward capability of CTT-enabled software, the CTT instructions would be defined as 4-byte opcodes defined as NOPs in the current x86 ISA. This mitiga- tion would appear to embody the verifiable protections of Abadi's CFI [1] without changes to high-level source code beyond recompilation, provide the performance required for full-binary coverage, and be backward compatible (with loss of control

69 flow protection) to earlier x86 processors. This will be a prominent point when we cover our proposed mitigations through security-based features in hardware in section 3.2.5.1.

2.10 Code Injection

Code injection refers to the process whereby an attacker constructs a payload containing foreign code of the attacker's design for export into a target computer system and deposits the payload in program memory for execution as part of an exploit. The payload may be machine code in the native binary form of the target host, shell script to be interpreted by a command processor, application commands for an interpreter such as SQL (although a SQL payload would be referred to by the more specific term “SQL injection”), or cross-site scripting (XSS) to inject content such as JavaScript into web browsers [79]. Our major interest for the secure host is in command script or binary code. Delivery of the payload is usually through an input process to a stack- or heap- based buffer. When the host allows the payload to be written to memory without validating it properly and the attacker can redirect program flow to the injected code such as we saw in the Levy [113] exploit of section 2.7.2, the host executes the attacker's code at the privilege level of the vulnerable input process. One of the features of the Levy exploit was to rewrite the binary code to eliminate ‘\0’

(00H) bytes from the input since the input process would have treated ‘\0’ bytes as string-terminating nulls. Since code injection exploits require delivery of the payload and redirection of host program flow, a secure host should be able to neutralize such an attackby

70 separating the input buffer from executable memory or providing program flow integrity checks that detect the flow redirection and terminate the process. Stack protection techniques of section 2.7.5, memory protections of section 2.11, and control flow protection of section 2.9.4 are potential methods for applying these safeguards.

2.11 Memory Protection

In discussing memory protection we are primarily interested in protecting program memory from unauthorized modification, but consideration of memory protection should also extend to protection of data memory across user or process bounds and accommodations for users to establish limitations on access or modification even within their own program. The need for protections against unauthorized browsing of confidential data or unauthorized modification of any data are intuitive. Our paramount concern in secure computing is ensuring the integrity of critical programs or processes, and preventing execution of any binary that was not loaded from a secure source. If program code can be modified or executed in arbitrary order by an attacker, the possibility exists that he will be able to execute arbitrary functions to bypass any other system safeguards. Therefore we take as a starting point the protection of program memory.

Looking back at Levy's stack-based buffer overflow [113] in section 2.7.2, wesee that the stack was in an executable area of memory, and Levy was able to write data of his own design to this memory and redirect program flow to it to produce a system prompt at the privilege level of the vulnerable input process. The requisite

71 conditions for the attack were the ability to write arbitrary data to memory and being able to execute that data as program code. Memory protection mechanisms in general purpose computers include two generic techniques; memory marked in some manner as Write XOR eXecute (W⊕X), and memory designated as Non- eXecutable (NX) [66]. The NX attribute is often implemented as a single bit in a memory descriptor, so the capability is also referred to as ‘NX bit’. Proprietary names for the NX bit hardware capability are Enhanced Virus Protection (EVP) for AMD [2, 3], Execute Disable (XD) for Intel® [73], and eXecute Never (XN) for ARM [6]. When properly utilized, the W⊕X and NX bit capabilities provide modern computers with Harvard-like memory protection, but without operating system support usefulness is limited for these features. Windows XP Service Pack 2 implemented no-execute page protection using the NX bit in a feature called Data Execution Protection™ (DEP) [33, 68], and Linux ExecShield first appeared in its early version in the first release of Fedora [164]. W⊕X prevents an attacker from executing code he has injected provided there is no mechanism to write data and then, in a different operation, either transfer the data to executable memory or redirect program flow to that memory after the no-execute restriction has been altered. Therefore, buffer overflow exploits are very difficult to solve in hardware alone. Even tagged memory can be misusedor corrupted when it is tagged incorrectly, and to date, W⊕X methods have not been sufficient to prevent code injection and privilege escalation exploits, andDEPis vulnerable to string-oriented programming [122]. Unsafe pointers is one of the issues that contributes to vulnerabilities. Intel addressed the unsafe pointer and associated buffer overflow issue in software with

72 a Parallel Studio XE 2013 feature called Point Checker, and followed it up with a hardware-assisted capability called Intel Memory Protection Extensions (Intel MPX) which adds hardware registers and new instructions to operate on them [74]. Supervisor Mode Execution Protection (SMEP) is another Intel hardware addi- tion first introduced in their Ivey Bridge processors [48] and included in Windows

8 32-bit [33]. SMEP operates below segmentation in Intel's paging architecture by modifying the user/supervisor properties to prevent program execution out- side of untrusted memory when operating at a more privileged level. Historically, there were overlapped user/supervisor page permissions in protection ring levels 0 through 2, and with SMEP, page permissions are marked as supervisor only or user only [48]. Memory protection is not a new subject. In the stack protection section, we mentioned that a reverse stack was one of Multics’ three hardware security fea- tures in 1965 [81]. The other two were execute permission bits in the memory Segment Descriptor Words (SDWs), and pointer layouts that did not allow a seg- ment overflow to carry into the next segment. Unfortunately, Multics’ SDW-based no-execute capability, like the NX bit capability of the x86 segmentation archi- tecture, was initially ignored by operating systems not designed specifically for security [81]. While the emphasis of this effort is hardware-based security features, amem- ory protection review would not be complete without recognizing software-based memory protection mechanisms to gauge applicability to the secure host. Kc achieved an “effectively non-executable” stack and heap through system call polic- ing [83] without the overhead of sandboxing. These methods are operating system-

73 based capabilities and beyond the scope of hardware-based features except that the hardware must provide underlying support for the software techniques. A se- cure processor cannot enforce perfect security if insecurities are ‘designed in’ by the programmer.

2.12 Address Space Layout Randomization

Address Space Layout Randomization (ASLR) [68] is a defense against a number of security issues involving attacker access to program memory structures. ASLR is an operating system capability for randomizing memory locations of stack, heap, and libraries at load time to avoid predictable memory addresses from session to session. With randomized memory locations, specific code or data cannot be reliably accessed by the attacker. In section 2.7.2 we covered stack overflow exploits where the stack frame return address was overwritten by the address of a dangerous or unauthorized process or function the attacker wanted to invoke. Since the replacement value for the stack frame return address must be a known location in memory, ASLR denies the attacker his required memory layout knowledge. ASLR does have limitations [7, 43, 166] in the number of bits of the mem- ory field that can be randomized, referred to as entropy. Low entropy resultsin course granularity of memory locations, i.e., randomly placing binaries on only a limited number of starting addresses. This opens the opportunity for exploits such as string-oriented programming [122] and heap-spraying attacks, and ASLR is particularly ineffective against attacks in a small block of code [43]. ASLR was first used in Linux version 2.6.12, and has been in Windows since Vista Beta 2 [68]. To overcome the memory granularity limitations of ASLR, High

74 Entropy ASLR was introduced to take advantage of the 64-bit address space [7] in modern processors and increase the number of potential address assignments for 64-bit processes. Not all applications and Windows DLLs use ASLR, and with predictable memory locations, these programs continue to be vulnerable. To help resolve this problem, ForceASLR was introduced in [33]. Memory protection is a necessary but not sufficient condition for a secure host, but ASLR (as implemented) is not a hardware feature. The emphasis of our effort is hardware-based security features; however, we recognize that ASLRisan important synergistic match for instruction set randomization (ISR) (section 2.15) and control flow protection (sections 3.2.5.1 and 4.9.1).

2.13 Harvard Architecture

A Harvard architecture machine uses separate memories for program code and data such as that illustrated in Figure 2.14. Harvard architecture CPUs are common in embedded applications [55] such as the Mica family of wireless sensors [36]. In a Harvard architecture it is impossible to execute code from data memory because the cannot point there, and for this reason it “has been a common belief that code injection is impossible in Harvard architec- tures” [55]; but the discussion quickly goes to what ‘true’ Harvard architecture really means. A true Harvard architecture would prevent remote modification of program memory, and modification requires physical access to the memory. Physi- cal access to memory and even the device is often impractical. Most Harvard-based are considered ‘modified’ Harvard architectures with some means of initializing and updating the program code. In the case of the Atmel micro-

75 Figure 2.14: Harvard Architecture Microcontroller (from Francillon [55] Fig. 1) processor [8] of Figure 2.14, the AVR assembly language provides primitives LPM and SPM to Load from Program Memory and Store to Program Memory (respec- tively). The microprocessor contains a bootloader to allow for updates or alternate operational modes on startup by copying at least configuration information from non-program memory to program memory. If the SPM instruction can be found in program memory, then it may be possible to construct a gadget to store selected contents of data memory in program memory. Francillon [55] proved that this modified Harvard architecture could be success- fully attacked using incremental exploits to write attacker code to data memory followed by a stack-based buffer overflow to access and use an SPM gadget in pro- gram memory to transfer the exploit code to persistent flash memory (Figure 2.14).

76 Based on Francillon's demonstration, we see that this Harvard architecture microprocessor was not immune to a code injection exploit, and in this case, it was written to persistent flash memory so the exploit code was made resident and survived subsequent reboots. While this particular full intrusion required a series of incremental and progressive steps, it is typical of the effort a determined attacker would make against a high-value target. Our take-away is that memory protection alone is not sufficient to ensure security.

2.14 Instruction Set Architecture

The important elements of this section were previously covered but are summarized here to gather security-related instruction set architecture (ISA) weaknesses (or ISA-based ‘insecurity contributors’) into a single section. We saw in section 2.9.1 that the existence of gadgets in x86 machine code, with its complex instruction set, is problematic in that it creates gadgets where they would not have existed even in assembly, the lowest level of program abstraction. While the application of hardware mitigation such as the Intel CTT method [58, 137] presented in section 2.9.4 would apparently resolve ROP issues by intercepting

RETurns with unmatched CALLs, it is not clear that all short-sequence JOP tasks could be prevented by CTT when gadgets are taken from intermediate code bytes. While this assumption could be the subject of a line of research, we believe it more beneficial to make a tacit assumption that CTT-defeating JOP gadgets couldbe found, take the beneficial ROP-prevention lesson from CTT, and combine thetwo to eliminate the x86 ISA from consideration in favor of a customized instruction set of our own specification that would resolve the unintended gadget problem of

77 section 2.9.1 and also be compatible with the separation of control-flow and data stacks, stack growth direction, and possible return address protections that we addressed in sections 2.7.3, 2.7.4, and 2.7.5.3 (respectively).

2.15 Instruction Set Randomization

As we have seen, two techniques attackers use are injection of malicious machine instructions [113, 83, 133] (section 2.10) or misuse of existing code already in memory [15, 136] (section 2.9). Code injection requires a properly formed payload and misuse of existing code requires knowledge of the code's location in memory. Instruction set randomization (ISR) obscures memory contents and increases the difficulty of both of these techniques. We will briefly explain the ISRconcept, examine the two attack techniques above as motivation for ISR, and end the section with a discussion of ISR implementation and limitations. Instruction set randomization (ISR) obfuscates the contents of program mem- ory. ‘Instruction set encryption’ may be a better term because the processor only has encrypted copies of program binary images in memory, and instructions are decrypted after instruction fetch and before instruction decode. The word ‘random- ization’ denotes ‘encrypted with a random key’ to induce variability. A notional diagram of ISR from Kc [83] is shown in Figure 2.15. For code injection, a properly formed payload consists, in part, of valid machine instructions. If an attacker has no a priori knowledge of the particular processor type he is attacking, he could make a statistical or situational guess, create his payload around the instruction set of the assumed processor (and possibly operat- ing system), and begin a trial and error process. Barring a correct first guess, the

78 Figure 2.15: ISR Implementation (from Kc [83], Figure 1) attacker would be delayed, but through either luck or elimination, he could eventu- ally determine the processor type; however, unsuccessful trial and error attempts increase an attacker's visibility and improve the defender's chances of detecting the attack11. If the defender employs some form of ISR, the number of ‘virtual processor’ types the attacker has to guess from is so large that the attacker will likely have to resort to attacks on the ISR itself (more later) in order to attack the host computer. Even if the attacker knows the type of processor, without defeating

ISR the attacker's probability of a successful injection payload would be similar to using a random number generator to create the payload [83]. Exploits that rely on code already in memory such as return- or jump- oriented techniques depend on knowing the address of code in memory; if the location of the code is not known, an attacker cannot reliably mount an attack. Even if we stipulate that introspection [24] is available, attempts to scan memory would be unproductive in the presence of properly-implemented ISR unless the encryption process and encoding key are also known to the attacker. It should be pointed out, however, that if the locations of necessary memory contents are known and

11This is a good argument for hardware and software [12] diversity in the computing ecosystem.

79 the attacker can redirect program flow to arbitrary locations, ISR alone would not prevent an exploit of the code because the processor will continue to decrypt memory as it executes the code. An effective defense using ISR, therefore, would also require Address Space Layout Randomization (ASLR, section 2.12). A general ISR technique was demonstrated by Kc [83] on a modified Linux kernel using the bochs-x86 emulator [94]. In the simplified diagram of Figure 2.15, the encrypted binaries reside in memory and are decrypted continuously just before instruction decode by XORing the contents of constant-width memory fields with the encryption key. The variable-length nature of the x86 ISA was partially solved by Kc by enforcing process alignment during compile to 2-byte boundaries using a modified compiler. In the presence of a code injection attack, ISR essentially prevents the exploit by causing the vulnerable process to be terminated by the run-time environment when the first invalid opcode is encountered. While a momentary loss ofhost responsiveness during process termination is preferable to the intrusion, the system response to ISR ‘saves’ could become another form of denial of service [19, 83] if other security measures are not used concurrently to reduce vulnerability to code injection. Selection of the encryption technique and encryption key are important vari- ables in the effectiveness and cost of the technique; ISR cannot be used onself- modifying or polymorphic code, and key-stealing attacks are not the only weak- nesses. A single-word key system is subject to guessing [12], large key words can be attacked incrementally [146], and stream ciphers with pseudo-random keys are subject to plaintext attack [172]. Possible mitigations to these problems have been proposed [172] such as access rights to the protected memory, secure key locations,

80 and one-time-pad modes. Additional issues to be resolved are implementation of key generation for pro- cesses and how common libraries will be handled. Since this is a ‘state of the art’ survey section, we will defer proposed implementation details to a later section but assume ISR will be an integral feature of the secure host.

2.16 Hardware-enhanced Authentication

The technology survey conducted for this effort focused on CPU hardware-enforced security features for application protection. CPU upgrades would certainly drive CPU support chipset changes or new systems-on-a-chip (SoC). These changes pro- vide leverage opportunities for hardware-enhanced authentication. Two such ex- amples not chosen for incorporation into the proof-of-concept design are docu- mented here as important wedges in future work.

2.16.1 Random Number Sources

No discussion of computer security would be complete without mention of rel- evant aspects of cryptography. The subject is appropriate for the Secure Host CPU based on the potential requirement for a random number source for ISR and a hardware true random number generator (TRNG) source similar to the Intel

RdRand instruction available in Intel platforms beginning with Ivy Bridge [67]. Random number generation has been said to be “the Achilles heel of cryptog- raphy” [60]. Sources for random numbers used in cryptographic functions can be one of two types: deterministic random bit generators (DRBGs) also called pseu- dorandom bit generators, and non-deterministic random bit generators (NRBGs),

81 Figure 2.16: RBG Functional Model (from NIST SP 800-90A Rev 1 Fig. 1) [11] also called true random bit12 generators [11]. Deterministic generators produce repeating output patterns that are made ef- fectively random by long repeat intervals, uniformly distributed outputs, hiding the internal process of the generator from observation, and using a seed function to randomize the generator's internal state. A functional diagram of a DRBG is shown in Figure 2.16. In this diagram, the Entropy Input source would be of the NRBG type described below. Non-deterministic random bit generators depend on input from an ‘entropy source’ of unpredictable data such as thermal or shot noise, hard drive seeks times, or keystroke intervals. The NRBG will include a digitizer if the source is not re- ceived as digital data. Since there is no guarantee that the input source distribution

12The terms ‘bit’ and ‘number’ can be used interchangeably to the extent that numbers can be assembled from streams of bits.

82 is uniform, the NRBG will provide assessment of the source and conditioning to produce the distribution required. Additionally, the source should provide self-test functions and a status output to indicate valid or invalid output [60]. The output of such an NRBG can provide the entropy input required by a DRBG [11]. Random number sources are important to cryptographic functions and there- fore to cybersecurity. Insecure Randomness is a prominent Security Features phy- lum in Tsipenyuk's taxonomy [160, 161] and a Weakness Class (CWE-330) in the Common Weakness Enumeration [105]. This drives the availability of true ran- dom number sources such as USB dongles [45], and more importantly Intel first provided a hardware-based RNG and the RdRand instruction in its Ivy Bridge processor released in 2012 [60, 102, 153]. A high quality hardware random number generator was implemented and demon- strated in an FPGA platform by Majzoobi [98]. This implementation uses FPGA flip-flop metastability as the entropy source and includes the requisite analysis, fil- tering, and post-processing to produce high-quality results on the NIST Statistical Test Suite [131]. The implementation was characterized as “low overhead”, but the paper provided conflicting data reporting in one place that the TRNG consumes “128 LUTs that are packed into 16 Virtex 5 CLBs” and later stated “low overhead, using only 5 CLBs”. CLBs are configurable logic blocks, and the discrepancy is likely a typographical error but warrants additional investigation. Given the value of a true random number generator as part of the security architecture for a modern processor, we would expect to see more use of embedded TRNGs in commercial and special-purpose processors.

83 2.16.1.1 Physical Uncloneable Functions

Physical Uncloneable Functions (PUFs) [65] are directly related to TRNGs but can add low-cost hardware and logic for low-overhead ‘function call’ capabilities for key generation and verification in challenge-response (C-R) authentication protocols. This hardware suite together with PUF-aware software provides complex hardware- based 2-way authentication effective in defending against iterated C-R protocol attacks [40]. Like TRNGs, PUFs would be an important added layer in ‘defense in depth’ of a Secure Host using a hardware-enhanced CPU as its foundation.

2.17 Current State of the Art Summary

In the preceding sections we have reviewed common vulnerability patterns and a number of ‘point’ solutions ranging from research to demonstrated and/or fielded software and hardware; however, these security features are neither universally available nor broadly used in commodity computing systems. Our intent is to integrate a number of these solutions into a highly customized CPU that achieves the following goals:

• A CPU which enables creation of a Secure Host processor that provides enhanced security by default as well as by design,

• A Secure Host processor that enables reuse of valuable legacy source code without adaptation beyond a simple recompile, and

• A demonstration of a number of security features simultaneously imple- mented in hardware that operate without co-interference, are transparent

84 to the user, and provide improved security over the existing state of the art.

85 Chapter 3

Secure Host CPU

3.1 Introduction

Chapter 2 focused on examples of secure processing systems, current vulnerabilities in commodity computer hardware, and concepts, designs, and demonstrations of point defenses against documented threats. Defense in depth is a philosophy that includes inherent system characteristics [49] and layered defensive strategies [78]; we have targeted a number of features amenable to implementation in hardware as inherent characteristics of a CPU that strengthen the layered defenses for a Secure Host capable of running legacy code without the need for software modification or accommodation other than recompile. This chapter presents a summary of our top-level design. It includes specific security-related hardware features as well as some architecture choices made for reasons other than security. For example, general purpose register architecture, register names, and instruction mnemonic codes do not relate directly to security but were chosen for utility and similarity to the Intel x86 processor family in order

86 to leverage familiarity with that architecture. This chapter includes items which may not be represented in the proof of concept prototype but which nonetheless are based on security considerations and intended for future implementation should the Secure Host CPU move from prototype to brassboard or first article.

3.2 Secure Host CPU Design Features

In the following sections we focus on high level architecture and design of the CPU. Implementation details for the CPU and results obtained in creation of the prototype are deferred to Chapter 4 to cleanly separate design and implementation.

3.2.1 High Level Architecture

At the highest level the Secure Host CPU should be described as an x86-like 32-bit

Central Processing Unit (CPU). Similarity of the Secure Host to Intel's ubiquitous x86 family was intentional and intended to provide immediate familiarity to practi- tioners in computing science. While the native address and data widths of the CPU are 32 bits, like the x86 our CPU provides instructions for transfer and processing of byte and word as well as double-word data, and the Arithmetic and Logic Unit (ALU) implements flags required for signed and unsigned integer arithmetic and logic operations in these widths.

3.2.2 Memory Architecture

In a notable departure from the x86 family, this version of the Secure Host CPU is based on a flat (non-segmented) memory model. Otherwise it is capable of addressing 232 bytes of memory with byte, word, or double-word addressability,

87 Figure 3.1: Secure Host CPU Concurrent General Purpose Registers and also like the x86, 16- and 32-bit data are stored in memory in little-endian fashion. The Secure Host CPU is also a modified-Harvard CPU. Program instructions and user data are stored in separate regions of the flat memory field with procedural controls for separation, but address and data buses are common to the entire memory field.

3.2.3 Register Architecture

The chosen register architecture for the Secure Host CPU mimics the x86 family with the use of concurrent general purpose registers such as the example concurrent

‘a’ group (eax, ax, ah, and al) illustrated in Figure 3.1. In this figure eax is a 32-bit base register which shares its lower 16 bits with ax; likewise, ax shares upper and lower bytes with ah and al. Writing to the lower 16 bits of the base register modifies the contents of the concurrent registers, but this arrangement allows the general purpose base registers to support 8- or 16-bit processing in a compact form without the need to mask and shift wider objects.

88 Figure 3.2: Single Stack Example (Repeated from 2.8)

3.2.4 Stack Architecture

The Secure Host CPU stack architecture represents a significant departure from most current including the x86. As a very brief review of section 2.7.1, most CPUs follow the x86 architecture of a single stack for interleaved control flow and user data, and stacks commonly grow from higher memory addressesto lower addresses as stack frames are added. Figure 3.2 repeated from Figure 2.8 in Chapter 2 as a convenience to the reader. We stress stack frames (or activation records) to point out that in this figure new callee stack frames are below the caller frame at a lower memory address; however, as data is written to or read from a buffer stored in a stack frame as a local variable, writing and reading follows conventional sequential memory access from lower memory addresses to higher addresses. With reference to this figure then it could be said in a summary phrase that “frames grow down while buffers grow up”.

89 It can be readily seen from this illustration that an overflow of chr buf[1024] larger than the width of ebp in the region labeled ‘Current Stack Frame’ would overwrite the stored return address and (intentionally or unintentionally) improp- erly modify program flow as well as potentially modify or corrupt caller data.The Secure Host CPU stack architecture provides two defenses against this type of attack or error as discussed below.

3.2.4.1 Reverse Stack Growth

To add specificity to this section title we should restate that program sequential ac- cess to stack-based buffer or array variables follows the norm of increasing memory addresses and add that multi-byte stack-based variables appear in stack memory in conventional little-endian forms. ‘Reverse’ stack growth refers to the fact that as callee frames are added to the stack the memory addresses of successive new frames increases. The result of this architecture feature is that sequential buffer reads or writes occur in the same direction as new frame addition; therefore a buffer over- flow will not corrupt or overwrite caller data or the current 'stack frame s return address. It is recognized that the urgency of reverse stacks is reduced by dual or split stacks covered in the next section; however, we regard it as a no-cost architecture feature that provides the benefit of isolation of caller data from buffer overruns and adds diversity to the computing ecosystem1.

1Williams et al. [173] noted that ”... much of the fragility of our networked computing systems can be attributed to the lack of diversity or monoculture of our software systems.”

90 3.2.4.2 Dual Stack

The Secure Host CPU provides two separate hardware stacks; one exclusively for program control data and one for user data. In addition to separate physical stacks, logical separation is enforced by the CPU hardware. Implementation details are given in Chapter 4 but this is accomplished by providing no machine PUSH or POP instructions to the control stack. After a supervisory mode initialization of the control stack pointer (ecp) during program load, control stack data storage and retrieval is handled by the CPU's instruction fetch and branching logic. Further, the control stack is afforded the same protections from user read/write access as program memory. This prevents interactive user actions as well as rogue or malicious programs from modifying program control flow via manipulation of stack- based data.

3.2.5 Instruction Set Architecture

The Secure Host CPU Instruction Set Architecture (ISA) is defined using x86-like mnemonics and generally reflects the x86/Intel instruction operand1, operand2 form. This provides familiarity for experienced x86 programmers as an aid to understanding and programming the CPU. Notably, the ISA is implemented strictly in 64-bit single-width instruction words as illustrated in Table 3.1. While x86-like variable multi-byte encoding with expansion bytes where needed would result in a smaller memory footprint, variable width encoding blurs or eliminates instruction word alignment and greatly increases the opportunity for gadget construction (section 2.9.1). Given today's low cost and high density memory options, in-memory size of application code is

91 a low-priority concern especially when increased security risk is a trade off.

Table 3.1: Secure Host CPU Instruction Word Description

# Bits Description 16 Instruction Operation (opclass) 2 Transfer width (target = source except for MOVS and MOVZ) 2 Operand1 Type (Register and Memory Flags) 1 Reloc Flag 3 Operand2 Type (Immediate, Register, and Memory Flags) 2 Unused (reserved) 6 Operand1 (Register ID Code) 32 Operand2 (Register ID Code or Immediate Value)

Detailed discussion of the instruction word internals will be deferred to Chapter 4, but a reasonable high-level question centers on the choice of 64 bits. From Table 3.1 it can be inferred that the absolute minimum size of the instruction word must accommodate at least one 32-bit address or immediate value to support the 32- bit architecture and one general-purpose register code for a second operand plus opcode space and control bits. Table 3.1 indicates that register codes are 6 bits, so it might be surmised that a usable constant-width instruction word might be constructed in 48 or 56 bits if byte alignment is desired. Where binary hardware structures are employed and the cost of storage and program memory is not high there is scant incentive to remain below the 2n integer multiple of 64 bits. Given that constant-width 64-bit instruction words would be used, hardware elements of the CPU could be tailored to enforce 8-byte (quad-word) instruction word alignment. Even though the CPU is based on a 32-bit architecture, enforced quad-word instruction-fetch boundaries further suppress gadget opportunities as another level of defense against ROP and JOP attacks (section 2.9).

92 Instruction references and summary table are provided in Appendix C. Given the average reader's familiarity with microprocessors it is sufficient to say that the Secure Host CPU ISA provides data movement, integer-arithmetic, logic, branch- ing, and control instructions including an interrupt instruction for coordination of system services as would be expected in a utility CPU. The most significant addi- tion to the ISA is the LAND group added for hardware-enforced program control flow mediation. It is emphasized that LAND group control flow mediation isa security enhancement rather than a change in functionality (i.e., this instruction group does not add new CPU functionality; its existence prevents the subversion of intended control flow). Except for the new Intel ENDBRANCH control transfer termination instruction discussed in section 2.9.4, we are not aware of the use of instructions of this type in any other general purpose CPU.

3.2.5.1 LAND Group

The Land group comprises the LANDJ, LANDC, and LANDR instructions that provide landing pads for conditional or unconditional branch, call, or return instructions. Landing pads are inserted by the Secure Host compiler without notation or adapta- tion of existing source code and enforced by logic in the Secure Host CPU. Failure to sense a proper landing pad during program redirection indicates an attempt to subvert flow (see section 2.9.4) and results in an error trap. Additional implemen- tation details and how they relate to source code compilation and assembly are provided in Chapter 4.

93 3.2.6 Instruction Set Randomization

Instruction Set Randomization (ISR) was introduced in section 2.15 as a defense against code injection and ROP or JOP leveraging of program code already in memory. By ‘randomizing’ program instructions as they are loaded into memory and recovering the plain-text contents between instruction fetch and instruction decode we deny the attacker a plain-text memory field even if access should be gained. We hasten to add this is not intended as a cryptographic-grade measure but serves to obscure the contents of memory and greatly reduce the risk of code injection or ROP/JOP exploits even when an attacker has sufficient dwell time to employ iterative methods. We are also aware of the effectiveness of statistical attacks on a simple XOR scheme on plain text such as Kc's approach shown in Figures 2.15 and 3.3a; however, access to an adequate portion of the cipher text is a necessary requirement to prosecute such an attack and we hold that once introspection exists an intrusion has already occurred.

Nonetheless we considered enhancements to Kc's example that would provide a ‘greater degree of randomization’. Figure 3.3a restates and amplifies 'Kc s ap- proach showing symmetrical encoding/decoding with a fixed n-wide key; Figure 3.3b expands this to a notional (also symmetric) enhancement using a keystream from a block of random numbers or a pseudo-random number generator (PRNG). If the keystream has a period2 greater than the plaintext source this scheme is equivalent to one-time-pad (OTP) encryption [16]; however, stream decryption requires that the ciphertext and keystream be synchronized. This is not a problem

2Period is the length of a block of random number or length between sequence repeats for a PRNG.

94 a) b)

Figure 3.3: Fixed Key and Keystream Randomization for fixed-length messages that can be replayed linearly from the beginning but it is an impediment to Instruction Set Randomization. Machine code is written linearly to program memory from a binary code file but readback is not linear or even always in the same segment sequence due to conditional branching. Should greater randomization that a single fixed key be desired our CPU de- sign includes an option for a weak but more random process emulating short period OTP encryption. Figure 3.4 illustrates a Look-Up Table (LUT) based enhanced randomization implementation that can provide symmetry and self- synchronization with low hardware overhead. With an eye toward returning to an FPGA implementation, the design of Figure 3.4 allows a variation of the LUT space required versus key period desired. For example, using the low byte of program memory address as input to the LUT with a 4-fan 32-wide output is compatible with the Secure Host CPU's 32-bit architecture and reduces LUT requirements to 28 × 8 = 2048 bits; a full 32-bit output would increase randomness with a LUT size increase of only 4×. Figure 3.5 shows the logical data flow for the design as it would be used in the Secure Host CPU. Everything except Disk

95 Figure 3.4: Look-Up Table-Based Randomization

Figure 3.5: ISR Block Diagram

Stores is contained in the CPU and the Encode/Decode blocks are realized in a single switched instance of the LUT-based encoder/decoder of Figure 3.4. While in supervisory mode the CPU initializes the LUT memory and loads the program code using the low m bits of the load address as the input to the LUT. Decoding of program memory uses the same LUT and low m bits of the instruction pointer

(eip) during instruction fetch to present the instruction decoder with a plaintext instruction word not otherwise visible within the CPU after program load. Using this scheme program memory is randomized or encrypted with a 256-word key period, and synchronization for decode is obtained from the memory address bus during instruction fetch. No particular conditioning of the keystream stored in

96 the LUT (such as an LCG3) is required and ISR can be effectively disabled for program debugging by loading the LUT with all 0's. Since the LUT or its single key counterpart are privileged machine registers not accessible after user program load, plain text machine instructions are not accessible to user space code even if an exploit should gain read access to program memory. If we assume some mechanism should allow an attacker to gain write access to program memory for the purpose of code injection, the injected code would have to be properly formed in its plain text version and encoded with a copy of the same Encoding Key(s) used at program load time. Even if there were no other security features in the Secure Host CPU to separate program code from user-space data on the stack, we would point back to Kc's estimation [83] that an attacker would have no better chance of success at code injection than if a random number generator was used to generate the payload.

3.3 Field Programmable Gate Arrays

Up to this point in the chapter we have presented architectural and design features chosen to support and further the security objectives of the project, specifically, hardware-based security features. Chapter 2 explained much of the ‘why’ of our project and the preceding sections of this chapter have outlined a list of identified ‘whats’. Before we move to the next chapter to address specifics of our prototype Secure Host CPU implementation it is appropriate to at least preview a proposed ‘how’. While demonstration prototype(s) and final target vehicle(s) are integral

3 A linear congruential generator [52] governed by the equation xi = (axi−1 + b) mod n can produce a pseudo-random distribution 0 ≤ x ≤ n − 1 with careful selection of a, b, and n.

97 to the implementation, selection of vehicles overlaps design and implementation– especially in a spiral or phased development effort. We now turn to a preliminary look at our ultimate design goal and the en- abling technology: To design a more secure RISC-like microprocessor with our proposed security features and implement a test article in hardware using a Field Programmable Gate Array (FPGA) device. An FPGA presents an attractive im- plementation option in order to demonstrate CPU feature in hardware without the time and expense of very large scale integrated (VLSI) circuit development. Many of the proposed security enhancement approaches surveyed in the lit- erature review are dependent on changes to the host processor architecture for implementation. Some can be implemented without hardware changes via emula- tion or virtualization but would then be subject to performance impacts, residual attack risks, or introduction of new or different system vulnerabilities; these would also reasonably benefit from hardware architecture changes. We believe FPGAs can provide adequate secure host performance in targeted applications because studies reported by Woodruff characterize CHERI as “performance-competitive” [174], although we realize CHERI, as a long-running effort, is likely well optimized. Since an FPGA implementation of the hardware-based security features we selected would be the preferred candidate for a prototype demonstration vehicle, this section is provided to briefly introduce FPGAs, describe how logic functions are implemented by the designer, and present typical design tool sets used.

3.3.1 Example Logic Functions in FPGAs

For the reader not familiar with implementation of complex logic functions in universal logic elements, the following examples are provided. Collections of NAND

98 Figure 3.6: Realization of Combinatorial Logic (from Swan [151] Fig. 1.(b)) gates can be used to implement arbitrary logic functions such as the combinatorial example from Swan [151] shown in Figure 3.6. Inverters can be implemented with single-input gates, and large fan-ins beyond the capability of a single gate can be accommodated by summing parallel gates. The same logic functions can also be realized with NOR gates. Using a sin- gle gate type simplifies the semiconductor wafer manufacturing process, andthe selection of NAND versus NOR becomes an implementation choice. To implement latches and sequential circuits, NAND-NAND or NOR-NOR logic gates are used to create Set-Reset (SR) latches or flip- as basic elements as shown in the examples from Stroud [148] (Figure 3.7). With additional gates, the SR latches can be logic-enabled to create D latches, or clocked to create syn- chronous D flip flops.

3.3.2 FPGA Manufacture and Function Implementation

An FPGA contains dense fields of logic devices from basic universal gates tocom- plex logic blocks such as (MUXs), Adders, Lookup Tables (LUTs),

99 Figure 3.7: NAND-NAND and NOR-NOR Latches, Stroud [148] bulk memory, and input/output (I/O) circuits for external interfaces. The logic block fields are overlaid with interconnection fabrics or meshes of field programmable interconnects. The designer expresses the logic functions to be implemented in the chip using a specialized hardware description language (HDL) that is compiled into programming files; programming files represent bit masks for selective pro- gramming of the interconnection fabric among selected logic blocks or gates and I/O pads. When the programming file is written to the FPGA chip, the chip will provide the programmed logic function(s) between its input and output pins.

In addition to the programmer's expression of logic functions in HDL, a pro- grammer may make use of very complex optimized logic functions such as inte- grated memory controllers that are provided by the FPGA vendors and licensed as intellectual property (IP); these modules are referred to as IP-core modules or IP cores and may be furnished as ‘hard IP cores’ or ‘soft IP cores’ [4]. Hard IP cores are included in certain FPGAs as embedded hardware. Their functions and lay- outs are highly optimized to provide better performance than might be achieved in a field-programmable instantiation. Examples of common hard IP cores are RAM/ROM blocks, DSP modules, PCI and JTAG controllers, and RISC CPUs. Soft IP cores are provided as library-like binaries for common complex functions such as Ethernet controllers, UARTs, DRAM and SSRAM controllers, and spe- cialized DSP, µP, and µC functions. IP-core functions are accessed via defined

100 interfaces very much like software library interfaces. The programmer's logic func- tions are integrated with other IP cores to complete the final expression of the FPGA as a programming file. When the programming file is written to an appropriate FPGA to define the final logic element interconnections, the FPGA becomes the electronic logic device described in the hardware language. FPGA devices range from small, low-density devices for simple logic functions to high-density devices capable of complex opera- tions up to complete system-on-a-chip (SOC) functions. Such will be the ultimate goal for implementation of a Secure Host CPU; a locally generated, hardware sys- tem on a chip that can be tested, modified, reprogrammed, and tested again to implement corrections, improvements, and/or alternate configurations for more advanced characterization and test.

3.3.3 Hardware Description Language

The two most common choices for hardware description languages (HDLs) are Verilog (Verilog® HDL), and VHDL (Very High Speed (VH- SIC) Hardware Description Language). FPGA vendors provide electronic design automation (EDA) tools including integrated development environments (IDEs) suitable for Verilog and VHDL. EDA tool sets include editors, compilers, optimiz- ers, circuit synthesizers, and logic simulators. Verilog and VHDL are similar in many respects with broadly overlapping capa- bilities; VHDL is considered to be stronger in system-level abstraction [144], while Verilog is more agile in low-level logic. A subjective assessment of their capabilities is shown in the chart of Figure 3.8. For the reader interested in an HDL overview, a comparison of Verilog and VHDL along with a sample algorithm implemented in

101 Figure 3.8: HDL Modeling Capabilities (Smith [144], Fig. 1) both languages and in C is given by Smith [144], who accurately notes that HDL choice is frequently a matter of personal preference of the designer and availability of EDA tools.

3.3.4 Possible Alternatives to FPGAs

Alternatives to FPGAs for implementation of the secure host are emulation of the hardware in software, ASICs, and a ‘soft core’ processor such as the ™ Crusoe™ [158]. These alternatives were weighed and rejected for the reasons given below. Emulation of the hardware in software is not the preferred option for other than highly exploratory initial investigation. Emulation defeats the intent of ‘hardware- based’ security features and would suffer performance issues. Implementation of the secure host in an Application Specific Integrated Circuit (ASIC) is a very good possibility as a further investigation task and should the secure host mature, as an ASIC would provide better performance than an FPGA implementation [56]. ASIC production follows the same general development arc

102 as FPGAs with respect to HDL programming, simulation, and synthesis. ASICs range from simple ‘hard mask’ production over an FPGA-like foundation chip to custom chips generated from standard logic cells [35], but after design, debug, and simulation, ASICs still require fabrication time after the netlist is available. The hard mask over standard cell approach would require fab time from metallization up; for a full-custom ASIC, fabrication time would be from silicon diffusion up. Strengths of the FPGA over ASIC include field programmability, lower design cost, reduced timeline to first availability, and lower overall cost for low rate production. Strengths of ASIC over FPGA include higher performance, lower power consump- tion, and lower cost for high rate production [46, 56]. The bottom line is that ASICs would be appropriate for a mature design in high production rates, but the scope of our current project demands FPGA over ASIC for the present.

The final candidate alternative was one of the Transmeta™ Crusoe™ Processor family. The Crusoe™ was available in several models [158] all providing a highly ca- pable x86-compatible processor with integrated Northbridge DDR/SDR/PCI and

MMX instruction support. The Crusoe™ excelled in low power high performance computing for embedded applications and was strong in representation in the early

2000's with more than two dozen laptops and notebooks designed around it. The Crusoe™ family of processors first came onto our radar due to its configurable nature. The processor is based on a native Very Long Instruction Word (VLIW) processor that is mapped to x86-compatible code through a Code Morphing Soft- ware (CMS) layer [41, 158]. The CMS combined code interpretation, dynamic , and run-time support which optimized code during run time as patterns were detected. The processor coped with self-modifying code using flags to indicate translated memory regions; when a modification to the x86 code was

103 detected, the flagged page was invalidated and retranslated, restarting the opti- mization process [86].The processor hardware was simplified by retaining complex and infrequently used functions in the CMS. The Crusoe™ was ruled out as an option due to the CMS being heavily designed for straight x86 compatibility and not reconfigurable in hardware. It was noted that Transmeta™ activity in general tapered off rather quickly after an initial wave of interest, and we learned that the Transmeta™patent portfolio was acquired in 2009 for use in improvement of proprietary designs and non-exclusive licensing to third parties [75].

3.4 Exception Handling

At this point we have covered design features of the Secure Host CPU covering basic functionality quickly and security features in detail, then and a quick look at possible means of realizing a product for a proof of concept demonstration. An area not addressed in depth has been events after the CPU’s detection of a error by any of the hardware security features, or an ‘error trap’. Before leaving the design section we will state an over-arching assumption or criterion: that an error or exception should trigger an assumption of intrusion or attempt and the processor will preserve state data and halt rather than attempt to continue or resume operation. This approach may not be consistent with some definitions of ‘robust’ or ‘reliable’ but we assert that is it the only approach thatcan reasonably meet the definition of ‘secure’ in the presence of unknown or evolving threats. We accept that there are many, many approaches that could result in a Denial of Service (or DoS) attack, but we refer back to the definitions of section 2.6.1 and restate that our Secure Host requirement is to prevent the execution

104 of unintended system functions. Continuing operations from a questionable or compromised state is inconsistent with this requirement.

3.5 Application Summary

By way of summarizing our chosen design features, it is worth stating we have no illusions of creating a generic x86 replacement CPU, nor do we expect to compete with processors that are optimized for specific high-performance applications. Our intent is the implementation of a Secure Host based on a CPU providing designed- in hardware-enforced security features. Ideally, the Secure Host would be easily deployed and operate in a sparse computing environment (i.e., minimal hardware and OS support), and provide a secure platform for common network applications or network front-end processors for specialized high performance machines. To this end we have focused on the user or client interface or network socket as the untrusted to trusted demarcation point.

105 Chapter 4

Secure Host CPU Implementation

The impetus for the Secure Host CPU began as a sub-task under an AF Rome Laboratory study called Secure Cognitive Network Manager (SCNM). While the the study focused on a reliable and secure battlefield network manager, a sub-task investigated the feasibility of a custom CPU that would be inherently resistant to cyber attack. Before the close of SCNM the Secure CPU subtask demonstrated the feasibility of a custom CPU implemented in a Field Programmable Gate Array (FPGA) under the control of a custom debugger [25]. After SCNM concluded, this author set out to extend the Secure CPU prototype to add additional features, integrate it into a Secure Host and Testbed, and perform testing. We began the effort with a literature survey of current vulnerabilities and identification of CPU hardware security features to harden a host processor against attack (Chapter 2). Using those hardware-based security features we produced a high level design for a functional Secure Host CPU (Chapter 3) to be implemented and integrated into a host computer for a proof of concept demonstration. This chapter begins with a summary of early implementation efforts for the

106 Secure CPU including the early prototype CPU in FPGA. Section 4.3 and following provides design and implementation details for the emulated Secure Host CPU used in the proof of concept demonstration system. Support hardware and software including the software toolchain comprising the Testbed are covered in Chapter 5 along with results of the proof of concept demonstration.

4.1 Early FPGA Prototype

An FPGA implementation for the custom CPU was desired primarily for perfor- mance reasons as demonstrated by CHERI (section 2.5.6) and the fact that security features embedded in hardware are resistant to unauthorized changes. Early ex- perimentation was conducted on a Spartan 6 FPGA module on a Trenz Electronic GigaBee carrier board as used for the SCNM demonstration. This configuration used a Windows host computer to communicate with the GigaBee test vehicle via a USB/JTAG serial interface to load FPGA configuration files, run and halt the FPGA, and read and write internal FPGA memory. The FPGA and debug host formed the foundation of the first iteration of the Secure Host Testbed and enabled early development and testing of CPU features. Network socket management and JTAG driver scripts were added to the debug host as illustrated in the block diagram of Figure 4.1 creating a network appliance to serve a remote client computer. The photo of Figure 4.2 shows the test config- uration with the FPGA module flanked by a client computer on the right and the debug host on the left. Valuable experience was gained with the early FPGA prototype, but as features were added to the CPU, limited VHDL support and concerns over the JTAG

107 Figure 4.1: Early FPGA Testbed Block Diagram

Figure 4.2: Early FPGA Prototype CPU interface bandwidth prompted a decision to shift to software emulation of the Secure Host CPU and defer further FPGA development until after the proof of concept demonstration.

4.1.1 C99 Emulator

While shelving the FPGA prototype was not in the original plan, moving to an emulated CPU allowed use of more familiar development tools and greater ease in

108 integrating the ‘sandboxed’ CPU emulation process with the support host. C99 was deemed the language of choice for the Secure Host CPU emulator due to ease of bit manipulation and greater similarity to VHDL than other high level languages. This allowed some leveraging of VHDL concepts already in use in the FPGA prototype and improved the possibility of porting more of the C99 work back to VHDL in a future effort.

4.2 Review and Introduction

With the preceding historical background on the formative stage of the Secure Host CPU effort in mind we return to a high level review of the Secure HostCPU architecture before moving into implementation details. A brief scan of some of the following section headings may suggest that this chapter is a repeat of the design information of Chapter 3. Similarity in subjects and sequence of this chapter to the previous is intended for consistency but the focus of the previous chapter was architecture, features, and goals; the remainder of this chapter is devoted to documentation of implementation choices and details for the custom CPU. Chapter 5 will follow with integration into the Secure Host, toolchain development, and the proof of concept demonstration.

4.3 CPU High Level Architecture

To restate Chapter 3, the Secure Host CPU is a 32-bit modified-Harvard processor maintaining program code and user data in separate memory regions. Native data widths of 8, 16, and 32 bits are reflected in the CPU instruction set, support- ing general purpose registers, and the ALU. The ALU provides all required flags

109 to support signed and unsigned integer arithmetic, logic, and conditional branch operations. Memory regions are assigned at load time and logically separated; however, CPU access to the respective memory sections uses common internal buses. Run- time separation is enforced by monitoring of absolute addresses during program execution with error trapping should an application program attempt execution of instructions outside of the defined program memory region. While 'the CPU s flat memory field is byte addressable, strict enforcement of instruction word alignment boundaries during program load and operation prevents gadget construction from non-aligned machine code. Implementation details for the Secure Host CPU are presented below in archi- tectural groups following the sequence of Chapter 3.

4.3.1 Secure Host CPU Emulator

The Secure Host CPU Emulator is a monitor and debug console supervising a collection of data structures and logic functions that describe a central processing unit in much the same manner as VHDL code in the early FPGA prototype. An important difference is the extent of integration between the Emulator andthe CPU logic. In the FPGA version a bit file defining the CPU functions was loaded followed by user load files containing initialization data for CPU registers, Secure CPU machine code (.code or .text memory section), and initial user data (.data section). The FPGA was released to Run mode and periodically Halted for polling of a reserved direct memory access (DMA) input/output (I/O) block to determine processor state and the status of waiting I/O operations. If the processor was in

110 a normal state the processor was returned to Run mode after pending I/Os were serviced. If the processor encountered a fault or trap state it would write status to I/O memory and internally Halt to await the next supervisory poll. In contrast, the Secure Host CPU Emulator including supervisor, monitor, and debugger processes are loaded in paged memory as a user-privilege Linux process. The supervisor initializes a section of its paged heap as a flat memory file marking the beginning of physical memory as virtual address 0 for the emulated CPU. After the CPU is initialized the Emulator loads the user program file named in the Em- ulator command line parameters, reserves code memory for the Secure Host CPU user application program, and loads program and data memory. Upon comple- tion of initialization and load tasks, the Emulator console presents a conventional debug menu allowing the operator to examine or modify memory and registers, single-step the user application program, or release the emulated CPU in continu- ous Run mode. Significant events are logged to the console to monitor operations and in the event of an error trap, processor state is preserved, a diagnostic error message is logged, and control is returned to the console. In operation, the Secure Host CPU is an interruptable and re-enterable routine within the single-threaded Emulator process. The Secure Host CPU executes its native functions as a physical CPU would with the exception of Operating System (OS) calls. The Secure Host CPU instruction set includes a software interrupt function (INT) which is called in the same manner as a Linux system call. OS support and syscall function discussions will be deferred to Chapter 5; it is sufficient at this point to simply state that user program syscalls invoke callback routines to the Emulator. This accommodation avoids the need to port an OS to Secure Host CPU machine code while allowing OS level support for Secure Host application

111 program demonstrations. The Emulator process can be debugged using any system debug or development tool, and the Emulator contains an integrated debugger and disassembler tailored to the emulated Secure Host CPU architecture and its customized instruction set.

4.3.2 Data Types in the C99 Emulator

Within this document we are presenting architecture and design details with the intent of leaving as much code detail as practical to source code. This is especially appropriate should a follow-on effort lead back to re-engagement with the FGPA for performance and hardware control reasons. We have abstained from discussions of programming except where programming methods are important to the understanding of a vulnerability or the architecture or function of the Secure Host CPU. Data type choices in our emulator code fit this exception. C is a weakly-typed language [107], but it is not untyped and type applies to data contained in structures more than the structures themselves. In contrast, VHDL is considered a strongly-typed language [92], but type applies to structures and RHS/LHS equivalence rather than contents. Due to fixed structures and dynamic contents we struggled somewhat withtype safety for dynamically typed C99 variables as we transitioned from VHDL to C99. Inconsistencies were addressed as the project matured but can still be found in the emulator code.

112 4.4 Memory Architecture

With the additional background of section 4.3.1 we can return to CPU-level dis- cussions. The Secure Host CPU's address lines A31 down to A0 are capable of addressing up 232 bytes of flat (non-segmented) byte-addressable memory or the limit of process memory reserved for the CPU. Data memory alignment for multi- byte data is not enforced by the Secure Host CPU; however, non-aligned operations would suffer the normal performance penalties arising from a 32-bit databus. Once in Run mode, the Secure Host CPU operates on customized machine instruction words (section 4.8.9) strictly from program memory reserved at load time and limited to the size of the machine instruction code actually loaded. Local variables are stored in the user's data stack (sections 3.2.4.2 and 4.6.2) in the usual manner. Program-scope variables declared by user programs stored in the .data section of the load file and are initialized in user data memory, and provisions are made as part of the Emulator function for user program allocation/deallocation of heap (data) memory via OS syscalls similar to Linux malloc() and free() functions. One memory detail of importance to a programmer/debugger is the endian-ness of in-memory multi-byte data. The FPGA prototype as well as the Secure Host CPU and its Emulator all follow the x86 model of little-endian in-memory storage. To close out the memory architecture section, we would add that non-segmented memory was an expedient option consistent with the scope of the proof of concept demonstration and the assumption that the Secure Host CPU would target a ‘util- ity’ CPU class. There are no perceived barriers to implementation of segment or address-extension registers; however, the 32-bit address field was deemed sufficient for the intended general-purpose CPU, and segment register implementation was

113 deferred for a future project.

4.5 Register Architecture

The Secure Host CPU register architecture provides 8-, 16-, and 32-bit concurrent and General Purpose registers following the x86 family model closely but with a slightly larger register set as illustrated in Figure 4.3. Notable departures from the x86 model are the appearance of the program control stack pointer ecp alongside the expected data stack pointer esp, and the addition of a Last Testament (elt) register. Since the control stack pointer is a hardware security feature (section 3.2.4.2) it is neither visible to or modifiable by a user program, and is only in- cluded in Figure 4.3 as a programmer's reference. The Last Testament register is continuously updated to reflect the previous instruction pointer (eip) value for debug purposes and is also included as a programmer's reference. Like the x86, 32-bit registers can be used as memory address pointers to access

1-, 2-, or 4-byte memory values via x86-style byteptr, wordptr, or dwordptr qualifiers. During SCNM FPGA development it was demonstrated that concurrent x86- like GP registers could be handled procedurally in VHDL with fixed-width registers for compactness and hardware mask and shift routines. Read/modify/write cycles were required in the FPGA for writing to subordinate concurrent registers without modifying non-target bits (e.g., writing al without altering the value of ah). Concurrent register management was simplified in the Secure Host CPU Em- ulator because registers are mapped into a contiguous memory array and the con- current registers overlayed on 32-bit base register are individually addressable.

114 Figure 4.3: Secure Host CPU Register Architecture

Referring again to Figure 4.3 and regarding the illustration as a contiguous array of 32-bit registers, it can be seen that the starting address of al, ax, and eax would be the same while ah would be 1 byte greater. Register memory address details will be apparent when we cover register ID numbering (section 4.5.1.1), but given that the toolchain is aware of the starting memory location and width of each named register, it is clear from Figure 4.3 that (e.g.) al can be read or written in one C99 operation without altering the value of ah by casting its memory address as the the appropriate pointer type from stdint.h (i.e., mem val = (uint8 t)

115 mem addr).

4.5.1 Register Implementation

The SCNM FPGA prototype used named variables corresponding to the archi- tectural names of base registers. For example, the SCNM assembler encoded the

32-bit register esi as a unique binary code which corresponded to the FPGA VHDL signal of type std logic vector(31 downto 0) named ESI. The remain- ing register-aware programming in VHDL was chiefly handling of general purpose concurrent registers. In view of Figure 4.3, register implementation in the Secure Host CPU Em- ulator was a straightforward choice of a named array of 32-bit variables of the form uint32 t reg ar[]. While void * reg ar[] would have been preferred in concept to reflect GP register dynamic content, an explicitly defined 32-bit width was used for portability. Given the choice of concurrent GP registers, a logical memory structure and coding scheme for the CPU Emulator was created to provide register concurrency without the performance impact of mask and shift operations for 8- and 16-bit reg- isters. The next section presents details of the chosen register naming conventions and how they are used to steer register access.

4.5.1.1 Register Identifiers

Recall the previous discussion of Figure 4.3 and the fact that for concurrent regis- ters, the 8- (low), 16-, 32-bit registers all have the same beginning address. Given little-endian storage, the beginning address of the 32-bit register array would also be the beginning address of element zero of the register array (reg ar[0]) and the

116 addresses of other register in the array can be computed from offsets added to the starting address (® ar[0]). Referring to Table 4.1 we see a row-column index scheme where any register can be addressed by reference to the Base column as a row index and the Type number (0 through 3) as a column index to obtain the offset. For any named

Table 4.1: Resulting Secure Host CPU Register IDs

Type=0 Type=1 Type=2 Type=3 32-bit 16-bit 8-bit 8-bit base lower byte1 byte0 register half Base Reg ID Reg ID Reg ID Reg ID 0 eax 0 ax 1 ah 2 al 3 1 ebx 4 bx 5 bh 6 bl 7 2 ecx 8 cx 9 ch 10 cl 11 3 edx 12 dx 13 dh 14 dl 15 4 eex 16 ex 17 eh 18 el 19 5 efx 20 fx 21 fh 22 fl 23 6 egx 24 gx 25 gh 26 gl 27 7 ehx 28 hx 29 hh 30 hl 31 8 ebp 32 9 esi 36 10 edi 40 11 esp 44 12 ecp 48 13 eip 52 14 elt 56 15 eflags 63

register in the table, the adjoining Reg ID number is derived my multiplying the

Base column index by 4 (Base << 2) and adding the numerical value of the Type column, or as expressed in binary form, 0bBBBBTT where B and T are Base and Type binary values.

The key to interpreting register IDs (RegIDs) at access time is given in Table

117 4.2. The Secure Host CPU Emulator includes register regread() and regwrite() functions which are called with RegID as a parameter. The access routines mask the Base bits to determine base register address offset, and Type bits to determine (a) the access bit width, and (b) for 8-bit registers only, whether the desired value resides in byte0 or byte1 of the base register.

Table 4.2: Secure Host CPU Register Type Code Suffixes

Type Bit Base Reg Code Width Position Description 0b00 32 bits Bytes 3 - 0 32-bit GP registers 0b01 16 bits Bytes 1 - 0 16-bit x registers 0b10 8 bits Byte 1 8-bit h registers 0b11 8 bits Byte 0 8-bit l registers

Access to 32-bit base registers can be accomplished in a more direct manner avoiding regread() and regwrite() function call overhead by simply addressing the base register by its array name and index number. While run-time instruction decode favors the regread() and regwrite() functions for most register operand accesses, certain control registers such as instruction and stack pointer are accessed frequently outside of the core instruction decode sequence. For code readability and to reduce function overhead, these registers have defined aliases of the form

(e.g.) #define ra ebp reg ar[8] where the ‘ra ’ prefix differentiates the defined name from ‘ebp’ as a generic register name. In this manner frequently-used control registers can be accessed by single source-statement instructions referencing the register's alias.

118 Figure 4.4: Secure Host CPU eflags Register

4.5.2 Flags Register (eflags)

The final element in register implementation is the flags registereflags ( ). The current version of the Secure Host CPU implements Overflow (OF), Sign (SF),

Carry (CF), and Zero (ZF) flag bits contained in the low nybble ofthe eflags register as illustrated in Figure 4.4. This flag bit complement is sufficient to accommodate all required signed and unsigned integer arithmetic and conditional jump operations. While the eflags register is amenable to access via the register access methods described above, the requirement for masking single bits to be tested, set, or cleared drove the Emulator implementation to dedicated single-register functions accepting bit patterns as parameters defining single- or multiple-flag bit operations.

4.6 Stack Architecture

Sections 3.2.4.2 and 3.2.4.1 discuss the two major features of the Secure Host CPU stack architecture: ‘reversed’ stack growth direction and dual or split stacks. Both features are illustrated in Figure 4.5.

119 Figure 4.5: Secure Host CPU Dual ‘Reverse’ Stacks

4.6.1 Reverse Stacks

A feature common to both of the dual stacks is that they operate in the opposite or reverse direction from the x86 family and most other contemporary CPUs. As shown in Figure 4.5, growth of the control and data stacks is from lower memory addresses toward higher memory addresses as new data is pushed onto the stack. Sections 2.7.1.3 and 3.2.4 stressed the concepts of frame growth and stack-based buffer growth, and the concepts exactly hold in the implementation of thedata stack; however, the concept of ‘stack frames’ is less apparent on the control stack. In a conventional single stack, logical stack frames are sections of stack mem- ory containing local process variables and/or parameters demarked by the return addresses of function callers. While dual stack data ‘frames’ are also blocks of process variables and/or parameters, storage blocks for each nested function are byte-adjacent (being absent return addresses), and the control stack, in the pres- ence of nested functions, is simply a linear array of progressive return addresses. Another feature common to both of the dual stacks is that they are post-

120 incremented (after) push operations and pre-decremented (before) pop operations. The increment/push and decrement/pop associations arise from the chosen growth direction of the stacks. The post-/push and pre-/pop associations reflect an arbi- trary design choice essentially limited to philosophical consequences. Referring again to Figure 4.5 it can be seen that a post-increment push results in the stack pointer initialization address becoming the first occupied stack slot. A pre- increment push would result in the initial stack pointer position never being used unless the stack pointer were initialized to a value 4 bytes lower than the stack boundary. In the current era of cheap and plentiful memory, one never-used stack slot is almost moot; on the other hand, we find the concept of intentionally setting an initial pointer value ‘outside’ or ‘below’ stack bounds to be objectionable. It could be argued that this means the last (highest address) slot of a defined stack area cannot be used or else the stack pointer, being incremented automatically in hardware, would exceed the stack upper bound and that would be equally ob- jectionable1. We prefer the former because the architect knows where the stack begins (recall that the control stack is never under user program control)2; the user-programmer can estimate where the stack might end but would surely pro- vide headroom above that point.

4.6.2 Dual Stacks

To recap the design features, dual stacks provide critical isolation by separating program control data, or more specifically, subroutine call return addresses from

1The voices made me write this. . .

2. . . but they are not wrong.

121 user data via separate physical stacks. User program code cannot read or modify the control stack pointer (ecp) or control stack contents. This prevents malicious code or improper user inputs from manipulating program flow via modification or corruption of function return addresses to prosecute ROP or JOP attacks (section 2.9) An important additional feature is that program machine instructions are never loaded from user memory or ‘heap’, and the user data stack exists solely in user memory. This prevents code injection attacks by denying the attacker a vector from stack overflows into executable program memory. In addition to being implemented in separate memory areas and having separate index pointers, stack usage patterns are unique and driven by the data stored. The data stack reflects the native 32-bit CPU and is implemented as a4-byte dword-aligned structure for performance reasons. Data are pushed on to or popped from the stack in 4-byte dwords and the stack pointer (ecp) is incremented or decremented by 4. Where pushed data is less than 32 bits in width, the unused high order bytes are zero-extended. When popped data is loaded into a target of less than 4 bytes the data retrieved from the stack is truncated to the appropriate number of least significant byte(s). The 32-bit data stack width does not prevent user programs from storing packed local variables or large data structures on the stack as described in section 2.7.1.5 by modifying the stack pointer to free larger blocks and addressing stack space as heap memory provided the stack pointer is restored during function or program cleanup. The control stack is also implemented as a 4-byte dword-aligned structure for performance reasons, and owing to its use strictly as a flow control structure, every data element on the control stack is a 32-bit program memory address pointing to quad-word aligned instruction words. Like esp, the control stack pointer (ecp)

122 is post-incremented or pre-decremented by 4 for each push or pop operation to maintain dword alignment of the stack.

4.7 Instruction Pointer Management

The Secure Host CPU's instruction pointer (eip) is implemented simply as an additional 32-bit element in the memory array comprising the CPU's emulated or virtual registers. As part of a real or virtual register architecture, it is logical to include eip management within this section. We will provide details for the Secure Host CPU instruction word (or iword) in the next section; for now it is sufficient to state that a single-width 64-bit iword is used and justify the choice later in order to dispense with the details of instruction pointer management before leaving the register architecture section. In section 3.2.5 we stipulated that in addition to being constant- or single- width 64-bit structures, iwords were strictly aligned to integer-multiple 8-byte (or quad-word) boundaries at load time. At the end of the instruction fetch cycle the value of eip is advanced by 8 bytes to represent ‘incrementing’ a variable of type uint64 t*. At each fetch, quad-word alignment is verified to prevent attempts to construct gadgets (section 2.9.1) from non-aligned iword fragments. In the CPU Emulator this is accomplished with a simple bitwise mask and test. Should it occur, failure of this modulo-8 test will cause the Secure Host CPU to trap and halt. With an eye toward a possible future version of the Secure Host CPU in FPGA, we should point out that a hardware implementation might evolve to more strict Harvard memory segregation without byte-addressable program memory. Program

123 memory access limited to double- or quad-word width and/or a narrower true hardware instruction pointer to control only the high n − 4 program memory address lines would be amenable to hardware implementation in FPGA. With the latter arrangement, instruction pointer management would be simplified to single- increment sequential fetches (with appropriate modifications to the fetch logic to bridge the gap between iword and bus widths). In this case the strict hardware enforcement of iword boundaries in program memory would eliminate the need for modulo-8 address verification.

4.8 Instruction Set Architecture

Given the previous discussions of the Instruction Set Architecture and the availabil- ity of the Instruction Set Reference in Appendix C, remaining ISA details are best presented in the context of instruction word (iword) implementation and instruc- tion decoding presented in the following sections. In particular, the Jump/Land implementation details in section 4.9.1 cover the Secure Host CPU instruction decoding as part of the flow control mechanism.

4.8.1 Instruction Word Overview

Considering the typical reader's familiarity with modern microprocessor CPUs, the most enlightening details of the Secure Host CPU internals are encapsulated in the implementation of the Secure Host CPU instruction word (iword). Accordingly, we will spend some time building up details in a building block fashion beginning with key concepts and terminology.

124 4.8.2 Instruction Word Architecture

Multiple considerations drove the choice of the 64-bit iword used in the Secure Host CPU prototype. In a pure Harvard CPU a case could be made for use of program memory tailored to the width of the CPU's iword provided a full- width program memory bus is available. For the current prototype effort we began with the minimum functional iword width, increased it to the next 2n increment for performance in instruction fetch operations, and redistributed ‘excess’ for any remaining gains available in performance, programming expediency, and security. Performance and expediency were considered in assigning locations so discrete iword fields could be accessed without (or with a minimum number of) maskand shift operations, and available excess bits were assigned for increased security. For the purpose of presenting the ISA as well as the iword implementation we generally avoid the use of the term opcode to reduce ambiguity and carry the following terminology into code development. While Intel uses the word opcode to denote “the complete object code produced for each form of the instruction” [71, pg 246], we focus on ‘classes’ of operations and subdivide them according to the form each class takes. For example, we view the ISA member mov as representing a class of operations for data movement to a target (or destination) from a source represented by an assembly instruction of the form mov target, source. Targets are registers or data memory locations and sources are registers (reg), data memory locations (mem), or immediate values (imm). On the surface this suggests 6 generic forms of the move instruction: mov reg, reg; mov reg, mem; mov reg, imm; mov mem, reg; mov mem, mem; and mov mem, imm. At the next level down, memory locations can be addressed directly via an address or indirectly via an address value contained in a register or another memory location. This leads to a large

125 number of opcodes representing the data movement class of instructions. Rather than list a unique opcode for every legal instruction variant we structured the iword to contain separate fields for the operation class (opclass), two operand types (op1type and op2type), and two operand values (op1value and op2value). Since iwords are constant-width, every field sized to its largest legal value exists in every iword. While this approach results in some storage inefficiency, constant- width iwords are an intentional design choice in order to prevent gadget formation (section 2.9). With this introduction we present a description of the Secure Host instruction word in Table 4.3 and a definition of the iword structure in Figure 4.6. The table is listed in logical order while the figure is reversed per the little-endian memory structure the Secure Host CPU shares with the X86. Table 4.3: Secure Host CPU Instruction Word Description

From DownTo # Bits Description 63 48 16 Instruction Operation (opclass) 47 46 2 Transfer width (target = source except for MOVS and MOVZ) 45 44 2 Operand1 Type (Register and Memory Flags) 43 43 1 Reloc Flag 42 40 3 Operand2 Type (Immediate, Register, and Memory Flags) 39 38 2 Unused (reserved) 37 32 6 Operand1 (Register ID Code) 31 0 32 Operand2 (Register ID Code or Immediate Value)

Prior to discussing the members of the implemented iword structure of Figure 4.6 we should point out an important feature applicable to all members. The Secure

Host CPU iword is defined as a C99 union comprising a nested struct of unsigned

126 Figure 4.6: Secure Host CPU iword C99 struct integers and other structs. This arrangement allows a globally or locally declared union iword to be fetched or moved efficiently as a uint64 t and referenced via its union member name.shadow iword. Likewise, the anonymous nested structs allow any other named element to be retrieved via single-level name.member name references without procedural shifting and masking. In particular, opclass is fetched as an aligned uint16 t dword as the first level of instruction decoding. Finally, the two innermost structs simply provide logical-byte grouping for smaller bit fields as a no-penalty concession to readability.

4.8.3 Opclass

As shown in Figure 4.6 opclass is a named member in a nested anonymous struc- ture and accessible by its member name. The current instruction word structure reserves 16 bits for the instruction operation (opclass) field, but less than 50 in- struction classes are implemented in the current Secure Host CPU. While a large number of the opclass bits could be re-purposed for expansion, the current choice

127 provides two important features. Opclass is the first level of instruction decoding in the Secure Host CPU and keeping opclass on a word boundary improves per- formance by avoiding mask and shift operations for each instruction fetch/decode cycle. The second feature relates to security. A sparse instruction set is inherently more resistant to random code injection than a dense instruction set if invalid in- structions are trapped. In the case of the Secure Host CPU, less than 0.1% of the available opclass space (< 26/216) forms a valid instruction, greatly reducing the risk of successful code injection without an intimate knowledge of the ‘local’3 ISA.

4.8.4 Transfer Width

The instruction transfer width field guides several aspects of instruction decoding and execution. It controls whether memory read and write operations are 8-, 16-, or 32-bit accesses and triggers data-width promotion (zero or sign extension) when data values are written to a larger container. When a machine instruction does not require transfer-width steering the value is ignored and could be set randomly in the assembly to promote diversity.

4.8.5 Arguments, Operands, and Operand Types

In the context of Secure Host CPU assembly language, operands are defined by arguments that follow an assembly mnemonic. The mnemonic denotes a general operation class4 such as ‘move’ or ‘jump’ and operands provide required data to

3Customized ISA plus Instruction Set Randomization (ISR) plus current ISR keyset.

4This is not to be confused with an Object Class as in Object Oriented Programming.

128 complete the operation (e.g., a ‘move’ requires a target, source pair of operands; a ‘jump’ requires a single destination address operand). The Secure Host CPU in- struction word (iword) is structured to provide an operation class code (opclass), transfer width, and descriptions of zero to two operands types (optype1 and op- type2 ) and identifiers or valuesop1val ( and op2val). Bit flags for opclass, width, and optypes comprise the Secure Host CPU opcode; opcodes and operand value fields are contained in the fixed 64-bit iword to provide the CPU the instructions and data necessary to execute the instruction. Operand types are none, register, immediate (i.e., a numerical value), or mem- ory (i.e., address of a memory location) where the memory address is generally conveyed by square brackets ([ ]) enclosing a numerical value (denoting address) or register ID (denoting the contents of the register). These definitions are suf- ficient to understand the majority of the instruction set architecture; however,a few exception cases for interpretation of memory can be found in the instruction reference of Appendix C.

4.8.6 Operands 1 and 2 Differences

While not all defined operand types are allowed in operand 1, the definitions are normalized throughout the Secure Host CPU code for consistency using the def- initions of Table 4.4. Immediate (imm) operands would be illogical in operand 1 as target, and memory references ([imm]) are disallowed in operand 1 by design5 in order to maintain a reasonable iword width. Allowing memory32 to memory32

5 This approach is consistent with x86's lack of direct memory32 to memory32 move. Conven- tion is to use an intermediate general purpose register or a PUSH, POP instruction pair.

129 moves would require 64 bits for source and destination addresses alone driving the constant-width iword to an unacceptable size; therefore, target operands are lim- ited to registers (reg) and indirect memory references via pointer registers ([reg]).

Table 4.4: Secure Host CPU Instruction Operand Flags

Value2 Type Flag 0b(IRM ) Value10 Description op none 0b000 0 None or operand n/a op reg 0b010 2 Register op reg mem 0b011 3 Memory flag, [reg] op imm 0b100 4 Immediate op imm mem 0b101 5 Memory flag, [imm] op imm lbl 0b110 6 Assembler only; replace with imm in pass 2 (TBR) op mem lbl 0b111 7 Assembler only; replace with [imm] in pass 2

The operand type flag and value relationship can be readily appreciated by correlating the bits to Immediate, Register, and Memory flags as shown in the header of the binary value (Value2) column:

• I signifies that the corresponding operand is an immediate value. Thisis valid for operand 2 only.

• R signifies a register code and may appear in operand 1or2.

• M bit or memory flag modifies the intent of the associated immediate value or 32-bit register and signifies that the immediate value or the contents of the register will be used as a 32-bit memory address. The M flag may appear in operand 1 or 2 subject to operand restrictions on the I flag.

Operand flag values 0 and 2 through 5 are used in Secure Host CPU assembly

130 and binary code; values 6(TBR)6 and 7 are assembly only and denote symbolic labels (i.e., placeholders for memory addresses). As an implementation detail, the appearance of a symbolic label in an operand during the first assembly pass is denoted by setting the otherwise mutually exclusive I and R flags with or without the M flag, and the label is cataloged as a named reference. In assembly pass2 the reference is resolved to a memory address offset for encoding, and the operand type is converted to a concrete value. Since immediate operands are only allowed in operand 2, the operand 1 type field is truncated to the least two significant bits purely for iword compactness as shown in Figure 4.6.

4.8.7 Instruction Transfer and Argument Sizes

Assembly code argument sizes (argsizes) may be undefinedsizeundef ( ), size8, size16, or size32 denoting their container and memory widths. These argsizes are determined from explicit assembly source code qualifiers including byte, word, dword for immediate data value or memory object references; byteptr, wordptr, and dwordptr for data references; or implicit size for named registers (e.g., eax, ax, and ah or al). Without qualifiers, immediate data have sizeundef sizes (but no smaller than the bit width of their magnitude) until an argize is inferred from the co-argument (e.g., for mov ax, 1; the immediate value 1 is sizeundef until matched with ax's size16). Assembly labels are always memory references therefore 32-bit pointers to argsize data as determined by a reference qualifier or the instruction co-operand (e.g., mov [edi], byteptr[.L005];, or mov ah,

6Use of assembly operand type code = 6 is queued for review and possible elimination

131 [.L005];. Except in special cases such as movsx and movzx, assembly instruction transfer or data manipulation width intent7 is represented unambiguously in the assembly source by a target or source argsize. Details for specific instructions are contained in the instruction set reference of Appendix C.

Table 4.5: Secure Host CPU Instruction Width Flags

Size Flag Value2 Value10 Description sizeundef 0b00 0 Undefined size8 0b01 1 8-bit size16 0b10 2 16-bit size32 0b11 3 32-bit

4.8.8 Relocation Flag

During implementation of the CPU emulator it was decided to virtualize addresses to allow a CPU “address 0” to exist at any physical location in emulator host memory. Loader details are provided in section 5.2.3, but an appropriate summary statement for this section is that Secure Host CPU binaries are compiled and assembled to an absolute address 0; memory operands are set as offsets from zero and flagged by setting the iword relocation flag to 1. Presence of this flaginan iword cues the loader to test operand 2 optypes and adjust memory addresses to relocate the code as it is loaded.

7Like most assembly languages including x86, Secure Host assembly is untyped and the programmer's “intent” is determined from source code context for binary encoding.

132 Figure 4.7: Secure Host CPU iword (C99 Format)

4.8.9 Instruction Word Binary Implementation

Figure 4.7 illustrates the Secure Host CPU iword as a big-endian 64-bit stream from the Secure Host CPU memory. It is presented here as a final reference to the iword implementation structure but will appear again in section 5.2.2.2 during discussion of the Python-based Secure Host CPU assembler.

4.9 Other Security Features of the ISA

4.9.1 Jump/Land Flow Control Instructions

The Secure Host CPU Jump/Land group was discussed briefly in section 3.2.5 and is expanded here to cover implementation details. ‘Jump/Land’ is a gener- alization that captures program flow control security or flow control protection; it is a defense against Return-Oriented and Jump-Oriented Programming (ROP and JOP, section 2.9) and is also an additional layer of protection against code injection (section 2.10). A more precise description of this security feature would observe that ‘program flow control’ includes Jumps, Calls, Returns, and software interrupts (INTs), as ‘program redirection’, and HALT as a program flow terminus.

133 In the current Secure Host CPU implementation, Land (landj, landc, and landr) instructions are added to the instruction set as new instructions and the internal CPU operations for unconditional and conditional Jumps, Calls, and Re- turns are modified to require complementary landx targets. Appropriate landx target instructions are inserted into user program assembly source code at compile time without the need for modification of user C/C++ source code. The first level of instruction decode is determination of the class of operation before internal branching to appropriate handlers. Redirection instruction handlers perform secondary fetches from the target or return address to probe for landing pads of the correct type, then instruction decoding and processing proceeds as follows:

• For jumps, if a valid landing pad is found the jump address is loaded into eip and control is returned to the instruction decoder.

• For calls, if a valid landing pad is found, eip is incremented to the next instruction and pushed to the control stack as a return address, the call

address is loaded into eip, and control is returned to the instruction decoder.

• For call returns, the presumed return address is popped from the control stack and probed; if a valid landing pad is found the return address is stored

in eip and control is returned to the instruction decoder.

• In any case, if a valid landing pad is not found the redirect handler invokes trap handling. Program and CPU states are preserved at the point of er- ror including the address of the offending redirect and attempted target, a diagnostic/error message is displayed, and control is returned to the CPU monitor console with the CPU in a HALTed state.

134 As a detail of instruction decoding, the Secure Host CPU Emulator's instruction decoder is called with the address of the first (or next) user machine instruction to be executed in the instruction pointer. The instruction decoder is a ‘forever’8 loop that fetches the current address, decodes the operation class (section 4.8.3), and either executes the instruction or calls an instruction handler for second-level decoding and/or complex processing. For example, consider the instructions nop and mov tgt, src. While nop is literally ‘do nothing’, mov is an instruction class comprising 24 possible operations (2 targets × 4 sources × 3 transfer widths).

The instruction decoder dispenses with a nop immediately but in the case of a mov a separate handler is called for additional decoding, operand verification, and execution. The last step in the instruction decoder loop is to increment eip to the next instruction word, then return to the top of the loop. Given the instruction decode detail above, it can be demonstrated that first- level instruction decoding will not fetch a redirect-target land due to the second- level fetch by the respective redirection handlers. Upon return to the first-level loop from a redirection handler, eip will already be pointing to the associated land instruction and the final step in the loop sequence will advance eip to the next instruction beyond. This does not mean, however, that the instruction decoder will never encounter ‘bare’ lands (i.e., lands not associated with a precursor redirection instruction). This artifact points to an implementation choice that is subject to later review.

As a land application example, Listing 4.1 assumes a ‘stock’ C/C++ applica-

8Run mode only. A single-step mode is available under debug. Run mode terminates upon reaching an error trap or programmed terminate request (HALT instruction or terminate() syscall).

135 tion (main) that processes data obtained from synchronous or dual-port memory

Listing 4.1: C/C++ Example Skeleton

1 int getdata(); // input function prototype

2

3 i n t main ( ) {

4 ... //setuptasks

5 dta = getdata(); // wait for input data

6 ... } // test/process input and exit via the function getdata(). Listing 4.2 shows relevant portions of Secure Host

CPU assembly source generated from compiler output. main's call to getdata generates a compiler-inserted landc at line 10 as the first instruction of subroutine getdata as well as a landr in line 4 as a return landing pad in main. landc and landj are different opcodes so upon encountering func's “not-ready” loop from line 16 back to the beginning of getdata the compiler expanded the function to add a new internal label .getdata01: at line 11 with a landj landing pad at line 12 to match the jump from line 16. From this we see that the CPU instruction decoder will encounter a bare landj at line 12 after main's call getdata. Had the compiler chosen to add landj (with its associated label) before getdata:, the instruction decoder would encounter a bare landc during every loop in getdata. One purpose of this somewhat tedious review is in contemplation of the value of applying the requisite ingenuity 9 to compiler algorithms to suppress bare lands altogether. For example, the case of Listing 4.2 could be resolved by adding a jmp .getdata01 immediately after the function entry landc; however, compiler

9We say “ingenuity” because even simple bare lands are not trivial. Note that in Listing 4.2 the jz at line 16 required the insertion of a new label at line 11 to emplace a landj target without displacing the mandatory landc call target at line 10.

136 Listing 4.2: LAND Application Example

1 main :

2 ... #processing

3 c a l l getdata # blocking function call

4 landr #retlandingpad

5 ... #processtheresultineax

6 ... #setupapplicationexit

7 int80 #exittoOS

8

9 getdata: #polling function

10 landc #landing pad for function call

11 .getdata01:

12 landj #landingpadforloop

13 ... #processing

14 mov eax , result #get status ( r e s u l t )

15 cmpeax, 0 #test status

16 jz .getdata01 #0= not ready; loop

17 jb error #negative=error; handle it

18 ret #else returnwithresult ineax

19

20 e r r o r :

21 landj #landingpad for errorjump

22 ... #processtheerror design is a one-time cost but Secure Host CPU clock cycles expended in bare land suppression are performance killers forever. Since this effort is for a proof of concept prototype we chose to allow bare lands to exist, treat them as nops during instruction decode, and defer further analysis of the security value of suppression to a follow-on effort.

4.9.2 Other Flow Control Instructions

In the introductory paragraph of section 4.9.1 we briefly mentioned software in- terrupts (INTs) as redirection instructions and HALTs as ‘flow control’ instructions.

137 We will close this section by dispensing with these last two candidates.

In reverse order, HALTs are used liberally for debug, but in production code there is scant practical use for them due to the external intervention required to override a halted state, and an attacker intent on a denial-of-service (only) attack could trigger any software trap to produce the same result as control-flow hijacking to a

HALT. Accordingly, the console monitor/debugger or an analogous OS halt state is the HALT landing pad there is no landh complement. Software interrupts are another matter but are excluded from this version of the Secure Host CPU for cause. INTs redirect program flow to locations such as OS libraries residing outside of user program space and are therefore beyond the scope of the current effort. Research and development beyond the Secure Host CPU proof of concept demonstration would likely include porting an operating system or microkernel and libraries to replace the DECREE support in the Linux host

(section 6.2); LAND expansion to support return from software interrupts would be an element of that effort.

4.9.3 Instruction Set Density

A feature used in the Secure Host CPU to reduce the risk of successful code in- jection attacks through random guessing is the use of a sparse instruction set. We have deferred this discussion until now so that the reader is familiar with the drivers and implications of the 16-bit iword.opclass field.

The Secure Host instruction word provides a 16-bit field for the instruction's operation class identifiers (opclass, section 4.8.3). Less than 50 opclasses are used in the 216 possible fields and valid opclass IDs are spread quasi-randomly toavoid clustering, giving a < 0.08% chance that a purely random guess will match a valid

138 opclass. Availability of a large enough code sample size to conduct statistical attacks would be denied to the attacker through physical security, and instruc- tion set randomization (ISR, section 4.9.4) and an effective program of ISR key replacement discipline will provide defense against remote introspection and in- crease perishability of data obtained from remote introspection should it occur. On the surface this degree of opcode dispersion may seem excessive given the other security features employed; however, this is a no-penalty implementation choice given the larger architectural considerations used in choosing iword width (section 3.2.5) so little justification is required. This inclusion in the CPUim- plementation chapter is primarily intended to summarize the implementation and highlight the availability of additional iword expansion bits if or when they are needed.

4.9.4 Instruction Set Randomization

The subject of Instruction Set Randomization (ISR) or “memory randomization” [76] does not fall exclusively into either architecture discussions of Memory or In- struction Set Architecture. We are biased to the term ISR rather than memory randomization due to the earlier work of Kc [83], have retained the term accord- ingly, and associate ISR more naturally with ISA. Instruction Set Randomization (ISR) is a deterrent to long-term statistical at- tempts to deduce ISA details provided the host program is reloaded periodically with new keys. ISR was implemented and demonstrated in the early FPGA pro- totype, and our preliminary design for ISR in the Secure Host CPU was addressed in detail in section 3.2.6. At the present time ISR has not been implemented in the Secure Host CPU Emulator due to unsettled questions on key length and most

139 recently due to time restrictions. This is subject to revision prior to the final proof of concept demonstration.

4.9.4.1 ISR Keys

As a closing comment on Instruction Set Randomization, issues such as key ro- tation or key exchange for Secure Host CPU ISR are not operational issues (i.e., no key distribution, key exchange, or crypto custodial accounts are required). A completely new key would be internally generated and used locally for each ap- plication program load. ISR keys would never transmitted outside of the Secure Host subject to consideration for core dump files for crash or trap analysis.

140 Chapter 5

CPU Testbed and Evaluation

The previous chapter provided implementation details of the Secure Host CPU and its software emulator. This chapter expands the picture to a system view to include the support elements needed to demonstrate the CPU, an operating host computer, and the use of custom machine code to provide a network server function or application].

5.1 Linux Host

The Linux-based Secure Host CPU Emulator was written in C99 using open source and community edition software (predominately Microsoft Visual Studio Code) and the GNU Compiler Collection (GCC) on hosts running Linux Mint. Chapters 3 and 4 provide design and implementation details for the Secure Host CPU and the CPU Emulator and will be referred to frequently in discussion of the testbed. Interfaces between the CPU Emulator and the Secure Host testbed are illus- trated in Figure 5.1. This high level block diagram shows the system in its test

141 Figure 5.1: Secure Host Testbed Block Diagram

configuration much as it would appear in operation. Since the default configura- tion is as a network server, a remote client computer would be expected to connect via the ethernet network on the left side of the diagram. If desired, a self-contained configuration can be operated from a separate tty process running on theLinux Host. The Secure Host CPU block in Figure 5.1 emulates the CPU hardware-based security features from Chapter 3 plus the remaining central processing unit (CPU) functions that are required to form an operational computing machine. These func- tions include program and data memory interfaces, storage registers and stacks, arithmetic and logic functions, input/output interfaces, and program control func- tions. The program control functions comprise an instruction pointer, branching logic, machine instruction decoding, and instruction execution logic. Given the description above, the remaining element required to complete a functional processor or server would be the calculus1 or machine code defining the

1Merriam-Webster dictionary defines calculus as: “a method of computation or calculation in a special notation (as of logic or symbolic logic)”

142 processes and/or calculations intended for the machine. If a series of coded machine instructions define, for example, the calculus of a network server, the Secure Host

CPU plus its machine code in the Secure Host CPU's program memory form a network server such as our testbed was designed to demonstrate and test. We have intentionally simplified the testbed block diagram in order to separate the unit under test (the Secure Host CPU and its machine code) from the testbed support system (the Linux Host). By logically extending the stdin and stdout channels to a remotely connected client we can demonstrate the efficacy of the Secure Host CPU hardware features in securing the server against remote exploit of vulnerabilities in the host's machine code. Linux host support to the testbed begins with loading of program code and data structures that emulate the Secure Host CPU hardware elements and define the CPU functions. After the CPU emulator is initialized, custom application machine code is loaded in the Secure Host CPU's program memory along with supporting data to form a Secure Host network server process running on the testbed. Like other processes, operating system (OS) services are required to provide interfaces to the platform hardware. In the case of the Secure Host applications, OS interfaces are limited by design to seven OS system calls (or syscalls). These seven syscalls described further in section 6.2.1 allow the Secure Host to manage and conduct i/o operations, manage memory resources, and signal the OS when the process terminates. Other operational and overhead support functions provided by the Linux Host are:

• Initial setup of the network port,

143 Figure 5.2: Secure Host Console -h (Help) Output

• Processing network connection requests,

• Moving data between the application stdin/stdout interfaces and the net- work,

• Processing output from the application stderr interface, and

• Responding to application code requests to increase or decrease allocated Secure Host CPU memory.

The lower portion of the Linux Host element of the testbed lists other ancil- lary functions needed including mass storage for load and log files, operator I/O, software for the control console, and ‘remote’ debug software for application code debugging. The screen captures in Figures 5.2 through 5.5 demonstrate four significant phases of a testbed test sequence. Figure 5.2 shows a Linux command line session with a command line request for display of Help (-h) information. Depicted on this screen are the Secure Host Emulator's four optional command line parameters:

144 • -b filepath specifies a relative or absolute directory and filename for the binary load file containing the Secure Host binary user application fileand

data set. The default filepath is ./load.bin.

• -r specifies that the CPU should enter the Run mode immediately following successful load file initialization. Default for Run/Step mode is single-step,

but the control console can command Run mode from the monitor/debugger screen.

• -p number specifies the listen port number to be used by the network server function. Valid port numbers are 1 through 65535 (the range of positive 16-bit signed integers), but good practice is to avoid the well-known port

numbers below 1025. 0 (zero) as the port number disables network support

and bypasses the Testbed's “Waiting for connection request” step. The de- fault value is port 2222 which is also the default listen port for the DECREE operating system (section 6.2).

• -l filepath specifies a relative or absolute directory and name to beused for the session logfile. The default filepath is ./log.var.

Figure 5.3 provides a screen capture from the initial phase of a network server session. Initialization information such as register stack locations is shown and emulated CPU memory initial status is given for debug purposes. In this test an echo application has been loaded and the Testbed is currently waiting for a connection request from a remote user. The screen shot of Figure 5.4 was taken after a remote user requested and received a connection. The CPU is in Step mode at the control prompt (sechost ->) and the first instruction in the application program is displayed in disassembled

145 Figure 5.3: Secure Host Console: Waiting for Connection Request form. From this point the program can be single-stepped or run to completion (or until the first programmed halt or program error is encountered). The last phase of a testbed application session is captured in Figure 5.5. The program has run to normal completion signified by a “Syscall 1” message and a return status from the application. The terminate() syscall (section 6.2.1) is equivalent to Linux exit() and does not return to the user program. CPU control reverts to the console with a current register state display and a disassembly of the current Secure Host CPU machine instruction pointed to by the instruction pointer. In this case the syscall was the last instruction in the Secure Host CPU program memory, and the operator entered the quit command (q) to end the session.

146 Figure 5.4: Secure Host Console: Ready, In Halt State

Figure 5.5: Secure Host Console: User Process Completion & Shutdown

5.2 Secure Host Tool Chain

Creation of an application requires a toolchain of at least a compiler and/or assem- bler. Use of existing Linux tools was sufficient for the Secure Host CPU Emulator, monitor, and debugger/disassembler. Since the Secure Host CPU instruction set is customized provisions had to be made to take user application source code from

147 C/C++ to Secure Host CPU custom machine code.

5.2.1 Secure Host CPU Compiler

Considerable time was invested in modification of LLVM for the Secure Host CPU project. LLVM is “a collection of modular and reusable compiler and toolchain technologies”[127] leveraging the Clang C language front-end. Results of this effort are covered in section 7.2.1, but in the interest of time LLVM was adopted only for use as a C/C++ compiler to LLVM intermediate representation (IR). LLVM IR is an internal symbolic target-independent representation of the input source code that is passed through multiple analysis and optimization steps before being matched to target machine instructions for further optimization and conversion to target object or binary files. While LLVM IR is not assembly language, it can be generated in human- readable form and is close enough to assembly language or assembly language sequences to be useful for semi-automatic conversion to Secure Host CPU assem- bly source code. Creation of a Secure Host CPU back-end for LLVM for more seamless generation of optimized code would be highly desired for a future and more comprehensive test programs.

5.2.1.1 IR to Assembly Register Allocation

One specific item of note in the IR to assembly conversion is register allocation. Considerable effort is normally expended optimizing register allocations for produc- tion toolchain software due to performance advantages for register- over memory- based data access. Where virtual hard registers are emulated in memory there is no performance penalty in ignoring register allocation and spilling variables to

148 memory; in fact, use of memory-mapped named registers for variable storage would negatively impact performance when emulated hard registers must be spilled dur- ing function calls. For this reason a significant but justified shortcut was taken in the IR toassem- bly conversion by ignoring hard registers for most local variable storage. Named registers are only used in the Secure Host CPU emulator for convenience, parame- ter passing for DECREE system calls, and certain CPU native binary operations.

5.2.2 Assembler

To support the Secure Host CPU development and test effort a custom Python- based assembler was generated to convert Secure Host CPU assembly code to binary machine code. Since the assembler development was only an ancillary part of the effort we will not cover it in depth except for a few items related tothe machine representation of code and of interest to a potential user for programming and debugging.

The assembler (shasm2.py) was constructed to accept the Secure Host CPU's modified instruction set given in Appendix C in Intel assembly format, andvery closely follows the syntax of the GNU assembler (gas or as). shasm2 has no command line options other than the source file name. It generates console error, warning, and status message messages during assembly, and if no source errors are encountered, produces a single binary load file using the input file base name concatenated with a .bin extension. The load file format is custom beginning with a text “shmagic1.0” preamble zero-padded to 16 bytes plus text and data section offsets and lengths for a total of 32 bytes. .text and .data sections are written in little-endian format to form memory images as they would appear starting at

149 Figure 5.6: Secure Host CPU Assembler Output address 0. Figure 5.6 shows a screen capture of a typical assembly process. Program title and version are shown followed by an incrementing counter as source lines are loaded and parsed. If load and parse are successful the assembler prints the output file name, offset and length ofthe .text and .data sections, and statistics including the number of source lines read, instruction words generated, and labels parsed. Except for the data dictionaries detailed below, design of the assembler is straightforward parsing of the pattern:

label: mnemonic [argument1 [, argument2]] # comment

The line parser strips comments, stores labels for pass-2 reconciliation, strips and matches mnemonic to an opclass dictionary, and if found, passes the remainder of the line to an argument parser which returns type, size, and value attributes. If the number and type(s) of arguments matches the opclass dictionary the line is stored and parsing moves to the next source line.

150 5.2.2.1 Assembler Dictionaries

Data structures in the assembler which affect machine code must obviously be co- ordinated with the Secure Host CPU Emulator. Two data sets previously covered were operand and transfer width flags (Tables 4.4 and 4.5). Two major struc- tures in this category not yet explicitly mentioned are the register and opclass dictionaries. The register dictionary appears in the general form:

reg d = {’regName’: (size, regID), ...}

This allows a presumed named-register argument to be validated and encoded by determining that argument matches a regName member in reg d, then populating the size and register ID attributes from the (size, regID) Python tuple associ- ated with the regName key. Likewise, the opclass dictionary appears in the general form:

opclass d = {’mnemonic’: (opclass, [operand1], [operand2]), ...}

Note the Python syntax where operand1 and operand2 appear inside square brackets ([ ]) denoting lists, and these appear in parentheses with opclass as the second and third elements of a tuple. The dictionary general form above is applied by the line parser to verify that mnemonic is in (or ‘a member of’) opclass d, argument1 is in (the list) of mnemonic's [operand1], and argument2 is in mnemonic's [operand2]. This scheme allows symbolic representations of op- class and arguments to be listed in assembly code text (e.g., mov eax, [esi]) and parsed and validated efficiently from the opclass dictionary. Table- and dictionary-defined data such as register and opclass information can be quickly modified and accurately exported to the C99 Secure Host CPU

151 Figure 5.7: Secure Host CPU iword (Assembler Format) definition as modifications or upgrades are made to the CPU architecture. We view this as an important and necessary feature for an agile and reconfigurable emulator and testbed.

5.2.2.2 Assembler Field Codes

Referring back to the general form of an assembly instruction as:

label: mnemonic [argument1[, argument2]] # comment and the brief tutorial on data dictionaries above, we can close the loop in tracking symbolic data in assembly instruction to an encoded instruction word (iword) depicted in Figure 5.7. Recall from the previous assembler parsing section that instruction arguments or operands have type, size, and value attributes. The op[i ][j ] fields represent argument1 (i = 0) and argument2 (i = 1). During binary encoding the argument size attributes (j = 1) are reconciled and converted to instruction transfer size leaving operand type (j = 0) and operand value (j = 2) fields to encode. Thus the 8-byte iword output from the assembler isthebinary representation of a sparsely-coded instruction opclass, an instruction transfer size, and descriptive fields for operands 1 and 2 matching the Secure Host CPU iword presented in section 4.8.9. Two bits in the iword (bits 6 and 7 of byte 4) are

152 unused, and the remaining bit (bit 3 of byte 5) is used by the relocating loader covered in the following section. Final notes for the assembler are that a Python bytearray(8) is used to create a binary iword that matches the C union iword representation of Figure 4.7, and the bytearray type and .to bytes() function are used to work around Python's dynamic data typing and fix the iword length at exactly 64 bits.

5.2.3 Relocating Loader

Since Linux uses ASLR (section 2.12), the load address for the Secure Host CPU emulator process is not predictable and virtual address 0 for the emulated CPU cannot be reliably known in advance. On-the-fly address translation in the emulator was rejected for performance reasons, and the time required for creation of an ELF-like linker/loader solution would have delayed the proof of concept demonstration. Instead, modifications were made to the assembler to flag iwords containing address data using an oth- erwise unused iword bit as a Relocation Flag. The Secure Host CPU emulator's loader function was updated to scan the load file's .text section for Relocation Flags and add the physical address of the emulated CPU's virtual address 0 to the iword's address offset data. A desirable future enhancement to the tool chain would be an upgrade to add object file and symbol table capabilities that are compatible with existing symbolic debuggers.

5.2.4 Console Monitor/Debugger

The final item in the Secure Host CPU testbed toolchain is the system monitor or control console and integrated debugger. Basic functions of the console for loading

153 Figure 5.8: Secure Host Console: Monitor/Debugger Menu and running a Secure Host application were covered in section 5.1. This section focuses on the integrated debugger capabilities provided to support Secure Host application development and debug. Figure 5.8 shows the commands and functions available while the Secure Host

CPU is in a halted state (indicated by the console prompt sechost ->). The Help screen shown in this figure is available from the console prompt via the h command. Logical command groupings are:

• Application program control including run mode, single step, and quit;

• Register commands including (dr) display all registers and (rr) read or (wr) write a single register;

• Memory commands including binary (rm) read or (wm) write memory or list (disassemble) program memory; and

• Miscellaneous functions including reinitialize/reload the emulator and appli-

154 Figure 5.9: Secure Host Console: Program Step Mode

Figure 5.10: Secure Host Console: Memory Display

cation load file, set a programeak br point, and show program control and user data stacks.

The screen capture of Figure 5.9 shows two features of the debugger beginning with successive single-step commands in a user program. The full register display shown here is presented at any programmed halt, error, or trap event as well as via the dr debug command. Concurrent general purpose registers are shown in their overlayed form in the register display, but individual concurrent registers can be read or written in their intrinsic length using the the rr or wr commands. Figure 5.10 depicts the debugger binary read memory command display which can be used to examine program or data memory by entering a beginning address and length. If no length argument is given 80 bytes are displayed; if no address or length is given a display of the same number of bytes previously displayed is

155 Figure 5.11: Secure Host Console: Custom Code Disassembly

Figure 5.12: Secure Host Console: Stacks Display shown starting at the first byte not displayed by the previous rm command. A basic symbolic disassembler is integrated into the Secure Host console as shown in Figure 5.11. We use the “basic” qualifier because symbol tables are not provided for address labels or variable and function names. Only instruc- tion mnemonics and register names are displayed with all other data displayed in hexadecimal format.

Figure 5.12 shows the debugger's parallel displays of the program control and user data stacks. The stacks are shown as stack-width dwords from low to high memory with the current stack pointer centered and current stack data above and below. Dashes indicate low-address stack bounds and the Secure Host CPU implements post-increment pushes, so Figure 5.12 indicates that one return address has been pushed and popped and three data items pushed to the data stack are still live.

156 5.3 OS Support for the Secure Host CPU

The remaining support element needed for the Secure Host CPU is an operating system at least for initial testing. At a minimum, a future version of the Secure Host would include at least a simple RTOS compiled for its custom CPU. For the purpose of the proof of concept demonstration we chose an open-source environment published by DARPA called DECREE. More detailed treatment of DECREE is provided given in section 6.2, but we should preface that with the statement that DECREE was designed specifically for computer security research and experimentation and isolates the device via clean interfaces to a very low number of operating system calls.

5.4 Performance Tuning of the Emulator

Initial work on the emulator was focused on functionality without obsession on performance tuning except specific areas of low to moderate cost with high benefit. The Secure Host CPU targets a 32-bit native architecture but necessarily has an oversize instruction word (iword). 64-bit iword transfers are minimized by using a global structure (a C union) that is loaded once, passed by reference, and incrementally read only in relevant elements as needed during instruction decode and execution. The choice of union over struct allows the entire iword structure to be fetched via its uint64 t shadow iword union member and overlayed on the iword structure (Figure 4.6). Once an instruction is initially fetched, elements are accessed as 8- to 32-bit stdint types in cascade as the instruction is incrementally decoded, and the iword union is not overwritten until the next instruction decode begins.

157 Since we are decoding up to 50 operation classes, the instruction decoder was constructed to allow half-splitting of the first level decoder to decrease search time in the decoder. Another option deferred for a follow-up effort is analysis of operation class occurrence to arrange operation class codes from most-used to least- used instructions. Detecting most-used instructions early in the search sequence would improve performance in an emulated CPU; however, the parallelism of a well designed FPGA implementation may render the effort moot.

5.5 Proof of Concept Demonstration

The proof of concept demonstration for the Secure Host CPU uses the testbed setup of Figure 5.1 and the C/C++ source of a sample Challenge Binary (CB) provided for the DARPA Cyber Grand Challenge (CGC). More information on this program is provided in section 6.1, but the CB is an application program performing the functions of a web server. Prior to the CGC competition challengers used the sample CB suite to develop and test automated tools (‘pollers’) to conduct iterated attacks against the net- work host services with the goal of finding one or more software vulnerabilities and exploiting the vulnerabilities. Source code for the CBs included disclosure of known vulnerabilities representing common software weaknesses and proof of vulnerabilities (POVs) giving methods to demonstrate them. Since the proof of concept demonstration was not a competition we did not attempt to move past sample CBs, nor did we use pollers but have described them here as a bookmark for use in future work. CB source code was compiled to LLVM intermediate representation (IR) using clang/llvm and the IR was con-

158 verted to Secure Host CPU assembly before being passed through the Secure Host CPU custom assembler for creation of a Secure Host CPU-based network server, and testing moved directly to demonstration of nominal server functions and per- formance against the sample CB's known vulnerability.

5.6 Demonstration Results

Sections 5.1 and 5.2 covered functions and capabilities of the emulator and testbed including screen captures of nominal events. The text and screen capture exhibits of those sections are a portion of our demonstration results but will not be repeated here. Expected (positive) results were obtained in proof of concept demonstrations for hardware-based defense against software vulnerabilities previously described. We have used several scenarios relating to various vulnerability patterns and at- tack techniques to demonstrate hardware-based security features designed into the Secure Host CPU. In each scenario the software error under test was trapped by the emulated hardware security feature of the Secure Host CPU as shown in the following sections.

5.6.1 Invalid Instructions

Invalid instructions may arise from imperfectly-constructed code injection or at- tempted flow control hijacking. One of the design safeguards for the Secure Host CPU is a sparse instruction set and strict instruction decoding allowing for no un- documented or unintended instructions. Invalid instructions result in a hardware exception (trap) as shown in the screen capture of Figure 5.13. Since an invalid

159 Figure 5.13: Secure Host Console: Invalid Instruction Trap

Figure 5.14: Secure Host Console: Invalid eip Value instruction can not be disassembled, the Secure Host CPU testbed's integrated debugger parses the invalid instruction word to aid in investigation.

5.6.2 ROP and JOP Gadget Reduction

Many security features overlap multiple vulnerabilities or exploitation techniques, but gadgets are used solely in Return- and Jump-Oriented Programming (ROP and JOP) exploits so they are linked here in order to anchor the discussion.

Since the instruction pointer (eip) and program control stack are under hard- ware control they are unlikely to be compromised; however, Figure 5.14 illustrates one of the safeguards against attempted gadget construction2 built into the Secure Host CPU design. The CPU maintains strict quad-word alignment of instruction memory and monitors eip for conformance at instruction fetches. This is a con- cession to the CPU emulator's byte-addressable memory and a wedge against a possible modified-Harvard implementation in FPGA in the future. As discussed in section 4.7, an alternate eip hardware implementation limiting eip to quad-

2Non-aligned program memory access attempts may be an indicator of attempted gadget construction similar to the “entrypoint ecb crypt” example from section 2.9.1

160 word alignment could eliminate the risk of gadget construction from non-aligned program memory contents.

5.6.3 Control Flow Protection

One of the custom extensions of the Secure Host CPU instruction set architecture was addition of a LAND group comprising landing pads for program flow redirec- tion via conditional and unconditional jumps (landj), subroutine calls (landc), and returns from subroutines (landr). Any valid (i.e., originally programmed) redirection instruction will be to a landing pad of the proper complementary type and failure to detect a valid landing pad generates a hardware exception or ‘trap’. Upon detection of an error the CPU halts the application program in process, preserves state information, and returns control to the console for display of rele- vant diagnostic information including register states and disassembled instruction words at the launch and target locations. Figures 5.15 and 5.16 show screen captures of CPU traps and the resulting error messages and diagnostic information displayed in response to unauthorized program flow redirection attempts via insertion of a jump command or modification of the jump address. Traps are shown for invalid unconditional and conditional jumps.

In the event the jump target is not a proper land or a valid instruction, the trap routine disassembles as much of the target location as possible as shown in the screen shot of Figure 5.17. The purpose of the disassembly attempt is to provide as much debug or forensic data as possible to aid in trap resolution. Figures 5.18 and 5.19 show screen captures of similar error and diagnostic information in response to traps resulting from unauthorized call and return

161 Figure 5.15: Secure Host Console: Missing landj Trap

Figure 5.16: Secure Host Console: Missing landj (Conditional) Trap locations.

5.7 Proof of Concept Demonstration Summary

The proof of concept demonstration scenarios addressed above were selected to highlight a cross-section of vulnerabilities and hardware-based security features presented in earlier chapters, and demonstrate the capabilities of the Secure Host

Figure 5.17: Secure Host Console: Missing land Trap With Invalid Instruction

162 Figure 5.18: Secure Host Console: Missing landc Trap

Figure 5.19: Secure Host Console: Missing landr Trap

CPU testbed. Through these scenarios we have demonstrated the value of hardware- based defenses to harden legacy application code without the need of code accom- modation or adaptation other than recompile and reassembly. In addition we have shown the capabilities of the testbed as a vehicle for future refinement of techniques demonstrated and development of more advance techniques.

While DARPA's DECREE OS was a significant element in the Secure Host CPU testbed, it is open source software that was used as a tool in the testbed rather than something developed or enhanced as part of the project. For this reason we have chosen to segregate discussion of CGC and DECREE details in the following chapter. Chapter 6 is included in this document in order to provide visibility for the DECREE experimental ecosystem concept and provide details on the OS system calls we adapted from DECREE.

163 Chapter 6

DARPA CGC and DECREE

This chapter summarizes portions of the DARPA Cyber Grand Challenge and the DECREE Operating System that are relevant to the Secure Host CPU proof of concept demonstration and to computer security research in general. The reader is referred to Appendix A (Ethics in Cybersecurity Research) for a background discussion, but here our focal interests are twofold:

• Effective hardware-based security, and

• Experimentation with offensive cyber techniques only where required and only in a safe and ethical manner.

This chapter briefly discusses the DARPA competition, its emphasis onsafe and ethical computer security research, and in particular, the relevance of the DECREE Operating System to the Secure Host CPU testbed.

164 6.1 DARPA Cyber Grand Challenge

In 2016 the Defense Advanced Research Projects Agency (DARPA) held the fi- nal event of the Cyber Grand Challenge (CGC) [39], a DARPA-sponsored con- test where computers were built for on-line Capture-the-Flag style cybersecurity competitions. During competition, contestants (through their automated tools) analyzed custom software programs built exclusively for the competition. The programs were Challenge Binaries (CBs) that implemented network services based on custom compiled C/C++ software. Each CB contained realistic designed-in (and possibly additional unknown) vulnerabilities similar to existing network sys- tems. Contestants would engage the network services to probe for vulnerabilities and attempt to exploit the systems, returning proof of vulnerabilities (POVs) and, where possible, patches to the CBs to mitigate vulnerabilities identified.

The CGC environment's simplified operating system (OS) and its cybersecurity isolation features comprise an ideal demonstration vehicle for the Secure Host CPU against threats discussed in Chapter 2, and a superb foundation for test and evaluation in a follow-on effort.

6.2 DARPA DECREE

To support the CGC, DARPA constructed DECREE, the DARPA Experimental Cybersecurity Research Evaluation Environment [39] [38] and we have emulated the DECREE model and operating system calls in the Secure Host CPU Emulator and testbed. DECREE is an open source C/C++ OS extension built for cybersecurity research and experimentation with the following three major features:

• Simplicity: Rather than the typical hundreds of OS system calls, DECREE

165 has just the seven necessary to provide the network interface and services.

• Incompatibility: DECREE is custom-built for computer security research. DECREE-hosted CBs have a unique binary format and system call para- digm; they share no code or protocols with the real world to ensure CGC au- tomation research is incompatible with real-world operational software (i.e.,

don't pollute the computing ecosystem).

• High determinism: DECREE is designed with reproducibility properties built into DECREE from kernel modifications up through the entire platform stack.

DECREE is Open Source [38] as an “experimentation ecosystem” [39] for the CGC competitions and other applied research activities. We leveraged this “ecosys- tem” to apply DECREE principles to the Secure Host CPU Testbed and used the C source code for one of the DECREE sample Challenge Binaries (CBs) for the proof of concept demonstration. Using additional unmodified CB source code and contest-generated POVs, a follow-on effort to evaluate the effectiveness of the Se- cure Host CPU in reducing vulnerabilities in a wide range of network services that were built on DARPA's existing C/C++ source code is strongly desired for a follow on effort.

6.2.1 DECREE OS Syscalls

Again, we stress that DECREE is an open source DARPA product; except for some author commentary, DECREE details are published here as a convenience to the reader in better understanding operation of the testbed.

166 As detailed in the DECREE features section above, the Operating System Ap- plication Binary Interface [38] contains seven system calls to provide everything necessary to host a fully functional network appliance or network service on the Secure Host CPU Testbed. The syscalls are summarized below with a brief de- scription of each function:

• Terminate – Gracefully terminate the application and return an integer status or result code to the OS. Equivalent to Linux exit().

• Transmit – Transmit or send to data to the stdout or stderr file descriptor (fd). This is a blocking call similar to Linux send().

• Receive – Receive or read data from the stdin fd. This is a blocking call similar to Linux read().

• FDWait – Wait for specified fd(s) to be ready for i/o without blocking on a single fd. The fd set to wait on is specified separately similar to Linux select().

• Allocate – Allocate heap memory for the current process. Equivalent to Linux malloc().

• Deallocate – Free heap memory. Similar to Linux free() except that the block freed is a multiple of page size.

• Random – Fill a specified block of memory with random bytes. There is no equivalent Linux OS system function; however, Linux provides mem- frob() which XORs a region of memory with the number 421 to “frobnicate

1“The answer to the ultimate question of life, the universe and everything is 42.” –The

167 (encrypt)” [96] the memory region.2

6.2.2 DECREE Syscall Interface

DECREE syscalls are accomplished in the same manner as Linux system calls using the information shown in Tables 6.1 and 6.2. For C/C++ source the syscall numbers and function prototypes are supported by the Secure Host CPU Emulator compiler; for assembly, the syscall number is stored in the eax register, remaining required parameters are stored in the registers listed in Table 6.2, and software interrupt 80 (int 80) is called. Status or result (if returned) is returned in register eax. Table 6.1: DECREE OS Syscall Prototypes

Call Function Prototype 1 void _terminate(int status) 2 int transmit(int fd, const void *buf, size_t count, size_t *tx_bytes) 3 int receive(int fd, void *buf, size_t count, size_t *rx_bytes) 4 int fdwait(int nfds, fd_set *readfds, fd_set *writefds, const struct timeval *timeout, int *readyfds) 5 int allocate(size_t length, int is_X, void **addr) 6 int deallocate(void *addr, size_t length) 7 int random(void *buf, size_t count, size_t *rnd_bytes)

Hitchhiker's Guide to the Galaxy by Douglas Adams

2“Note that this function is not a proper encryption routine as the XOR constant is fixed, and is suitable only for hiding strings.” [96]

168 Table 6.2: DECREE OS Syscall Format

Syscall Element Location Syscall Number eax Parameter 1 ebx Parameter 2 edx Parameter 3 edx Parameter 4 esi Parameter 5 edi

169 Chapter 7

Future Work and Concluding Remarks

This chapter discusses conclusions and observations garnered during the the Secure Host CPU project. The intent of this chapter is to encapsulate observations that may be of value in future work and provide some incentive for continuing Secure Host CPU development.

7.1 Architecture Retrospectives

7.1.1 x86 Patterning

Multiple factors drove the choice to pattern the original FPGA prototype after the x86. The “Intel-like model” [25] gave immediate familiarity with “the most popular

170 and important in history”1. Modern integrated development environments (IDEs) have made high level code debugging much easier and the need for strong machine language skills is not as great for mature code bases so prior experience with the CPU instruction set is not as important as it once was. Another incentive for adopting an architecture similar to the x86 was the ex- pectation of leveraging existing x86 development tools. In the end, this was as much more distraction than advantage. With a forward reference to section 7.2.1 we will consider the toolchain a non-factor in its real influence on an architecture choice if it were made today.

7.1.2 Concurrent Registers

A significant feature of the current architecture is concurrent (overlayed) general purpose registers. These 8-, 16-, and 32-bit registers had high value in backwards compatibility and utility at a time when silicon real estate was extremely valu- able. For the purpose of a clean-sheet design for the Secure Host CPU in today's technology a simpler register architecture merits evaluation.

7.1.3 Additional Registers

In addition to the existing general purpose register suite, it is recognized that additional flag bits or a separate mode register will be required to support execu- tion privilege levels separating kernel and system tasks from user processes before growing beyond a single-tasking system.

1Quoted from Bob Colwell, Intel Fellow in his description of the book The Unabridged Pen- tium 4 IA32 Processor Genealogy (July 2004) by Bob Colwell and Tom Shanley

171 7.1.4 Instruction Set Architecture Changes

Recall that the conditional and unconditional jump instruction group and the call and ret instructions have respective landj, landc, and landr instructions as landing pads. The only test in the current Secure Host CPU is that launch and landing pads must be complementary pairs. No attempt was made to tokenize jump/land pairs to obtain more selectivity, but this position could be revisited if type matching is found to provide insufficient discrimination in future testing. To avoid linear strings of landing pads in complex code it is envisioned that tokenizing would be keyed by groups of launch pads rather than individual points.

7.2 Testbed Enhancements

7.2.1 Toolchain

During the course of the current effort an integrated toolchain for seamless con- version of C/C++ source to Secure Host CPU machine code was never realized. Semi-automatic code conversion is labor intensive and error prone. The siren song is that if creation of a new backend for GCC or LLVM is not an insurmountable task2 [128, 59, 169] modifying an existing x86 backend should be straightforward. The reality we learned is that the x86 family of processors is highly integrated in toolchains and separated at compile time by target options, and the instruction selection and assembly process is conducted in multiple passes for optimization. The result is that there are no simple modifications without side

2Krister Walfriddson said that writing a GCC backend for a new architecture “. . . is easy provided you have done it once before. But the first time is quite painful. . . ” [169].

172 effects. The proper approach would have been to treat the creation ofanewback end as the research project it deserves to be (e.g., [30]3), learn the internals of the compiler, and create a new non-optimized backend from scratch.

This author's gross under-estimation of the time and effort required to modify LLVM or GCC became far and away our single biggest regret. Therefore, “lever- aging toolchains” is no longer perceived to be an important factor.

7.2.2 Replacement OS or Microkernel

An important future consideration is the possibility of porting an operating system or microkernel to the Secure Host and bootstrapping to 100% Secure Host CPU- compiled code to replace the Linux Host support functions. This is not required for further research and evaluation in the DECREE environment (section 6.2), but would be imperative for a fieldable Secure Host. An initial survey was conducted to scope the effort of porting a replacement OS, and a microkernel was quickly identified as the preferred choice due to the very large size of the code base for an OS such as Linux. As an example, C/C++-based microkernels have published code size ranges of 10 thousand to 36 thousand source lines of source (SLOCs) [63] compared to Linux's 15 to 25 million4. To return briefly to the architecture discussion of section 7.1, an integrated look at a Phase II effort would surely include accommodation for a microkernel. Stock microkernels have targeted architectures including ARMv5, ARMv6, MIPS,

3Release 3.9.1 (May 11, 2018) of this tutorial textbook is 605 pages in length.

4Linux passed “15 million total lines of code” in 2011 based on a study by the Linux Founda- tion [121]. Current estimates are 25 million plus, but an important dimension not easily measured is the size impact of custom drivers, media additions, etc.

173 and x86. While customization of the microkernel would be required for porting to a Secure Host CPU, the effort may be simplified by either:

• Extending the existing Secure Host CPU architecture to provide higher fi- delity to the x86 model (e.g., segment registers, additional control bits, . . . ), or

• Revising the Secure Host CPU architecture altogether to align more closely with one of the RISC architectures

7.3 Secure Host CPU in Real Life

To anchor this final section of the dissertation we restate our vision of theSecure Host CPU in real life: A general purpose processor hardened against remote threats without the need of firewall-type rules, capable of general processing tasks in ahigh- exposure environment or high value targets or as a secure front-end communication processor for more specialized high performance machines. Two specific applications for hardware-enhanced security are supervisory con- trol and data acquisition (SCADA) and Internet of Things (IoT) node devices:

• SCADA devices are widely applied in industrial controls and critical infras- tructure operations such as electric power and gas transmission systems. We can headline this threat in one word: Stuxnet. This was a malicious software worm described as the “first cyber warfare weapon ever” [91] that seriously affected Iran’s uranium enrichment program when it infected SCADAde- vices controlling centrifuges. The criticality of many SCADA devices raises the importance of software updates to meet emerging cyber threats, but this

174 criticality is juxtaposed to the small population size and high cost of software patches and upgrades due to lack of economy of scale.

• The “Internet of Things” refers in great part to the modern automated home. Wireless remote control of homes now extends from the video doorbell, re- mote control of door locks and security systems, and ‘smart’ TVs and refrig- erators down to remotely-controlled light bulbs. A spoof of the Good Times virus published in 1996 stated in part that “It will recalibrate your refriger- ator’s coolness setting so all your ice cream goes melty” [111]. This became more than a spoof in 2014 when it was confirmed that a ‘smart’ refrigerator was one of the devices hacked to send spam email as part of a wide-spread malware infection [110]. A very real difficulty with IoT devices is the sheer number of throw-away devices that, because of planned obsolescence, have no plan or provision for software upgrades. Sadly, for these devices security that is not built in at the factory will never be there at all.

We have no illusions of significantly changing the trajectory of the modern desktop or laptop computer, but hope to see some of the hardware-enforced security features covered in this document in future use in special-purpose controllers for such applications as the SCADA and IoT devices we have mentioned in closing.

175 Bibliography

[1] Mart´ınAbadi, Mihai Budiu, Ulfar´ Erlingsson, and Jay Ligatti. Control-flow integrity principles, implementations, and applications. ACM Trans. Inf. Syst. Secur., 13(1):4:1–4:40, November 2009.

[2] AMD. Amd Athlon™ cpu competitive comparison. AMD Product Sheet.

[3] AMD. Amd Turion™ 64 x2 mobile technology dual-core processor product data sheet. AMD Product Sheet.

[4] V. Angelov. IP cores. VHDL-FPGA@PI 2013, 2013.

[5] Anonymous. Formjacking. Web User, pages 38–39, May 2019.

[6] ARM Limited. ARMv6-M Architecture Reference Manual. ARM Limited, 110 Fulbourn Road Cambridge, England CB1 9NJ, arm ddi 0419c edition, Sept 2010.

[7] Milad Aslander. Windows 8 security insights. Microsoft Virtual Academy, 2012.

[8] Atmel Corporation. 8-bit atmel microcontroller with 128kbytes in-system pro- grammable flash. Product Summary Sheet, Jun 2011.

176 [9] Murat Balaban. Buffer overflows demystified. Web Article, ukn. http://www. enderunix.org/documents/eng/bof-eng.txt.

[10] Piotr Bania. Securing the kernel via static binary rewriting and program shepherding. CoRR, abs/1105.1846, 2011.

[11] Elaine B Barker and John M Kelsey. Recommendation for Random Number Generation Using Deterministic Random Bit Generators. National Institute of Standards & Technology, Gaithersburg, MD, Jun 2015.

[12] Elena Gabriela Barrantes, David H. Ackley, Stephanie Forrest, and Darko Stefanovi´c. Randomized instruction set emulation. ACM Trans. Inf. Syst. Secur., 8(1):3–40, Feb 2005.

[13] BBC. Coal mine canaries made redundant, Dec 30 1986.

[14] D. E. Bell and L. J. La Padula. Secure computer system: Unified exposition and multics interpretation. Technical Report ESD-TR-75-306, Electronic Sys- tems Division, AFSC, Hanscom Air Force Base, Bedford, Massachusetts, Mar 1976.

[15] Tyler Bletsch, Xuxian Jiang, Vince W Freeh, and Zhenkai Liang. Jump- oriented programming: a new class of code-reuse attack. In Proceedings of the 6th ACM Symposium on Information, Computer and Communications Secu- rity, pages 30–40. ACM, 2011.

[16] David G. Boak. A history of u.s. communications security (U). Governmen- tAttic.org, July 1973.

177 [17] W. E. Boebert. On the inability of an unmodified capability machine to enforce the *-property. In 7th DOD/NBS Computer Security Conference, 1984.

[18] W.E. Boebert, R.Y. Kaln, W.D. Young, and S.A. Hansohn. Secure ada target: Issues, system design, and verification. In Security and Privacy, 1985 IEEE Symposium on, pages 176–176, April 1985.

[19] Stephen W. Boyd, Gaurav S. Kc, Michael E. Locasto, Angelos D. Keromytis, and Vassilis Prevelakis. On the general applicability of instruction-set random- ization. IEEE Transactions on Dependable and Secure Computing, 7(3):255– 270, 2010.

[20] Brandon Bray. Compiler security checks in depth. MSDN, February, 2002.

[21] Bulba and Kil3r. Bypassing stackguard and stackshield. Phrack, 0xa(0x38), 05 2000.

[22] BW Online Bureau. Cyber adversaries flock to apps where the users are and when users are online. Business World, 27 May 2019. Business Insights: Essentials, May 2019.

[23] George Arthur Burrell and Frank Meyers Seibert. Gases found in coal mines, volume Circular14. USGPO, 3rd edition, 1916.

[24] Jing Cao. Introspection in Dynamically Linked Applications. PhD thesis, University of Chicago, Chicago, IL, USA, 2007. AAI3272986.

[25] Dr. Marco Carvalho and Dr. Richard Ford. Study of secure prototype frame- work for cognitive network management.

178 [26] Marco Carvalho, Jared DeMott, Richard Ford, and David A. Wheeler. Heart- bleed 101. IEEE Security & Privacy, 12(4):63–67, 2014.

[27] Stephen Checkoway, Lucas Davi, Alexandra Dmitrienko, Ahmad-Reza Sadeghi, Hovav Shacham, and Marcel Winandy. Return-oriented programming without returns. In Proceedings of the 17th ACM conference on Computer and communications security, pages 559–572. ACM, 2010.

[28] Ping Chen, Xiao Xing, Bing Mao, Li Xie, Xiaobin Shen, and Xinchun Yin. Automatic construction of jump-oriented programming shellcode (on the x86). In Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security, pages 20–29. ACM, 2011.

[29] Michael Chertoff. The cybersecurity challenge. Regulation & Governance, 2(4):480–484, 2008.

[30] Chen Chung-Shu. Tutorial: Creating an LLVM Backend for the Cpu0 Archi- tecture. github.com, 2018.

[31] Marcia Conner. Encyclopedia of the Sciences of Learning. Springer, Jan 1 2012.

[32] Corelan Team. Exploit writing tutorial part 6 : Bypassing stack cookies, safeseh, sehop, hw dep and aslr. On-line Workshop, Sept 21 2009.

[33] Manuel Corregedor and Sebastiaan Von Solms. Windows 8 32 bit - improved security? In AFRICON, 2013, pages 1–5. IEEE, 2013.

[34] Harvey G Cragon. Memory systems and pipelined processors. Jones & Bartlett Learning, 1st edition, 1996.

179 [35] CSIT Laboratory. Asic construction. http://www.csit-sun.pub.ro/ resources/asic/CH15.pdf, Unk.

[36] David Culler, Jason Hill, Mike Horton, Kris Pister, Robert Szewczyk, and Alec Wood. Mica: The commercialization of microsensor motes. Sensor Technology and Design, April, 2002.

[37] John Curran. Three congressional committees schedule opm data breach hear- ings. Cybersecurity Policy Report, page 1, Jun 22 2015. Copyright - Copyright Aspen Publishers, Inc. Jun 22, 2015; Last updated - 2015-07-02.

[38] DARPA. Cyber grand challenge repositories. http://github.com/ cybergrandchallenge/, Jul 2015.

[39] DARPA. Cyber grand challenge web site. Archived at https://archive. darpa.mil/cybergrandchallenge/tech.html, Jul 2015.

[40] Robson de Oliveira Albuquerque, Luis Javier Garc´ıa Villalba, and Rafael Tim´oteode Sousa. Enhancing an integer challenge-response protocol. In Osvaldo Gervasi, Beniamino Murgante, Antonio Lagan`a,David Taniar, Young- song Mun, and Marina L. Gavrilova, editors, Computational Science and Its Applications – ICCSA 2008, pages 526–540, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.

[41] James C Dehnert. The transmeta crusoe: Vliw embedded in cisc. In Software and Compilers for Embedded Systems, pages 1–1. Springer, 2003.

[42] Barrett Devlin. FBI probing breach of houston astros database; authorities investigating whether st. louis cardinals employees hacked rival team’s network. Wall Street Journal (Online), Jun 16 2015.

180 [43] Yu Ding, Tao Wei, TieLei Wang, Zhenkai Liang, and Wei Zou. Heap taichi: exploiting memory allocation granularity in heap-spraying attacks. In Pro- ceedings of the 26th Annual Computer Security Applications Conference, pages 327–336. ACM, 2010.

[44] Wenliang Du. Return-to-libc attack lab. Syracuse University, Computer Lab Notes, 2010.

[45] M. R. Dugan. Analysis of existing implementations of true random number generators. George Mason University, Course Notes, May 12 2005.

[46] EE Herald. Custom hardware design: Why engineers going for fpga rather than asic? Electronic Engineering Herald, 2006.

[47] Electronic Frontier Foundation. Cyber security legislation. Web Blog, Sep

2015. https://www.eff.org/issues/cyber-security-legislation.

[48] Stephen Fischer. Supervisor mode execution protection. 2nd Annual NSA Trusted Computing Conference and Exposition, Feb 21 2011.

[49] Karl N Fleming. A risk informed defense-in-depth framework for existing and advanced reactors. Reliability engineering & system safety, 78:205–225, 2002.

[50] Flight Standards Service. Risk Management Handbook. FAA, U.S. Department of Transportation Federal Aviation Administration, faa-h-8083-2 edition, 2009.

[51] Agner Fog. Calling conventions for different c++ compilers and operating systems. Copenhagen University College of Engineering, 2009.

[52] Caroline Fontaine. Encyclopedia of Cryptography and Security. Springer, Jan 1 2011.

181 [53] Bree Fowler and Joe Mandak. Target data breach, Feb 08 2014. Name - Target Stores Inc; Copyright - Copyright Charleston Newspapers Feb 8, 2014; Last updated - 2014-02-10.

[54] LJ Fraim. Scomp: A solution to the multilevel security problem. Computer, 16(7):26–34, 1983.

[55] Aur´elien Francillon and Claude Castelluccia. Code injection attacks on harvard-architecture devices. In Proceedings of the 15th ACM conference on Computer and communications security, pages 15–26. ACM, 2008.

[56] Peter Gasperini. Lending perspective to asic vx. fpga debate. Electronic Engineering Times, (1156):73, Mar 05 2001.

[57] EL Glaser, JF Couleur, and GA Oliver. System design of a computer for time sharing applications. In Proceedings of the November 30–December 1, 1965, fall joint computer conference, part I, pages 197–202. ACM, 1965.

[58] Global IP News. Intel granted patent for control transfer termination instruc- tions of an instruction set architecture (isa). Information Technology Patent News, Jul 12 2017.

[59] Anthony Green. How to retarget the gnu toolchain in 21 patches.

[60] Michael Hamburg. Understanding intel’s ivy bridge random number genera- tor. Electronic Design (on line), Dec 11 2012.

[61] Stuart Hannabuss. A dictionary of philosophical logic. Reference Reviews, 24(1):19–20, 2010.

182 [62] Yongle Hao, Yizhen Jia, Baojiang Cui, Wei Xin, and Dehu Meng. Openssl heartbleed: Security management of implements of basic protocols. In P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2014 Ninth Inter- national Conference on, pages 520–524. IEEE, 2014.

[63] Gernot Heiser and Kevin Elphinstone. L4 microkernels: The lessons from 20 years of research and deployment. ACM Trans. Comput. Syst., 34(1):1:1–1:29, April 2016.

[64] John Hennessy, John L Hennessy, David Goldberg, and David A Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publish- ers, 2nd edition, 1990.

[65] C. Herder, M. Yu, F. Koushanfar, and S. Devadas. Physical unclonable func- tions and applications: A tutorial. Proceedings of the IEEE, 102(8):1126–1141, Aug 2014.

[66] Hewlett-Packard. Data Execution Prevention. Hewlett-Packard Development Company, L.P., v1.2 edition, May 2005.

[67] Gael Hofemeier. Find out about Intel’s new RdRand instruction. Intel Devel- oper Zone Blogs.

[68] Michael Howard. Address space layout randomization in windows vista. Mi- crosoft Corporation, 26, May 2006.

[69] Ted Huffmire, Cynthia Irvine, Thuy D Nguyen, Timothy Levin, Ryan Kastner, and Timothy Sherwood. Handbook of FPGA Design Security. Springer Science & Business Media, 2010.

183 [70] IEEE. IEEE code of ethics. Policies, Section 7.8, July 2015.

[71] Intel. Intel 80386 Programmer’s Reference Manual. Intel Corp., edited 2001- 02-01 edition, 1986.

[72] Intel. Intel® 64 and IA-32 Architectures Software Developers Manual. Intel Corporation, 253665-054us edition, Apr 2015.

[73] Intel. Intel® Core™2 duo processor e6600. Intel Product Sheet, Sep 2015.

[74] Intel (RB). Introduction to Intel® memory protection extensions. Web Arti- cle, Jul 16 2013.

[75] Intellectual Ventures. Intellectual ventures acquires transmeta patent portfo- lio. Press Release, Jan 28 2009.

[76] V. Iyer, A. Kanitkar, P. Dasgupta, and R. Srinivasan. Preventing overflow at- tacks by memory randomization. In 2010 IEEE 21st International Symposium on Software Reliability Engineering, pages 339–347, Nov 2010.

[77] Trent Jaeger. Operating system security, volume 1. Morgan & Claypool Pub- lishers, 2008.

[78] S. Jajodia, S. Noel, P. Kalapa, M. Albanese, and J. Williams. Cauldron mission-centric cyber situational awareness with defense in depth. In 2011 - MILCOM 2011 Military Communications Conference, pages 1339–1344, Nov 2011.

[79] JUCC. Code injection. A newsletter for IT Professionals, Joint Universities Computer Centre Limited, (4):12, Unk.

184 [80] Richard Y Kain and Carl E Landwehr. On access checking in capability-based systems. Software Engineering, IEEE Transactions on, (2):202–207, 1987.

[81] Paul Karger, Roger R Schell, et al. Thirty years later: Lessons from the multics security evaluation. In Computer Security Applications Conference, 2002. Proceedings. 18th Annual, pages 119–126. IEEE, 2002.

[82] Paul A Karger and Roger R Schell. Multics security evaluation volume ii: Vulnerability analysis. Technical Report ESD-TR-74-193, Vol. II, Electronic Systems Division, AFSC, Hanscom Air Force Base, Bedford, Massachusetts, Jun 1974.

[83] Gaurav S Kc, Angelos D Keromytis, and Vassilis Prevelakis. Countering code-injection attacks with instruction-set randomization. In Proceedings of the 10th ACM conference on Computer and communications security, pages 272–280. ACM, 2003.

[84] Erin Kenneally and Michael Bailey. Cyber-security research ethics dialogue & strategy workshop. ACM SIGCOMM Computer Communication Review (CCR), 4(2), Apr 2014.

[85] Christoph Kern, Anita Kesavan, and Neil Daswani. Foundations of security: what every programmer needs to know. Apress, 2007. Buffer Overflows, pp. 93–105.

[86] Alexander Klaiber et al. The technology behind Crusoe™ processors. Trans- meta Technical Brief, Jan 2000.

[87] Philip Koopman. Stack computers: the new wave. Reprinted Mountain View Press, 1989.

185 [88] J¨orgKr¨uger,Bertram Nickolay, and Sandro Gaycken. The secure information society: ethical, legal and political challenges. Springer Science & Business Media, 2012.

[89] Albert Kwon, Udit Dhawan, Jonathan M Smith, Thomas F Knight Jr, and Andre DeHon. Low-fat pointers: compact encoding and efficient gate-level im- plementation of fat pointers for spatial safety and capability-based security. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communi- cations security, pages 721–732. ACM, 2013.

[90] Arthur M. Langer. Legacy Systems and Integration, pages 179–213. Springer London, London, 2016.

[91] R. Langner. Stuxnet: Dissecting a cyberwarfare weapon. IEEE Security Privacy, 9(3):49–51, May 2011.

[92] Stanley MazorPatricia Langstraat. A Guide to VHDL. Springer, 1992.

[93] Lawrence Lasker and Walter F. Parkes. War games. Movie, June 1983.

[94] Kevin Lawton. Bochs ia-32 emulator. Web site.

[95] Kyung-Suk Lhee and Steve J Chapin. Buffer overflow and format string over- flow vulnerabilities. Software: Practice and Experience, 33(5):423–460, 2003.

[96] Linux contributors. memfrob(3). Linux Programmer’s Manual, Mar 28 2017.

[97] Linux Kernel Organization, Inc. Kernel parameters. Article, The Linux Kernel Archives, Sep 7 2015.

186 [98] Mehrdad Majzoobi, Farinaz Koushanfar, and Srinivas Devadas. Fpga-based true random number generation using circuit metastability with adaptive feed- back control. In Cryptographic Hardware and Embedded Systems–CHES 2011, pages 17–32. Springer, 2011.

[99] Michail Maniatakos. Privilege escalation attack through address space iden- tifier corruption in untrusted modern processors. In Design & Technology of Integrated Systems in Nanoscale Era (DTIS), 2013 8th International Confer- ence on, pages 161–166. IEEE, 2013.

[100] Steve Mansfield-Devine. The state of operational technology security. Net- work security, 2019:9, 2019.

[101] John P McGregor, David K Karig, Zhijie Shi, and Ruby B Lee. A processor architecture defense against buffer overflow attacks. In Information Technology: Research and Education, 2003. Proceedings. ITRE2003. International Confer- ence on, pages 243–250. IEEE, 2003.

[102] John Mechalas. The difference between rdrand and rdseed. Intel Developer Zone Article, Nov 17 2012.

[103] Microsoft. x86 architecture. Windows Hardware Dev Center, Jul 2015.

[104] Mike Frysinger et al. Hardened/GNU stack quickstart. Gentoo Foundation, Jun 24 2014.

[105] Mitre. CWE™ common weakness enumeration, a community-developed list of software weakness types. Web site.

[106] Multicians. Glossary. Multicians.org Web site, Jul 31 2015.

187 [107] Sebastian Nanz and Carlo A. Furia. A comparative study of programming languages in rosetta code. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 778–788, Piscataway, NJ, USA, 2015. IEEE Press.

[108] Nergal. The advanced return-into-lib(c) exploits: PaX case study. Phrack, 0x0b(0x3a):File 4 of 11, Dec 28 2001.

[109] RSA News. Lack of precise definitions plagues cybersecurity legislation. In- fosecurity, 7(2):11, Mar-Apr 2010.

[110] UPI NewsTrack. ‘Smart’ refrigerator hacked to send out spam emails. Busi- ness Insights: Essentials, 1/17/2014.

[111] M. Newton and J.L. French. The Encyclopedia of High-tech Crime and Crime-fighting. Facts on File Crime Library. Facts On File, Incorporated, 2003.

[112] NICCS. A glossary of common cybersecurity terminology. US Department of Homeland Security, National Initiative for Cybersecurity Careers and Studies (NICCS) Web Site, Jul 2015.

[113] Aleph One. Smashing the stack for fun and profit. Phrack, 7(49):File 14 of 16, Nov 1996.

[114] OpenSSL. Openssl (1) ver. 1.0.1.o. Linux man page, Jun 12 2015.

[115] OpenSSL. Welcome to the openssl project. OpenSSL Web Site, Jul 2015.

[116] H. Orman. The morris worm: a fifteen-year perspective. Security Privacy, IEEE, 1(5):35–43, Sept 2003.

188 [117] Yongji Ouyang, Qingxian Wang, Jianshan Peng, and Jie Zeng. An advanced automatic construction method of rop. Wuhan University Journal of Natural Sciences, 20(2):119–128, 2015.

[118] Steven J Padilla and Terry Benzel. Final evaluation report of scomp secure communications processor stop release 2.1. Technical Report CSC-EPL-85/001, Department of Defense Computer Security Center, Ft. George G. Meade, MD, Sep 23 1985.

[119] Yongsu Park, Younho Lee, Heeyoul Kim, Gil-Joo Lee, and Il-Hee Kim. Hard- ware stack design: towards an effective defence against frame pointer overwrite attacks. In Advances in Information and Computer Security, pages 268–277. Springer, 2006.

[120] Yongsu Park, Yong Ho Song, and Eul Gyu Im. Design of a reliable hardware stack to defend against frame pointer overwrite attacks. In Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics, pages 731–732. Springer-Verlag, 2006.

[121] Ryan Paul. Linux kernel in 2011: 15 million total lines of code and Microsoft is a top contributor. Article, Ars Technica Newletter, Apr 4 2012.

[122] Mathias Payer and Thomas R Gross. String oriented programming: when aslr is not enough. In Proceedings of the 2nd ACM SIGPLAN Program Protec- tion and Reverse Engineering Workshop, page 2. ACM, 2013.

[123] Bernd Paysan. A four stack processor. Article, Core Security Web Site, Apr 25 2000.

189 [124] Andrea Peterson. Lawsuits against sony pictures could test employer respon- sibility for data breaches. Washington Post – Blogs, Dec 19 2014.

[125] Suzy Platt. Respectfully Quoted A Dictionary of Quotations. Number 1115. Barnes & Nobel, Inc., 1993.

[126] Oxford University Press. Oxford dictionaries definitions (on line). Web Site, Jul 2015.

[127] LLVM Project. LLVM. http://llvm.org/.

[128] LLVM Project. Writing an llvm backend.

[129] Gerardo Richarte et al. Four different tricks to bypass stackshield and stack- guard protection. Article, Core Security web site, Apr 2002.

[130] Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. Return- oriented programming: Systems, languages, and applications. ACM Transac- tions on Information and System Security (TISSEC), 15(1):2, 2012.

[131] Andrew Rukhin et al. A Statistical Test Suite for Random and Pseudoran- dom Number Generators for Cryptographic Applications. National Institute of Standards & Technology, Gaithersburg, MD, Apr 2010.

[132] Hinde S. The complete security circle. Computers & security, 21(689), Nov 15 2002.

[133] Babak Salamat, Andreas Gal, Alexander Yermolovich, Karthik Manivannan, and Michael Franz. Reverse stack execution. Technical Report TR 07-07, University of California Irvine, Aug 23 2007.

190 [134] Klaus Schleisiek. Microcore: an open-source, scalable, dual-stack, harvard processor synthesisable vhdl for fpgas. Article, TU WEIN Institute for Infor- mation Systems Engineering Web Site, Nov 21 2001.

[135] Sebastian Schrittwieser, Martin Mulazzani, and Edgar Weippl. Ethics in security research which lines should not be crossed? In Security and Privacy Workshops (SPW), 2013 IEEE, pages 1–4. IEEE, 2013.

[136] Hovav Shacham. The geometry of innocent flesh on the bone: Return-into- libc without function calls (on the x86). In Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS ’07, pages 552– 561, New York, NY, USA, 2007. ACM.

[137] Vedyvas Shanbhogue, Jason W Brandt, Uday R Savagaonkar, and Ravi L Sahita. Control transfer termination instructions of an instruction set architec- ture (isa), Jun 5 2014. US Patent Application Publication No. US 2014/0156972 A1.

[138] Alex Shaw, Dusten Doggett, and Munawar Hafiz. Automatically fixing c buffer overflows using program transformations. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on, pages 124–135. IEEE, 2014.

[139] Takahiro Shinagawa. Segmentshield: Exploiting segmentation hardware for protecting against buffer overflow attacks. In Reliable Distributed Systems, 2006. SRDS’06. 25th IEEE Symposium on, pages 277–288. IEEE, Oct 2006.

[140] Greg Shipley. Top 10 people: Elias levy. Network Computing, suppl. 10th Anniversary Special Issue, 11(19):76–78, Oct 2 2000.

191 [141] Kevin Skadron, Pritpal S Ahuja, Margaret Martonosi, and Douglas W Clark. Improving prediction for procedure returns with return-address-stack repair mechanisms. In Proceedings of the 31st annual ACM/IEEE international sym- posium on , pages 259–271. IEEE Computer Society Press, 1998.

[142] Sergei Skorobogatov. Latest news on my hardware security research. Web Article, University of Cambridge Department of Computer Science and Tech- nology, Jul 2 2013.

[143] Sergei Skorobogatov and Christopher Woods. Breakthrough silicon scanning discovers backdoor in military chip, volume Cryptographic Hardware and Em- bedded Systems - CHES 2012, Lecture Notes in Computer Science Volume 7428, pp 23–40. Springer Berlin Heidelberg, 2012.

[144] Douglas J Smith. Vhdl and verilog compared and contrasted-plus modeled example written in vhdl, verilog and c. In Design Automation Conference Proceedings 1996, 33rd, pages 771–776. IEEE, 1996.

[145] Jaydeep Solanki, Aenik Shah, and Manik Lal Das. Secure patrol: Patrolling against buffer overflow exploits. Information Security Journal: A Global Per- spective, 23(3):107–117, 2014.

[146] Joseph Souren. Security by design: hardware-based security in windows 8. Computer Fraud & Security, 2013(5):18–20, 2013.

192 [147] Raoul Strackx, Yves Younan, Pieter Philippaerts, Frank Piessens, Sven Lach- mund, and Thomas Walter. Breaking the memory secrecy assumption. In Proceedings of the Second European Workshop on System Security, pages 1–8. ACM, 2009.

[148] C. E. Stroud. Anatomy of a flip-flop - elec 4200. Class Notes, Dept ofECE, Auburn Univ., Aug 2006.

[149] Andrew Suffield. Bounds checking for c and c++. Web Article, Imperial College London.

[150] Caixia Sun and Minxuan Zhang. Dual-stack return address predictor. In Embedded Software and Systems, pages 172–179. Springer, 2005.

[151] W Swan. Design of nand logic switching circuits. Radio and Electronic Engineer, 44(1):27–32, 1974.

[152] J´acint Szab´o.Good characterizations for some degree constrained subgraphs. Journal of Combinatorial Theory, Series B, 99(2):436–446, 2009.

[153] Greg Taylor and George Cox. Behind new random-number generator. IEEE Spectrum, 24, 2011.

[154] Ubuntu Security Team. Executable stacks. Article, ubuntu wiki Web Site.

[155] Ken Thompson. Reflections on trusting trust. Communications of the ACM, 27(8):761–763, Aug 1984.

[156] Matt Townsend and Chris Strohm. Home depot confirms data breach. Na- tional Post, Postmedia Network Inc., Sep 09 2014.

193 [157] Minh Tran, Mark Etheridge, Tyler Bletsch, Xuxian Jiang, Vincent Freeh, and Peng Ning. On the expressiveness of return-into-libc attacks. In Recent Advances in Intrusion Detection, pages 121–141. Springer, 2011.

[158] Transmeta. Transmeta™ Crusoe™ tm5800 processor for embedded applica- tions. Product Brochure, 2004.

[159] Luca Trevisan, Gregory B Sorkin, Madhu Sudan, and David P Williamson. Gadgets, approximation, and linear programming. SIAM Journal on Comput- ing, 29(6):2074–2097, 2000.

[160] Katrina Tsipenyuk. Seven pernicious kingdoms: A taxonomy of software security errors. In NIST Workshop on Software Security Assurance Tools, Techniques, and Metrics, November, 2005, 2005.

[161] Katrina Tsipenyuk, Brian Chess, and Gary McGraw. Seven pernicious king- doms: A taxonomy of software security errors. Security & Privacy, IEEE, 3(6):81–84, 2005.

[162] Nektarios Georgios Tsoutsos and Michail Maniatakos. Fabrication attacks: Zero-overhead malicious modifications enabling modern microprocessor priv- ilege escalation. Emerging Topics in Computing, IEEE Transactions on, 2(1):81–93, Mar 2014.

[163] Matthias Vallentin. On the evolution of buffer overflows. Munich, May, 2007.

[164] Arjan van de Ven. New security enhancements in red hat enterprise linux. Article, redhat Web Site, 2004.

194 [165] Henk CA Van Tilborg and Sushil Jajodia. Encyclopedia of cryptography and security. Springer Science & Business Media, 2011.

[166] Colt VanWinkle and Andy Davis. Compile time randomization. Web Article, MIT CSAIL Computer Systems Security Group, 2013.

[167] Vendicator. Stackshield, 0.7 beta. Web Article, Angelfire web site.

[168] Tom Van Vleck. How the air force cracked multics security. Web Article, Multicians.org web site, Feb 15 1995.

[169] Krister Walfridsson. Writing a gcc back end. Web blog, August 4 2017.

[170] Robert NM Watson, Peter G Neumann, Jonathan Woodruff, Jonathan An- derson, David Chisnall, Brooks Davis, Ben Laurie, Simon W Moore, Steven J Murdoch, and Michael Roe. Capability hardware enhanced risc instructions: Cheri instruction-set architecture. University of Cambridge, Computer Lab., Tech. Rep. UCAM-CL-TR-864, 2014.

[171] Robert NM Watson, Jonathan Woodruff, David Chisnall, Brooks Davis, Wo- jciech Koszek, A Theodore Markettos, Simon W Moore, Steven J Murdoch, Peter G Neumann, Robert Norton, et al. Bluespec extensible risc implementa- tion: Beri hardware reference. University of Cambridge, Computer Laboratory, Technical Report, (UCAM-CL-TR-852), 2014.

[172] Yoav Weiss and Elena Gabriela Barrantes. Known/chosen key attacks against software instruction set randomization. In Computer Security Applica- tions Conference, 2006. ACSAC’06. 22nd Annual, pages 349–360. IEEE, Dec 2006.

195 [173] D. Williams, W. Hu, J. W. Davidson, J. D. Hiser, J. C. Knight, and A. Nguyen-Tuong. Security through diversity: Leveraging virtual machine technology. IEEE Security Privacy, 7(1):26–33, Jan 2009.

[174] Jonathan Woodruff, Robert NM Watson, David Chisnall, Simon W Moore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G Neumann, Robert Norton, and Michael Roe. The cheri capability model: Revisiting risc in an age of risk. In Proceeding of the 41st annual international symposium on Computer architecuture, pages 457–468. IEEE Press, 2014.

[175] Jun Xu, Zbigniew Kalbarczyk, Sanjay Patel, and Ravishankar K Iyer. Ar- chitecture support for defending against buffer overflow attacks. In Workshop on Evaluating and Architecting Systems for Dependability. Citeseer, 2002.

[176] Dong Ye and David Kaeli. A reliable return address stack: Microarchitectural features to defeat stack smashing. ACM SIGARCH Computer Architecture News, 33(1):73–80, Mar 2005.

[177] Adam Zabrocki. The story of ms13-002: How incorrectly casting fat pointers can make your code explode. Microsoft Security Research and Defense Blog, Aug 6 2013.

[178] Qiang Zeng, Dinghao Wu, and Peng Liu. Cruiser: concurrent heap buffer overflow monitoring using lock-free data structures. ACM SIGPLAN Notices, 46(6):367–377, 2011.

196 [179] Chao Zhang, Tao Wei, Zhaofeng Chen, Lei Duan, Stephen McCamant, and Laszlo Szekeres. Protecting function pointers in binary. In Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security, pages 487–492. ACM, 2013.

197 Appendix A

Ethics in Cybersecurity Research

Formation of guidelines for ethical cyber-security research is in its infancy as evi- denced by the fact that the Cyber-security Research Ethics Dialogue & Strategy (CREDS) Workshop held in May of 2013 during an IEEE Security Privacy Sym- posium of the same period was described as “inaugural” [84]. In the absence of formal guidelines, we would be remiss if we did not at least conform to the most general professional conduct rules such as the following from item 1 of the IEEE Code of Ethics: “. . . to accept responsibility in making decisions consistent with the safety, health, and welfare of the public . . . ” [70]. Arguably, we are compelled to act by the admonition “Do not watch bad things happening” [135] if it is within our power to contribute to improved security. This research program is a defensive effort rather than the development of malware or attack techniques, so on the surface there should be no ethical questions about the research; however, we could potentially be porting and using aggressive, automated attack techniques against the Secure Host CPU in future test and demonstration phases. Therefore it is important to point out that the techniques

198 and tools planned for the proof of concept demonstration as well as the software environment subjected to evaluation will be based on DECREE (as discussed in Chapter 6). Since DECREE was designed explicitly for cybersecurity research and not based on any existing real world systems or protocols, we believe the research for this dissertation, including any potential enhancement of CGC attack techniques, and the publication of our designs, methods, and results falls well within the boundaries of ethical research in cybersecurity.

199 Appendix B

Use of IP-Core Devices from Untrusted Channels

During his Turing Award lecture entitled “Reflections on Trusting Trust” [155], Ken Thompson related a most engaging and utterly convincing scenario whereby a self-replicating Trojan horse could be added to a C compiler's source code. Once the revised source was compiled and the new binary installed as the system com- piler, the Trojan horse source code could be removed from the C compiler. After that, every time the compiler is recompiled the Trojan horse will be reinserted into the next binary by the existing binary even though there is no trace of the Trojan horse anywhere in source code. He stated:

“The moral is obvious. You can't trust code that you did not totally create yourself.” [155]

We now understand that “totally create” does not mean that we wrote all the source code; “totally create” also extends to every binary used to process the

200 source, and every source and binary before that. This very interesting tale relates to the secure host and its hardware-based security features in the following way: Sources such as [99] and [162] caution that low overhead hardware Trojans can be introduced in various points in the fabrication of an IP-core device when untrusted channels are used1. Coupling Dr. Thompsons lecture with the C/C++ and VHDL compilers and supporting IP cores that may be used for a future secure host, it is clear that our (or any other) secure host may never be provably secure within available resources. Adding this to the concept of outsourced secure host chips for commodity IoT-connected devices further increases the uncertainty, but adds incentive for vigilance and maintenance of multiple levels of security.

1An early draft of a Cryptographic Hardware and Embedded Systems Workshop paper re- ported “the first real world detection of a backdoor in a military-grade FPGA” [143]. Thispaper was picked up quickly by bloggers across the Internet and blamed on the Chinese; however, the author later confirmed that the “backdoor” was actually a factory test interface [142].

201 Appendix C

Secure Host CPU Instruction Set

Table C.1 contains a listing of the Secure Host CPU instruction set with application notes. This Table provides hexadecimal opclass codes and mnemonics for the Secure Host CPU instruction set with valid/required Operand 1 and Operand 2 types as indicated by:

• R = Register

• M = Memory where M = [R] (operand 1 or 2) or [I] (operand 2 only)

• I = Immediate

Operand 1 is restricted to named registers and register memory pointers due to instruction word field width restrictions as discussed in section 4.8.2. Positional implications are that Operand 1 and Operand 2 are normally targets and sources respectively; however, when an instruction has a single operand it should be expected to found in the Operand 2 container without regard to its contrived identify as a target or source (i.e., PUSH operand and POP operand). This

202 is in order to accommodate the widest required operand value (i.e., a 32-bit integer immediate, or a 32-bit address). This is of little consequence to the programmer because the Secure Host toolchain is context-aware and handles operand storage and/or display appropriately. It is only pointed out here as a consideration for analysis of load files or program memory content. Notes on many instructions are given in the Remarks column following the Operands. Where given, these notes provide a plain-English statement of the instruction and information on usage. In this section assignments (i.e., equations containing a single equal sign (‘=’)) are stated in C/C++ programming format where the Left Hand Side (LHS) is assigned the value resulting from the operation of the RHS. Logical tests for equality are represented by symbol pairs (i.e., “==” or “!=”). The Flags column of Table C.1 states which flags reset are modified bythe instruction. When an equation is present (e.g., “OC = 0”) the indicated flags are cleared. A flag or flag group outside of an assignment statement is set orcleared according to the bit-logical or arithmetic result of the instruction's operation on the given operands.

203 Table C.1: Instruction Set Summary

OpClass Operand Operand Flags Code Mnemonic 1 2 Remarks Modified 0A3D ADC R/M R/M/I ADD with carry, (C) OSCZ sum = sum + 1 if CF == 1 0E67 ADD R/M R/M/I OSCZ 1173 AND R/M R/M/I OC=0, SZ 3381 CALL R/M/I Function call with LANDC test 1776 CMP OSCZ 1B48 CPUID Returns CPU ID in eax (TBD) 238F DIV AL8, R/M/I Unsigned integer division Undefined AX16, (AL, Rem. AH) = (Op1/Op2)8

204 EAX32 (AX, Rem. DX) = (Op1/Op2)16 (EAX, Rem. EDX) = (Op1/Op2)32 28FB HALT Halt execution; requires console restart 306A IDIV AL8, R/M/I Signed integer division Undefined AX16, (AL, Rem. AH) = (Op1/Op2)8 EAX32 (AX, Rem. DX) = (Op1/Op2)16 (EAX, Rem. EDX) = (Op1/Op2)32 16C0 IMUL AX8, R/M/I Signed integer multiplication; OC EAX16, AX = AL · R/M/I8 EDX:EAX32 EAX = AX · R/M/I16 EDX : EAX = EAX · R/M/I32 3111 INT I Software Interrupt, I = level 38F5 JA R/M/I Jump with LANDJ test (JNBE) if CF == 0 and ZF == 0 Continued on next page ... Continuation of Table C.1 OpClass Operand Operand Flags Code Mnemonic 1 2 Remarks Modified 3FAA JBE R/M/I Jump with LANDJ test (JNA) if CF == 1 or ZF == 1 46CD JC R/M/I Jump with LANDJ test (JB) if CF == 1 (JNAE) 48D9 JE R/M/I Jump with LANDJ test (JZ) if ZF == 1 5177 JG R/M/I Jump with LANDJ test (JNLE) if ZF == 0 and OF == SF 565D JGE R/M/I Jump with LANDJ test

205 (JNL) if OF == SF 599E JL R/M/I Jump with LANDJ test (JNGE) if OF != SF 607B JLE R/M/I Jump with LANDJ test (JNG) if ZF == 1 or OF != SF 66E9 JMP R/M/I Unconditional jump with LANDJ test 6B0A JNC R/M/I Jump with LANDJ test (JNB) if CF == 0 (JAE) 6FD0 JNE R/M/I Jump with LANDJ test (JNZ) if ZF == 0 7260 JNO R/M/I Jump with LANDJ test if OF == 0 7AA4 JNS R/M/I Jump with LANDJ test if SF == 0 7DE5 JO R/M/I Jump with LANDJ test if OF == 1 Continued on next page ... Continuation of Table C.1 OpClass Operand Operand Flags Code Mnemonic 1 2 Remarks Modified 8760 JS R/M/I Jump with LANDJ test if SF == 1 896C LANDC Mandatory destination of CALL instructions 900B LANDJ Mandatory destination of Jump instructions, conditional and unconditional 9853 LANDR Immediately follows CALL as mandatory destination for RET instructions 9D41 MOV R/M R/M/I

206 A0CE MOVSX R/M R/M/I Move 8 or 16 bits with Sign extend to 32 bits A785 MOVZX R/M R/M/I Move 8 or 16 bits with left Zero fill to 32 bits ABB5 MUL AL8, R/M/I Unsigned integer multiplication; OC AX16, AX = AL · R/M/I8 EAX32 EAX = AX · R/M/I16 EDX : EAX = EAX · R/M/I32 B2BC NOP B54A OR R/M R/M/I OC=0, SZ BAC1 POP R/M Pre-decrement ESP 4 bytes C05F PUSH R/M/I Post-increment ESP 4 bytes C71C RET Return with LANDR test CD75 ROL R/M R/M/I Rotate left, count < target width C Continued on next page ... Continuation of Table C.1 OpClass Operand Operand Flags Code Mnemonic 1 2 Remarks Modified D10B ROR R/M R/M/I Rotate right, count < target width C D6BB SAL R/M R/M/I Shift left into carry flag, SCZ (SHL) b0 = 0, count < target width DEC6 SAR R/M R/M/I Signed shift right, sign bit unchanged, SCZ C = b0, count < target width E351 SBB R/M R/M/I Subtract with borrow, OSCZ diff = diff-1 if CF ==1 EEBF SHL R/M R/M/I Synonym for SAL F3B7 SHR R/M R/M/I Unsigned shift right, SCZ bhigh = 0, C = b0, count < target width

207 FA73 SUB R/M R/M/I Subtract OSCZ FD7A XOR R/M R/M/I Exclusive OR OC=0, SZ End of Table C.1