Trustworthy Whole-System Provenance for the Linux Kernel
Total Page:16
File Type:pdf, Size:1020Kb
Trustworthy Whole-System Provenance for the Linux Kernel Adam Bates, Dave (Jing) Tian, Thomas Moyer Kevin R.B. Butler University of Florida MIT Lincoln Laboratory {adammbates,daveti,butler}@ufl.edu [email protected] Abstract is presently of enormous interest in a variety of dis- In a provenance-aware system, mechanisms gather parate communities including scientific data processing, and report metadata that describes the history of each ob- databases, software development, and storage [43, 53]. ject being processed on the system, allowing users to un- Provenance has also been demonstrated to be of great derstand how data objects came to exist in their present value to security by identifying malicious activity in data state. However, while past work has demonstrated the centers [5, 27, 56, 65, 66], improving Mandatory Access usefulness of provenance, less attention has been given Control (MAC) labels [45, 46, 47], and assuring regula- to securing provenance-aware systems. Provenance it- tory compliance [3]. self is a ripe attack vector, and its authenticity and in- Unfortunately, most provenance collection mecha- tegrity must be guaranteed before it can be put to use. nisms in the literature exist as fully-trusted user space We present Linux Provenance Modules (LPM), applications [28, 27, 41, 56]. Even kernel-based prove- the first general framework for the development of nance mechanisms [43, 48] and sketches for trusted provenance-aware systems. We demonstrate that LPM provenance architectures [40, 42] fall short of providing creates a trusted provenance-aware execution environ- a provenance-aware system for malicious environments. ment, collecting complete whole-system provenance The problem of whether or not to trust provenance is fur- while imposing as little as 2.7% performance overhead ther exacerbated in distributed environments, or in lay- on normal system operation. LPM introduces new mech- ered provenance systems, due to the lack of a mechanism anisms for secure provenance layering and authenticated to verify the authenticity and integrity of provenance col- communication between provenance-aware hosts, and lected from different sources. also interoperates with existing mechanisms to provide In this work, we present Linux Provenance Modules strong security assurances. To demonstrate the poten- (LPM), the first generalized framework for secure prove- tial uses of LPM, we design a Provenance-Based Data nance collection on the Linux operating system. Mod- Loss Prevention (PB-DLP) system. We implement PB- ules capture whole-system provenance, a detailed record DLP as a file transfer application that blocks the trans- of processes, IPC mechanisms, network activity, and mission of files derived from sensitive ancestors while even the kernel itself; this capture is invisible to the ap- imposing just tens of milliseconds overhead. LPM is the plications for which provenance is being collected. LPM first step towards widespread deployment of trustworthy introduces a gateway that permits the upgrading of low provenance-aware applications. integrity workflow provenance from user space. LPM also facilitates secure distributed provenance through an authenticated, tamper-evident channel for the transmis- 1 Introduction sion of provenance metadata between hosts. LPM inter- operates with existing security mechanisms to establish a A provenance-aware system automatically gathers and hardware-based root of trust to protect system integrity. reports metadata that describes the history of each ob- ject being processed on the system. This allows users to Achieving the goal of trustworthy whole-system track, and understand, how a piece of data came to ex- provenance, we demonstrate the power of our approach ist in its current state. The application of provenance by presenting a scheme for Provenance-Based Data Loss Prevention (PB-DLP). PB-DLP allows administrators to The Lincoln Laboratory portion of this work was sponsored by the reason about the propagation of sensitive data and control Assistant Secretary of Defense for Research & Engineering under Air its further dissemination through an expressive policy Force Contract #FA8721-05-C-0002. Opinions, interpretations, con- clusions and recommendations are those of the author and are not nec- system, offering dramatically stronger assurances than essarily endorsed by the United States Government. existing enterprise solutions, while imposing just mil- /etc/rc.local:0 /bin/ps:0 /var/spool/cron/root:0 /etc/passwd:0 /etc/shadow:0 root Used Used Used Used Used WasControlledBy Malicious Binary WasGeneratedBy WasGeneratedByWasGeneratedBy WasGeneratedBy WasGeneratedBy /etc/rc.local:1 /bin/ps:1 /var/spool/cron/root:1 /etc/passwd:1 /etc/shadow:1 Figure 1: A provenance graph showing the attack footprint of a malicious binary. Edges encode relationships that flow backwards into the history of system execution, and writing to an object creates a second node with an incremented version number. Here, we see that the binary has rewritten /etc/rc.local, likely in an attempt to gain persistence after a system reboot. liseconds of overhead on file transmission. To our knowl- and “In what environment was the data produced?" Con- edge, this work is the first to apply provenance to DLP. versely, provenance can also answer questions about the Our contributions can thus be summarized as follows: successors of a piece of data, such as “What objects on the system were derived from this object?" Although po- • Introduce Linux Provenance Modules (LPM). tential applications for such information are nearly lim- LPM facilitates secure provenance collection at the itless, past proposals have conceptualized provenance in kernel layer, supports attested disclosure at the ap- different ways, indicating that a one-size-fits-all solution plication layer, provides an authenticated channel to provenance collection is unlikely to meet the needs of for network transmission, and is compatible with all of these audiences. We review these past proposals the W3C Provenance (PROV) Model [59]. In eval- for provenance-aware systems in Section 8. uation, we demonstrate that provenance collection The commonly accepted representation for data prove- imposes as little as 2.7% performance overhead. nance is a directed acyclic graph (DAG). In this work, we use the W3C PROV-DM specification [59] because it is • Demonstrate secure deployment. Leveraging pervasive and facilitates the exchange of provenance be- LPM and existing security mechanisms, we create tween deployments. An example PROV-DM graph of a a trusted provenance-aware execution environment malicious binary is shown in Figure 1. This graph de- for Linux. Through porting Hi-Fi [48] and provid- scribes an attack in which a binary running with root ing support for SPADE [29], we demonstrate the privilege reads several sensitive system files, then ed- relative ease with which LPM can be used to secure its those files in an attempt to gain persistent access to existing provenance collection mechanisms. We the host. Edges encode relationships between nodes, show that, in realistic malicious environments, ours pointing backwards into the history of system execution. is the first proposed system to offer secure prove- Writing to an object triggers the creation of a second ob- nance collection. ject node with an incremented version number. This par- ticular provenance graph could serve as a valuable foren- • Introduce Provenance-Based Data Loss Preven- sics tool, allowing system administrators to better under- tion (PB-DLP). We present a new paradigm for stand the nature of a network intrusion. the prevention of data leakage that searches object provenance to identify and prevent the spread of sensitive data. PB-DLP is impervious to attempts to launder data through intermediary files and IPC. 2.1 Data Loss Prevention We implement PB-DLP as a file transfer applica- tion, and demonstrate its ability to query object an- Data Loss Prevention (DLP) is enterprise software that cestries in just tens of milliseconds. seeks to minimize the leakage of sensitive data by moni- toring and controlling information flow in large, complex organizations [1].1 In addition to the desire to control in- 2 Background tellectual property, another motivator for DLP systems is demonstrating regulatory compliance for personally- Data provenance, sometimes called lineage, describes identifiable information (PII),2 as well as directives such the actions taken on a data object from its creation up to the present. Provenance can be used to answer a va- 1 riety of historical questions about the data it describes. Our overview of data loss prevention is based on review of pub- licly available product descriptions for software developed by Bit9, Such questions include, but are not limited to, “What CDW, Cisco, McAfee, Symantec, and Titus. processes and datasets were used to generate this data?" 2 See NIST SP 800-122 2 as PCI,3 HIPAA,4 SOX.5 or E.U. Data Protection.6 As 3.1 Defining Whole-System Provenance encryption can be used to protect data at rest from unau- thorized access, the true DLP challenge involves prevent- In the design of LPM, we adopt a model for whole- 9 ing leakage at the hands of authorized users, both mali- system provenance that is broad enough to accom- cious and well-meaning agents. This latter group is a modate the needs of a variety of existing provenance surprisingly big problem in the fight to control an organi- projects. To arrive at a definition, we inspect four zation’s intellectual property; a 2013 study conducted by past proposals that collect broadly scoped provenance: the Ponemon Institute found that over half of companies’ SPADE [29], LineageFS [53], PASS [43], and Hi-Fi [48]. employees admitted to emailing intellectual property to SPADE provenance is structured around primitive oper- their personal email accounts, with 41 percent admitting ations of system activities with data inputs and outputs. to doing so on a weekly basis [2]. It is therefore im- It instruments file and process system calls, and asso- portant for a DLP system to be able to exhaustively ex- ciates each call to a process ID (PID), user identifier, and plain which pieces of data are sensitive, where that data network address.