Linux Provenance Modules: Trustworthy Whole-System Provenance for the Linux Kernel

Linux Provenance Modules: Trustworthy Whole-System Provenance for the Linux Kernel Adam Bates1, Dave Tian1, Kevin R.B. Butler1, and Thomas Moyer2 1University of Florida 2MIT Lincoln Laboratory Abstract ject being processed on the system. This allows users to In a provenance-aware system, mechanisms gather and track, and understand, how a piece of data came to ex- report metadata that describes the history of each ob- ist in its current state. The application of provenance ject being processed on the system. This allows users is of enormous interest in a variety of disparate com- to track, and understand, how a piece of data came to munities including scientific data processing, databases, exist in its current state. However, while past work has software development, and storage [56, 66]. Provenance demonstrated the usefulness of provenance for myriad has also been demonstrated to be of great value to se- applications, less attention has been given to securing curity by identifying malicious activity in data centers provenance-aware systems. Provenance itself is a ripe [11, 34, 70, 82, 83], improving Mandatory Access Con- attack vector, and the authenticity and integrity of prove- trol (MAC) labels [58, 59, 60], and assuring regulatory nance must be guaranteed before it can be put to use. compliance [9]. We present Linux Provenance Modules (LPM), Unfortunately, most provenance collection mecha- the first general framework for the development of nisms in the literature exist as fully-trusted user space provenance-aware systems. We demonstrate that LPM applications [35, 34, 53, 70]. Even kernel-based prove- creates a trusted provenance-aware execution environ- nance mechanisms [56, 61] and sketches for trusted ment, collecting complete whole-system provenance provenance architectures [52, 55] fall short of providing while imposing as little as 2.7% performance overhead a provenance-aware system for malicious environments. on normal system operation. LPM introduces new mech- The problem of whether or not to trust provenance is fur- anisms for secure provenance layering and authenticated ther exacerbated in distributed environments, or in lay- communication between provenance-aware hosts, and ered provenance systems, due to the lack of a mechanism also interoperates with existing mechanisms to provide to verify the authenticity and integrity of provenance col- strong security assurances. To demonstrate the poten- lected from different sources. tial uses of LPM, we design a Provenance-Based Data In this work, we present Linux Provenance Modules Loss Prevention (PB-DLP) system. We implement PB- (LPM), the first generalized framework for secure prove- DLP as a file transfer application that blocks the trans- nance collection on the Linux operating system. Mod- mission of files derived from sensitive ancestors while ules capture whole-system provenance, a detailed record imposing just tens of milliseconds overhead. LPM is the of processes, IPC mechanisms, network activity, and first step towards widespread deployment of trustworthy even the kernel itself; this capture is invisible to the ap- provenance-aware applications. plications for which provenance is being collected. LPM introduces a gateway that permits the upgrading of low integrity workflow provenance from user space. LPM 1 Introduction also facilitates secure distributed provenance through an authenticated, tamper-evident channel for the transmis- A provenance-aware system automatically gathers and sion of provenance metadata between hosts. LPM inter- reports metadata that describes the history of each ob- operates with existing security mechanisms to establish a The Lincoln Laboratory portion of this work was sponsored by the hardware-based root of trust to protect system integrity. Assistant Secretary of Defense for Research & Engineering under Air Achieving the goal of trustworthy whole-system Force Contract #FA8721-05-C-0002. Opinions, interpretations, con- clusions and recommendations are those of the author and are not nec- provenance, we demonstrate the power of our approach essarily endorsed by the United States Government. by presenting a scheme for Provenance-Based Data Loss 1 /etc/ld.so.cache:0 /lib/libc-2.12.so:0 /etc/rc.local:0 /bin/ps:0 /var/spool/cron/root:0 /etc/passwd:0 /etc/shadow:0 root Used Used Used Used Used Used Used WasControlledBy Malicious Binary WasGeneratedBy WasGeneratedByWasGeneratedBy WasGeneratedBy WasGeneratedBy /etc/rc.local:1 /bin/ps:1 /var/spool/cron/root:1 /etc/passwd:1 /etc/shadow:1 Figure 1: A provenance graph showing the attack footprint of a malicious binary. Edges encode relationships that flow backwards into the history of system execution, and writing to an object creates a second node with an incremented version number. Here, we see that the binary has rewritten /etc/rc.local, likely in an attempt to gain persistence after a system reboot. Prevention (PB-DLP). PB-DLP allows administrators to formation flow security. In Section 3 we present the de- reason about the propagation of sensitive data and control sign, implementation, and deployment of Linux Prove- its further dissemination through an expressive policy nance Modules. We analyze the security of such a de- system, offering dramatically stronger assurances than ployment in Section 4. LPM system performance is eval- existing enterprise solutions, while imposing just mil- uated in Section 6. Section 5 presents the design an im- liseconds of overhead on file transmission. To our knowl- plementation of an exemplar LPM application that em- edge, this work is the first to apply provenance to DLP. ploys a provenance-based approach to data loss preven- Our contributions can thus be summarized as follows: tion. Other aspects of LPM are discussed in Section 7, and in Section 9 we conclude. • Introduce Linux Provenance Modules (LPM). LPM facilitates secure provenance collection at the kernel layer, supports attested disclosure at the ap- 2 Background plication layer, provides an authenticated channel for network transmission, and is compatible with Data provenance, sometimes called lineage, describes the W3C Provenance (PROV) Model [74]. In eval- the actions taken on a data object from its creation up uation, we demonstrate that provenance collection to the present. Provenance can be used to answer a va- imposes as little as 2.7% performance overhead. riety of historical questions about the data it describes. Such questions include, but are not limited to, “What • Demonstrate secure deployment. Leveraging processes and datasets were used to generate this data?" LPM and existing security mechanisms, we cre- and “In what environment was the data produced?" Con- ate a trusted provenance-aware execution environ- versely, provenance can also answer questions about the ment for Linux. We port and extend the Hi-Fi sys- successors of a piece of data, such as “What objects on tem [61], provide a second module that interoper- the system were derived from this object?" Although po- ates with the SPADE system [36], and describe how tential applications for such information are nearly lim- LPM is being used to create provenance-informed itless, past proposals have conceptualized provenance in MAC policies [68]. We show that, in realistic ma- different ways, indicating that a one-size-fits-all solution licious environments, ours is the first proposed sys- to provenance collection is unlikely to meet the needs of tem to offer secure provenance collection. all of these audiences. • Introduce Provenance-Based Data Loss Preven- The commonly accepted representation for data prove- tion (PB-DLP). We present a new paradigm for nance is a directed acyclic graph (DAG). In this work, we the prevention of data leakage that searches object use the W3C PROV-DM specification [74] because it is provenance to identify and prevent the spread of pervasive and facilitates the exchange of provenance be- sensitive data. PB-DLP is impervious to attempts tween deployments. An example PROV-DM graph of a to launder data through intermediary files and IPC. malicious binary is shown in Figure 1. This graph de- We implement PB-DLP as a file transfer applica- scribes an attack in which a binary running with root tion, and demonstrate its ability to query object an- privilege reads several sensitive system files, then ed- cestries in just tens of milliseconds. its those files in an attempt to gain persistent access to the host. Edges encode relationships between nodes, The rest of this paper is structured as follows. In Sec- pointing backwards into the history of system execution. tion 2, we present background on provenance, and ex- Writing to an object triggers the creation of a second ob- plain how it compares to past efforts in the area of in- ject node with an incremented version number. This par- 2 ticular provenance graph could serve as a valuable foren- In addition to storing measurements, the TPM pro- sics tool, allowing system administrators to better under- vides a reporting mechanism that can be used to prove stand the nature of a network intrusion. the integrity of the system (at least the parts that have been measured). This is done via the quote opera- 2.1 Linux Security Modules tion, which can also be called an attestation. The TPM quote is a digitally signed “proof” of the set of measure- The Linux Security Modules (LSM) framework provides ments recorded in the Platform Configuration Registers. a set of standardized authorization hooks for implement- A client wishing to validate the set of measurements can ing flexible mandatory access control in the Linux kernel connect to the system, and request a quote. As part of the [76]. LSM provides complete mediation of operations request, the client provides the list of PCRs he is inter- on key kernel data types by ensuring that an authoriza- ested in validating, and a nonce. The TPM takes these, tion step occurs prior to permitting these operations to reads the values in the PCRs, and signs the nonce and execute. LSM is designed on the principle of generality; PCR values.

Load more