Provenance Modules: Trustworthy Whole-System Provenance for the

Adam Bates1, Dave Tian1, Kevin R.B. Butler1, and Thomas Moyer2 1University of Florida 2MIT Lincoln Laboratory

Abstract ject being processed on the system. This allows users to In a provenance-aware system, mechanisms gather and track, and understand, how a piece of data came to ex- report metadata that describes the history of each ob- ist in its current state. The application of provenance ject being processed on the system. This allows users is of enormous interest in a variety of disparate com- to track, and understand, how a piece of data came to munities including scientific data processing, databases, exist in its current state. However, while past work has software development, and storage [56, 66]. Provenance demonstrated the usefulness of provenance for myriad has also been demonstrated to be of great value to se- applications, less attention has been given to securing curity by identifying malicious activity in data centers provenance-aware systems. Provenance itself is a ripe [11, 34, 70, 82, 83], improving Mandatory Access Con- attack vector, and the authenticity and integrity of prove- trol (MAC) labels [58, 59, 60], and assuring regulatory nance must be guaranteed before it can be put to use. compliance [9]. We present Linux Provenance Modules (LPM), Unfortunately, most provenance collection mecha- the first general framework for the development of nisms in the literature exist as fully-trusted provenance-aware systems. We demonstrate that LPM applications [35, 34, 53, 70]. Even kernel-based prove- creates a trusted provenance-aware execution environ- nance mechanisms [56, 61] and sketches for trusted ment, collecting complete whole-system provenance provenance architectures [52, 55] fall short of providing while imposing as little as 2.7% performance overhead a provenance-aware system for malicious environments. on normal system operation. LPM introduces new mech- The problem of whether or not to trust provenance is fur- anisms for secure provenance layering and authenticated ther exacerbated in distributed environments, or in lay- communication between provenance-aware hosts, and ered provenance systems, due to the lack of a mechanism also interoperates with existing mechanisms to provide to verify the authenticity and integrity of provenance col- strong security assurances. To demonstrate the poten- lected from different sources. tial uses of LPM, we design a Provenance-Based Data In this work, we present Linux Provenance Modules Loss Prevention (PB-DLP) system. We implement PB- (LPM), the first generalized framework for secure prove- DLP as a file transfer application that blocks the trans- nance collection on the Linux operating system. Mod- mission of files derived from sensitive ancestors while ules capture whole-system provenance, a detailed record imposing just tens of milliseconds overhead. LPM is the of processes, IPC mechanisms, network activity, and first step towards widespread deployment of trustworthy even the kernel itself; this capture is invisible to the ap- provenance-aware applications. plications for which provenance is being collected. LPM introduces a gateway that permits the upgrading of low integrity workflow provenance from user space. LPM 1 Introduction also facilitates secure distributed provenance through an authenticated, tamper-evident channel for the transmis- A provenance-aware system automatically gathers and sion of provenance metadata between hosts. LPM inter- reports metadata that describes the history of each ob- operates with existing security mechanisms to establish a The Lincoln Laboratory portion of this work was sponsored by the hardware-based root of trust to protect system integrity. Assistant Secretary of Defense for Research & Engineering under Air Achieving the goal of trustworthy whole-system Force Contract #FA8721-05-C-0002. Opinions, interpretations, con- clusions and recommendations are those of the author and are not nec- provenance, we demonstrate the power of our approach essarily endorsed by the United States Government. by presenting a scheme for Provenance-Based Data Loss

1 /etc/ld.so.cache:0 /lib/libc-2.12.so:0 /etc/rc.local:0 /bin/ps:0 /var/spool/cron/root:0 /etc/passwd:0 /etc/shadow:0 root

Used Used Used Used Used Used Used WasControlledBy

Malicious Binary

WasGeneratedBy WasGeneratedByWasGeneratedBy WasGeneratedBy WasGeneratedBy

/etc/rc.local:1 /bin/ps:1 /var/spool/cron/root:1 /etc/passwd:1 /etc/shadow:1

Figure 1: A provenance graph showing the attack footprint of a malicious binary. Edges encode relationships that flow backwards into the history of system execution, and writing to an object creates a second node with an incremented version number. Here, we see that the binary has rewritten /etc/rc.local, likely in an attempt to gain persistence after a system reboot.

Prevention (PB-DLP). PB-DLP allows administrators to formation flow security. In Section 3 we present the de- reason about the propagation of sensitive data and control sign, implementation, and deployment of Linux Prove- its further dissemination through an expressive policy nance Modules. We analyze the security of such a de- system, offering dramatically stronger assurances than ployment in Section 4. LPM system performance is eval- existing enterprise solutions, while imposing just mil- uated in Section 6. Section 5 presents the design an im- liseconds of overhead on file transmission. To our knowl- plementation of an exemplar LPM application that em- edge, this work is the first to apply provenance to DLP. ploys a provenance-based approach to data loss preven- Our contributions can thus be summarized as follows: tion. Other aspects of LPM are discussed in Section 7, and in Section 9 we conclude. • Introduce Linux Provenance Modules (LPM). LPM facilitates secure provenance collection at the kernel layer, supports attested disclosure at the ap- 2 Background plication layer, provides an authenticated channel for network transmission, and is compatible with Data provenance, sometimes called lineage, describes the W3C Provenance (PROV) Model [74]. In eval- the actions taken on a data object from its creation up uation, we demonstrate that provenance collection to the present. Provenance can be used to answer a va- imposes as little as 2.7% performance overhead. riety of historical questions about the data it describes. Such questions include, but are not limited to, “What • Demonstrate secure deployment. Leveraging processes and datasets were used to generate this data?" LPM and existing security mechanisms, we cre- and “In what environment was the data produced?" Con- ate a trusted provenance-aware execution environ- versely, provenance can also answer questions about the ment for Linux. We port and extend the Hi-Fi sys- successors of a piece of data, such as “What objects on tem [61], provide a second module that interoper- the system were derived from this object?" Although po- ates with the SPADE system [36], and describe how tential applications for such information are nearly lim- LPM is being used to create provenance-informed itless, past proposals have conceptualized provenance in MAC policies [68]. We show that, in realistic ma- different ways, indicating that a one-size-fits-all solution licious environments, ours is the first proposed sys- to provenance collection is unlikely to meet the needs of tem to offer secure provenance collection. all of these audiences. • Introduce Provenance-Based Data Loss Preven- The commonly accepted representation for data prove- tion (PB-DLP). We present a new paradigm for nance is a directed acyclic graph (DAG). In this work, we the prevention of data leakage that searches object use the W3C PROV-DM specification [74] because it is provenance to identify and prevent the spread of pervasive and facilitates the exchange of provenance be- sensitive data. PB-DLP is impervious to attempts tween deployments. An example PROV-DM graph of a to launder data through intermediary files and IPC. malicious binary is shown in Figure 1. This graph de- We implement PB-DLP as a file transfer applica- scribes an attack in which a binary running with root tion, and demonstrate its ability to query object an- privilege reads several sensitive system files, then ed- cestries in just tens of milliseconds. its those files in an attempt to gain persistent access to the host. Edges encode relationships between nodes, The rest of this paper is structured as follows. In Sec- pointing backwards into the history of system execution. tion 2, we present background on provenance, and ex- Writing to an object triggers the creation of a second ob- plain how it compares to past efforts in the area of in- ject node with an incremented version number. This par-

2 ticular provenance graph could serve as a valuable foren- In addition to storing measurements, the TPM pro- sics tool, allowing system administrators to better under- vides a reporting mechanism that can be used to prove stand the nature of a network intrusion. the integrity of the system (at least the parts that have been measured). This is done via the quote opera- 2.1 Linux Security Modules tion, which can also be called an attestation. The TPM quote is a digitally signed “proof” of the set of measure- The Linux Security Modules (LSM) framework provides ments recorded in the Platform Configuration Registers. a set of standardized authorization hooks for implement- A client wishing to validate the set of measurements can ing flexible in the Linux kernel connect to the system, and request a quote. As part of the [76]. LSM provides complete mediation of operations request, the client provides the list of PCRs he is inter- on key kernel data types by ensuring that an authoriza- ested in validating, and a nonce. The TPM takes these, tion step occurs prior to permitting these operations to reads the values in the PCRs, and signs the nonce and execute. LSM is designed on the principle of generality; PCR values. This is then sent to the client for validation. rather than offering a one-size-fits-all solution to access One use of this quote mechanism is to validate the control, one of a number of different security modules set of software binaries that are loaded on the system, provide the logic for the authorization hooks. The cor- as is done with the Linux Integrity Measurement Archi- rectness of the authorization hook’s placement through- tecture [65]. Another use is to validate the integrity of out the kernel has been verified through both static and critical boot files, and provide a trusted, or secure, boot dynamic analysis [25, 43, 81]. In this work, we argue of the system. Several TPM-aware bootloaders exist, in- that the principled design of Linux Security Modules also cluding TrustedGRUB 4, OSLO [46], and Intel tboot 5. serves as a strong foundation for the collection of whole- All three support a trusted boot mechanism, where the system provenance. system measures the boot-time files (the kernel and ini- tial RAM disk, and if present the hypervisor), and stores 2.2 Trusted Computing the measurements in the TPM for later verification. Se- cure boot, as supported by Intel tboot, takes this one step The secure deployment of our proposed system relies further, and stores a policy in the TPM’s NVRAM. When on the Trusted Platform Module (TPM) 1, a small cryp- the system boots, tboot will compare the set of measure- tographic processor2 available in many commodity sys- ments to the stored policy, and only if the measurements tems today. The TPM provides a hardware-based root of match, will the system boot. The execution the boot- trust for storing keys and measurements, and generating loader relies on the DRTM secure execution environment attestations, or evidence of a set of collected measure- to protect the validation of measurements. ments, representing the current state of the system. An additional feature of the TPM is the ability to seal The measurements are SHA1 hashes that are stored in data to specific measurements. What this means is that Platform Configuration Registers (PCRs). These PCRs the data is encrypted using a key that can only be ac- have a limited interface, only supporting the ability to cessed if the PCR values are set to specific values, i.e. add a new measurement, via the extend operation, the measurements match some set of known values. One or resetting them to their default value. When a new use for this is to seal an encryption key for a hard drive to measurement is added to the PCR, the current value of the PCR values that represent the kernel and initial RAM the PCR and the new measurement are concatenated, disk image for the approved OS. This ensures that only and this value hashed. The result of the hash is stored the approved OS can access the sealed data on the drive, in the PCR, forming a hash chain of measurements. providing a secure-boot mechanism for systems that are This ensures that an adversary cannot modify an already TPM-aware but lacks support for Intel’s tboot. recorded measurement. This only considers the “nor- mal” PCRs, and not the PCRs that support the dynamic 2.3 Data Loss Prevention root of trust for measurement, where the PCRs start with a value of -1, and are set to 0 only after executing a set Data Loss Prevention (DLP) is enterprise software that of CPU instructions that establish a secure execution en- seeks to minimize the loss and exfiltration of sensitive vironment 3. For more information on the utility of this data and intellectual property by monitoring and con- function, see [54] trolling information flows in large, complex organiza- 1 See http://www.trustedcomputinggroup.org/developers/trusted_ tions [5]. As encryption can be used to protect data at platform_module rest from unauthorized access, the true DLP challenge 2 The TPM is not designed as a cryptographic co-processor that involves preventing leakage at the hands of authorized accelerates cryptographic operations. Instead, it provides a basis for establishing trust in a system. 4 See http://sf.net/projects/trustedgrub 3 See https://www.kernel.org/doc/Documentation/intel_txt.txt 5 See http://sf.net/projects/tboot

3 users, both malicious and well-meaning agents. This lat- tion flows. Thus, DLP systems have historically offered ter group is a surprisingly big problem in the fight to con- marginal utility at great price. trol an organization’s intellectual property; a 2013 study conducted by the Ponemon Institute found that over half 3 Linux Provenance Modules of companies’ employees admitted to emailing intellec- tual property to their personal email accounts, with 41 To serve as the foundation for secure provenance-aware percent admitting to doing so on a weekly basis [8]. Em- systems, we present Linux Provenance Modules (LPM). ployees’ reasons for doing are at times completely inno- LPM is a generic framework that creates a provenance cent, such as wishing to be able to work from home on a collection layer in the kernel. We present a working personal device. definition for the provenance our system will collect in Enterprise DLP software is therefore comprised of a §3.1. In §3.2 we consider the capabilities and aims of variety of security mechanisms, including system moni- a provenance-aware adversary, and identify security and toring tools and access control mechanisms [1, 2, 3, 5, 7]. design goals in §3.3. The LPM design is presented in System monitoring involves discovering where sen- §3.4, and in §3.5 we demonstrate its secure deployment. sitive data is stored and monitoring its legitimate flow through a system. Access control involves preventing those flows from beginning illegitimate flows, as spec- 3.1 Defining Whole-System Provenance ified by the data loss policy, by allowing the data to flow In the design of LPM, we adopted a model for whole- past controlled boundaries. system provenance8 that is broad enough to accom- DLP systems are proprietary and are marketed so as modate the needs of a variety of existing provenance to abstract away the complex operational details of their projects. To arrive at a definition, we inspect four operation, so we cannot offer a complete explanation of past proposals that collected broadly scoped provenance: their core features. However, some of the mechanisms in SPADE [36], LineageFS [66], PASS [56], and Hi-Fi [61]. such systems are known. Many DLP products use a regu- SPADE provenance is structured around primitive oper- lar expression-based approach to identify sensitive data, ations of system activities with data inputs and outputs. operating similarly to a general-purpose version of Cor- It instruments file and process system calls, and asso- 6 7 nell’s Spider . For example, in PCI compliance , DLP ciates each call to a process ID (PID), user identifier, and might attempt to identify credit card numbers in out- network address. LineageFS uses a similar definition, bound emails by searching for 16 digit numbers that pass associating process IDs with the file descriptors that the a Mod-10 validation check [51]. Of course, other 16 digit process reads and writes. PASS associates a process’s numbers that are not actually credit cards, perhaps a pur- output with references to all input files and the command chase order number, will occasionally pass the validation line and process environment of the process; it also ap- step, inevitably leading to false positives. This approach pends out-of-band knowledge such as OS and hardware is also unable to differentiate between legitimate sensi- descriptions, and random number generator seeds, if pro- tive data and training data, the widespread distribution of vided. In each of these systems, networking and IPC which is common in enterprises with large development activity is primarily reflected in the provenance record teams. through manipulation of the underlying file descriptors. Other DLP systems use a label-based approach to Hi-Fi takes an even broader approach to provenance, identify sensitive data, tagging document metadata with treating non-persistent objects such as memory, IPC, and security labels. The Titus system accomplishes this by network packets as principal objects. having company employees manually annotate the doc- We observe that, in all instances, provenance-aware uments that they create [6]; plug-ins for applications systems are exclusively concerned with operations on (e.g., Microsoft Office and Outlook) then prevent the controlled data types, which are identified by Zhang et document from being transmitted to or opened by other al. as files, inodes, superblocks, socket buffers, IPC employees that lack the necessary clearance. Because messages, IPC message queue, semaphores, and shared both the classification and enforcement mechanisms are memory [81]. Because controlled data types represent a application-specific, as opposed to system wide, sensi- super set of the objects tracked by system layer prove- tive information can quickly become unlabeled by pass- nance mechanisms, we define whole-system provenance ing through one of many unmonitored channels. These as a complete description of agents (users, groups) con- existing DLP approaches are therefore difficult to con- trolling activities (processes) interacting with controlled figure and prone to failure, primarily because they can- data types during system execution. not provide complete observation of a system’s informa- 8This term is coined in [61], but its definition is implicit the correct- 6 See http://www2.cit.cornell.edu/security/tools ness of the system is not proven. We explicitly define the requirements 7 See https://www.pcisecuritystandards.org of a collection mechanism for whole system provenance in this work.

4 3.2 Threat Model & Assumptions guarantees laid out by Anderson [10]: complete media- tion, tamperproofness, and verifiability. We define these We consider an adversary that has gained remote access guarantees as follows: to a provenance-aware host or network. Once inside the system, the attacker may attempt to remove provenance G1 Complete. Complete mediation for provenance has records, insert spurious information into those records, been discussed elsewhere in the literature in terms or find gaps in the provenance monitor’s ability to record of assuring completeness [39]: that the provenance information flows. A network attacker may also attempt record be gapless in its description of system activ- to forge or strip provenance from data in transit. Be- ity. To facilitate this, LPM must be able to observe cause captured provenance can be put to use in other ap- all information flows that pass through controlled plications, the adversary’s goal may even be to target the data types. provenance monitor itself. The implications and meth- ods of such an attack are domain-specific. For example: G2 Tamperproof. As many provenance use cases in- volve enhancing system security, LPM will be an • Scientific Computing: An adversary may wish to ma- adversarial target. The TCB must therefore be im- nipulate provenance in order to commit fraud, or to in- pervious to disabling or manipulation by processes ject uncertainty into records to trigger a “Climategate”- in user space. like controversy [63]. G3 Verifiable. The functionality of LPM must be • Access Control: When used to mediate access deci- verifiably correct. Additionally, local and remote sions [12, 58, 59, 60], an attacker could tamper with users should be able to attest whether the host with provenance in order to gain unauthorized access, or to which they are communicating is running the se- perform a denial-of-service attack on other users by ar- cured provenance-aware kernel. tificially escalating the security level of data objects. Through surveying past work in provenance-aware • Networks: Provenance metadata can also be associ- systems, we identify the following additional goals to ated with packets in order to better understand network support whole-system provenance: events in distributed systems [11, 82, 83]. Coordi- nating multiple compromised hosts, an attacker may G4 Authenticated Channel. In distributed environ- attempt to send unauthenticated messages to avoid ments, provenance-aware systems must provide a provenance generation and to perform data exfiltration. means of assuring authenticity and integrity of provenance as it is communicated over open net- We define a provenance trusted computing base (TCB) works [12, 55, 61, 82]. While we do not seek to to be the kernel mechanisms, provenance recorder, and provide a complete distributed provenance solution storage back-ends responsible for the collection and in LPM, we do wish to provide the required build- management of provenance. Provenance-aware appli- ing blocks within the host for such a system to ex- cations are not considered part of the TCB. We make the ist. LPM must therefore be able to monitor ev- following assumption with regards to the TCB. In Linux, ery network message that is sent or received by the kernel modules have unrestricted access to kernel mem- host, and reliably explain these messages to other ory, meaning that there is no mechanism for protecting provenance-aware hosts in the network. LPM from the rest of the kernel. The kernel code is there- fore trusted; we assume that the stock kernel will not G5 Attested Disclosure. Layered provenance, where seek to tamper with the TCB. However, we do consider additional metadata is disclosed from higher opera- the possibility that the kernel could be compromised af- tional layers, is a desirable feature in provenance- ter installation through its interactions with user space aware systems, as applications are able to report applications. To facilitate host attestation in distributed workflow semantics that are invisible to the oper- environments, we also assume access to a Public Key In- ating system [57]. LPM must provide a gateway for frastructure (PKI) for provenance-aware hosts to publish upgrading low integrity user space disclosures be- their public signing keys. fore logging them in the high integrity provenance record. This is consistent with the Clark-Wilson In- 3.3 System Goals tegrity model for upgrading or discarding low in- tegrity inputs [21]. We set out to provide the following security assurances in the design of of our system-layer provenance collec- In order to bootstrap trust in our system, we have im- tion mechanism. McDaniel et al. liken the needs of a plemented LPM as a parallel framework to Linux Secu- secure provenance monitor [55] to the reference monitor rity Modules (LSM) [75, 76]. Building on these results,

5 user space Text Editor user space Prov. Aware GZip Applications open System Call kernel space

SQL Provenance Look Up Inode Recorder Neo4j Error Checks

SNAP DAC Checks Relay Buffer LSM Module kernel space Examine context. "Authorized?" Does request pass policy? LSM Hook IMA Prov. Module Yes or No Grant or deny. LPM Module Examine context. TPM Prov. Hooks NF Hooks "Prov collected?" Collect provenance. LPM Hook Yes or No If successful, grant. System Provenance Workflow Provenance Access Inode Integrity Measurements

Figure 2: LPM places a set of provenance hooks around Figure 3: Hook Architecture for the open system call. the kernel. A provenance module registers to control Provenance is collected after DAC and LSM checks, en- these hooks, and also registers Netfilter hooks. The mod- suring that it accurately reflects system activity. LPM ule transmits via a relay to a recorder in user space, will only deny the operation if it fails to generate prove- which uses one of several storage back-ends. The nance for the event. recorder is also responsible for evaluating the integrity of workflow provenance prior to storing it. Hooks Count Purpose BPRM 05 Observe program execution operations. Cred 10 Manage credentials. Dentry 03 Observe dentry operations. we show in Section 4 that this approach allows LPM to File 20 Observe file operations. Inode 30 Observe inode operations. inherit the formal assurances that have been verified for IPC 10 Observe System V IPC Message Queues. the LSM architecture. 02 Observe Netlink Message. SB 19 Observe superblock operations. SEM 05 Observe System V Semaphores. SHM 05 Observe System V Shared Memory Segments. 3.4 Design & Implementation Socket 35 Observe socket and network operations. Task 24 Observe task operations. An overview of the LPM architecture is shown in Fig- Unix 02 Observe Unix Domain Networking. ure 2. The LPM patch places a set of provenance hooks VM 03 Observe virtual memory use. Kernel 05 Observe other misc. kernel access. around the kernel; a provenance module then registers to control these hooks, and also registers several Netfil- Table 1: Summary of Provenance Hooks. LPM places a ter hooks; the module then observes system events and total of 178 unique hooks around the kernel. transmits information via a relay buffer to a provenance recorder in user space that interfaces with a datastore. The recorder also accepts disclosed provenance from ap- 3.4.1 Provenance Hooks plications after verifying their correctness using the In- tegrity Measurements Architecture (IMA) [65]. The LPM patch introduces a set of hook functions in the Linux kernel. These hooks behave similarly to the LSM In designing LPM, we first considered using an exper- framework’s security hooks in that they facilitate mod- imental patch to the LSM framework that allows “stack- ularity, and default to taking no action unless a module ing” of LSM modules 9. However, at this time, no stan- is enabled. Each provenance hook is placed directly be- dard exists for handling when modules make conflict- neath a corresponding security hook. The return value of ing decisions, creating the potential unpredicted behav- the security hook is checked prior to calling the prove- ior. We also felt that dedicated provenance hooks were nance hook, thus assuring that the requested activity has necessary; by collecting provenance after LSM autho- been authorized prior to provenance capture; we consider rization routines, we ensure that the provenance history the implications of this design in Section 4. A workflow is an accurate description of authorized system events. If for the hook architecture is depicted in Figure 3. The provenance collection occurred during authorization, as LPM patch places over 170 provenance hooks, one for would be the case with stacked LSMs, it would not be each of the LSM authorization hooks. In addition to the possible to provide this property. hooks that correspond to existing security hooks, we also 9See https://lwn.net/Articles/518345/ support Pohly et al.’s Hi-Fi [61] hook that is necessary to

6 1 int vfs_readdir(struct file *file, filldir_t filler, void *buf){ Web Browser user space 2 struct inode *inode = file->f_path.dentry->d_inode; 3 int res = -ENOTDIR; TCP Send Packet kernel space 4 if (!file->f_op || !file->f_op->readdir) 5 goto out; 6 IP Send Packet 7 res = security_file_permission(file, MAY_READ); 8 if (res) Update IP Checksum 9 goto out; 10 11 res = provenance_file_permission(file, Netfilter Hook Iterate file->f_provenance, MAY_READ); IPTables 12 if (res) Examine routing table. "Ok with you?" Does request pass policy? IPTables Hook 13 goto out; Yes or No Grant or deny. 14 15 res = mutex_lock_killable(&inode->i_mutex); LPM Module Sign IP HDR, Payload. 16 if (res) "Prov embedded?" Embed in IP Options. LPM Hook 17 goto out; Yes or No Update IP Checksum. 18 19 res = -ENOENT; Network Card 20 if (!IS_DEADDIR(inode)) { 21 res = file->f_op->readdir(file, buf, filler); 22 file_accessed(file); 23 } Figure 5: Packet transmission in LPM’s message com- 24 mutex_unlock(&inode->i_mutex); mitment protocol. Netfilter hooks are used to sign over 25 out: 26 return res; the IP packet and then embed the signature into the IP 27 } Options field. 28 29 EXPORT_SYMBOL(vfs_readdir);

Figure 4: Example provenance hook from the signatures, which are space-efficient enough to embed in vfs_readdir function in fs/readdir.c. LPM inserts the IP Options field. We have implemented full DSA Lines 11-13. support in the Linux kernel by creating signing rou- tines to use with the existing DSA verification func- tion. DSA signing and verification occurs in the NetFil- preserve Lamport timestamps on network messages [50]. ter inet_local_out and inet_local_in hooks. A complete list of provenance hooks is included in In inet_local_out, LPM signs over the immutable Table 1. An example hook placement is shown in Fig- fields of the IP header, as well as the IP payload (Figure ure 4. The vfs_readdir functions attempts to read a 5). In inet_local_in, LPM checks for the presence file’s directory, placing it in the buf pointer. LPM in- of a signature, then verifies the signature against a config- troduces Line 11-13 of the function. Immediately after urable list of public keys. If the signature fails, the packet the security_file_permission affirms that the is dropped before it reaches the recipient application, subject has permission to take this action (Lines 7-9), thus ensuring that there are no breaks in the continuity of provenance_file_permission is called so that the provenance log. The key store for provenance-aware LPM can record the event. The return value of the prove- hosts is obtained by a PKI and transmitted to the ker- nance hook is checked before the operation is permitted. nel during the boot process by writing to securityfs. If the hook returns an error code such as ENOMEM, indi- LPM registers the Netfilter hooks with the highest prior- cating that the attempt to allocate memory for the prove- ity levels, such that signing occurs just before transmis- nance record failed, vfs_readdir terminates without sion (i.e., after all other IPTables operations), and sig- permitting the operation, thus ensuring that provenance nature verification occurs just after the packet enters the is gapless (Goal G1). interface (i.e., before all other IPTables operations).

3.4.2 Netfilter Hooks 3.4.3 Provenance Modules LPM uses Netfilter hooks to implement a cryptographic Here, we introduce two of our own provenance modules message commitment protocol. In Hi-Fi, provenance- (Provmon, SPADE), as well as briefly mention the work aware hosts communicated by embedding a provenance of our peers (UPTEMPO): sequence number in the IP options field [62] of each out- bound packet [61]. This approach allowed Hi-Fi to com- • Provmon. Provmon is an extended port of the Hi-Fi municate as normal with hosts that were not provenance- security module [61]. The original Hi-Fi code base aware, but unfortunately was not secure against a net- was 1,566 lines of code, requiring 723 lines to be mod- work adversary. In LPM, provenance sequence numbers ified in the transition. Our extensions introduced 728 are replaced with Digital Signature Algorithm (DSA) additional lines of code. The process of porting did

7 PROV-DM Graph Representation LPM Hook Comment Activity Process credfork Create a new Object node on each fork bprm_check_provenance Append process execution info to this fork’s Object node. Agent task_fix_setuid Get/Create one global Agent node representing User (UID) User If no task_fix_setuid Get the Agent node from Parent credential d_instantiate Object (Entity) Inode:Version Create a new Object node for Inode at version zero. , or file_permission (mask=W,A) Create a new Object node for Inode at incremented version. socket_ send recv msg UNIX IP Address:Port { / } (proto= ) Get/Create an Object node for each socket. , etc. socket_{send/recv}msg (proto=INET) Get/Create an Object node for each remote (IP,Port) tuple. shm_shmat Get/Create an Object node for each shared memory address. WasGeneratedBy WasGeneratedBy file_permission (mask=W,A) Relation between an Activity and an Output Object Object Activity socket_sendmsg Used Used file_permission (mask=R) Relation between an Activity and an Input Object Activity Object file_mmap socket_recvmsg WasControlledBy WasControlledBy credfork Relation between an Activity and an Agent. Activity Agent

Table 2: A mapping between PROV-DM’s core concepts (types and relationships), the LPM kernel events that trigger their creation, and the resulting graph representation.

not affect the module’s functionality, although we have execution in a sterile environment, aggregating LPM subsequently extended the Hi-Fi protocol to capture provenance in a centralized data store. It then recov- additional lineage information: ers the implicit information flow policy through min- ing the provenance store to generate a MAC policy for – File Versioning. The original Hi-Fi protocol did not the distributed system, decreasing both administrator track version information for files, leading to un- effort and the potential for misconfiguration. certainty as to the exact contents of a file at the time it was read. Accurately recovering this in- 3.4.4 Provenance Recorders formation in user space was not possible due to race conditions between kernel events. Because LPM provides modular support for different storage versioning is necessary to break cycles in prove- through provenance recorders. To prevent an infinite nance graphs [56], we have added a version field to provenance loop, recorders are flagged as provenance- the provenance context for inodes, which is incre- opaque [61] using the security.provenance ex- mented each time the inode is written. tended attribute, which is checked by LPM before creat- – Network Context. Hi-Fi omitted remote host ad- ing a new event. Each recorder was designed to be as ag- dress information for network events, reasoning nostic to the active LPM as possible, making them easy that source information could be forged by a to adapt to new modules. We currently provide 4 prove- dishonest agent in the network. These human- nance recorders: interpretable data points were replaced with a ran- Gzip incurs low storage overheads and fast insertion domly generated identifiers that were associated speeds. On our test bed, we observed this recorder pro- with each packet. We found, however, that these cessing up to 400,000 events per second from the Prov- identifiers could not be interpreted without remote mon provenance stream. However, because the prove- address information, and incorporated the record- nance is not stored in an easily queried form, this back- ing of remote IP addresses and ports into Provmon. end is best suited for environments where queries are an offline process. • SPADE. The SPADE system is an increasingly pop- PostgreSQL provides limited support for online query- ular option for provenance auditing, but collecting ing, but only moderate insertion and query speeds due to provenance in user space limits SPADE’s expressive- the known difficulties of representing provenance in re- ness and creates the potential for incomplete prove- lational databases [42]. Under similar conditions as the nance. To address this limitation, we have cre- Gzip recorder, a representative insertion speed for this ated a mechanism that reports LPM provenance into recorder was 2,200 events per second. SPADE’s Domain-Specific Language pipe [36]. This Neo4j provides full support for the W3C PROV-DM permits the collection of whole-system provenance model [74]. Although Neo4j is a popular graph database, while simultaneously leveraging SPADE’s existing we were unable to find a configuration that was effi- storage, remote query, and visualization utilities. cient enough to keep up with Provmon’s provenance stream. Under similar conditions, a representative inser- • Using Provenance to Expedite MAC Policies (UP- tion speed for this recorder was just 28 events per second. TEMPO). Using LPM as a collection mechanism, : To create graph storage that was efficient enough Moyer et al. investigate provenance analysis as a for LPM, we used the SNAP graphing library 10 to design means of administrating Mandatory Access Control (MAC) policies [68]. UPTEMPO first observes system 10See http://snap.stanford.edu

8 librt-2.12.so libc-2.12.so libpthread-2.12.so 500 a 1 [20a9] bprm_check_provenance 54558 "sort a > b" 2 [20a9] task_fix_setuid uid 500 gid 500 3 [20a9] file_permission R 7374:librt-2.12.so Used Used Used WasControlledBy Used 4 [20a9] file_permission R 33675:libc-2.12.so 5 [20a9] file_permission R 33374:libpthread-2.12.so sort a > b 6 [20a9] file_permission R 16520304:a 7 [20a9] file_permission W 16520302:b WasGeneratedBy

b (a) LPM Provenance Stream (b) PROV-DM Graph Representation

Figure 6: The LPM provenance stream of events can be transformed into standardized graph models. Here, a sort command resulting in kernel output 6a can be mapped to PROV-DM graph 6b. This is necessary to facilitate efficient querying, and also to permit the exchange of provenance between deployments. a recorder that maintains an in-memory graph database b.png a.png that is fully compliant with the W3C PROV-DM Model Used Used [74]. We have observed insertion speeds of over 150,000 events per second using the SNAP recorder, and highly WasDerivedFrom mogrify -format jpg *.png WasDerivedFrom efficient querying as well. This recorder is further evalu- WasGeneratedBy WasGeneratedBy ated in Section 6. b.jpg a.jpg 3.4.5 Graph Representation Figure 7: A provenance graph of image conversion. A full summary of how LPM creates a PROV-DM- Here, workflow provenance (WasDerivedFrom) encodes compliant graph is shown in Table 2. The whole-system a relationship that more accurately identifies the output provenance graph is made up of subgraphs where Ac- files’ dependencies compared to only using kernel layer tivities, Agents, and Object nodes interact. The finest observations (Used, WasGeneratedBy). granularity of these subgraphs is a system fork. Each new system fork is immediately associated with an Ac- tivity, Agent, and WasControlledBy relation, as described complish this, we extend the LPM Provenance Recorder above. The input (Used) and output (WasGeneratedBy) to use the Linux Integrity Measurement Architecture objects of an activity link together subgraphs. Using (IMA) [44, 65]. IMA computes a cryptographic hash of these relations, PROV-DM’s WasDerivedFrom, WasIn- each binary before execution, extends the measurement formedBy, WasAttributedTo, and WasAssociatedWith re- into a TPM Platform Control Register (PCR), and stores lations are implicitly represented [74]. ActedOnBehalfOf the measurement in kernel memory. This set of measure- can also be used to encode when the EUID is set. ments can be used by the Recorder to make a decision LPM generates provenance as a stream of kernel about the integrity of the a Provenance-Aware Applica- events which need to be processed into a graph repre- tion (PAA) prior to accepting the disclosed provenance. sentation before they can be efficiently queried. Figure When a PAA wishes to disclose provenance, it opens a 6 shows an example of this transformation for the com- new UNIX domain socket to send the provenance data mand sort a > b. Line 1 of the LPM stream creates to the Provenance Recorder. The Recorder uses its own the sort activity node, and creates a WasControlledBy UNIX domain socket to recover the process’s pid, then relation between the new activity node and the parent uses the /proc filesystem to find the full path of the bi- fork’s agent node; line 2 updates the WasControlledBy nary, then uses this information to look up the PAA in relation so that it directs to the 500 agent node; lines 3-6 the IMA measurement list. The disclosed provenance is get or create object nodes with Used relations; and line 7 recorded only if the signature of PAA matches a known- creates the b object and WasGeneratedBy relation. good cryptographic hash. As a demonstration of this functionality, we created 3.4.6 Workflow Provenance a provenance-aware version of the popular ImageMag- ick utility 11. ImageMagick contains a batch conversion To support layered provenance while preserving our se- tool for image reformatting, mogrify. Shown in Fig- curity goals, we require a means of evaluating the in- tegrity of user space provenance disclosures. To ac- 11See http://www.imagemagick.org

9 ure 7, mogrify reads and writes multiple files during 1 policy_module(uprovd, 1.0.0) execution, leading to an overtainting problem – at the 2 3 ######################################## kernel layer, LPM is forced to conservatively assume 4 # that all outputs were derived from all inputs, creating 5 # Declarations 6 # false dependencies in the provenance record. To address 7 this, we extended the Provmon protocol to support a 8 type uprovd_t; 9 type uprovd_exec_t; new message, provmsg_imagemagick_convert, 10 init_daemon_domain(uprovd_t, uprovd_exec_t) which links an input file directly to its output file. When 11 12 permissive uprovd_t; the recorder receives this message, it first checks the list 13 of IMA measurements to confirm that ImageMagick is 14 type uprovd_log_t; 15 logging_log_file(uprovd_log_t) in a good state. If successful, it then annotates the exist- 16 ing provenance graph, connecting the appropriate input 17 type uprovd_rw_t; 18 files_type(uprovd_rw_t) and output objects with WasDerivedFrom relationships. 19 Our instrumentation of ImageMagick demonstrates that 20 ######################################## 21 # LPM supports layered provenance at no additional cost 22 # uprovd local policy over other provenance-aware systems [36, 56], and does 23 # 24 allow uprovd_t self:process { signal }; so in a manner that provides assurance of the integrity of 25 the provenance log. 26 allow uprovd_t self:fifo_file rw_fifo_file_perms; 27 allow uprovd_t self:unix_stream_socket create_stream_socket_perms; 28 3.5 Deployment 29 manage_dirs_pattern(uprovd_t, uprovd_log_t, uprovd_log_t) We now demonstrate how we used LPM in the deploy- 30 manage_files_pattern(uprovd_t, uprovd_log_t, uprovd_log_t) ment of a secure provenance-aware system. 31 logging_log_filetrans(uprovd_t, uprovd_log_t, { dir file }) 32 3.5.1 Platform Integrity 33 manage_dirs_pattern(uprovd_t, uprovd_rw_t, uprovd_rw_t) 34 manage_files_pattern(uprovd_t, uprovd_rw_t, We configured LPM to run on a physical machine with uprovd_rw_t) a Trusted Platform Module (TPM). The TPM provides a 35 36 domain_use_interactive_fds(uprovd_t) root of trust that allows for a measured boot of the sys- 37 tem. The TPM also provides the basis for remote attes- 38 files_read_etc_files(uprovd_t) 39 tations to prove that LPM was in a known hardware and 40 miscfiles_read_localization(uprovd_t) software configuration. The BIOS’s core root of trust for measurement (CRTM) bootstraps a series of code mea- Figure 8: SELinux Policy Module for Provmon: Type surements prior to the execution of each platform com- Enforcement for Provmon’s user space daemon. ponent. Once booted, the kernel then measures the code for user space components (e.g., provenance recorder) before launching them, through the use of the Linux In- TPM of the host system. Additionally, we compiled sup- tegrity Measurement Architecture (IMA)[65]. The result port for IMA into the provenance-aware kernel, which is is then extended into TPM PCRs, which forms a verifi- necessary in order for the LPM Recorder to be able to able chain of trust that shows the integrity of the system measure the integrity of provenance-aware applications. via a digital signature over the measurements. A remote verifier can use this chain to determine the current state of the system using TPM attestation. 3.5.2 Runtime Integrity We configured the system with Intel’s Trusted Boot 5 After booting into the provenance-aware kernel, the run- (tboot), which provides a secure boot mechanism, pre- time integrity of the TCB (defined in §3.2) must also be venting system from booting into the environment where assured. To protect the runtime integrity of the kernel, critical components (e.g., the BIOS, boot loader and we deploy a Mandatory Access Control (MAC) policy, the kernel) are modified. Intel tboot relies on the Intel as implemented by Linux Security Modules. On our pro- TXT instructions that provide a secure execution envi- 3 totype deployments, we enabled SELinux’s MLS policy, ronment. For virtual environments, similar function- the security of which was formally modeled by Hicks et ality can be provided on Xen via TPM sealing and the 12 al. [41]. Refining the SELinux policy to prevent Ac- virtual TPM (vTPM) , which is bound to the physical cess Vector Cache (AVC) denials on LPM components 12See http://wiki.xenproject.org/wiki/Virtual_Trusted_Platform_ required minimal effort; the only denial we encountered Module_(vTPM) was when using the PostgreSQL recorder, which was

10 1 Tamperproof (G2). The runtime integrity of the LPM 2 /sys/kernel/debug/provenance0 -- trusted computing base is assured via the SELinux MLS gen_context(system_u:object_r:uprovd_rw_t,s0) 3 policy, and we have written a policy module that protects 4 the LPM user space components (§3.5.2). Therefore, the 5 /usr/bin/uprovd -- gen_context(system_u:object_r:uprovd_exec_t,s0) only way to disable LPM would be to reboot the sys- 6 tem into a different kernel; this action can be disallowed 7 /var/log(/.*)? 5 gen_context(system_u:object_r:uprovd_log_t,s0) through secure boot techniques, and is detectable by re- mote hosts via TPM attestation (§3.5.1). Figure 9: SELinux Policy Module for Provmon: File Verifiable (G3). While we have not conducted an in- Contexts for Provmon’s user space daemon. dependent formal verification of LPM, our argument for its correctness is as follows. A provenance hook follows each LSM authorization hook in the kernel. The correct- quickly remedied with the audit2allow tool. Pre- ness of LSM hook placement has been verified through serving the integrity of LPM’s user space components, both static and dynamic analysis techniques [25, 32, 43]. such as the provenance recorder, was as simple as creat- Because an authorization hook exists on the path of ev- ing a new policy module. We created a policy module ery sensitive operation to controlled data types, and LPM to protect the LPM recorder and storage back-end using introduces a provenance hook behind each authorization the sepolicy utility. Uncompiled, the policy module hook, LPM inherits LSM’s formal assurance of com- was only 135 lines. Excerpts from the policy are shown plete mediation over controlled data types. That is, LPM in Figures 8 and 9. can observe every sensitive operation on controlled data types in the kernel, satisfying our definition of whole- system provenance. 4 Security In the same way that LSM and SELinux have been in- dependently verified [41, 45, 72], the logic of individual In this section, we demonstrate that our system meets provenance modules must also be modeled and verified all of the required security goals for trustworthy whole- before a formal guarantee about the collected provenance system provenance. In this analysis, we consider an LPM can be made. A formal model for the Provmon system deployment on a physical machine that was enabled with is not provided in this work. However, because Prov- the Provmon module and has been configured to the con- mon is significantly less complex than the SELinux se- ditions described in Section 3.5. curity module, we believe that formal modeling is possi- Complete (G1). We defined whole-system provenance ble. Our Provmon implementation is 2,294 lines of code, as a complete description of agents (users, groups) con- while SELinux is 17,581 lines of code. trolling activities (processes) interacting with controlled Authenticated Channel (G4). Through use of Net- data types during system execution (§ 3.1). LPM at- filter hooks [71], LPM embeds a DSA signature in ev- tempts to track these system objects through the place- ery outbound network packet. Signing occurs immedi- ment of provenance hooks (§3.4.1), which directly fol- ately prior to transmission, and verification occurs im- low each LSM authorization hook. The LSM’s complete mediately after reception, making it impossible for an mediation property has been formally verified [25, 81]; adversary-controlled application running in user space in other words, there is an authorization hook prior to to interfere. For both transmission and reception, the every security-sensitive operation. Because every inter- signature is invisible to user space. Signatures are action with a controlled data type is considered security- removed from the packets before delivery, and LPM sensitive, we know that a provenance hook resides on feigns ignorance that the options field has been set if all control paths to the provenance-sensitive operations. get_options is called. Hence, LPM can enforce that LPM is therefore capable of collecting complete prove- all applications participate in the commitment protocol. nance on the host. Prior to implementing our own message commitment It is important to note that, as a consequence of plac- protocol in the kernel, we investigated a variety of ex- ing provenance hooks beneath authorization hooks, LPM isting secure protocols. The integrity and authenticity of is unable to record failed access attempts. However, in- provenance identifiers could also be protected via IPsec serting the provenance layer beneath the security layer [47], SSL tunneling [4], or other forms of encapsulation ensures accuracy of the provenance record. Moreover, [11, 82]. We elected to move forward with our approach failed authorizations are a different kind of metadata than because 1) it ensures the monitoring of all all processes provenance because they do not describe processed data; and network events, including non-IP packets, 2) it does this information is better handled at the security layer, not change the number of packets sent or received, en- e.g., by the SELinux Access Vector Cache (AVC) Log. suring that our provenance mechanism is minimally in-

11 vasive to the rest of the Linux network stack, and 3) Algorithm 1 Summarizes a’s propagation through the system. it preserves compatibility with non-LPM hosts. An al- Require: a is an entity ternative to DSA signing would be HMAC [13], which 1: procedure REPORT(a) 2: Locations = [ ] . Assigns an empty list. offers better performance but requires pairwise keying 3: for each s in a,FindSuccessors(a) do and sacrifices the non-repudiation policy; BLS, which 4: if s.type is File then 5: Locations.Add(< s.disk,s.directory >) approaches the theoretical maximum security parame- 6: else if s.type is Network Packet then ter per byte of signature [16]; or online/offline signature 7: Locations.Add(< s.remote_ip,s.port >) 8: end if schemes [19, 30, 33, 69]. 9: end for Authenticated Disclosures (G5). We make use 10: return Locations 11: end procedure of IMA to protect the channel between LPM and provenance-aware applications wishing to disclose provenance. IMA is able to prove to the provenance which pieces of data are sensitive, where that data has recorder that the application was unmodified at the time propagated to within the organization, and where it is it was loaded into memory, at which point the recorder (and is not) permitted to flow. can accept the provenance disclosure into the official A provenance-based approach is a novel and effective record. If the application is known to be correct (e.g., means of handling data loss prevention; to our knowl- through formal verification), this is sufficient to estab- edge, we are the first in the literature to do so. The ad- lish the runtime integrity of the application. However, if vantage of our approach when compared to existing sys- the application is compromised after execution, this ap- tems is that LPM-based provenance-aware systems al- proach is unable to protect against provenance forgery. ready perform system-wide capture of information flows A separate consideration for all of the above security between kernel objects. Data loss prevention in such properties are Denial of Service (DoS) attacks. DoS at- a system therefore becomes a matter of preventing all tacks on LPM do not break its security properties. If an derivations of a sensitive source entity, e.g., a Payment attacker launches a resource exhaustion attack in order Card Industry (PCI) database, from being written to a to prevent provenance from being collected, all kernel monitored destination entity (e.g., a network interface). operations will be disallowed and the host will cease to We begin by defining a policy format for PB-DLP. In- function. If a network attacker drops packets, or strips dividual rules take the form them of their provenance identifiers, the packet will not be delivered to the recipient application. Therefore, the < Srcs = [src1,src2,...,srcn],dst > provenance record remains an accurate reflection of sys- where Srcs is a list of entities representing persistent tem events, regardless of DoS attacks. data objects, and dst is a single entity representing either a persistent data object such as a file or interface or an ab- 5 LPM Application: Provenance-Based stract entity such as a remote host. The goal for PB-DLP Data Loss Prevention is as follows – an entity e1 with ancestors A is written to entity e2 if and only if A 6⊇ Srcs for all rules in the To further demonstrate the power of LPM, we now in- ruleset where e2 = dst. The reason that sources are ex- troduce a set of mechanisms for Provenance-Based Data pressed as sets is that, at times, the union of information Loss Prevention (PB-DLP) that offer dramatically sim- is more sensitive than its individual components. For ex- plified administration and improved enforcement over ample, sharing a person’s last name or birthdate may be existing DLP systems. Data Loss Prevention (DLP), permissible, while sharing the last name and birthdate is 13 which is also called Data Leakage Protection, is enter- restricted as PII. prise software that seeks to minimize the loss and ex- Below, we define the functions that realize this goal. filtration of sensitive data by monitoring and controlling First, we define two provenance-based functions as the information flows in large, complex organizations [5]. In basis for a DLP monitoring phase, which allows admin- addition to the desire to control intellectual property, an- istrators to learn more about the propagation of sensitive other motivator for DLP systems is demonstrating reg- data on their systems. Then, we define mechanisms for a ulatory compliance for personally-identifiable informa- DLP enforcement phase. tion (PII),13 as well as directives such as PCI,7 HIPAA,14 SOX.15 or E.U. Data Protection.16 It is therefore impor- 5.1 Monitoring Phase tant for a DLP system to be able to exhaustively explain The goal of monitoring is to allow administrators to rea- 13 See NIST SP 800-122 14 See http://www.hhs.gov/ocr/privacy son about how sensitive data is stored and put to use on 15 Short for the Sarbanes-Oxley Act, U.S. Public Law No. 107-20 their systems. The end product of the monitor phase is a 16 See EU Directive 95/46/EC set of rules (a policy) that restrict the permissible flows

12 Algorithm 2 Mediates request to write e to d given Rules. FindAncestors can be then used as the basis for a func- Require: e,d are entities tion that prevents the spread of sensitive data: Require: Rules is a PB-DLP policy 1: procedure PROVWRITE(e,d,Rules) 2: for each rule in Rules do 4. ProvWrite(Entity, Destination, Rules): Write the tar- 3: if d = rule.dst then 4: A = FindAncestors(e) get entity to the destination if and only if it is valid to 5: NumSrcs = length(rule.Srcs) the provided rule set, as defined in Algorithm 2. 6: for each src in rule.Srcs do 7: if src in A then 8: NumSrcs − − 9: end if 5.3 File Transfer Application 10: end for 11: if NumSrcs = 0 then . A ⊇ Srcs, deny. 12: return PB-DLP_DENY In many enterprise networks that are isolated from the 13: end if Internet via firewalls and proxies, it is desirable to share 14: end if 15: end for files with external users. File transfer services are one 16: return PB-DLP_PERMIT . A 6⊇ Srcs, permit. way to achieve this, and provide a single entry/exit point 17: end procedure to the enterprise network where files being transferred can be examined before being released.17 In the case for sensitive data sources. Monitoring is an ongoing pro- of incoming files, scans can check for known malware, cess in DLP, where administrators attempt to iteratively and in some cases, check for other types of malicious improve protection against data leakage. The first step is behavior from unknown malware. to identify the data that needs protection. Identifying the We implemented PB-DLP as a file transfer applica- source of such information is often quite simple; for ex- tion for provenance-aware systems using LPM’s Prov- ample, a database of PCI or PII data. However, reliably mon module. The application interfaced with LPM’s finding data objects that were derived from this source is SNAP recorder using a custom API. Before permitting extraordinarily complicated using existing solutions, but a file to be transmitted to a remote host, the application is simple now with LPM. To begin, we define a helper ran a query that traversed WasDerivedFrom edges to re- function for system monitoring: turn a list of the file’s ancestors, permitting the transfer only if the file was not derived from a restricted source. 1. FindSuccessors(Entity): This function performs a PB-DLP allows internal users to share data, while ensur- provenance graph traversal to obtain the list of data ing that sensitive data is not exfiltrated in the process. objects derived from Entity. Because provenance graphs continue to grow indefi- nitely over time, in practice the bottleneck of this appli- FindSuccessors can then be used as the basis for a cation is the speed of provenance querying. We evaluate function that summarizes the spread of sensitive data: the performance of PB-DLP queries in Section 6.3. 2. Report(Entity): List the locations that a target object and its successors have propagated. This function is defined in Algorithm 1. 5.4 PB-DLP Analysis

The information provided by Report is similar to the Below, we select two open source systems that approx- data found in the Symantec DLP Dashboard [5], and imate label based and regular expression (regex) based could be used as the backbone of a PB-DLP user inter- DLP solutions, and compare their benefits to PB-DLP. face. Administrators can use this information to write a PB-DLP policy or revise an existing one. 5.4.1 Label-Based DLP

5.2 Enforcement Phase The SELinux MLS policy [38] provides information flow security through a label-based approach, and could be Possessing a PB-DLP policy, the goal of the enforcement used to approximate a DLP solution without relying on phase is to prevent entities that were derived from sensi- commercial products. Proprietary label-based DLP sys- tive sources from being written to restricted locations. To tems rely on manual annotations provided by users, re- do so, we need to inspect the object’s provenance to dis- quiring them to provide correct labeling based on their cover the entities from which it was derived. We define knowledge of data content. Using SELinux as an exem- the following helper function: plar labeling system is therefore an extremely conserva- tive approach to analysis. 3. FindAncestors(Entity): This function performs a provenance graph traversal to obtain the list of data 17 Two examples of vendors that provide this capability are FireEye objects used in the creation of Entity. (http://www.fireeye.com) and Accellion (http://www.accellion.com/)

13 1 Training_Data:0

Birth_Dates:0 Used 3 4 WasGeneratedBy Used WasGeneratedBy 2 Used join Birth_Dates SSNs > PII_Data PII_Data:0 gzip PII_Data PII_Data.gz:0 SSNs:0

Figure 10: A provenance graph of PII data objects that are first fused and then transformed. The numbers mark DLP decision conditions. Objects marked by green circles should not be restricted, while red octagons should be restricted. Label-Based DLP correctly handles data resembling PII (1,2) and data transformations (4), but struggles with data fusions (3). Regex-Based DLP correctly identifies data fusions (3), but is prone to incorrect handling of data resembling PII (1) and fails to identify data transformations (4). PB-DLP correctly handles all conditions.

ment to ensure that each employee receives a paycheck. T Separately, the user may need to send a list of birthdays to another user in the department to coordinate birthday celebrations for each month. Either of these activities are A, {α,β} acceptable (Figure 10, Decision Condition 2). However, A, {α} A, {β} it is common practice for organizations to have stricter sharing policies for data that contains multiple forms of B, {α,β} PII, so while either of these identifiers could be transmit- ted in isolation, the two pieces of information combined B, {α} B, {β} could not be shared (Figure 10, Decision Condition 3).

T The MLS policy cannot easily handle this type of data fusion. In order to provide support for correctly label- ing fused data, an administrator would need to define the Figure 11: Example MLS lattice with two classification power set of all compartments within the MLS policy. levels and two compartments. The lattice is a pictorial In the example above, the administrator would define the representation of the access control policy that is en- following compartments: {}, {α}, {β}, {α,β}. In the forced. default configuration SELinux supports 256 unique cate- gories, meaning an SELinux DLP policy could only sup- port eight types of data. Furthermore, the MLS policy Within an MLS system, each subject, and object, is does not support defining multiple categories within a assigned a classification level, and categories, or com- single sensitivity level18. This implies that the MLS pol- partments. Consider an example system, with classi- icy cannot support having a security level for A,{α} and fication levels, {A,B} with A dominating B, and com- for A,{α,β}. Instead, the most restrictive labeling must partments {α,β}. We can model our policy as a lat- be defined to protect the data on the system. In contrast, tice, where each node in the lattice is a tuple of the PB-DLP can support an arbitrary number of data fusions. form {< level >,{compartments}}. Once the policy is defined, it is possible to enforce the simple and *- 5.4.2 Regex-Based DLP properties. If a user has access to data with classification level A, and compartment α, he cannot read anything in The majority of DLP software relies on pattern matching compartment {β} (no read-up). Furthermore, when data techniques to identify sensitive data. While enterprise so- is accessed in A,{α}, the user cannot write anything to lutions offer greater sophistication and customizability, B,{α} (no write-down). Figure 11 shows an example their fundamental approach resembles that of Cornell’s policy in lattice form. Spider 6, a forensics tools for identifying sensitive per- sonal data (e.g., credit card or social security numbers). In order to use SELinux’s MLS enforcement as a DLP Because it is open source, we make use of Spider as an solution, the administrator configures the policy to en- exemplar application for regex-based DLP. force the constraint that no data of specified types can Regex approaches are prone to false positives. be sent over the network. However, this is difficult in Spider is pre-configured with a set of regular practice. Consider an example system that processes PII. The users of the system may need to access information, 18See the definition of level statements at http:// such as last names, and send these to the payroll depart- selinuxproject.org/page/MLSStatements

14 Test Type Vanilla LPM Provmon Test Vanilla Provmon Overhead Process tests, times in µseconds (smaller is better) Kernel Compile 598 sec 614 sec 2.7% null call 0.14 0.14 (0%) 0.14 (0%) Postmark 25 sec 27 sec 7.5% null I/O 0.21 0.21 (0%) 0.32 (52%) Blast 376 sec 390 sec 4.8% stat 1.57 1.6 (2%) 2.8 (78%) open/close file 2.75 2.42 (-12%) 3.91 (42%) signal install 0.25 0.25 (0%) 0.25 (0%) Table 4: Benchmarking Results. Our provenance module signal handle 1.37 1.29 (-6%) 1.39 (1%) imposed just 2.7% overhead on kernel compilation. fork process 380 396 (4%) 401 (6%) exec process 873 879 (1%) 911 (4%) shell process 2990 3000 (0%) 3113 (4%) File and memory latencies in µseconds (smaller is better) file create (0k) 11.5 11.2 (-3%) 15.8 (37%) 6.1 Collection Performance file delete (0k) 8.51 8.12 (-5%) 11.8 (39%) file create (10k) 23.4 21.6 (-8%) 28.8 (23%) file delete (10k) 12.5 12 (-4%) 14.7 (18%) We used LMBench to microbenchmark LPM’s impact mmap latency 1062 1053 (-1%) 1120 (5%) protect fault 0.32 0.3 (-6%) 0.346 (8%) on system calls as well as file and memory latencies. page fault 0.016 0.016 (0%) 0.016 (0%) Table 3 shows the averaged results over 10 trials for 100 fd select 1.53 1.53 (0%) 1.53 (0%) each kernel, with a percent overhead calculation against Vanilla. For most measures, the performance differ- Table 3: LMBench measurements for LPM kernels. All ences between LPM and Vanilla are negligible. Com- times are in microseconds. Percent overhead for modi- paring Vanilla to Provmon, there are several measures fied configurations are shown in parenthesis. in which overhead is noteworthy: stat, open/close, file creation and deletion. Each of these benchmarks in- volve LMBench manipulating a temporary file that re- expressions for identifying potential PII, e.g., sides in /usr/tmp/lat_fs/. Because an absolute (\d{3}-\d{2}-\d{4}) identifies a social se- path is provided, before each system call occurs LM- curity number. However, it is common practice for Bench first traverses the path to the file, resulting in developers to generate and distribute training datasets to the creation of 3 different provenance events in Prov- aid in software testing (Figure 10, Decision Condition mon’s inode_permission hook, each of which is 1). Spider is oblivious to information flows, instead transmitted to user space via the kernel relay. While searching for content that bears structural similarity the overheads seem large for these operations, the log- to PII, and therefore would be unable to distinguish ging of these three events only imposes approximately between true PII and training data. PB-DLP tracks the 1.5 microseconds per traversal. Moreover, the over- propagation of data from its source onwards, and could head for opening and closing is significantly higher than trivially differentiate between true PII and training sets. the overhead than reads and writes (Null I/O); thus, the Regex approaches are also prone to false negatives. higher open/close costs are likely to be amortized over Even after the most trivial data transformations, PII and the course of regular system use. A provenance module PCI data is no longer identifiable to the Spider system could further reduce this overhead by maintaining state (Figure 10, Decision Condition 4), permitting its exfil- about past events within the kernel, then blocking the tration. To demonstrate, we generated a file full of ran- creation of redundant provenance records. dom valid social security numbers that Spider was able to To gain a more practical sense of the costs of LPM, we identify. We then ran gzip on the file and stored it in a also performed multiple benchmark tests that represented second file. Spider was unable to identify the second file, realistic system workloads. Each trial was repeated 5 but PB-DLP correctly identified both files as PII since the times to ensure consistency. The results are summarized gzip output was derived from a sensitive input. in Table 4. For the kernel compile test, we recompiled the kernel source (in a fixed configuration) while booted into each of the kernels. Each compilation occurred on 16 threads. The LPM scaffolding (without an enabled 6 Evaluation module) is not included, because in both tests it imposed less than 1% overhead. In spite of seemingly high over- We now evaluate the performance of LPM. Our bench- heads for file I/O, Provmon imposes just 2.7% overhead marks were run on a bare metal server machine with 12 on kernel compilation, or 16 seconds. The Postmark test GB memory and 2 Intel Xeon quad core CPUs. The Red simulates the operation of an email server. It was con- Hat 2.6.32 kernel was compiled and installed under 3 dif- figured to run 15,000 transactions with file sizes rang- ferent configurations: all provenance disabled (Vanilla), ing from 4 KB to 1 MB in 10 subdirectories, with up to LPM scaffolding installed but without an enabled mod- 1,500 simultaneous transactions. The Provmon module ule (LPM), and LPM installed with the Provmon module imposed just 7.5% overhead on this task. To estimate enabled (Provmon). LPM’s overhead for scientific applications, we ran the

15 5 1 5000 Raw Prov. Stream Vanilla 4 GZip Recorder 0.99 4000 LPM SNAP Recorder Provmon 3 0.98 Batch Sig 3000 2 0.97 2000 1 0.96 Cumulative Density

Storage Cost (GBytes) 1000 0 0.95 Throughput (Mbps) 0 2 4 6 8 10 0 5 10 15 20 25 0 Time (Minutes) Response Time (Milliseconds) Iperf Performance

Figure 12: Growth of provenance Figure 13: Performance of ances- Figure 14: LPM network overhead storage overheads during kernel try queries for objects created dur- can be reduced with batch signature compilation. ing kernel compilation. schemes.

BLAST benchmarks 19, which simulates typical biolog- 6.3 Query Performance (PB-DLP) ical sequence workloads obtained from analysis of hun- dreds of thousands of jobs from the National Institutes of We evaluated query performance using our exemplar PB- Health. DLP application and LPM’s SNAP recorder. The prove- nance graph that was populated using the routine from For kernel compile and postmark, Provmon outper- the kernel compile benchmark. This yielded a raw prove- forms the PASS system, which exacted 15.6% and 11.5% nance stream of 3.7 GB, which was translated by the overheads on kernel compilation and postmark, respec- recorder into a graph of 6,513,398 nodes and 6,754,059 tively [56]. Provmon introduces comparable kernel com- edges. We were then able to use the graph to issue an- pilation overhead to Hi-Fi (2.8%) [61]. It is difficult to cestry queries, in which the subgraphs were traversed to compare our Blast results to SPADE and PASS, as both find the data object ancestors of a given file. Because used a custom workload instead of a publicly available we did not want ephemeral objects with limited ances- benchmark. SPADE reports an 11.5% overhead on a tries to skew our results, we only considered the results large workload [36], while PASS reports just an 0.7% of objects with more than 50 ancestors. overhead. Taken as a whole, though, LPM collection ei- In the worst case, which was a node that had 17,696 ther meets or exceeds the performance of past systems, ancestors, the query returned in just 21 milliseconds. while providing additional security assurances. Effectively, we were able to query object ancestries at line speed for network activity. We are confident that this approach can scale to large databases through pre- 6.2 Storage Overhead computation of expensive operations at data ingest, mak- A major challenge to automated provenance collection ing it a promising strategy for provenance-aware dis- is the storage overhead incurred. We plotted the growth tributed systems; however, we note that these results are of provenance storage using different recorders during highly dependent on the size of the graph. Our test graph, the kernel compilation benchmark, shown in Figure 12. while large, would inevitably be dwarfed by the size of LPM generated 3.7 GB of raw provenance. This re- the provenance on long-lived systems. Fortunately, there quired only 450 MB of storage with the Gzip recorder, are also a variety of techniques for reducing these costs. but provenance cannot be efficiently queried in this for- Bates et al. show that the results from provenance graph mat. The SNAP recorder builds an in-memory prove- traversals can be extensively cached when using a fixed nance graph. We approximated the storage overhead security policy, which would allow querying to amor- through polling the virtual memory consumed by the tize to a small constant cost [12]. LPM could also be recorder process in the /proc filesystem. The SNAP extended to support provenance deduplication [78, 77] graph required 1.6 GB storage; the reduction from the and policy-based provenance pruning [17], both of which raw provenance stream is due to the fact that redundant would further improve performance by reducing the size events did not lead to the creation of new graph compo- of provenance graphs. nents. In contrast, the PASS system generates 1.3 GB of storage overhead during kernel compilation, but PASS 6.4 Message Commitment Protocol does not monitor of shared memory and memory map- ping activity. LPM’s storage overheads are thus compa- Under each kernel configuration, we performed iperf rable to other provenance-aware systems. benchmarking to discover LPM’s impact on TCP throughput. iperf was launched in both client and server 19See http://fiehnlab.ucdavis.edu/staff/kind/Collector/Benchmark/ mode over localhost. The client was launched Blast_Benchmark with 16 threads, two per each CPU. Our results can be

16 found in Figure 14. The Vanilla kernel reached 4490 most any other security solution [22]. Mbps. While the LPM framework imposed negligible LPM does not address the matter of provenance confi- cost compared to the vanilla kernel (4480 Mbps), Prov- dentiality; this is an important challenge that is explored mon’s DSA-based message commitment protocol re- elsewhere in the literature [18, 59]. LPM’s Recorders duced throughput by an order of magnitude (482 Mbps). provide interfaces that can be used to introduce an ac- Through use of printk instrumentation, we found that cess control layer onto the provenance store. the average overhead per packet signature was 1.2 ms. Although we have not presented a secure distributed This result is not surprising when compared to IPsec provenance-aware system in this work, LPM provides performance. IPsec’s Authentication Header (AH) mode the foundation for the creation of such a system. In the uses an HMAC-based approach to provide similar guar- presented modules, provenance is stored locally by the antees as our protocol. AH has been shown to reduce host and retrieved on an as-needed basis from other hosts. throughput by as much as half [20]. An HMAC approach This raises availability concerns as hosts inevitably begin is a viable alternative to establish integrity and data ori- to fail. Availability could be improved with minimal per- gin authenticity and would also fit into the options field, formance and storage overheads through Gehani et al.’s but would require the negotiation of IPsec security as- approach of duplicating provenance at k neighbors with sociations. Our message commitment protocol has the a limited graph depth d [34, 35]. benefit of being fully interoperable with other hosts, and does not require a negotiation phase before communi- 8 Related Work cation occurs. Another option for increasing through- put would be to employ CPU instruction extensions [37] While myriad provenance-aware systems have been pro- and security co-processor [24] to accelerate the speed of posed in the literature, the majority disclose provenance DSA. Yet another approach to reducing our impact on within an application [39, 53, 82] or workflow [31, 73]. It network performance would be to employ a batch signa- is difficult or impossible to obtain complete provenance ture scheme [15]. We tested this by transmitting a sig- in this manner. This is because systems events that occur nature over every 10 packets during TCP sessions, and outside of the application, but still effect its execution, found that throughput increased by 3.3 times to approx- will not appear in the provenance record. imately 1600 Mbps. Due to the fact that this overhead The alternative to disclosed systems are automatic may not be suitable for some environments, Provmon provenance-aware systems, which collect provenance can be configured to use Hi-Fi identifiers [61], which are transparently within the operating system. Gehani et vulnerable to network attack but impose negligible over- al.’s SPADE is a multi-platform system for eScience and head. LPM’s impact on network performance is specific grid computing audiences, with an emphasis on low la- to the particular module, and can be tailored to meet the tency and availability in distributed environments [36]. needs of the system. SPADE’s provenance reporters make use of familiar ap- plication layer utilities to generate provenance, such as ps lsof 7 Discussion polling for process information and for net- work information. This gives rise to the possibility of in- complete provenance due to race conditions. The PASS In the same way that LPM is not able to capture workflow project collects the provenance of system calls at the layer provenance without the assistance of instrumented virtual filesystem (VFS) layer. PASSv1 provides base applications, LPM is also vulnerable to workflow layer functions for provenance collection that observe pro- side channels that could be used to launder information cesses’ file I/O activity [56]. Because these basic func- from one process to another. The most obvious example tions are manually placed around the kernel, there is no of such a side channel is the copy/paste buffer in win- clear way to extend PASSv1 to support additional col- dowing applications like Xorg. This is a known problem lection hooks; we address this limitation in the modular for kernel layer security mechanisms, one that has been design of LPM. PASSv2 introduces a Disclosed Prove- addressed by the Trusted Solaris project [14], Trusted nance API for tighter integration between provenance X [28, 29], the SELinux-aware X window system [48], collected at different layers of abstraction, e.g., at the SecureView 20, and General Dynamics’ TVE 21. Simi- application layer [57]. PASSv2 assumes that disclos- lar approaches could be used to create provenance-aware ing processes are benign, while LPM provides a secure desktop environments. Similarly, LPM is unable to cap- disclosure mechanism for attesting the correctness of ture side channel flows, such as timing channels or L2 provenance-aware applications. Both SPADE and PASS cache measurements [64], a limitation that it shares with are designed for benign environments, and make no at- 20See http://www.ainfosec.com/secureview tempt to protect their collection mechanisms from an ad- 21See http://gdc4s.com/tve.html versary.

17 Security Whole-System Implemented? Model? Provenance? Adversary on Host? Adversary in Network? SPADE [36]    Vulnerable (Disable Monitor) Vulnerable (Prov. Forgery) PASSv2 [57]    Vulnerable (User Space Components) Vulnerable (Prov. Forgery) SProv [39]    Vulnerable (Disable Monitor) N/A SNooPy [82]    Vulnerable to all attacks, but with probabilistic detection. Lyle, Martin [52]    Secure. (But does not collect whole-system provenance.) Hi-Fi [61]    Vulnerable (User Space Components) Vulnerable (Prov. Forgery) LPM    Secure for whole-system provenance collection.

Table 5: A summary of security considerations in existing provenance-aware systems. The “Security Model?" field marks whether the system is designed to work in the presence of some adversary. “Whole-System Provenance?" denotes whether the system collects provenance for all events including kernel activities. The identified vulnerabilities are according to our security model.

Previous work has considered the security of prove- network, who can strip the provenance identifiers from nance under relaxed threat models relative to LPM’s. packets in transit, resulting in irrecoverable provenance. The resulting vulnerabilities are surveyed in Table 5. In Unlike LPM, Hi-Fi does not attempt to provide layered SProv, Hasan et al. introduce provenance chains, cryp- provenance services, and therefore does not consider tographic constructs that prevent the insertion or dele- the integrity and authenticity of provenance-aware tion of provenance inside of a series of events [39]. applications. SProv effectively demonstrates the authentication prop- erties of this primitive, but is not intended to serve as a Provenance vs. Information Flow secure provenance-aware system; attackers can still ap- pend false records to the end of the chain, delete the Provenance is a form of information flow monitor- whole chain, or disable the library altogether. Zhou et ing that is related but distinct from past areas of study. al. consider provenance corruption an inevitability, and Many projects have aimed to support Information Flow show that provenance can detect some malicious hosts in Control (IFC) through instrumenting applications, oper- distributed environments provided that a critical mass of ating systems, and programming languages. FlowwolF correct hosts still exist [82]. They later strengthen these is a web browser that provides labeling support for dis- assurances through use of provenance-aware software- tributed mandatory access control at the application layer defined networking [11]. These systems consider only [40]. Decentralized IFC systems like Asbestos [26], HiS- network events, and are unable to speak to the internal tar [80], and Flume [49] make use of a decentralized la- state of hosts. Lyle and Martin sketch the design for beling scheme that allows processes to compartmental- a secure provenance monitor based on trusted comput- ize data that is protected by the operating system. This ing [52]. However, they conceptualize provenance as facilitates the use of a lattice-based model for informa- a TPM-aided proof of code execution, overlooking in- tion flow [23]. DStar [79] extends this approach to sup- terprocess communication and other system activity that port distributed environments. In each case, the primary could inform execution results, and therefore offer infor- thrust of these works is to provide system-layer support mation that is too coarse-grained to meet the needs of to applications wishing to isolate user data, thus reducing some applications. Moreover, to the best of our knowl- the consequences of a compromise. IFC systems thus re- edge their system is unimplemented. quire a priori knowledge of desired flow properties, and The most promising model to date for secure prove- are unable to answer questions such as “How did data nance collection is Pohly et al.’s Hi-Fi system [61]. object x come to have label a?" Provenance provides Hi-Fi is a Linux Security Module (LSM) that collects a means of reasoning about flows and answering these whole-system provenance that details the actions of questions. processes, IPC mechanisms, and even the kernel itself Another area of information flow study is dynamic (which does not exclusively use system calls). Hi-Fi taint analysis, in which the goal is to track the propaga- attempts to provide a provenance reference monitor tion of select pieces of data across a system. Automated [55], but remains vulnerable to the provenance-aware instrumentation for taint tracking has been developed for adversary that we describe in Section 3.2. Enabling x86 binaries [67] and smartphones [27], and dynamic Hi-Fi blocks the installation of other LSM’s, such as taint analysis in sandbox environments has also been SELinux or Tomoyo, or requires a third party patch to used to secure off-the-shelf applications [84]. Prove- permit module stacking. This blocks the installation of nance can offer a more complete explanation as to how MAC policy on the host, preventing runtime integrity an object became tainted. It is also more flexible: taint assurances. Hi-Fi is also vulnerable to adversaries in the tracking relies on an immutable policy that requires that

18 data be tagged at runtime, while a provenance-based ap- [4] Kernel SSL Proxy (KSSL). http://docs.oracle.com/ proach can obtain a result after execution by “replaying” cd/E23823_01/html/816-5175/kssl-5.html. the provenance graph [82], permitting different taints to [5] Symantec Data Loss Prevention Customer Brochure. http:// be considered without re-executing. Provenance is thus www.symantec.com/data-loss-prevention. a distinct form of reasoning about information flow. [6] Titus. http://www.titus.com. [7] What Makes Bit9 + Carbon Black Unique? https://www. bit9.com/why-bit9/unique-from-competitors/. 9 Conclusion [8] What’s Yours is Mine: How Employees are Putting Your Intellectual Property at Risk. https://www4. The Linux Provenance Module Framework is an ex- symantec.com/mktginfo/whitepaper/WP_ WhatsYoursIsMine-HowEmployeesarePutting citing step forward for both provenance- and security- YourIntellectualPropertyatRisk_dai211501_ conscious communities. We have demonstrated that cta69167.pdf. LPM can be used to create a trusted provenance-aware [9] R. Aldeco-Pérez and L. Moreau. Provenance-based Auditing of execution environment. A key feature of LPM is its Private Data Use. In Proceedings of the 2008 International Con- ability to leverage Linux’s existing security features to ference on Visions of Computer Science, VoCS’08, Sept. 2008. provide strong provenance collection assurances. Our [10] J. P. Anderson. Computer Security Technology Planning Study. system imposes as little as 2.7% performance overhead Technical Report ESD-TR-73-51, Air Force Electronic Systems Division, 1972. on normal system operation, and can respond to queries [11] A. Bates, K. Butler, A. Haeberlen, M. Sherr, and W. Zhou. Let about data object ancestry in tens of milliseconds. We SDN Be Your Eyes: Secure Forensics in Data Center Networks. have used LPM as the foundation of a provenance-based In NDSS Workshop on Security of Emerging Network Technolo- data loss prevention (PB-DLP) system that can scan file gies, SENT, Feb. 2014. transmissions to detect the presence of sensitive ances- [12] A. Bates, B. Mood, M. Valafar, and K. Butler. Towards Secure tors in just tenths of a second. Provenance-based Access Control in Cloud Environments. In Proceedings of the 3rd ACM Conference on Data and Applica- tion Security and Privacy, CODASPY ’13, pages 277–284, New Acknowledgements York, NY, USA, 2013. ACM. [13] M. Bellare, R. Canetti, and H. Krawczyk. Keyed Hash Functions and Message Authentication. In Proceedings of Crypto’96, vol- We would like to thank Rob Cunningham, Alin Do- ume 1109 of LNCS, pages 1–15, 1996. bra, Will Enck, Jun Li, Al Malony, Patrick McDaniel, [14] M. Bellis, S. Lofthouse, H. Griffin, and D. Kucukreisoglu. Daniela Oliveira, Nabil Schear, Micah Sherr, and Patrick Trusted Solaris 8 4/01 Security Target. 2003. Traynor for their valuable comments and insight, as well [15] A. Bittau, D. Boneh, M. Hamburg, M. Handley, D. Mazieres, as Devin Pohly for his sustained assistance in working and Q. Slack. Cryptographic protection of TCP Streams with Hi-Fi, and Mugdha Kumar for her help developing (tcpcrypt). https://tools.ietf.org/html/ LPM SPADE support. This work was supported in part draft-bittau-tcp-crypt-01. by the US National Science Foundation under grant num- [16] D. Boneh, B. Lynn, and H. Shacham. Short Signatures from the Weil Pairing. In C. Boyd, editor, Advances in Cryptology – ASI- bers CNS-1118046, CNS-1254198, and CNS-1445983. ACRYPT 2001, volume 2248 of Lecture Notes in Computer Sci- ence, pages 514–532. Springer Berlin Heidelberg, 2001. [17] U. Braun, S. L. Garfinkel, D. A. Holland, K.-K. Muniswamy- Availability Reddy, and M. I. Seltzer. Issues in Automatic Provenance Col- lection. In International Provenance and Annotation Workshop, The LPM code base, including all user space utilities and pages 171–183, 2006. patches for both Red Hat and the mainline Linux kernels, [18] U. Braun and A. Shinnar. A Security Model for Provenance. is available at http://linuxprovenance.org. Technical Report TR-04-06, Harvard University Computer Sci- ence Group, 2006. [19] D. Catalano, M. Di Raimondo, D. Fiore, and R. Gennaro. Off- References line/On-line Signatures: Theoretical Aspects and Experimental Results. In PKC’08: Proceedings of the Practice and theory in [1] CDW Data Loss Prevention. http://www.cdw.com/ public key cryptography, 11th international conference on Pub- content/solutions/data-loss-prevention. lic key cryptography, pages 101–120, Berlin, Heidelberg, 2008. aspx. Springer-Verlag. [2] Data Leakage Protection: Assess Risk and Safeguard [20] S. Chaitanya, K. Butler, A. Sivasubramaniam, P. McDaniel, and Valuable Information. http://www.cisco.com/ M. Vilayannur. Design, Implementation and Evaluation of Secu- c/en/us/solutions/enterprise-networks/ rity in iSCSI-based Network Storage Systems. In Proceedings of data-loss-prevention/index.html. the Second ACM Workshop on Storage Security and Survivability, [3] Data Leakage Protection: Assess Risk and Safeguard Valuable StorageSS ’06, pages 17–28, New York, NY, USA, 2006. ACM. Information. http://www.mcafee.com/us/products/ [21] D. D. Clark and D. R. Wilson. A Comparison of Commercial and total-protection-for-data-loss-prevention. Military Computer Security Policies. In IEEE S&P, Oakland, aspx. CA, USA, Apr. 1987.

19 [22] D. Cock, Q. Ge, T. Murray, and G. Heiser. The Last Mile: An [39] R. Hasan, R. Sion, and M. Winslett. The Case of the Fake Pi- Empirical Study of Some Timing Channels on seL4. In ACM casso: Preventing History Forgery with Secure Provenance. In Conference on Computer and Communications Security, pages Proceedings of the 7th USENIX Conference on File and Storage 570–581, Scottsdale, AZ, USA, nov 2014. Technologies, FAST’09, San Francisco, CA, USA, Feb. 2009. [23] D. E. Denning. A Lattice Model of Secure Information Flow. [40] B. Hicks, S. Rueda, D. King, T. Moyer, J. Schiffman, Y. Sreeni- Commun. ACM, 19(5):236–243, May 1976. vasan, P. McDaniel, and T. Jaeger. An Architecture for Enforc- ing End-to-end Access Control over Web Applications. In Pro- [24] J. Dyer, M. Lindemann, R. Perez, R. Sailer, L. van Doorn, and ceedings of the 15th ACM Symposium on Access Control Models Computer S. Smith. Building the IBM 4758 Secure Coprocessor. , and Technologies, SACMAT ’10, pages 163–172, New York, NY, 34(10):57–66, Oct 2001. USA, 2010. ACM. [25] A. Edwards, T. Jaeger, and X. Zhang. Runtime Verification of [41] B. Hicks, S. Rueda, L. St.Clair, T. Jaeger, and P. McDaniel. Authorization Hook Placement for the Linux Security Modules A Logical Specification and Analysis for SELinux MLS Policy. Framework. In Proceedings of the 9th ACM Conference on Com- ACM Trans. Inf. Syst. Secur., 13(3):26:1–26:31, July 2010. puter and Communications Security, CCS’02, 2002. [42] D. A. Holland, U. Bruan, D. Maclean, K.-K. Muniswamy-Reddy, [26] P. Efstathopoulos, M. Krohn, S. VanDeBogart, C. Frey, and M. I. Seltzer. Choosing a Data Model and Query Language D. Ziegler, E. Kohler, D. Mazières, F. Kaashoek, and R. Morris. for Provenance. In 2nd International Provenance and Annotation Labels and Event Processes in the Asbestos Operating System. Workshop, IPAW’08, June 2008. SIGOPS Oper. Syst. Rev., 39(5):17–30, Oct. 2005. [43] T. Jaeger, A. Edwards, and X. Zhang. Consistency Analysis of [27] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, Authorization Hook Placement in the Linux Security Modules and A. N. Sheth. TaintDroid: An Information-flow Tracking Sys- Framework. ACM Trans. Inf. Syst. Secur., 7(2):175–205, May tem for Realtime Privacy Monitoring on Smartphones. In Pro- 2004. ceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI’10, Oct. 2010. [44] T. Jaeger, R. Sailer, and U. Shankar. PRIMA: Policy-reduced Integrity Measurement Architecture. In Proceedings of the 11th [28] J. Epstein and J. Picciotto. Trusting X: Issues in Building Trusted ACM Symposium on Access Control Models and Technologies, X Window Systems -or- WhatâA˘ Zs´ Not Trusted About X. In Pro- SACMAT ’06, pages 19–28, New York, NY, USA, 2006. ACM. ceedings of the 14th Annual National Computer Security Confer- ence, 1991. [45] T. Jaeger, R. Sailer, and X. Zhang. Analyzing Integrity Protection in the SELinux Example Policy. In Proceedings of the 12th con- [29] J. Epstein and M. Shugerman. A Trusted X Window System ference on USENIX Security Symposium - Volume 12, SSYM’03, Server for Trusted Mach. In USENIX MACH Symposium, pages pages 5–5, Berkeley, CA, USA, 2003. USENIX Association. 141–156, 1990. [46] B. Kauer. OSLO: Improving the Security of Trusted Comput- [30] S. Even, O. Goldreich, and S. Micali. On-line/off-line Digital ing. In Proceedings of 16th USENIX Security Symposium on Signatures. In Proceedings on Advances in cryptology, CRYPTO USENIX Security Symposium, SS’07, pages 16:1–16:9, Berkeley, ’89, pages 263–275, New York, NY, USA, 1989. Springer-Verlag CA, USA, 2007. USENIX Association. New York, Inc. [47] S. Kent and R. Atkinson. RFC 2406: IP Encapsulating Security [31] I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao. Chimera: AVir- Payload (ESP). 1998. tual Data System for Representing, Querying, and Automating [48] D. Kilpatrick, W. Salamon, and C. Vance. Securing the X Win- Data Derivation. In Proceedings of the 14th Conference on Sci- dow System with SELinux. Technical report, Jan. 2003. entific and Statistical Database Management, SSDBM’02, July 2002. [49] M. Krohn, A. Yip, M. Brodsky, N. Cliffer, M. F. Kaashoek, E. Kohler, and R. Morris. Information Flow Control for Stan- [32] V. Ganapathy, T. Jaeger, and S. Jha. Automatic placement of dard OS Abstractions. SIGOPS Oper. Syst. Rev., 41(6):321–334, authorization hooks in the linux security modules framework. In Oct. 2007. Proceedings of the 12th ACM Conference on Computer and Com- munications Security, CCS ’05, pages 330–339, New York, NY, [50] L. Lamport. Time, Clocks, and the Ordering of Events in a Dis- USA, 2005. ACM. tributed System. Commun. ACM, 21(7):558–565, July 1978. [33] C.-z. Gao and Z.-a. Yao. A Further Improved Online/Offline Sig- [51] H. Luhn. Computer for Verifying Numbers, Aug. 23 1960. US nature Scheme. Fundam. Inf., 91:523–532, August 2009. Patent 2,950,048. [34] A. Gehani, B. Baig, S. Mahmood, D. Tariq, and F. Zaffar. Fine- [52] J. Lyle and A. Martin. Trusted Computing and Provenance: Bet- grained Tracking of Grid Infections. In Proceedings of the ter Together. In 2nd Workshop on the Theory and Practice of 11th IEEE/ACM International Conference on Grid Computing, Provenance, TaPP’10, Feb. 2010. GRID’10, Oct 2010. [53] P. Macko and M. Seltzer. A General-purpose Provenance Li- [35] A. Gehani and U. Lindqvist. Bonsai: Balanced Lineage Authen- brary. In 4th Workshop on the Theory and Practice of Prove- tication. In Proceedings of the 23rd Annual Computer Security nance, TaPP’12, June 2012. Applications Conference, ACSAC’07, Dec 2007. [54] J. M. McCune, B. J. Parno, A. Perrig, M. K. Reiter, and [36] A. Gehani and D. Tariq. SPADE: Support for Provenance Audit- H. Isozaki. Flicker: An Execution Infrastructure for TCB Mini- Proceedings of the 3rd ACM SIGOPS/EuroSys Euro- ing in Distributed Environments. In Proceedings of the 13th In- mization. In pean Conference on Computer Systems 2008 ternational Middleware Conference, Middleware ’12, Dec 2012. , Eurosys ’08, pages 315–328, New York, NY, USA, 2008. ACM. [37] S. Gueron and V. Krasnov. Speed Up Big-Number Multiplication [55] P. McDaniel, K. Butler, S. McLaughlin, R. Sion, E. Zadok, and Using Single Instruction Multiple Data (SIMD) Architectures, M. Winslett. Towards a Secure and Efficient System for End-to- June 7 2012. US Patent App. 13/491,141. End Provenance. In Proceedings of the 2nd conference on The- [38] C. Hanson. SELinux and MLS: Putting The Pieces Together. In ory and practice of provenance, San Jose, CA, USA, Feb. 2010. In Proceedings of the 2nd Annual SELinux Symposium, 2006. USENIX Association.

20 [56] K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. [73] J. Widom. Trio: A System for Integrated Management of Data, Provenance-Aware Storage Systems. In Proceedings of the 2006 Accuracy, and Lineage. Technical Report 2004-40, Stanford In- USENIX Annual Technical Conference, 2006. foLab, Aug. 2004. [57] K.-K. Muniswamy-Reddy, U. Braun, D. A. Holland, P. Macko, [74] World Wide Web Consortium. PROV-Overview: An Overview D. Maclean, D. Margo, M. Seltzer, and R. Smogor. Layering in of the PROV Family of Documents. http://www.w3.org/ Provenance Systems. In Proceedings of the 2009 Conference on TR/prov-overview/, 2013. USENIX Annual Technical Conference, ATC’09, June 2009. [75] C. Wright, C. Cowan, J. Morris, S. Smalley, and G. Kroah- [58] D. Nguyen, J. Park, and R. Sandhu. Dependency Path Patterns Hartman. Linux Security Module Framework. In Ottawa Linux As the Foundation of Access Control in Provenance-aware Sys- Symposium, page 604, 2002. tems. In Proceedings of the 4th USENIX Conference on Theory [76] C. Wright, C. Cowan, S. Smalley, J. Morris, and G. Kroah- and Practice of Provenance, TaPP’12, pages 4–4, Berkeley, CA, Hartman. Linux security modules: General security support for USA, 2012. USENIX Association. the linux kernel. In Proceedings of the 11th USENIX Security [59] Q. Ni, S. Xu, E. Bertino, R. Sandhu, and W. Han. An Access Symposium, pages 17–31, Berkeley, CA, USA, 2002. USENIX Control Language for a General Provenance Model. In Secure Association. Data Management, Aug. 2009. [77] Y. Xie, K.-K. Muniswamy-Reddy, D. Feng, Y. Li, and D. D. E. [60] J. Park, D. Nguyen, and R. Sandhu. A Provenance-Based Access Long. Evaluation of a Hybrid Approach for Efficient Provenance Control Model. In Proceedings of the 10th Annual International Storage. Trans. Storage, 9(4):14:1–14:29, Nov. 2013. Conference on Privacy, Security and Trust (PST), pages 137–144, [78] Y. Xie, K.-K. Muniswamy-Reddy, D. D. E. Long, A. Amer, 2012. D. Feng, and Z. Tan. Compressing Provenance Graphs, June [61] D. Pohly, S. McLaughlin, P. McDaniel, and K. Butler. Hi-Fi: Col- 2011. lecting High-Fidelity Whole-System Provenance. In Proceedings [79] N. Zeldovich, S. Boyd-Wickizer, and D. Mazières. Securing Dis- of the 2012 Annual Computer Security Applications Conference, tributed Systems with Information Flow Control. In Proceedings ACSAC ’12, Orlando, FL, USA, 2012. of the 5th USENIX Symposium on Networked Systems Design and [62] J. Postel. RFC 791: Internet protocol. 1981. Implementation, NSDI’08, pages 293–308, Berkeley, CA, USA, [63] A. C. Revkin. Hacked E-mail is New Fodder for Climate Dispute. 2008. USENIX Association. New York Times, 20, 2009. [80] N. B. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazières. [64] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage. Hey, You, Making Information Flow Explicit in HiStar. In Proceedings of Get Off of My Cloud: Exploring Information Leakage in Third- the 7th USENIX Symposium on Operating Systems Design and Party Compute Clouds. In Proceedings of the 16th ACM Con- Implementation, OSDI’06, Nov. 2006. ference on Computer and Communications Security (CCS’09), [81] X. Zhang, A. Edwards, and T. Jaeger. Using CQUAL for Static pages 199–212, Chicago, IL, USA, Oct. 2009. ACM. Analysis of Authorization Hook Placement. In Proceedings of [65] R. Sailer, X. Zhang, T. Jaeger, and L. van Doorn. Design and the 11th USENIX Security Symposium, 2002. Implementation of a TCG-based Integrity Measurement Archi- [82] W. Zhou, Q. Fei, A. Narayan, A. Haeberlen, B. T. Loo, and tecture. In SSYM’04: Proceedings of the 13th conference on M. Sherr. Secure Network Provenance. In Proceedings of USENIX Security Symposium, pages 16–16, Berkeley, CA, USA, the 23rd ACM Symposium on Operating Systems Principles, 2004. USENIX Association. SOSP’11, Oct. 2011. [66] C. Sar and P. Cao. Lineage File System. [83] W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. Efficient http://crypto.stanford.edu/~cao/lineage.html. Querying and Maintenance of Network Provenance at Internet- [67] P. Saxena, R. Sekar, and V. Puranik. Efficient Fine-grained Bi- Scale. In ACM SIGMOD International Conference on Manage- nary Instrumentationwith Applications to Taint-tracking. In Pro- ment of Data (SIGMOD), June 2010. ceedings of the 6th Annual IEEE/ACM International Symposium [84] D. Y. Zhu, J. Jung, D. Song, T. Kohno, and D. Wetherall. Tain- on Code Generation and Optimization, CGO ’08, pages 74–83, tEraser: Protecting Sensitive Data Leaks Using Application-level New York, NY, USA, 2008. ACM. Taint Tracking. SIGOPS Oper. Syst. Rev., 45(1):142–154, Feb. [68] J. Seibert, G. Baah, J. Diewald, and R. Cunningham. Using 2011. Provenance To Expedite MAC Policies (UPTEMPO) (Previously Known as IPDAM). Technical Report USTC-PM-015, MIT Lin- coln Laboratory, October 2014. [69] A. Shamir and Y. Tauman. Improved Online/Offline Signa- ture Schemes. In J. Kilian, editor, Advances in Cryptology — CRYPTO 2001, volume 2139 of Lecture Notes in Computer Sci- ence, pages 355–367. Springer Berlin / Heidelberg, 2001. [70] D. Tariq, B. Baig, A. Gehani, S. Mahmood, R. Tahir, A. Aqil, and F. Zaffar. Identifying the Provenance of Correlated Anomalies. In Proceedings of the 2011 ACM Symposium on Applied Computing, SAC ’11, Mar. 2011. [71] The Netfilter Core Team. The Netfilter Project: Packet Mangling for Linux 2.4. http://www.netfilter.org/, 1999. [72] H. Vijayakumar, G. Jakka, S. Rueda, J. Schiffman, and T. Jaeger. Integrity Walls: Finding Attack Surfaces from Mandatory Ac- cess Control Policies. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ASI- ACCS ’12, pages 75–76, New York, NY, USA, 2012. ACM.

21