Curator: Provenance Management for Modern Distributed Systems
Total Page:16
File Type:pdf, Size:1020Kb
Curator: Provenance Management for Modern Distributed Systems∗† Warren Smith Thomas Moyer Charles Munson The Weather Company, an IBM MIT Lincoln Laboratory Business UNC Charlotte Secure Resilient Systems and Andover, MA, USA Software and Information Systems Technology Charlotte, NC, USA Lexington, MA, USA Abstract systems have focused on the collection and usage of prove- Data provenance is a valuable tool for protecting and trou- nance data, often with the analysis occurring on the same bleshooting distributed systems. Careful design of the prove- system that collected the provenance. While this works well nance components reduces the impact on the design, im- for applications running on a single host, it quickly breaks plementation, and operation of the distributed system. In down on distributed systems. this paper, we present Curator, a provenance management In systems that have considered provenance for distributed toolkit that can be easily integrated with microservice-based systems, the proposed architectures treat provenance as systems and other modern distributed systems. This paper unique from other sources of metadata, such as log and au- describes the design of Curator and discusses how we have dit data. This requires users of provenance to build entire used Curator to add provenance to distributed systems. We infrastructures to manage the provenance data, increasing find that our approach results in no changes to the design the complexity of the application and system. This increased of these distributed systems and minimal additional code complexity can also increase the attack surface of the appli- and dependencies to manage. In addition, Curator uses the cation, negating the security benefits of adding provenance same scalable infrastructure as the distributed system and to a system. What is needed is a provenance management can therefore scale with the distributed system. system that works in concert with existing infrastructures, and provides lightweight integration into applications. ACM Reference Format: Curator Warren Smith, Thomas Moyer, and Charles Munson. 2018. Curator: In this paper, we present , a provenance manage- Provenance Management for Modern Distributed Systems. In Pro- ment toolkit that integrates with existing logging/auditing ceedings of USENIX Theory and Practice of Provenance (TaPP’18). systems. Additionally, Curator provides a lightweight li- ACM, New York, NY, USA, 4 pages. brary to integrate provenance collection into existing appli- cations that minimizes dependencies, reducing the integra- 1 Introduction tion complexity for application developers. Curator is able Data provenance, the history of data as it moves through to integrate provenance from multiple levels of abstraction, and between systems, provides distributed system operators including applications, infrastructure (databases, processing with a potentially rich source of information for a wide-range engines, etc.), and operating systems. The system ensures of uses. Operators can use provenance for troubleshoot- a consistent encoding of data between provenance sources, ing [13], auditing [6], and forensic analysis [9]. Existing allowing consumers of provenance data to reason about sys- tem behavior across different levels of the system. ∗DISTRIBUTION STATEMENT A. Approved for public release: distribution We focus our attention on the integration of Curator unlimited. into microservice-based systems, where applications consist This material is based upon work supported by the Assistant Secretary of of small services that coordinate to achieve the goals of Defense for Research and Engineering under Air Force Contract No. FA8721- the application. Such architectures are popular in today’s 05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and systems and present challenges when adding provenance. do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering. 2 Design †Permission to make digital or hard copies of all or part of this work for The goals for the design of the Curator toolkit emerged personal or classroom use is granted without fee provided that copies are from our experiences adding data provenance to prototype not made or distributed for profit or commercial advantage and that copies data processing systems. bear this notice and the full citation on the first page. G1) Minimally invasive: Our first goal is to make it eas- TaPP 2018, July 11–12, 2018, London, UK. ier to create and emit provenance from application services Copyright remains with the owner/author(s). and infrastructure. It was often difficult to add and de-conflict packages used by our previous provenance instrumentation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not with the package dependencies of the data processing sys- made or distributed for profit or commercial advantage and that copies bear tems we were instrumenting. this notice and the full citation on the first page. Copyrights for components G2) Scalable: The second goal is to aggregate and store of this work owned by others than ACM must be honored. Abstracting with provenance information in a scalable way while re-using credit is permitted. To copy otherwise, or republish, to post on servers or to the infrastructure deployed by a data processing system. redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Maintainers of data processing systems are often reluctant to TaPP’18, July 9–13, 2018, London, England add additional software infrastructure solely for the use of a © 2018 Association for Computing Machinery. data provenance subsystem, even when the system generates high volumes of provenance data. TaPP’18, July 9–13, 2018, London, England Warren Smith, Thomas Moyer, and Charles Munson ProvenanceLogger logger = Database Analyze new ProvenanceLogger(Logger.getLogger("App"), Receive Transform L C L C L C L C new ProvJsonSerializer()); Receive Transform Fuse Send L C L C L C L C L C Entity input = new Entity(); input.setAttribute("filename", "IMG-0942.jpg"); Input Input Input Activity transform = new Activity(); Used used = new Used(transform, inputData); Concentrate Archive Filesystem logger.log(inputData); logger.log(transform); Visualize logger.log(used); Ingest Provenance Database Analyze Key Figure 2. Adding provenance instrumentation to a microser- L Logging C Curator vice. Figure 1. The Curator toolkit integrated into a microser- vice architecture. Some connections have been removed for currently receives log4j data via a socket or a log file. It then deserializes the provenance data using any of the available legibility. deserializers (e.g PROV-JSON), and writes the provenance information to any supported provenance database. Curator G3) Visualization: Our third goal is to provide a set of However, rather than using the simple ingest tools and displays for visualizing and analyzing provenance service, we recommend using a log management system for a more scalable and robust ingest implementation. Systems information. We found that there are commonalities in the 2 3 provenance visualization and analysis needs of the systems such as logstash and fluentd are widely used in microser- we added data provenance to and that we could create com- vice architectures to manage log data. These services are ponents for use across these systems. modular and configurable and support the dual use of moni- Fig. 1 shows our approach to satisfy these goals. The key toring and analyzing the operation of a microservice-based concept of this architecture is to view gathering, storing, system as well as managing provenance data. The Curator and analyzing provenance information as a logging problem. toolkit currently includes a logstash output plugin. Curator The design consists of helper libraries to create 2.3 Storage and Query provenance information and add it to logs, use of a log aggre- We have found that it is not possible to select a single data gation and processing system, and a provenance subsystem storage solution for provenance data. First, the volume and for storing, visualizing, and analyzing the provenance. velocity of provenance data varies from system to system so 2.1 Instrumentation and Logging no one solution is always the most appropriate. Second, a To emit provenance, the Curator toolkit includes a Java microservice system has likely adopted a database for their library to create, encode, and log provenance. Fig. 2 shows needs and the developers and operators would prefer to also the creation of a provenance logger with a serializer, the use that database for provenance data. creation of provenance objects, and the logging of these The Curator approach is to therefore support a number objects to the same log the microservice uses for its logging. of different databases behind common Store and Query in- Curator adopts the World Wide Web Consortium’s W3C- terfaces. Curator currently supports popular SQL databases PROV [5] standards to represent data provenance informa- (MySQL/MariaDB, PostgreSQL, H2, Derby) and the Accu- tion and defines a set of Java objects organized as graph mulo 4 distributed key/value store. Curator represents prove- vertices and edges to represent the PROV data model. The nance information as graphs of vertices and edges that have middle of Fig. 2 contains