Curator: Provenance Management for Modern Distributed Systems
Total Page:16
File Type:pdf, Size:1020Kb
Curator: Provenance Management for Modern Distributed Systems∗ Warren Smith Thomas Moyer Charles Munson The Weather Company, an IBM MIT Lincoln Laboratory Business UNC Charlotte Secure Resilient Systems and Andover, MA, USA Software and Information Systems Technology Charlotte, NC, USA Lexington, MA, USA Abstract infrastructures to manage the provenance data, increasing Data provenance is a valuable tool for protecting and trou- the complexity of the application and system. This increased bleshooting distributed systems. Careful design of the prove- complexity can also increase the attack surface of the appli- nance components reduces the impact on the design, im- cation, negating the security benefits of adding provenance plementation, and operation of the distributed system. In to a system. What is needed is a provenance management this paper, we present Curator, a provenance management system that works in concert with existing infrastructures, toolkit that can be easily integrated with microservice-based and provides lightweight integration into applications. systems and other modern distributed systems. This paper In this paper, we present Curator, a provenance manage- describes the design of Curator and discusses how we have ment toolkit that integrates with existing logging/auditing used Curator to add provenance to distributed systems. We systems. Additionally, Curator provides a lightweight li- find that our approach results in no changes to the design brary to integrate provenance collection into existing appli- of these distributed systems and minimal additional code cations that minimizes dependencies, reducing the integra- and dependencies to manage. In addition, Curator uses the tion complexity for application developers. Curator is able same scalable infrastructure as the distributed system and to integrate provenance from multiple levels of abstraction, can therefore scale with the distributed system. including applications, infrastructure (databases, processing engines, etc.), and operating systems. The system ensures ACM Reference Format: Warren Smith, Thomas Moyer, and Charles Munson. 2018. Curator: a consistent encoding of data between provenance sources, Provenance Management for Modern Distributed Systems. In Pro- allowing consumers of provenance data to reason about sys- ceedings of TaPP 2018. ACM, New York, NY, USA, 4 pages. tem behavior across different levels of the system. We focus our attention on the integration of Curator 1 Introduction into microservice-based systems, where applications consist Data provenance, the history of data as it moves through of small services that coordinate to achieve the goals of and between systems, provides distributed system operators the application. Such architectures are popular in today’s with a potentially rich source of information for a wide-range systems and present challenges when adding provenance. of uses. Operators can use provenance for troubleshoot- ing [13], auditing [6], and forensic analysis [9]. Existing 2 Design systems have focused on the collection and usage of prove- The goals for the design of the Curator toolkit emerged nance data, often with the analysis occurring on the same from our experiences adding data provenance to prototype system that collected the provenance. While this works well data processing systems. for applications running on a single host, it quickly breaks G1) Minimally invasive: Our first goal is to make it eas- down on distributed systems. ier to create and emit provenance from application services In systems that have considered provenance for distributed and infrastructure. It was often difficult to add and de-conflict systems, the proposed architectures treat provenance as packages used by our previous provenance instrumentation unique from other sources of metadata, such as log and au- with the package dependencies of the data processing sys- dit data. This requires users of provenance to build entire tems we were instrumenting. G2) Scalable: The second goal is to aggregate and store ∗DISTRIBUTION STATEMENT A. Approved for public release: distribution provenance information in a scalable way while re-using unlimited. This material is based upon work supported by the Assistant Secretary of the infrastructure deployed by a data processing system. Defense for Research and Engineering under Air Force Contract No. FA8721- Maintainers of data processing systems are often reluctant to 05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, conclusions or add additional software infrastructure solely for the use of a recommendations expressed in this material are those of the author(s) and data provenance subsystem, even when the system generates do not necessarily reflect the views of the Assistant Secretary of Defense high volumes of provenance data. for Research and Engineering. G3) Visualization: Our third goal is to provide a set of Permission to make digital or hard copies of part or all of this work for tools and displays for visualizing and analyzing provenance personal or classroom use is granted without fee provided that copies are information. We found that there are commonalities in the not made or distributed for profit or commercial advantage and that copies provenance visualization and analysis needs of the systems bear this notice and the full citation on the first page. Copyrights for third- we added data provenance to and that we could create com- party components of this work must be honored. For all other uses, contact the owner/author(s). ponents for use across these systems. TaPP 2018, July 11 - 12, 2018, London, UK Fig. 1 shows our approach to satisfy these goals. The key © 2018 Copyright held by the owner/author(s). concept of this architecture is to view gathering, storing, and analyzing provenance information as a logging problem. TaPP 2018, July 11 - 12, 2018, London, UK Warren Smith, Thomas Moyer, and Charles Munson 2.2 Aggregation and Ingest Database Analyze Receive Transform L C Once the microservices log provenance information, the L C L C L C logging system gathers the logs and then filters the prove- Receive Transform Fuse Send L C L C L C L C L C nance information out of the logs. Curator deserializes and writes the provenance data into a provenance database. Curator Input Input Input includes a simple service to perform these tasks. This service Concentrate Archive Filesystem currently receives log4j data via a socket or a log file. It then deserializes the provenance data using any of the available Visualize deserializers (e.g PROV-JSON), and writes the provenance Ingest Provenance Database information to any supported provenance database. Analyze However, rather than using the simple Curator ingest Key L Logging service, we recommend using a log management system for C Curator a more scalable and robust ingest implementation. Systems such as logstash 2 and fluentd 3 are widely used in microser- Figure 1. The Curator toolkit integrated into a microser- vice architectures to manage log data. These services are vice architecture. Some connections have been removed for modular and configurable and support the dual use of moni- toring and analyzing the operation of a microservice-based legibility. system as well as managing provenance data. The Curator toolkit currently includes a logstash output plugin. ProvenanceLogger logger = 2.3 Storage and Query new ProvenanceLogger(Logger.getLogger("App"), We have found that it is not possible to select a single data new ProvJsonSerializer()); storage solution for provenance data. First, the volume and Entity input = new Entity(); velocity of provenance data varies from system to system so input.setAttribute("filename", "IMG-0942.jpg"); no one solution is always the most appropriate. Second, a Activity transform = new Activity(); microservice system has likely adopted a database for their Used used = new Used(transform, inputData); logger.log(inputData); needs and the developers and operators would prefer to also logger.log(transform); use that database for provenance data. logger.log(used); The Curator approach is to therefore support a number of different databases behind common Store and Query in- terfaces. Curator currently supports popular SQL databases Figure 2. Adding provenance instrumentation to a microser- (MySQL/MariaDB, PostgreSQL, H2, Derby) and the Accu- vice. mulo 4 distributed key/value store. Curator represents prove- nance information as graphs of vertices and edges that have attributes (key/value pairs). Curator stores these graphs in The Curator design consists of helper libraries to create SQL in a normalized form and in Accumulo in a denormal- provenance information and add it to logs, use of a log aggre- ized form, as is typical for such databases. The optimized gation and processing system, and a provenance subsystem schemas used in the databases enable fast retrieval of ver- for storing, visualizing, and analyzing the provenance. tices and edges by their ids and locating vertices and edges 2.1 Instrumentation and Logging that have specific attributes. It has been demonstrated that SQL databases support lower ingest rates and data volumes, To emit provenance, the Curator toolkit includes a Java 5 library to create, encode, and log provenance. Fig. 2 shows but fast queries . Accumulo supports high ingest rates and the creation of a provenance logger with a serializer, the volumes of data, but queries can be slower [11]. Curator Query creation of provenance objects, and the logging of these The interface supports a number