Curator: Provenance Management for Modern Distributed Systems

Curator: Provenance Management for Modern Distributed Systems∗† Warren Smith omas Moyer Charles Munson e Weather Company, an IBM Business MIT Lincoln Laboratory 400 Minuteman Rd UNC Charloe Secure Resilient Systems and Technology Andover, MA 01810, USA Soware and Information Systems 244 Wood St. 333G Woodward Hall Lexington, MA 02421, USA 9201 University City Blvd Charloe, NC 28223, USA Abstract provenance data, increasing the complexity of the application and Data provenance is a valuable tool for protecting and trou- system. is increased complexity can also increase the aack sur- bleshooting distributed systems. Careful design of the provenance face of the application, negating the security benets of adding components reduces the impact on the design, implementation, provenance to a system. What is needed is a provenance manage- and operation of the distributed system. In this paper, we present ment system that works in concert with existing infrastructures, Curator, a provenance management toolkit that can be easily in- and provides lightweight integration into applications. tegrated with microservice-based systems and other modern dis- In this paper, we present Curator, a provenance management tributed systems. is paper describes the design of Curator and toolkit that integrates with existing logging/auditing systems. Addi- discusses how we have used Curator to add provenance to dis- tionally, Curator provides a lightweight library to integrate prove- tributed systems. We nd that our approach results in no changes nance collection into existing applications that minimizes depen- to the design of these distributed systems and minimal additional dencies, reducing the integration complexity for application devel- code and dependencies to manage. In addition, Curator uses the opers. Curator is able to integrate provenance from multiple levels same scalable infrastructure as the distributed system and can there- of abstraction, including applications, infrastructure (databases, fore scale with the distributed system. processing engines, etc.), and operating systems. e system en- sures a consistent encoding of data between provenance sources, ACM Reference format: Warren Smith, omas Moyer, and Charles Munson. 2018. Curator: Prove- allowing consumers of provenance data to reason about system nance Management for Modern Distributed Systems. In Proceedings of behavior across dierent levels of the system. USENIX eory and Practice of Provenance, London, England, July 9–13, 2018 We focus our aention on the integration of Curator into microservice- (TaPP’18), 6 pages. based systems, where applications consist of small services that DOI: coordinate to achieve the goals of the application. Such architectures are popular in today’s systems and present challenges when 1 Introduction adding provenance. Data provenance, the history of data as it moves through and between systems, provides distributed system operators with a 2 Design potentially rich source of information for a wide-range of uses. e goals for the design of the Curator toolkit emerged from our Operators can use provenance for troubleshooting [13], auditing [6], experiences adding data provenance to prototype data processing and forensic analysis [9]. Existing systems have focused on the systems. collection and usage of provenance data, oen with the analysis G1) Minimally invasive: Our rst goal is to make it easier to occurring on the same system that collected the provenance. While create and emit provenance from application services and infras- this works well for applications running on a single host, it quickly tructure. It was oen dicult to add and de-conict packages used breaks down on distributed systems. by our previous provenance instrumentation with the package de- In systems that have considered provenance for distributed sys- pendencies of the data processing systems we were instrumenting. tems, the proposed architectures treat provenance as unique from G2) Scalable: e second goal is to aggregate and store prove- other sources of metadata, such as log and audit data. is requires nance information in a scalable way while re-using the infrastruc- users of provenance to build entire infrastructures to manage the ture deployed by a data processing system. Maintainers of data processing systems are oen reluctant to add additional soware ∗DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. arXiv:1806.02227v1 [cs.DB] 6 Jun 2018 is material is based upon work supported by the Assistant Secretary of Defense for infrastructure solely for the use of a data provenance subsystem, Research and Engineering under Air Force Contract No. FA8721-05-C-0002 and/or even when the system generates high volumes of provenance data. FA8702-15-D-0001. Any opinions, ndings, conclusions or recommendations expressed G3) Visualization: Our third goal is to provide a set of tools and in this material are those of the author(s) and do not necessarily reect the views of displays for visualizing and analyzing provenance information. We the Assistant Secretary of Defense for Research and Engineering. †Permission to make digital or hard copies of all or part of this work for personal or found that there are commonalities in the provenance visualization classroom use is granted without fee provided that copies are not made or distributed and analysis needs of the systems we added data provenance to for prot or commercial advantage and that copies bear this notice and the full citation and that we could create components for use across these systems. on the rst page. Fig. 1 shows our approach to satisfy these goals. e key concept TaPP 2018, July 11–12, 2018, London, UK. of this architecture is to view gathering, storing, and analyzing Copyright remains with the owner/author(s). provenance information as a logging problem. e Curator design Permission to make digital or hard copies of all or part of this work for personal or consists of helper libraries to create provenance information and classroom use is granted without fee provided that copies are not made or distributed add it to logs, use of a log aggregation and processing system, and for prot or commercial advantage and that copies bear this notice and the full citation a provenance subsystem for storing, visualizing, and analyzing the on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, provenance. to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from [email protected]. 2.1 Instrumentation and Logging TaPP’18, London, England © 2018 ACM. ...$15.00 To emit provenance, the Curator toolkit includes a Java library DOI: to create, encode, and log provenance. Fig. 2 shows the creation TaPP’18, July 9–13, 2018, London, England Warren Smith, Thomas Moyer, and Charles Munson microservice-based system as well as managing provenance data. Database Analyze Receive Transform L C e Curator toolkit currently includes a logstash output plugin. L C L C L C Receive Transform Fuse Send L C 2.3 Storage and ery L C L C L C L C We have found that it is not possible to select a single data Input Input Input storage solution for provenance data. First, the volume and velocity of provenance data varies from system to system so no one solution Concentrate Archive Filesystem is always the most appropriate. Second, a microservice system has likely adopted a database for their needs and the developers and operators would prefer to also use that database for provenance Visualize Ingest Provenance data. Database Analyze e Curator approach is to therefore support a number of dier- Key ent databases behind common Store and Query interfaces. Curator L Logging currently supports popular SQL databases (MySQL/MariaDB, Post- C Curator greSQL, H2, Derby) and the Accumulo 4 distributed key/value store. Curator represents provenance information as graphs of vertices Figure 1. e Curator toolkit integrated into a microservice ar- and edges that have aributes (key/value pairs). Curator stores chitecture. Some connections have been removed for legibility. these graphs in SQL in a normalized form and in Accumulo in a denormalized form, as is typical for such databases. e optimized ProvenanceLogger logger = schemas used in the databases enable fast retrieval of vertices and new ProvenanceLogger(Logger.getLogger("App"), edges by their ids and locating vertices and edges that have specic new ProvJsonSerializer()); aributes. It has been demonstrated that SQL databases support Entity input = new Entity(); 5 input.setAttribute("filename", "IMG-0942.jpg"); lower ingest rates and data volumes, but fast queries . Accumulo Activity transform = new Activity(); supports high ingest rates and volumes of data, but queries can be Used used = new Used(transform, inputData); slower [11]. logger.log(inputData); e Curator Query interface supports a number of operations logger.log(transform); to retrieve provenance data to drive analytics. As mentioned above, logger.log(used); this interface supports basic operations such as nding vertices and edges by identier or by aributes. e interface also supports Figure 2. Adding provenance instrumentation to a microservice. nding ancestors and descendants of a vertex (typically an entity) so that an analysis tool can determine what entities, activities, and agents inuenced or were inuenced by a vertex. For broader views, of a provenance logger with a serializer, the creation of prove- the Query interface supports nding the ancestors and descendants nance objects, and the logging of these objects to the same log the of a set of vertices and the connected subgraph that a set of vertices microservice uses for its logging. are part of. Curator adopts the World Wide Web Consortium’s W3C-PROV [5] standards to represent data provenance information and denes a 2.4 Analytics Query set of Java objects organized as graph vertices and edges to repre- We use the provenance information available from the to sent the PROV data model. e middle of Fig. 2 contains examples drive analytics.

Load more