: Management for Modern Distributed Systems∗† Warren Smith Œomas Moyer Charles Munson Œe Weather Company, an IBM Business MIT Lincoln Laboratory 400 Minuteman Rd UNC CharloŠe Secure Resilient Systems and Technology Andover, MA 01810, USA So‰ware and Information Systems 244 Wood St. 333G Woodward Hall Lexington, MA 02421, USA 9201 University City Blvd CharloŠe, NC 28223, USA Abstract provenance data, increasing the complexity of the application and Data provenance is a valuable tool for protecting and trou- system. Œis increased complexity can also increase the aŠack sur- bleshooting distributed systems. Careful design of the provenance face of the application, negating the security bene€ts of adding components reduces the impact on the design, implementation, provenance to a system. What is needed is a provenance manage- and operation of the distributed system. In this paper, we present ment system that works in concert with existing infrastructures, Curator, a provenance management toolkit that can be easily in- and provides lightweight integration into applications. tegrated with microservice-based systems and other modern dis- In this paper, we present Curator, a provenance management tributed systems. Œis paper describes the design of Curator and toolkit that integrates with existing logging/auditing systems. Addi- discusses how we have used Curator to add provenance to dis- tionally, Curator provides a lightweight to integrate prove- tributed systems. We €nd that our approach results in no changes nance into existing applications that minimizes depen- to the design of these distributed systems and minimal additional dencies, reducing the integration complexity for application devel- code and dependencies to manage. In addition, Curator uses the opers. Curator is able to integrate provenance from multiple levels same scalable infrastructure as the distributed system and can there- of abstraction, including applications, infrastructure (databases, fore scale with the distributed system. processing engines, etc.), and operating systems. Œe system en- sures a consistent encoding of data between provenance sources, ACM Reference format: Warren Smith, Œomas Moyer, and Charles Munson. 2018. Curator: Prove- allowing consumers of provenance data to reason about system nance Management for Modern Distributed Systems. In Proceedings of behavior across di‚erent levels of the system. USENIX Šeory and Practice of Provenance, London, England, July 9–13, 2018 We focus our aŠention on the integration of Curator into microservice- (TaPP’18), 6 pages. based systems, where applications consist of small services that DOI: coordinate to achieve the goals of the application. Such architec- tures are popular in today’s systems and present challenges when 1 Introduction adding provenance. Data provenance, the history of data as it moves through and between systems, provides distributed system operators with a 2 Design potentially rich source of information for a wide-range of uses. Œe goals for the design of the Curator toolkit emerged from our Operators can use provenance for troubleshooting [13], auditing [6], experiences adding data provenance to prototype data processing and forensic analysis [9]. Existing systems have focused on the systems. collection and usage of provenance data, o‰en with the analysis G1) Minimally invasive: Our €rst goal is to make it easier to occurring on the same system that collected the provenance. While create and emit provenance from application services and infras- this works well for applications running on a single host, it quickly tructure. It was o‰en dicult to add and de-conƒict packages used breaks down on distributed systems. by our previous provenance instrumentation with the package de- In systems that have considered provenance for distributed sys- pendencies of the data processing systems we were instrumenting. tems, the proposed architectures treat provenance as unique from G2) Scalable: Œe second goal is to aggregate and store prove- other sources of , such as log and audit data. Œis requires nance information in a scalable way while re-using the infrastruc- users of provenance to build entire infrastructures to manage the ture deployed by a data processing system. Maintainers of data processing systems are o‰en reluctant to add additional so‰ware ∗DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. arXiv:1806.02227v1 [cs.DB] 6 Jun 2018 Œis material is based upon work supported by the Assistant Secretary of Defense for infrastructure solely for the use of a data provenance subsystem, Research and Engineering under Air Force Contract No. FA8721-05-C-0002 and/or even when the system generates high volumes of provenance data. FA8702-15-D-0001. Any opinions, €ndings, conclusions or recommendations expressed G3) Visualization: Our third goal is to provide a set of tools and in this material are those of the author(s) and do not necessarily reƒect the views of displays for visualizing and analyzing provenance information. We the Assistant Secretary of Defense for Research and Engineering. †Permission to make digital or hard copies of all or part of this work for personal or found that there are commonalities in the provenance visualization classroom use is granted without fee provided that copies are not made or distributed and analysis needs of the systems we added data provenance to for pro€t or commercial advantage and that copies bear this notice and the full citation and that we could create components for use across these systems. on the €rst page. Fig. 1 shows our approach to satisfy these goals. Œe key concept TaPP 2018, July 11–12, 2018, London, UK. of this architecture is to view gathering, storing, and analyzing Copyright remains with the owner/author(s). provenance information as a logging problem. Œe Curator design Permission to make digital or hard copies of all or part of this work for personal or consists of helper to create provenance information and classroom use is granted without fee provided that copies are not made or distributed add it to logs, use of a log aggregation and processing system, and for pro€t or commercial advantage and that copies bear this notice and the full citation a provenance subsystem for storing, visualizing, and analyzing the on the €rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permiŠed. To copy otherwise, or republish, provenance. to post on servers or to redistribute to lists, requires prior speci€c permission and/or a fee. Request permissions from [email protected]. 2.1 Instrumentation and Logging TaPP’18, London, England © 2018 ACM. ...$15.00 To emit provenance, the Curator toolkit includes a Java library DOI: to create, encode, and log provenance. Fig. 2 shows the creation TaPP’18, July 9–13, 2018, London, England Warren Smith, Thomas Moyer, and Charles Munson

microservice-based system as well as managing provenance data. Database Analyze Receive Transform L C Œe Curator toolkit currently includes a logstash output plugin. L C L C L C

Receive Transform Fuse Send L C 2.3 Storage and †ery L C L C L C L C We have found that it is not possible to select a single data

Input Input Input storage solution for provenance data. First, the volume and velocity of provenance data varies from system to system so no one solution Concentrate Filesystem is always the most appropriate. Second, a microservice system has likely adopted a database for their needs and the developers and operators would prefer to also use that database for provenance Visualize Ingest Provenance data. Database Analyze Œe Curator approach is to therefore support a number of di‚er-

Key ent databases behind common Store and Query interfaces. Curator L Logging currently supports popular SQL databases (MySQL/MariaDB, Post- C Curator greSQL, H2, Derby) and the Accumulo 4 distributed key/value store. Curator represents provenance information as graphs of vertices Figure 1. Œe Curator toolkit integrated into a microservice ar- and edges that have aŠributes (key/value pairs). Curator stores chitecture. Some connections have been removed for legibility. these graphs in SQL in a normalized form and in Accumulo in a denormalized form, as is typical for such databases. Œe optimized ProvenanceLogger logger = schemas used in the databases enable fast retrieval of vertices and new ProvenanceLogger(Logger.getLogger("App"), edges by their ids and locating vertices and edges that have speci€c new ProvJsonSerializer()); aŠributes. It has been demonstrated that SQL databases support Entity input = new Entity(); 5 input.setAttribute("filename", "IMG-0942.jpg"); lower ingest rates and data volumes, but fast queries . Accumulo Activity transform = new Activity(); supports high ingest rates and volumes of data, but queries can be Used used = new Used(transform, inputData); slower [11]. logger.log(inputData); Œe Curator Query interface supports a number of operations logger.log(transform); to retrieve provenance data to drive analytics. As mentioned above, logger.log(used); this interface supports basic operations such as €nding vertices and edges by identi€er or by aŠributes. Œe interface also supports Figure 2. Adding provenance instrumentation to a microservice. €nding ancestors and descendants of a vertex (typically an entity) so that an analysis tool can determine what entities, activities, and agents inƒuenced or were inƒuenced by a vertex. For broader views, of a provenance logger with a serializer, the creation of prove- the Query interface supports €nding the ancestors and descendants nance objects, and the logging of these objects to the same log the of a set of vertices and the connected subgraph that a set of vertices microservice uses for its logging. are part of. Curator adopts the World Wide Web Consortium’s W3C-PROV [5] standards to represent data provenance information and de€nes a 2.4 Analytics Query set of Java objects organized as graph vertices and edges to repre- We use the provenance information available from the to sent the PROV data . Œe middle of Fig. 2 contains examples drive analytics. For example, we have wriŠen system-speci€c analy- of these objects. To translate provenance objects to and from repre- sis code to analyze the structure and content of a provenance graph sentations that are suitable for logs, the toolkit contains serializers related to a data item passing through a data processing pipeline and deserializers that support formats such as PROV-JSON [8]. Œe in order to ensure accurate and timely processing of data inputs. third line of Fig. 2 demonstrates selection of a serializer. Œis code determines if the provenance graph has the structure and Curator makes use of the logging framework chosen by the content that we expect and noti€es the user if it does not. Such authors of the microservice, rather than imposing a logging frame- analysis can be used for anomoaly detection and for debugging. 1 However, in the case of anomaly detection, more work is needed work, and currently supports log4j and log4j2. Œe toolkit also to ensure the security of the provenance collected using Curator. ProvenanceLogger provides wrapper classes to hold a serializer Graph-grammar based techniques such as those used in Winnower log() object and to implement methods for all of the provenance can be integreated to allow for more general-purpose analysis of objects. graph structure [7]. 2.2 Aggregation and Ingest 2.5 Visualization Once the microservices log provenance information, the logging Finally, provenance visualizations are useful for a number of system gathers the logs and then €lters the provenance information purposes. An operator or developer can use visualizations to un- out of the logs. Curator deserializes and writes the provenance Curator derstand the operation of a system and the interactions of the data into a provenance database. includes a simple service components in the system. Œey can also use visualizations to ana- to perform these tasks. Œis service currently receives log4j data via lyze the causes or e‚ects of a failure or an aŠack. Curator includes a socket or a log €le. It then deserializes the provenance data using visualization components for integration into data processing sys- any of the available deserializers (e.g PROV-JSON), and writes the tems. Œese components provide a general-purpose mechanism provenance information to any supported provenance database. Curator Curator to explore provenance data that has been processed by , However, rather than using the simple ingest service, allowing consumers of provenance data to explore the provenance we recommend using a log management system for a more scalable data in a browser. and robust ingest implementation. Systems such as logstash 2 and ƒuentd 3 are widely used in microservice architectures to manage 3 Use Cases log data. Œese services are modular and con€gurable and sup- We used the Curator toolkit to add provenance to several dis- port the dual use of monitoring and analyzing the operation of a tributed systems. Œe general purpose of all of these systems is to process data as it streams into the system, store data products, 1h‚ps://logging.apache.org/log4j 2h‚ps://www.elastic.co/products/logstash 4h‚ps://accumulo.apache.org 3h‚ps://www.fluentd.org/ 5Œis assumes that the tables have the appropriate indexes. Curator: Provenance Management for Modern Distributed Systems TaPP’18, July 9–13, 2018, London, England perform analyses on stored data, and present results to users. Œe ProvToolbox 7 is a Java library for working with provenance systems are architected as a set of microservices that communicate in the W3C-PROV format. It de€nes Java classes to represent via a message bus. provenance documents. ProvToolbox also provides functionality to We integrated Curator into these systems in order to collect convert between Java provenance documents and the W3C PROV provenance. First, we added provenance logging statements into formats (PROV-XML, PROV-JSON, PROV-N). Finally, it can gener- the application services. As described above, these statements are ate images of provenance graphs using Graphviz 8. However, we not dicult to write and the application services have a great deal found that ProvToolbox didn’t ful€ll all of our needs. First, ProvTool- of contextual information that can be of use in the provenance box depends on a large number of packages making it challenging record. However, it is challenging to locate all of the code locations to integrate with microservices that have their own, sometimes that require logging. One promising technique for auotmatically conƒicting dependencies. Second, we desired an in-memory graph instrumenting applications for provenance is described in [2]. Such representation of provenance information that we could easily systems can use Curator to add provenance to legacy applications. search and traverse and that would match the representation that Second, we added provenance logging to common application we store in databases, which ProvToolbox does not provide. Œird, libraries, such as a library that wraps a messaging system to pack- we wanted a simpler API based on the expectation that we will age application data for sending and receiving. Œis is again easy need to implement that API in a variety of programming languages. to accomplish using code similar to that shown in Fig. 2. Œis Œese reasons led us to develop the Curator representation for approach works well because provenance logging is added in a provenance graphs and support for serializing and deserializing minimal number of locations and can gather a large amount of the those graphs to W3C-PROV formats. provenance needed for a system. In addition, because this logging Œe Provenance-Aware Storage System, PASS [12] supports an is part of the application, a large amount of application context is API for disclosing provenance called the Disclosed Provenance API, available to the provenance system. or DPAPI. Œe Core Provenance Library, CPL [10] also provides Œird, we added provenance logging to common infrastructure, an API for reporting provenance from userspace applications. Œe such as Spring Integration. Spring Integration 6 is a Java framework PASS kernel and userspace applications utilize the DPAPI to in- for creating distributed applications by composing services and tegrate provenance from the OS and from the applications into a abstracting the message-oriented mechanisms that connect the single provenance record for the system. Unlike ProvToolbox, the services. Œis approach allows services to be wriŠen independently DPAPI and CPL do not require a particular speci€cation. Instead, from each other focusing only on their inputs, functionality, and they are adaptable to di‚erent speci€cations. However, the DPAPI outputs. To add provenance gathering to Spring, we created a also expects the system to be running the PASS kernel, which may Spring Interceptor that creates provenance log entries for each not be feasible for all applications. message it sees. We also made a small change to the Spring XML con€guration to add this interceptor to every channel so that we 5 Future Work could gather provenance information for every communication Several open challenges still exist that Curator does not currently between services. More details on the Spring Interceptor can be address. First, security of collected provenanve data. Sevreral sys- found in the full technical report. Details of the technical report tems have looked at ways to provide secure provenance [1, 6] and it can be found at h‚ps://bitbucket.org/crestlab/Curator-public.git. remains to be seen how Curator integrates with these. Second, link- ing high-level, semantically rich provenance to low-level system 4 Related Work provenance continues to be a challenge for all provenance systems Curator is not the €rst provenance management toolkit. SPADE [4] that do not provide explicit linking between high- and low-level and PLUS [3] are two systems that provide similar capabilities. events [12]. As we continue to leverage Curator, these enhanced Œese management systems aim to capture, encode, store, and query capabilities will be explored further. provenance information from a wide-range of systems. PLUS aims Our research and development roadmap also includes adding a to support integration into applications by providing a library for number of analytics and visualization capabilities to Curator. We reporting provenance, a SQL schema for storing provenance, and a plan to add support for streaming analytics so that provenance query interface to access collected provenance. PLUS also provides data is analyzed as it arrives. In both of these areas, we will fol- a modi€ed Mule enterprise service bus that captures provenance low our existing approach and use existing systems to provide and a provenance reporting API. the primary functionality (e.g. Storm 9) with Curator providing SPADE is a similar system that aims to collect, encode, store, provenance-speci€c components to use with those systems. Fu- and query provenance data in distributed systems. Œe core of ture work towards visualization of provenance data includes an SPADE is a pluggable kernel with interfaces for collection, storage, interactive provenance graph, allowing users to navigate through and querying of provenance data. SPADE also considers how to provenance data visually, as well as to €lter and search the prove- handle provenance at multiple abstraction layers, with collection nance data to €nd nodes or portions of the graph that match given plugins to capture both operating system information from audit criteria. logs and from applications using a named pipe. Similar to PLUS, SPADE provides an API for applications to report provenance. In 6 Conclusion Curator both SPADE and PLUS, the transfer of provenance data is handled Œis paper described , a toolkit designed to make it Curator separately from other logging messages. easier to integrate provenance into distributed systems. Œe primary di‚erence between Curator and the tools described provides tools for logging provenance, retrieving provenance from above is that Curator integrates with existing logging infrastruc- logs, storing provenance in databases, and analyzing and visual- Curator tures for distributed systems, instead of building an entirely parallel izing provenance. is a composable set of tools where a infrastructure to support the management of the provenance data. developer can select the tools, such as provenance loggers, that Œis provides the additional bene€t that system designers and de- best match their system. Œis paper also provided examples of how Curator velopers can integrate with their existing logging infrastructures. we used in distributed systems. We have found that the Curator Œis also reduces the complexity of Curator, since it is not respon- design of increases the likelihood that developers will add sible for log management. Our initial version of Curator acted as provenance to a system because it lowers the impact on that system Curator a centralized log management system for provenance data, making compared to previous approaches. We also found that the core of Curator large and complex. 7h‚p://lucmoreau.github.io/ProvToolbox 8h‚p://www.graphviz.org 6h‚ps://projects.spring.io/spring-integration 9h‚ps://storm.apache.org TaPP’18, July 9–13, 2018, London, England Warren Smith, Thomas Moyer, and Charles Munson

reduces the e‚ort it takes to add provenance to a system. Existing systems for generating and managing provenance information do not have these advantages. Availability Œe source code for Curator can be found at h‚ps://bitbucket. org/crestlab/Curator-public.git. Acknowledgments Œe authors would like to thank the anonymous reviewers for their helpful feedback. Œey would also like to acknowlege the hard work of the Secure Resilient Systems and Technology Group at MIT Lincoln Laboratory, many of whom had discussions with the authors about what would make Curator most useful for their work. References [1] Bates, A., Tian, D., Butler, K. ., and Moyer, T. Trustworthy whole-system provenance for the linux kernel. In 24th USENIX Security Symposium (USENIX Security 15) (Washington, D.C., August 2015), USENIX Association. [2] Capobianco, F., Skalka, C., and Jaeger, T. Accessprov: Tracking the prove- nance of access control decisions. In 9th International Workshop on Šeory and Practice of Provenance (June 2017). [3] Chapman, A., Blaustein, B., Seligman, L., and Allen, M. Plus: A provenance manager for integrated information. In Information Reuse and Integration (IRI), 2011 IEEE International Conference on (Aug. 2011), pp. 269–275. [4] Gehani, A., and Tariq, D. Spade: Support for provenance auditing in distributed environments. In Middleware 2012, P. Narasimhan and P. Trianta€llou, Eds., vol. 7662 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012, pp. 101–120. [5] Groth, P., and Moreau, L. PROV-Overview: An Overview of the PROV Family of Documents. h‚ps://www.w3.org/TR/prov-overview/, April 2013. [6] Hasan, R., Sion, R., and Winslett, M. Œe case of the fake picasso: Preventing history forgery with secure provenance. In FAST (2009), USENIX, pp. 1–14. [7] Hassan, W. U., Bates, A., and Moyer, T. Towards Scalable Cluster Audit- ing through Grammatical Inference over Provenance Graphs. In Network and Distributed System Security Symposium, NDSS 2018 (2018). [8] Huynh, T. D., Jewell, M. O., Keshavarz, A. S., Michaelides, D. T., Yang, H., and Moreau, L. Œe PROV-JSON Serialization: A JSON Represntation for the PROV Data Model. h‚ps://www.w3.org/Submission/2013/SUBM-prov-json-20130424/, April 2013. [9] Lee, K. H., Zhang, X., and Xu, D. High accuracy aŠack provenance via binary- based execution partition. In NDSS (2013), Œe Internet Society. [10] Macko, P., and Seltzer, M. A general-purpose provenance library. In 4th USENIX conference on Šeory and Practice of Provenance (Berkeley, CA, USA, 2012), TaPP’12, USENIX Association, pp. 6–6. [11] Moyer, T., and Gadepally, V. High-throughput ingest of data provenance records into accumulo. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE (2016), IEEE, pp. 1–6. [12] Muniswamy-Reddy, K.-K., Braun, U., Holland, D. A., Macko, P., Maclean, D., Margo, D., Seltzer, M., and Smogor, R. Layering in provenance systems. In 2009 Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2009), USENIX’09, USENIX Association, pp. 10–10. [13] Zhou, W., Fei, Q., Narayan, A., Haeberlen, A., Loo, B. T., and Sherr, M. Secure network provenance. In 23rd ACM Symposium on Operating Systems Principles (SOSP) (Oct. 2011). Curator: Provenance Management for Modern Distributed Systems TaPP’18, July 9–13, 2018, London, England public class ProvChanInter extends ChannelInterceptorAdapter { @Override public Message preSend(Message message, MessageChannel channel) { Entity msg = new Entity(message.getHeaders().getId()); for (String name : message.getHeaders()) { msg.setAttribute(name, message.getHeaders() .get(name) .toString()); } logger.log(msg) if (message.getHeaders().containsKey("previousId")) { logger.log(new WasDerivedFrom(message.getHeaders() .getId(), message.getHeaders() .get("previousId"))); } return MessageBuilder.fromMessage(message) .setHeader("previousId", message.getHeaders().getId() .toString()).build(); } }

Figure 3. Spring channel interceptor to log provenance.

A Spring Integration Channel Interceptor Fig. 3 shows a Spring Interceptor for provenance that allows us to create wasDerivedFrom relationships between messages as the messages ƒow through the system. In addition, we use thread local storage to share information between the interceptor and the microservices. Œe interceptor saves the message identi€er of the message being handled and a microservice can then create used or wasDerivedFrom relationships to entitys contained within the message. Œe microservice also records the entitys accessed while handling a message and then coordinates with the interceptor to add these wasDerivedFrom relationships for the outgoing message to the provenance log. TaPP’18, July 9–13, 2018, London, England Warren Smith, Thomas Moyer, and Charles Munson

Figure 4. Example of the visualizer web application.

B Graphical Interface for Visualization Tool Figure 4 shows the web application that can display provenance graphs.