<<

Koji: Automating pipelines with mixed-semantics data sources

Petar Maymounkov Aljabr, Inc. [email protected]

ABSTRACT dependencies, which are usually other €les). Œe simple dataƒow We propose a new result-oriented semantic for de€ning data pro- model mentioned previously no longer applies. cessing workƒows that manipulate data in di‚erent semantic forms To the best of our knowledge, no dataƒow model has been pro- (€les or services) in a uni€ed manner. Œis approach enables users to posed to capture this scenario. Yet, to this day, this type of mixed- de€ne workƒows for a vast variety of reproducible data-processing semantic (€le and service) pipelines represent the most common tasks in a simple declarative manner which focuses on application- type of o‚-line batch-processing workƒows. level results, while automating all control-plane considerations (like Due to the lack of a specialized formalism for describing them failure recovery without loss of progress and computation reuse) and a tool for executing them, they are currently codi€ed in a behind the scenes. variety of error-prone ways, which usually amount to the usage of Œe uniform treatment of €les and services as data enables easy data-unaware task execution pipelines. We address this gap here. integration with existing data sources (e.g. data acquisition APIs) and sinks of data (e.g. database services). Whereas the focus on con- 1.2 Problem tainers as transformations enables reuse of existing data-processing We address a class of modern Machine Learning and data-processing systems. pipelines. We describe a declarative con€guration mechanism, which can Such pipelines transform a set of input €les through chains of be viewed as an intermediate representation (IR) of reproducible transformations, provided by open-source so‰ware (OSS) for large- data processing pipelines in the same spirit as, for instance, Tensor- scale computation (like TensorFlow [12], [6], Apache Flow [12] and ONNX [16] utilize IRs for de€ning tensor-processing Spark [8], etc.) or user-speci€ implementations (usually provided pipelines. as executable containers). In short, these pipelines use mixtures of disparate OSS technologies tied into a single coherent data ƒow. ACM Reference format: Petar Maymounkov. 2016. Koji: Automating pipelines with mixed-semantics At present, such workƒows are frequently built using task-driven data sources. In Proceedings of White paper, Brooklyn, NY, 2018, 11 pages. pipeline technologies (like Apache Airƒow [5]) which execute tasks DOI: 10.1145/nnnnnnn.nnnnnnn in a given dependency order, but are unaware of the data passed from one task to the next. Œe lack of data ƒow awareness of current solutions prevents large data-processing pipelines from bene€ting 1 INTRODUCTION in caching and reuse of computation, which could provide signi€- 1.1 History cant eciency gains in these industry cases: • Restarting long-running pipelines a‰er failure and contin- Œe introduction of MapReduce [4] by arguably marked the uing from where previous executions le‰ o‚. beginning of programmable large-scale data processing. MapRe- • Re-running pipelines with incremental changes, during duce performs a transformation of one set of large €les (the input) developer iterations. into another (the output). Since the transformation provided by • Running concurrent pipelines which share logic, i.e. com- a MapReduce is a primitive €” a many-to-many shu„e, followed pute identical data artifacts within their workƒow. by an element-wise map €” it became common practice to chain multiple MapReduce transformations in a pipeline. Furthermore, task-centric technologies it impractically Œe dataƒow in such a pipeline is cleanly captured by a directed- hard to integrate data and computation optimizations like: arXiv:1901.01908v1 [cs.DC] 2 Dec 2018 acyclic graph (DAG), whose vertices represent transformations and • In-memory storage of intermediate results which are not edges represent €les. cache-able, or In a twist, it became commonplace to query a service (usually • Context-speci€c choices of job scheduling and placement a key-value lookup service) from inside the mapper function. For algorithms. instance, this technique is used to join two by mapping over one of them and looking up into the other. More recently, 1.3 Solution Machine Learning systems have been serving trained models as We propose a new pipeline semantic (and describe its system ar- lookup services, which are used by MapReduce mappers in a similar chitecture, which can be realized on top of ) based on a fashion. few key design choices: With this twist, a MapReduce transformation no longer depends • Result-oriented speci€cation: Œe goal of workƒows is just on input €les but also on lookup services (and their transitive to build data artifacts. Workƒows are represented as depen- dency graphs over artifacts. Input artifacts are provided White paper, Brooklyn, NY 2018. 978-x-xxxx-xxxx-x/YY/MM...$15.00 by the caller. Intermediate artifacts are produced through DOI: 10.1145/nnnnnnn.nnnnnnn a computation, using prior artifacts. Output artifacts are White paper, 2018, Brooklyn, NY Petar Maymounkov

returned to the caller. Œis view is entirely analogous to Kubernetes [13], which is becoming the industry-standard clus- the way so‰ware build systems (like Bazel [11] and UNIX ter OS, is ubiquitously available on most cloud providers, can be make) de€ne so‰ware build dependencies. run on-premises, and is generally provider-agnostic from the users’ • Uni€ed treatment of data and services: We view €le standpoint. Kubernetes bene€ts from having mature integrations artifacts and service artifacts in a uni€ed way as resources. to the Docker (and other) container ecosystems, providing seamless Œis allows us to describe complex workƒows which mix- out-of-box access to many data tools, thanks to the rich ecosystem and-match batch and streaming computations (the laŠer of operator implementations [14]. being a special case of services). Furthermore, this enables us to automate service garbage-collection and achieve op- 1.4 How it works timal computation reuse (via caching) across the entire Œe user de€nes a pipeline in a language-agnostic manner. A pipeline. Œe resource-level uni€ed view of €les and ser- pipeline de€nition describes: (i) input data resources and their vices purports to be the Goldilocks level of coarse data sources, (ii) intermediate resources and the data transformation knowledge, that is needed by a dataƒow controller to auto- that produced them from dependent resources, (iii) output data mate all €le caching and service control considerations. resources and where they should be delivered. • Type-safe declarative speci€cation: We believe that • Resources are €les (or services) and their format (or proto- workƒow speci€cation has to be declarative, i.e. repre- col) can be optionally speci€ed to bene€t from type-safety sentable via a typed schema (like e.g. Protocol Bu‚ers). checks over the dataƒow graph, thanks to declarations. Œis provides full decoupling from implementations, and • Input resources can be provided by various standard meth- serves as a reproducible assembly language for de€ning ods (volume €les, existing cluster services, Amazon S3 pipelines. buckets, and so on.). Data source types can be added seam- • Decouple dataƒow from transform implementation: lessly. We decouple the speci€cation of application logic from • Intermediate resources are described as €les (or services), the de€nition of how data transforms are performed by optionally with a speci€ed format (or protocol). Œeir place- underlying backend technologies. Application logic com- ment location is not provided by the user, in order to enable prises the dependency graph between artifacts, and the the pipeline controller to make optimal choices in this re- data transform at each node. Data transforms are viewed gard and to manage caching placements decisions. uniformly akin to functions from a library of choices. Œe • Output resources can specify the location where they should methods for invoking transformation-backing technolo- be delivered, with standard (and extensible) options simi- gies (like MapReduce, TensorFlow, etc.) are implemented larly to the case for input resources. separately as driver functions, and surfaced as a library of Œe transformations at the nodes of the dataƒow graph consume declarative structures that can be used in application logic. a set of input resources to produce new resources. Transformations • Extensible transformations: New types of data trans- are exposed to the user with a clean application-level interface. A forms (other than container-execution based) can be added transformation: easily. Œis is done in two parts. First, a simple driver function implements the speci€cs of calling the underly- • Consumes a set of named input resources (the arguments), ing technology. Second, a new transformation structure which can be ful€lled by prior resources in the user’s is added to the application logic schema. Œis extension dataƒow program. mechanism is reserved for transformations that cannot be • Produces a set of named output resources. Œe laŠer can containerized. be referenced (by name) by dependent transformations, • Scheduler and storage-independent design: Applica- downstream in the dataƒow program. tion logic governs the order in which data computations • Accepts a transformation-speci€c set of parameters. For must occur. However, Œe choice of job schedulers (or instance, a TensorFlow transformation may require a Ten- placement) algorithms, as well as the choice of storage sorFlow program (e.g. as a Python or Protocol Bu‚er €le). technologies (e.g. disk versus memory volumes), are en- Œe user programs (in Python or Go) which build the pipeline tirely orthogonal to the application’s dataƒow de€nition. dataƒow are used to generate a Protocol Bu‚er (or YAML) €le, Our architecture enables ƒexible choice of relevant tech- which captures the entire pipeline and is e‚ectively executable and nology on a per-node (scheduling) and per-edge (storage) reproducible. basis. For instance, some intermediate €les can be stored Pipelines can be executed either from the command-line or by in memory volumes, instead of disk, to increase eciency. sending them o‚ as a background job to Kubernetes, using the operator paŠern (via CRD). We aim for this to be a pipeline technology which can perform data transformations based on any so‰ware available, as OSS-for- 2 AN EXAMPLE Linux or as-a-Service. Œis goal informs our choice of Kubernetes, A typical modern Machine Learning pipeline produces interdepen- as an underlying infrastructure and cluster management technol- dent €les (e.g. data tables) and services (e.g. trained model servers) ogy for coordinating orchestration of backend execution runtimes at multiple stages of its workƒow. on single or multi-tenant physical or (multi)cloud computing re- Œe following example captures all semantic aspects of a modern sources. ML/ETL pipeline: Koji: Automating pipelines with mixed-semantics data sources White paper, 2018, Brooklyn, NY

(N3) Œe outputs of pipeline steps (be it content of €les produced, or behavior of services rendered) depend deterministically TRAIN BUSINESS on their inputs

To summarize, this view of a data-processing pipeline captures resource dependencies and resource semantics (€les or services), TensorFlow while treating computations as black-box deterministic procedures Train (provided by containers, in practice). Œis coarse container/resource-level view of a pipeline suces

MODEL to automate pipeline execution optimally and achieve signi€cant compute and space eciencies in common day-to-day operations. Let us illustrate this with two examples: TensorFlow Serve • Example 1: Suppose, a‰er execution, the pipeline com- pletes steps 1 and 2, then fails during step 3 due to hardware LOOKUP dysfunction. Due to the determinism (N3) of pipeline steps, it is pos- Apacha Beam sible to cache the €le results of intermediate computations, Combine in this case table MODEL, so they can be reused. When the pipeline is restarted a‰er its failure, the caching mechanism would enable it to skip step 1 (a costly training computation) and proceed directly to restarting the service INSIGHT in step 2 (which takes negligible time) and then renewing the interrupted computations in step 3. • Example 2: In another example, suppose the pipeline is executed successfully with inputs BUSINESS1 and TRAIN. On the next day, the user executes the same pipeline with Figure 1: An example workƒow which uses di‚erent tech- inputs BUSINESS2 and TRAIN, due to updates in the busi- nologies. Solid edges represent €le resources. ‡e dashed ness table. edge represents a service resource. Œe change in the BUSINESS table does not a‚ect the computations in step 1 and 2 of the pipeline. Œerefore just as in the previous example, an optimal pipeline would skip • INPUTS: Œe pipeline expects two input resources from these steps and proceed to step 3. its caller: a training table, called TRAIN, and a table of business data, called BUSINESS. • STEP 1: Table TRAIN is used as input to a Machine Learn- 3 RELATED WORK: PIPELINE TAXONOMY ing procedure, e.g. TensorFlow training, to produce a new Here we position the pipeline technology proposed in this paper table, we call MODEL. Œis is a batch job: It processes an against related technologies in the OSS ecosystem. input €le into an output €le. For the sake of our comparison, we identify two types of pipeline/- • STEP 2: Table MODEL is then used as input to bring up a workƒow technologies: task-driven and data-driven. Additionally, Machine Model Server, e.g. TensorFlow Serve. Œe server data-driven pipelines are subdivided into coarse-grain and €ne-grain loads the trained model in memory, and starts a model- types. serving API service at a network location, we call SERVICE. Task-driven pipeline technologies target the execution of a set • STEP 3: Table BUSINESS together with service SERVICE of user tasks, each provided by an executable technology (e.g. bi- are used as input to an application-speci€c MapReduce nary or container), according to a dependency graph order. A task job, which annotates every record in BUSINESS with some executes only a‰er its dependencies have €nished successfully. insight from SERVICE and outputs the result as table IN- Task-driven pipelines provide simple (usually per-task) facilities SIGHT. for recovering from failure conditions, like restart rules. In general, • OUTPUT: Table INSIGHT is the result of the pipeline. task-driven pipelines are not aware of the ƒow of data (or services) provided by earlier tasks to later ones. A few things are notable here: Data-driven pipeline technologies aim to de€ne and perform (N1) Œe pipelines inputs, intermediate results and outputs are reproducible transformations of a set of input data. Œe input is usu- either €les or services, which we call collectively resources ally consumed either from structured €les (representing things like (N2) Œe pipeline program describes a dependency graph be- tables or graphs, e.g.) located on a cluster €le-system, or databases tween the resources: MODEL depends on TRAIN, SERVICE available as services. Œe outputs are produced in a similar fashion. depends on MODEL, and INSIGHT depends on BUSINESS Data transformations are speci€ed in the form of a directed acyclic and MODEL dataƒow graph, comprising data transformations at the vertices. White paper, 2018, Brooklyn, NY Petar Maymounkov

Task-driven Data-driven is distributed transparently. Another example is the low-level Py- MapReduce, Spark, Torch [17] package for distributed algorithms [18], which allows Fine-grain Beam, Storm, Heron, for passing PyTorch tensors across workers. Dask Dagster [2] is a course-grain data-driven pipeline with a front- Airƒow, Argo, Coarse-grain Reƒow, Dagster, this paper end in Python. It di‚ers from Koji in that its architecture is not Brigade adequate for supporting service-based resources (and their automa- tion, caching and garbage-collection). Aside from its front-end language choice, Dagster is very similar to Reƒow, described next. Reƒow [19] is the only coarse-grain data-driven pipeline we found. Reƒow was designed with a speci€c use in mind, which Œe granularity at which the vertex transformations are pro- makes it fall short of being applicable in a more general industry grammed by the pipeline developer determines whether a data- seŠing. Reƒow does not seem to manage data ƒow across service driven pipeline is €ne-grain or coarse-grain. dependencies, as far as we can tell. It is AWS-speci€c, rather than In €ne-grain technologies, transformations manipulate data down platform agnostic. It is based on a new DSL, making it hard to to the level of arithmetic and data-structure manipulations of indi- inter-operate with industry practices like con€guration-as-code. vidual records and their entries. Such transformations are necessar- Outside of the OSS space, one can €nd a variety of emerging ily speci€ed in a general , where records are closed-source or domain-speci€c monolithic solutions to coarse- typically represented by language values (like structures, classes, grain data-driven workƒows. One such example are the bio-informatics arrays, etc). products of LifeBit [15]. In coarse-grain technologies, transformations manipulate batches of data, which usually represent high-level concepts like tables, 4 SEMANTICS graphs, machine learning models, etc. Such transformations are In this section, we discuss the proposed semantics. speci€ed by referring to pre-packaged applications which perform the transformations when executed. Coarse-grain transforma- 4.1 Representation tions are typically programmed using declarative con€guration languages, which aim at describing the dataƒow graph and how to 4.1.1 Dataflow topology. A data-processing pipeline is repre- invoke the transformations at its vertices. sented as a directed acyclic graph, whose vertices and edges are Data-driven technologies (€ne- or coarse-grain) are aware of called steps and dependencies, respectively. the types of data that ƒow along edges, as well as the semantics • Every pipeline vertex (i.e. step) has an associated set of of the transformations at the vertices (e.g. how they modify the named input slots and a set of named output slots. data types). Œis allows them to implement ecient recovery from • Every directed pipeline edge (a dependency) is associated distributed failures without loss of progress, and more generally with (1) an output slot at its source vertex, and (2) an input ecient caching everywhere. slot at its sink vertex. Œe pipeline technology proposed here belongs to the bucket of Output slots can have multiple outbound edges, reƒecting that data-driven, coarse-grain pipelines which is largely unoccupied in the output of a step can be used by multiple dependent steps. Input the OSS space. slots, on the other hand, must have a unique inbound edge, reƒect- For comparison: ing that a step input argument is ful€lled by a single upstream Apache Airƒow [5] is task-driven with a con€guration-as-code source. (Python) interface. Argo is task-driven with a declarative YAML con€guration interface. Brigade is task-driven with an imperative, 4.1.2 Steps and transformations. In addition to their graph struc- event-based programming interface. ture, steps and dependencies are associated with computational MapReduce implementations, like Gleam, are €ne-grain data- meaning. driven. Each dependency (i.e. directed graph edge) is associated with a Apache Spark [8] is a €ne-grain data-driven pipeline with inter- resource, which is provided by the source step and consumed by the faces in Scala, Java, Python. sink step (of the dependency edge). Apache Beam [6] is a €ne-grain data-driven pipeline with inter- Resources are analogous to types in programming languages: faces in Go, Java and Python. Œey provide a €compile-time€ description of the data processed Apache Storm [9] is a €ne-grain data-driven programmable by the pipeline at execution time. pipeline for distributed real-time/streaming computations in Java. Pipeline resource descriptions capture both the data semantics Apache Heron [7] is a real-time, distributed, fault-tolerant stream (€le or service) as well as the data syntax (€le format or service processing engine from TwiŠer; it is €ne-grain data-driven and a protocol). direct successor to Apache Storm. Each step (i.e. graph vertex) is associated with a (description Another interesting take on €ne-grain data-driven technologies of a) transform. A transform is a computational procedure which, has emerged in the Python community, where distributed pipeline at execution time, consumes a set of input resource instances and programming has been disguised in the form of €array€ object produces a set of output resource instances, whose names and manipulations. Œe Dask [3] project is one such example, where resource types are as indicated by the inbound and outbound edges operations over placeholder NumPy arrays or Pandas collections of the pipeline step. Koji: Automating pipelines with mixed-semantics data sources White paper, 2018, Brooklyn, NY

Œere are two distinguished transform (i.e. vertex) types, called (i) Mark all input dependencies of the step as €need- argument and return transforms, which are used to designate the ed€ inputs and outputs of the pipeline itself. Argument transforms (ii) Wait until all input dependencies become €avail- have no input dependencies and a single output dependency. Re- able€ turn transforms have a single input dependency and no output (iii) Start the container process associated with this dependencies. step Steps which are not based on argument or return transforms are (iv) A‰er starting the process, mark all service out- called intermediate. put dependencies of this step as €available€ (v) Wait until the process terminates: 4.2 Execution model (A) If the termination state is €succeeded€ (im- When a pipeline is executed by a caller (either a human operator or plying that all output €les have been pro- through programmatic control), a pipeline controller is allocated duced), then: Mark all €le output depen- to dynamically manage the execution of the pipeline towards the dencies of this step as €available€. (Since goal of delivering the pipeline’s return resources to the caller. output €les are cached, these dependen- Œe key technical challenge in designing the pipeline control cies will remain in state €available€.) Mark logic is to devise a generic algorithm which is robust against process all input dependencies of this step as €not failures, while also accommodating for the semantic di‚erences needed€. Goto (1). between €le and service resources: (B) If the termination state is €failed€, then: Goto (1). • File resources are considered available a‡er the successful (C) If the termination state is €killed€ (by the termination of the transformation process that produces collector loop), then: Mark all service out- them, put dependencies of this step as €unavail- • Service resources are considered available during the exe- able€. Mark all input dependencies of this cution of the transformation process that produces them. step as €not needed€. Goto (1).

4.2.1 Control algorithm. Œe pipeline execution algorithm, per- Œe (process) collector loop is responsible for sensing when the formed by the pipeline controller, associates two state variables with outputs of the supervised step (there is a collector for each step in each dependency (edge) in the pipeline graph. the pipeline) are not needed any longer and arranging to garbage- • A variable that assumes one of the states €available€ or collect its process. €not available€, indicates whether the underlying resource (€le or service) is currently available. Œis variable is writ- (1) Repeat: ten by the supervisor (see below) of the step producing (a) If the step process is running and all of the following the dependency, and read by the supervisor of the step conditions hold, then kill the process: consuming the dependency. (i) All €le output dependencies of the step are ei- • A variable that assumes one of the states €needed€ or €non ther €needed€ and €available€ or €not needed€, needed€, indicating whether the underlying resource (€le and or service) is currently needed. Œis variable is wriŠen by (ii) All service output dependencies of the step are the supervisor of the step consuming the dependency, and €not needed€. read by the supervisor of the step producing the depen- (b) Goto (1). dency. On execution, the pipeline controller proceeds as follows: 4.3 Pure functions and causal hashing of (1) Mark the state of each input dependency to a return step content as €needed€. Œese dependencies will remain €needed€ Most data-processing pipelines in industry are required, by design, until the pipeline is terminated by the caller. to have reproducible and deterministic outcomes. Œis includes (2) For each intermediate step in the pipeline graph, create workƒows such as Machine Learning, banking and €nance, canary- a step supervisor, running in a dedicated process (or co- ing, so‰ware build systems, continuous delivery and integration, routine). and so on. Every step supervisor comprises two independent sub-processes: In all reproducible pipelines, by de€nition, step transformations a driver loop and a (process) collector loop. are pure: Œe outcomes (€les output or services provided) obtained Œe driver loop is responsible for €sensing€ when the outputs of from executing pure transformations are entirely determined by the the supervised step are needed dynamically (by dependent steps) inputs provided to them and the identity (i.e. program description) and arranging for making them available. of the transformation. By contrast, non-reproducible pipelines are ones where transfor- (1) Repeat: mation outcomes might additionally be a‚ected by: (a) If the step has no output dependencies which are €needed€ and €not available€, then goto (1). • runtime information (like the value of the wall clock or the (b) Otherwise: temperature of the CPU), or White paper, 2018, Brooklyn, NY Petar Maymounkov

• interactions with an external stateful entity (like disk, a Two di‚erent pipeline graphs can entail similar transformations persistent store service, or outside Internet services, for in the sense of a common computational subgraph, appearing in instance). both pipelines. Œis situation occurs, for instance, as a developer iterates over 4.3.1 Caching. In the case of reproducible pipelines (comprising pipeline designs incrementally, producing many similar designs. pure transformations), pipeline execution can bene€t from dramatic A causal cache (as described earlier) shared between concurrent eciency gains (in computation, communication and space), using pipelines enables one pipeline to reuse the computed output of an a simple technique we call causal caching. identical step, that was already computed by the other pipeline. Œe results of pipeline steps which are based on purely determin- We accomplish cache sharing across any number of concurrently istic transformations can be cached to obtain signi€cant eciency executing pipelines by means of per-causal-hash cluster-wide lock- gains in the following situations: ing. (1) Avoiding duplicate computations when restarting partially- In particular, the controller algorithm for executing a pipeline executed pipelines, for instance, a‰er a hardware failure; transformation step is augmented as follows: (2) Multiple executions of the same pipeline, perhaps by dif- (1) Compute the causal hashes, H, of the step outputs ferent users concurrently or at di‚erent times; (2) Obtain a cluster-wide lock on H (3) Executions of pipelines that have similar structure, for (3) Check if the output resources (€les) have already been instance, as is the case with re-evaluating the results of cached in a designated caching €le system: multiple incremental changes of the same base pipeline (a) If so, then release the lock on H and reuse the cached during development iterations. resources. Œe caching algorithm assigns a number, called a causal hash, (b) Otherwise, execute the step transformation, cache its to every edge of the computation graph of a pipeline. Œese hash outputs, release the lock on H and return the output numbers are used as keys in a cluster-wide caching €le system. resources. To serve their purpose of cache keys for the outputs of pipeline 4.3.3 Composability. Œe reader will note that a pipeline can be steps, causal hashes have to meet two criteria: viewed as a transform: It has a set of named inputs (the arguments), (C1) A causal hash has to have the properties of a content hash: a set of named outputs (the return values) and a description of an If the causal hashes of two resources are identical, then the executable procedure (the graph). resources must be identical. Consequently, one pipeline can be invoked as a step transforma- (C2) A causal hash has to be computable before the resource it tion in another. describes has been computed, by executing the step trans- Œis generic and modular ƒexibility enables developers to cre- form that produces it. ate pipeline templates for common workƒows, like a canary-ing To meet these criteria, we de€ne causal hashes in the following workƒow or an ML topology, and reuse those templates as building manner: blocks in multiple applications. (1) Œe causal hashes of the resources passed as inputs to the pipeline are to be provided by the caller of the pipeline. 5 ARCHITECTURE Criteria (C2) does not apply to input resource, thus any Our goal here is to describe an architecture for a data-processing choice of a content hashing algorithm, like using an MD5 pipeline system, and outline an implementation strategy that works message digest or a semantic hash, suces. well with available OSS so‰ware. (2) All other pipeline edges correspond to resources output by We focus on an approach that uses Kubernetes as underlying a transformation step. In this case, the causal hash of the infrastructure, due to its ubiquitous deployments in commercial resource is de€ned recursively, as the message digest (e.g. clouds. using SHA-1) of the following meta information: Œe pipeline execution logic itself is implemented as a Go library, (a) Œe pairs of name and causal hash for all inputs to the which can execute a pipeline given runtime access to a Kuber- step transformation, netes cluster and a user pipeline speci€cation. Pipeline executions (b) Œe identity (or program description) of the transfor- can be invoked through standard integration points: (a) using a mation, command-line tool by passing a pipeline description €le, (b) using (c) Œe name of the transformation output associated a Kubernetes controller (via CRD), or (c) from any programming with the edge. language by sending pipeline con€gurations for execution to the Note that while only €le resources can be cached (on a cluster controller interface in (b). €le system), service resources can also bene€t from caching. For Œe approach (c) is sometimes called con€guration-as-code and instance, a service resource in the middle of a large pipeline, can be is a common practice. For instance, TensorFlow and PyTorch are made available if the €le resources it depends on have been cached Python front-ends for expressing tensor pipelines. Using general from a prior execution. imperative languages to express pipelines has proven suboptimal for various reasons. For one, pipelines (in the sense of DAG data ƒows) 4.3.2 Locking and synchronization. Pipeline semantics make it correspond to immutable functional semantics (not imperative mu- possible to execute multiple racing pipelines in the same cluster, table ones). Furthermore, con€guration-as-code libraries have not while ensuring they utilize computational resource optimally. been able to deliver type-safety at compile-time. To solve for both Koji: Automating pipelines with mixed-semantics data sources White paper, 2018, Brooklyn, NY of these problems, we have designed a general functional language, 5.3 Transformation backends called Ko [1], which allows for concise type-safe functional-style A transformation is, generally, any process execution within the expression of pipelines. cluster, which accepts €les or services as inputs, and produces €les or provides services as output. 5.1 Type checks before execution From a technology point of view, a transform can be: Œe pipeline speci€cation schema allows the user to optionally (1) Œe execution of a container, specify more detailed €type€ information about the resources input (2) Œe execution of a custom controller, known as a Kuber- to or output by each transformation in a dataƒow program. netes CRD. For instance, the kubeflow controller is used For €le resources, this type information can describe the under- to start TensorFlow jobs against a running TesnorFlow lying €le and its data at various levels of precision. It could specify cluster (within Kubernetes), a €le format (e.g. CSV or JSON), an encoding (e.g. UTF8) and a data (3) More generally, the execution of any programming code schema (e.g. provided as a reference to a Protocol Bu‚er or XML that orchestrates the processing of input resources into schema de€nition). output resources. For service resources, analogously, the user can optionally de- scribe the service type in to a varying level of detail: transport layer To accommodate such varying needs, Koji provides a simple (e.g. HTTP over TCP), security layer (e.g. TLS), RPC semantics (e.g. mechanism for de€ning new types of transforms as needed. GRPC), and protocol de€nition (e.g. a reference to a Protocol Bu‚er From a system architecture perspective, a transform comprises or an OpenAPI speci€cation). two parts: When such typing information is provided, the pipeline con- troller is able to check the user’s dataƒow programs for type-safety, (1) A schema for the declarative con€guration that the user before it commits to a lengthy execution, as is o‰en the case. provides to instantiate transforms of this type, and (2) A backend implementation which performs the execution, 5.2 Resource management and plumbing given a con€guration structure (and access to the pipeline and cluster APIs). At the programming level, the user directly connects the outputs of one transformation to the inputs of another. Optionally, such backends can install dependent technologies At runtime, however, these intermediate resources €” be it €les during an initialization phase. For instance, a backend for executing or services €” need to managed. Apache Spark jobs might opt to include an installation procedure for Apache Spark, if it is not present on the cluster. 5.2.1 . For intermediate €le resources, generally, the pipeline Œis paper uses container execution as the running example controller will determine the placement of €les on a cluster volume throughout the sections on semantics and speci€cation. In prac- and will connect these volumes as necessary to containers requiring tice, most legacy/existing OSS technologies will require a dedicated access to the €les. backend, due the large variety of execution and installation seman- For instance, assume transformation A has an output that is con- tics. nected to an input of transformation B. At runtime, the controller Fortunately, writing such backends is a short one-time e‚ort. will choose a volume for placing the €le produced by A and con- One can envision amassing a collection of backends for common sumed by B. It will aŠach this volume to the container for A during technologies like Apache Spark, Apache Beam, TensorFlow, R, and its execution, and then it will aŠach the volume (now containing so on. the produced €le) to the container for B. Plumbing details such as Each such technology will de€ne a dedicated con€guration struc- passing execution ƒags to containers are handled as well. ture, akin to Container (in the speci€cation section), which cap- Of course, this is a basic example. Œe €le management logic can tures the parameters needed to perform a transform execution. We be extended with hooks to ful€ll various optimization and policy believe that such a simple-to-use declarative library of transforms needs, such as: backed by OSS technologies provides standardized assembly-level • Plugging third-party €le placement algorithms that opti- blocks for expressing business ƒows, in general. mize physical placement locality, or • Placing non-cacheable resources on memory-backed vol- 5.4 Transform job scheduling umes, Œe pipeline controller orchestrates the execution of transforms in Œe €le management layer also contains the causal-hash-based their dependency order: A transformation step is ready to execute caching of €les (described in the previous section). only when the resources it depends on become available. 5.2.2 Services. For intermediate services resources €” provided Beyond this semantic constraint on execution order, transform from one transformation to the next €” the pipeline controller han- jobs can be scheduled to meet additional optimization criteria like dles plumbing details transparently, as well. Generally, it takes care throŠling or locality of placement. of creating DNS records for services, and coordinating container Optimized job scheduling (and placement) is generally provided server addresses and ƒag-passing details. by various out-of-the-box products, like Medea [10], for instance. As with €les, services between a server and a client transfor- Since job scheduling considerations are orthogonal to the pipeline mation, can be customized via hooks to address load-balancing, execution order, our architecture makes it easy to plug in any sched- re-routing, authentication, and other such concerns. uling algorithm available out-of-the-box. White paper, 2018, Brooklyn, NY Petar Maymounkov

Œis is accomplished by implementing a short function which A SPECIFICATION the pipeline controller uses to communicate with the scheduler A.1 Speci€cation methodology when needed. Transform steps can be annotated (in the pipeline’s declarative con€guration) with tags that determine their scheduling Programmable technologies, in general, expose the user-programmable anities which would be communicated to the scheduler. functionality in one of two ways. Either by using a (general or domain-speci€c) programming language, or using a typed declara- tive schema (captured by standard technologies like XML, YAML, 5.5 Recursion Œri‰ or Protocol Bu‚ers, for instance) for expressing program Pipelines can be executed from multiple places in the so‰ware stack, con€gurations. e.g. For instance, Apache Spark and Apache Storm are programmable through Java. Gleam, a MapReduce reduce implementation in Go, (1) Using a command-line tool, which consumes a pipeline is programmable through Go. On the other hand, TensorFlow and con€guration €le (YAML or Protocol Bu‚ers), Argo express their programs in the form of computational DAGs (2) Using a Kubernetes CRD whose con€guration schema un- captured by Protocol Bu‚ers and YAML, respectively. derstands the pipeline schema shown here, Œe use of typed data structures, in the form of Protocol Bu‚ers, (3) From any program running in the cluster, using a client for de€ning programmable so‰ware has been a wide-spread prac- library, also by providing a pipeline con€guration as an tice within Google, for many years now. execution argument. Œis declarative/con€guration approach has a few advantages. Œe con€guration schema for any particular technology acts as In particular, for instance as implied by the last method, a pro- an €assembly language€ for that technology and provides a formal gram that runs as a transform in one pipeline can €” as part of its decoupling between programmable semantics and any particular internal logic €” execute another pipeline. implementation. Furthermore, declarative con€gurations being In the presence of such recursive pipeline invocation, all data con- data succumb to static analyses (for validity, security or policy, e.g.) sistency and caching guarantees remain in e‚ect, due the powerful prior to execution, which is not the case with DSL-based interfaces. nature of causal hashes. Œis enables developers to build complex Con€guration schema are language-agnostic, as they can be recursive pipelines, such as those required by Deep Learning and generated from any general programming language: a practice Reinforcement Learning methodologies. widely used and known as con€guration-as-code. Due to the ability to execute pipelines recursively and the mod- Œe declarative schema-based approach is gaining momentum ular declarative approach to de€ning pipelines as con€gurations, in the OSS space as well, as witnessed for instance by projects our pipeline system can directly be reused as the backend of a DSL like ONNX. ONNX de€nes a platform-independent programming for programming data-processing logic at the cluster-level. schema for describing ML models in the form of an extensible We have made initial strides in this direction with the design Protocol Bu‚er. ONNX aims to be viewed as a standard, to be and implementation of the Ko programming language. We defer implemented by various backends. this extension to a follow up paper. In this spirit, we believe that the correct interface for de€ning a general-purpose data-processing pipeline is the typed declarative 6 CONCLUSION one. We use Protocol Bu‚ers as they provide a time-tested extension Œis paper has two main contributions. First, we make the obser- mechanism for the de€nition of backward- and forward-compatible vation that virtually all large-scale, reproducible data-processing data schemas. But it should be understood that interoperability pipelines follow a common paŠern, when viewed at the right level with other standards like OpenAPI and YAML is a given, using of abstraction. standard tooling. In particular, at a semantic level, said pipelines can be viewed as dependency-based build tools for data, akin to code build tools for A.2 Pipeline so‰ware. Within this context, however, pipelines di‚er from build A pipeline is a directed acyclic graph whose vertices, called pipeline tools for code in that the resources being depended on can be €les steps, represent data-processing tasks and whose edges represent as well as short-lived services. (€le or service) dependencies between pairs of steps. Our second contribution is to cast these two types of resources At the highest level, a pipeline is captured by message Pipeline, €” which have very di‚erent runtime semantics €” into a uni€ed shown below. framework, where either can be viewed merely as a simple ”resource dependency” from the point of view of the user. message Pipeline { To make this possible, we introduce Causal Hashing which is a repeated Step steps = 1; method for generating content hashes for both €les and services. } Causal Hashing is thus a generalization of content hashing, which can assign unique content IDs to complex temporal objects (like A pipeline is an executable application, which will (1) consume services). some inputs from its cluster environment (e.g. €les and directories Causal Hashing unlocks the complete automation of a myriad from a volume, or a stream of data from a micro-service API), (2) of tasks, such as caching, conƒict resolution, version tracking, in- process these inputs through a chain of transformation steps, and cremental builds and much more. (3) deliver some outputs (which could be data or services). Koji: Automating pipelines with mixed-semantics data sources White paper, 2018, Brooklyn, NY

A.3 Steps message TransformInput { A pipeline step is the generic building-block of a pipeline application. string name = 1; Steps are used to describe the inputs, intermediate transformations, Resource resource = 10; and outputs of a pipeline application. } A step is captured by message Step below: message TransformOutput { string name = 1; message Step { Resource resource = 10; required string label = 1; } repeated StepInput inputs = 2; required Transform transform = 3; Œe inputs (and outputs) of a transformation are identi€ed by } unique names. Œese names serve the purpose to decouple the pipeline wiring Each step is identi€ed by a unique string label, which distin- de€nitions (captured in Step and StepInput) from the im- guishes it from other steps in the pipeline. Œis is captured by €eld plementation detail of how inputs are passed to the container tech- label. nology backing a transform (captured within message TransformLogic). Œe step de€nition speci€es the transformation being performed Each input (and output) is associated with a resource type, which by the step, as well as the sources for the transformation’s inputs is captured in €eld resource. relative to the pipeline. transform Œe step’s transformation is captured by €eld . Trans- A.5 Resources formations are self-contained descriptions of data processing logic (described in more detail later). A resource is something that a transform consumes as its input or Each transformation declares a list of named and typed inputs produces as its output. (which can be viewed akin to functional arguments), as well as Œe type of a resource is de€ned using message Resource below. a list of named and typed outputs (which can be viewed akin to A Resource should have exactly one of its €elds, file or service, functional return values). set. Field inputs describes the source of each named input, expected message Resource { by the step’s transformation. Each named input is matched with optional FileResource file = 1; another step in the pipeline, called a provider, and a speci€c named optional ServiceResource service = 2; output at the provider step. } Œis matching between inputs and provider steps is captured in StepInput message below: Resource types capture the €temporal€ nature of a resource (e.g. message StepInput { €le vs service), as well as its €spacial€ nature (e.g. speci€c €le required string name = 1; format or speci€c service protocol). required string provider_step_label = 2; Resource type information is used in two ways by the pipeline required string provider_output_name = 3; controller: } (1) To verify the correctness of the step-to-step pipeline stitch- ing in an application-meaningful way. Speci€cally, the pipeline compilation process will verify that the resource A.4 Transform output by one step ful€lls the resource type expected as A transform is a self-contained, reusable description of a data- input by a downstream dependent step. processing computation, based on containerized technology. (2) To inform garbage-collection of steps that provide services Akin to a function (in a programming language), a transform as their output. Speci€cally, if a step provides a service comprises: (1) a set of named and typed inputs, (2) a set of named resource as its output, the pipeline controller will garbage- and typed outputs, and (3) an implementation, which describes how collect the step (e.g. kill its underlying container process) to perform the transform using containers. as soon as all dependent steps have completed their tasks. Transforms are described by message Transform below. In contrast, a step which provides €le resources will be garbage-collected only a‰er it terminates successfully on message Transform { its own. repeated TransformInput inputs = 1; repeated TransformOutput outputs = 2; A.5.1 File resources. A €le resource type is described using required TransformLogic logic = 3; message FileResource: } message FileResource { required bool directory = 1; A.4.1 Transform inputs and outputs. Transform inputs and out- optional string encoding = 2; puts are captured by messages TransformInput and TransformOutput optional string format = 3; below. } White paper, 2018, Brooklyn, NY Petar Maymounkov

Œe €le type speci€es whether the resource is a €le or directory, Œe dedicated transform logic, called argument, is used to declare and associates with it an optional encoding and an optional format pipeline arguments. identi€er. From a pipeline graph point of view, steps based on argument Encoding and format identi€ers are used during pipeline compi- transforms are vertices that have no input edges and a single output lation to verifying that the output resource type of a provider step edge, representing the resource supplied to the argument when the ful€lls the input resource type of a consumer step. In this context, if pipeline was executed. provided, the encoding and format identi€ers are treated as opaque Message ArgumentLogic, shown below, describes a pipeline ar- strings and are checked for exact match. gument.

A.5.2 Service resources. A service resource type is described message ArgumentLogic { using message ServiceResource: required string name = 1; required FileResource resource = 2; message ServiceResource { } optional string protocol = 1; } Field name speci€es the name of the pipeline argument. Field resource speci€es the type of €le or service resource expected as Œe service type optionally speci€es a protocol identi€er. argument value. Œis identi€er is used during pipeline compilation to ensure Argument steps have one output in the pipeline graph, whose that the service provided by one step’s output ful€lls the service resource type is that provided by €eld resource. expectations of a dependent step’s input. In this context if the protocol identi€er is given on both sides, it will be veri€ed for an A.8 Pipeline return values exact match. Pipelines, like regular functions, can return values to the caller Protocol identi€ers should follow a meaningful convention which, environment. In the case of pipelines, the returned values are at minimum, determines the service technology (e.g. GRPC vs resource (€le or service) instances. OpenAPI) and the service itself (e.g. using the Protocol Bu‚er Œe dedicated transform logic, called return, is used to declare fully-quali€ed name of the service de€nition). For example, pipeline return values. From a pipeline graph point of view, steps based on a return openapi://org.proto.path.to.Service" transform are vertices that have a single input edge, representing the resource to be returned to the pipeline caller, and no output edges. A.6 Transform logic Message ReturnLogic, shown below, describes a pipeline return Œe logic of a transform is a description (akin to a function imple- value. mentation) of what a transform does and how it does it. message ReturnLogic { Transform logic is described by message TransformLogic shown required string name = 1; below. required Resource resource = 2; } message TransformLogic { optional ArgumentLogic arg = 100; name resource optional ReturnLogic return = 200; Field speci€es the name of the return value. Field optional ContainerLogic container = 300; speci€es the type of €le or service resource returned. // additional logics go here, e.g. Return steps have one input in the pipeline graph, whose re- // optional TensorFlowLogic tensor_flow = 301; source type is that provided by €eld resource. // optional ApacheKafkaLogic apache_kafka = 302; // etc. A.9 Container-backed transforms } A container logic describes a pipeline transform backed by a con- tainer. TransformLogic Message consists of a collection of mutually- Container logics are captured by message ContainerLogic shown exclusive logic types captured by the message €elds, of which below. exactly one must be set. Each logic type is implemented as a €plug- in€ in the pipeline controller and additional logics can be added, as message ContainerLogic { described in the section on architecture. required string image = 10; repeated ContainerInput inputs = 20; repeated ContainerOutput outputs = 21; A.7 Pipeline arguments repeated ContainerFlag flags = 22; Pipelines, like regular functions, can have arguments whose values repeated ContainerEnv env = 23; are supplied at execution time. Unlike function argument values } (which are arithmetic numbers and data structures), pipeline argu- ment values are resource (€le or service) instances. Œe container logic speci€cation captures: Koji: Automating pipelines with mixed-semantics data sources White paper, 2018, Brooklyn, NY

(1) Œe identity of the container image, e.g. a Docker image la- REFERENCES bel. Œis is captured by €eld image of message Container. [1] Inc. Aljabr. [n. d.]. Ko: A generic type-safe language for concurrent, stateful, (2) For every named transform input and output (declared deadlock-free systems and protocol manipulations. hŠps://.com/kocircuit/ kocircuit. in €elds inputs and outputs of message Transform), a [2] Dagster [n. d.]. Dagster is an opinionated system and programming model for method for passing the location of the corresponding re- data pipelines. hŠps://github.com/dagster-io/dagster. [3] Dask [n. d.]. Dask: Dask natively scales Python. hŠps://dask.org/. source (€le or service) to the container on startup. Methods [4] Je‚rey Dean and Sanjay Ghemawat. 2004. MapReduce: Simpli€ed Data Process- for passing resource location include ƒags and environ- ing on Large Clusters. In Proceedings of the 6th Conference on Symposium on ment variables, as well as di‚erent formaŠing semantics, Opearting Systems Design & Implementation - Volume 6 (OSDI’04). USENIX Asso- ciation, Berkeley, CA, USA, 10–10. hŠp://dl.acm.org/citation.cfm?id=1251254. and are described later. Input and output passing meth- 1251264 ods are captured by €elds inputs and outputs of message [5] Apache Foundation. [n. d.]. Apache Airƒow: Author, schedule and monitor Container. workƒows. hŠps://airƒow.apache.org/. [6] Apache Foundation. [n. d.]. Apache Beam: An advanced uni€ed programming (3) Any additional startup parameters in the form of ƒags or model. hŠps://beam.apache.org/. environment variables. Œese are captured by €elds flags [7] Apache Foundation. [n. d.]. Apache Heron: A realtime, distributed, fault- env Container tolerant stream processing engine from TwiŠer. hŠps://apache.github.io/ and of message . incubator-heron/. [8] Apache Foundation. [n. d.]. Apache Spark: Lightning-fast uni€ed analytics A.9.1 Container input and output. ContainerInput engine. hŠps://spark.apache.org/. Messages [9] Apache Foundation. [n. d.]. Apache Storm: Distributed realtime computation and ContainerOutput associate every transform input and output, system. hŠp://storm.apache.org/. respectively, with a container ƒag and/or environment variable, [10] Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: Scheduling of Long Running Applications in where the resource locator is to be passed to the container on Shared Production Clusters. In Proceedings of the Širteenth EuroSys Conference startup. (EuroSys ’18). ACM, New York, NY, USA, Article 4, 13 pages. hŠps://doi.org/10. 1145/3190508.3190549 message ContainerInput { [11] Inc. Google. [n. d.]. Build and test so‰ware of any size, quickly and reliably. hŠps://bazel.build/. required string name = 1; [12] Inc. Google. [n. d.]. An open source machine learning framework. hŠps: optional string flag = 2; //www.tensorƒow.org/. optional string env = 3; [13] Kubernetes [n. d.]. An open-source system for automating deployment, scaling, optional ResourceFormat format = 4; and management of containerized applications. hŠps://kubernetes.io/. [14] Kubernetes open-source operators [n. d.]. Collection of Kubernetes operators. } hŠps://github.com/operator-framework/awesome-operators. message ContainerOutput { [15] LifeBit [n. d.]. Intelligent Compute Management for Data Analysis. hŠps: required string name = 1; //lifebit.ai/products. optional string flag = 2; [16] ONNX [n. d.]. Open Neural Network Exchange Format. hŠps://onnx.ai/. [17] PyTorch [n. d.]. PyTorch: An open source deep learning platform. hŠps: optional string env = 3; //pytorch.org/. optional ResourceFormat format = 4; [18] PyTorch distributed [n. d.]. Writing distributed applications with PyTorch. } hŠps://pytorch.org/tutorials/intermediate/dist tuto.html. [19] Reƒow [n. d.]. Reƒow: A language and runtime for distributed, incremental data processing in the cloud. hŠps://github.com/grailbio/reƒow. Both messages have analogous semantics: Field name matches the corresponding transform input (or out- put) name, as declared in message Transform within €eld inputs (or outputs). Fields flag and env determine the ƒag name and environment variable, respectively, where the resource locator is to be passed. Field format determines the resource locator formaŠing conven- tion. By default, €le resource location is passed to the container in the form of an absolute path relative to the container’s local €le system. By default, service resource location is passed to the container using standard host and port notation. Alternatives, are provided by message ResourceFormat.

A.9.2 Container parameters. Additional container parameters, speci€ed in €elds flags and env of message Container, are de€ned by message ContainerFlag below. message ContainerFlag { required string name = 1; optional string value = 2; }