Automating Pipelines with Mixed-Semantics Data Sources

Koji: Automating pipelines with mixed-semantics data sources Petar Maymounkov Aljabr, Inc. [email protected] ABSTRACT dependencies, which are usually other les). e simple dataow We propose a new result-oriented semantic for dening data pro- model mentioned previously no longer applies. cessing workows that manipulate data in dierent semantic forms To the best of our knowledge, no dataow model has been pro- (les or services) in a unied manner. is approach enables users to posed to capture this scenario. Yet, to this day, this type of mixed- dene workows for a vast variety of reproducible data-processing semantic (le and service) pipelines represent the most common tasks in a simple declarative manner which focuses on application- type of o-line batch-processing workows. level results, while automating all control-plane considerations (like Due to the lack of a specialized formalism for describing them failure recovery without loss of progress and computation reuse) and a tool for executing them, they are currently codied in a behind the scenes. variety of error-prone ways, which usually amount to the usage of e uniform treatment of les and services as data enables easy data-unaware task execution pipelines. We address this gap here. integration with existing data sources (e.g. data acquisition APIs) and sinks of data (e.g. database services). Whereas the focus on con- 1.2 Problem tainers as transformations enables reuse of existing data-processing We address a class of modern Machine Learning and data-processing systems. pipelines. We describe a declarative conguration mechanism, which can Such pipelines transform a set of input les through chains of be viewed as an intermediate representation (IR) of reproducible transformations, provided by open-source soware (OSS) for large- data processing pipelines in the same spirit as, for instance, Tensor- scale computation (like TensorFlow [12], Apache Beam [6], Apache Flow [12] and ONNX [16] utilize IRs for dening tensor-processing Spark [8], etc.) or user-specic implementations (usually provided pipelines. as executable containers). In short, these pipelines use mixtures of disparate OSS technologies tied into a single coherent data ow. ACM Reference format: Petar Maymounkov. 2016. Koji: Automating pipelines with mixed-semantics At present, such workows are frequently built using task-driven data sources. In Proceedings of White paper, Brooklyn, NY, 2018, 11 pages. pipeline technologies (like Apache Airow [5]) which execute tasks DOI: 10.1145/nnnnnnn.nnnnnnn in a given dependency order, but are unaware of the data passed from one task to the next. e lack of data ow awareness of current solutions prevents large data-processing pipelines from beneting 1 INTRODUCTION in caching and reuse of computation, which could provide signi- 1.1 History cant eciency gains in these industry cases: • Restarting long-running pipelines aer failure and contin- e introduction of MapReduce [4] by Google arguably marked the uing from where previous executions le o. beginning of programmable large-scale data processing. MapRe- • Re-running pipelines with incremental changes, during duce performs a transformation of one set of large les (the input) developer iterations. into another (the output). Since the transformation provided by • Running concurrent pipelines which share logic, i.e. com- a MapReduce is a primitive a many-to-many shue, followed pute identical data artifacts within their workow. by an element-wise map it became common practice to chain multiple MapReduce transformations in a pipeline. Furthermore, task-centric technologies make it impractically e dataow in such a pipeline is cleanly captured by a directed- hard to integrate data and computation optimizations like: arXiv:1901.01908v1 [cs.DC] 2 Dec 2018 acyclic graph (DAG), whose vertices represent transformations and • In-memory storage of intermediate results which are not edges represent les. cache-able, or In a twist, it became commonplace to query a service (usually • Context-specic choices of job scheduling and placement a key-value lookup service) from inside the mapper function. For algorithms. instance, this technique is used to join two tables by mapping over one of them and looking up into the other. More recently, 1.3 Solution Machine Learning systems have been serving trained models as We propose a new pipeline semantic (and describe its system ar- lookup services, which are used by MapReduce mappers in a similar chitecture, which can be realized on top of Kubernetes) based on a fashion. few key design choices: With this twist, a MapReduce transformation no longer depends • Result-oriented specication: e goal of workows is just on input les but also on lookup services (and their transitive to build data artifacts. Workows are represented as dependency graphs over artifacts. Input artifacts are provided White paper, Brooklyn, NY 2018. 978-x-xxxx-xxxx-x/YY/MM...$15.00 by the caller. Intermediate artifacts are produced through DOI: 10.1145/nnnnnnn.nnnnnnn a computation, using prior artifacts. Output artifacts are White paper, 2018, Brooklyn, NY Petar Maymounkov returned to the caller. is view is entirely analogous to Kubernetes [13], which is becoming the industry-standard clus- the way soware build systems (like Bazel [11] and UNIX ter OS, is ubiquitously available on most cloud providers, can be make) dene soware build dependencies. run on-premises, and is generally provider-agnostic from the users’ • Unied treatment of data and services: We view le standpoint. Kubernetes benets from having mature integrations artifacts and service artifacts in a unied way as resources. to the Docker (and other) container ecosystems, providing seamless is allows us to describe complex workows which mix- out-of-box access to many data tools, thanks to the rich ecosystem and-match batch and streaming computations (the laer of operator implementations [14]. being a special case of services). Furthermore, this enables us to automate service garbage-collection and achieve op- 1.4 How it works timal computation reuse (via caching) across the entire e user denes a pipeline in a language-agnostic manner. A pipeline. e resource-level unied view of les and ser- pipeline denition describes: (i) input data resources and their vices purports to be the Goldilocks level of coarse data sources, (ii) intermediate resources and the data transformation knowledge, that is needed by a dataow controller to auto- that produced them from dependent resources, (iii) output data mate all le caching and service control considerations. resources and where they should be delivered. • Type-safe declarative specication: We believe that • Resources are les (or services) and their format (or proto- workow specication has to be declarative, i.e. repre- col) can be optionally specied to benet from type-safety sentable via a typed schema (like e.g. Protocol Buers). checks over the dataow graph, thanks to declarations. is provides full decoupling from implementations, and • Input resources can be provided by various standard meth- serves as a reproducible assembly language for dening ods (volume les, existing cluster services, Amazon S3 pipelines. buckets, and so on.). Data source types can be added seam- • Decouple dataow from transform implementation: lessly. We decouple the specication of application logic from • Intermediate resources are described as les (or services), the denition of how data transforms are performed by optionally with a specied format (or protocol). eir place- underlying backend technologies. Application logic com- ment location is not provided by the user, in order to enable prises the dependency graph between artifacts, and the the pipeline controller to make optimal choices in this re- data transform at each node. Data transforms are viewed gard and to manage caching placements decisions. uniformly akin to functions from a library of choices. e • Output resources can specify the location where they should methods for invoking transformation-backing technolo- be delivered, with standard (and extensible) options simi- gies (like MapReduce, TensorFlow, etc.) are implemented larly to the case for input resources. separately as driver functions, and surfaced as a library of e transformations at the nodes of the dataow graph consume declarative structures that can be used in application logic. a set of input resources to produce new resources. Transformations • Extensible transformations: New types of data trans- are exposed to the user with a clean application-level interface. A forms (other than container-execution based) can be added transformation: easily. is is done in two parts. First, a simple driver function implements the specics of calling the underly- • Consumes a set of named input resources (the arguments), ing technology. Second, a new transformation structure which can be fullled by prior resources in the user’s is added to the application logic schema. is extension dataow program. mechanism is reserved for transformations that cannot be • Produces a set of named output resources. e laer can containerized. be referenced (by name) by dependent transformations, • Scheduler and storage-independent design: Applica- downstream in the dataow program. tion logic governs the order in which data computations • Accepts a transformation-specic set of parameters. For must occur. However, e choice of job schedulers (or instance, a TensorFlow transformation may require a Ten- placement) algorithms, as well as the choice of storage sorFlow program (e.g. as a Python or Protocol Buer le). technologies (e.g. disk versus memory volumes), are en- e user programs (in Python or Go) which build the pipeline tirely orthogonal to the application’s dataow denition. dataow are used to generate a Protocol Buer (or YAML) le, Our architecture enables exible choice of relevant tech- which captures the entire pipeline and is eectively executable and nology on a per-node (scheduling) and per-edge (storage) reproducible. basis. For instance, some intermediate les can be stored Pipelines can be executed either from the command-line or by in memory volumes, instead of disk, to increase eciency. sending them o as a background job to Kubernetes, using the operator paern (via CRD).

Load more