A Machine Learning Data Processing Framework

tf.data: A Machine Learning Data Processing Framework Derek G. Murray Jiří Šimša Lacework* Google [email protected] [email protected] Ana Klimovic Ihor Indyk ETH Zurich* Google [email protected] [email protected] ABSTRACT Training machine learning models requires feeding input data for models to ingest. Input pipelines for machine learning jobs are often challenging to implement efficiently as they require reading large volumes of data, applying complex transformations, and transfer- ring data to hardware accelerators while overlapping computation and communication to achieve optimal performance. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. The tf.data API provides operators that can be parameterized with user-defined computation, composed, and reused across different machine learning domains. Figure 1: CDF showing the fraction of compute time that These abstractions enable users to focus on the application logic millions of ML training jobs executed in our fleet over one of data processing, while tf.data’s runtime ensures that pipelines month spend in the input pipeline. 20% of jobs spend more run efficiently. than a third of their compute time ingesting data. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models. tf.data delivers the high performance required, while 1 INTRODUCTION avoiding the need for manual tuning of performance knobs. We Data is the lifeblood of machine learning (ML). Training ML models show that tf.data features, such as parallelism, caching, static op- requires steadily pumping examples for models to ingest and learn timizations, and optional non-deterministic execution are essential from. While prior work has focused on optimizing the accuracy and for high performance. Finally, we characterize machine learning speed of model training and serving, how we store and preprocess input pipelines for millions of jobs that ran in Google’s datacenter data for machine learning jobs has received significantly less atten- fleet, showing that input data processing is highly diverse andcon- tion. Across the millions of ML jobs we run in Google’s datacenters sumes a significant fraction of job resources. Our analysis motivates every month, we observe that the input data pipeline accounts future research directions, such as sharing computation across jobs for significant resource usage and can greatly impact end-to-end and pushing data projection to the storage layer. performance. Figure 1 shows how the fraction of compute time that jobs spend in the input pipeline varies, where we define compute time as the time spent on a hardware resource – such as a CPU or an PVLDB Reference Format: accelerator core – scaled by the compute capability of that resource. Derek G. Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk. tf.data: A The marked point shows that 20% of jobs spend more than a third of Machine Learning Data Processing Framework. PVLDB, 14(12): 2945-2958, their compute time in the input pipeline. When taking into account 2021. the total compute time from all jobs in our analysis (§ 5), we find doi:10.14778/3476311.3476374 that 30% of the total compute time is spent ingesting data. A com- PVLDB Artifact Availability: plementary study of ML model training with public datasets found The source code, data, and/or other artifacts have been made available at that preprocessing data accounts for up to 65% of epoch time [44]. https://github.com/tensorflow/tensorflow. This shows that input data pipelines consume a significant fraction of ML job resources and are important to optimize. Input pipelines of machine learning jobs are often challenging to *Work done while at Google. implement efficiently as they typically need to ingest large volumes of data, apply complex transformations, overlap communication This work is licensed under the Creative Commons BY-NC-ND 4.0 International and computation, and shuffle and batch data with various data License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by ordering guarantees. For example, some jobs require that each ex- emailing [email protected]. Copyright is held by the owner/author(s). Publication rights ample is visited exactly once before any example is visited a second licensed to the VLDB Endowment. time during training. Moreover, to achieve good performance and Proceedings of the VLDB Endowment, Vol. 14, No. 12 ISSN 2150-8097. doi:10.14778/3476311.3476374 avoid input pipeline stalls, the data preprocessing should leverage parallelism and pipelining to overlap preprocessing with model 2945 1.0 training computations. Determining the optimal degree of paral- Number of jobs lelism and amount of data to prefetch is often challenging as it Job compute time 0.8 depends on the nature of the workload and the hardware resources available. 0.6 Hardware accelerators used for ML training further increase the need for efficient input pipelines. Today’s accelerators, such as 0.4 GPUs and TPUs, are tailored towards executing the linear algebra Cumulative fraction operations that are common in ML computations, but have limited 0.2 support for common data preprocessing operations. Hence, input 0.0 data is commonly processed on the CPU and feeding an accelerator 103 106 109 1012 1015 1018 with data at a sufficient rate to saturate its compute capabilities is Bytes read from storage becoming increasingly challenging. The high cost of accelerators compared to their CPU hosts makes it particularly important to Figure 2: CDF of input data size across ML training jobs. 13% ensure that accelerators operate at high utilization [6, 20]. of jobs read more than 1 TB of data. These jobs consume over We present tf.data, an API and a runtime for building and 96% of total compute resources. executing efficient input data pipelines for machine learning jobs. The tf.data API provides generic operators that can be parameterized by user-defined functions, composed, and reused across which implies that preprocessing commonly decreases the volume ML domains. Inspired by the programming models of relational of data. Most notably, we observe that identical input pipelines databases [21, 24], declarative collection libraries [29, 42], and data- are frequently re-executed within and across jobs, suggesting that parallel big-data systems [66, 67], the tf.data API consists of state- caching materialized datasets is a promising future direction to less datasets, which are an abstraction for users to define their explore to improve the performance and efficiency of input data input pipeline, and stateful iterators, which produce a sequence processing for ML. Our findings motivate several other directions of elements and maintain the current position within a dataset. for future research, such as processing data closer to storage and These abstractions allow users to focus on the application logic of disaggregating input data processing from model training to avoid their input pipeline and leave the task of executing the pipeline host resource bottlenecks. efficiently to the tf.data runtime. In particular, tf.data internally represents an input pipeline dataset as a graph and applies static 2 INPUT PIPELINE REQUIREMENTS optimizations using graph rewrites. Furthermore, tf.data can au- Raw input data, such as images, audio, and text files, undergo both tomatically tune parameters such as the degree of parallelism and offline and online preprocessing before being ingested for model data prefetch buffer sizes, which are critical for performance yet training. Offline data preprocessing involves extracting features often challenging for an average ML user to tune by hand. from raw data, validating data [12], and converting data to binary Our evaluation demonstrates that 1) input pipeline performance formats, such as Avro [8], Parquet [9], or TFRecord [58], to enable is critical to end-to-end training time of state-of-the-art ML bench- higher throughput data ingestion. Batch computing frameworks marks, 2) tf.data is capable of improving input pipeline latency such as Apache Spark [67], Beam [1], and Flume [2] are well suited through a combination of software pipelining, parallelization, and to offline preprocessing. While some data transformations, such static optimizations, and 3) tf.data dynamic optimizations avoid as normalization, can be applied during offline preprocessing, ML the need to manually tune performance knobs. For example, we training also requires applying transformations online as examples show that introducing parallelism and software pipelining to the are fed to the model. For instance, image models commonly rely input pipeline of a Resnet50 model training on the ImageNet on data augmentation, e.g. randomly distorting images, to improve dataset results in a 10.4× decrease in time to convergence. Apply- accuracy [17, 54]. Data augmentation multiplies the size of the ing further optimizations with tf.data, such as caching and static original dataset, making it prohibitive to store outputs in interme- optimizations, improves training time by an additional 2×. We also diate files. Our work focuses on online data preprocessing, which demonstrate that tf.data’s auto-tuning matches the performance executes as part of the input pipeline of ML training jobs. of expert hand-tuned input pipelines. The input pipeline of ML training can be characterized as a three- The tf.data API and runtime are open source and integrated stage extract, transform, load (ETL) process. The first stage reads in TensorFlow [3]. At Google, tf.data has been in production use input data from a storage system. Machine learning jobs commonly since 2017 for various ML training jobs, such as supervised learning, train on large data sets. Figure 2 shows that 13% of jobs, out of federated learning, and reinforcement learning; with different data the millions of jobs we analyzed, read at least 1 TB of input data.

A Machine Learning Data Processing Framework

Is 'Distributed' Worth It? Benchmarking Apache Spark with Mesos

Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks

Comparison of Spark Resource Managers and Distributed File Systems

Alluxio: a Virtual Distributed File System by Haoyuan Li A

An Architecture for Fast and General Data Processing on Large Clusters

Thesis Enabling Autoscaling for In-Memory Storage In

Performance Tuning and Evaluation of Iterative Algorithms in Spark

From Cloud Computing to Sky Computing Ion Stoica and Scott Shenker UC Berkeley

©Copyright Haichen Shen

Learning Spark, Second Edition

A Survey on Spark Ecosystem for Big Data Processing

Mllib: Machine Learning in Apache Spark