tf.data: A Machine Learning Data Processing Framework

Derek G. Murray Jiří Šimša Lacework* Google [email protected] [email protected] Ana Klimovic Ihor Indyk ETH Zurich* Google [email protected] [email protected] ABSTRACT Training machine learning models requires feeding input data for models to ingest. Input pipelines for machine learning jobs are often challenging to implement efficiently as they require reading large volumes of data, applying complex transformations, and transfer- ring data to hardware accelerators while overlapping computation and communication to achieve optimal performance. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. The tf.data API provides op- erators that can be parameterized with user-defined computation, composed, and reused across different machine learning domains. Figure 1: CDF showing the fraction of compute time that These abstractions enable users to focus on the application logic millions of ML training jobs executed in our fleet over one of data processing, while tf.data’s runtime ensures that pipelines month spend in the input pipeline. 20% of jobs spend more run efficiently. than a third of their compute time ingesting data. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models. tf.data delivers the high performance required, while 1 INTRODUCTION avoiding the need for manual tuning of performance knobs. We Data is the lifeblood of machine learning (ML). Training ML models show that tf.data features, such as parallelism, caching, static op- requires steadily pumping examples for models to ingest and learn timizations, and optional non-deterministic execution are essential from. While prior work has focused on optimizing the accuracy and for high performance. Finally, we characterize machine learning speed of model training and serving, how we store and preprocess input pipelines for millions of jobs that ran in Google’s datacenter data for machine learning jobs has received significantly less atten- fleet, showing that input data processing is highly diverse andcon- tion. Across the millions of ML jobs we run in Google’s datacenters sumes a significant fraction of job resources. Our analysis motivates every month, we observe that the input data pipeline accounts future research directions, such as sharing computation across jobs for significant resource usage and can greatly impact end-to-end and pushing data projection to the storage layer. performance. Figure 1 shows how the fraction of compute time that jobs spend in the input pipeline varies, where we define compute time as the time spent on a hardware resource – such as a CPU or an PVLDB Reference Format: accelerator core – scaled by the compute capability of that resource. Derek G. Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk. tf.data: A The marked point shows that 20% of jobs spend more than a third of Machine Learning Data Processing Framework. PVLDB, 14(12): 2945-2958, their compute time in the input pipeline. When taking into account 2021. the total compute time from all jobs in our analysis (§ 5), we find doi:10.14778/3476311.3476374 that 30% of the total compute time is spent ingesting data. A com- PVLDB Artifact Availability: plementary study of ML model training with public datasets found The source code, data, and/or other artifacts have been made available at that preprocessing data accounts for up to 65% of epoch time [44]. https://github.com/tensorflow/tensorflow. This shows that input data pipelines consume a significant fraction of ML job resources and are important to optimize. Input pipelines of machine learning jobs are often challenging to *Work done while at Google. implement efficiently as they typically need to ingest large volumes of data, apply complex transformations, overlap communication This work is licensed under the Creative Commons BY-NC-ND 4.0 International and computation, and shuffle and batch data with various data License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by ordering guarantees. For example, some jobs require that each ex- emailing [email protected]. Copyright is held by the owner/author(s). Publication rights ample is visited exactly once before any example is visited a second licensed to the VLDB Endowment. time during training. Moreover, to achieve good performance and Proceedings of the VLDB Endowment, Vol. 14, No. 12 ISSN 2150-8097. doi:10.14778/3476311.3476374 avoid input pipeline stalls, the data preprocessing should leverage parallelism and pipelining to overlap preprocessing with model

2945 1.0 training computations. Determining the optimal degree of paral- Number of jobs lelism and amount of data to prefetch is often challenging as it Job compute time 0.8 depends on the nature of the workload and the hardware resources available. 0.6 Hardware accelerators used for ML training further increase the need for efficient input pipelines. Today’s accelerators, such as 0.4 GPUs and TPUs, are tailored towards executing the linear algebra Cumulative fraction operations that are common in ML computations, but have limited 0.2 support for common data preprocessing operations. Hence, input 0.0 data is commonly processed on the CPU and feeding an accelerator 103 106 109 1012 1015 1018 with data at a sufficient rate to saturate its compute capabilities is Bytes read from storage becoming increasingly challenging. The high cost of accelerators compared to their CPU hosts makes it particularly important to Figure 2: CDF of input data size across ML training jobs. 13% ensure that accelerators operate at high utilization [6, 20]. of jobs read more than 1 TB of data. These jobs consume over We present tf.data, an API and a runtime for building and 96% of total compute resources. executing efficient input data pipelines for machine learning jobs. The tf.data API provides generic operators that can be param- eterized by user-defined functions, composed, and reused across which implies that preprocessing commonly decreases the volume ML domains. Inspired by the programming models of relational of data. Most notably, we observe that identical input pipelines databases [21, 24], declarative collection libraries [29, 42], and data- are frequently re-executed within and across jobs, suggesting that parallel big-data systems [66, 67], the tf.data API consists of state- caching materialized datasets is a promising future direction to less datasets, which are an abstraction for users to define their explore to improve the performance and efficiency of input data input pipeline, and stateful iterators, which produce a sequence processing for ML. Our findings motivate several other directions of elements and maintain the current position within a dataset. for future research, such as processing data closer to storage and These abstractions allow users to focus on the application logic of disaggregating input data processing from model training to avoid their input pipeline and leave the task of executing the pipeline host resource bottlenecks. efficiently to the tf.data runtime. In particular, tf.data internally represents an input pipeline dataset as a graph and applies static 2 INPUT PIPELINE REQUIREMENTS optimizations using graph rewrites. Furthermore, tf.data can au- Raw input data, such as images, audio, and text files, undergo both tomatically tune parameters such as the degree of parallelism and offline and online preprocessing before being ingested for model data prefetch buffer sizes, which are critical for performance yet training. Offline data preprocessing involves extracting features often challenging for an average ML user to tune by hand. from raw data, validating data [12], and converting data to binary Our evaluation demonstrates that 1) input pipeline performance formats, such as Avro [8], Parquet [9], or TFRecord [58], to enable is critical to end-to-end training time of state-of-the-art ML bench- higher throughput data ingestion. Batch computing frameworks marks, 2) tf.data is capable of improving input pipeline latency such as [67], Beam [1], and Flume [2] are well suited through a combination of software pipelining, parallelization, and to offline preprocessing. While some data transformations, such static optimizations, and 3) tf.data dynamic optimizations avoid as normalization, can be applied during offline preprocessing, ML the need to manually tune performance knobs. For example, we training also requires applying transformations online as examples show that introducing parallelism and software pipelining to the are fed to the model. For instance, image models commonly rely input pipeline of a Resnet50 model training on the ImageNet on data augmentation, e.g. randomly distorting images, to improve dataset results in a 10.4× decrease in time to convergence. Apply- accuracy [17, 54]. Data augmentation multiplies the size of the ing further optimizations with tf.data, such as caching and static original dataset, making it prohibitive to store outputs in interme- optimizations, improves training time by an additional 2×. We also diate files. Our work focuses on online data preprocessing, which demonstrate that tf.data’s auto-tuning matches the performance executes as part of the input pipeline of ML training jobs. of expert hand-tuned input pipelines. The input pipeline of ML training can be characterized as a three- The tf.data API and runtime are open source and integrated stage extract, transform, load (ETL) process. The first stage reads in TensorFlow [3]. At Google, tf.data has been in production use input data from a storage system. Machine learning jobs commonly since 2017 for various ML training jobs, such as supervised learning, train on large data sets. Figure 2 shows that 13% of jobs, out of federated learning, and reinforcement learning; with different data the millions of jobs we analyzed, read at least 1 TB of input data. modalities, including text, image, and video data. The system is This means that for a non-trivial fraction of training jobs, the input used daily by hundreds of thousands of ML jobs in our fleet. data cannot fit in memory. Furthermore, over 96% of total compute We conduct a fleet-wide analysis of tf.data jobs to character- resources across jobs are spent in jobs that read over 1 TB of data. ize the input pipelines of millions of real machine learning jobs The second stage transforms the data to a format amenable to and identify opportunities for future work in data preprocessing ML training computation. It applies transformations to the input systems. We find that the set of transformations applied in input data, such as sampling, permuting, and filtering data to extract the pipelines varies greatly across jobs. For 75% of jobs, the materialized subset of most relevant features. When training image models, it dataset is smaller in size than the raw input data read from storage, is common practice to apply data augmentation such as clipping,

2946 resizing, flipping, and blurring images. For text pipelines, training Ease of use Machine learning workloads in a typical large or- examples commonly need to be grouped and batched based on ganization span different domains, storage systems, data formats, sequence length. Finally, the third stage loads the data onto the and accelerator hardware. Therefore, it must be possible to com- accelerator device that executes the training computation. bine pipeline stages in unanticipated ways, and extend the system ML training imposes unique requirements for input data pipelines. with new data sources and transformations. To emphasize the im- We describe these requirements below and summarize why they portance of flexibility, in our fleet-wide analysis of ML jobs,we are not adequately addressed by other systems. classified transformations into categories – such as reading input data from storage, caching, batching, or shuffling – and recorded the Data ordering Unlike many data-parallel data processing plat- combination of transformation categories used by each job. While forms [16, 66, 67], ML training is sensitive to the order in which the 10 most common combinations of transformations account for records are delivered. The most common training algorithms are over 75% of jobs, there is a heavy tail with over 1000 combinations derived from stochastic gradient descent [50], which accesses the of transformations in total. In addition to supporting diverse input input examples pseudo-randomly. Empirically, convergence is more pipelines, we also require the input pipeline framework to address rapid when the algorithm makes multiple passes over input exam- the tension between performance and ease-of-use. Optimizing an ples (called epochs), and uses a different random permutation of the input pipeline can require expertise in how to structure operations input data on each pass (or equivalently, samples examples without and tune performance-related parameters, such as degrees of par- replacement within each epoch) [10]. Furthermore, to improve sys- allelism and pipeline buffer sizes. Hence, we require that tf.data tem efficiency via vectorization and reduced communication, the can optimize an input pipeline automatically. input pipeline typically concatenates consecutive examples into a batch that is processed in a single training step. The final parameters of a trained model can be sensitive tothe Before designing tf.data, we evaluated several existing input exact order in which the input examples were consumed. To aid pipeline implementations, and found that they did not meet our in debugging, especially when porting models between different requirements in all of the above areas: 1) PyTorch’s DataLoader hardware architectures, tf.data must be able to produce random API [14] is easy to use (it provides a simple Python interface), but results in a deterministic order, according to a given seed. While its reliance on Python on the critical path – despite the use of such a feature is useful for debugging, it is in tension with high multiprocessing to work around the interpreter lock bottleneck – performance, since any variability in the element processing time and assumption of uniform random access to all input data, do not could lead to head-of-line blocking. Therefore, while tf.data de- satisfy our performance requirement, especially for multi-terabyte faults to deterministic execution, a user can disable it to mitigate datasets. 2) MXNet’s DataIter API [47] uses a native C++ imple- the effect stragglers have on end-to-end performance. mentation for greater performance than PyTorch, but it requires Finally, both the end-to-end training computation and the indi- users to add native extensions in order to handle new preprocessing vidual epochs can take a long time to complete. To provide ordering schemes. Therefore it does not help our users with diverse data pro- guarantees in the presence of preemptions – commonplace in our cessing needs, who tend to prefer writing Python, and who are often data centers – the data processing computation for ML training restricted to memory-safe programming languages for security rea- jobs must be checkpointable. sons. 3) NVIDIA’s Data Loading Library (DALI) API [22] enables some preprocessing, such as image decoding, to be offloaded to a Performance A single training step consumes a batch of input GPU. This offloading partially fulfils our performance requirement, elements and updates the current weights of the model. Often, the but it lacks the flexibility to support heterogeneous preprocessing step computation runs on an accelerator device – such as a GPU workloads and different types of accelerators. or TPU [30] – that can compute vector floating point operations In the next section, we present the tf.data programming model, efficiently, although the computation may also run on a multi-core which is based on chaining higher-order functional transformations, CPU. Ideally, the data processing computation is pipelined with the and inspired by LINQ [42]. Several data processing systems offer a training computation, minimizing the likelihood that the training similar programming model, including DryadLINQ [66], Spark [67], computation is blocked waiting for the next batch of elements and and Naiad [46]. We discuss them in more detail in § 6. For prag- hence maximizing the utilization of valuable accelerator resources. matic reasons, we did not consider using any of these systems, The input pipeline is responsible for fetching the raw input data because the impedance mismatch with TensorFlow’s C++ codebase from storage and transforming it into input features for the model. would severely limit performance. Furthermore, these systems are For example, the raw input for an image classification model might designed to optimize data parallel computations, with a large num- be a protocol buffer19 [ ] containing a JPEG-encoded image, and ber of independent values in each batch. This makes it difficult or the input pipeline must convert the raw input into a dense three- inefficient for them to produce values sequentially, to fulfill the dimensional array of floating point values corresponding to the sequential ordering requirement. While one could use a system RGB values of each pixel. Along the way, the input pipeline must like Spark Streaming [68] for online preprocessing and pass data extract and decode the JPEG and apply additional transformations to the ML framework through an in-memory buffer, the additional such as affine transformations and colorspace changes to augment copies would have significant overhead due to the short step times the training data [54]. These activities are CPU-intensive, and must in ML training workloads. In the training workloads we have an- make efficient use of available CPU resources to maximize input alyzed, step times less than 1 ms are not uncommon and most pipeline throughput. workloads have step times less than 10ms. The extra copy overhead

2947 would be especially significant in the common case where mem- Dataset Description ory bandwidth is the bottleneck. By integrating tf.data directly batch Concatenates multiple elements into a single element. cache into TensorFlow, and sharing the same threadpools and memory Stores the input data in memory. concatenate Concatenates two datasets. allocators, we avoid this overhead. from_file Reads elements from a file, e.g. TextFileDataset. from_tensors Creates a singleton dataset from data in memory. 3 DESIGN AND IMPLEMENTATION filter Returns elements matching a predicate. flat_map Maps elements to datasets and flattens the result. tf.data In § 3.1, we present ’s API which enables users to compose interleave Like flat_map, but mixes outputs from input elements. and parameterize operators. In § 3.2 and § 3.3 we discuss key aspects map Transforms individual elements. of tf.data’s runtime. prefetch Adds a buffer to pipeline input production. reduce Reduces a dataset to a single element. repeat Produces the input dataset multiple times. 3.1 Datasets and Iterators shard Selects a subset of elements from the dataset. The tf.data Dataset represents the stateless definition of an input shuffle Randomizes the order of elements. pipeline as a (potentially infinite) sequence of elements. A dataset unbatch Splits input elements on the 0th dimension. can either be a source dataset that is created from primitive values zip Combines elements of multiple datasets into tuples. (e.g. a matrix of floating-point numbers representing input exam- Table 2: Common tf.data source and transformed datasets. ples, or a vector of strings representing filenames), or a transformed dataset that transforms one or more input datasets into a new se- quence of elements. The elements of a dataset are statically typed, and valid element types include tensors (with a specific element The example in Figure 3 illustrates a training loop that uses type and optional shape) and composite types (such as tuples, op- a tf.data input pipeline to read elements from files, apply user- tionals, and nested datasets). Together, source and transformed defined processing logic on each element and combine the processed datasets form an expression tree that represents the entire input elements into a mini-batch. pipeline. Table 1 shows the Dataset interface. tf.data includes source datasets that support common file for- 3.2 Parallel and Distributed Execution mats, and transformed datasets that implement functional trans- To efficiently utilize available host resources, tf.data provides formations and may be parameterized by user-defined functions transformations that enable software pipelining, and parallel exe- (UDFs). The UDFs can be written in Python, and tf.data uses cution of computation and I/O. The prefetch transformation de- TensorFlow’s Autograph library to convert them into dataflow couples the producer and consumer of data using an internal buffer, graphs [45]. Table 2 summarizes the most common tf.data trans- making it possible to overlap their computation. Input pipelines can formations. use this transformation to overlap host computation, host-to-device The tf.data Iterator represents the current state of traversing transfer, and device computation. The map transformation takes a Dataset. An iterator provides sequential access to the elements of an optional argument that specifies the degree of parallelism to a dataset via the get_next operation, which either returns a typed use for applying the user-defined computation to input elements element, or an error status such as "out-of-range" (EOF). In tf.data, concurrently. The interleave transformation provides a similar implementations of the Iterator interface are thread-safe, so multi- optional argument that specifies the degree of parallelism touse ple threads can call get_next concurrently to improve throughput, for fetching data from input elements concurrently. In particular, at the expense of determinism. The interface also includes save the interleave transformation can parallelize I/O by interleaving and restore methods to support checkpointing. data read from multiple files. By default, tf.data transformations The iterator interface (Table 3) abstracts all details of how the produce elements in a deterministic order. However, as determin- elements are produced, including internal buffering and parallelism. istic ordering can lead to head-of-line blocking, the parallel map Before applying optimizations, there is a one-to-one correspon- and interleave transformations provide an option for enabling dence between dataset and iterator objects, but the optimizations non-deterministic ordering, which can result in better performance in § 3.3 exploit the iterator abstraction to change the underlying at the expense of reproducibility. dataset graph, and optimize how elements are produced, while To illustrate the benefits of the above transformations, we revisit presenting the same interface. the example presented in Figure 3. Let us assume that it takes 5ms to read an element from the file, 2ms to apply the user-defined logic to an element, and 1ms to batch 10 elements. The accelerator would

Method Description Method Description make_iterator Creates a new iterator over the dataset. get_next Returns the next element, or raises EOF. serialize Converts the dataset to a serialized expression. save Writes the iterator state to a file. element_spec Returns the type signature of dataset elements. restore Reads the iterator state from a file.

Table 1: Dataset interface Table 3: Iterator interface

2948 ds = tf.data.TextFileDataset(["a.txt",...]) 3.3 Automatic Optimization ds = ds.map(parse).batch(batch_size=10) tf.data for elem in ds: ’s functional programming model enables it to provide train_step(elem) multiple different implementations for a single input pipeline. Au- tomatic static (§3.3.1) and dynamic (§3.3.2) optimizations improve tf.data’s performance and usability. Figure 3: Example of a training loop using a tf.data input pipeline that reads from a list of text files. parse is a user- 3.3.1 Static Optimizations. At run-time, tf.data can reflect on defined function for data preprocessing. the expression tree of any dataset and replace it with a more ef- ficient version. We implemented static optimizations as a virtual dataset transformation that converts the input dataset to an expres- ds = tf.data.Dataset.from_tensors(["a.txt",...]) sion tree, applies a suite of rewriting rules, and then evaluates the ds = ds.interleave( rewritten expression tree to produce an output dataset. The current tf. data .TextFileDataset, cycle_length=2, implementation uses TensorFlow’s GraphDef protocol buffer as the num_parallel_calls=2) representation and the Grappler optimization framework [56] to ds = ds.map(parse, num_parallel_calls=10) manipulate these expression trees. We are investigating the use of ds = ds.batch(batch_size=10) MLIR [37] as a richer representation that will enable us to reuse ds = ds.prefetch(buffer_size=1) optimizations from other domains. for elem in ds: As we gained experience with tf.data, we created several cus- train_step(elem) tom transformation that fuse commonly adjacent transformations for performance reasons. These transformations are: map + batch Figure 4: Optimized version of Figure 3 with a tf.data input fusion, shuffle + repeat fusion, map + map fusion, map + filter pipeline that employs parallelism (for I/O and parsing) and fusion, and filter + filter fusion. For example, the map + batch software pipelining. fusion transforms d.map(푓 ).batch(푏) into map_and_batch(푓 ,푏), which is functionally equivalent but the implementation of the fused operator parallelizes and pipelines the copies of each ele- ment into the output batch with the processing of other batch be idle for (5 + 2) ∗ 10 + 1 = 71ms at the start of each iteration before elements. Many of the fusion optimizations in tf.data are inspired data for the training computation becomes available. by deforestation in functional languages [60]. As the simplest exam- The tf.data input pipeline in Figure 4 is semantically equivalent ple, the map + map fusion transforms d.map(푓 ).map(푔) expression to that of Figure 3. However, it uses 1) the optional num_parallel_- into d.map(푔 ◦ 푓 ). This eliminates the per-element overhead of calls argument of interleave and map to parallelize I/O and an iterator—a virtual call to get_next and one of two function computation respectively, and 2) prefetch to overlap the input dispatches—and the composition 푔 ◦ 푓 may be optimized further pipeline computation with the training computation. As a result, the by Grappler’s standard optimization passes, such as arithmetic input pipeline in Figure 4 will take푚푎푥 (10∗5/2, 10∗2/10, 1) = 25ms optimization and dead code elimination. to produce a batch (assuming a sufficiently slow consumer) and the tf.data static optimizations are not limited to fusions. Map input pipeline computation (of the next batch) will be overlapped vectorization is an optimization that transforms d.map(푓 ).batch(푏) with the training computation on the accelerator (for the current into the more efficient d.batch(푏).map(pfor(푓 )). In the transformed batch). If the training computation takes more than 25ms, the data expression, pfor(푓 ) applies 푓 to every slice of the batch in par- for each iteration of the training loop will be ready by the time the allel [4]. This increases the efficiency of the resulting code by iteration starts. In § 3.3.2 we describe a mechanism for auto-tuning converting multiple invocations of a per-element operation (e.g. parallelism and buffer sizes so that users do not have to tune them tf.matmul() into a single invocation of a batched operation (e.g. manually. tf.batch_matmul()) that itself has an efficient vectorized imple- While interleave is typically used to parallelize I/O, it can also mentation. It also reduces the framework-induced overhead by be used for parallel execution of multiple copies of an arbitrary replacing 푏 function invocations with a single invocation. input pipeline (operating over different shards of the input data). We have found this mechanism useful to speed up input pipelines 3.3.2 Dynamic Optimizations. In many cases, the optimal configu- bottlenecked by inherently sequential transformations, such as ration for a tf.data pipeline depends on properties of the input filter or unbatch. data (e.g. raw image sizes) and the available resources (e.g. number In addition to supporting efficient single-host execution, we also of CPU cores, RAM, and network bandwidth). Hence, tf.data pro- designed tf.data for distributed ML training computation use- vides configuration parameters such as the degree of parallelism cases, such as data parallel synchronous training, across multiple for map transformations and the size of the buffer for the prefetch hosts (and accelerators per host). In this setup, each host has a transformation. tf.data input pipeline providing data for the accelerators attached To avoid the need for users to manually tune performance-related to the host. To provide for clean separation of epochs, the input data knobs, the tf.data runtime contains an auto-tuning mechanism can be sharded across multiple files and the shard transformation that allocates CPU and RAM resources across various parts of the ensures that different hosts operate over different shards of the data. input pipeline in a way that minimizes the (expected) latency of The sharded input pipelines do not communicate with each other. the input pipeline producing an element. In the rest of this section,

2949 consumer rate, we refer to the time it takes for an iterator to produce an element output get_next calls pipeline as its output latency and the output latency of an input pipeline is latency, ms the output latency of the iterator for its final transformation. per second data consumer To perform auto-tuning, tf.data executes the input pipeline in 푥 = 1000 = 27.4, a light-weight harness that maintains a tree representation of the prefetch 36.5 1 = 100 푦 = 100, 푛 = 2 iterators currently executing as part of the input pipeline, and mea- 0.01 buffer size = 2 sures the processing time spent in each of the iterators. The root of 36.5 ∗ 푝푒푚푝푡푦 = 27 the tree is the iterator producing data for training computation, the leaves of the tree correspond to source dataset iterators, and edges batch 100 10 ∗ 3.55 + 1 = 36.5 are implied by the input-output relationship between transformed batch size = 10 datasets’ iterators and their inputs. The tree structure can change 2 over time as transformations such as interleave or repeat create producer time = 4.15+ 5 = 4.55 multiple iterators during their lifetime. map 푥 = 1000 = 220, 100 ∗ batch size 4.55 The auto-tuning implementation uses the processing time and parallelism = 5 푦 = 1000, 푛 = 5, = 1000 the input pipeline structure to build an analytical model that is used buffer size = 5 4.55 ∗ 푝푒푚푝푡푦 = 3.55 to estimate how input pipeline parameters affect end-to-end latency. The estimating function is a composition of the output latencies of individual iterators as functions of tunable parameters, iterator’s interleave 푥 = 1000 = 200, processing time and inputs’ output latency. The outermost function parallelism = 1 5 1000 푦 = 1000, 푛 = 1, of the composition is the one for the final iterator. For synchronous buffer size = 1 5 ∗ 푝 = 4.15 transformations (i.e. transformations that do not decouple producer cycle length = 2 푒푚푝푡푦 and consumer), the output latency of an iterator is a linear function of the output latencies of its inputs and the processing time spent in 1000 = cycle length 500 from_file 5 the iterator. For asynchronous transformations, such as prefetch, and the parallel map and interleave, the output latency of an iter- ator is no longer linear and additionally depends on the parallelism, Figure 5: Output latency estimation: the downward traver- buffer size, and the rate of the consumer. In particular, the expected sal computes the consumer rate starting with the root ad- output latency of the iterator is computed as the output latency of justing it by the number of concurrent get_next calls from its input(s) multiplied by the probability that the buffer is empty, the consumer and the number of iterators. The upward tra- which we model using an M/M/1/k queue [53] and estimate as: versal can compute the output latency of each iterator in the tree since by the time the traversal returns to an iterator, the 1  + if 푥 = 푦 output latency of its inputs is known. Asynchronous trans-  푛 1  formations prefetch, parallel map and interleave use (1) to 푝 = (1) 푒푚푝푡푦 1− 푥 estimate the output latency, whereas a synchronous batch  푦   푛+1 otherwise  − 푥 produces an estimate with a linear function of its own pro-  1 푦  cessing time and output latency of its input. where 푛 is the buffer size, 푥 is the producer rate (computed from the output latency of the input iterator(s)), and 푦 is the consumer pipeline subject to CPU and RAM budget constraints. The optimiza- rate (computed from the frequency of get_next calls). Note that tion uses a gradient descent algorithm and is depicted in Figure 6. the producer rate, 푥, in general depends on upstream computation, The optimization period ranges from milliseconds to seconds and while the consumer rate, 푦, in general depends on downstream is determined automatically based on changes to the input pipeline computation. We traverse the iterator tree depth first to estimate structure and execution time. both 푥 and 푦 in a single traversal. An important aspect of the optimization is its ability to mini- To illustrate how the estimation works, let’s revisit the example mize output latency of the end-to-end input pipeline as opposed to from Figure 4, additionally assuming that 1) the num_parallel_- minimizing the output latency of individual transformations. As calls and buffer_size arguments are set to the special AUTOTUNE different transformations share the same CPU and RAM resources, value to enable auto-tuning, 2) the training computation requests locally optimal decisions may lead to excessive parallelism and data every 10ms on average, and 3) the auto-tuning harness is buffering, which in turn lead to inefficient thread scheduling and estimating the following combination: interleave parallelism 1 poor cache locality, negatively affecting end-to-end performance. and buffer size 1, map parallelism 5 and buffer size 5, and prefetch The ability to perform the optimization analytically is essential; buffer size 2. Figure 5 gives an example of how tf.data computes it allows tf.data to quickly find a good configuration without the output latency for such a pipeline. affecting the performance of the real input pipeline while evaluating tf.data creates a background thread that periodically uses the sub-optimal configurations. Once the background thread identifies estimation process above to evaluate different combinations of par- a configuration to use, it updates the parallelism and buffer sizesof allelism and buffer sizes for tunable transformations. Parameters the actual input pipeline accordingly. For most input pipelines the are chosen to minimize the expected output latency of the input optimization takes less than a few milliseconds to complete.

2950 while True: Dataset Domain Artifacts Size model = pipeline.get_analytical_model() ImageNet image classification 1.3M images 140GB params = model.get_tunable_parameters() COCO object detection 330K images 19GB best_latency = INFINITY WMT16 translation 4M pairs 1.3GB latency = model.latency() while (best_latency - latency >= EPS and WMT17 translation 4.5M pairs 720MB model.resource_usage() <= BUDGET): Table 4: MLPerf input data overview. best_latency = latency params -= DELTA * model.latency_grad() latency = model.latency() pipeline.set_tunable_parameters(params) 70 Expert-tuned Parallelism sleep(OPTIMIZATION_PERIOD) 63.162.9 Expert-tuned Parallelism + Optimizations 60 Autotune Parallelism + Optimizations

50 43.342.6 Figure 6: Periodic optimization of tunable parameters. 40

30 29.1

20 4 EVALUATION 11.6 10 6.0 6.1 6.0

Input Processing Epoch Speedup 4.7 4.6 To evaluate tf.data we seek to answer the following questions: 2.4 2.7 2.7 1.3 0 1) how do tf.data’s performance-related features affect input ResNet-50 SSD Mask-RCNN GNMT Transformer pipeline throughput, 2) how do input pipeline optimizations im- Figure 7: Speedup of input pipeline processing time with dif- pact the end-to-end time to reach a target accuracy when training ferent configurations, relative to a sequential input pipeline. state-of-the-art ML models, and 3) how does tf.data performance compare to other systems? For our evaluation, we used the open-source MLPerf benchmark input pipelines, intra-op parallelism generally provides little ben- suite [41], which is the de facto standard for evaluating ML software efit, as there is plenty of inter-op and inter-element parallelism. and hardware systems by measuring how fast a system can train Superfluous fine-grained parallelism can actually hurt performance models to a target quality metric. We use tf.data to implement the due to poorer cache locality and additional synchronization. input pipelines for each of the MLPerf benchmarks. Our evaluation considers the following combinations of model architectures and 4.1 Input Pipeline Experiments input data: 1) Resnet50 [26] with ImageNet [17], 2) SSD [40] with Methodology: To evaluate the effect of tf.data performance- COCO [39], 3) Mask-RCNN [25] with COCO [39], 4) GNMT [64] related features on input pipeline throughput, we executed the input with WMT16 [62], and 5) Transformer [59] with WMT17 [63]. pipeline portion of our MLPerf benchmark implementations in a Table 4 summarizes the attributes of the MLPerf datasets, which tight loop (with no model training computation) and measured the range from 135 MB to 140 GB in size. Though these public datasets time it takes to process an epoch’s worth of data. We used a single fit in memory before decompression and/or data augmentations, machine with 56 Intel Xeon 2.60 GHz CPU cores, 128 GB of RAM, in Section 5.1 we discuss our experience with production work- and the input data stored on a 1 TB Samsung SM961 SSD. We limited loads which commonly preprocess larger-than-memory datasets the Resnet50 experiment to only use 60% of the ImageNet data to tf.data (Figure 2). When dealing with such datasets, ’s prefetching make sure that an epoch’s worth of data can be cached in memory. and software pipelining optimizations become even more critical For each of the input pipelines we ran the following experiments: 1) for end-to-end performance. a baseline which does not use any tf.data performance features tf.data Table 5 shows the performance-related features of used (i.e. sequential reading and processing), 2) a version that uses expert- in the input pipelines of our MLPerf benchmark implementations. tuned1 parallelism for I/O and compute, 3) a version that uses map interleave prefetch All input pipelines used the , , and trans- all tf.data performance features in Table 5 with expert-tuned formations for parallel computation, parallel I/O, and software parallelism, and 4) a version that uses all tf.data performance pipelining, respectively. All pipelines also used non-deterministic features with auto-tuned parallelism. Note that even though the ordering to mitigate the effect of stragglers. With the exception baseline does not use input pipeline parallelism, TensorFlow may tf.data of Transformer, the input pipelines used static opti- still parallelize the user-defined computation in map. cache mizations to benefit from transformation fusion and the Results: Figure 7 shows the mean duration of a single epoch, transformation to materialize intermediate preprocessing artifacts normalized to the epoch duration of the baseline, which does not use in memory to avoid their recomputation across epochs. Note that any tf.data performance-related features. On the 56-core machine intermediate artifacts cannot always be materialized as they may used for the experiment, the speedups ranged from 2.7× (Mask- be a result of a randomized transformation which produces a dif- RCNN) to 63.1× (SSD). Since we are parallelizing both compute and ferent result each epoch. Finally, the image-based input pipelines I/O we could achieve speedup greater than 56×. (Resnet50, SSD, and Mask-RCNN) disabled intra-op parallelism for tf.data computation. Intra-op parallelism makes it possible to par- 1Expert-tuned parallelism sets map parallelism to the number of CPU cores available on the machine, interleave parallelism to a constant between 10 and 64 tuned based allelize execution of individual TensorFlow ops, such as tf.matmul, on available I/O bandwidth, and the prefetch buffer size to an empirically tuned but this comes at the expense of increased CPU usage. For tf.data multiple of batch size.

2951 Parallel Parallel I/O Software Non- Caching Static No intra-op computation pipelining deterministic Optimization parallelism Resnet50 ✓ ✓ ✓ ✓ ✓ ✓ ✓ SSD ✓ ✓ ✓ ✓ ✓ ✓ ✓ Mask-RCNN ✓ ✓ ✓ ✓ ✓ ✓ ✓ GNMT ✓ ✓ ✓ ✓ ✓ ✓ Transformer ✓ ✓ ✓ ✓

Table 5: tf.data features used by different MLPerf benchmarks.

The performance of Resnet50 and SSD input pipelines benefits 25 21.922.0 Expert-tuned Parallelism significantly from tf.data’s performance optimizations, and the 20.821.0 Expert-tuned Parallelism + Optimizations 20 input pipelines can fully utilize the available CPU. In particular, map 17.9 Autotune Parallelism + Optimizations

+ batch fusion yields the most significant speedup among static 15 optimizations for these two benchmarks, as it enables computing 10.4 multiple batches in parallel. In contrast, the performance of Mask- 10 RCNN, GNMT, and Transformer input pipelines sees less benefit 5 4.3 4.4 from the tf.data optimizations. For Mask-RCNN, the reason for 3.1 1.4 1.6 1.5 2.0 2.0 2.0 the limited speedup is two-fold: 1) the baseline employs parallelism End-to-end Training Time Speedup 0 as the user-defined computation applied to each element canbe ResNet-50 SSD Mask-RCNN GNMT Transformer parallelized by TensorFlow and 2) the input pipeline is bottlenecked Figure 8: Speedup of the time to convergence for MLPerf by batching, which is performed sequentially because of an inter- workloads with tf.data optimizations, relative to execution mediate step between map and batch in the pipeline that prevents with a sequential input pipeline. map + batch fusion. Similarly, the text pipelines (GNMT and Trans- former) did not benefit from map + batch fusion as elements need to be grouped based on size after the map operation before they are batched, but the tf.data runtime does not currently support Results: Figure 8 shows the end-to-end training time speedup map + groupby + batch fusion. Most benchmarks saw less than (relative to the model training time with a sequential input pipeline) 4% improvement in training time with non-deterministic vs. deter- for each MLPerf benchmark. We draw several insights from these ministic data ordering, however Resnet50 benefited more (approx. results. First and foremost, the performance of the input pipeline 40% higher throughput) as its dataset (ImageNet) has a wide dis- significantly affects the end-to-end training performance. Second, tribution of image sizes, and non-deterministic ordering avoids computation and I/O parallelism is necessary but not sufficient to head-of-line blocking. match the rate at which accelerators perform training computa- For all of the input pipelines, using auto-tuned parallelism re- tion. Compared to using a sequential input pipeline as the baseline, sults in comparable performance to the expert-tuned versions. This adding software pipelining and parallelism in the input pipeline demonstrates that the algorithm described in § 3.3 is able to auto- improved end-to-end training time by 7× on average across the five matically configure performance knobs similar to a human expert. MLPerf benchmarks. For image-based input pipelines (Resnet50, SSD, and Mask-RCNN), the end-to-end performance benefited fur- ther from the application of tf.data performance-oriented fea- tures, providing an additional 2×, 1.2×, 1.4×, speedup respectively. 4.2 End-to-End Experiments For text-based input pipelines (GNMT and Transformer), paral- Methodology: To evaluate how input pipeline performance opti- lelism and software pipelining alone were sufficient to match the mizations with tf.data translate to end-to-end performance bene- rate at which data was consumed by the training computation. fits when training state-of-the-art ML models, we measured the time Figure 8 also compares the training time with expert-tuned it takes to reach target accuracy with our tf.data-based implemen- tf.data configuration to training time with auto-tuned configura- tation of the MLPerf benchmarks. We executed each benchmark tion. Similarly to the input pipeline experiments, we find that using using 8 hosts with 112 CPU cores, 100 GB of RAM, and 8 TPUv3 tf.data’s dynamic optimizations to select parameters such as the accelerators each [30]. For the Mask-RCNN benchmark, we used degree of parallelism and prefetch buffer sizes leads to similar per- 400 GB RAM per host to ensure that intermediate artifacts can formance compared to the expert-tuned pipelines. The end-to-end be fully cached in memory. We ran the following experiments for time to convergence with dynamic tuning is within 1% of the time each benchmark: 1) a baseline that trains the MLPerf model with to convergence with expert-tuned input pipelines for Resnet50, sequential reading and processing of input data, 2) a version that SSD, Mask-RCNN, and Transformer and within 4% for GNMT. uses expert-tuned parallelism for I/O and compute in the input This demonstrates that tf.data can effectively relieve users from pipeline, 3) a version that uses all tf.data performance features the burden of hand-tuning input pipeline configurations. with expert-tuned parallelism, and 4) a version that uses all tf.data Finally, we also verified that tf.data optimizations enable input performance features with auto-tuning. pipelines to match the rate at which accelerators perform training

2952 Input data framework Hardware Epoch duration (s) Resnet50 SSD Mask- GNMT Trans- BERT PyTorch DataLoader CPU-only 213 RCNN former NVIDIA DALI CPU-only 777 tf.data 28.8 27.6 487.8 77.4 15.6 23.4 NVIDIA DALI CPU + 1 GPU 172 DataLoader - - 627.6 42.6 37.2 48.6 NVIDIA DALI CPU + 2 GPUs 107 DALI 49.8 49.2 - - - - tf.data CPU-only 110 Table 7: Best end-to-end MLPerf v0.7 competition training Table 6: ImageNet-Resnet50 input data processing time times (in seconds), categorized by the input data frame- with tf.data vs. NVIDIA DALI and PyTorch DataLoader. work used. Entries with tf.data, DataLoader, and DALI in- put pipelines use TensorFlow, PyTorch, and MXNet, respec- tively, for model training. computations for state-of-the-art models. For each MLPerf bench- text workloads. This attests to tf.data’s flexibility. Other frame- mark, we measured the time it takes to ingest a batch of data and per- works only achieved competitive results for a subset of benchmarks form the model computation when using 1) an optimized tf.data (e.g., DALI for image workloads and DataLoader for text workloads). input pipeline versus 2) an artificial input pipeline that produces Second, tf.data is fast enough to avoid input bottlenecks across data as fast as possible (by caching a single batch and repeating it state-of-the-art models and hardware configurations, enabling train- infinitely). The artificial pipeline does not perform any data pro- ing Resnet50, SSD, Transformer, and BERT in under 30 seconds. cessing and hence serves as an upper bound on input pipeline As shown in § 4.2, the MLPerf workloads are not input-bound after performance. Step times with optimized tf.data pipelines match applying tf.data optimizations. In particular, the higher end-to- the upper-bound performance, hence the MLPerf benchmarks are end training time with GNMT, is due to the TensorFlow model no longer input bound after tf.data optimizations. computation being slower than the PyTorch implementation; the tf.data part of the computation is not on the critical path. 4.3 Comparison to Other Systems To evaluate how tf.data compares to other input data processing 5 EXPERIENCE systems for ML, we implement a standard ImageNet pipeline using Our colleagues at Google have been using tf.data in training tf.data, PyTorch DataLoader [14], and NVIDIA DALI [22]. Table 6 research and production ML models since 2017. As of today, the shows the average time to process an epoch’s worth of data with system implementation consists of over 21K lines of Python and each framework running on a 64 core server (n2-standard-64 over 56k lines of C++ (excluding test code). The tf.data frame- on Google Cloud) with 256 GB of RAM, 500 GB local SSD, and work is used for data processing by the majority of TensorFlow NVIDIA Tesla T4 GPUs. The tf.data pipeline is 1.9× faster than training jobs in Google’s fleet. These jobs run in production clusters, DataLoader, thanks to tf.data’s static and dynamic optimizations. spanning a variety of application domains (e.g., image classification, For example, if we disable map + batch fusion in tf.data, perfor- translation, and video content recommendation) and using various mance drops to 448 seconds per epoch. Table 6 also shows that types of ML training algorithms (e.g., supervised learning, rein- tf.data outperforms DALI on CPU or even with one GPU. When forcement learning, and federated learning). tf.data’s generality offloading computation to multiple GPUs, DALI achieves higher has also facilitated novel research. For example, a creative approach throughput, however the use of GPUs adds to the cost of input to working around limited I/O bandwidth when training models data processing and consumes GPU cores and memory that could and was implemented using three standard tf.data transforma- otherwise be dedicated to model training. tions [13]. tf.data was also used to automatically generate a data In addition to comparing input pipeline throughput, it is useful to augmentation policy that achieved state-of-the-art results on image compare end-to-end model training time with different input data classification tasks [15]. frameworks across heterogeneous platforms. The MLPerf Training To understand the characteristics of machine learning input data competition provides the fairest comparison across ML systems as pipelines at scale, we studied millions of tf.data jobs in Google’s each submission is optimized by experts familiar with their per- fleet over a one month period in 2020. We show that input pipelines formance knobs. For each benchmark, a cluster ranging from 8 are highly diverse and frequently re-executed. We also identify accelerators to over 1000 accelerators was used to train the model several future research directions motivated by our findings, such to a target accuracy. Table 7 summarizes the top MLPerf v0.7 train- as the opportunity to re-use input pipeline computation across jobs. ing times achieved, categorized by the input pipeline framework used [43]. The end-to-end training times in Table 7 do not pro- 5.1 Fleet-wide Input Pipeline Analysis vide an apples-to-apples performance comparison of input data Methodology: We instrument tf.data to collect metrics such as frameworks, since the competition entries used different software the set of transformations applied in each job’s input pipeline. For frameworks (TensorFlow, PyTorch, MXNet) and hardware (TPUs, each job, we also record the bytes consumed and produced by each GPUs) to run model training computations. However, we can still transformation in its input pipeline. 71% of jobs define their in- draw two important takeaways from the end-to-end training times put pipeline as a single tf.data dataset, while the remaining jobs in Table 7. First, tf.data is the only input processing framework define their input processing logic across two or more tf.data that was used across all MLPerf benchmarks, including image and datasets. When an iterator is created for a tf.data dataset, we

2953 Map 16.9% 1.0 Batch 14.6% End-to-end Prefetch 14.1% 0.8 Map Repeat 12.3% Filter Zip 9.6% Read from local mem 9.2% 0.6 Read from storage 7.4% Interleave 7.3% Shuffle 4.4% 0.4 Filter 1.6% Read from remote mem 1.3%

Unbatch 0.9% Cumulative fraction 0.2 Cache 0.4% Other 0.1% 0.0 10−3 10−2 10−1 100 101 102 103 104 105 0.0% 5.0% 10.0% 15.0% Bytes produced / Bytes read % of total bytes produced by input pipeline operations

Figure 10: CDF showing how the ratio of bytes produced vs. Figure 9: Types of input data pipeline operations and their bytes read varies for end-to-end input pipelines, map, and prevalence, based on the bytes produced by each type of op. filter transformations. For 75% of jobs, preprocessing re- duces the volume of data end-to-end.

How does preprocessing affect data volume? Although some fingerprint the dataset by computing a hash of its dataflow graph. transformations, such as filtering, decrease the size of input data, We include the list of input file names in the hash calculation and machine learning jobs also commonly apply transformations that exclude random seed values. We track the number of iterator cre- increase the size of data, such as decompressing and augmenting ations for each unique hash over time. We also measure the total images to train image understanding models. To understand how compute time for jobs and the compute time that jobs spend in the volume of data flowing through ML input pipelines varies with tf.data. The compute time is measured in normalized compute different transformations, we measure each input pipeline’s ratio units and is the product of the time spent on a hardware resource of bytes produced versus the bytes read from inputs sources. We – such as a CPU or an accelerator core – scaled by the compute compute this ratio for the end-to-end input pipeline of each job, as capability of that resource. Our compute time metric is analogous well as for each type of transformation applied in the job’s input to AWS’s Elastic Compute Units (ECUs) [5]. We collect the metrics pipeline. When the bytes produced over bytes consumed ratio is described above with full coverage for all tf.data jobs, with one ex- less than one, it means that the input pipeline or transformation ception. Measuring the fraction of compute time spent in tf.data in this job decreases the data volume, whereas a ratio greater than requires a configuration flag to be set when jobs are launched. Due one implies that the volume of data increases. to configuration differences across jobs, we measured the fraction Figure 10 plots the CDF of the bytes produced over bytes con- of compute time spent in tf.data for 66% of jobs, accounting for sumed ratio across jobs for their end-to-end input pipeline, map 75% of total compute time across tf.data jobs. For the remaining transformations, and filter transformations. For approximately jobs, we assume that each job spends 10% of its total compute time 75% of jobs, the volume of data produced by the input pipeline and in tf.data, as this is the median time that jobs spend in the input fed to the model training stage is less than the volume of input pipeline (see Figure 1). data read. In other words, for most jobs, the materialized dataset Our analysis focuses on three key questions: 1) how frequently used for training is smaller than the raw input data. For some jobs, are various transformations used in an input pipeline, 2) how does decompressing and augmenting data results in high expansion of the "shape" of data change as data flows through an input pipeline, source data. Figure 10 shows that user-defined map transformations, and 3) how much computation is shared across input pipeline exe- while preserving dataset cardinality, can decrease or expand data by cutions in our fleet? over an order of magnitude for 13% of jobs. filter transformations, which can modify dataset cardinality, discard more than 10% of Which datasets are most common? Figure 9 plots the relative input data volume for approximately 23% of jobs. For 8% of jobs, frequency of tf.data transformations across jobs, based on the more than half of the bytes fed into filter transformations are number of bytes each transformation is applied on. The map, batch, discarded. filter is also used to sanitize data, hence in 70% of jobs, prefetch, repeat, and zip transformations are the five most com- the transformation reduces the data by less than 1%. monly applied types of transformations, followed by reading input data from local memory and storage. We also study how many input How often are input pipelines re-executed? We observe that pipelines rely on various tf.data optimizations. On average, 77% input pipelines are commonly re-executed and there is a sizeable of input pipelines rely on parallel I/O optimizations by using the opportunity to reuse input pipeline computation both within and interleave transformation, 87% of input pipelines rely on pipeline across jobs. Some jobs rely on different permutations of the same parallelism with the prefetch transformation, and 40% of pipelines dataset across iterators to improve convergence. To conservatively rely on parallelizing compute with the map transformation (and its estimate the opportunity for computation reuse across input pipeline fusion with batch). Only 19% of jobs use the cache transformation executions, we have excluded datasets that use the shuffle trans- to cache input data in memory, though we later show that many formation (57% of tf.data jobs) in this part our analysis. more jobs could benefit from caching since many input pipelines An input pipeline iteration begins by creating an iterator for a are re-executed. dataset definition. We record the number of iterator creations at

2954 Executed > 1 time/hr Executed > 5 times/hr Executed >50 times/hr 1.0 Executed > 2 times/hr Executed > 10 times/hr Executed >100 times/hr Input pipeline executions 1.0 Total input pipeline compute time 0.8 0.8

0.6 0.6

0.4 0.4 Cumulative fraction 0.2 0.2 Fraction of input pipelines

0.0 0.0 10−5 10−4 10−3 10−2 10−1 100 2020-04-17 2020-04-21 2020-04-25 2020-04-292020-05-01 2020-05-05 2020-05-09 2020-05-13 Normalized # of Input Pipelines (ordered most to least frequently executed) Time

Figure 11: Fraction of input pipelines executed more than Figure 12: CDF of input pipeline executions over a one 푥 times per hour, over time. Approx. 75% of input pipelines month period. 10% of pipelines account for 77% of total in- are executed more than once in the same hour. put pipeline executions and 72% of compute resources. the granularity of one hour time intervals for each dataset finger- CPU and possibly improve throughput – and the impact of caching print (computed by hashing its dataflow graph). Figure 11 plots the on training quality – in general, results of randomized transfor- fraction of input pipelines that are executed more than 푥 times in mation (such as shuffle) should not be cached. Navigating the the same hour, over time. Approximately 75% of input pipelines compute-storage trade-off and estimating the benefit of caching on are executed more than once within the same hour and 5% of input end-to-end performance and downstream accelerator utilization for pipelines are executed more than 100 times within an hour. Re- ML training jobs is a complex task for users [23]. Hence, automating execution of input pipelines can occur across epochs of a training cache insertion in input pipelines is important [65]. Furthermore, job and also across jobs. For example, neural architecture search [69] since input pipelines can be shared across jobs, designing a dataset and hyper-parameter tuning both require training multiple models caching service to re-use input pipeline computations across jobs using the same input pipeline. is a promising future direction. Quiver [36] and CoorDL [44] al- Having found that many input pipelines are re-executed, we next ready optimize source dataset caching for ML training. Several quantify the opportunity for reusing input pipeline computation systems have shown that caching data across jobs greatly improves by caching materialized datasets. Figure 12 plots the cumulative performance for big data analytics [7, 18, 23, 38, 49]. distribution of input pipeline executions over the one month time span of our study, with input pipelines ordered from most to least Processing data closer to storage Figure 10 showed that data frequently executed. We also show the CDF of the compute re- preprocessing reduces the volume of data for 75% of jobs. For 14% of sources spent executing these pipelines. Figure 12 shows that by jobs, the volume of data fed into the model for training is less than caching the top 10% of materialized datasets, we can capture 72% 10% of bytes read from storage. As input data for ML jobs commonly of CPU resources used for computing tf.data datasets across all resides in remote storage, such as a distributed file system or cloud jobs that executed in the one month period. The steepness of the object store, this means that more data than necessary is sent over CDF curves indicates that some datasets are particularly frequently the network during ML training. Designing ML data processing executed and consumed significant resources. Only 10% of input systems that apply projections closer to storage is a promising way pipelines are re-executed across multiple jobs. 1% of input pipelines to reduce data transfers. Using columnar data formats is another are executed by more than 25 different jobs and the largest cross- well-known approach to enable reading only the relevant fields job sharing we observed was approximately 50,000 jobs executing in a record [9]. We are exploring this approach to improve data the same input pipeline. However, our analysis conservatively esti- ingestion efficiency for ML jobs. mates the opportunity for reuse since it only counts re-executions of pipelines with identical end-to-end transformation graphs. We Addressing host bottlenecks Some input pipelines require sig- anticipate further opportunities to reuse computation across jobs nificant CPU and memory resources to produce their data. When by considering input pipeline sub-graphs. a host machine isn’t powerful enough to generate input data at the rate the attached accelerator(s) consume the data, the acceler- 5.2 Implications for Future Research ator(s) idle and slow down model training. To solve this problem, we are currently exploring the disaggregation of data processing We discuss the implications of our fleet-wide analysis and future from model training, by enabling users feed accelerators from input research directions based on our experience with tf.data. workers distributed across multiple hosts [57]. The number of input Datasets as a service We showed that input pipelines are fre- workers can scale up or down as needed to keep up with the accel- quently re-executed, yet only 19% of jobs in our analysis used the erators, independent of the number of accelerators attached to one cache transformation. It is often challenging for users to decide host. Another approach to address host resource bottlenecks for if and where to apply caching as there are several factors to con- input data processing is to offload data preprocessing computations sider: the cost-benefit of caching the data – spending RAM tosave to accelerators [22].

2955 Data processing for online inference We designed tf.data numerical libraries. By building tf.data into TensorFlow and us- to support ML training. However, efficient input processing is also ing its Tensor type to represent values, we achieved efficient in- critical for ML inference. Inference involves less model computa- teroperability for free. More recently, Karanasos et al. introduced tion per input element since only a forward pass of the model is Raven, which integrates the ONNX Runtime for machine learning required, whereas training also requires backpropagation. Hence, into Microsoft SQL Server [34]. Raven focuses on ML inference although not all input transformations applied during training – for SQL-based analytic pipelines, achieving better performance by such as data augmentations – are applied when serving a model, pushing linear algebra operators into earlier stages of the query plan the input pipeline for inference still presents a significant fraction and using ONNX Runtime to offload computation to accelerators. of total work. Inference jobs need a different input pipeline design The model-related optimizations in tf.data are more conservative compared to training jobs as the primary performance objective is than Raven’s, because the model is mutable at training time, but to optimize the latency of individual requests rather than overall the ideas in Raven would be useful for applications like knowledge throughput. This implies less buffering, no shuffling, and a different distillation [28] that perform inference at traning time. approach to batching to balance request latency with accelerator Several related projects have investigated the problem of auto- efficiency. Kang et al. propose jointly optimizing inference and data matically tuning dataflow workloads. SEDA dynamically allocates preprocessing for DNN-based visual analytics [33]. threads to stages, using a simple scheme that adds threads to a stage when its queue length exceeds a threshold, and removes them when they idle for a period [61]. By contrast, tf.data tunes the 6 RELATED WORK performance of each stage based on the predicted end-to-end per- Kakarapathy et al. argue for building a single, unified system for formance. The DS2 scaling controller for dataflow-based stream data loading that could be shared between multiple machine learn- processing attempts to find the minimum parallelism for each stage ing jobs, and potentially between different frameworks as well [31]. in a dataflow graph that will enable it to consume data at the ratesof Their observation that much I/O and preprocessing work can be all the sources [32]. Like DS2, tf.data uses lightweight instrumen- shared between jobs agrees with our findings in § 5.2. By contrast, tation of “useful” processing time in each transformation to make our work on tf.data has focused on a more general programming scaling decisions, but we additionally model memory consumption model, enabling users to build different preprocessing schemes. as a possible bottleneck resource to avoid excessive buffering. Our inspiration for tf.data’s programming model drew from the successful application of LINQ [42] to parallel processing with 7 CONCLUSION PLINQ [55], big-data cluster processing with DryadLINQ [66], and We presented tf.data, a framework for building and executing stream processing with Naiad [46]. Many tf.data transformations efficient input data processing pipelines for machine learning jobs have direct equivalents in LINQ, though we added order-sensitive at scale. tf.data’s programming model enables users to build di- transformations (e.g., batch, shuffle, and interleave) to support verse input pipelines by composing and customizing operators. ML training algorithms. Optimus [35], which added dynamic graph tf.data executes input pipelines as dataflow graphs and applies rewriting support to DryadLINQ, is similar to the automatic opti- static optimizations that improve end-to-end training time for state- mization framework that we described in § 3.3. Optimus focused on of-the-art models. For example, input pipeline parallelism and soft- reducing network I/O in distributed big-data queries, whereas the ware pipelining improve Resnet50 training time by over 10× and bottleneck in tf.data tends to be the host CPU, and our optimiza- other tf.data optimizations such as operator fusion provide an tions aim to reduce the wall-clock execution time of code within a additional 2× improvement. We developed an analytical approach single machine. Dandelion extended LINQ with the ability to run to automatically tune internal buffer sizes and the degree of par- on GPU and FPGA accelerators [52], using the PTask abstraction to allelism in input pipelines. These dynamic optimizations achieve manage the accelerators [51]. Dandelion and PTask provide a simple comparable performance to expert-tuned input pipelines while programming model that hides data movement between the host relieving users from the burden of manually tuning parameters. and accelerator devices, similar to how tf.data uses prefetch to The widespread deployment of tf.data across millions of ML mask copies. Dandelion goes further than tf.data in using func- training jobs at Google enabled us to quantify several aspects of ML tional transformations to represent all computation – not just the data processing, namely its resource footprint, diversity, and the ex- input pipeline – while tf.data interoperates with existing ML tent of redundant computation. On average, ML training jobs spend frameworks such as TensorFlow [3], Pytorch [48], and JAX [11] by 30% of their total compute time in the input pipeline. Our findings using their existing programming models for the training loop. motivate future work on sharing input pipeline computation across The design, implementation, and optimization of tf.data all jobs and pushing data projection to the storage layer. bear similarities to how SQL is used in a relational database manage- ment system (RDBMS). A related strand of work has investigated ACKNOWLEDGEMENTS how to push machine learning computations into SQL, and optimize across the boundary between relational data and linear algebra. The We thank Paul Barham, Chandu Thekkath, Vijay Vasudevan, Martín MADlib analytics library pushes various learning algorithms into Abadi, Sudip Roy, Dehao Chen, and our anonymous reviewers for an existing RDBMS [27]. MADlib uses existing SQL constructs for their helpful feedback on this work. We gratefully acknowledge orchestration – i.e. defining both the input pipeline and the “driver Andrew Audibert, Brennan Saeta, Fei Hu, Piotr Padlewski, Rachel program” (or training loop) – and provides a C++ abstraction layer Lim, Rohan Jain, Saurabh Saxena, and Shivani Agrawal for their for plugging in user-defined functions that call high-performance engineering contributions to tf.data.

2956 REFERENCES [32] Vasiliki Kalavri, John Liagouris, Moritz Hoffmann, Desislava Dimitrova, Matthew [1] 2020. Apache Beam: An advanced unified programming model. https://beam. Forshaw, and Timothy Roscoe. 2018. Three Steps is All You Need: Fast, Accurate, apache.org/. Automatic Scaling Decisions for Distributed Streaming Dataflows. In Proceedings [2] 2020. Apache Flume. https://flume.apache.org/. of OSDI. 783–798. [3] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey [33] Daniel Kang, Ankit Mathur, Teja Veeramacheneni, Peter Bailis, and . Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man- 2020. Jointly Optimizing Preprocessing and Inference for DNN-based Visual junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Analytics. arXiv:2007.13005 [cs.DB] Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan [34] Konstantinos Karanasos, Matteo Interlandi, Doris Xin, Fotis Psallidas, Rathijit Sen, Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine Kwanghyun Park, Ivan Popivanov, Supun Nakandal, Subru Krishnan, Markus learning. In Proceedings of OSDI. 265–283. Weimer, Yuan Yu, Raghu Ramakrishnan, and Carlo Curino. 2020. Extending [4] Ashish Agarwal. 2019. Static Automatic Batching In TensorFlow. In Proceedings Relational Query Processing with ML Inference. In Proceedings of CIDR. of ICML. 92–101. [35] Qifa Ke, Michael Isard, and Yuan Yu. 2013. Optimus: a dynamic rewriting [5] Amazon. 2020. Amazon EC2 FAQs. https://aws.amazon.com/ec2/faqs. framework for data-parallel execution plans. In Proceedings of EuroSys, Zdenek [6] Amazon. 2020. Amazon EC2 Pricing. https://aws.amazon.com/ec2/pricing/. Hanzálek, Hermann Härtig, Miguel Castro, and M. Frans Kaashoek (Eds.). 15–28. [7] Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, [36] Abhishek Vijaya Kumar and Muthian Sivathanu. 2020. Quiver: An Informed Srikanth Kandula, Scott Shenker, and . 2012. PACMan: Coordinated Storage Cache for Deep Learning. In Proceedings of FAST. 283–296. Memory Caching for Parallel Jobs. In Proceedings of NSDI. 20. [37] Chris Lattner, Jacques A. Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, [8] Apache Software Foundation. 2012. Avro. https://avro.apache.org/docs/1.2.0. Albert Cohen, Tatiana Shpeisman, Andy Davis, Nicolas Vasilache, and Oleksandr [9] Apache Software Foundation. 2018. Parquet. https://parquet.apache.org/. Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. [10] Leon Bottou. 2009. Curiously Fast Convergence of some Stochastic Gradient CoRR (2020). https://arxiv.org/abs/2002.11054 Descent Algorithms. In Proceedings of the Symposium on Learning and Data [38] , Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Science. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. [11] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris In Proceedings of SoCC. 1–15. Leary, Dougal Maclaurin, and Skye Wanderman-Milne. 2018. JAX: composable [39] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva transformations of Python+NumPy programs. http://github.com/google/jax Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common [12] Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. Objects in Context. In Proceedings of ECCV. 2019. Data Validation for Machine Learning. In Proceedings of Machine Learning [40] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, and Systems (MLSys) 2019. Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. [13] Dami Choi, Alexandre Passos, Christopher J. Shallue, and George E. Dahl. 2019. In Proceedings of ECCV. Springer, 21–37. Faster Neural Network Training with Data Echoing. arXiv:1907.05550 [cs.LG] [41] Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micike- [14] Torch Contributors. 2019. PyTorch Docs: torch.utils.data. https://pytorch.org/ vicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, docs/stable/data.html. et al. 2019. MLPerf training benchmark. arXiv preprint arXiv:1910.01500 (2019). [15] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. 2019. Ran- [42] Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling Object, dAugment: Practical automated data augmentation with a reduced search space. Relations and XML in the .NET Framework. In Proceedings of SIGMOD. 706. arXiv:1909.13719 [cs.CV] [43] MLPerf Training v0.7 Results. 2020. Designing Efficient Data Loaders for Deep [16] Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing Learning. https://mlperf.org/training-results-0-7/. on Large Clusters. In Proceedings of OSDI. 137–150. [44] Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, and Vijay Chidambaram. [17] Jia Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A 2021. Analyzing and Mitigating Data Stalls in DNN Training. Proceedings of Large-Scale Hierarchical Image Database. In Proceedings of CVPR. VLDB Endow. 14, 5 (Jan. 2021), 771–784. [18] Francis Deslauriers, Peter McCormick, George Amvrosiadis, Ashvin Goel, and [45] Dan Moldovan, James Decker, Fei Wang, Andrew Johnson, Brian Lee, Zack Angela Demke Brown. 2016. Quartet: Harmonizing Task Scheduling and Caching Nado, D Sculley, Tiark Rompf, and Alexander B Wiltschko. 2019. AutoGraph: for Cluster Computing. In Proceedings of HotStorage. Imperative-style Coding with Graph-based Performance. In SysML. [19] Google. [n.d.]. Protocol Buffers. https://developers.google.com/protocol-buffers. [46] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, [20] Google. 2020. Google Cloud: All Pricing. https://cloud.google.com/compute/all- and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the pricing. 24th ACM Symposium on Operating Systems Principles (SOSP) (proceedings of the [21] Goetz Graefe. 1994. Volcano: An Extensible and Parallel Query Evaluation System. 24th acm symposium on operating systems principles (sosp) ed.). ACM. IEEE Trans. on Knowledge and Data Engineering 6, 1 (Feb 1994), 120–135. [47] MXNET. 2018. Designing Efficient Data Loaders for Deep Learning. https: [22] Joaquin Anton Guirao, Krzysztof Łęcki, Janusz Lisiecki, Serge Panev, Michał //mxnet.apache.org/api/architecture/note_data_loading. Szołucha, Albert Wolant, and Michał Zientkiewicz. 2019. Fast AI Data Preprocess- [48] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory ing with NVIDIA DALI. https://devblogs.nvidia.com/fast-ai-data-preprocessing- Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des- with-nvidia-dali. maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan [23] Pradeep Kumar Gunda, Lenin Ravindranath, Chandu Thekkath, Yuan Yu, and Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Li Zhuang. 2010. Nectar: Automatic Management of Data and Computation in Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learn- Datacenters. In Proceedings of OSDI. ing Library. In Advances in Neural Information Processing Systems 32. Curran [24] Donald J. Haderle and Robert D. Jackson. 1984. IBM Database 2 overview. IBM Associates, Inc., 8024–8035. Systems Journal 23, 2 (1984), 112–125. [49] K. V. Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ram- [25] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask chandran. 2016. EC-Cache: Load-Balanced, Low-Latency Cluster Caching with R-CNN. CoRR (2017). http://arxiv.org/abs/1703.06870 Online Erasure Coding. In Proceedings of OSDI. 401–417. [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual [50] Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. Learning for Image Recognition. In Proceedings of CVPR. IEEE Computer Society, Ann. Math. Statist. 22, 3 (09 1951), 400–407. 770–778. [51] Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Em- [27] Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, mett Witchel. 2011. PTask: Operating System Abstractions to Manage GPUs as Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Compute Devices. In Proceedings of SOSP. 233–248. Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library: Or MAD Skills, [52] Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis the SQL. Proc. VLDB Endow. 5, 12 (Aug. 2012), 1700–1711. Fetterly. 2013. Dandelion: A Compiler and Runtime for Heterogeneous Systems. [28] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledgein In Proceedings of SOSP. 49–68. a Neural Network. arXiv:1503.02531 [stat.ML] [53] John E. Shore. 1980. The lazy repairman and other models: Performance collapse [29] Java. 2020. Stream API. https://docs.oracle.com/javase/8/docs/api/java/util/ due to overhead in simple, single-server queuing systems. ACM SIGMETRICS stream/package-summary.html. Performance Evaluation Review 9, 2 (1980), 217–224. [30] Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James [54] Patrice Y. Simard, Dave Steinkraus, and John C. Platt. 2003. Best Practices Laudon, Cliff Young, and David Patterson. 2020. A Domain-Specific Supercom- for Convolutional Neural Networks Applied to Visual Document Analysis. In puter for Training Deep Neural Networks. Commun. ACM 63, 7 (June 2020), Proceedings of ICDAR. IEEE Computer Society, 958. 67–78. [55] Roy Patrick Tan, Pooja Nagpal, and Shaun Miller. 2009. Automated Black Box [31] Aarati Kakaraparthy, Abhay Venkatesh, Amar Phanishayee, and Shivaram Testing Tool for a Parallel Programming Library. In Proceedings of ICST. IEEE Venkataraman. 2019. The Case for Unifying Data Loading in Machine Learning Computer Society, 307–316. Clusters. In Proceedings of HotCloud. Renton, WA. [56] TensorFlow. 2019. TensorFlow Graph Optimizations. https://research.google/ pubs/pub48051.pdf.

2957 [57] TensorFlow. 2021. tf.data service. https://www.tensorflow.org/api_docs/python/ 2016. Google’s neural machine translation system: Bridging the gap between tf/data/experimental/service. human and machine translation. arXiv preprint arXiv:1609.08144 (2016). [58] TensorFlow. 2021. TFRecord and tf.Example. https://www.tensorflow.org/ [65] Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, and Aditya tutorials/load_data/tfrecord. Parameswaran. 2018. Helix: Accelerating Human-in-the-Loop Machine Learning. [59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Proc. VLDB Endow. 11, 12 (Aug. 2018), 1958–1961. Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All [66] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, you Need. In Advances in Neural Information Processing Systems 30. 5998–6008. Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A System for General- [60] Philip Wadler. 1988. Deforestation: Transforming Programs to Eliminate Trees. In Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of the Second European Symposium on Programming. North-Holland Proceedings of OSDI. 1–14. Publishing Co., NLD, 231–248. [67] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and [61] Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: an architecture for Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of well-conditioned, scalable internet services. ACM SIGOPS Operating Systems HotCloud. Review 35, 5 (2001), 230–243. [68] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and [62] WMT. 2016. 1st Conference on Machine Translation. http://statmt.org/wmt16. Ion Stoica. 2013. Discretized Streams: Fault-Tolerant Streaming Computation at [63] WMT. 2017. 2nd Conference on Machine Translation. http://statmt.org/wmt17. Scale. In Proceedings of SOSP. 423–438. [64] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, [69] Barret Zoph and Quoc V. Le. 2017. Neural Architecture Search with Reinforcement Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Learning. In Proceedings of ICLR.

2958