<<

Cerebro: A Data System for Optimized Deep Learning Model Selection

Supun Nakandala, Yuhao Zhang, and Arun Kumar University of California, San Diego {snakanda, yuz870, arunkk}@eng.ucsd.edu

ABSTRACT process is called model selection, and it is unavoidable be- Deep neural networks (deep nets) are revolutionizing many cause it is how one controls underfitting vs. overfitting [64]. machine learning (ML) applications. But there is a major Model selection is a major bottleneck for the adoption of bottleneck to wider adoption: the pain and resource inten- deep learning among enterprises and domain scientists due siveness of model selection. This empirical process involves to both the time spent and resource costs. Not all ML users exploring deep net architectures and hyper-parameters, of- can afford to throw hundreds of GPUs at their task and burn ten requiring hundreds of trials. Alas, most ML systems resources like the Googles and Facebooks of the world. focus on training one model at a time, reducing through- Case Study. We present a real-world model selection sce- put and raising overall resource costs; some also sacrifice re- nario. Our public health collaborators at UC San Diego producibility. We present Cerebro, a new data system to wanted to try deep nets for identifying different activities raise deep net model selection throughput at scale without (e.g., sitting, standing, stepping, etc.) of subjects from body- raising resource costs and without sacrificing reproducibility worn accelerometer data. The data was collected from a co- or accuracy. Cerebro uses a new parallel SGD execution hort of about 600 people and is labeled. Its size is 864 GB. strategy we call model hopper parallelism that hybridizes During model selection, we tried different deep net archi- task- and data-parallelism to mitigate the cons of these prior tectures such as convolution neural networks (CNNs), long paradigms and offer the best of both worlds. Experiments short-term memory models (LSTMs), and composite models on large ML benchmark datasets show that Cerebro offers such as CNN-LSTMs, which now offer state-of-the-art re- 3x to 10x runtime savings relative to data-parallel systems sults for multivariate time-series classification [35, 56]. Our like Horovod and Parameter Server and up to 8x memo- collaborators also wanted to try different prediction win- ry/storage savings or up to 100x network savings relative dow sizes (e.g., predictions generated every 5 seconds vs. 15 to task-parallel systems. Cerebro also supports heteroge- seconds) and alternative target semantics (e.g., sitting– neous resources and fault tolerance. standing–stepping or sitting vs. not sitting). The training process also involves tuning various hyper-parameters such PVLDB Reference Format: Supun Nakandala, Yuhao Zhang, and Arun Kumar. Cerebro: as learning rate and regularization coefficient. A Data System for Optimized Deep Learning Model Selection. In the above scenario it is clear that the model selection PVLDB, 13(11): 2159-2173, 2020. DOI: https://doi.org/10.14778/3407790.3407816 process generates dozens, if not hundreds, of different mod- els that need to be evaluated in order to pick the best one for the prediction task. Due to the scale of the data and the 1. INTRODUCTION complexity of the task, it is too tedious and time-consuming Deep learning is revolutionizing many ML applications. to manually steer this process by trying models one by one. Their success at large Web companies has created excite- Parallel execution on a cluster is critical for reasonable run- ment among practitioners in other settings, including do- times. Moreover, since our collaborators often changed the main sciences, enterprises, and small Web companies, to try time windows and output semantics for health-related anal- deep nets for their applications. But training deep nets is a yses, we had to rerun the whole model selection process painful empirical process, since accuracy is tied to the neu- over and over several times to get the best accuracy for ral architecture and hyper-parameter settings. A common their evolving task definitions. Finally, reproducible model practice to choose these settings is to empirically compare as training is also a key requirement in such scientific settings. many training configurations as possible for the user. This All this underscores the importance of automatically scaling deep net model selection on a cluster with high throughput.

This work is licensed under the Creative Commons Attribution- System Desiderata. We have the following key desiderata NonCommercial-NoDerivatives 4.0 International License. To view a copy for a deep net model selection system. of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing 1) Scalability. Deep learning often has large training [email protected]. Copyright is held by the owner/author(s). Publication rights datasets, larger than single-node memory and sometimes licensed to the VLDB Endowment. even disk. Deep net model selection is also highly compute- Proceedings of the VLDB Endowment, Vol. 13, No. 11 intensive. Thus, we desire out-of-the-box scalability to a ISSN 2150-8097. DOI: https://doi.org/10.14778/3407790.3407816

2159 (A) (B) Model Search/AutoML Procedures Task-Parallel Systems Data-Parallel Systems Task Parallelism Data Parallelism (C) Grid/Random PBT HyperBand … ASHA + high throughput + high data scalability Search - low data scalability - low throughput Dask, Celery, MOP/CEREBRO Async. Param. - memory/storage wastage - high communication cost Cerebro/MOP Vizier, Spark- Async. (This Work) Server HyperOpt

Model Hopper Parallelism (Cerebro) No Partitioning Spark or Sync. Param. + high throughput (Full replication) TF Model Server, Sync. Deep Learning Systems + high data scalability Averaging Horovod + low communication cost Distributed Data + no memory/storage wastage Bulk Fine-grained Partition 1 Partition 2 … Partition p (Partitions) (Mini-batches)

Figure 1: (A) Cerebro combines the advantages of both task- and data-parallelism. (B) System design phi- losophy and approach of Cerebro/MOP (introduced in [52]): “narrow waist” architecture in which multiple model selection procedures and multiple deep learning tools are supported–unmodified–for specifying/exe- cuting deep net computations. MOP is our novel resource-efficient distributed SGD execution approach. (C) Model Hopper Parallelism (MOP) as a approach of task- and data-parallelism. It is the first known form of bulk asynchronous parallelism, filling a major gap in the parallel data systems literature. cluster with large partitioned datasets (data scalability) and tion on partitioned datasets is important because parallel distributed execution (compute scalability). file systems (e.g., HDFS for Spark), parallel RDBMSs, and 2) High Throughput. Regardless of manual grid/ran- “data lakes” typically store large datasets in that manner. dom searches or AutoML searches, a key bottleneck for 2) Resource Inefficiencies. Due to the false dichotomy, model selection is throughput: how many training config- naively combining the above mentioned approaches could urations are evaluated per unit time. Higher throughput cause overheads and resource wastage (Section 2 explains enables ML users to iterate through more configurations in more). For instance, using task-parallelism on HDFS re- bulk, potentially reaching a better accuracy sooner. quires extra data movement and potential caching, substan- 3) Overall Resource Efficiency. Deep net train- tially wasting network and memory/storage resources. An ing uses variants of mini-batch stochastic gradient descent alternative is remote data storage (e.g., S3) and reading re- (SGD) [7, 10, 11]. To improve efficiency, the model selec- peatedly at every iteration of SGD. But this leads to or- tion system has to avoid wasting resources and maximize ders of magnitude higher network costs by flooding the net- resource utilization for executing SGD on a cluster. We work with lots of redundant data reads. On the other hand, have 4 key components of resource efficiency: (1) per-epoch data-parallel systems that train one model at a time (e.g., efficiency: time to complete an epoch of training; (2) con- Horovod [63] and Parameter Servers [44]) incur high com- vergence efficiency: time to reach a given accuracy met- munication costs, leading to high runtimes. ric; (3) memory/storage efficiency: amount of memory/stor- Overall, we see a major gap between task- and data- age used by the system; and (4) communication efficiency: parallel systems today, which leads to substantially lower amount of network bandwidth used by the system. In cloud overall resource efficiency, i.e., when compute, memory/s- settings, compute, memory/storage, and network all mat- torage, and network are considered holistically. ter for overall costs because resources are pay-as-you-go; on shared clusters, which are common in academia, wastefully Our Proposed System. We present Cerebro, a new sys- hogging any resource is unethical. tem for deep learning model selection that mitigates the 4) Reproducibility. Ad hoc model selection with dis- above issues with both task- and data-parallel execution. tributed training is a key reason for the “reproducibility cri- As Figure 1(A) shows, Cerebro combines the advantages sis” in deep learning [68]. While some Web giants may not of both task- and data-parallelism, while avoiding the limi- care about unreproducibility for some use cases, this is a tations of each. It raises model selection throughput without showstopper issue for many enterprises due to auditing, reg- raising resource costs. Our target setting is small clusters ulations, and/or other legal reasons. Most domain scientists (say, tens of nodes), which covers a vast majority (over 90%) also inherently value reproducibility. of parallel ML workloads in practice [60]. We focus on the common setting of partitioned data on such clusters. Fig- Limitations of Existing Landscape. We compared exist- ure 1(B) shows the system design philosophy of Cerebro: ing approaches to see how well they cover the above desider- a narrow-waist architecture inspired by [40] to support mul- ata. Unfortunately, each approach falls short on some major tiple AutoML procedures and deep net frameworks. desiderata, as we summarize next. Figure 3 and Section 2.2 Summary of Our Techniques. At the heart of present our analysis in depth. Cere- bro is a simple but novel hybrid of task- and data- 1) False Dichotomy of Task- and Data-Parallelism. parallelism we call model hopper parallelism (MOP) that Prior work on model selection systems, primarily from the fulfills all of our desiderata. MOP is based on our insight ML world, almost exclusively focus on the task-parallel set- about a formal optimization theoretic property of SGD: ro- ting [32, 42, 43]. This ignores a pervasive approach to scale bustness to the random ordering of the data. Figure 1(C) po- to large data on clusters: data partitioning (sharding). A sitions MOP against prior approaches: it is the first known disjoint line of work on data-parallel ML systems do con- form of “Bulk Asynchronous” parallelism, a hybridization of sider partitioned data but focus on training one model at a the Bulk Synchronous parallelism common in the database time, not model selection workloads [44, 63]. Model selec- world and task-parallelism common in the ML world. As

2160 inwrlaswt o ere fparallelism. of degrees low also We with workloads days. tion of model for 5.6). hybrid run net and a often deep weigh 5.5 can for which (Sections crucial workloads, box are selection toler- the our features fault of systems-level extend out partitions, Such then elasticity of We and replication ance, 5). support per- (Section to good aspects scheduler surprisingly that all has find on algorithm and formance randomized alter- simulations simple compare a with a We as algorithms (MILP). problem candidate program scheduling nate linear our integer formalize We mixed clusters. small scheduler choice resource-optimal the network–MOP be can and poor memory/storage, has holistically–compute, averaging”) Overall, for “model [20]. approach convergence behavior called BSP convergence ML the (also that better SGD shown much distributed has work offers Prior but memory/storage behavior. BSP and network of the has efficiency MOP PS) shows, 2 varying than Figure has lower size. or reads dataset on (higher remote based **Task- costs full cost. communication leading communication with communication of (PS), lower efficient Parallelism relatively Server a more Parameter to comparison than a parameter. per mechanism controllable axes line uses a cost Dashed key has *Horovod Conceptual approach communication two wastage. that means memory/storage on and efficiency: art epoch resource prior of 2: with MOP/Cerebro Figure vrl,ti ae ae h olwn contributions: following the makes paper this Overall, basis, its as MOP With • • • Communication Cost per Epoch n admzd sn iuain.W n hta setting. that our find in We well works approximate, scheduler simulations. randomized solver, using (MILP randomized) and alternatives 3 compare of problem as scheduling PyTorch. the well and formalize We TensorFlow as with procedures. types, it AutoML integrate and data We tools and learning nets deep multiple deep arbitrary port MOP. using system selection model build We resource-intensiveness. their pressing of the issue with combined due popularity nets growing deep their on to primarily focus We formal SGD. with a to trained applicable exploiting is MOP by SGD. earlier of property listed approach desiderata execution the all SGD parallel call new we a present We Once per Higher Once per partition mini-batch oecetyeeuede e oe eeto on selection model net deep execute efficiently to Server Parameter Horovod* No replica. BSP oe oprparallelism hopper model Cerebro MOP/Cerebro Cerebro (Controllable: Replication rate) Task-Parallel w/ full remote reads** remote full w/ Task-Parallel Memory/storageWastage eea n xesbede net deep extensible and general a , Cerebro ihHrvdfrmdlselec- model for Horovod with (Controllable: Caching rate) Caching (Controllable: nortre setting. target our in osdrn l resources all considering eie an devises MP htsatisfies that (MOP) Cerebro Task-Parallel w/ w/ replication full Task-Parallel any Cerebro Full replica. Full Higher Lmodels ML optimizing a sup- can and 2161 oua a orietruhu is throughput raise to way popular A training of latency Training Net Deep Distributed for Systems SGD. on details 2.2 technical interested more the for refer We 10] [7, accuracy. to of given reader number a for the needed raising epochs typically to lead behavior, inherently may convergence is execution SGD poor so, sequential update. updates. mini-batches; from model model deviating of many sequential; a millions as makes make to epoch 1000s and an have gradient datasets the Large estimate to is pass that Each a replacement–called without data. examples the process iterative over an an passes called is multiple SGD (e.g., variants performs [17]). its that or RMSprop SGD and mini-batch [36] by solved Adam is It SGD [25]. lem Mini-batch with Training Net Deep tradeoffs. approaches their 2.1 and existing training compare net then deep We parallel for nets. deep training TRADEOFFS AND BACKGROUND 2. n oe tatm npriinddt u ttefiner the at but data partitioned on time a at [65]. model training net one deep for Fine-grained. used Centralized never almost the Alas, mod- is non-convex average it highly epoch. so, for every master, els; poorly this converges the repeat approach on and this gradients), worker’s models each (or all weights on independently collect broad- models train partition, They model, efficiency. a work- memory/storage across cast dataset high the partition yielding They parallelize ers, time. [1] a averaging at model model one with TensorFlow and 2.. Spark Table (BSP). in Parallel Synchronous calculation Bulk cost realistic redundant a wastefully extra see magnitude reads e.g., of data, repeated orders with such network the data But read flood and epoch. S3) every (e.g., storage remotely remote use can one natively, datasets. also partitioned is each workers large sources to all to to copied scale datasets be Copying not must does dataset which whole dur- worker, the workers but across SGD, re- training, communication also sequential ing no is is logically This There efficiency. use convergence producible. different best can the run worker yields can which Each [48] task-parallel a Ray manner. in workers and different on [23], configurations training Vizier Celery, categories: Dask, Fig- Parallel. 4 as into Task desiderata, approaches Embarrassingly some these group on We short shows. fall 3 them ure of been All have approaches studied. execution parallel multi-node various otde erigtos(.. esrlw ou nthe on focus TensorFlow) (e.g., tools learning deep Most prob- optimization non-convex a is training net Deep for used method the SGD, mini-batch explain briefly We • • xmmr/trg an vrprl akprle sys- task-parallel tems. to purely up over and gains memory/storage systems 8x data-parallel purely over gains time datasets: ML ImageNet benchmark large two selec- with model workloads real tion on elasticity. experiments extensive and perform tolerance We fault support also and extend We ohmmr n trg,wihrie ot.Alter- costs. raises which storage, and memory both , epoch Cerebro na pc,i admysmlsabthof batch a samples randomly it epoch, an In . and Cerebro n oe tatime a at model one Criteo loehbt iersedpbehavior. speedup linear exhibits also oepotprildt replication data partial exploit to . Cerebro hs ytm loparallelize also systems These ol uha Python as such Tools ihywseu fre- of wasteful highly ffr xt 0 run- 10x to 3x offers o nthroughput. on not , S ytm uhas such systems BSP mini-batch parallelism aduses –and Thus, . Data Parallelism Embarrassing Model Hopper Desiderata Task Parallelism Bulk Synchronous Centralized Fine-grained Decentralized Fine-grained Parallelism (e.g., Dask, Celery, Vizier) (e.g., Spark, Greenplum) (e.g., Async Parameter Server) (e.g., Horovod) (Our Work)

Data Scalability No (Full Replication) Yes Yes Yes Yes Wasteful (Remote Reads)

Per-Epoch High High Lowest Low High Efficiency

SGD Convergence Highest Lowest Medium High Highest Efficiency

Memory/Storage Lowest High High High High Efficiency

Reproducibility Yes Yes No Yes Yes

Figure 3: Qualitative comparisons of existing systems on key desiderata for a model selection system.

Table 1: Notation used in Section 3 runs for k epochs–we relax this later1. Shuffle the dataset Symbol Description once and split into p partitions, with each partition located S Set of training configurations on one of p worker machines. Given these inputs, MOP works as follows. Pick p configs from S and assign one per p Number of data partitions/workers worker (Section 5 explains how we pick the subset). On each k Number of epochs for S to be trained worker, the assigned config is trained on the local partition for a single sub-epoch, which we also call a training unit. Model size (uniform for exposition sake) Completing a training unit puts that worker back to the b Mini-batch size idle state. An idle worker is then assigned a new config that D Training dataset (hDi : dataset size, |D| : has not already been trained and also not being currently number of examples) trained on another worker. Overall, a model “hops” from one worker to another after a sub-epoch. Repeat this process until all configs are trained on all partitions, completing one granularity of each mini-batch. The most prominent ex- epoch for each model. Repeat this every epoch until all ample is Parameter Server (PS) [44]. PS is a set of sys- configs in S are trained for k epochs. The invariants of tems for data-parallel ML. A typical PS consists of servers MOP can be summarized as follows: and workers; servers maintain the globally shared model weights, while workers compute SGD gradients on a locally • Completeness: In a single epoch, each training config stored data partition. Workers communicate with servers is trained on all workers exactly once. periodically to update and retrieve model weights. Based • Model training isolation: Two training units of the on the nature of these communications, PS has two variants: same config are not run simultaneously. synchronous and asynchronous. Asynchronous PS is highly • Worker/partition exclusive access: A worker executes scalable but unreproducible; it often has poorer convergence only one training unit at a time. than synchronous PS due to stale updates but synchronous • Non-preemptive execution: An individual training unit PS has higher overhead for synchronization. is run without preemption once started. All PS-style approaches have high communication due to their centralized all-to-one communications, which is pro- Insights Underpinning MOP. MOP exploits a formal portional to the number of mini-batches and orders of mag- property of SGD: any random ordering of examples suf- nitude higher than BSP, e.g., 1000x in Table 2. fices for convergence [7, 10]. Each of the p configs visits Decentralized Fine-grained. The best example is the data partitions in a different (pseudorandom) yet in se- Horovod [63]. It adopts HPC-style techniques to enable syn- quential order. Thus, MOP offers high accuracy for all mod- chronous all-reduce SGD. While this approach is bandwidth els, comparable to sequential SGD. While SGD’s robustness optimal, communication latency is still proportional to the has been exploited before in ML systems, e.g., in Parameter number of workers, and the synchronization barrier can be- Server [44], MOP exploits it at the partition level instead of come a bottleneck. The total communication overhead is at the mini-batch level to reduce communication costs. This also proportional to the number of mini-batches and orders is possible because we connect this property with model se- of magnitude higher than BSP, e.g., 500x in Table 2. lection workloads instead of training one model at a time. Positioning MOP. As Figure 1(C) shows, MOP is a new hybrid of task- and data-parallelism that is a form 3. MODEL HOPPER PARALLELISM of “bulk asynchronous” parallelism. Like task-parallelism, We first explain how MOP works and its properties. Ta- MOP trains many configs in parallel but like BSP, it runs ble 1 presents some notation. We also theoretically compare on partitions. So, MOP is more fine-grained than task par- the communication costs of MOP and prior approaches. allelism but more coarse-grained than BSP. MOP has no global synchronization barrier within an epoch. Later in 3.1 Basic Idea of MOP 1Section 4.2 (Supporting Multiple AutoML Procedures) ex- We are given a set S of training configurations (“configs” plains further how Cerebro can support different configs for short). For simplicity of exposition, assume for now each being trained for different numbers of epochs.

2162 Table 2: Communication cost analysis of MOP and other approaches. ?Full replication. †Remote reads. Interactions Extensible Components ‡ Parameters for the example: k = 20, |S| = 20, p = Invokes (1) Register workers and data

10, m = 1GB, hDi = 1TB, and |D|/b = 100K. Flow of data, results, (2) Launch model selection workload Comm. Cost Example‡ and information and get results

Model Hopper Parallelism kmp|S| + m|S| 4 TB Cerebro API Catalog ? Task Parallelism (FR ) phDi + m|S| 10 TB Hyperband Data Task Parallelism (RR†) k|S|hDi + m|S| 400 TB Catalog Grid Search Scheduler Bulk Synchronous Parallelism 2kmp|S| 8 TB Resource   PBT Catalog |D| Centralized Fine-grained 2kmp|S| bp 80 PB TensorFlow Task Executor   Handler Decentralized Fine-grained kmp|S| |D| 40 PB Task Model bp Launcher Hopper Resource PyTorch Handler Monitor

Section 5, we dive into how Cerebro uses MOP to sched- ule S efficiently and in a general way. Overall, while the core Cluster idea of MOP is simple–perhaps even obvious in hindsight–it has hitherto not been exploited in its full generality in ML Figure 4: System architecture of Cerebro. systems. then invokes model fn to instantiate the neural architec- Reproducibility. MOP does not restrict the visit ordering. ture and potentially restore the model state from a previ- So, reproducibility is trivial in MOP: log the worker visit ous checkpointed state. The train fn is invoked to perform order for each configuration per epoch and replay with this one sub-epoch of training. We assume validation data is order. Crucially, this logging incurs very negligible overhead also partitioned and use the same infrastructure for evalu- because a model hops only once per partition, not for every ation. During evaluation, Cerebro marks model param- mini-batch, at each epoch. eters as non-trainable before invoking train fn. We also support higher-level API methods for AutoML procedures 3.2 Communication Cost Analysis that resemble the popular APIs of Keras [57]. Note that We summarize the communication costs of MOP and model fn is highly general, i.e., Cerebro supports all neu- other approaches in Table 2. It also illustrates the com- ral computational graphs on all data types supported by munication costs in bytes for a realistic example based on the underlying deep learning tool, including CNNs, RNNs, our case study in Section 1. MOP reaches the theoretical transformers, etc. on structured data, text, images, video, minimum cost of kmp|S|. Crucially, note that this cost does etc. Due to space constraints, more details of our APIs, in- not depend on batch size, which underpins MOP’s higher cluding full method signatures and a fleshed out example of efficiency. BSP also has the same asymptotic cost but un- how to use Cerebro are provided in the appendix of our like MOP, BSP typically converges poorly for deep nets and technical report [53]. lacks sequential-equivalence. Fine-grained approaches like PS and Horovod have communication costs proportional to 4.2 System Architecture the number of mini-batches, which can be orders of mag- We adopt an extensible architecture, as Figure 4 shows. nitude higher. In our setting, p is under low 10s, but the This allows us to easily support multiple deep learning tools number of mini-batches can even be 1000s to millions based and AutoML procedures. There are 5 main components: (1) on the batch size. API, (2) Scheduler, (3) Task Executor, (4) Catalog, and (5) Resource Monitor. Scheduler is responsible for orchestrating 4. SYSTEM OVERVIEW the entire workload. It relies on worker and data availabil- ity information from the Catalog. Task Executor launches We present an overview of Cerebro, an ML system that uses MOP to execute deep net model selection workloads. training units on the cluster and also handles model hops. Resource Monitor is responsible for detecting worker failures 4.1 User-facing API and updating the Resource Catalog. Section 5 explains how the Scheduler works and how we achieve fault tolerance and Cerebro API allows users to do 2 things: (1) register elasticity. Next, we describe how Cerebro’s architecture workers and data; and (2) issue a deep net model selection enables high system generality. workload. Workers are registered by IP addresses. As for datasets, Cerebro expects a list of data partitions and their Supporting Multiple Deep Learning Tools. The func- availability on each worker. We assume shuffling and parti- tions input fn, model fn, and train fn are written by users tioning are already handled by other means, since these are in the deep learning tool’s APIs. We currently support Ten- well studied. This common data ETL step is also orthogonal sorFlow and PyTorch (it is simple to add support for more). to our focus and is not a major part of the total runtime for To support multiple such tools, we adopt a handler-based ar- iterative deep net training. chitecture to delineate tool-specific aspects: model training, Cerebro takes the reference to the dataset, set of ini- checkpointing and restoring. Note that checkpointing and tial training configs, the AutoML procedure, and 3 user- restoring is how Cerebro realizes model hops. Task Ex- defined functions: input fn, model fn, and train fn. It ecutor automatically injects the tool-specific aspects from first invokes input fn to read and pre-process the data. It the corresponding tool’s handler and runs these functions

2163 Model 1 A B E C D A A B C D E A B C D E Config 2 B D A C B E D B D C E A

Runtime 6 3 3 3 6 Worker 3 C E C D B E A C E A B D

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 (A) Per-epoch runtimes (B) An optimal task-parallel schedule (C) A non-optimal MOP schedule (D) An optimal MOP schedule Figure 5: Gantt charts of task-parallel and MOP schedules for a sample model selection workload. 4.3 System Implementation Details Table 3: Additional notation used in the MOP MILP formulation We prototype Cerebro in Python using XML-RPC client-server package. Scheduler runs on the client. Each Symbol Description worker runs a single service. Scheduling follows a push-based |S|×p T ∈ IR Ti,j is the runtime of unit model–Scheduler assigns tasks and periodically checks the th th si,j (i configuration on j responses from the workers. We use a shared network file worker) system (NFS) as the central repository for models. Model hopping is realized implicitly by workers writing models to C Makespan of the workload and reading models from this shared file system. Tech- |S|×p X ∈ IR Xi,j is the start time of the ex- nically, this doubles the communication cost of MOP to ecution of ith configuration on 2kmp|S|, still a negligible overhead. Using NFS greatly re- jth partition/worker duces engineering complexity to implement model hops. |S|×p×p Y ∈ {0, 1} Y 0 = 1 ⇐⇒ X < X 0 i,j,j i,j i,j 5. CEREBRO SCHEDULER |S|×|S|×p Z ∈ {0, 1} Zi,i0,j = 1 ⇐⇒ Xi,j < Xi0,j Scheduling training units on workers properly is critical V Very large value (Default: sum because pathological orderings can under-utilize resources of training unit runtimes) substantially, especially when deep net architectures and/or workers are heterogeneous. Consider the model selection workload shown in Figure 5(A). Assume workers are homo- on the workers. Overall, Cerebro’s architecture is highly geneous and there is no data replication. For one epoch general and supports virtually all forms of data types, deep of training, Figure 5(B) shows an optimal task-parallel net architectures, loss functions, and SGD-based optimizers. schedule for this workload with a 9-unit makespan. Fig- ure 5(C) shows a non-optimal MOP schedulewith also 9 Supporting Multiple AutoML Procedures Meta- units makespan. But as Figure 5(D) shows, an optimal MOP heuristics called AutoML procedures are common for explor- schedule has a makespan of only 7 units. Overall, we see that ing training configs. We now make a key observation about MOP’s training unit-based scheduling offers more flexibility such procedures that underpins our Scheduler. Most Au- to raise resource utilization. Next, we formally define the toML procedures fit a common template: create an initial set MOP-based scheduling problem and explain how we design of configs (S) and evaluate them after each epoch (or every our Scheduler. few epochs). Based on the evaluations, terminate some con- figurations (e.g., as in Hyperband [42] and PBT [32]) or add 5.1 Formal Problem Statement as MILP new configurations (e.g., as in PBT). Grid/random search Suppose the runtimes of each training unit, aka unit times, is a one-shot instance of this template. Thus, we adopt this are given. These can be obtained with, say, a pilot run for template for our Scheduler. Given S, Cerebro trains all a few mini-batches and then extrapolating (this overhead models in S for one epoch and passes control back to the will be marginal). For starters, assume each of the p data corresponding AutoML procedure for convergence/termina- partitions is assigned to only one worker. The objective tion/addition evaluations. Cerebro then gets a potentially and constraints of the MOP-based scheduling problem is as 0 modified set S for the next epoch. This approach also lets follows. Table 3 lists the additional notation used here. Cerebro support data re-shuffling after each epoch. But the default (and common practice) is to shuffle only once Objective: min C (1) C,X,Y,Z upfront. Grid/random search (perhaps the most popular in Constraints: practice), Hyperband, and PBT (and more procedures) con- 0 0 form to this common template and are currently supported. ∀i, i ∈ [1,..., |S|] ∀j, j ∈ [1, . . . , p] ASHA [43] and Hyperopt [6] are two notable exceptions (a) Xi,j ≥ Xi,j0 + Ti,j0 − V · Yi,j,j0 to the above template, since they do not have a global (b) Xi,j0 ≥ Xi,j + Ti,j − V · (1 − Yi,j,j0 ) synchronized evaluation of training configs after an epoch (c) X ≥ X 0 + T 0 − V · Z 0 (2) and are somewhat tied to task-parallel execution. While i,j i ,j i ,j i,i ,j MOP/Cerebro cannot ensure logically same execution as (d) Xi0,j ≥ Xi,j + Ti,j − V · (1 − Zi,i0,j ) ASHA or HyperOpt on task-parallelism, it is still possi- (e) Xi,j ≥ 0 ble to emulate them on MOP/Cerebro without any mod- (f) C ≥ Xi,j + Ti,j ifications to our system. In fact, our experiments with ASHA show that ASHA on Cerebro has comparable–even We need to minimize makespan C, subject to the con- slightly better!–convergence behavior than ASHA on pure straints on C, unit start times X, model training isolation task-parallelism (Section 6.3). matrix Y , and worker/partition exclusive access matrix Z.

2164 The constraints enforce some of the invariants of MOP listed Algorithm 1 Randomized Scheduling in Section 3. Equations 2.a and 2.b ensure model training 1: Input: S isolation. Equations 2.c and 2.d ensure worker exclusive ac- 2: Q = {si,j : ∀i ∈ [1,..., |S|], ∀j ∈ [1, . . . , p]} cess. Equation 2.e ensures that training unit start times are 3: worker idle ← [true,..., true] non-negative and Equation 2.f ensures that C captures the 4: model idle ← [true,..., true] time taken to complete all training units. 5: while not empty(Q) do Given the above, a straightforward approach to schedul- 6: for j ∈ [1, . . . , p] do ing is to use an MILP solver like Gurobi [27]. The start 7: if worker idle[j] then times X then yield the actual schedule. But our problem 8: Q ← shuffle(Q) is essentially an instance of the classical open-shop schedul- 9: for si,j0 ∈ Q do ing problem, which is known to be NP-Hard [24]. Since |S| 10: if model idle[i] and j0 = j then can even be 100s, MILP solvers may be too slow (more in 11: Execute si,j0 on worker j Section 5.4); thus, we explore alternative approaches. 12: model idle[i] ← false 13: worker idle[j] ← false 5.2 Approximate Algorithm-based Scheduler 14: remove(Q, si,j0 ) For many special cases, there are algorithms with good 15: break approximation guarantees that can even be optimal un- 16: wait WAIT TIME der some conditions. One such algorithm is “vector rear- rangement” [21, 70]. It produces an optimal solution when |S|  p, which is possible in our setting. Algorithm 2 When s finishes on worker j The vector rearrangement based method depends on two i,j 1: model idle[i] ← true values: Lmax (see Equation 3), the maximum load on any 2: worker idle[j] ← true worker; and Tmax (see Equation 4), the maximum unit time of any training configuration in S. |S| X Lmax = max Ti,j (3) j∈[1,...,p] Gurobi. We set a maximum optimization time of 5min for i=1 tractability sake. We compare the scheduling methods on 3 Tmax = max Ti,j (4) i∈[1,...,|S|],j∈[1,...,p] dimensions: 1) number of training configs (two values: 16 and 256), 2) number of workers (two values: 8 and 16), 3) 2 If Lmax ≥ (p + p − 1) · Tmax, this algorithm’s output is homogeneity/heterogeneity of configs and workers. optimal. When there are lots of configs, the chance of the Sub-epoch training time (unit time) of a training con- above constraint being satisfied is high, yielding us an opti- fig is directly proportional to the compute cost of the con- mal schedule. But if the condition is not met, the schedule fig and inversely proportional to compute capacity of the ∗ produced yields a makespan C ≤ C + (p − 1) · Tmax, where worker. For the homogeneous setting, we initialize all train- C∗ is the optimal makespan value. This algorithm scales to ing config compute costs to be the same and also all worker large |S| and p because it runs in polynomial time in contrast compute capacities to be the same. For the heterogeneous to the MILP solver. For more details on this algorithm, we setting, training config compute costs are randomly sampled refer the interested reader to [21, 70]. (with replacement) from a set of popular deep CNNs (n=35) obtained from [3]. The costs vary from 360 MFLOPS to 5.3 Randomized Algorithm-based Scheduler 21000 MFLOPS with a mean of 5939 MFLOPS and stan- The approximate algorithm is complex to implement in dard deviation of 5671 MFLOPS. Due to space constraints some cases, and its optimality condition may be violated of- we provide these computational costs in the Appendix of ten. Thus, we now consider a much simpler scheduler based our technical report [53]. For worker compute capacities, on randomization. This approach is simple to implement we randomly sample (with replacement) compute capacities and offer much more flexibility (explained more later). Al- from 4 popular Nvidia GPUs: Titan Xp (12.1 TFLOPS/s), gorithm 1 presents our randomized scheduler. K80 (5.6 TFLOPS/s), GTX 1080 (11.3 TFLOPS/s), and Given S, create Q = {si,j : ∀i ∈ [1, ..., |S|], j ∈ [1, .., p]}, P100 (18.7 TFLOPS/s). For each setting, we report the the set of all training units. Note that si,j is the train- average of 5 runs with different random seeds set to the ing unit of configuration i on worker j. Initialize the state scheduling algorithms and also the min and max of all 5 of all models and workers to idle state. Then find an idle runs. All makespans reported are normalized by the ran- worker and schedule a random training unit from Q on it. domized scheduler’s makespan. This training unit must be such that its configuration is not The MILP scheduler sometimes performs poorer than the scheduled on another worker and it corresponds to the data other two because it has not converged to the optimal in partition placed on that worker (Line 10). Then remove the the given time budget. The approximate scheduler performs chosen training unit from Q. Continue this process until poorly when both the configs and workers are heterogeneous. no worker is idle and eventually, until Q is empty. After It is also slower than the randomized scheduler. a worker completes training unit si,j mark its model i and Overall, the randomized approach works surprisingly well worker j as idle again as per Algorithm 2. on all aspects: near-optimal makespans with minimal vari- ance across runs and very fast scheduling. We believe this 5.4 Comparing Different Scheduling Methods interesting superiority of the randomized algorithm against We use simulations to compare the efficiency and the approximation algorithm is due to some fundamental makespans yielded by the three alternative schedulers. The characteristics of deep net model selection workloads, e.g., MILP and approximate algorithm are implemented using large number of configurations and relatively low differences

2165 hc ewe h ceue n okr.Bcuethe Because train- workers. between checkpointed always and are states scheduler model trained the between future check 1. a Algorithm at in workers loop the the those execute of on avail-iteration successfully units also will training are scheduler worker corresponding the failed from elsewhere, the back on able moved training present be the partitions before will data fails it worker finishes, the if unit But worker. from assigned removed be will from it scheduled, is Q unit both training until a continues When 1 Algorithm in tures just of Instead tolerant. fault Elasticity and Tolerance Fault worker. that 5.6 on 10: available Line is partition in relevant is the 1 if Algorithm check to checking needed of schedul- change instead replica-aware randomized only for the shop The contrast, extended open In ing. easily the skip be too. also and to can that it is scheduler coupled skip use we tightly algorithm not so, is approximate do problem; it the we because So, Modifying sched- with non-trivial MILP slower. problem the details. got this extended only its We of it case one. but of special uler factor a still replication is is a problem it shop but open scheduling, configura- shop training open a MOP, avail- need replica-aware (an tion In workers on map). such ability partitions exploit of now information plans. availability We faster and flexibility sake. scheduling reliability replicate more often for for HDFS) replicas say, (e.g., files, systems file data some But worker. Scheduling Replica-Aware 5.5 default on the Based as approach in work. randomized Scheduler the future use to we algorithm results, these randomized anal- the theoretical thorough of a ysis leave We capacities. compute in hetero- and cluster configs. training Heterogeneous train- geneous (B) homogeneous configs. and Randomized. ing cluster of that Homogeneous to (A) respect with Makespans the normalized settings. are of different makespans in and produced runtimes schedules Scheduler 6: Figure Cerebro scheduler randomized our make we how explain now We input: additional an requires scheduler replica-aware The one only on available is partition a that assumed we far So

sbfr u o also now but before as Sched. Time (s) Makespan A Q Q 0 and Homo. clusterandconfigs hni ucsflycmltsistann nthe on training its completes successfully it when 16 Configs not Randomized Approximate MILP Q Cerebro eet alrsvateproi heart-beat periodic the via failures detects ii l okr.Ti xeso osbeyond goes extension This workers. all visit 0 . Cluster Size Q 0 j 256 Configs siiilzdt eepy h process The empty. be to initialized is 0 . = j osl h viaiiympto map availability the consult , added Q emiti w aastruc- data two maintain we , to B 16 Configs Q Hetero. clusterandconfigs Q 0 NP-Hard twl eremoved be will It . and Cluster Size Q 0 Q to 256 Configs 0 eas the because r empty. are Q fthe If . 2166 vdaP0 P.AlndsrnUut 60.W use We 16.04. Ubuntu run nodes 10 as extra All v1.12.0 an and 10- TensorFlow has GPU. HDD Xeon node P100 worker master 1TB Intel cluster Nvidia memory, two 1 GPU Each has 192GB and network. clusters CPUs, nodes Gbps both GHz worker in 2.20 8 node core has Each cluster Each node. [19]. Lab for SGD Setup. our Experimental as on results [36] ASHA present Adam and also HyperOpt use we dataset. for generality, We demonstrate each To hyper- details. for the method. for configs offers search training 5 Table grid 16 yielding and parameters, architectures neural different used is Only Workloads. dataset the statistics. data. of dataset the the sample densify en- summarizes random and one-hot 2.5% We features a categorical representation. sparse the under code shipped features. is categorical and re- It numeric and with version dataset 112 classification 2012 to images the the choose shape We dataset. classification geNet Datasets. We selection. Cerebro model net evaluate deep then of efficiency and throughput EVALUATION EXPERIMENTAL 6. [53]. hy- report this technical our explain in we across further constraints, MOP architecture sub-clusters space brid and virtual to sub-cluster into Due each cluster basic sub-clusters. within the The Horovod divide run reproducible. and simple: is hybrid is the idea so, SGD; with sequential like MOP Just hybridizing of Horovod. hybrid we doubly a implementing limitation, of by this data-parallelism possibility overcome the To explored cluster. the under-utilize hsmasa h ae tgs emyecutrasitua- a encounter convergence. may till we train stages, may later where configs the tion few at a means only This So, epochs. large Hybrid Horovod Extension: 5.7 and in resources can elasticity compute mechanism scale-out new safely seamless same of be availability support The detect can to model others used storage. checkpointed and be reclaiming recovery last failure for very the deleted the for Only needed units training is failed restarted. the be and recovered can be can they units, ing datasets. pre- the after of are sampling numbers All and processing details. Dataset 4: Table a opeei esnbeaon ftm.Ti decision This time. of amount reasonable does in complete can 2 emd hsdcso nys htalo u experiments our of all that so only decision this made We eeprclyvldt if validate empirically We oeAtM rcdrs(.. yebn)satwith start Hyperband) (e.g., procedures AutoML Some mgNt20G .MHF 1000 Binary TFRecords 100M HDF5 Class GB 1.2M 400 Format GB 250 Count Criteo size On-disk ImageNet Dataset Criteo not | S 1]and [18] | u hnkl oennpoiigcng fe some after configs non-promising some kill then but le h aewy rmorexperiments. our from takeaways the alter saiiyt upr utpeAtM procedures. AutoML multiple support to ability ’s n P-nbe for GPU-enabled and | S euetolrebnhakdatasets: benchmark large two use We | Cerebro osbelow goes o u rtedt-n et euetwo use we test, end-to-end first our For Criteo Cerebro Cerebro [14]. × ndph ial,w demonstrate we Finally, depth. in euetocutr:CPU-only clusters: two use We 1 pixels 112 p nsc cases, such In . ooo sas qiaetto equivalent also is Horovod , Cerebro Cerebro ImageNet sudryn eplearning deep underlying ’s ImageNet 2 . Criteo nScin6.3. Section in Cerebro a mrv overall improve can sapplrimage popular a is oho Cloud- on both , Cerebro Cerebro sa dclick ad an is 2 al 4. Table . . Ima- and can TF Model Averaging Cerebro Horovod ImageNet Criteo TF Parameter Server - Async. Celery GPU Storage CPU Storage System Runtime Runtime 100 Utili. Footprint Utili. Footprint (hrs) (hrs) (%) (GB ) (%) (GB ) 85 TF PS - 19.00 8.6 250 28.80 6.9 400 Async 70 Horovod 5.42 92.1 250 14.06 16.0 400 TF Model 1.97 72.1 250 3.84 52.2 400 55

Averaging (%) Error Validation Top-5

Celery 1.72 82.4 2000 3.95 53.6 3200 40 1 2 3 4 5 6 7 8 9 10 Cerebro 1.77 79.8 250 3.40 51.9 400 Epoch (A) Per-epoch makespans and CPU/GPU utilization. (B) Learning curves of the resp. best configs on ImageNet.

Figure 7: End-to-end results on ImageNet and Criteo. For Celery, we report the runtime corresponding to the lowest makespan schedule. Celery’s per-epoch runtime varies between 1.72-2.02 hours on ImageNet; on Criteo, 3.95-5.49 hours. Horovod uses GPU kernels for communication; hence its high GPU utilization.

Table 5: Workloads.?architectures similar to VGG16 and ResNet50, respectively.†serialized sizes. Dataset Model arch. Model size/MB† Batch size Learning rate Regularization Epochs ImageNet {VGG16?, ResNet50?} VGG16: 792, ResNet50: 293 {32, 256}{10−4, 10−6}{10−4, 10−6} 10 Criteo 3-layer NN, 1000+500 hidden units 179 {32, 64, 256, 512}{10−3, 10−4}{10−4, 10−5} 5

(A) Speedup (Strong Scaling) (B) Fault Tolerance tool. For GPU nodes, we use CUDA version 9.0 and cuDNN 9 40 W2 Fails W2 Recovers version 7.4.2. Both datasets are randomly shuffled and split 8 38 7 36 into 8 equi-sized partitions. 6 34 5 32 4 30 3 2 28 6.1 End-to-End Results 1 26 1 2 4 8 Time (minutes) Per-Epoch 1 2 3 4 5 Speedup Against 1 Worker Epoch We compare Cerebro with 5 systems: 4 data-parallel– Cluster Size W1 Fails W1 Recovers synchronous and asynchronous TensorFlow Parameter Server, Horovod, BSP-style TensorFlow model averaging– Figure 8: (A) Speedup plot (strong scaling). (B) and 1 task-parallel (Celery). For Celery, we replicate Fault-tolerance. datasets to each worker beforehand and stream them from disk, since they do not fit in memory. I/O time is trivial for report severe CPU under-utilization (< 7%). Cerebro is deep nets, where computation dominates; thus, they can be also 4x faster than Horovod. Cerebro’s runtime is compa- interleaved. We use TensorFlow features to achieve this. For rable to model averaging, with about 52% CPU utilization. all other systems, each worker node has one in-memory data Celery is somewhat slower than Cerebro due to a straggler partition. We do not include data copying in the end-to-end issue caused by the highly heterogeneous model configs for runtimes. For scheduling, Celery uses a FIFO queue and Criteo. Cerebro’s MOP approach offers higher flexibility Cerebro uses the randomized scheduler. All other systems to avoid such straggler issues. A more detailed explanation train models sequentially. is given in the appendix of our technical report [53]. All Figure 7 presents the results. Cerebro significantly im- methods have almost indistinguishable convergence behav- proves the efficiency and throughput of model selection. On ior on this dataset: all reached 99% accuracy quickly, since ImageNet, Cerebro is over 10x faster than asynchronous the class label is quite skewed. PS, which has a GPU utilization as low as 9%! Synchronous Overall, Cerebro is the most resource-efficient approach PS was even slower. Cerebro is 3x faster than Horovod. when compute, memory/storage, and network are consid- Horovod has high GPU utilization because it uses GPU for ered holistically. It also has the best accuracy behavior, on communication. Cerebro’s runtime is comparable to model par with task-parallelism. averaging, which is as expected. But note model averaging converges poorly. Celery’s runtime is dependent on the ex- 6.2 Drill-down Experiments ecution order and thus we report the runtime on the opti- Unless specified otherwise, we now show experiments on mal schedule. On ImageNet, Celery’s runtime is compara- the GPU cluster, ImageNet, and a model selection work- ble to . But note that Celery has a highly bloated Cerebro load of 8 configs (4 learning rates, 2 regularization values, 8x memory/storage footprint. Overall, Celery and Cere- and ResNet architectures) trained for 5 epochs. Each data have the best learning curves–this is also as expected bro partition is placed on only one worker. because MOP ensures sequential equivalence for SGD, just like task-parallelism. Horovod converges slower due to its Scalability. We study the speedups (strong scaling) of larger effective mini-batch size. Cerebro and Horovod as we vary the cluster sizes. Fig- On Criteo, Cerebro is 14x faster than synchronous PS ure 8(A) shows the speedups, defined as the workload com- and 8x faster than asynchronous PS. Both variants of PS pletion time on multiple workers vs a single worker. Cere-

2167 bro exhibits linear speedups due to MOP’s marginal com- (A) Runtime (B) Network Cost munication costs; in fact, it seems slightly super-linear here because the dataset fits entirely in cluster memory com- pared to the minor overhead of reading from disk on the single worker. In contrast, Horovod exhibits substantially sub-linear speedups due to its much higher communication Runtime (hours) costs with multiple workers. Data Scale by a worker (GB) Data read Data Scale Fault Tolerance. We repeat our drill-down workload with Figure 10: Reading data from remote storage. a replication factor of 3. We first inject two node failures (A) Runtime (B) Network Cost and bring the nodes back online later. Figure 8(B) shows the time taken for each epoch and the points where the work- ers failed and returned online. Overall, we see Cerebro’s replica-aware randomized scheduler can seamlessly execute the workload despite worker failures. Runtime (hours)

(A) Runtime (B) Validation Error Replication Factor by a worker (GB) Data read Replication Factor Figure 11: Reading data from distributed storage.

which is over 10x lower than Celery’s. After the initial read, epochs (%) Runtime (hours) Cerebro only communicates models weights during train- Validation error at 10 error Validation

Batch Size Batch Size ing. In situations where reads and networks are not free (e.g., cloud providers), Celery will incur higher monetary Figure 9: Effect of batch size on communication costs than Cerebro. These results show it is perhaps better overheads and convergence efficiency. (A) Runtime to partition the dataset on S3, cache partitions on workers against batch size. (B) The lowest validation error on the first read, and then run Cerebro instead of Celery after 10 epochs against batch size. with full dataset reads from S3 per epoch to avoid copying. Reading from distributed storage (e.g., HDFS). In this set- Effect of Batch Size. We now evaluate the effect of train- ting, the dataset is partitioned, replicated, and stored on 8 ing mini-batch size for Cerebro and Horovod. We evaluate workers. We then load all local data partitions into each 5 batch sizes and report makespans and the validation er- worker’s memory. Celery performs remote reads for non- ror of the best model for each batch size after 10 epochs. local partitions. We vary the replication factor to study its Figure 9 presents the results. With batch size 32, Horovod effect on the makespan and the number of remote reads. is 2x slower than Cerebro. However, as the batch size Figure 10 presents the results. For replication factors 1 (no increases, the difference narrows since the relative commu- replication), 2, and 4, Cerebro incurs 100x less network nication overhead per epoch decreases. Cerebro also runs usage and is slightly faster than Celery. But at a replication faster with larger batch size due to better hardware utiliza- factor of 8 (i.e., full replication), Cerebro is slightly slower tion. The models converge slower as batch size increases. due to the overhead of model hops. For the same reason, The best validation error is achieved by with a Cerebro Cerebro incurs marginal network usage, while Celery has batch size of 32. With the same setting, Horovod’s best almost no network usage other than control actions. Note validation error is higher than Cerebro; this is because its that the higher the replication factor for Celery, the more effective batch size is 256 (32×8). Horovod’s best validation memory/storage is wasted. Cerebro offers the best overall error is closer to Cerebro’s at a batch size of 256. Overall, resource efficiency–compute, memory/storage, and network Cerebro’s efficiency is more stable to the batch size, since put together–for deep net model selection. models hop per sub-epoch, not per mini-batch. Experiments with Horovod Hybrid. Our experiment Network and Storage Efficiency. We study the trade- with the Horovod Hybrid gave an anti-climactic result: the off between redundant remote reads (wastes network) vs intrinsic network overheads of Horovod meant the hybrid is redundant data copies across workers (wastes memory/s- often slower than regular Cerebro with some workers be- torage). Task parallelism forces users to either duplicate ing idle! We realized that mitigating this issue requires more the dataset to all workers or store it in a common reposito- careful data repartitioning. We deemed this complexity as ry/distributed filesystem and read remotely. Cerebro can perhaps not worth it. Instead, we propose a simpler res- avoid both forms of resource wastage. We assume the whole olution: if |S| falls below p but above p/2, use Cerebro; dataset cannot fit on single-node memory. We compare if |S| falls below p/2, just switch to Horovod. This switch Cerebro and Celery in the following 2 settings: incurs no extra overhead. Due to space constraints, we skip Reading from remote storage (e.g., S3). In this setting, Cel- the details here and explain this experiment further in our ery reads data from a remote storage repeatedly each epoch. technical report [53]. For Cerebro each worker remotely reads one data parti- tion and caches it. We change the data scale to evaluate 6.3 Experiments with AutoML Procedures effects on the makespan and the amount of remote reads. We experiment with two popular AutoML procedures: Figure 10 shows the results. Celery is slightly slower than HyperOpt [6] and ASHA [43]. For HyperOpt, we compare Cerebro due to remote read overheads. The most signifi- Cerebro and Spark as the execution backends. Spark is cant advantage of Cerebro is its network bandwidth cost, a backend supported natively by HyperOpt; it distributes

2168 (A) ASHA on Celery (B) ASHA on Cerebro Table 6: Parameter grid used to randomly sample configuration for Section 6.3. Values sampled from Model [ResNet18, ResNet34] Learning rate [10−5, . . . , 10−1] Weight decay coefficient [10−5, . . . , 10−1] Top-5 Validation Error (%) Error Validation Top-5 Top-5 Validation Error (%) Error Validation Top-5 Batch size [16, . . . , 256] Hours Hours Figure 13: ASHA learning curves by time.

(A) HyperOpt on Spark (B) HyperOpt on Cerebro experiments with another AutoML procedure (HyperBand) are presented in the appendix of our technical report [53].

7. DISCUSSION AND LIMITATIONS

Applications. Cerebro is in active use for time series Top-5 Validation Error (%) Error Validation Top-5 Time (Hours) (%) Error Validation Top-5 Time (Hours) analytics for our public health collaborators. In the case study from Section 1, Cerebro helped us pick 16 deep net Figure 12: HyperOpt learning curves by time. configs to compare. To predict sitting vs. not-sitting, these configs had accuracies between 62% and 93%, underscoring the importance of rigorous model selection. The best configs only the models, i.e., it is task-parallel on fully replicated gave a large lift of 10% over their prior RandomForest model data. For ASHA, we compare and Celery as the Cerebro based on hand-engineered time series features. We plan to execution backends. We use ImageNet, GPU cluster, and use Cerebro for more domain science applications in the PyTorch. Training configs are sampled from the grid shown future on time series, video, graph, and text data. in Table 6. For Cerebro data is partitioned without repli- cation; for Spark and Celery the dataset is fully replicated. Open Systems. Cerebro is open sourced and Both HyperOpt and ASHA keep exploring different con- available for download [2]. MOP’s generality also enabled figs until a resource limit is reached. For HyperOpt, this us to emulate it on existing data-parallel systems. Piv- limit is the maximum number of configs; for ASHA, it is the otal/VMware collaborated with us to integrate MOP into maximum wall-clock time. During the exploration Hyper- Greenplum by extending the MADlib library [28] for running Opt uses Bayesian sampling to generate new configs; ASHA TensorFlow on Greenplum-resident data [47, 67]. Green- uses random sampling. For both methods, the generated plum’s customers are interested in this for enterprise ML configs are dependent on the completion order of configs use cases, including language processing, image recognition, across task-parallel workers. Thus, it is impossible for Cere- and fraud detection. We have also integrated Cerebro into bro to exactly replicate HyperOpt or ASHA ran with task- Apache Spark [16]. Cerebro-Spark can run MOP on ex- parallelism. However, we can closely emulate HyperOpt and isting resource managers such as YARN and Mesos. Al- ASHA on Cerebro by making the number of simultane- ternatively, one can also deploy Cerebro as a standalone ously trained configs (|S|) equal to the number of workers application by wrapping it as tasks accepted by the resource (p) and without making any changes to Cerebro. manager. We leave such an extension to future work. HyperOpt. We run an experiment using HyperOpt with Other ML Model Families. We focused primarily on a max config budget of 32. We train each config for deep nets due to their growing popularity, high sensitivity to 10 epochs. With this configuration, HyperOpt on Cere- model configurations, and resource intensiveness. However, bro (resp. Spark) took 31.8 (resp. 25.9) hours. Figure 12 note that MOP and Cerebro’s ideas are directly usable for shows all learning curves. We found that the slightly higher model selection of any ML models trainable with SGD. Ex- (23%) runtime of Cerebro is mainly due to the lower degree amples include linear/logistic regression, some support vec- of parallelism (|S| = 8). However, this issue can be miti- tor machines, low-rank matrix factorization, and conditional gated by increasing the number of simultaneously trained random fields. In fact, since linear/logistic regression can be configs. Although individual configs are not comparable trivially expressed in the deep learning tools’s APIs, Cere- across the two systems, the best errors achieved are close bro will work out of the box for them. Cerebro’s high (34.1% on Cerebro; 33.2% on Celery). memory efficiency makes it easier for users to store the en- tire large datasets in distributed memory, which can signif- ASHA. We use ASHA with a max epoch budget (R) of 9, icantly reduce runtimes of such I/O-bound ML models. We a selection fraction (η) of 3, and a time limit of 24hr. With leave an empirical analysis of these less compute-intensive these settings, ASHA trains for a maximum of 13 epochs models to future work. over 3 stages: 1, 3, and 9 epochs. Only the more promis- ing configurations are trained for more epochs. In the given Model Parallelism and Batching. Cerebro currently time limit, ASHA on Cerebro (resp. Celery) explored 83 does not support model parallelism (for models larger than (resp. 67) configs. Figure 13 shows all learning curves. Like single-node memory) or model batching (running multiple HyperOpt, even though the configs are not directly compa- models on a worker at a time). It is possible to remove rable, the best errors achieved are close (31.9% on Cerebro; these two limitations from Cerebro. For instance, model 33.2% on Celery). More details about this experiment and parallelism can be supported with the notion of virtual nodes

2169 composed of multiple physical nodes that together hold a ports a hybrid of task- and data-parallelism for better plan very large model. Model batching can be supported with generation for linear algebra-based classical ML on top of multiple virtual nodes mapped to a physical node. We leave MapReduce [9]. Cerebro is complementary due to its fo- these extensions to future work. cus on deep nets and SGD’s data access pattern, not lin- ear algebra-based classical ML. Finally, a recent benchmark study suggested that communication bottlenecks inherent in 8. RELATED WORK pure data-parallelism imply hybrid parallelism is crucial for scalable ML systems [66]. Our work validates that sugges- Systems for Model Selection. Google Vizier [23], Ray tion for deep learning workloads. Tune [46], Dask-Hyperband [61], SparkDL [15], and Spark- Hyperopt [31] are systems for model selection. Vizier, Ray, Multi-Query and Other System Optimizations. MOP and Dask-Hyperband are pure task-parallel systems that im- is also inspired by multi-query optimization (MQO) [62]. A plement some AutoML procedures. SparkDL and Spark- recent line of work in the database literature studies MQO Hyperopt use Spark for execution but distribute configs in a for deep learning, including staging and sharing work in task-parallel manner with full data replication Cerebro of- CNN transfer learning [49] and batched incremental view fers higher overall resource efficiency compared to pure task- maintenance for CNN inference [50, 51, 59]. Cerebro fur- or pure data-parallel approaches. thers this research direction. All these MQO techniques are complementary and can be used together. Several works op- AutoML Procedures. AutoML procedures such as Hy- timize the internals of deep net or SGD systems, including perband [42] and PBT [32] are orthogonal to our work communication-computation pipelining [54], new compila- and exist at a higher abstraction level. They fit a com- tion techniques [33], model batching [55], and execution on mon template of per-epoch scheduling in Cerebro. While compressed data [41]. They are complementary to Cere- ASHA [43] does not fit this template, Cerebro can still bro, since they optimize lower-level issues. MOP’s general- emulate it well and offer similar accuracy. Bayesian opti- ity enables Cerebro to be hybridized with such ideas. mization is a class of AutoML procedures, some of which have a high degree of parallelism for searching configs (e.g., Scheduling. Gandiva [71], Tiresias [26], and SLAQ [72] are Hyperopt [6]); Cerebro supports such procedures. Some cluster scheduling frameworks for deep learning. They fo- others run a sequential search, leading to a low degree of cus on lower-level primitives such as resource allocation and parallelism (e.g., [5, 37]); these may not be a fit for Cere- intra-server locality for reducing mean job completion times. bro. Cerebro is complementary as it exists at a higher abstrac- tion level and focuses on model selection throughput. How Distributed SGD Systems. There is much prior work compute hardware is allocated is outside our scope. There on data-parallel distributed SGD, including centralized fine- is a long line of work on job scheduling in the operations grained (e.g., [30, 34, 58, 73]) and decentralized fine-grained research and systems literatures [12,22,29]. Our goal is not (e.g., [45, 58, 69]). These are all complementary to our to create new scheduling algorithms but to apply known work because they train one model at a time, while we fo- techniques to a new ML systems setting. cus on parallel model selection. As we showed, such ap- proaches have higher communication complexity and thus, 9. CONCLUSIONS AND FUTURE WORK higher runtimes than MOP in our setting. Also, since Cere- bro performs logically sequential SGD, it ensures theoreti- Simplicity that still achieves maximal functionality and cally best convergence efficiency. CROSSBOW [38] proposes efficiency is a paragon of virtue in real-world systems. We a new variant of model averaging for single-server multi- present a simple but novel and highly general form of paral- GPU setting. But it is also complementary to our work, lel SGD execution, MOP, that raises the resource efficiency since it also trains one model at a time. Overall, our work of deep net model selection without sacrificing accuracy or breaks the dichotomy between data- and task-parallel ap- reproducibility. MOP is also simple to implement, which we proaches, thus offering better overall resource efficiency. demonstrate by building Cerebro, a fault-tolerant deep net model selection system that supports multiple popular deep Hybrid Parallelism in ML Systems. MOP is inspired learning tools and model selection procedures. Experiments by the classical idea of process migration in OS multiprocess- with large ML benchmark datasets confirm the benefits of ing [4]. We bring that notion to the data-partitioned cluster Cerebro. As for future work, we plan to hybridize MOP setting. This generic idea has been used before in limited with model parallelism and batching and also support more contexts in ML systems [8, 39]. The closest to our work complex model selection scenarios such as transfer learning. is [13], which proposes a scheme for training many homoge- neous CNNs on a homogeneous GPU cluster. They propose Acknowledgments. This work was supported in part by a a ring topology to migrate models, resembling a restricted Hellman Fellowship, the NIDDK of the NIH under award form of MOP. But their strong homogeneity assumptions number R01DK114945, an NSF CAREER Award under can cause stalls in general model selection workloads, e.g., award number 1942724, and a gift from VMware. The con- due to heterogeneous neural architectures and/or machines. tent is solely the responsibility of the authors and does not In contrast, we approach this problem from first principles necessarily represent the views of any of these organizations. and formalize it as an instance of open shop scheduling. We thank the members of UC San Diego’s Database Lab This powerful abstraction lets Cerebro support arbitrary and Center for Networked Systems, Natarajan and our deep nets and data types, as well as heterogeneous neural public health collaborators at UC San Diego, Frank McQuil- architectures and machines. It also enables Cerebro to lan and the Apache MADlib/Greenplum team at VMware, support replication, fault tolerance, elasticity, and arbitrary Carlo Curino, Matteo Interlandi, and Julian McAuley for AutoML procedures, unlike prior work. SystemML also sup- their feedback on this work.

2170 10. REFERENCES [17] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and Equilibrated Adaptive Learning Rates [1] Script for Tensorflow Model Averaging, Accessed for Non-convex Optimization. CoRR, abs/1502.04390, January 31, 2020. https: 2015. //github.com/tensorflow/tensor2tensor/blob/ [18] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. master/tensor2tensor/utils/avg_checkpoints.py. Imagenet: A Large-scale Hierarchical Image Database. [2] Cerebro Documentation, Accessed July 5, 2020. In 2009 IEEE Computer Society Conference on https://adalabucsd.github.io/cerebro-system/. Computer and Pattern Recognition (CVPR [3] S. Albanie. Memory Consumption and FLOP Count 2009), pages 248–255. IEEE Computer Society, 2009. Estimates for Convnets, Accessed January 31, 2020. [19] D. Duplyakin, R. Ricci, A. Maricq, G. , https://github.com/albanie/convnet-burden. J. Duerig, E. Eide, L. Stoller, M. Hibler, D. Johnson, [4] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. K. Webb, A. Akella, K. Wang, G. Ricart, Operating Systems: Three Easy Pieces. L. Landweber, C. Elliott, M. Zink, E. Cecchet, S. Kar, Arpaci-Dusseau Books LLC, 2018. and P. Mishra. The design and operation of CloudLab. [5] J. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl. In Proceedings of the USENIX Annual Technical Algorithms for Hyper-Parameter Optimization. In Conference (ATC), pages 1–14, July 2019. Advances in Neural Information Processing Systems [20] X. Feng, A. Kumar, B. Recht, and C. R´e.Towards a 24: 25th Annual Conference on Neural Information Unified Architecture for In-RDBMS Analytics. In Processing Systems 2011, pages 2546–2554, 2011. Proceedings of the 2012 ACM SIGMOD International [6] J. Bergstra, D. Yamins, and D. D. Cox. Making a Conference on Management of Data, SIGMOD ’12, Science of Model Search: Hyperparameter pages 325—-336. Association for Computing Optimization in Hundreds of Dimensions for Vision Machinery, 2012. Architectures. In Proceedings of the 30th International [21] T. Fiala. An Algorithm for the Open-shop Problem. Conference on Machine Learning, ICML 2013, Mathematics of Operations Research, 8(1):100–109, volume 28 of JMLR Workshop and Conference 1983. Proceedings, pages 115–123. JMLR.org, 2013. [22] J. V. Gautam, H. B. Prajapati, V. K. Dabhi, and [7] D. P. Bertsekas. A New Class of Incremental Gradient S. Chaudhary. A Survey on Job Scheduling Methods for Least Squares Problems. Society for Algorithms in Big Data Processing. In 2015 IEEE Industrial and Applied Mathematics Journal on International Conference on Electrical, Computer and Optimization, 7(4):913–926, Apr. 1997. Communication Technologies (ICECCT), pages 1–11, [8] M. Boehm, A. Kumar, and J. Yang. Data 2015. Management in Machine Learning Systems. Synthesis [23] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, Lectures on Data Management. Morgan & Claypool J. Karro, and D. Sculley. Google Vizier: A Service for Publishers, 2019. Black-box Optimization. In Proceedings of the 23rd [9] M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, ACM SIGKDD International Conference on Y. Tian, D. R. Burdick, and S. Vaithyanathan. Hybrid Knowledge Discovery and Data Mining, KDD ’17, Parallelization Strategies for Large-Scale Machine pages 1487–1495. Association for Computing Learning in SystemML. PVLDB, 7(7):553–564, 2014. Machinery, 2017. [10] L. Bottou. Curiously Fast Convergence of some [24] T. Gonzalez and S. Sahni. Open Shop Scheduling to Stochastic Gradient Descent Algorithms. Unpublished Minimize Finish Time. J. ACM, 23(4):665–679, Oct. open problem offered to the attendance of the 1976. Symposium on Learning and Data Science 2009 [25] I. Goodfellow, Y. Bengio, and A. Courville. Deep conference, 2009. Learning. MIT press, 2016. [11] S. Boyd and L. Vandenberghe. Convex Optimization. [26] J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, Cambridge University Press, 2004. J. Qian, H. H. Liu, and C. Guo. Tiresias: A GPU [12] P. Brucker. Scheduling Algorithms. Springer-Verlag, Cluster Manager for Distributed Deep Learning. In 3rd edition, 2001. 16th USENIX Symposium on Networked Systems [13] C. R. Chen, G. G. C. Lee, Y. Xia, W. S. Lin, Design and Implementation, NSDI 2019, pages T. Suzumura, and C. Lin. Efficient Multi-training 485–500. USENIX Association, 2019. Framework of Image Deep Learning on GPU Cluster. [27] Gurobi. Gurobi Optimization, Accessed January 31, In 2015 IEEE International Symposium on 2020. https://www.gurobi.com. Multimedia, ISM 2015, pages 489–494. IEEE [28] J. M. Hellerstein, C. R´e,F. Schoppmann, D. Z. Wang, Computer Society, 2015. E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, [14] CriteoLabs. Kaggle Contest Dataset Is Now Available K. Li, and A. Kumar. The MADlib Analytics Library: for Academic Use!, Accessed January 31, 2020. Or MAD Skills, the SQL. PVLDB, 5(12):1700–1711, https://ailab.criteo.com/category/dataset. 2012. [15] databricks. Deep Learning Pipelines for Apache Spark, [29] W. Herroelen, B. D. Reyck, and E. Demeulemeester. Accessed January 31, 2020. https: Resource-constrained Project Scheduling: A Survey of //github.com/databricks/spark-deep-learning. Recent Developments. Computers & Operations [16] Databricks. Resource-efficient Deep Learning Model Research, 25(4):279–302, 1998. Selection on Apache Spark, Accessed May 30, 2020. [30] Y. Huang, T. Jin, Y. Wu, Z. Cai, X. Yan, F. Yang, https://bit.ly/3esN3JT. J. Li, Y. Guo, and J. Cheng. FlexPS: Flexible

2171 Parallelism Control in Parameter Server Architecture. [44] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, PVLDB, 11(5):566–579, 2018. A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and [31] HyperOpt. Scaling out Search with Apache Spark, B. Su. Scaling Distributed Machine Learning with the Accessed January 31, 2020. http: Parameter Server. In 11th USENIX Symposium on //hyperopt.github.io/hyperopt/scaleout/spark/. Operating Systems Design and Implementation, OSDI [32] M. Jaderberg, V. Dalibard, S. Osindero, W. M. ’14, pages 583–598. USENIX Association, 2014. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, [45] X. Lian, W. Zhang, C. Zhang, and J. Liu. T. Green, I. Dunning, K. Simonyan, C. Fernando, and Asynchronous Decentralized Parallel Stochastic K. Kavukcuoglu. Population Based Training of Neural Gradient Descent. In Proceedings of the 35th Networks. CoRR, abs/1711.09846, 2017. International Conference on Machine Learning, ICML [33] Z. Jia, M. Zaharia, and A. Aiken. Beyond Data and 2018, volume 80 of Proceedings of Machine Learning Model Parallelism for Deep Neural Networks. In Research, pages 3049–3058. PMLR, 2018. Proceedings of Machine Learning and Systems 2019, [46] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. MLSys 2019. mlsys.org, 2019. Gonzalez, and I. Stoica. Tune: A Research Platform [34] J. Jiang, B. Cui, C. Zhang, and L. Yu. for Distributed Model Selection and Training. CoRR, Heterogeneity-Aware Distributed Parameter Servers. abs/1807.05118, 2018. In Proceedings of the 2017 ACM International [47] MADLib. User Documentation for Apache MADlib, Conference on Management of Data, SIGMOD ’17, Accessed May 30, 2020. https://bit.ly/3epbEyS. pages 463–478. Association for Computing Machinery, [48] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, 2017. R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. [35] T.-Y. Kim and S.-B. Cho. Predicting Residential Jordan, and I. Stoica. Ray: A Distributed Framework Energy Consumption Using CNN-LSTM Neural for Emerging AI Applications. In 13th USENIX Networks. Energy, 182:72–81, 2019. Symposium on Operating Systems Design and [36] D. P. Kingma and J. Ba. Adam: A Method for Implementation, OSDI 2018, pages 561–577. USENIX Stochastic Optimization. CoRR, abs/1412.6980, 2015. Association, 2018. [37] A. Klein, S. Falkner, S. Bartels, P. Hennig, and [49] S. Nakandala and A. Kumar. Vista: Optimized F. Hutter. Fast Bayesian Optimization of Machine System for Declarative Feature Transfer from Deep Learning Hyperparameters on Large Datasets. In CNNs at Scale. In Proceedings of the 2020 ACM Proceedings of the 20th International Conference on SIGMOD International Conference on Management of Artificial Intelligence and Statistics, AISTATS 2017, Data, SIGMOD ’20, pages 1685–1700. Association for volume 54 of Proceedings of Machine Learning Computing Machinery, 2020. Research, pages 528–536. PMLR, 2017. [50] S. Nakandala, A. Kumar, and Y. Papakonstantinou. [38] A. Koliousis, P. Watcharapichat, M. Weidlich, L. Mai, Incremental and Approximate Inference for Faster P. Costa, and P. Pietzuch. Crossbow: Scaling Deep Occlusion-Based Deep CNN Explanations. In Learning with Small Batch Sizes on Multi-GPU Proceedings of the 2019 International Conference on Servers. PVLDB, 12(11):1399–1412, 2019. Management of Data, SIGMOD ’19, pages 1589–1606. [39] A. Kumar, M. Boehm, and J. Yang. Data Association for Computing Machinery, 2019. Management in Machine Learning: Challenges, [51] S. Nakandala, K. Nagrecha, A. Kumar, and Techniques, and Systems. In Proceedings of the 2017 Y. Papakonstantinou. Incremental and Approximate ACM International Conference on Management of Computations for Accelerating Deep CNN Inference. Data, SIGMOD ’17, pages 1717–1722. Association for ACM Trans. Database Syst., 0(ja), 2020. Computing Machinery, 2017. [52] S. Nakandala, Y. Zhang, and A. Kumar. Cerebro: [40] A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Efficient and Reproducible Model Selection on Deep Model Selection Management Systems: the Next Learning Systems. In Proceedings of the 3rd Frontier of Advanced Analytics. SIGMOD Rec., International Workshop on Data Management for 44(4):17–22, May 2016. End-to-End Machine Learning, pages 1–4, 2019. [41] F. Li, L. Chen, Y. Zeng, A. Kumar, X. Wu, J. F. [53] S. Nakandala, Y. Zhang, and A. Kumar. Cerebro: A Naughton, and J. M. Patel. Tuple-Oriented Data System for Optimized Deep Learning Model Compression for Large-Scale Mini-Batch Stochastic Selection. https://adalabucsd.github.io/papers/ Gradient Descent. In Proceedings of the 2019 TR_2020_Cerebro.pdf, 2020. [Tech report]. International Conference on Management of Data, [54] D. Narayanan, A. Harlap, A. Phanishayee, SIGMOD ’19, pages 1517–1534. Association for V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Computing Machinery, 2019. Gibbons, and M. Zaharia. PipeDream: Generalized [42] L. Li, K. G. Jamieson, G. DeSalvo, A. Rostamizadeh, Pipeline Parallelism for DNN Training. In Proceedings and A. Talwalkar. Hyperband: A Novel Bandit-based of the 27th ACM Symposium on Operating Systems Approach to Hyperparameter Optimization. J. Mach. Principles, SOSP ’19, pages 1–15. Association for Learn. Res., 18:185:1–185:52, 2017. Computing Machinery, 2019. [43] L. Li, K. G. Jamieson, A. Rostamizadeh, E. Gonina, [55] D. Narayanan, K. Santhanam, A. Phanishayee, and J. Ben-tzur, M. Hardt, B. Recht, and A. Talwalkar. M. Zaharia. Accelerating Deep Learning Workloads Massively Parallel Hyperparameter Tuning. In through Efficient Multi-Model Execution. In NeurIPS Proceedings of Machine Learning and Systems 2020, Workshop on Systems for Machine Learning, MLSys 2020. mlsys.org, 2020. December 2018.

2172 [56] S. L. Oh, E. Y. Ng, R. S. Tan, and U. R. Acharya. [71] W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, Automated Diagnosis of Arrhythmia Using N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Combination of CNN and LSTM Techniques with Q. Zhang, F. Yang, and L. Zhou. Gandiva: Variable Length Heart Beats. Computers in Biology Introspective Cluster Scheduling for Deep Learning. In and Medicine, 102:278–287, 2018. 13th USENIX Symposium on Operating Systems [57] T. O’Malley. Hyperparameter Tuning with Keras Design and Implementation, OSDI 2018, pages Tuner, Accessed January 31, 2020. 595–610. USENIX Association, 2018. https://blog.tensorflow.org/2020/01/ [72] H. Zhang, L. Stafman, A. Or, and M. J. Freedman. hyperparameter-tuning-with-keras-tuner.html? SLAQ: Quality-Driven Scheduling for Distributed linkId=81371017. Machine Learning. In Proceedings of the 2017 [58] B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, Symposium on Cloud Computing, SoCC ’17, pages G. Chen, J. Gao, Z. Luo, A. K. Tung, Y. Wang, 390–404. Association for Computing Machinery, 2017. Z. Xie, M. Zhang, and K. Zheng. SINGA: A [73] Y. Zou, X. Jin, Y. Li, Z. Guo, E. Wang, and B. Xiao. Distributed Deep Learning Platform. In Proceedings of Mariana: Tencent Deep Learning Platform and Its the 23rd ACM International Conference on Applications. PVLDB, 7(13):1772–1777, 2014. Multimedia, MM ’15, pages 685–688. Association for Computing Machinery, 2015. [59] A. Ordookhanians, X. Li, S. Nakandala, and A. Kumar. Demonstration of Krypton: Optimized CNN Inference for Occlusion-Based Deep CNN Explanations. PVLDB, 12(12):1894–1897, 2019. [60] S. Pafka. Big RAM is Eating Big Data - Size of Datasets Used for Analytics, Accessed January 31, 2020. https://www.kdnuggets.com/2015/ 11/big-ram-big-data-size-datasets.html. [61] Scott Sievert, Tom Augspurger, and Matthew Rocklin. Better and Faster Hyperparameter Optimization with Dask. In Proceedings of the 18th Python in Science Conference, pages 118–125, 2019. [62] T. K. Sellis. Multiple-query Optimization. ACM Trans. Database Syst., 13(1):23–52, Mar. 1988. [63] A. Sergeev and M. D. Balso. Horovod: Fast and Easy Distributed Deep Learning in TF. CoRR, abs/1802.05799, 2018. [64] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: from Theory to Algorithms. Cambridge University Press, 2014. [65] H. Su and H. Chen. Experiments on Parallel Training of Deep Neural Network using Model Averaging. CoRR, abs/1507.01239, 2015. [66] A. Thomas and A. Kumar. A Comparative Evaluation of Systems for Scalable Linear Algebra-Based Analytics. PVLDB, 11(13):2168–2182, 2018. [67] VMware. Model Selection for Deep Neural Networks on Greenplum Database, Accessed May 30, 2020. https://bit.ly/2AaQLc2. [68] P. Warden. The Machine Learning Reproducibility Crisis, Accessed January 31, 2020. https://petewarden.com/2018/03/19/ the-machine-learning-reproducibility-crisis. [69] P. Watcharapichat, V. L. Morales, R. C. Fernandez, and P. Pietzuch. Ako: Decentralised Deep Learning with Partial Gradient Exchange. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC ’16, pages 84–97. Association for Computing Machinery, 2016. [70] G. J. Woeginger. The Open Shop Scheduling Problem. In 35th Symposium on Theoretical Aspects of Computer Science, STACS 2018, volume 96 of LIPIcs, pages 4:1–4:12. Schloss Dagstuhl - Leibniz-Zentrum f¨urInformatik, 2018.

2173