Cerebro: a Data System for Optimized Deep Learning Model Selection

Cerebro: A Data System for Optimized Deep Learning Model Selection Supun Nakandala, Yuhao Zhang, and Arun Kumar University of California, San Diego fsnakanda, yuz870, [email protected] ABSTRACT process is called model selection, and it is unavoidable be- Deep neural networks (deep nets) are revolutionizing many cause it is how one controls underfitting vs. overfitting [64]. machine learning (ML) applications. But there is a major Model selection is a major bottleneck for the adoption of bottleneck to wider adoption: the pain and resource inten- deep learning among enterprises and domain scientists due siveness of model selection. This empirical process involves to both the time spent and resource costs. Not all ML users exploring deep net architectures and hyper-parameters, of- can afford to throw hundreds of GPUs at their task and burn ten requiring hundreds of trials. Alas, most ML systems resources like the Googles and Facebooks of the world. focus on training one model at a time, reducing through- Case Study. We present a real-world model selection sce- put and raising overall resource costs; some also sacrifice re- nario. Our public health collaborators at UC San Diego producibility. We present Cerebro, a new data system to wanted to try deep nets for identifying different activities raise deep net model selection throughput at scale without (e.g., sitting, standing, stepping, etc.) of subjects from body- raising resource costs and without sacrificing reproducibility worn accelerometer data. The data was collected from a co- or accuracy. Cerebro uses a new parallel SGD execution hort of about 600 people and is labeled. Its size is 864 GB. strategy we call model hopper parallelism that hybridizes During model selection, we tried different deep net archi- task- and data-parallelism to mitigate the cons of these prior tectures such as convolution neural networks (CNNs), long paradigms and offer the best of both worlds. Experiments short-term memory models (LSTMs), and composite models on large ML benchmark datasets show that Cerebro offers such as CNN-LSTMs, which now offer state-of-the-art re- 3x to 10x runtime savings relative to data-parallel systems sults for multivariate time-series classification [35, 56]. Our like Horovod and Parameter Server and up to 8x memo- collaborators also wanted to try different prediction win- ry/storage savings or up to 100x network savings relative dow sizes (e.g., predictions generated every 5 seconds vs. 15 to task-parallel systems. Cerebro also supports heteroge- seconds) and alternative target semantics (e.g., sitting{ neous resources and fault tolerance. standing{stepping or sitting vs. not sitting). The training process also involves tuning various hyper-parameters such PVLDB Reference Format: Supun Nakandala, Yuhao Zhang, and Arun Kumar. Cerebro: as learning rate and regularization coefficient. A Data System for Optimized Deep Learning Model Selection. In the above scenario it is clear that the model selection PVLDB, 13(11): 2159-2173, 2020. DOI: https://doi.org/10.14778/3407790.3407816 process generates dozens, if not hundreds, of different models that need to be evaluated in order to pick the best one for the prediction task. Due to the scale of the data and the 1. INTRODUCTION complexity of the task, it is too tedious and time-consuming Deep learning is revolutionizing many ML applications. to manually steer this process by trying models one by one. Their success at large Web companies has created excite- Parallel execution on a cluster is critical for reasonable run- ment among practitioners in other settings, including do- times. Moreover, since our collaborators often changed the main sciences, enterprises, and small Web companies, to try time windows and output semantics for health-related anal- deep nets for their applications. But training deep nets is a yses, we had to rerun the whole model selection process painful empirical process, since accuracy is tied to the neu- over and over several times to get the best accuracy for ral architecture and hyper-parameter settings. A common their evolving task definitions. Finally, reproducible model practice to choose these settings is to empirically compare as training is also a key requirement in such scientific settings. many training configurations as possible for the user. This All this underscores the importance of automatically scaling deep net model selection on a cluster with high throughput. This work is licensed under the Creative Commons Attribution- System Desiderata. We have the following key desiderata NonCommercial-NoDerivatives 4.0 International License. To view a copy for a deep net model selection system. of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing 1) Scalability. Deep learning often has large training [email protected]. Copyright is held by the owner/author(s). Publication rights datasets, larger than single-node memory and sometimes licensed to the VLDB Endowment. even disk. Deep net model selection is also highly compute- Proceedings of the VLDB Endowment, Vol. 13, No. 11 intensive. Thus, we desire out-of-the-box scalability to a ISSN 2150-8097. DOI: https://doi.org/10.14778/3407790.3407816 2159 (A) (B) Model Search/AutoML Procedures Task-Parallel Systems Data-Parallel Systems Task Parallelism Data Parallelism (C) Grid/Random PBT HyperBand … ASHA + high throughput + high data scalability Search - low data scalability - low throughput Dask, Celery, MOP/CEREBRO Async. Param. - memory/storage wastage - high communication cost Cerebro/MOP Vizier, Spark- Async. (This Work) Server HyperOpt Model Hopper Parallelism (Cerebro) No Partitioning Spark or Sync. Param. + high throughput (Full replication) TF Model Server, Sync. Deep Learning Systems + high data scalability Averaging Horovod + low communication cost Distributed Data + no memory/storage wastage Bulk Fine-grained Partition 1 Partition 2 … Partition p (Partitions) (Mini-batches) Figure 1: (A) Cerebro combines the advantages of both task- and data-parallelism. (B) System design philosophy and approach of Cerebro/MOP (introduced in [52]): \narrow waist" architecture in which multiple model selection procedures and multiple deep learning tools are supported{unmodified–for specifying/executing deep net computations. MOP is our novel resource-efficient distributed SGD execution approach. (C) Model Hopper Parallelism (MOP) as a hybrid approach of task- and data-parallelism. It is the first known form of bulk asynchronous parallelism, filling a major gap in the parallel data systems literature. cluster with large partitioned datasets (data scalability) and tion on partitioned datasets is important because parallel distributed execution (compute scalability). file systems (e.g., HDFS for Spark), parallel RDBMSs, and 2) High Throughput. Regardless of manual grid/ran- \data lakes" typically store large datasets in that manner. dom searches or AutoML searches, a key bottleneck for 2) Resource Inefficiencies. Due to the false dichotomy, model selection is throughput: how many training config- naively combining the above mentioned approaches could urations are evaluated per unit time. Higher throughput cause overheads and resource wastage (Section 2 explains enables ML users to iterate through more configurations in more). For instance, using task-parallelism on HDFS re- bulk, potentially reaching a better accuracy sooner. quires extra data movement and potential caching, substan- 3) Overall Resource Efficiency. Deep net train- tially wasting network and memory/storage resources. An ing uses variants of mini-batch stochastic gradient descent alternative is remote data storage (e.g., S3) and reading re- (SGD) [7, 10, 11]. To improve efficiency, the model selec- peatedly at every iteration of SGD. But this leads to or- tion system has to avoid wasting resources and maximize ders of magnitude higher network costs by flooding the net- resource utilization for executing SGD on a cluster. We work with lots of redundant data reads. On the other hand, have 4 key components of resource efficiency: (1) per-epoch data-parallel systems that train one model at a time (e.g., efficiency: time to complete an epoch of training; (2) con- Horovod [63] and Parameter Servers [44]) incur high com- vergence efficiency: time to reach a given accuracy met- munication costs, leading to high runtimes. ric; (3) memory/storage efficiency: amount of memory/stor- Overall, we see a major gap between task- and data- age used by the system; and (4) communication efficiency: parallel systems today, which leads to substantially lower amount of network bandwidth used by the system. In cloud overall resource efficiency, i.e., when compute, memory/s- settings, compute, memory/storage, and network all mat- torage, and network are considered holistically. ter for overall costs because resources are pay-as-you-go; on shared clusters, which are common in academia, wastefully Our Proposed System. We present Cerebro, a new sys- hogging any resource is unethical. tem for deep learning model selection that mitigates the 4) Reproducibility. Ad hoc model selection with dis- above issues with both task- and data-parallel execution. tributed training is a key reason for the \reproducibility cri- As Figure 1(A) shows, Cerebro combines the advantages sis" in deep learning [68]. While some Web giants may not of both task- and data-parallelism, while avoiding the limi- care about unreproducibility for some use cases, this is a tations of each. It raises model selection throughput without showstopper issue for many enterprises due to auditing, reg- raising resource costs. Our target setting is small clusters ulations, and/or other legal reasons. Most domain scientists (say, tens of nodes), which covers a vast majority (over 90%) also inherently value reproducibility. of parallel ML workloads in practice [60]. We focus on the common setting of partitioned data on such clusters. Fig- Limitations of Existing Landscape. We compared exist- ure 1(B) shows the system design philosophy of Cerebro: ing approaches to see how well they cover the above desider- a narrow-waist architecture inspired by [40] to support mul- ata.

Load more