Storage Benchmarking with Deep Learning Workloads

Storage Benchmarking with Deep Learning Workloads Peng Cheng Haryadi S. Gunawi University of Chicago University of Chicago Chicago, IL Chicago, IL [email protected] [email protected] ABSTRACT Potential bottlenecks to store large collections of small files have In this AI-driven era, while compute infrastructure is often the fo- been pointed out across the file system community [5, 32], but none cus, storage is equally important. Loading many batches of sample of these studies specifically considered the problem in the context of randomly is the common workload for deep learning (DL). Those deep learning. In addition, to avoid overfitting caused by predefined iterative small random reads impose nontrivial I/O pressure on stor- input patterns, each batch of sample files must be loaded randomly. age systems. Therefore, we are interested in exploring the optimal This presents another challenge to the conventional I/O services storage system and data format for storing DL data and the possi- that are highly optimized for large sequential I/O. To exploit the best ble trade-offs. In the meantime, object storage is usually preferred performance, deep learning frameworks (e.g., TensorFlow [1], Caffe because of its high scalability, rich metadata, competitive cost, and [18]) contain their own input methods. For example, TensorFlow the ability to store unstructured data. introduces Dataset API for importing dataset from different formats. This motivates us to benchmark two object storage systems: However, their data loading performance depends on underlying MinIO and Ceph. As a comparison, we also benchmark three popu- conventional file systems (e.g. standard POSIX file system) thatare lar key-value storage databases: MongoDB, Redis, and Cassandra. designed for general usage instead of being optimized for large- We explore the impact of different parameters, including storage scale training. location, storage disaggregation granularity, access pattern, and In this paper, we explore using object storage systems and key- data format. For each parameter, we summarize the benchmark value databases to store deep learning datasets, and benchmark results and give some suggestions. Overall, although the optimal these systems with different parameters, including storage location, storage system is workload-specific, our benchmarks provide some storage disaggregation granularity, access pattern, and data format. insights on how to reach it. The results suggest that although the optimal solution may be workload-specific, proper configurations in various parameters can KEYWORDS improve storage system performance for deep learning workloads. In the remainder of this paper we cover the background in Sec- Storage Systems, Object Storage, Database, Deep Learning, Perfor- tion 2 and review related work in Section 3. In Section 4, we in- mance Evaluation troduce our methods. Finally, we present our results in Section 5, discuss some future work in Section 6, and summarize our work in 1 INTRODUCTION Section 7. Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. It is substan- 2 BACKGROUND tially challenging because it has to deal with massive quantities of In this section, we review the background of the storage disaggre- multi-dimensional data. A distributed storage system formed by gation, I/O in deep learning, and object storage. networking together with a large number of storage devices provides a promising mechanism to store the massive amount of data 2.1 Storage Disaggregation with high reliability and availability. Deep Learning system archi- 2.1.1 Resource Disaggregation. In cloud computing, the periods tects strive to design a balanced system where the computational of peak demands can be short-lived, and this results in significant accelerator – FPGA, GPU, etc, is not starved for data. However, resource underutilization [4] because it is difficult to determine an feeding training data fast enough to effectively keep the accelerator exact amount of resources on every server to meet the dynamic utilization high is difficult. In addition, as accelerators are getting needs of different applications. Similar problems have been ob- faster, the storage media and data buses feeding the data have served in large-scale data centers hosted by HPE [15], Intel [17], not kept pace and the ever-increasing size of training data further and Facebook [38]. Thus an emerging paradigm for resource provi- compounds the problem. sioning called resource disaggregation is quickly gaining popularity. The random file access pattern in DL is mainly required bythe Many efforts have been carried out recently on resource disaggre- use of stochastic gradient descent (SGD) [33] model optimizer. The gation in both academia [2, 10] and industry [9, 16]. The goal of SGD model optimizer requires loading a mini-batch of training sam- disaggregation is to decouple different resources or disaggregate ples in a randomized order for every iteration. This is important a large amount of a single resource so that fine-grained alloca- for accelerating the model’s convergence speed and decreasing the tions can be made to meet the dynamic demands of applications at noise learned from the input sequence. However, such iterative load- run-time. ing of samples imposes nontrivial I/O pressure on storage systems which are typically designed and optimized for large sequential I/O 2.1.2 Decoupled Storage and Compute. With resource aggregation, [26, 34]. the resources are physically separated from the compute servers and accessed remotely, which allows each resource technology to are typically designed for large sequential I/O patterns. Although we evolve independently and supports increased configurability of can preprocess small samples into large batched files (e.g., TFRecord resources. In the context of high performance computing (HPC), format) to avoid small random I/O, the existing sample shuffling several projects have provided mechanisms for disaggregating GPU method cannot support global data shuffling, and the size of shuffle accelerators [28, 30]. Nowadays, decoupled storage and compute buffer limits the shuffling result. For example, when using TFRecord is becoming the mainstream considering the benefits of scalability, files in TensorFlow, every TFRecord file is sequentially readina availability, and cost. fixed size shuffle buffer. However, if the size of the shuffle bufferis not large enough, we can only get partially shuffled samples, which 2.1.3 Storage Disaggregation for Deep Learning. In the context of reduces the training accuracy significantly. storage disaggregation, the provisioning of it can be particularly hard because the actual demand may be a complex combination of 2.3 Object Storage required capacity, throughput, bandwidth, latency, etc. For example, [29] claimed there is only 40% or less of the raw bandwidth of the 2.3.1 Motivation. We can think of object storage as a hybrid stor- flash memory in the SSDs is delivered by the storage system tothe age of file systems and block storage. On the one hand, many appli- applications. Disaggregating storage resources can help amortize cations nowadays depend less on the traditional POSIX file systems. the total system building cost and maximize the performance-per- For example, some applications are satisfied with storing images dollar of computing infrastructure. There have been storage dis- in a file system that only supports a flat namespace and weak con- aggregation solutions to improve storage resource utilization on sistency. On the other hand, although a reduction of file system large-scale data centers [21, 27, 29]. These proposed solutions aim features can help scalability, users still need storage services to to solve the problem at the system level. [42] presents a user-level, free them from the burden of managing blocks and various other read-optimized file system on top of non-volatile memory (NVM) storage metadata. To this end, object storage systems are often devices for deep learning applications called DLFS, but their work implemented with eclectic designs: they are less sophisticated than focuses on in-memory sample directory for fast metadata man- file systems but are more intelligent than block devices. agement. To the best of our knowledge, there has been no effort 2.3.2 Popularity of Object storage. Object storage systems has been to leverage storage disaggregation for deep learning by storing increasingly recognized as a preferred way to expose storage in- multiple training samples with their corresponding labels as an frastructure to the web. In the meantime, with our world being individual object. increasingly connected and enriched by newly emerged computing technologies, the data produced is gigantic [40]. To combat 2.2 I/O in Deep Learning such data tsunami, object storage has been introduced to fulfill the Gradient descent is one of the most popular algorithms to optimize common need of a scale-out storage infrastructure [3]. parameters of the DNNs. Mini-batch gradient descent, often referred The rise of cloud computing has also assisted the emergence of to as SGD [33], is commonly used because of its fast training speed object storage. In fact, Object storage echoes the very spirit of cloud and low memory requirements. However, there are a number of computing: it provides users with an unlimited and on-demand challenges for the effective utilization of storage resources. data depository accessible from anywhere in the world, and it helps achieve data consolidation and economy of scale, both facilitat- 2.2.1 Increased mini-batch size. The popularity of SGD has led ing cost-effectiveness. With cloud computing getting increasingly to many optimizations to leverage increasing compute power. For adopted, so are object storage [41]. example, it is important to batch more input samples, i.e., increasing the mini-batch size in each training iteration. Many studies [11, 14, 2.3.3 Object Storage Interface. Object storage systems are designed 36, 39] have reported that they can finish the training of ResNet- as RESTful web services and are accessed via HTTP requests. To 50 [14] with ImageNet [8] at fast speeds.

Load more