An Experiment on Bare-Metal BigData Provisioning

Ata Turk Ravi S. Gudimetla Emine Ugur Kaynar Boston University Northeastern University Boston University Jason Hennessey Sahil Tikale Peter Desnoyers Orran Krieger Boston University Boston University Northeastern University Boston University

Abstract form. Projects such as Ironic [10], MaaS [11], Em- ulab [12] have developed sophisticated mechanisms to Many BigData customers use on-demand platforms in make this process efficient. A recent ASPLOS paper by the cloud, where they can get a dedicated virtual clus- Omote et al. [13] goes a step further to reduce these de- ter in a couple of minutes and pay only for the time they lays, lazily copying the image to local disk while running use. Increasingly, there is a demand for bare-metal big- the application virtualized, and then de-virtualizing when data solutions for applications that cannot tolerate the copying is complete. unpredictability and performance degradation of virtu- Is all this effort really necessary? In fact Omote et alized systems. Existing bare-metal solutions can intro- al [13] observe that network booting was actually faster duce delays of 10s of minutes to provision a cluster by than their approach, but asserted it would incur a “contin- installing operating systems and applications on the lo- ual overhead”, directing every disk I/O over the network. cal disks of servers. This has motivated recent research However, it is not clear if they considered the natural ap- developing sophisticated mechanisms to optimize this in- proach of having a network-mounted boot drive (with OS stallation. These approaches assume that using network files and applications) and using the local drive for just mounted boot disks incur unacceptable run-time over- application data. head. Our analysis suggest that while this assumption is true for application data, it is incorrect for operating sys- To evaluate this option we create a simple prototype tems and applications, and network mounting the boot where client machines (24-core 10GbE servers, RHEL disk and applications result in negligible run-time impact 7.1) access their kernel and init ramdisk via standard while leading to faster provisioning time. network boot mechanisms (PXE), mount their root file system with pre-installed applications (Hadoop bench- marks) from an iSCSI volume located on a remote server, 1 Introduction and use local disk for ephemeral storage (i.e. /swap, /tmp, and Hadoop data). With this approach, which Today, virtualized IaaS based BigData analytics solu- involved a few lines of config file changes, we found tions such as those provided by Amazon EMR [1] and that the run time overhead of having a network mounted IBM BigInsights [2] are boasting significant shares of boot drive is in fact negligible. After a short startup the BigData analytics market [3]. Virtualization, at least phase there are very few subsequent reads from the boot in the way it is enabled in today’s clouds, can intro- disk (around 3KB/s over 10 hours) suggesting that file duce significant overhead, unpredictability, and secu- caching is very effective for the boot drive. Boot disk rity concerns, which is not tolarable for certain appli- writes, mostly to application log files, average 14KB/s. cations [4,5,6]. To address the needs of applications These results strongly suggest that the enormous effort that are sensitive to these overheads, cloud vendors like by on-demand bare-metal platforms to reduce the delay IBM [7], Rackspace [8], and Internap [9] have started to and overhead of installing tenants operating systems on serve bare-metal IaaS cloud solutions, with much of the local disks may be misguided. The much simpler ap- focus being on supporting on-demand bare-metal Big- proach, of separating boot and data disks and handling Data platforms. them differently, appears to offer improved provisioning All these Bare-metal cloud solutions install the ten- time with little or no runtime degradation. Moreover, a ant’s and application into the server’s system based on this approach, can allow the boot drives local disks, incurring long delays for the user of the plat- to be stored in a centralized repository, bringing to bare

1 1400

1200 Bigdata Configuration Bigdata Installation 1000 OS Reboot Firmware Initialization 800 Post Setup Software Installation Package Installation 600 OS Boot(inc. Kernel+Initrd Download) DHCP request 400 Firmware Initialization Haas Power Cycle Elapsed Time (Secs) 200 Ceph Cloning Haas Initilization 0 Local Disk iSCSI

Figure 1: Architecture of our network mounted BigData provisioning environment and cluster provisioning flow. Figure 2: Provisioning time comparison of local disk in- stallation and network (iSCSI) mounting. metal environments many of the same capabilities avail- able on virtualized platforms today. We are starting to 2. Image Preparation: A golden image is cloned to cre- develop a new Bare Metal Imaging (BMI) service based ate an image for each node. on this approach. 3. Per-node Configuration: Each image is modi- In the remainder of the paper, Section2 describes the fied (using loopback mount) to perform per-node prototype we built to evaluating our approach to bare- configuration such as SSH keys, cluster IP ad- metal BigData cluster provisioning. Section3 presents dresses (/etc/hosts), and specifying application- the evaluation results we obtained. Related works are specific functionality. Each image is then exposed discussed in Section4, and in Section5 we conclude with as an iSCSI volume by the iSCSI server. a discussion of our findings. 4. PXE Boot: On boot the node requests configura- 2 The Prototype tion information via DHCP, downloading its kernel, initial ramdisk, and a configuration file giving the Figure 1 shows the simple prototype we developed to iSCSI address for that node’s remote boot disk. evaluate our approach. HaaS [14] is a service we previ- This prototype is designed to provide us with basic ously developed to allow users to allocate and provision performance information, and has obvious limitations physical nodes out of a shared pool. A single VM was from both a functionality and performance perspective. used in the prototype for PXE services (DHCP, TFTP) A real implementation would have all the functional- and as an iSCSI server, with images (exposed to nodes as ity implemented as bash scripts provided by an API- iSCSI targets) stored in a shared Ceph file system. Ceph accessible service. The single provisioning VM will ob- provides us with a distributed storage system that sup- viously be a performance bottleneck in the long term, as ports efficient cloning of files. Most of the functionality e.g. multiple iSCSI servers will be needed to scale to in the prototype was implemented as bash scripts that in- large numbers of nodes. Despite these issues, the current teract with Ceph (to clone images), the provisioning VM, implementation provides a proof of concept, as a more and the HaaS service. carefully-constructed system would provide even better We chose iSCSI, rather than NFS, as the protocol performance. for mounting the drives because of the simplicity of installation—rather than crafting a shareable file system, we were able to connect a server to a blank iSCSI vol- 3 Evaluation ume, perform a standard operating system installation, We tested the prototype on a HaaS-managed 48-node and then copy the resulting image file. Moreover, with cluster; each server was equipped with two Intel Xeon the right hardware support, it should be possible to boot E5-2630L CPUs, 128 GB memory, 300 GB 10K SAS an iSCSI mounted drive with no changes to the operating HDDs (two nodes had 1 TB 7.2K SATA HDDs), and two system being booted. In addition, iSCSI does not incur Intel 82599ES 10 Gbit NICs. Storage was provided by the overhead of NFS to validate that potentially shared a four-node Fujitsu CD10000 Ceph storage appliance, files have not been modified. with 4 10 Gbit external NICs and internal 40 Gbit Infini- As shown in Figure 1, provisioning a cluster has four Band interconnect. main steps: Figure2 iSCSI bar shows the time taken to start up 1. Node Reservation: Provisioning scripts interact from scratch a bare-metal Hadoop image using our pro- with HaaS to allocate physical servers. totype. As we can see from the figure almost half the

2 300 iSCSI Reads: Runs with 256GB Data 300 iSCSI Reads: Runs with 128GB Data 250

200 200 Bigdata Post Script Booting 150 100 Ceph Cloning Haas Initilization 100 0 Cumulative iSCSI reads per node (MB) Elapsed Time (Secs) Sort 1 Sort 3 Sort 2 Sort 5 50 Sort 4 Initial Provisioning Data Generation 1 Data Generation 3 0 Data Generation 2 Data Generation 4 Data Generation 5 2 Node 4 Node 8 Node Figure 4: Per-node cumulative iSCSI read volume (MB).

700 iSCSI Writes - Runs with 256GB Data Figure 3: Scalability analysis for network (iSCSI) 600 iSCSI Writes - Runs with 128GB Data mounted BigData cluster provisioning. 500 400

300 time is spent in firmware initialization, and the overall 200 boot time is very rapid (260 seconds) and comparable to 100 network boot results presented in prior work [13]. As 0 a comparison point, the Local Disk bar shows the time Cumulative iSCSI writes per node (MB) Sort 1 Sort 3 Sort 2 Sort 5 for a full install of a Hadoop environment using standard Sort 4 Initial Provisioning

tools (RedHat Foreman for OS installation, Apache Big- Data Generation 1 Data Generation 3 Data Generation 2 Data Generation 4 Data Generation 5 Top for Hadoop installation, . . . ), as is typically done in Figure 5: Cumulative writes (MB) made by servers on managed system environments. The remainder of this the iSCSI gateway. section compares the runtime overheads of these two in- stallation mechanisms. to transfer 2.9GB over the network to each node. Worse We have made no effort to make the prototype scal- yet, it would then need to write this data to the local disk, able. The provisioning scripts are sequential, cloning at typical speeds of 100 MB/s or less for single-disk sys- each image in turn, and then starting the nodes boot- tems, or 1 of network speed. ing. Moreover, there is only a single provisioning VM in 10 the prototype. Figure3 demonstrates that even this very In Figure4, both curves flatten after repeated runs, simple design is sufficient to provision a modest number demonstrating that (even with the 256GB case where to- of nodes in parallel with relatively modest degradation tal data handled—at minimum five runs times 128GB as we increase the number of concurrently provisioned per machine, times a replication factor of two—is sub- nodes from two to eight. stantially larger than system memory) that the file cache is effective at caching the boot drive. After initial boot The main goal of the prototype was to understand what and application startup, the sustained read bandwidth in- is the run time impact of a network mounted boot drive curred is around 3KBytes per second; effectively negli- for a Big Data platform. Figures4 and5 show the per gible. node cumulative read and write iSCSI traffic during ini- tial provisioning and then over five consecutive runs of Figure5 shows the writes to the network mounted random data generation followed by sorting, using the storage; in contrast to the read case, log writes con- Hadoop Sort example, covering a duration of 7 to 17 tinue throughout the experiment, at an average rate of hours for 128 GB and 256 GB of data respectively. These approximately 14 KB/s. On further examination these writes target paths such as /var/log, /hadoop/log, experiments were performed on two nodes allocated out 1 of the HaaS cluster with local data stored in the one ter- and /var/run. Most of these writes are log file up- abyte drive. dates made by Hadoop; although they could be directed While we do not have a comparison to provisioning to local storage, we did not do so due to their utility for systems that copy (rather than install) an image to the debugging and negligible rate. local disk, one interesting data point is how much data The above figures examined the read/write overhead would be transferred in the two cases. For the iSCSI for relatively large data sets for just two nodes, and took case, Figures4 shows that only around 250MBytes of more than 17 hours to run. To examine the performance the Boot disk are read over 10 hours. In contrast, out of difference between the two configurations, we also timed the 8GB boot image image, 2.9GB were actually used. In 1We should note that, in our deployments, /tmp and /swap are other words, any image distribution service would need configured to reside on the local disk of servers.

3 ain npoiindsses aoia’ Metal-as-a- Canonical’s appli- systems. desired provisioned the on up cations set config- to additional systems use and management nodes uration the to image frameworks disk provisioning OpenStack a copy bare-metal and these Emulab of Cob- detail All and in Ironic. evaluate [ 21], and Razor [22] [20], bler Crowbar [10], Open- Ironic commonly [12], Stack Emulab of namely analysis systems, comparative provisioning used a Gib- provide and Chandrasekar [19] commer- son and systems. provi- bare-metal source automated of open for sioning developed of been rarely set have is rich products access cial a disk, storage Instead, local remote to with used. installation boot OS network initiate but to used widely is data. ephemeral high-speed for add SSD) [18]) (e.g. com- storage (Gordon local diskless others reliable tech- while more this nodes, and use pute smaller [17]) allow Cray to from largest nique the those of Some (e.g. high- installations the [16]. in field re- popular computing become More performance has the workstations. storage enabling remote diskless system cently, of file clusters and of kernel) creation ini- (e.g. both to files access boot remote tial with [ 15], ago years use 30 widespread almost into came computers of booting Network work Related 4 ran- of benchmarks. behavior sorting non-deterministic dom the this that by hypothesize caused experiments we be Sort data; may 128GB of and data exception 32GB the for performances with runtime seen negligible, in As difference are the negligible. figure, among are the deviations configuration from that same observed the av- on we are runs runs; numbers five Reported of erage and clusters. disk-installed mounted local com- network on we (Sort, running 6 benchmarks WordCount) Hadoop Figure Grep, standard In of 128GB. varied runtime to we the 8GB as pare clusters from node set 8 data on the experiments of series a iSCSI and based disk local on applications Grep and Sort, WordCount, systems. Hadoop mounted of comparison Performance 6: Figure nohrfils nta ewr otn ie PXE) (i.e. booting network initial fields, other In

Elapsed Time (secs) 1200 1600 2000 2400 2800 400 800 0

171

171 Grep -iSCSIMounted Grep -LocalDisk Sort -iSCSIMounted Sort -LocalDisk WordCount -iSCSIMounted WordCount -LocalDisk 8GB 64 69 60 52

319

318 16GB 115 120 75

Data Size(GB) 63

616 4

617 32GB

ytm ordc h ea oisalsfwr nothe into software install to delay the these reduce in to on gone systems has effort Enormous and offerings. commercial research of number a with important becoming increasingly are platforms BigData on-demand Bare-metal Conclusion 5 offer. can approach mounted man- network image a of network agement benefits then potential for provision the useful lacks to and and longer mounting, novel takes very cases, use is many This think we instances. which bare-metal approach, of de- startup purpose quick special transparent OS- supports a that Manager with Machine Virtual system virtualizable deployment OS BMCast, agility propose an and providing clouds, in bare-metal in inhibitor elasticity metal significant and bare a of as in times act startup systems servers long bare-metal that on argue They time clouds. boot reducing and sys- operating tems inves- of provisioning [13] fast al. for et mechanisms de- Omote tigate bare-metal solutions. by BigData achieved of corrobo- ployment gains studies performance Their the infrastructure. virtual rate vs Cloud metal a bare in on nodes run by applications study. incurred based overhead this Hadoop the of study [27] focus that Fox. the and studies Ekanayake to academic related of problems flurry investigate a exists there solutions, [26]. Plat- form Data Converged MapR [24], or Platform [25], Enterprise Data Cloudera Hortonworks as such platforms bare- provi- tion applica- on BigData [10] with coupled Ironic Hadoop typically OpenStack solution, offer sioning the [8] using In- solutions Rackspace deployment. metal and machine virtual [9] on ternap based most are others, and these MapRe- [23], of Elastic DataProc Cloud Amazon Google as [1], such duce Service, a as BigData customer. a by owned hardware host the to on solution cloud similar a a provides [11] (MAAS) Service 300

nadto ooe orepout n commercial and products source open to addition In for offerings commercial many are there Although 238 76 86

1187

1176 64GB 542 555 118 125

2314

2281 128GB 1073 1361 199 201 local disks. While previous work acknowledged that net- ’09. New York, NY, USA: ACM, 2009, pp. 199– work booting is faster than a local installation, they re- 212. jected this approach because of the assumption that it would incur an ongoing unacceptable overhead. We hy- [7] Softlayer, “Big data solutions,” http: pothesized that if we separate boot and data disks, using //www.softlayer.com/big-data, 2015. local storage for data, this overhead would be substan- tially reduced. We demonstrate with a simple prototype [8] Rackspace, “Rackspace cloud big data onmetal,” that this very simple strategy preserves all the advantages http://go.rackspace.com/baremetalbigdata/, 2015. of network booting while incurring negligible runtime [9] Internap, “Bare-metal agileserver,” http: overhead. //www.internap.com/bare-metal/, 2015.

Acknowledgments [10] Openstack, “Ironic,” http://docs.openstack.org/ developer/ironic/deploy/user-guide.html, 2015. We would like to thank Dan Shatzberg for his early sug- gestions in supporting a network mounted imaging ser- [11] Canonical, “Metal as a service (maas),” http:// vice, and the MOC team in general for support and un- maas.ubuntu.com/docs/, 2015. derstanding while performing the experiments. We also thank Cisco and Fujitsu for their generous donations of [12] D. Anderson, M. Hibler, L. Stoller, T. Stack, server hardware and Ceph storage, respectively. and J. Lepreau, “Automatic Online Validation of This research was supported in part by the MassTech Network Configuration in the Emulab Network Collaborative Research Matching Grant Program, NSF Testbed,” in IEEE International Conference on Au- awards 1347525 and 1414119 and several commercial tonomic Computing, 2006. ICAC ’06, Jun. 2006, partners of the Massachusetts Open Cloud who may be pp. 134–142. found at http://www.massopencloud.org. [13] Y. Omote, T. Shinagawa, and K. Kato, “Improv- ing Agility and Elasticity in Bare-metal Clouds,” References in Proceedings of the Twentieth International Con- ference on Architectural Support for Programming [1] Amazon, “Amazon elastic mapreduce (amazon Languages and Operating Systems, ser. ASPLOS emr),” https://aws.amazon.com/elasticmapreduce/, ’15. New York, NY, USA: ACM, 2015, pp. 145– 2015. 159.

[2] IBM, “Ibm biginsights for apache hadoop,” [14] J. Hennessey, . Hill, I. Denhardt, V. Venugopal, www.ibm.com/software/products/en/ibm- G. Silvis, O. Krieger, and P. Desnoyers, “Hardware biginsights-for-apache-hadoop, 2015. as a service - enabling dynamic, user-level bare metal provisioning of pools of data center [3] J. Kelly, “Hadoop-nosql software and resources.” in 2014 IEEE High Performance services market forecast, 2014-2017,” Extreme Computing Conference, Waltham, MA, http://wikibon.com/hadoop-nosql-software- USA, Sep. 2014. [Online]. Available: https: and-services-market-forecast-2013-2017, 2014. //open.bu.edu/handle/2144/11221 [4] ZDNet, “Facebook: Virtualisation does not scale,” http://www.zdnet.com/article/facebook- [15] R. Gusella, “The Analysis of Diskless Workstation virtualisation-does-not-scale/, 2011. Traffic on an Ethernet,” Tech. Rep., Nov. 1987.

[5] A. Iosup, S. Ostermann, N. Yigitbasi, R. Prodan, [16] C. Engelmann, H. Ong, and S. Scott, “Evaluating T. Fahringer, and D. Epema, “Performance analy- the Shared Root File System Approach for Diskless sis of services for many-tasks sci- High-Performance Computing Systems,” in Pro- entific computing,” IEEE Trans. Parallel Distrib. ceedings of the 10th LCI International Conference Syst., vol. 22, no. 6, pp. 931–945, Jun. 2011. on High-Performance Clustered Computing (LCI- 09), 2009. [6] T. Ristenpart, E. Tromer, H. Shacham, and S. Sav- age, “Hey, you, get off of my cloud: Exploring in- [17] R. Alverson, D. Roweth, and L. Kaplan, “The Gem- formation leakage in third-party compute clouds,” ini System Interconnect,” in 2010 IEEE 18th An- in Proceedings of the 16th ACM Conference on nual Symposium on High Performance Intercon- Computer and Communications Security, ser. CCS nects (HOTI), Aug. 2010, pp. 83–87.

5 [18] S. M. Strande, P. Cicotti, R. S. Sinkovits, W. S. Young, R. Wagner, M. Tatineni, E. Hocks, A. Snavely, and M. Norman, “Gordon: design, per- formance, and experiences deploying and support- ing a data intensive ,” in Proceed- ings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridg- ing from the eXtreme to the campus and beyond. Chicago, Illinois, USA: ACM, 2012, pp. 1–8. [19] A. Chandrasekar and G. Gibson, “A comparative study of provisioning frameworks,” Par- allel Data Laboratory, Carnegie Mellon University, Tech. Rep. CMU-PDL-14-109, 2014. [20] OpenCrowbar, “The crowbar project,” https:// opencrowbar.github.io, 2015.

[21] Puppetlabs, “Provisioning with razor,” https: //docs.puppetlabs.com/pe/latest/razor intro.html, 2015. [22] Cobbler, “Cobbler,” https://cobbler.github.io, 2015.

[23] Google, “Google cloud dataproc,” https: //cloud.google.com/dataproc/overview, 2015. [24] Hortonworks, “Hortonworks data platform,” http: //hortonworks.com/hdp/, 2016. [25] Cloudera, “Cloudera enterprise,” http: //www.cloudera.com/products.html, 2016. [26] MapR, “Mapr converged data platform,” https://www.mapr.com/products/mapr-converged- data-platform, 2016.

[27] J. Ekanayake and G. Fox, “High performance paral- lel computing with clouds and cloud technologies,” in Cloud Computing. Springer, 2010, pp. 20–38.

6