An Experiment on Bare-Metal Bigdata Provisioning
Total Page:16
File Type:pdf, Size:1020Kb
An Experiment on Bare-Metal BigData Provisioning Ata Turk Ravi S. Gudimetla Emine Ugur Kaynar Boston University Northeastern University Boston University Jason Hennessey Sahil Tikale Peter Desnoyers Orran Krieger Boston University Boston University Northeastern University Boston University Abstract form. Projects such as Ironic [10], MaaS [11], Em- ulab [12] have developed sophisticated mechanisms to Many BigData customers use on-demand platforms in make this process efficient. A recent ASPLOS paper by the cloud, where they can get a dedicated virtual clus- Omote et al. [13] goes a step further to reduce these de- ter in a couple of minutes and pay only for the time they lays, lazily copying the image to local disk while running use. Increasingly, there is a demand for bare-metal big- the application virtualized, and then de-virtualizing when data solutions for applications that cannot tolerate the copying is complete. unpredictability and performance degradation of virtu- Is all this effort really necessary? In fact Omote et alized systems. Existing bare-metal solutions can intro- al [13] observe that network booting was actually faster duce delays of 10s of minutes to provision a cluster by than their approach, but asserted it would incur a “contin- installing operating systems and applications on the lo- ual overhead”, directing every disk I/O over the network. cal disks of servers. This has motivated recent research However, it is not clear if they considered the natural ap- developing sophisticated mechanisms to optimize this in- proach of having a network-mounted boot drive (with OS stallation. These approaches assume that using network files and applications) and using the local drive for just mounted boot disks incur unacceptable run-time over- application data. head. Our analysis suggest that while this assumption is true for application data, it is incorrect for operating sys- To evaluate this option we create a simple prototype tems and applications, and network mounting the boot where client machines (24-core 10GbE servers, RHEL disk and applications result in negligible run-time impact 7.1) access their kernel and init ramdisk via standard while leading to faster provisioning time. network boot mechanisms (PXE), mount their root file system with pre-installed applications (Hadoop bench- marks) from an iSCSI volume located on a remote server, 1 Introduction and use local disk for ephemeral storage (i.e. /swap, /tmp, and Hadoop data). With this approach, which Today, virtualized IaaS based BigData analytics solu- involved a few lines of config file changes, we found tions such as those provided by Amazon EMR [1] and that the run time overhead of having a network mounted IBM BigInsights [2] are boasting significant shares of boot drive is in fact negligible. After a short startup the BigData analytics market [3]. Virtualization, at least phase there are very few subsequent reads from the boot in the way it is enabled in today’s clouds, can intro- disk (around 3KB/s over 10 hours) suggesting that file duce significant overhead, unpredictability, and secu- caching is very effective for the boot drive. Boot disk rity concerns, which is not tolarable for certain appli- writes, mostly to application log files, average 14KB/s. cations [4,5,6]. To address the needs of applications These results strongly suggest that the enormous effort that are sensitive to these overheads, cloud vendors like by on-demand bare-metal platforms to reduce the delay IBM [7], Rackspace [8], and Internap [9] have started to and overhead of installing tenants operating systems on serve bare-metal IaaS cloud solutions, with much of the local disks may be misguided. The much simpler ap- focus being on supporting on-demand bare-metal Big- proach, of separating boot and data disks and handling Data platforms. them differently, appears to offer improved provisioning All these Bare-metal cloud solutions install the ten- time with little or no runtime degradation. Moreover, a ant’s operating system and application into the server’s system based on this approach, can allow the boot drives local disks, incurring long delays for the user of the plat- to be stored in a centralized repository, bringing to bare 1 1400 1200 Bigdata Configuration Bigdata Installation 1000 OS Reboot Firmware Initialization 800 Post Setup Software Installation Package Installation 600 OS Boot(inc. Kernel+Initrd Download) DHCP request 400 Firmware Initialization Haas Power Cycle Elapsed Time (Secs) 200 Ceph Cloning Haas Initilization 0 Local Disk iSCSI Figure 1: Architecture of our network mounted BigData provisioning environment and cluster provisioning flow. Figure 2: Provisioning time comparison of local disk in- stallation and network (iSCSI) mounting. metal environments many of the same capabilities avail- able on virtualized platforms today. We are starting to 2. Image Preparation: A golden image is cloned to cre- develop a new Bare Metal Imaging (BMI) service based ate an image for each node. on this approach. 3. Per-node Configuration: Each image is modi- In the remainder of the paper, Section2 describes the fied (using loopback mount) to perform per-node prototype we built to evaluating our approach to bare- configuration such as SSH keys, cluster IP ad- metal BigData cluster provisioning. Section3 presents dresses (/etc/hosts), and specifying application- the evaluation results we obtained. Related works are specific functionality. Each image is then exposed discussed in Section4, and in Section5 we conclude with as an iSCSI volume by the iSCSI server. a discussion of our findings. 4. PXE Boot: On boot the node requests configura- 2 The Prototype tion information via DHCP, downloading its kernel, initial ramdisk, and a configuration file giving the Figure 1 shows the simple prototype we developed to iSCSI address for that node’s remote boot disk. evaluate our approach. HaaS [14] is a service we previ- This prototype is designed to provide us with basic ously developed to allow users to allocate and provision performance information, and has obvious limitations physical nodes out of a shared pool. A single VM was from both a functionality and performance perspective. used in the prototype for PXE services (DHCP, TFTP) A real implementation would have all the functional- and as an iSCSI server, with images (exposed to nodes as ity implemented as bash scripts provided by an API- iSCSI targets) stored in a shared Ceph file system. Ceph accessible service. The single provisioning VM will ob- provides us with a distributed storage system that sup- viously be a performance bottleneck in the long term, as ports efficient cloning of files. Most of the functionality e.g. multiple iSCSI servers will be needed to scale to in the prototype was implemented as bash scripts that in- large numbers of nodes. Despite these issues, the current teract with Ceph (to clone images), the provisioning VM, implementation provides a proof of concept, as a more and the HaaS service. carefully-constructed system would provide even better We chose iSCSI, rather than NFS, as the protocol performance. for mounting the drives because of the simplicity of installation—rather than crafting a shareable file system, we were able to connect a server to a blank iSCSI vol- 3 Evaluation ume, perform a standard operating system installation, We tested the prototype on a HaaS-managed 48-node and then copy the resulting image file. Moreover, with cluster; each server was equipped with two Intel Xeon the right hardware support, it should be possible to boot E5-2630L CPUs, 128 GB memory, 300 GB 10K SAS an iSCSI mounted drive with no changes to the operating HDDs (two nodes had 1 TB 7.2K SATA HDDs), and two system being booted. In addition, iSCSI does not incur Intel 82599ES 10 Gbit NICs. Storage was provided by the overhead of NFS to validate that potentially shared a four-node Fujitsu CD10000 Ceph storage appliance, files have not been modified. with 4 10 Gbit external NICs and internal 40 Gbit Infini- As shown in Figure 1, provisioning a cluster has four Band interconnect. main steps: Figure2 iSCSI bar shows the time taken to start up 1. Node Reservation: Provisioning scripts interact from scratch a bare-metal Hadoop image using our pro- with HaaS to allocate physical servers. totype. As we can see from the figure almost half the 2 300 iSCSI Reads: Runs with 256GB Data 300 iSCSI Reads: Runs with 128GB Data 250 200 200 Bigdata Post Script Booting 150 100 Ceph Cloning Haas Initilization 100 0 Cumulative iSCSI reads per node (MB) Elapsed Time (Secs) Sort 1 Sort 3 Sort 2 Sort 5 50 Sort 4 Initial Provisioning Data Generation 1 Data Generation 3 0 Data Generation 2 Data Generation 4 Data Generation 5 2 Node 4 Node 8 Node Figure 4: Per-node cumulative iSCSI read volume (MB). 700 iSCSI Writes - Runs with 256GB Data Figure 3: Scalability analysis for network (iSCSI) 600 iSCSI Writes - Runs with 128GB Data mounted BigData cluster provisioning. 500 400 300 time is spent in firmware initialization, and the overall 200 boot time is very rapid (260 seconds) and comparable to 100 network boot results presented in prior work [13]. As 0 a comparison point, the Local Disk bar shows the time Cumulative iSCSI writes per node (MB) Sort 1 Sort 3 Sort 2 Sort 5 for a full install of a Hadoop environment using standard Sort 4 Initial Provisioning tools (RedHat Foreman for OS installation, Apache Big- Data Generation 1 Data Generation 3 Data Generation 2 Data Generation 4 Data Generation 5 Top for Hadoop installation, . ), as is typically done in Figure 5: Cumulative writes (MB) made by servers on managed system environments. The remainder of this the iSCSI gateway. section compares the runtime overheads of these two in- stallation mechanisms.