PoS(ACAT2010)027 http://pos.sissa.it/ ∗ [email protected] [email protected] [email protected] Speaker. Dynamic virtual organization clusters withtages over user-supplied generic virtual computing machines environments. (VMs) Thesehave advantages a have priori include advan- knowledge the of ability the for scientificvirtualized the tools environment user and as libraries to well available as to the programs other executingsmall-scale details in of testing the the locally, environment. thus The user saving can timeuser-supplied also VMs and perform require conserving contextualization computational in resources. orderronment. to However, Two operate types of properly contextualization in are a necessary: givenof image-level and cluster image-level instance-level. contextualization envi- Examples include one-time configurationity tasks of such ephemeral as ensuring storage, availabil- mounting ofthe a cluster’s cluster-provided batch shared scheduler. filesystem, and Alsosignment integration necessary of with is MAC and instance-level IP contextualization addresses. suchovercome This as those paper challenges the discusses in as- the the challengesUniversity contextualization and cluster of techniques the environment used STAR that to VM isauthors for to exposed use allow to with for the OSG. efficient contextualization Clemson Also oftualized their included grids. VMs are and suggestions recommendations for to future VM vir- ∗ Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. c

13th International Workshop on Advanced ComputingFebruary and 22-27, Analysis 2010 Techniques in Physics Research Jaipur, India Brookhaven National Laboratory E-mail: Jerome Lauret Sebastien Goasguen Clemson University E-mail: Michael Fenn Clemson University E-mail: Contextualization in Practice: The Clemson Experience PoS(ACAT2010)027 , 5 , are 1 Jerome Lauret ] and illustrated in Figure 9 ]. ] to investigate the integration of VMs into 7 1 ]. 8 ], VOCs are booted directly from that filesystem, 2 4 . 6 , the procedure followed in contextualizing the STAR VM is 3 , the results of testing performed on that image can be found in Section 4 , contextualization principles and recommendations for packaging VMs according to those 2 Virtual Organization Clusters (VOCs), first put forward in [ A key provision of the VOC model is its separation into two administrative domains, the Vir- Jobs are submitted to the VOC through a dedicated head node. This head node contains a The Virtual Organization Cluster (VOC) model has been developed by the Cyberinfrastructure Central to the VOC model is the idea that a Virtual Organization (VO) is able to provide a As cloud computing and virtual machines (VMs) become more popular, there is great interest The remainder of this work is organized as follows: the VOC model is discussed in Sec- tual Administrative Domain (VAD) and the Physicalsite Administrative on Domain the (PAD). Each grid is physical a uniqueVOC PAD containing is all a hardware, distinct infrastructure, VAD and that systems iseach software. managed VO Each by to the have VO. its This own distinction customized, between virtualized PAD environment. and VAD allows standard installation of the Open Science Grid (OSG) Compute Element software stack. Thus 2. Virtual Organization Clusters without the need to copy theis image presented to as each having node. a VOCs normal are grid transparent interface because to a the site end appears user. and principles are presented in Section described in Section and conclusions are presented in Section a cloud-computing construct that enable thehave creation the of properties virtual of environments. These beingnode environments compatible replication across of physical images, sites, transparentmanner, to deployable and end without customizable users, per- by able a tospawned Virtual be Organization. from implemented a VOCs in are single a made image,site non-destructive of uses and virtual a thus distributed machines filesystem their (VMs) such worker as nodes PVFS [ are trivially homogeneous. If a grid Research Group at Clemson University in an attemptprovides a to framework ameliorate for these developing concerns. virtual clusters The that VOCcloud model seamlessly computing integrate with technologies. existing grid This and integrationwell requires as sound procedures principles that are for guided contextualization by as those principles. This workcustomized seeks VM to document image both. to thereal VOC provider. VM In image order should to beVO validate deployed has this on long aspect a of been testbed the asince cluster VOC proponent 2007 and model, of has a a worked leveraging realistic with workload the in run. Open grid Science The computing Grid STAR environments, [ and grid sites. In light ofand the workload STAR used VO’s for virtualization this expertise validation and experiment need, [ it provided the VM image in adapting their use toof the virtualization many technologies, tasks a of given computationalhypervisor. VM science. Also, disk However, a image due VM cannot to image be amodified. that necessarily proliferation This must used process integrate is with with known any pre-existing as given infrastructure contextualization must [ be further tion Contextualization in Practice: The Clemson Experience 1. Introduction PoS(ACAT2010)027 Jerome Lauret mode, in which writes are directed to snapshot 3 ]. This scaling is accomplished via a daemon that monitors the workload 11 A Virtual Organization Cluster, consisting of a set of virtual machines (VOC nodes) instantiated As mentioned above, VOCs are autonomically scaled without the user having to make explicit The topic of VM contextualization merits further discussion. It is a safe assumption that any and takes appropriate action. When the daemonthis detects as an an increase increase in job in queue thethe length, demand demand it for interprets by computational instantiating resources. a Thejoin new daemon VM the will from batch thus the scheduling seek appropriate pool to VO-specifictaken and meet when image. appear the to These daemon VMs the detects a then user decreasedecrease as in the normal demand. number compute The of nodes. daemon VMs attempts in Similarimportant to order steps slowly to to increase are note avoid and inefficiencies that due each to VM a utilizes non-zero QEMU’s VM boot time. It is reservation requests [ given VM image will not successfullyis integrate into due a to VOC a as implemented varietytems, at of the any factors, need given site. to including acquire This but an not IP limited address to: (be it the via need DHCP to or mount some overlay any networking eternal mechanism), filesys- Figure 1: from a single image file and running on a set of hypervisor-equipped computethe nodes. VOC appears to thequeue user on the as head a node normalbe and contextualized site autonomically both on starts at the boot-time and grid. andmechanisms. stops when VMs the A Boot-time in image contextualization VOC response is component is to receivedresources monitors defined by load. from the the in the VMs VOC PAD the job to via must the VOC standard VAD. model grid as the leasing of certain a local copy-on-write image. Thesethat copy-on-write images is store itself the stored differences on fromhypervisor a a node base before network image VM filesystem. instantiation. The base image does not need3. to Contextualization be Principles copied to each Contextualization in Practice: The Clemson Experience PoS(ACAT2010)027 Jerome Lauret ] tool provides functionality 2 [ -img 4 utility. This utility mounts the disk image and extracts the kernel and initial ramdisk from The image layout issue can become much more involved. The two main image layouts are the Important considerations for image-level contextualization are image format, image layout, The simplest image format is that of the raw disk image. A raw image is simply a file contain- partition image layout and thea disk single image disk layout. partition. A Essentially,partition this does partition not layout image contain could contains any be metadata ais referred with representation able regard to to of to as present itself. a individual This partitions filesystem layoutof to image, this. requires a a The since guest hypervisor disk OS. a that image Currently, layout onlyrecord, contains the a boot representation sector, hypervisor of is and an capable partition entirethis disk, table. type including the All of master , image. boot includingof Since KVM, the Xen are VM requires capable image, the of Xen guest utilizing pygrub kernel may and only initial boot ramdisk fromthe to disk image, be images and located when as outside it such,procedure is can for used only converting in between be conjunction partition utilizedto with images with be the and converted a (at disk disk least images. in temporarily)utilized. the to Images the raw There will raw disk are, generally format format. however, need disk in several image order There useful is to is tools the allow same no and standard as set Converting one disk a between tools guiding physical image to disk, principle. formats be and is a a Theplaces. matter partition principle of image Useful is: getting is tools the the include: a correct same disk as structures a into physical the partition. correct that can convert images between manyon of any the particular popular hypervisor formats, image thuscan format. convert freeing between Hypervisor the their vendors user format also from and generally reliance the provide raw a format. tool that 3.1 Image-level contextualization shared filesystem support, and batchtion scheduler of integration. the Image disk’s format data refersplaced within to on the the the image representa- disk file. and to Image what layout other refers disk to structures how are the present. ing various the partitions exact are byte string thatbut would is appear not on space a efficient physicaldevice because device. being the represented. This image Note format file’s that is size raw highlyare images must compatible compress fairly be very easy equal well to with to distribute. gzip the In compression,of order capacity so virtual to they of image mitigate the formats the virtual such in-use as sizementation issue, VMDK, and there VDI, hypervisor has VHD, support, been and a but QCOW2.When proliferation they These utilizing all formats one allow vary of the in these compact imple- data formats, representation present the of on size a the of disk device, thecontextualize instead image. image the of is VM being determined image determined by format, bypatible the the the size image with capacity of must the of simply the hypervisor the be actual used device. converted at to In a a order given format to site. that is The com- Contextualization in Practice: The Clemson Experience the need to handshake withvariables. a Thus, batch the system, image and must theImage-level be contextualization need contextualized occurs to in once define two per any phases: VMization grid-specific occurs disk image-level environment once image and per per instance-level. VM site. instance. Instance-level contextual- PoS(ACAT2010)027 , and Jerome Lauret $OSG_APP , $OSG_GRID 5 option allows a partition image to be mounted, -o loop , certain resources must be leased from the physical site. These resources 2 , allows the exposure of the partitions of a, disk allows image the as running individual of devices, the native tools present in the image if necessary. environment variables. , allows the calculation of partition extents and the creation/modification of partition , when used with the , allows block level copying of defined sections of an image, fdisk tables, dd mount kpartx Network addresses, including both MAC and IP addresses, should be assigned (leased) to the If the VOC nodes are not spawned from a single image, some allocation of disk space must Whereas image-level contextualization can be performed manually by a systems administrator, There must also be a way to get computational jobs into the VM. Either the site’s batch sched- Grid systems have specifications that the compute nodes must adhere to. These specifica- • • • • • VMs in such a wayhypervisor. Leasing as of to IP avoid addresses conflicts. maythis be information performed Leasing by of to the MAC the hypervisor addressessuch if guest it method must (e.g. is of be capable assignment performed of is Xen) passing byhypervisor to or node the implement would may a contact central be a central leasing throughThe service server. the and service Before made would standard VM a then lease instantiation, maintain DHCP request a the protocol.addresses for lease will a database be MAC in One or unique order IP to to address. a avoidto hypervisor duplication. node, that Since that MAC of node and may the IP alsosatisfies VM. use the a As uniqueness function long constraint to without map as its the this address requirement of function a will centralized not service. be cause made an to overlap the hypervisor. in This addresses, could this use hypervisor’s method local disk, but care must be taken to avoid include network addresses, disk space, and scheduler slots. instance-level contextualization occurs once per VMAs instantiation described and in as Section such must be automated. Contextualization in Practice: The Clemson Experience $OSG_DATA uler or a VO-levelscheduler scheduling is system installed, it must is be prudentscheduling to installed pool configure into may the the scheduling be system partitioned VMthe in off image. such constraints from a of the way If the site’s that general the VOC themade VM’s scheduling Model. for site’s pool crossing batch If NAT in boundaries. order a to VO-level scheduler satisfy is installed,3.2 some Instance-level provision contextualization must be These tools, along with the bootloadera installer, should set be of sufficient partition to image assemble or a decompose disk a image disk from image intotions a set generally of require partition images. that variousits filesystems associated be worker nodes. shared amongthose Thus, filesystems. the the In compute particular, image element any must software (CE)must libraries be be and needed installed contextualized to and so the mount grid-provided that the application,must site’s it user-provided shared the application, properly filesystem mounted and mounts user at data the shares of locations defined the by Open the Science CE Grid configuration. which One perscribed such the specification definition is of that the PoS(ACAT2010)027 ] VM. 8 , 3 ]. If this is 12 Jerome Lauret has been per- 3 qemu-img create -f starworker_part.img utility) whereas the partition pygrub ] overlay network, or Kestrel [ 13 6 tool, issue the command qemu-img . ], Condor with the IPOP [ 6 alter the VM image from its original state due to the factors discussed in this ]. Techniques described for leasing network addresses can be easily extended 12 must Some VO’s may wish to have a measure of certainty that their VM has not been altered from A practical application of the principles and techniques discussed in Section The STAR VM image is provided as a Xen partition image named Then, in order to have the appropriate bootloader and kernel installed into the image, a fresh In light of the above discussion, some best practices emerge for VO’s that wish to provide It is best to provide the VM as a disk image layout in the raw format. The disk image layout is If the scheduling system requires the use of fixed slots for compute nodes, then these must Two approaches emerge to mitigate the administrator effort needed to install a compatible the original image. Thisthat guarantee various is sites not easysection. to enforce It in is the possible for generalThis a case. will VO greatly administrator In restrict to fact, sign thesites it the VM’s which is VO usability can likely image utilize because with the the his/her provided VM VM X.509 image will certificate. without only modification. be able4. to be Contextualizing deployed the at STAR VM not chosen, the VM should utilize aas common the distribution various of Debian the or GNU/Linux Red operatingthat Hat system appropriate derivatives. such The packages use will of exist such for as the system system, maximizes thus the minimizing probability administrator effort. This VM contains the programs andexperimental libraries data. necessary for the simulation and analysis of STAR’s with no bootloader or kernel.disk device. Therefore, To to do use thisraw the with image the 10G with KVM, starworker.img it is necessary to createinstallation a of the guest operating system (Scientific Linux in this case) is performed. This instal- formed at Clemson University to enable the contextualization of the STAR experiment’s [ to provide for such a scheduler. 3.3 Best Practices for VOs a VM. In short,scheduling VO’s pool should or use utilize disk a commonbelow. image operating layouts system in distribution. the This advice raw is format expanded that upon eithercompatible join with a all hypervisors global (although Xen requires the also be assigned [ batch scheduler. First, the VM may besuch as configured Condor to Glide-ins join [ a global scheduling pool, via mechanisms Contextualization in Practice: The Clemson Experience exceeding the disk’s capacity, especially whenAnother dynamically solution would resizing be disk to image map formats LUNs are of used. a storage area network to the hypervisor node. image is only natively compatible withbetween Xen. partition Substantial and administrator disk effort layouts. is Similarly,bility. required the Sites to raw may format convert choose is to preferred use duehowever, the to necessary raw its image to broad directly, compress or compati- raw to images convertraw it for images to transport compress their between very preferred well, sites. format. generally It It to is has the been size observed of that the actual data. PoS(ACAT2010)027 . * . start fdisk umount column that Jerome Lauret kpartx -d . Note that the and and calculating the start kpartx where # is the partition number cp -a /mnt/looppart/ $UNITS )) starworker.img * umount /mnt/looppart 7 variable has been set with the value of the depicts the STAR VM’s integration with the prototype wherein the output will contain a to actually create the loop devices, to see which loop devices will be created (some trial and 2 $START was used to mount the partition, the command . Figure variable has been set with the value of the units given by 4 where the kpartx $UNITS is issued to remove the loop devices. At this point, the STAR VM is bootable is used to copy the contents of the partition image into the disk image. Then the . If mount -o loop starworker_part.img /mnt/looppart/ column and the /mnt/loopdisk/ fdisk -lu starworker.img mount -o loop,offset=$(( $START to mount. contains the offset of each partition as well as a header giving the units of the offset, kpartx -a starworker.img mount /dev/mapper/loop0p# /mnt/loopdisk kpartx -l starworker.img error may be necessary to mount the correct partition), Once both images have been mounted, the command To use kpartx, issue the following commands: Instance-level contextualization also needs to be performed on each instance of the image. Due The STAR VO provided an image that was contextualized for the prototype cluster using the Now that the kernel and bootloader are installed into the new image, the contents of the root 1. 2. 2. 3. 1. qemu-kvm -hda starworker.img -m 512 -net nic -net user Once the appropriate partition in thewith disk the image command has been mounted, the partition image is mounted To calculate the offset manually issue the commands /mnt/loopdisk images are unmounted with the commands to the site-local networking and storage environments,VM the instance only is resource a that MAC has address. tohostname This be to leased is a to performed MAC the via address, a avoiding the functional need mapping for from a the central hypervisor’s leasing service. 5. STAR VM Results procedures outlined in Section offset manually. /mnt/loopdisk starworker.img with KVM and the site-specific batch scheduler and shared filesystems are configured as normal. directory must be copied from thebe provided image mounted. to the However, new the image.image Do new must do disk this, be image both mounted. images cannot must There be are mounted two directly, ways the of partition doing inside this: the using Contextualization in Practice: The Clemson Experience lation should be performed with theis target hypervisor. For QEMU/KVM, the invocationcontents command of a this installation willto be apply the completely appropriate replaced partitionioning in scheme abe and later performed to manually step, install if this the desired. installation bootloader. is These simply two steps may PoS(ACAT2010)027 Jerome Lauret . The overhead of booting the VMs plus 1 ], but all VMs did not boot in parallel due to 5 8 Integration of the STAR VM into the prototype VOC Condor reaction time by Job ID as observed by STAR , the VOC’s Condor scheduler was fast for the first two jobs due to the 3 Figure 2: Figure 3: STAR utilized the 16 slots available and submitted 32 jobs. TheThe use job’s of 280MB network-backed copy-on-write of images total means output that the time required to stage the VM As shown in Figure behavior of the VOC sizing daemon described in Section fact that the watchdog was configured to keep two VMs running at all times. Jobs 2 through 16 was streamed back to the Brookhaven National Laboratory (BNL) at 6.8MB/s. image to the node is negligible.overhead The to booting the of 11 the hours VOCVM nodes of takes added overall much approximately walltime 7 less required minutes than of by seven the minutes workload. to boot Note [ than an individual VOC. Once the VM has image-level contextualizationthe performed, same it as appears any to other the STAR-supporting STAR resource. VO to be virtualization overhead gives a total VOC overhead of approximately one percent. Contextualization in Practice: The Clemson Experience PoS(ACAT2010)027 , in (1) 1-18. 08 , in Jerome Lauret 17th Euromicro , in proceedings of ]. . 10 . CHEP 2009 , in proceedings of . . Condor-G: A computation management agent Autonomic Clouds on the Grid, JGC . , in proceedings of 9 PVFS: A Parallel File System for Linux Clusters Dynamic Provisioning of Virtual Organization A Study of a KVM-based Cluster for Grid Computing Virtual Organization Clusters (3) 236-246. 05 CC , 9th IEEE International Symposium on Cluster Computing and the Grid Contextualization: Providing One-Click Virtual Clusters http://www.star.bnl.gov/ http://linux.die.net/man/1/qemu-img http://www.opensciencegrid.org/ 4th annual Linux Showcase and Conference (ALS’00) 47th ACM Southeast Conference (ACMSE ’09) man page, Computing for the RHIC Experiments , in proceeding of Qemu-img proceedings of for multi-institutional grids 4th IEEE International Conference on e-Science International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2009). Clusters (CCGrid ’09). proceedings of Contextualization of virtual machines is an operational necessity. Of the two stages of contex- The Virtual Organization Cluster (VOC) model worked smoothly, with very little (1%) over- [1] Open Science Grid, [2] [3] The STAR Experiment, [4] P.H. Carns, W.B. Ligon, R.B. Ross, R. Thakur, [5] M. Fenn, M. Murphy, S. Goasguen, [6] T. Frey, T. Tannenbaum, M. Livny, I. Foster, S. Teucke, [7] K. Keahey, T. Freeman, [9] M. Murphy, M. Fenn, S. Goasguen, [8] J. Lauret, tualization, instance-level contextualization is the mostization generally easily requires automated. systems Image-level administrator contextual- effortin due each to specific the large image, number hypervisor, of andare variables site presented present configuration. in this The work principles havethe of been complexity contextualization shown that to of be the implementable problem into they practice. develop can a Unfortunately, only due system serve to for as automating general image-level guides. contextualization. A worthwhile futurehead. goal The is model allows a VOCworking to and appear as validated a software normal stack. gridenvironments site, Thus, in thus VOCs a giving show way the great that user is promise confidence maximally for in convenient a providing for customized VOs. References 6. Conclusions Contextualization in Practice: The Clemson Experience started as soon as a VMin was the started queue and because joined there to were only thejobs 16 Condor remained VOC in pool. nodes the Jobs available. queue, 17-32 Once causing were the theof first VOC forced VMs. sizing 16 to daemon jobs wait Thus to completed, decide the 16 against VMs changingable started the to number schedule in the response remaining to 16 theprior jobs work first to that the 16 showed VOC jobs that nodes remained without synthetic running delay. jobs and This were observation well Condor validates mapped was to resources [ [10] M. Murphy, B. Kagey, M. Fenn, S. Goasguen, [11] M. Murphy, L. Abraham, M. Fenn, and S. Goasguen, PoS(ACAT2010)027 Jerome Lauret On the Design of Scalable, . SuperComputing ’09 10 Kestrel: An XMPP-based Framework for Many Task Computing , in proceedings of . 2nd Workshop on Many-Task Computing on Grids and , in proceedings of Self-Configuring Virtual Networks Applications Supercomputers (MTAGS 2009) Contextualization in Practice: The Clemson Experience [12] L. Stout, M. Murphy, S. Goasguen, [13] D.I. Wolinsky, Y. Liu, P. St. Juste, G. Venkatasubramanian, R. Figueiredo,