IZO: Applications of Large-Window Compression to Virtual Machine Management
Total Page:16
File Type:pdf, Size:1020Kb
IZO: Applications of Large-Window Compression to Virtual Machine Management Mark A. Smith, Jan Pieper, Daniel Gruhl, and Lucas Villa Real – IBM Almaden Research Center ABSTRACT The increased use of virtual machines in the enterprise environment presents an interesting new set of challenges for the administrators of today’s information systems. In addition to the management of the sheer volume of easily-created new data on physical machines, VMs themselves contain data that is important to the user of the virtual machine. Efficient storage, transmission, and backup of VM images has become a growing concern. We present IZO, a novel large-window compression tool inspired by data deduplication algorithms, which provides significantly faster and better compression than existing large-window compression tools. We apply this tool to a number of VM management domains, including deep-freeze, backup, and transmission, to more efficiently store, administer, and move virtual machines. Introduction We find this property analogous to the way a suitcase is packed. It is possible to simply stuff all of Virtual machines are becoming an increasingly one’s clothes into a suitcase, sit on the lid, and zip it important tool for reducing costs in enterprise comput- ing. The ability to generate a machine for a task, and up. However, if the clothes are folded first, and then know it has a clean install of a given OS can be an compressed, they will fit in a smaller space in less invaluable time saver. It provides a degree of separa- time. Large-window compression can be thought of as tion that simplifies support (a major cost) as people all the folding step, while small-window gzip style com- get their ‘‘own’’ machines [26]. pression is the sitting and zipping. Using the term compression for both techniques can become confus- A side effect of this, however, is that there can be ing, so in such cases, we refer to large-window com- many machines created in a very short time frame, pression as data folding. each requiring several gigabytes of storage. While storage costs are falling, terabytes are not free. In this paper, we present three applications of large-window compression for the efficient manage- We believe that large-window compression tools ment of virtual machines. We also introduce IZO,a provide much needed relief in this domain, because novel large-window compression tool using data dedu- they often perform extremely well on large data sets. plication, which accomplishes significantly better data Large-window compression has not received much folding than existing large-window compression tools. attention in the past because archiving tasks have tradi- tionally involved only hundreds of megabytes. More- The first application is the use of large-window over, traditional tools such as gzip perform consistently compression in cases where long-term storage of well for small and large data sets. However, falling ‘‘retired’’ machines is desired. This is a common case storage costs and larger storage applications (e.g., vir- when a group of machines have been generated for tual machines) have driven archiving to hundreds of tasks that are no longer active, but which may become gigabytes or even hundreds of terabytes. Large-win- active again. Examples include yearly activities, or dow approaches provide significant additional com- machines for people who are only present periodically pression for data sets of these sizes, and complement (e.g., summer students). In this scenario a number of small-window compression. similar machines are to be placed in ‘‘deep freeze.’’ One of the most exciting results of our experi- The second task is that of distributing machine ments is that large-window compression does not images. Core images of machines are often used for impact the compression factor achieved by a small- the ‘‘base machine’’ with users having their own window compressor. The combined compression ratio mounted filesystems. The core image is usually main- is close to the product of the individual compression tained by experienced IT staff, e.g., ensuring that the ratios. In some cases we even noticed small improve- latest security patches and drivers are installed prop- ments in the compression ratio of gzip when applied to erly. In a global organization these new images need to data that had been processed by our large-window be distributed to remote sites. For example, in the compression algorithm. Moreover, the total time re- financial sector it may be necessary to distribute quired for large-window and small-window compres- updated machine images to all bank branches. Unfor- sion combined is often smaller than that of the small- tunately, these images can be quite large. A problem window compression alone. that can be compounded by low or moderate Internet 22nd Large Installation System Administration Conference (LISA ’08) 121 IZO: Applications of Large-Window Compression ... Smith, Pieper, Gruhl, and Villa Real connectivity for remote sites. We observe that each outline our implementation (Tutorial and Implementa- image tends to be very similar to the previous one. We tion), discuss the experiments and their results (Exper- therefore investigate whether a smaller patch file can iments) before concluding with some thoughts on how be created by calculating the binary delta between these tools can be extended even further (Conclusions similar virtual machines. We use the series of RedHat and Future Work). Enterprise Linux installs from version 4 to create a set of virtual machines from the initial release to updates Background 1-6 and calculate binary deltas between these ma- Virtual machines are an emerging tool for allow- chines. ing multiple ‘‘guest machines’’ to be run on a single The last task is in creating regular backups of piece of hardware. Each guest thinks it has a whole these systems. Traditional backup tools need to be machine to itself and the host machine juggles re- installed on the actual machine and require the ma- sources to support this. A full discussion of the chal- chine to be running during the backup. They are often lenges and opportunities of virtual machines are be- specialized on a specific operating system and filesys- yond the scope of this paper, but the reader is encour- tem. One benefit of these specialized solutions is that aged to consult [19, 28, 21] for a good overview. A they provide incremental backups: the ability to iden- good introduction to the open source virtual machine tify which files were modified since the last backup, in monitor Xen can be found in [5, 6] and the commer- order to save storage space on the backup device. This cial product VMWare in [27, 22]. can be difficult if the virtual machines are running a We are working on a novel large-window com- variety of (potentially arcane) operating systems or pression tool called IZO. IZO draws on concepts from filesystems. We instead look at creating backups of the existing tools and technologies including tar [7], gzip whole virtual machine image, and using large-window [9], rzip [24, 25] and data deduplication [29, 14]. It is deduplication to identify the changes between differ- ent revisions of the backup, to achieve the footprint of invoked on the command line using parameters identi- incremental backups without specialized knowledge of cal to the well-known tar utility to produce an archive the inner workings of these virtual machines. of multiple files or directories. In all these cases there is considerable redundancy Large Window Compression or duplication of data – for example, files that appear Traditional LZW-style compression techniques on multiple filesystems. These duplicates may occur are notoriously poor at handling long-range global gigabytes apart in the archive, and thus traditional redundancy. This is because they only look for redun- streaming short-window compression schemes such as dancy within a very small input window (32KB for gzip are unlikely to ‘‘see’’ both duplicates within their gzip, and 900KB for bzip2 [20]). To address this short- window, and therefore cannot compress them. coming, rzip [24] was developed to effectively in- We are working on a novel compression tool crease this window to 900 MB. Rzip generates a called IZO (Information Zipping Optimizer), which 4-byte hash signature for every offset in the input file was initially targeted at compressing ISO CD images. and stores selected hashes as {hash, offset} tuples in a IZO uses data deduplication algorithms to identify and large hash table. Data at offsets that produce the same remove global redundancies. Because large-window hash signature may be identical. The data is re-read compression tools are expected to be applied to hun- and a longest-matching byte sequence is calculated. If dreds of gigabytes of data, we desire reasonable mem- a matching byte sequence is found, a redirecting (off- ory requirements and fast data processing. Moreover, a set, length) record is written to the rzip file instead of scriptable interface is important to allow automation of the duplicate data. This method achieves extremely the different compression scenarios. For IZO, the com- high compression ratios, but suffers from two main mand line interface has been modeled after tar, provid- drawbacks as described by rzip’s author Andrew ing intuitive ease-of-use to users that are familiar with Tridgell in [24]. First, rzip has large memory require- this tool. IZO also provides the same streaming seman- ments for storing a hash table entry for every offset in tics as tar – it requires a set of files or directories as the input file. Second, it cannot operate on input input, and can stream the compressed output to stdout, streams because it requires random access to the input or to a specified archive. IZO only removes large-win- data to re-read possible matching byte sequences.