COVER STORY Cloop DEEPBlock device compression COMPRESSION with the cloop module KYRO, photocase.com

The cloop module lets you manage compression at the block device 512 bytes), and they are usually used for random access storage like ramdisks, level. Read on to learn how and other Live CDs fit all that CD-ROMs, floppy disks, hard disks, and hard disk partitions. software on a single disc. BY KLAUS KNOPPER Filesystems are a logical representa- tion of ordered data that is often present loop is a kernel block device block-based devices. If you look into the on a block device. A filesystem turns raw module used in Live CDs such output of ls -l /dev, you will easily recog- data into the familiar directory/file view. Cas Knoppix. The cloop module nize these devices by the prefix – c for The mount command is the bridge be- allows the system to read compressed character-based and b for block-based tween a block device partition and its data, usually from a file, thus creating devices – at the beginning of the output projection into a mount point directory. compressed virtual disks. Using cloop, line (see Listing 1). a Linux installation of about 2GB fits on Character-based devices, such as tape Cloop: A Compressed a single 700MB CD-R disc. In this article, drives, mice, and gamepads, provide se- Loopback Block Device I look at how cloop works and provide quential, character-by-character access One block device included in any Linux some insight into general kernel struc- to data. kernel is loop, which maps a plain file to tures of a block device. Block devices instead allow direct ac- a block device node in /dev/loop A Unix system traditionally distin- cess to arbitrary blocks of data, indexed [number]. The loop-attached file can guishes between character-based and by block number or sector (segments of then be mounted like an ordinary hard

32 ISSUE 86 JANUARY 2008

032-036_cloop.indd 32 15.11.2007 10:17:18 Uhr Cloop COVER STORY

been successfully port write operations would probably be Plain file read and decom- very inefficient because, at each write or pressed, and it operation, the compressed block struc- Disk Image never terminates ture would have to be reordered, and or interrupts ac- blocks on a block device can never be cess to the under- “deleted,” since the block device has cloop lying device. That no information about whether or not modprobe cloop file=/path/to/file.cloop way, even partly data is actually needed. damaged data can Also, when compressing a block, the still be restored, resulting block size is not really predict- Filesystem/Mount and crashes be- able. In fact, a compressed block can be cause of read er- slightly larger than an uncompressed mount −r −t iso9660 /dev/cloop /mnt rors are unlikely block, in case that the data is already (though, of stored compressed. When exchanging Mountpoint with visible course, an execut- compressed blocks, the block index table directories and files able program that would have to be recalculated (all offsets / has been damaged after the changed block will also /usr /home will most likely change). Therefore, cloop is very opti- refuse to run and mized for reading data, but writing is terminate with a not currently supported and probably Figure 1: (C)loop maps a hard disk image file to a block device and lets segmentation never will be. you mount the image like a real hard disk partition. fault). This feature If you need write support on read-only may give cloop media, you are better off with a filesys- disk, optionally with encryption. How- a slight advantage against compressing tem such as AuFS or Unionfs that merges ever, this driver still only provides a filesystems, in which an entire file be- the read-only directory with a directory translation between different data repre- comes unreadable if there is a read error offering write support. sentations of the same size. right at the beginning of that file. In Cloop can handle parallel accesses to Cloop, which was first written in 1999 cloop, at maximum, one block is missing multiple cloop image files. If you have for kernel 2.2 by iptables author Paul when read errors occur within this to split files because of size limits in the ‘Rusty’ Russell, uses a blockwise com- block. underlying filesystem, this is important pressed input file that reduces the neces- One disadvantage is that cloop is read- (i.e., the 2GB and 4GB maximum file sary space to around 1/3 of the original only. Attempting to rewrite cloop to sup- size in FAT32 and iso9660, respectively). size for executable files. I ported, and later rewrote, cloop for newer kernel ver- Listing 1: Character and Block Device Nodes sions, extending the module to support 01 crw-rw---- 1 root root 10, 1 2007-10-18 10:41 /dev/psaux 64-bit file access (thus allowing file sizes 02 brw-rw---- 1 root disk 3, 0 2007-10-18 10:41 /dev/hda beyond 4GB), multiple input files, and ioctl extensions to permit the exchange 03 brw-rw---- 1 root disk 240, 0 2007-11-01 21:38 /dev/cloop of input files on-the-fly. Since cloop is a block device and not a Listing 2: Attaching a New Cloop Image filesystem like , the compressed 01 $ sudo losetup /dev/cloop1 /media/cdrom/backup.cloop image can be anything, from raw data to arbitrary filesystem or database struc- 02 $ sudo mount -r -t /dev/cloop1 /mnt tures. On a Live CD, the cloop image 03 $ dmesg | tail -2 usually contains an iso9660 , 04 cloop: Initializing cloop v2.622 which is the most popular read-only and 05 cloop: loaded (max 8 devices) read-optimized filesystem used for data CDs. Cloop supports all filesystem fea- 06 cloop: losetup_file: 3571 blocks, 65536 bytes/block, largest block tures necessary for a Unix system is 65562 bytes. through the standard RockRidge exten- sion (long file names, permissions, Listing 3: Starting the Cloop Block Device symlinks, etc.). 01 register_blkdev(major=240, cloop_name="cloop"); Cloop also comes with several validity checks that return non-fatal error codes 02 for(i=0; i

JANUARY 2008 ISSUE 86 33

032-036_cloop.indd 33 15.11.2007 10:17:26 Uhr COVER STORY Cloop

Cloop images can be attached and de- If a file is given as module parameter Block devices have two methods for tached at run time using the standard file="/path/to/first/image", this file is reading from the device. The direct losetup command (see Listing 2). opened and associated with the first method is setting up a request function Cloop can work with input files that cloop device /dev/cloop0. that will be called each and every time a are mounted over NFS. This feature is How does /dev/cloop0 know what to read() on the device is performed. This important if you plan to run completely do when a process opens the block de- request function is then associated with diskless clients with filesystems vice and reads from it? Unlike filesys- the default block queue that has been mounted over cloop. Another interesting tems, block devices have a more com- created for the major device ID of cloop feature of cloop is that it can read and plex way of performing input and out- when the block device was set up (see decompress data asynchronously using put. Listing 5a). a kernel thread, so it’s not blocking a Depending on which kind of block de- cloop_request_fn() has to find out process group and not staying in kernel vice you are using, there are some opera- which device was accessed and then space for a noticeable time. tions that have to be supported and oth- transfer data from the cloop file into the Access to cloop image files can be sus- ers that won’t. In case of a disk or parti- memory space that was given in buffer_ pended while they are in use (a feature tion, you can use a structure like the one head. This direct method was used until introduced by Fabian Franz). The idea in Listing 4, which tells the kernel what cloop version 2.06. behind this feature is that you can tem- to do if someone opens or closes a /dev/ A disadvantage of this direct method porarily remove a storage device that is cloop* file. is that the entire device is blocked until in use by cloop without getting read er- cloop_open() needs to increase the use cloop_request_fn() returns. Actually, rors. After receiving a special ioctl call, counter of the device so that the module probably because of this, the direct cloop will patiently wait until the under- knows it is in use. cloop_close() does the method ceased to work for cloop starting lying file is activated (and re-analyzed) opposite. cloop_ioctl(), which is a special from kernel 2.6.22. again, so you could eject a Live CD case that handles losetup for exchanging while an OS is running from it. the underlying cloop image file, also sus- Per-Device Wait Queue One drawback is that all processes pends device operations in case the The new approach to block device I/O is that access the cloop device will be CLOOP_SUSPEND ioctl is sent. to use a per-device wait queue. With this blocked until the underlying file is re- inserted. In the worst case, the desktop Listing 5a: Setting Up a Direct Callback will freeze, which could make it hard to 01 static int cloop_request_fn(request_queue_t *q, int rw, struct send the command for unfreezing cloop. buffer_head *bh) Kernel Block Device 02 { /* do something with the read request ... */ } Components 03 To start a kernel 2.6.x block device in the 04 blk_queue_make_request(BLK_DEFAULT_QUEUE(cloop_major=240), module initialization, you need to call a cloop_request_fn); few kernel procedures. Listing 3 should give you the basic idea, although the code in Listing 3 does not include extra Listing 5b: Adding a Wait Queue features such as error checks and bailout 01 /* Called while queue_lock is held by kernel. */ procedures. 02 static void cloop_do_request(struct request_queue *q) { Listing 4: Disk Operations 03 struct request *req; 04 while((req = elv_next_request(q)) != NULL) { 01 static struct block_device_ 05 struct cloop_device *clo = req->rq_disk->private_data;; operations clo_fops = 06 blkdev_dequeue_request(req); /* Dequeue request first. */ 02 { 07 list_add(&req->queuelist, &clo->clo_list); /* Add to working list 03 owner: THIS_MODULE, for thread */ 04 open: cloop_open, 08 wake_up(&clo->clo_event); /* Wake up cloop_thread */ 05 release: cloop_close, 09 } 06 ioctl: cloop_ioctl 10 } 07 }; 11 08 12 ... 09 ... 13 10 14 cloop_dev[i].clo_queue = blk_init_queue(cloop_do_request, 11 ((struct gendisk *) clo_ &clo->queue_lock); disk)->fops = &clo_fops; 15 cloop_dev[i].clo_disk->queue = cloop_dev[i]->clo_queue;

34 ISSUE 86 JANUARY 2008

032-036_cloop.indd 34 15.11.2007 10:17:31 Uhr Cloop COVER STORY

method, the kernel collects requests The queue has to be added to the disk ning in the kernel thread, which is a from various sources in a linked list and associated with the cloop device, and separate process and does not block the occasionally delivers them to a proce- this requires a lock that can be held by entire system. dure in cloop. the kernel when requests are processed Because the request had been assem- For slow physical reads, letting the in order to avoid parallel queue manipu- bled from block I/O segments, it must be kernel collect read requests and deliver lation. A kernel thread is created for divided into data-to-be-read units, which them all at once makes the rest of the handling the real work of processing is what rq_for_each_bio() and bio_for_ system act more efficiently, since less the internal queue of read requests each_segment() do (Listing 7). time is spent in kernel space. (see Listing 6). cloop_load_buffer() consists of a phys- Cloop version 2.622 takes each request cloop_handle_request() will now read ical read procedure and a decompressor out of the block device queue and puts it blocks from the cloop image file, decom- based on the kernel’s internal uncom- into an internal queue, transferring it to press them into memory, and transfer press(), which is also used for decom- a per-device kernel thread, which does the parts of decompressed data that were pressing the initial ramdisk and some the real work with lower priority and no requested by the calling process to that kinds of compressed data packets in net- spinlock. This means that the block de- process’ buffer. work protocols. vice I/O scheduler never blocks, because The instructions can be arbitrarily Explaining kernel-library methods like it does not have to wait for a slow physi- complex and take as long a necessary do_generic_read() would be an article in cal I/O to complete (Listing 5b). because cloop_handle_request() is run- itself, but it is fair to say that they basi- cally do the same thing as read() or Listing 6: A Kernel Thread Processes Requests fread(), which you may know from the C library, just on the (much more com- 01 static int cloop_thread(void *data) { plex) kernel layer, which has a very low- level view of files. 02 struct cloop_device *clo = data; cloop_load_buffer() expects the block 03 current->flags |= PF_NOFREEZE; number of the block-to-read, which can 04 set_user_nice(current, -20); be calculated from the byte offset of the 05 while (!kthread_should_stop()||!list_empty(&clo->clo_list)) { data section requested by a process, and the block size used in the cloop image 06 struct list_head *n, *p; file associated with the device. 07 int err = wait_event_interruptible(clo->clo_event, 08 !list_empty(&clo->clo_list) || kthread_should_stop()); Creating a Cloop Image File 09 if(unlikely(err)) continue; After this short walk through the kernel side of cloop, the next step is to consider 10 list_for_each_safe(p, n, &clo->clo_list) { the image file. The cloop file format (see 11 int uptodate; Figure 2) is much less complicated than 12 struct request *req = list_entry(p, struct request, queuelist); the module source would suggest. 13 spin_lock_irq(&clo->queue_lock); The 128 bytes that appear at the be- ginning of the cloop image file are for 14 list_del_init(&req->queuelist); future extensions and the possibility of 15 spin_unlock_irq(&clo->queue_lock); making an “executable” cloop image file. 16 uptodate = cloop_handle_request(clo, req); /* do the read/ Usually, these 128 bytes contain a shell decompression */ script that will call modprobe with the 17 spin_lock_irq(&clo->queue_lock); right parameters in order to activate this 18 if(!end_that_request_first(req, uptodate, req->nr_sectors)) image, so that you can attach the image to a cloop device by typing its pathname. 19 end_that_request_last(req, uptodate); The version number within the header is 20 spin_unlock_irq(&clo->queue_lock); used to distinguish between older ver- 21 } sions that still use 32-bit numbers. 22 } In general, all numbers contained in the header of a cloop file have an archi- 23 return 0; tecture-independent network byte order, 24 } so that the same image will work on big- 25 endian as well as little-endian systems. 26 ... Cloop assumes uncompressed blocks of constant size (at least within the 27 same image file) because block devices 28 clo->clo_thread = kthread_create(cloop_thread, clo, "cloop%d", always use a constant block size, and it’s cloop_num); easier to calculate block numbers by the total size of a partition that way. The

JANUARY 2008 ISSUE 86 35

032-036_cloop.indd 35 15.11.2007 10:17:31 Uhr COVER STORY Cloop

block size is present in the header of a make compressed cloop file. backups on CD: Preamble (128 Bytes) Compressing blocks of constant size, #!/bin/sh however, leads to different sizes in the mkisofs -l -R / #V2.00 Format U compressed output, depending on how home/username | modprobe cloop file="$0" compressible the contained data is. Be- create_ exit 0 cause of this, the image file needs an compressed_fs - ... index of compressed block locations at 65536 |U the beginning, before the compressed cdrecord -v - block_size : 32−bit number, network order data starts. num_blocks : 32−bit number, network order The compressed data, block after Because create_com- block, follows until the end of file is pressed_fs has to data_index : num_blocks+1 64−bit numbers, network order reached. With the block index in the write the block index header part, cloop knows where the header before the compressed blocks can be found within data, the entire com- the data part. pressed image is Although decompressing a cloop file stored in virtual mem- is most conveniently done by just copy- ory (ram+swap) ing from /dev/cloop to a partition or file, until all data is com- compressed_data : num_blocks compressed data blocks compression is done via a userspace pressed and indexed, program called create_compressed_fs. so cdrecord will start This program is misnamed because writing at the end of create_compressed_fs does not compress the compression pro- a filesystem, but arbitrary data in cloop cess. format. The most basic call to create_ Make sure you have compressed_fs is: enough swap space available when using create_compressed_fs U this method. Figure 2: Structure of a cloop input file. inputfile blocksize > U Adding the -b outputfile (“best”) option will try gzip-0 (un- (the default) alone; the decompressor compressed) to gzip-9 (best gzip com- automatically finds the right decom- The use of - as the input file name will pression) and 7zip compression (gzip- pression method of each compressed cause a read from stdin. create_com- compatible mode) one after the other block. pressed_fs can be used as a pipe, without and use the smallest result. This com- Debian maintainer Eduard Bloch has the need of a temporary file, so you can presses about 7% better than gzip-9 rewritten create_compressed_fs and added an alternative syntax that allows Listing 7: Splitting a Request using temporary files instead of memory, threaded/parallel compression for multi- 01 /* This function does all the real work. */ processor machines, and daemon mode, 02 /* returns "uptodate" */ which allows setting up a cluster of com- 03 static int cloop_handle_request(struct cloop_device *clo, puters for fast compression of huge 04 struct request *req) { amounts of data. When installing the cloop-utils from Debian, you will get this 05 struct bio *bio; newer version of create_compressed_fs, 06 rq_for_each_bio(bio, req) { which is also present in the current 07 struct bio_vec *bvec; cloop source. 08 int vecnr; Sources and Compiling 09 bio_for_each_segment(bvec, bio, vecnr) { The most current source code for cloop 10 /* read compressed data from file, decompress, is always at http://debian-knoppix. 11 transfer data to process */ alioth. debian. org/ . Compilation should 12 ... succeed with 13 buffered_blocknum = cloop_load_buffer(clo,block_offset); make KERNEL_DIR=U 14 ... /path/to/your/kernel/sources 15 } which will build the cloop module cloop. 16 } ko as well as the cloop compression util- 17 } ity create_compressed_fs. I

36 ISSUE 86 JANUARY 2008

032-036_cloop.indd 36 15.11.2007 10:17:31 Uhr