Quick viewing(Text Mode)

Emulating Goliath Storage Systems with David

Emulating Goliath Storage Systems with David

Emulating Storage Systems with

† Nitin Agrawal , Leo Arulraj∗, Andrea C. Arpaci-Dusseau∗, Remzi H. Arpaci-Dusseau∗ † NEC Laboratories America University of Wisconsin–Madison∗ [email protected] arulraj, dusseau, remzi @cs.wisc.edu { }

Abstract Benchmarking on large storage devices is thus a fre- Benchmarking file and storage systems on large file- quent source of frustration for file-system evaluators; the system images is important, but difficult and often in- scale acts as a deterrent against using larger albeit realis- feasible. Typically, running benchmarks on such large tic benchmarks [24], but running toy workloads on small disk setups is a frequent source of frustration for file- disks is not sufficient. One obvious solution is to contin- system evaluators; the scale alone acts as a strong deter- ually upgrade one’s storage capacity. However, it is an rent against using larger albeit realistic benchmarks. To expensive, and perhaps an infeasible solution to justify address this problem, we develop David: a system that the costs and overheads solely for benchmarking. makes it practical to run large benchmarks using modest Storage emulators such as Memulator [10] prove ex- amount of storage or memory capacities readily available tremely useful for such scenarios – they let us prototype on most computers. David creates a “compressed” ver- the “future” by pretending to plug in bigger, faster stor- sion of the original file-system image by omitting all file age systems and run real workloads against them. Mem- data and laying out metadata more efficiently; an online ulator, in fact, makes a strong case for storage emulation storage model determines the runtime of the benchmark as the performance evaluation methodology of choice. workload on the original uncompressed image. David But emulators are particularly tough: if they are to be works under any file system as demonstrated in this pa- big, they have to use existing storage (and thus are slow); per with ext3 and btrfs. We find that David reduces stor- if they are to be fast, they have to be run out of memory age requirements by orders of magnitude; David is able (and thus they are small). to emulate a 1 TB target workload using only an 80 GB The challenge we face is how can we get the best of available disk, while still modeling the actual runtime ac- both worlds? To address this problem, we have devel- curately. David can also emulate newer or faster devices, oped David, a “scale down” emulator that allows one e.g., we show how David can effectively emulate a multi- to run large workloads by scaling down the storage re- disk RAID using a limited amount of memory. quirements transparently to the workload. David makes 1 Introduction it practical to experiment with benchmarksthat were oth- erwise infeasible to run on a given system. File and storage systems are currently difficult to bench- Our observation is that in many cases, the benchmark mark. Ideally, one would like to use a benchmark work- application does not care about the contents of individ- load that is a realistic approximation of a known appli- ual files, but only about the structure and properties of cation. One would also like to run it in a configuration the metadata that is being stored on disk. In particular, representative of real world scenarios, including realistic for the purposes of benchmarking, many applications do disk subsystems and file-system images. not write or read file contents at all (e.g., fsck); the ones In practice, realistic benchmarks and their realistic that do, often do not care what the contents are as long as configurations tend to be much larger and more com- some valid content is made available (e.g., backup soft- plex to set up than their trivial counterparts. File system ware). Since file data constitutes a significant fraction traces (e.g., from HP Labs [17]) are good examples of of the total file system size, ranging anywhere from 90 such workloads, often being large and unwieldy. Devel- to 99% depending on the actual file-system image [3] oping scalable yet practical benchmarks has long been avoiding the need to store file data has the potential to a challenge for the storage systems community [16]. In significantly reduce the required storage capacity during particular, benchmarks such as GraySort [1] and SPEC- benchmarking. mail2009 [22] are compelling yet difficult to set up and The key idea in David is to create a “compressed” ver- use currently, requiring around 100 TB for GraySort and sion of the original file-system image for the purposes of anywhere from 100 GB to 2 TB for SPECmail2009. benchmarking. In the compressed image, unneeded user data blocks are omitted using novel classification tech- niques to distinguish data from metadata at scale; file system metadata blocks (e.g., inodes, directories and in- direct blocks) are stored compactly on the available back- ing store. The primary benefit of the compressed image is to reduce the storage capacity required to run any given workload. To ensure that applications remain unaware of this interposition, whenever necessary, David syntheti- cally generates file data on the fly; metadata I/O is redi- rected and accessed appropriately. David works under any file system; we demonstrate this using ext3 [25] and Figure 1: Capacity Savings. Shows the savings in stor- btrfs [26], two file systems very different in design. age capacity if only metadata is stored, with varying file-size Since David alters the original I/O patterns, it needs distribution modeled by (µ, σ) parameters of a lognormal - to model the runtime of the benchmark workload on the tribution, (7.53, 2.48) and (8.33, 3.18) for the two extremes. original uncompressed image. David uses an in-kernel model of the disk and storage stack to determine the system on a multi-disk multi-TB mirrored RAID con- run times of all individual requests as they would have figuration without having access to one? An end-user executed on the uncompressed image. The model pays looking to select an application that works best at larger special attention to accurately modeling the I/O request scale may also use David for emulation. For example, queues; we find that modeling the request queues is cru- which anti-virus application scans a terabyte file system cial for overall accuracy, especially for applications issu- the fastest? ing bursty I/O. The primary mode of operation of David is the timing- One challenge in building David is how to deal with accurate mode in which after modeling the runtime, an scale as we experiment with larger file systems contain- appropriate delay is inserted before returning to the ap- ing many more files and directories. Figure 1 shows the plication. A secondary speedup mode is also available percentage of storage space occupied by metadata alone in which the storage model returns instantaneously after as compared to the total size of the file-system image computing the time taken to run the benchmark on the written; the different file-system images for this experi- uncompressed disk; in this mode David offers the poten- ment were generated by varying the file size distribution tial to reduce application runtime and speedup the bench- using Impressions [2]. Using publicly available data on mark itself. In this paper we discuss and evaluate David file-system metadata [4], we analyzed how file-size dis- in the timing-accurate mode. tribution changes with file systems of varying sizes. David allows one to run benchmark workloads that re- We found that larger file systems not only had more quire file-system images orders of magnitude larger than files, they also had larger files. For this experiment, the available backing store while still reporting the run- the parameters of the lognormal distribution controlling time as it would have taken on the original image. We the file sizes were changed along the x-axis to gen- demonstrate that David even enables emulation of faster erate progressively larger file systems with larger files and multi-disk systems like RAID using a small amount therein. The relatively small fraction belonging to meta- of memory. David can also aid in running large bench- data (roughly 1 to 10%) as shown on the y-axis demon- marks on storage devices that are expensive or not even strates the potential savings in storage capacity made available in the market as it requires only a model of the possible if only metadata blocks are stored; David is de- non-existent storage device; for example, one can use a signed to take advantage of this observation. modified version of David to run benchmarks on a hypo- thetical 1TB SSD. For workloads like PostMark, mkfs, Filebench We believe David will be useful to file and storage WebServer, Filebench VarMail, and other mi- developers, application developers, and users looking to crobenchmarks, we find that David delivers on its benchmark these systems at scale. Developers often like promise in reducing the required storage size while still to evaluate a prototype implementation at larger scales accurately predicting the benchmark runtime for both to identify performance bottlenecks, fine-tune optimiza- ext3 and btrfs. The storage model within David is fairly tions, and make design decisions; analyses at scale of- accurate in spite of operating in real-time within the ker- ten reveal interesting and critical insights into the sys- nel, and for most workloads predicts a runtime within tem [16]. David can help obtain approximate perfor- 5% of the actual runtime. For example, for the Filebench mance estimates within limits of its modeling error. For webserver workload, David provides a 1000-fold reduc- example, how does one measure performance of a file tion in required storage capacity and predicts a runtime within 0.08% of the actual.

2 Benchmark is situated below the file system and above the backing store, interposing on all I/O requests. Since the driver File System

Target Backing Store appears as a regular device, a file system can be created    

    and mounted on it. Being a loadable module, David can be used without any changeto the application, file system    or the kernel. Figure 3 presents the architecture of David   David



 with all the significant components and also shows the



 different types of requests that are handled within. We



Available Backing Store now describe the components of David.



 First, the Block Classifier is responsible for classify- 

Metadata block Data block Unoccupied ing blocks addressed in a request as data or metadata Figure 2: Metadata Remapping and Data Squashing and preventing I/O requests to data blocks from going in David. The figure shows how metadata gets remapped and to the backing store. David intercepts all writes to data data blocks are squashed. The disk image above David is the blocks, records the block address if necessary, and dis- target and the one below it is the available. cards the actual write using the Data Squasher. I/O re- quests to metadata blocks are passed on to the Metadata 2 David Overview Remapper. Second, the Metadata Remapperis responsible for lay- 2.1 Design Goals for David ing out metadata blocks more efficiently on the backing store. It intercepts all write requests to metadata blocks, Scalability: Emulating a large device requires David generates a remapping for the set of blocks addressed, • to maintain additional data structures and mimic sev- and writes out the metadata blocks to the remapped loca- eral operations; our goal is to ensure that it works well tions. The remapping is stored in the Metadata Remap- as the underlying storage capacity scales. per to service subsequent reads. Model accuracy: An important goal is to model Third, writes to data blocks are not saved, but reads to • a storage device and accurately predict performance. these blocks could still be issued by the application; in The model should not only characterize the physical order to allow applications to run transparently, the Data characteristics of the drive but also the interactions un- Generator is responsible for generating synthetic content der different workload patterns. to service subsequent reads to data blocks that were writ- Model overhead: Equally important to being accu- ten earlier and discarded. The Data Generator contains a • rate is that the model imposes minimal overhead; since number of built-in schemes to generate different kinds of the model is inside the OS and runs concurrently with content and also allows the application to provide hints workload execution, it is required to be fairly fast. to generate more tailored content (e.g., binary files). Emulation flexibility: David should be able to emu- Finally, by performing the above-mentioned tasks • late different disks, storage subsystems, and multi-disk David modifies the original I/O request stream. These systems through appropriate use of backing stores. modifications in the I/O traffic substantially change the Minimal application modification: It should allow application runtime rendering it useless for benchmark- • applications to run unmodified without knowing the ing. The Storage Model carefully models the (potentially significantly less capacity of the storage system under- different) target storage subsystem underneath to predict neath; modifications can be performed in limited cases the benchmark runtime on the target system. By doing only to improve ease of use but never as a necessity. so in an online fashion with little overhead, the Storage Model makes it feasible to run large workloadsin a space 2.2 David Design and time-efficient manner. The individual components David exports a fake storage stack including a fake de- are discussed in detail in 3 through 6. § § vice of a much higher capacity than available. For the rest of the paper, we use the terms target to denote the 2.3 Choice of Available Backing Store hypothetical larger storage device, and available to de- David is largely agnostic to the choice of the backing note the physically available system on which David is store for available storage: HDDs, SSDs, or memory can running, as shown in Figure 2. It also shows a schematic be used depending on the performance and capacity re- of how David makes use of metadata remapping and data quirements of the target device being emulated. Through squashing to free up a large percentage of the required a significant reduction in the number of device I/Os, storage space; a much smaller backing store can now ser- David compensates for its internal book-keeping over- vice the requests of the benchmark. head and also for small mismatches between the emu- David is implemented as a pseudo-device driver that lated and available device. However, if one wishes to

3 Disk Request From File System

Cloned Disk Request Original Disk Request

Storage Model Block Classifier Implicit Notification Mechanism Process Waiting Requests Explicit Notification Status List Inode Journal Unclassified Parser Snooper Block Store Mechanism

Target I/O Request Queue Read Read Or After Write Read Write Write Target Time Metadata Remap Bitmap Data Squasher Data Generator Device Model Delay Remapper Remap Table Sync Reads Data Available Available Metadata I/O Request Backing Store Data Or Metadata Queue Figure 3: David Architecture. Shows the components of David and the flow of requests handled within. emulate a device much faster than the available device, tion, in order to accurately classify a dynamically allo- using memory is a safer option. For example, as shown cated block, the system needs to track the inode pointing in 6.3, David successfully emulates a RAID-1 configu- to the particular block to infer its current status. § ration using a limited amount of memory. If the perfor- Implicit classification is based on prior work on mance mismatch is not significant, a hard disk as backing Semantically-Smart Disk Systems (SDS) [21]; an SDS store provides much greater scale in terms of storage ca- employs three techniques to classify blocks: direct and pacity. Throughout the paper, “available storage” refers indirect classification, and association. With direct clas- to the backing store in a generic sense. sification, blocks are identified simply by their location on disk. With indirect classification, blocks are identified 3 Block Classification only with additional information; for example, to iden- tify directory data or indirect blocks, the corresponding The primary requirement for David to prevent data writes inode must also be examined. Finally, with association, using the Data Squasher is the ability to classify a block a data block and its inode are connected. as metadata or data. David provides both implicit and ex- There are two significant additional challenges David plicit block classification. The implicit approach is more must address. First, as opposed to SDS, David has laborious but provides a flexible approach to run unmod- to ensure that no metadata blocks are ever misclassi- ified applications and file systems. The explicit notifica- fied. Second, benchmark scalability introduces addi- tion approach is straightforward and much simpler to im- tional memory pressure to handle delayed classification. plement, albeit at the cost of a small modification in the In this paper we only discuss our new contributions (the operating system or the application; both are available in original SDS paper provides details of the basic block- David and can be chosen according to the requirements classification mechanisms). of the evaluator. The implicit approach is demonstrated 3.1.1 Unclassified Block Store using ext3 and the explicit approach using btrfs. To infer when a file or directory is allocated and deallo- 3.1 Implicit Type Detection cated, David tracks writes to inode blocks, inode bitmaps For ext2 and ext3, the majority of the blocks are stati- and data bitmaps; to enumerate the indirect and directory cally assigned for a given file system size and configu- blocks that belong to a particular file or directory, it uses ration at the time of file system creation; the allocation the contents of the inode. It is often the case that the for these blocks doesn’t change during the lifetime of the blocks pointed to by an inode are written out before the file system. Blocks that fall in this category include the corresponding inode block; if a classification attempt is super block, group descriptors, inode and data bitmaps, made when a block is being written, an indirect or di- inode blocks and blocks belonging to the journal; these rectory block will be misclassified as an ordinary data blocks are relatively straightforward to classify based on block. This transient error is unacceptable for David their on-disk location, or their Logical Block Address since it leads to the “metadata” block being discarded (LBA). However, not all blocks are statically assigned; prematurely and could cause irreparable damage to the dynamically-allocated blocks include directory, indirect file system. For example, if a directory or indirect block (single, double, or triple indirect) and data blocks. Un- is accidentally discarded, it could lead to file system cor- less all blocks contain some self-identification informa- ruption.

4 To rectify this problem, David temporarily buffers in 600000 Without Ext3 Journal Snooping memory writes to all blocks which are as yet unclassi- With Ext3 Journal Snooping 500000 fied, inside the Unclassified Block Store (UBS). These system out of memory Maximum memory limit write requests remain in the UBS until a classification is 400000 made possible upon the write of the correspondinginode. When a correspondinginode does get written, blocksthat 300000

are classified as metadata are passed on to the Metadata 200000 Remapper for remapping; they are then written out to persistent storage at the remapped location. Blocks clas- 100000 Number of 4KB Unclassified Blocks

sified as data are discarded at that time. All entries in the 0 UBS corresponding to that inode are also removed. 0 10 20 30 40 50 60 70 80 90 Time in units of 10 secs The UBS is implementedas a list of block I/O(bio) re- quest structures. An extra reference to the memory pages Figure 4: Memory usage with Journal Snooping. pointed to by these bio structures is held by David as long Figure 4 compares the memory pressure with and they remain in the UBS; this reference ensures that these without Journal Snooping demonstrating its effective- pages are not mistakenly freed until the UBS is able to ness. It shows the number of 4 KB block I/O requests classify and persist them on disk, if needed. In order resident in the UBS sampled at 10 sec intervals during to reduce the inode parsing overhead otherwise imposed the creation of a 24 GB file on ext3; the file system is for each inode write, David maintains a list of recently mounted on top of David in ordered journaling mode written inode blocks that need to be processed and uses with a commit interval of 5 secs. This experiment was a separate kernel thread for parsing. run on a dual core machine with 2 GB memory. Since this workload is data write intensive, without Journal 3.1.2 Journal Snooping Snooping, the system runs out of memory when around Storing unclassified blocks in the UBS can cause a strain 450,000 bio requests are in the UBS (occupying roughly on available memory in certain situations. In particular, 1.8 GB of memory). Journal Snooping ensures that the when ext3 is mounted on top of David in ordered jour- memory consumed by outstanding bio requests does not naling mode, all the data blocks are written to disk at go beyond a maximum of 240 MB. journal-commit time but the metadata blocks are written to disk only at the checkpoint time which occurs much 3.2 Explicit Metadata Notification less frequently. This results in a temporary yet precari- David is meant to be useful for a wide variety of file sys- ous build up of data blocks in the UBS even though they tems; explicit metadata notification provides a mecha- are bound to be squashed as soon as the corresponding nism to rapidly adopt a file system for use with David. inode is written; this situation is especially true when Since data writes can come only from the benchmark ap- large files (e.g., 10s of GB) are written. In order to en- plication in user-space whereas metadata writes are is- sure the overall scalability of David, handling large files sued by the file system, our approach is to identify the and the consequent explosion in memory consumption is data blocks before they are even written to the file sys- critical. To achieve this without any modification to the tem. Our implementation of explicit notification is thus ext3 filesystem, David performs Journal Snooping in the file-system agnostic – it relies on a small modification block device driver. to the page cache to collect additional information. We David snoops on the journal commit traffic for inodes demonstrate the benefits of this approach using btrfs, a and indirect blocks logged within a committed transac- file system quite unlike ext3 in design. tion; this enables block classification even prior to check- When an application writes to a file, David captures point. When a journal-descriptor block is written as part the pointers to the in-memory pages where the data con- of a transaction, David records the blocks that are being tent is stored, as it is being copied into the page cache. logged within that particular transaction. In addition, all Subsequently, when the writes reach David, they are journal writes within that transaction are cached in mem- compared against the captured pointer addresses to de- ory until the transaction is committed. After that, the in- cide whether the write is to metadata or data. Once the odes and their corresponding direct and indirect blocks presence is tested, the pointer is removed from the list are processed to allow block classification; the identified since the same page can be reused for metadata writes in data blocks are squashed from the UBS and the iden- the future. tified metadata blocks are remapped and stored persis- There are certainly other ways to implement explicit tently. The challenge in implementing Journal Snooping notification. One way is to capture the checksum of the was to handle the continuousstream of unorderedjournal contents of the in-memory pages instead of the pointer blocks and reconstruct the journal transaction. to track data blocks. One can also modify the file system

5 to explicitly flag the metadata blocks, instead of identi- figure 5 shows some examples of the different types of fying data blocks with the page cache modification. We content that can be generated. believe our approach is easier to implement, does not re- Many systems that read back previously written data quire any file system modification, and is also easier to do not care about the specific content within the files as extend to software RAID since parity blocks are auto- long as there is some content (e.g., a file-system backup matically classified as metadata and not discarded. utility, or the Postmark benchmark). Much in the same way as failure-oblivious computing generates values to 4 Metadata Remapping service reads to invalid memory while ignoring invalid writes [18], David randomly generates content to service Since David exports a target pseudo device of much out-of-bound read requests. higher capacity to the file system than the available stor- Some systems may expect file contents to have valid age device, the bio requests issued to the pseudo device syntax or semantics; the performance of these systems will have addresses in the full target range and thus need depend on the actual content being read (e.g., a desk- to be suitably remapped. For this purpose, David main- top search engine for a file system, or a spell-checker). tains a remap table called Metadata Remapper which For such systems, naive content generation would either maps “target” addresses to “available” addresses. The crash the application or give poor benchmarking results. Metadata Remapper can contain an entry either for one David produces valid file content leveraging prior work metadata block (e.g., super block), or a range of metadata on generating file-system images [2]. blocks (e.g., group descriptors); by allowing an arbitrary Finally, some systems may expect to read back data range of blocks to be remapped together, the Metadata exactly as they wrote earlier (i.e., a read-after-write or Remapper provides an efficient translation service that RAW dependency) or expect a precise structure that can- also provides scalability. Range remapping in addition not be generated arbitrarily (e.g., a binary file or a con- preserves sequentiality of the blocks if a disk is used as figuration file). David provides additional support to run the backing store. In addition to the Metadata Remapper, these demanding applications using the RAW Store, de- a remap bitmap is maintained to keep track of free and signed as a cooperative resource visible to the user and used blocks on the available physical device; the remap configurable to suit the needs of different applications. bitmap supports allocation both of a single remapped Our current implementation of RAW Store is very sim- block and a range of remapped blocks. ple: in order to decide which data blocks need to be The destination (or remapped) location for a request stored persistently, David requires the application to sup- is determined using a simple algorithm which takes as ply a list of the relevant file paths. David then looks up input the number of contiguous blocks that need to be the inode number of the files and tracks all data blocks remapped and finds the first available chunk of space pointed to by these inodes, writing them out to disk us- from the remap bitmap. This can be done statically or at ing the Metadata Remapper just as any metadata block. runtime; for the ext3 file system, since most of the blocks In the future, we intend to support more nuanced ways to are statically allocated, the remapping for these blocks maintain the RAW Store; for example, specifying direc- can also be done statically to improve performance. Sub- tories instead of files, or by using Memoization [14]. sequent writes to othermetadata blocks are remappeddy- For applications that must exactly read back a signif- namically; when metadata blocks are deallocated, corre- icant fraction of what they write, the scalability advan- sponding entries from the Metadata Remapper and the tage of David diminishes; in such cases the benefits are remap bitmap are removed. From our experience, this primarily from the ability to emulate new devices. simple algorithm lays out blocks on disk quite efficiently. More sophisticated allocation algorithms based on local- 6 Storage Model and Emulation ity of reference can be implemented in the future. Not having access to the target storage system requires 5 Data Generator David to precisely capture the behavior of the entire stor- age stack with all its dependencies through a model. The David services the requirements of systems oblivious to storage system modeled by David is the target system file content with data squashing and metadata remapping. and the system on which it runs is the available system. However, many real applications care about file content; David emulates the behavior of the target disk by send- the Data Generator with David is responsible for gener- ing requests to the available disk (for persistence) while ating synthetic content to service read requests to data simultaneously sending the target request stream to the blocks that were previously discarded. Different systems Storage Model; the model computes the time that would can have different requirements for the file content and have taken for the request to finish on the target system the Data Generator has various options to choose from; and introduces an appropriate delay in the actual request

6 weqw ueqw 8((0jkw the quick brown fox %PDF−1.4 umask 027 weqwe0 289402qw jumps over the lazy dog %[] if ( ! $?TERM ) then setenv TERM qw (*() )) hhkjh)) 2a 5 0 obj endif 900 bgt GG*&& aa a quick movement of <> if ($TERM==unknown) kja kjwert jhuyfg aa jeopardize six gunboats stream then set noglob; sdnjo abcd2 dsll**(1 x<9C>]%q eval ‘tset −s −r −Q"? wqw 421−0 e−w1‘vc) the jaded zombies acted <8D><80> $TERM"‘; unset noglob 2323 d−−POs&^! aab quietly but kept driving trailer << /Size 75 /Root endif qw HUK;0922 dds 12 their oxen forward 1 0 R /Info 2 0 R /ID>> if($TERM==unknown) startxref 1052 %%EOF goto loop endif rand.txt pangrams.txt compress.pdf config.RAW Figure 5: Examples of content generation by Data Generator. The figure shows a randomly generated text file, a text file with semantically meaningful content, a well-formatted PDF file, and a config file with precise syntax to be stored in the RAW Store. Parameter H1 H2 Parameter H1 H2 Disk size 80 GB 1 TB Cache segments 11 500 Rotational Speed 7200 RPM 7200 RPM Cache R/W partition Varies Varies Number of cylinders 88283 147583 Bus Transfer 133 MBps 133 MBps Number of zones 30 30 Seek profile(long) 3800+(cyl*116)/103 3300+(cyl*5)/106 Sectors per track 567 to 1170 840 to 1680 Seek profile(short) 300+p(cyl 2235) 700+√cyl Cylinders per zone 1444 to 1521 1279 to 8320 Head switch 1.4 ms∗ 1.4 ms On-disk cache size 2870 KB 300 MB Cylinder switch 1.6 ms 1.6 ms Disk cache segment 260 KB 600 KB Dev driver req queue∗ 128-160 128-160 Req scheduling∗ FIFO FIFO Req queue timeout∗ 3 ms (unplug) 3 ms (unplug) Table 1: Storage Model Parameters in David. Lists important parameters obtained to model disks Hitachi HDS728080PLA380 (H1) and Hitachi HDS721010KLA330 (H2). ∗denotes parameters of I/O request queue (IORQ). stream before returning control. Figure 3 presented in 2 will be interesting to explore using Disksim for the disk shows this setup more clearly. § model. Disksim is a detailed user-space disk simulator As a general design principle, to support low-overhead which allows for greater flexibility in the types of device modeling without compromising accuracy, we avoid us- properties that can be simulated along with their degree ing any technique that either relies on storing empiri- of detail; we will need to ensure it does not appreciably cal data to compute statistics or requires table-based ap- slow down the emulation when used without memory as proaches to predict performance [6]; the overheads for backing store. such methods are directly proportional to the amount 6.1.1 Disk Drive Profile of runtime statistics being maintained which in turn de- pends on the size of the disk. Instead, wherever applica- The disk model requires a number of drive-specific pa- ble, we have adopted and developed analytical approxi- rameters as input, a list of which is presented in the first mations that did not slow the system down; our resulting column of Table 1; currently David contains models for models are sufficiently lean while being fairly accurate. two disks: the Hitachi HDS728080PLA380 80 GB disk, and the Hitachi HDS721010KLA330 1 TB disk. We To ensure portability of our models, we have refrained have verified the parameter values for both these disks from making device-specific optimizations to improve through carefully controlled experiments. David is en- accuracy; we believe current models in David are fairly visioned for use in environments where the target drive accurate. The models are also adaptive enough to be eas- itself may not be available; if users need to model addi- ily configured for changes in disk drives and other pa- tional drives, they need to supply the relevant parameters. rameters of the storage stack. We next present some de- Disk seeks, rotation time and transfer times are modeled tails of the disk model and the storage stack model. muchin the same way asproposedin the RW model. The 6.1 DiskModel actual parameter values defining the above properties are specific to a drive; empirically obtained values for the David’s disk model is based on the classical model pro- two disks we model are shown in Table 1. posed by Ruemmler and Wilkes [19], henceforth referred as the RW model. The disk model contains informa- 6.1.2 Disk Cache Modeling tion about the disk geometry (i.e., platters, cylinders and The drive cache is usually small (few hundred KB to a zones) and maintains the current disk head position; us- few MB) and serves to cache reads from the disk me- ing these sources it models the disk seek, rotation, and dia to service future reads, or to buffer writes. Unfortu- transfer times for any request. The disk model also keeps nately, the drive cache is one of the least specified com- track of the effects of disk caches (track prefetching, ponents as well; the cache managementlogic is low-level write-through and write-back caches). In the future, it firmware code which is not easy to model.

7 David models the number and size of segments in the David exports multiple block devices with separate ma- disk cache and the number of disk sector-sized slots in jor and minor numbers; it differentiates requests to dif- each segment. Partitioning of the cache segments into ferent devices using the major number. For the pur- read and write caches, if any, is also part of the informa- pose of performance benchmarking, David uses a - tion contained in the disk model. David models the read gle memory-based backing store for all the compressed cache with a FIFO eviction policy. To model the effects RAID devices. Using multiple threads, the Storage of write caching, the disk model maintains statistics on Model maintains separate state for each of the devices the current size of writes pending in the disk cache and being emulated. Requests are placed in a single request the time needed to flush these writes out to the media. queue tagged with a device identifier; individual Storage Write buffering is simulated by periodically emptying a Model threads for each device fetch one request at a time fraction of the contents of the write cache during idle from this request queue based on the device identifier. periods of the disk in between successive foreground re- Similar to the single device case, the servicing thread cal- quests. The cache is modeled with a write-throughpolicy culates the time at which a request to the device should and is partitioned into a sufficiently large read cache to finish and notifies completion using a callback. match the read-ahead behavior of the disk drive. David currently only provides mechanisms for simple software RAID emulation that do not need a model of a 6.2 Storage Stack Model software RAID itself. New techniques might be needed David also models the I/O request queues (IORQs) main- to emulate more complex commercial RAID configura- tained in the OS; Table 1 lists a few of its impor- tions, for example, commercial RAID settings using a tant parameters. While developing the Storage Model, hardware RAID card. we found that accurately modeling the behavior of the IORQs is crucial to predict the target execution time cor- 7 Evaluation rectly. The IORQs usually have a limit on the maximum number of requests that can be held at any point; pro- We seek to answer four important questions. First, what cesses that try to issue an I/O request when the IORQ is is the accuracy of the Storage Model? Second, how ac- full are made to wait. Such waiting processes are wo- curately does David predict benchmark runtime and what ken up when an I/O issued to the disk drive completes, storage space savings does it provide? Third, can David thereby creating an empty slot in the IORQ. Once wo- scale to large target devices including RAID? Finally, ken up, the process is also granted privilege to batch a what is the memory and CPU overhead of David? constant number of additional I/O requests even when the IORQ is full, as long as the total number of requests 7.1 Experimental Platform is within a specified upper limit. Therefore, for applica- We have developed David for the Linux operating sys- tions issuing bursty I/O, the time spent by a request in the tem. The hard disks currently modeled are the 1 TB IORQ can outweigh the time spent at the disk by several Hitachi HDS721010KLA330 (referred to as D1TB) and orders of magnitude; modeling the IORQs is thus crucial the 80 GB Hitachi HDS728080PLA380 (referred to as for overall accuracy. D80GB); table 1 lists their relevant parameters. Unless Disk requests arriving at David are first enqueued into specified otherwise, the following hold for all the experi- a replica queue maintained inside the Storage Model. ments: (1) machine used has a quad-core Intel processor While being enqueued, the disk request is also checked and 4GB running Linux 2.6.23.1 (2) ext3 file sys- for a possible merge with other pending requests: a com- tem is mounted in ordered-journaling mode with a com- mon optimization that reduces the number of total re- mit interval of 5 sec (3) microbenchmarks were run di- quests issued to the device. There is a limit on the num- rectly on the disk without a file system (4) David predicts ber of disk requests that can be merged into a single disk the benchmark runtime for a target D1TB while in fact request; eventually merged disk requests are dequeued running on the available D80GB (5) to validate accuracy, from the replica queue and dispatched to the disk model David was instead run directly on D1TB. to obtain the service time spent at the drive. The replica queue uses the same request scheduling policy as the tar- 7.2 Storage Model Accuracy get IORQ. First, we validate the accuracy of Storage Model in pre- dicting the benchmark runtime on the target system. 6.3 RAID Emulation Since our aim is to validate the accuracy of the Stor- David can also provide effective RAID emulation. To age Model alone, we run David in a model only mode demonstrate simple RAID configurations with David, where we disable block classification, remapping and each component disk is emulated using a memory- data squashing. David just passes down the requests that backed “compressed” device underneath software RAID. it receives to the available request queue below. We run

8 Sequential Reads [ Demerit: 24.39 ] Random Reads [ Demerit: 5.51 ] Sequential Writes [ Demerit: 0.08 ] Random Writes [ Demerit: 0.02 ] 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 Measured 0.2 Measured 0.2 Measured 0.2 Measured Fraction of I/Os Modeled Fraction of I/Os Modeled Fraction of I/Os Modeled Fraction of I/Os Modeled 0 0 0 0 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Time in units of 1 us Time in units of 100 us Time in units of 100 us Time in units of 100 us

Figure 6: Storage Model accuracy for Sequential and Random Reads and Writes. The graph shows the cumulative distribution of measured and modeled times for sequential and random reads and writes.

Figure 7: Storage Model accuracy. The graphs show the cumulative distribution of measured and modeled times for the following workloads from left to right: Postmark, Webserver, Varmail and Tar.

Implicit Classification – Ext3 Explicit Notification – Btrfs Benchmark Original David Storage Original David Runtime Original David Runtime Workload Storage Storage Savings Runtime Runtime Error Runtime Runtime Error (KB) (KB) (%) (Secs) (Secs) (%) (Secs) (Secs) (%) mkfs 976762584 7900712 99.19 278.66 281.81 1.13 - - - imp 11224140 18368 99.84 344.18 339.42 -1.38 327.294 324.057 0.99 tar 21144 628 97.03 257.66 255.33 -0.9 146.472 135.014 7.8 grep - - - 250.52 254.40 1.55 141.960 138.455 2.47 virus scan - - - 55.60 47.95 -13.75 27.420 31.555 15.08 find - - - 26.21 26.60 1.5 - - - du - - - 102.69 101.36 -1.29 - - - postmark 204572 404 99.80 33.23 29.34 -11.69 22.709 22.243 2.05 webserver 3854828 3920 99.89 127.04 126.94 -0.08 125.611 126.504 0.71 varmail 7852 3920 50.07 126.66 126.27 -0.31 126.019 126.478 0.36 sr - - - 40.32 44.90 11.34 40.32 44.90 11.34 rr - - - 913.10 935.46 2.45 913.10 935.46 2.45 sw - - - 57.28 58.96 2.93 57.28 58.96 2.93 rw - - - 308.74 291.40 -5.62 308.74 291.40 -5.62 Table 2: David Performance and Accuracy. Shows savings in capacity, accuracy of runtime prediction, and the overhead of storage modeling for different workloads. Webserver and varmail are generated using FileBench; virus scan using AVG.

David on top of D1TB and set the target drive to be the demerit figures of 24.39, 5.51, 0.08, and 0.02 respec- same. Note that the available system is the same as the tively, as computed using the Ruemmler and Wilkes target system for these experiments since we only want methodology[19]. The large demerit for sequential reads to compare the measured and modeled times to validate is due to a variance in the available disk’s cache-read the accuracy of the Storage Model. Each block request times; modeling the disk cache in greater detail in the fu- is traced along its path from David to the disk drive and ture could potentially avoid this situation. However, se- back. This is done in order to measure the total time that quential read requests do not contribute to a measurably the request spends in the available IORQ and the time large error in the total modeled runtime; they often hit spent getting serviced at the available disk. These mea- the disk cache and have service times less than 500 mi- sured times are then compared with the modeled times crosecondswhile other types of disk requests take around obtained from the Storage Model. 20 to 35 milliseconds to get serviced. Any inaccuracy in Figure 6 shows the Storage Model accuracy for four the modeled times for sequential reads is negligible when micro-workloads: sequential and random reads, and se- compared to the service times of other types of disk re- quential and random writes; these micro-workloads have quests; we thus chose to not make the disk-cache model

9 800 800 more complex for the sake of sequential reads. WOD Space 700 D Space 700 Figure 7 shows the accuracy for four different macro WOD Time workloads and application kernels: Postmark [13], web- 600 D Time 600 server (generated using FileBench [15]), Varmail (mail WOD Space server workload using FileBench), and a Tar workload 500 500 (copy and untar of the linux kernel of size 46 MB). 400 400 The FileBench Varmail workload emulates an NFS 300 300 mail server, similar to Postmark, but is multi-threaded 200 WOD Time 200 instead. The Varmail workload consists of a set of D Time open/read/close, open/append/close and deletes in a 100 100 Runtime (100s of seconds)

Actual Storage space used (GB) D Space single directory, in a multi-threaded fashion. The 0 0 FileBench webserver workload comprises of a mix of 0 100 200 300 400 500 600 700 800 open/read/close of multiple files in a directory tree. In File System Impression size (GB) addition, to simulate a webserver log, a file append oper- ation is also issued. The workload consists of 100 threads Figure 8: Storage Space Savings and Model Accu- issuing 16 KB appends to the weblog every 10 reads. racy. The “Space” lines show the savings in storage space Overall, we find that storage modeling inside David is achieved when using David for the impressions workload with quite accurate for all workloads used in our evaluation. file-system images of varying sizes until 800GB; “Time” lines The total modeled time as well as the distribution of the show the accuracy of runtime prediction for the same workload. individual request times are close to the total measured WOD: space/time without David, D: space/time with David. time and the distribution of the measured request times. tar uses the GNU tar utility to create a gzipped archive of the file-system image of size 10 GB created by imp; 7.3 David Accuracy it writes the newly created archive in the same file sys- Next, we want to measure how accurately David predicts tem. This workload is a data read and data write inten- the benchmark runtime. Table 2 lists the accuracy and sive workload. The data reads are satisfied by the Data storage space savings provided by David for a variety of Generator without accessing the available disk, while the benchmark applications for both ext3 and btrfs. We have data writes end up being squashed. chosen a set of benchmarks that are commonly used and grep uses the GNU grep utility to search for the ex- also stress various paths that disk requests take within pression “nothing” in the content generated by both imp David. The first and second columns of the table show and tar. This workload issues significant amounts of data the storage space consumed by the benchmark workload reads and small amounts of metadata reads. virus scan without and with David. The third column shows the runs the AVG virus scanner on the file-system image cre- percentage savings in storage space achieved by using ated by imp. find and du run the GNU find and GNU du David. The fourth column shows the original bench- utilities over the content generated by both imp and tar. mark runtime without David on D1TB. The fifth column These two workloads are metadata read only workloads. shows the benchmark runtime with David on D80GB. The David works well under both the implicit and ex- sixth column shows the percentage error in the predic- plicit approaches demonstrating its usefulness across file tion of the benchmark runtime by David. The final three systems. Table 2 shows how David provides tremen- columns show the original and modeled runtime, and the dous savings in the required storage capacity, upwards of percentage error for the btrfs experiments; the storage 99% (a 100-fold or more reduction) for most workloads. space savings are roughly the same as for ext3. The sr, David also predicts benchmark runtime quite accurately. rr, sw, and rw workloads are run directly on the raw de- Prediction error for most workloads is less than 3%, al- vice and hence are independent of the file system. though for a few it is just over 10%. The errors in the mkfs creates a file system with a 4 KB block size over predicted runtimes stem from the relative simplicity of the 1 TB target device exported by David. This workload our in-kernel Disk Model; for example, it does not cap- only writes metadata and David remaps writes issued by ture the layout of physical blocks on the magnetic media mkfs sequentially starting from the beginning of D80GB; accurately. This information is not published by the disk no data squashing occurs in this experiment. manufacturers and experimental inference is not possible imp creates a realistic file-system image of size 10 GB for ATA disks that do not have a command similar to the using the publicly available Impressions tool [2]. A total SCSI mode page. of 5000 regularfiles and 1000directories are created with an average of 10.2 files per directory. This workload is 7.4 David Scalability a data-write intensive workload and most of the issued David is aimed at providing scalable emulation using writes end up being squashed by David. commodity hardware; it is important that accuracy is

10 Num Disks Rand R Rand W Seq R Seq W 100 % 100 D Mem Measured 3 232.77 72.37 119.29 119.98 80 % 80 CPU lines 2 156.76 72.02 119.11 119.33 60 % 60 1 78.66 71.88 118.65 118.71 40 % 40 Modeled 3 238.79 73.77 119.44 119.40 20 % 20

SM Mem Memory used (MB)

2 159.36 72.21 119.16 119.21 CPU Busy Percentage 0 % 0 1 79.56 72.15 118.95 118.83 0 10 20 30 40 50 60 70 Table 3: David Software RAID-1 Emulation. Shows Time (in units of 5 Seconds) IOPS for a software RAID-1 setup using David with memory as WOD CPU D CPU D Mem backing store; workload issues 20000 read and write requests SM CPU SM Mem through concurrent processes which equal the number of disks in the experiment. 1 disk experiments run w/o RAID-1. Figure 9: David CPU and Memory Overhead. Shows the memory and percentage CPU consumption by David while not compromised at larger scale. Figure 8 shows the creating a 10 GB file-system image using impressions. WOD accuracy and storage space savings provided by David CPU: CPU without David, SM CPU: CPU with Storage Model while creating file-system images of 100s of GB. Using alone, D CPU: total CPU with David, SM Mem: Storage an available capacity of only 10 GB, David can model the Model memory alone, D Mem: total memory with David. runtime of Impressions in creating a realistic file-system image of 800 GB; in contrast to the linear scaling of the using several counters; David provides support to mea- target capacity demanded, David barely requires any ex- sure the memory usage of its different components using tra available capacity. David also predicts the benchmark ioctls. To measure the CPU overhead of the Storage runtime within a maximum of 2.5% error even with the Model alone, David is run in the model-only mode where huge disparity between target and available disks at the block classification, remapping and data squashing are 800 GB mark, as shown in Figure 8. turned off. The reason we limit these experiments to a target ca- In our experience with running different workloads, pacity of less than 1 TB is because we had access to only we found that the memory and CPU usage of David is a terabyte sized disk against which we could validate the acceptable for the purposes of benchmarking. As an ex- accuracy of David. Extrapolating from this experience, ample, Figure 9 shows the CPU and memory consump- we believe David will enable one to emulate disks of 10s tion by David captured at 5 second intervals while cre- or 100s of TB given the 1 TB disk. ating a 10 GB file-system image using Impressions. For this experiment, the Storage Model consumes less than 1 7.5 Davidfor RAID MB of memory; the average memory consumed in total We present a brief evaluation and validation of software by David is less than 90 MB, of which the pre-allocated RAID-1 configurations using David. Table 3 shows a cache used by the Journal Snooping to temporarily store simple experiment where David emulates a multi-disk the journal writes itself contributes 80 MB. Amount of software RAID-1 (mirrored) configuration; each device CPU used by the Storage Model alone is insignificant, is emulated using a memory-disk as backing store. How- however implicit classification by the Block Classifier ever, since the multiple disks contain copies of the same is the primary consumer of CPU using 10% on average block, a single physical copy is stored, further reducing with occasional spikes. The CPU overhead is not an is- the memory footprint. In each disk setup, a set of threads sue at all if one uses explicit notification. which equal in number to the number of disks issue a to- tal of 20000 requests. David is able to accurately emulate the software RAID-1 setup upto 3 disks; more complex 8 Related Work RAID schemes are left as part of future work. Memulator [10] makes a great case for why storage em- 7.6 David Overhead ulation provides the unique ability to explore nonexistent David is designed to be used for benchmarking and not storage components and take end-to-end measurements. as a production system, thus scalability and accuracy are Memulator is a “timing-accurate” storage emulator that the more relevant metrics of evaluation; we do however allows a simulated storage component to be plugged want to measure the memory and CPU overhead of us- into a real system running real applications. Memula- ing David on the available system to ensure it is prac- tor can use the memory of either a networked machine or tical to use. All memory usage within David is tracked the local machine as the storage media of the emulated

11 disk, enabling full system evaluation of hypothetical stor- dilation for emulating network speeds orders of mag- age devices. Although this provides flexibility in device nitude faster than available [11]. Time dilation allows emulation, high-capacity devices requires an equivalent one to experiment with unmodified applications running amount of memory; David provides the necessary scala- on commodity operating systems by subjecting them to bility to emulate such devices. In turn, David can benefit much faster network speeds than actually available. from the networked-emulation capabilities of Memula- A key challenge in David is the ability to identify data tor in scenarios when either the host machine has limited and meta-data blocks. Besides SDS [21], XN, the stable CPU and memory resources, or when the interference of storage system for the Xok exokernel [12] dealt with sim- running David on the same machine competing for the ilar issues. XN employed a template of metadata trans- same resources is unacceptable. lation functions called UDFs specific to each file type. One alternate to emulation is to simply buy a larger ca- The responsibility of providing UDFs rested with the file pacity or newer device and use it to run the benchmarks. system developer, allowing the kernel to handle arbitrary This is sometimes feasible, but often not desirable. Even metadata layouts without understanding the layout itself. if one buys a larger disk, in the future they would need Specifying an encoding of the on-disk scheme can be an even larger one; David allows one to keep up with this tricky for a file system such as ReiserFS that uses dy- arms race without always investing in new devices. Note namic allocation; however, in the future, David’s meta- that we chose 1 TB as the upper limit for evaluation in data classification scheme can benefit from a more for- this paper because we could validate our results for that mally specified on-disk layout per file-system. size. Having a large disk will also not address the issue of emulating much faster devices such as SSDs or RAID 9 Conclusion configurations. David emulates faster devices through an efficient use of memory as backing store. David is born out of the frustration in doing large-scale Another alternate is to simulate the storage component experimentation on realistic storage hardware – a prob- under test; disk simulators like Disksim [7] allow such an lem many in the storage community face. David makes it evaluation flexibly. However, simulation results are often practical to experiment with benchmarks that were oth- far from perfect [9] – they fail to capture system depen- erwise infeasible to run on a given system, by transpar- dencies and require the generation of representative I/O ently scaling down the storage capacity required to run traces which is a challenge in itself. the workload. The available backing store under David can be orders of magnitude smaller than the target de- Finally, one might use analytical modeling for the stor- vice. David ensures accuracy of benchmarking results age devices; while very useful in some circumstances, by using a detailed storage model to predict the runtime. it is not without its own set of challenges and limita- In the future, we plan to extend David to include support tions [20]. In particular, it is extremely hard to capture for a numberof other useful storage devices and configu- the interactions and complexities in real systems. Wher- rations. In particular, the Storage Model can be extended ever possible, David does leverage well-tested analytical to support flash-based SSDs using an existing simulation models for individual components to aid the emulation. model [5]. We believe David will be a useful emulator Both simulation and analytical modeling are comple- for file and storage system evaluation. mentary to emulation, perfectly useful in their own right. Emulation does however provide a reasonable middle 10 Acknowledgments ground in terms of flexibility and realism. Evaluation of how well an I/O system scales has been We thank the anonymous reviewers and Rob Ross (our of interest in prior research and is becoming increas- ) for their feedback and comments, which have ingly more relevant [28]. Chen and Patterson proposed substantially improved the content and presentation of a “self-scaling” benchmark that scales with the I/O sys- this paper. The first author thanks the members of the tem being evaluated, to stress the system in meaningful Storage Systems Group at NEC Labs for their comments ways [8]. Although useful for disk and I/O systems, and feedback. the self-scaling benchmarks are not directly applicable This material is based upon work supported by for file systems. The evaluation of the XFS file sys- the National Science Foundation under the following tem from Silicon Graphics uses a number of benchmarks grants: CCF-0621487, CNS-0509474, CNS-0834392, specifically intended to test its scalability [23]; such an CCF-0811697, CCF-0811697, CCF-0937959, as well as evaluation can benefit from David to employ even larger by generous donations from NetApp, Sun Microsystems, benchmarks with greater ease; SpecSFS [27] also con- and Google. Any opinions, findings, and conclusions or tains some techniques for scalable workload generation. recommendations expressed in this material are those of Similar to our emulation of scale in a storage system, the authors and do not necessarily reflect the views of Gupta et al. from UCSD propose a technique called time NSF or other institutions.

12 References [11] D. Gupta, K. Yocum, M. McNett, A. C. Snoeren, A. Vahdat, and G. M. Voelker. To infinity and be- [1] GraySort Benchmark. http://sortbenchmark. yond: time-warped network emulation. In Proceed- org/FAQ.htm#gray. ings of the 3rd conference on Networked Systems Design and Implementation ( NSDI’06 ), San Jose, [2] N. Agrawal, A. C. Arpaci-Dusseau, and R. H. CA, 2006. Arpaci-Dusseau. Generating Realistic Impressions for File-System Benchmarking. In Proceedings [12] M. F. Kaashoek, D. R. Engler, G. R. Ganger, of the 7th Conference on File and Storage Tech- H. Brice˜no, R. Hunt, D. Mazi`eres, T. Pinckney, nologies (FAST ’09), San Francisco, CA, February R. Grimm, J. Jannotti, and K. Mackenzie. Applica- 2009. tion Performance and Flexibility on Exokernel Sys- tems. In Proceedings of the 16th ACM Symposium [3] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. on Operating Systems Principles (SOSP ’97), pages Lorch. A Five-Year Study of File-System Meta- 52–65, -Malo, France, October 1997. data. In Proceedings of the 5th USENIX Symposium [13] J. Katcher. PostMark: A New File System Bench- on File and Storage Technologies (FAST ’07), San mark. Technical Report TR-3022, Network Appli- Jose, California, February 2007. ance Inc., oct 1997. [4] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. [14] J. Mayfield, T. Finin, and M. Hall. Using automatic Lorch. A five-year study of file-system metadata: memoization as a software engineering tool in real- Microsoft longitudinal dataset. http://iotta. world ai systems. Artificial Intelligence for Appli- snia.org/traces/list/Static, 2007. cations, Conference on, 0:87, 1995.

[5] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. [15] R. McDougall. Filebench: Applica- Davis, M. Manasse, and R. Panigrahy. Design tion level file system benchmark. http: Tradeoffs for SSD Performance. In Proceedings of //www.solarisinternals.com/si/tools/ the Usenix Annual Technical Conference (USENIX filebench/index.php. ’08), Boston, MA, June 2008. [16] E. L. Miller. Towards scalable benchmarksfor mass storage systems. In 5th NASA Goddard Conference [6] E. Anderson. Simple table-based modeling of stor- on Mass Storage Systems and Technologies, 1996. age devices. Technical Report HPL-SSP-2001-04, HP Laboratories, July 2001. [17] E. Riedel, M. Kallahalla, and R. Swaminathan. A Framework for Evaluating Storage System Secu- [7] J. S. Bucy and G. R. Ganger. The DiskSim Simu- rity. In Proceedings of the 1st USENIX Symposium lation Environment Version 3.0 Reference Manual. on File and Storage Technologies (FAST ’02), pages Technical Report CMU-CS-03-102, Carnegie Mel- 14–29, Monterey, California, January 2002. lon University, January 2003. [18] M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, [8] P. M. Chen and D. A. Patterson. A New Approach T. Leu, and J. William S. Beebe. Enhancing to I/O Performance Evaluation–Self-Scaling I/O Server Availability and Security Through Failure- Benchmarks, Predicted I/O Performance. In Pro- Oblivious Computing. In Proceedings of the 6th ceedings of the 1993 ACM SIGMETRICS Confer- Symposium on Operating Systems Design and Im- ence on Measurement and Modeling of Computer plementation (OSDI ’04), San Francicso, CA, De- Systems (SIGMETRICS ’93), pages 1–12, Santa cember 2004. Clara, California, May 1993. [19] C. Ruemmler and J. Wilkes. An Introduction to [9] G. R. Ganger and Y. N. Patt. Using system-level Disk Drive Modeling. IEEE Computer, 27(3):17– models to evaluate i/o subsystem designs. IEEE 28, March 1994. Trans. Comput., 47(6):667–678, 1998. [20] E. Shriver. Performance modeling for realistic stor- age devices. PhD thesis, New York, NY, USA, [10] J. L. Griffin, J. Schindler, S. W. Schlosser, J. S. 1997. Bucy, and G. R. Ganger. Timing-accurate Storage Emulation. In Proceedings of the 1st USENIX Sym- [21] M. Sivathanu, V. Prabhakaran, F. I. Popovici, T. E. posium on File and Storage Technologies (FAST Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci- ’02), Monterey, California, January 2002. Dusseau. Semantically-Smart Disk Systems. In

13 Proceedings of the 2nd USENIX Symposium on File and Storage Technologies (FAST ’03), pages 73–88, San Francisco, California, April 2003. [22] Standard Performance Evaluation Corporation. SPECmail2009 Benchmark. http://www.spec. org/mail2009/. [23] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck. Scalability in the XFS File System. In Proceedings of the USENIX Annual Technical Conference (USENIX ’96), San Diego, California, January 1996. [24] A. Traeger and E. . How to cheat at bench- marking. In USENIX FAST Birds of a feather ses- sion, San Francisco, CA, February 2009. [25] S. C. Tweedie. Journaling the Linux ext2fs File System. In The Fourth Annual Linux Expo, Durham, North Carolina, May 1998. [26] Wikipedia. Btrfs. en.wikipedia.org/wiki/Btrfs, 2009. [27] M. Wittle and B. E. Keith. LADDIS: The next generation in NFS file server benchmarking. In USENIX Summer, pages 111–128, 1993. [28] E. Zadok. File and storage systems benchmarking workshop. UC Santa Cruz, CA, May 2008.

14