THE MAGAZINE OF USENIX & SAGE April 2002 • Volume 27 • Number 2

inside: CONFERENCE REPORTS CONFERENCE ON FILE AND STORAGE TECHNOLOGIES (FAST ‘02)

& The Advanced Computing Systems Association & The System Administrators Guild conference reports

This issue’s reports focus on on the Conference on File and ing: “We need disk fingers,”said Morris. Conference on File and Storage Tech- Storage Technologies He went on to introduce microelectro- mechanical systems, or MEMS-based nologies (FAST 2002) held in Monterey, MONTEREY, devices. One MEMS device would con- California, January 28-30, 2002. JANUARY 28-30, 2002 tain many read/write heads operating in OUR THANKS TO THE SUMMARIZERS: parallel on a single media surface. IBM KEYNOTE I has produced a prototype of such a STORAGE: FROM ATOMS TO PEOPLE device, called “Millipede,”that uses

Ismail Ari Robert Morris, IBM Almaden Research array-heated heads to make pits in a Center polymer media surface. Scott Banachowski Summarized by Zachary Peterson Zachary Peterson Morris concluded by charging the Dr. Morris began by defining the impor- attending researchers of futuristic stor- tance and motivation of the FAST con- age to consider an ideal case where stor- ference. Storage is getting faster and age devices will be self-organizing, larger. In fact, it has increased by 14 self-optimizing, and self-protecting. He orders of magnitude. However, these believes the IBM IceCube is the begin- increases are only interesting when they ning of such devices. Many IceCubes are aid computer scientists. Morris asserted placed physically contiguous with each that “storage determines the way we use other in three dimensions, reducing the computers” and, therefore, is a technol- space needed to manage a large storage ogy worthy of investigation, the most array. When an IceCube fails, it is simply important existing technology being the left in the structure, letting the other . devices recover around it. This is the Morris enumerated the challenges that first step IBM Research is making face the disk drive and how IBM toward self-managing storage, and they Research plans to addresses them. The hope to continue this trend through an greatest of these challenges is the hard, ideology called “autonomic computing.” physical limit at which the magnetic This concept transcends storage and will properties used to store data no longer affect all levels of context-based com- hold, called the superparamagnetic limit puting. In general, researchers need to – a limit that has been passed and re- move toward an environment where sys- predicted a few times. IBM has pushed tems should be easy to use and easy to this limit out by various means of maintain for the end user, while still manipulating the physical organization providing the performance and capacity of the magnetic media. Making the bits gains seen in the past. more square and smaller, combined with a layering of magnetic substrates, SESSION: SECURE STORAGE enables current production drives to Summarized by Zachary Peterson

achieve greater capacities with a higher STRONG SECURITY FOR NETWORK-ATTACHED signal-to-noise ratio. IBM hopes to con- STORAGE tinue this trend in their future produc- Ethan Miller, Darrell Long, University of tion disks by reducing the size of bits to California, Santa Cruz; William Free- a single grain and by utilizing electron man, TRW; Benjamin Reed, IBM beam lithography to create very small Research and accurate components. Ethan Miller presented a set of security IBM also looks beyond the standard disk protocols to provide for an on-disk drive architecture, and the limitations method of securing data in a network- inherent in such a design, for the future attached storage system. Even someone of storage. The disk arm is too confin- who absconds with a disk using strong

70 Vol. 27, No. 2 ;login: security cannot gain access to the data. work is especially useful for comparing and effective method of storing keys. For Additionally, the presence of an authen- aspects of security and performance for more information, refer to http://identis- EPORTS tication scheme means that maliciously various methods of security. Riedel then cape.stanford.edu/. R changed data can be detected. showed some trace-driven simulator results that, when applied to the com- SESSION: PERFORMANCE AND Miller presented three schemes of secu-

mon framework, illustrate that encrypt- MODELING ONFERENCE rity, each offering higher levels of pro- C

on-disk systems are a preferred method tection with slightly decreased system Summarized by Scott Banachowski of security over encrypt-on-wire, pro- performance. In scheme 1, each block is WOLF – A NOVEL REORDERING WRITE viding the best security for the least secured using public-key encryption and BUFFER TO BOOST THE PERFORMANCE OF effort. The framework and the analysis signed using a hash function. Scheme 2 LOG-STRUCTURED FILE SYSTEMS can be applied to answer questions extends this model to include an HMAC Jun Wang and Yiming Hu, University of beyond this particular result and to dif- for added authentication and security Cincinnati ferent environments. but increases processing time at the Log-structured file systems make good client and the server. Scheme 3 avoids ENABLING THE ARCHIVAL STORAGE OF use of disk bandwidth by combining using the slow public-key encryption SIGNED DOCUMENTS several writes into a single sequential methods used in schemes 1 and 2, and Petros Maniatis and Mary Baker, disk access. However, one shortcoming replaces them with a secure keyed-hash Stanford University of log-structured file systems is the over- approach. Results of these three schemes Consider a situation where two people head incurred from cleaning. Cleaning is compared to a baseline system with no agree to a contract, the contract is digi- the process of reclaiming space in a seg- security showed that the public-key tally signed by each person, and ment occupied by obsolete blocks; by encryption schemes suffer significantly archived. Significantly later, one of the rewriting the segment’s live blocks to the in sequential I/O operations. However, signers challenges the contract. What log, the entire segment is freed. the last scheme shows only slightly problems arise with the passage of time? Jun Wang presented a method (called degraded performance, about 1% to Petros Maniatis addressed these issues, WOLF) for reducing the cleaning over- 20% degradation, compared to the base- providing one possible solution that head of log-structured file systems. The line. This work demonstrates that on- extends traditional archival storage to key idea comes from the observation disk security and authentication for support archiving of long-term con- that file accesses form a bimodal distri- network-attached storage can be tracts. bution: some files are repeatedly rewrit- achieved efficiently using a keyed-hash ten while others rarely change. If the approach. As time passes, issues arise that make it difficult to ensure the long-term validity bimodal distribution of data is classified A FRAMEWORK FOR EVALUATING STORAGE of signed data, the sensitivity of keys when written to disk, each type of data SYSTEM SECURITY being the most outstanding issue. Keys can be stored in separate segments. Over Erik Riedel, Mahesh Kallahalla, and Ram are lost, names are changed, and digital time, segments of rewritten data will Swaminathan, HP Labs certificates expire. This issue begs two have almost all their blocks quickly Erik Riedel asserted that there is a need questions: “Can one trust a 30-year-old invalidated, and segments of infre- for a quantitative evaluation of storage signature key?” and “How does one ver- quently modified data will accumulate security. This is because storage has ify such a signature?” Maniatis intro- few holes. unique propreties that differentiate it duced KASTS, a key archival service that WOLF uses an adaptive grouping algo- from other security applications, such as uses time stamping and timed storage of rithm to identify active and inactive networks. Propreties such as sharing, keys as an answer to these questions. data, and assigns the data into separate distribution, and persistence make KASTS uses two main components, a log segments. Using this method, rewrit- applying network security ideas unsatis- Time-Stamping Server (TSS) and a Key ten data may be reordered into a factory. He went on to develop a frame- Archival Service (KAS), to establish a bimodal distribution of segments, leav- work of security variables, such as user time of signing and an effective method ing little work for the cleaner. The algo- operations, encryption methods, and for verifying old signatures. KASTS uses rithm tracks segment buffer block attacks, that when permuted, expose the a versioned and balanced tree for the accesses with reference counters for a benefits and drawbacks for categories of public keys of signatures. Maniatis time-window of initialized 10 minutes existing storage security. This frame- argued that this structure is a feasible

April 2002 ;login: FAST 2002 71 to determine which kind of segment the model with a network model. The simu- responsible for simulating the device on data belongs to. lation, configured 16 disk RAIDs, was a bus and translating bus signals to sim- fed a synthetic workload and a Web ulator requests; the data manager uses a Wang described the performance of a server trace. Forney found that their RAM-based cache to hold the data WOLF implementation adapted from implementation performed similarly to stored on the device; and the timing the Sprite LFS source. The metric used LANDLORD, a performance-compari- manager keeps the system state, timing in measurements was overall write cost, son algorithm rather than an implemen- info, and the simulation engine. Obvi- a value that incorporates garbage collec- tation comparison. The simulation ous limitations of a TASE system is that tion overhead by including the expense showed that their policy alleviated dra- it must be capable of responding to of reading and rewriting cleaned blocks. matic performance drops due to naïve requests faster than the device it emu- WOLF performed with a 25–35% overall caching policies when a slow disk is in lates, and the memory must be large write performance improvement over the system. enough to cache data required by the LFS, and a 53% reduction of cleaning application using the device. overhead. An interesting comment posed to For- ney was that the assumption of uniform The TASE was validated by comparing STORAGE-AWARE CACHING: REVISITING workload may not hold, because systems the performance of an emulated disk CACHING FOR HETEROGENEOUS STORAGE usually try to migrate infrequently drive to the real disk. The emulator SYSTEMS accessed data to slower disks. Forney response time errors were within 0.1% Brian C. Forney, Andrea C. conceded that this will reduce the effect of the real device. To show that non- Arpaci-Dusseau, and Remzi H. of the algorithm but added that layout is existent devices may be tested using real Arpaci-Dusseau, University of more of a long-term decision whereas benchmarks, Linwood presented perfor- Wisconsin storage-aware caching makes short-term mance results of the MEMS Memulator Brian Forney spoke about a storage- decisions. Both need to be made and from the PostMark benchmarks. aware caching algorithm. The research made cooperatively. Additionally, there addresses cache buffer policies that do may be situations, such as accessing SESSION: HANDLING DISASTER not work well in heterogeneous storage remote data beyond administrative con- SESSION systems. An example problem is a client trol of the user of the data, where chang- Summarized by Zachary Peterson with a single cache that presents a uni- ing layout may not be possible. form workload to both a slow and a fast VENTI: A NEW APPROACH TO ARCHIVAL DATA STORAGE disk. Data from the fast disk pushes data TIMING-ACCURATE STORAGE EMULATION from the slow disk out of the cache, John Linwood Griffin, Jiri Schindler, Sean Quinlan and Sean Dorward, Bell causing the client to repeatedly access Steven W. Schlosser, John C. Bucy, and Labs, Lucent Technologies the slow disk for cache misses. The pre- Gregory R. Ganger, Carnegie Mellon Best Paper Award sented solution is to make caching poli- University In the absence of his co-workers, Rob cies aware of device performance. John Linwood Griffin talked about the Pike spoke on behalf of the authors of A cache is partitioned into variable-sized Memulator system developed at CMU. the Venti file system. Venti’s key contri- buffers, each with its own policy. This The Memulator is an emulator for bution is its ability to manage an on- allows the cache to adapt to workload MEMS-based storage devices that uses disk archive of data efficiently and changes. The challenge is deciding how timing-accurate storage emulation quickly. The motivation for this work is to partition the cache: a static allocation (TASE). TASE combines a device simu- that tape is slow and difficult to manage. is simple yet wasteful, so a desired lator with a timing manager so that the Additionally, secondary magnetic stor- approach is to dynamically adjust the emulator responds to input with accu- age has become cheap and plentiful, partition size according to access pat- rate response times. This tool is helpful meaning that tape is no longer a cost- terns. The dynamic algorithm records for testing MEMS storage devices, effective solution. Venti accomplishes a delays during a window of disk requests because although we can simulate the new architecture for tapeless archiving to measure the device behavior, and bal- devices we don’t yet have them available by providing a write-once block inter- ances the allocation of partitions based to test in real systems with real applica- face to all files. In essence, all files are on the relative performance. tions. copy-on-written, creating a new, The TASE system consists of three com- authenticated version of a file for every The caching algorithm was evaluated write performed. using a simulation combining a disk ponents: the communication manager is

72 Vol. 27, No. 2 ;login: Authentication and manageability are Implementing cross-site checksums pre- SnapMirror uses the existing WAFL file satisfied by using the SHA1 hash func- sented some challenges, which the system metadata to find block-level dif- EPORTS tion that distills all data blocks into a 20- speaker addressed. First, checksum ferences in the updates, thus avoiding R byte digest. Pike quipped that “Venti can updates are not idempotent and must full scans of the data. New data is writ- compress any amount of data to 20 use version numbers to enforce consis- ten to new disk blocks, so that by copy-

bytes.”This SHA1 digest can further be tency. Additionally, the overwriting of ing newly allocated disk blocks one has ONFERENCE C

used to identify redundant data blocks data blocks can “unprotect” other blocks copied all newly written disk blocks. The for reclamation to free disk space. How- in the same redundancy group. There- results presented showed that, on aver- ever, Pike believed this idea of reclama- fore, overwrites are not done in place. age, using asynchronous 15-minute tion is transient. With average disk sizes MacCormick showed results of this updates can reduce data transfers by increasing so rapidly, it’s time to “let go” architecture that illustrated a high level 50%, hourly updates by 58%, and daily of issues of capacity management. of reliability, similar to a double-mir- updates by over 75%. The results indi- rored RAID. Myraid was also shown to cate that by adjusting the “frequency In implementation, the SHA1 per- be able to reduce the total cost of owner- knob” on updates, SnapMirror can fill formed reasonably fast (60MBps at ship by up to 25%. the cost and performance void between 700MHz) and provided an efficient ran- tape-based and synchronous mirroring- dom access facility to any blocks in the SNAPMIRROR: FILE-SYSTEM-BASED based backup systems. file system. Results from a Venti proto- ASYNCHRONOUS MIRRORING FOR DISASTER type show that after four years of file RECOVERY WORK-IN-PROGRESS REPORTS system activity, Venti experienced only a R. Hugo Patterson, Stephen Manley, Summarized by Ismail Ari 10% increase in file system overhead, Mike Federwisch, Dave Hitz, Steve and could effectively reduce the size of Kleiman, and Shane Owara, Network Chair Scott Brandt gathered the speakers data stored on a file system by up to Appliance and explained the rules: “14 short talks, 76%. Computer data has become so impor- 8.57 minutes per talk, just enough to tant to its owners that data loss and introduce your idea.”Scott also prom- MYRAID: COST-EFFECTIVE DISASTER downtime is not only an inconvenience ised not to tackle them off the podium TOLERANCE but, in many cases, can cost an enter- as long as they obeyed the rules. The Fay Chang, Minwen Ji, Shun-Tak prise millions of dollars in revenue. players were excited, the whistle was Leung, John MacCormick, Sharon Perl, Backing up data has two coarse-grain blown, and the game began. and Li Zhang, Compaq SRC solutions. Tape can be used to perform These papers can be found at: John MacCormick spoke about achiev- daily snapshots, which is cheap but still http://www.usenix.org/events/fast02/wips. ing fault tolerance in a distributed RAID faces the possibility of losing an entire html. environment. The research presented day’s work. A more reliable approach is introduces a distributed storage system to use an online, synchronized mirror- STORAGEAGENT: AN AGENT-BASED called Myriad, which claims to achieve ing of data; however, this can be very APPROACH FOR DYNAMIC RESOURCE SHAR- similar levels of fault tolerance and expensive for large data sets and can be ING IN A STORAGE SERVICE PROVIDER (SSP) performance as a local RAID with mir- bandwidth intensive. Hugo Patterson INFRASTRUCTURE roring. He began with an example of presented work that finds a medium Sandeep Uttamchandani, IBM Almaden how Myraid operates. Assume a system between these two poles, called Snap- Research Center that has many sites connected by a Mirror. SnapMirror is an asynchronous, The resources to be shared in a storage WAN. To achieve high fault tolerance, periodic mirroring tool that uses batches server are cache, memory, and CPU. each site’s data blocks exist in a “redun- updates to maintain data integrity. The This paper presents an agent-based dancy group.”This group protects data frequency and size of batches can be architecture to improve the throughput from disaster by using cross-site check- tuned to increase reliability, but this also and latency of data access by leasing sums and erasure codes. MacCormick increases the need for network band- resources to agents and reclaiming these argues that this distributed method of width. The key idea of this work is that resources when required. This way the checksums is more cost effective than by lagging behind, and not performing client-sharing is not ad hoc but is con- mirroring. Essentially, this architecture synchronous updates, that a SnapMirror trolled and efficiently utilized by a can use less disk, hence decreasing the system can reduce cost, and bound resource manager. Beyond the data allo- total cost of ownership, while not sacri- potential data loss. cation and performance monitoring, the ficing reliability.

April 2002 ;login: FAST 2002 73 architecture also envisions agent moni- of by the local file system, load balanc- LAZY PARITY UPDATE: A TECHNIQUE TO toring and data security via access tick- ing, file migration and global naming IMPROVE WRITE I/O PERFORMANCE OF DISK ets. are handled by FedFS. Virtual directory ARRAY TOLERATING DOUBLE DISK FAILURES entries are cached at each node and Young Jin Nam, Dae-Woong Kim, NFS OVER RDMA updates are exchanged periodically. File Tae-Young Choe, and Chanik Park, Brent Callaghan, Sun Microsystems lookups go through the virtual directory Pohang University of Science and Engi- NFS traffic over gigabit networks takes managers. neering, Kyungbuk, Republic of Korea 90% of the CPU while resulting in low RAID6 can tolerate double disk failures, throughput. Using Remote Direct Mem- AN ITERATIVE TECHNIQUE FOR DISTILLING A however the write I/O performance is ory Access (RDMA) protocols, they WORKLOAD’S IMPORTANT PERFORMANCE 66% of RAID5. Young Jin Nam pre- expect NFS to make full and efficient use INFORMATION sented a lazy parity update (LPU) tech- of gigabit networks. NFS over RDMA Zachary Kurmas, Georgia Tech; nique to improve the write I/O over transport (GigE, IWARP, FC, Infini- Kimberly Keeton, HP Labs. performance in RAID6. LPU separates Band, etc.) is claimed to be much more The idea is to extract information from parity groups into a forward parity efficient than NFS over TCP/IP or a workload trace to synthetically gener- group (FPG) and backward parity group UDP/IP over transport. The benefits of ate workloads with similar performance. (BPG), and updates to BPG are deferred this system were compared to DAFS and They try to obtain useful attribute sets until the RAID is idle. TCP Offload Engines (TOE) by the (mean request size, inter-arrival times, audience. run counts) by subtracting from or THE ARMADA FRAMEWORK FOR PARALLEL adding to initially chosen attributes and I/O ON COMPUTATIONAL GRIDS THE CASE FOR MASSIVE ARRAYS OF IDLE measuring if the last action has changed Ron Oldfield and David Kotz, DISKS (MAID) the response-time distribution. If not, Dartmouth College Dennis Colarelli, Dirk Grunwald, and then this attribute is not included. The Ron Oldfield presented the Armada Michael Neufeld, University of Col- goal is to maximize the potential benefit framework for building I/O-access paths orado, Boulder of all attributes in a certain attribute for data-intensive grid applications. Fil- The talk focused on the power usage of group. The usefulness and viability of tering the tremendous amounts of raw disk arrays and how the power cost the technique was questioned for sys- data is the bottleneck for these applica- could be reduced by file migration and tems not trained with that specific tions. They propose a distributed filter disk spin-down techniques to make the workload. usage. They are trying to reduce the RAIDs (or MAIDs) comparable in price amount of data transferred through the to the tape libraries. Even if we assume LARGER DISK BLOCKS OR NOT? network and provide a mechanism for the disk and tape unit prices to be the Steve McCarthy, Mike Leis, and Steve arranging access to distributed and same, the big difference in power usage Byan, Maxtor Corporation replicated data sets. Fault tolerance of these two archival storage systems To continue with the 100% doubling of issues are not resolved yet. renders disk arrays a costlier choice. At disk capacity, obstacles to increased bits 7.25 cents per KW/h assuming a 24x7 per inch should be removed. In his pres- IBM STORAGE TANK[TM]: A DISTRIBUTED data center operation, it would cost entation Steve McCarthy proposed an STORAGE SYSTEM $9,400 to power the tape library system increase in sector size from 512 bytes to D.A. Pease et al., IBM Almaden vs. $91,500 to power the disks in the disk 4096 bytes. The additional capacity gain Research Center; R.C. Burns, Johns Hopkins University; Darrell D.E. Long, array. will be 10–12%. However, the question University of California, Santa Cruz was whether the gain from larger sector FEDERATED FILE SYSTEMS FOR CLUSTERS size would be lost back due to internal David Pease introduced IBM’s Storage WITH REMOTE MEMORY COMMUNICATION fragmentation. Tank, a distributed object file system Suresh Gopalakrishnan, DiscoLab, that allows heterogeneous file sharing in Rutgers University SAN. It has load balancing, fail-over Federated file system (FedFS) provides processing, and integrated backup and global namespace for distributed appli- restore capabilities. More information cations and is built as a layer on top of can be found at: http://www.almaden. local file systems. While file access, per- ibm.com/cs/storagesystems/stortank. missions, and consistency are taken care

74 Vol. 27, No. 2 ;login: DATA PLACEMENT BASED ON THE SEEK TIME map or restrict the credentials given by performance, performance, and cost ANALYSIS OF A MEMS-BASED STORAGE the client. Second, there is no mecha- performance. However, with so much EPORTS

DEVICE nism to hide data from users who do not emphasis on performance, we are forget- R Zachary N.J. Peterson, Scott A. Brandt, have privileges to access it. They address ting to make systems that people want to Darrell D.E. Long, University of these problems by a combination of (1) maintain. In the future, the metric for California, Santa Cruz

range-mapping, which allows the NFS comparing computer servers will shift ONFERENCE Zachary Peterson presented his simula- C

server to restrict and flexibly map the emphasis from performance to availabil- tion results for access times in MEMS- credentials set by the client and (2) file- ity. based storage devices. He pointed out cloaking, which allows the server to con- The price of technology continues to the similarities and differences between trol the data a client is able to view or decrease (Moore’s law), but salaries disk and MEMS and explained how access beyond normal UNIX semantics. these differences should affect data lay- increase over time, so the total cost of ownership of systems becomes domi- out in MEMS. He identified equivalence CONQUEST: BETTER PERFORMANCE THROUGH regions, or regions of media that share A DISK/PERSISTENT-RAM HYBRID FILE nated by the salaries of those maintain- the same seek time, on the device in SYSTEM ing them. Patterson showed recent data which data could be placed efficiently. An-I A. Wang, Peter Reiher, Gerald J. revealing that a third to a half of the Someone noted that “CMU says their Popek, UCLA; Geoffrey H. Kuenning, price of systems goes into keeping them disk layout works well with MEMS”; Harvey Mudd College running (i.e., paying people to keep the Zachary replied, “This work is an alter- Andy Wang presented Conquest, which system up). Patterson assured the audi- native approach, and further compar- shows what to do with tons of battery- ence that the world was behind him in isons must be done,”which summarizes backed RAM or MRAM. Metadata and his sentiment that availability and main- his presentation. small files are in memory while only the tainability are serious issues by citing the contents of large files go to disk. Most ideas of such luminaries as Jim Gray, LOGISTICAL NETWORKING RESEARCH AND other file systems are designed for disk. Butler Lampson, John Hennessy, and THE NETWORK STORAGE STACK Bill Gates. COOPERATIVE BACKUP SYSTEM James S. Plank, Micah Beck, and Terry Patterson listed new goals for the Sameh Elnikety and Willy Zwaenepoel, Moore, University of Tennessee research community to investigate in the Rice University; Mark Lillibridge, Micah Beck started by saying, “Every- next century: availability, changeability, thing you know about network storage Compaq SRC; Mike Burrows, Microsoft Research maintainability, and evolutionary is wrong” to invite people to view the growth (ACME). Systems being used end-to-end picture in remote storage This paper presents the design of a novel backup system built on top of a peer-to- today are failing to meet desired stan- access rather than the networked storage dards in all four areas. Fault tolerance is device alone. He presented the proposed peer architecture with minimal support- ing infrastructure. The system can be not solving availability problems, diffi- network storage stack that had a Logisti- cult upgrade procedures hinder change- cal File System, L-Bone, and Internet deployed for both large-scale and small- scale peer-to-peer overlay networks. It ability, systems are unforgiving in their Backplane Protocol (IBP), an IP equiva- maintainability, and the back end of sys- lent for storage. For details, see the allows computers connected to the Internet to back up their data coopera- tems fall short of providing evolutionary Logistical Computing and Internetwork- growth. ing (LoCI) project at the University of tively. Tennessee: http://loci.cs.utk.edu. As systems become more automated, the KEYNOTE II possibility of error now lies with design- ENHANCING NFS CROSS-ADMINISTRATIVE AVAILABILITY AND MAINTAINABILITY ers and operators. However, automation DOMAIN ACCESS PERFORMANCE: NEW FOCUS FOR A NEW typically addresses easy tasks and hides Joseph Spadavecchia and Erez Zadok, CENTURY the implementation, leaving mistake- Stony Brook University David Patterson, University of prone humans to mess with the harder Erez Zadok claims NFS actually stands California, Berkeley tasks, operating systems of increased for “No File Security.”The NFS access Summarized by Scott Banachowski complexity and reduced visibility. model is weak. The server depends on David Patterson started by listing the Should build margins the client to specify the user credentials three most important aspects in build- of safety into their systems, the same to use and has no flexible mechanism to ing systems over the past 15 years: way civil engineers beef up a bridge by

April 2002 ;login: FAST 2002 75 adding fudge factors to the design in nearby WayStations for safety and are (iSCSI) prototype has been imple- parameters? The challenges that stand in periodically exchanged between WaySta- mented in FreeBSD. A FreeBSD dum- the way of ACME are twofold: hardware tions and servers for update visibility mynet package was used to introduce and software failures plague us, and through reconciliation. Due to the dan- variable network delays, bandwidth lim- human error plagues us. Patterson ger of out-of-order updates happening itations, and packet losses. A SmartBits quoted Shimon Peres (Peres’s law): “If a between reconciliations, WayStations program was used to introduce a fixed problem has no solution, it may not be a keep file versions in “escrow” (cache) in background traffic. Using this setup, a problem, but a fact, not to be solved, but the event that they are referenced at number of application benchmarks to be coped with over time.” other replicas. A WayStation (replica (SSH, Postmark, TPC-C) have been site) that makes an update visible to tested with remote-block-level access. The path Patterson outlined to begin another via reconciliation must retain a The results show that network delay can addressing the problem of ACME is to copy of the update for as long as the adversely affect application performance collect data on failures, create ACME other replica may refer to it. but can be alleviated by caching and benchmarks, start applying margins of application prefetching. For fast net- safety in designs, and create and evaluate The pessimistic bilateral reconciliation works disks are the bottleneck. More techniques for ACME. For example, a that locks the server until a two-phase information can be found at: new benchmark might be “time to commit succeeds is compared to an http://www.bell-labs.com/project/iSCSI. recover.”We can inject failures into sys- optimistic unilateral reconciliation that

tems and measure the QoS during the assumes reconciliation messages will be PERSONALRAID: MOBILE STORAGE FOR recovery period. We also need to focus received. A prototype consisting of a DISTRIBUTED AND DISCONNECTED on making designs palatable to opera- cache manager, a server, and a WaySta- COMPUTERS tors, not just end users. tion is implemented in Java. Sumeet Sobti, Nitin Garg, Chi Zhang, Patterson took several questions and Their trace analysis results show that Xiang Yu, and Randolph Wang, Princeton University; Arvind comments from the audience; he commits by different users are separated Krishnamurthy, Yale University responded by reiterating the themes of by 1.9–2.9 hours, which gives enough PersonalRAID is designed for a single the talk. We need to characterize failures time for WayStations to propagate user to control a number of distributed before we can make failure recovery shared data back to the server before it is and disconnected personal storage benchmarks; the goal is to build systems needed. Only 0.01% of all operations devices. This device addresses the chal- that are forgiving of mistakes. An inter- caused sharing within 15 seconds. Fluid lenges mobile users are facing: lack of a esting observation is that people in the replication’s update performance does single transparent storage space ubiqui- field are embarrassed by the state of not depend on wide-area connectivity. tously available with a certain degree of computers. We are proud of the perfor- Ten megabytes of escrow space was reliability assurance; inconvenience of mance, but know that the ACME goals enough, even for the worst cases. Mobile manual data movement; and poor are lacking. When we come up with data was both safe and visible. Every- performance due to synchronization benchmarks that measure ACME, aca- body was happy. during disconnects and connects. With demics will be happy to do the research PersonalRAID the user never has to per- because they will have the ability to OBTAINING HIGH PERFORMANCE FOR form manual hoarding or manual prop- quantify their results and progress. STORAGE OUTSOURCING Wee Teck Ng, Hao Sun, Bruce Hillyer, agation of data. SESSION: WIDE-AREA STORAGE Elizabeth Shriver, Eran Gabber, and PersonalRAID has a log-structured file Banu Ozden, Bell Labs system (LFS) design, where disconnec- Summarized by Ismail Ari Since storage outsourcing to Storage tion is analogous to a graceful LFS shut- SAFETY, VISIBILITY, AND PERFORMANCE IN A Service Providers (SSP) is becoming a down, and connection (and replay) is WIDE-AREA FILE SYSTEM big market, it is important to do a analogous to LFS recovery. LFS was pre- Minkyong Kim, Landon Cox, and Brian performance and viability analysis of ferred because of the fast replaying and Noble, University of Michigan remote storage access over various net- low recording overheads. The key com- The goal is to help mobile clients reach work conditions. ponent between the fixed hosts is the their home file systems without giving This research implements a real testbed mobile storage device (such as the IBM up consistency and sharing to avoid with two routers, hosts, disk, disk arrays, microdrive) that is called Virtual-A, or WAN overhead. Client updates are held and storage gateways. A storage over IP VA (movable storage on Windows PCs is

76 Vol. 27, No. 2 ;login: drive “A”). The fixed host is unusable analyzed to learn more about the perfor- its computational costs. This approach without the VA plugged in, but this is a mance required by the application, produces solutions with average EPORTS personal system, and it is assumed that which can be used as input to the next cost/performance 14–17% better than R the user will carry the VA with her. iteration of the loop. the best results for the tagging solutions and 150–200% better than their worst Removing and adding hosts (reconfigu- Hippodrome starts with as little as

solutions. ONFERENCE ration) is easy. The system can also capacity information and converges to a C recover from both the fixed host and VA valid storage design without further APPIA: AUTOMATIC STORAGE AREA device losses, the latter being trickier. human intervention, often in only a few NETWORK FABRIC DESIGN Several benchmark analyses show that loop iterations. Convergence time can be Julie Ward, Troy Shahoumian, and John transparency and reliability are achieved decreased through the use of initial Wilkes, HP Labs; Michael O’Sullivan, without any serious performance performance hints, even if they are inac- Stanford University penalty. curate. Using near-minimal resources, SAN fabric design consists of connecting the storage systems designed by Hippo- hosts to storage devices via hubs and SESSION: SELF-ORGANIZING drome provide performance within 15% switches and is an NP-hard problem. STORAGE SYSTEMS of solutions determined by human Manual designs usually over-provision Summarized by Ismail Ari experts. A prototype implementation just to be safe. However, SAN fabric is The three presentations in this session constitutes a proof of concept for syn- extremely costly. Appia saved millions of were all made by groups from HP Labs thetic and email server workloads. dollars by efficient allocation, and it in Palo Alto. Details can be found at: found results in a few minutes that SELECTING RAID LEVELS FOR DISK ARRAYS http://www.hpl.hp.com/SSP. would have taken days otherwise. An Eric Anderson, Ram Swaminathan, even more important property of Appia Alistair Veitch, Guillermo A. Alvarez, HIPPODROME: RUNNING CIRCLES AROUND is that the resulting designs can be and John Wilkes, HP Labs STORAGE ADMINISTRATION proven to be correct, which reduces the The manual rule of thumb in RAID- Eric Anderson, Michael Hobbs, Kim- chance of human error. berly Keeton, Susan Spence, Mustafa level selection is to “tag” the data with a Uysal, and Alistair Veitch, HP Labs RAID level before seeing the array con- Two efficient algorithms for automatic Kimberly Keeton talked about the stor- figuration. In the automated design a SAN design are demonstrated in this age system configuration challenges and “solver” takes as input the workload paper: FlowMerge and QuickBuilder. difficulty of understanding modern description, the target array types and FlowMerge tries to eliminate port viola- workloads. She stated that in today’s their configuration schemes. The output tions (each flow should be assigned to a world, human experts use rules of of the solver is a storage system design single port and port number on switches thumb and trial and error to iteratively capable of supporting that workload. and storage devices are limited) by merging flowsets together. Flowset configure storage systems, a process that The initial tests showed that solvers uti- merges are also performed to eliminate often takes too long and results in incor- lizing a pre-tagged workload resulted in fabric nodes to reduce cost where possi- rectly provisioned systems. expensive solutions, so they switched to ble. QuickBuilder assigns flows to ports using partially adaptive (deciding RAID Hippodrome automates the iterative and then recursively builds fabric mod- level associated with an individual store storage system configuration process. In ules for each independent port group. when first assigning the store to an LU) each loop iteration, Hippodrome takes In general, FlowMerge works better with and fully adaptive (changing RAID level capacity and performance requirements sparse connectivity requirements and of an LU even after stores had been for the workload as input, and efficiently QuickBuilder with dense connectivity assigned) solvers. Using a set of real searches a large space of possible storage requirements. designs to find a minimal cost design workloads, they tested the tagger-based that meets those requirements. Once a (using “rules of thumb” or analytic satisfactory design has been chosen, that performance models) and the adaptive- design is implemented by adding or sub- solver schemes. Best results were tracting storage resources and poten- obtained when the solver was allowed to tially migrating data from an existing revise its RAID-level selection decision configuration to the target configura- at any time (fully adaptive). The benefits tion. Finally, the running workload is of the fully adaptive scheme outweighed

April 2002 ;login: FAST 2002 77 SESSION: THE FUTURE OF STORAGE nentially decreasing costs, of disk drives. equipped with a software interface that TECHNOLOGY However, this trend will not continue, or enabled people to use it. Drives have Summarized by Zachary Peterson else “we will have a disk with infinite changed dramatically since then, but Jim capacity, infinite bandwidth, no latency, Gray of Microsoft Research concludes FUTURE MAGNETIC RECORDING and we will give it away for free,”Coufal that we will return to the built-in soft- TECHNOLOGIES joked. He went on to introduce a num- ware interface paradigm. Every disk will Mark Kryder, Seagate Research ber of non-magnetic storage devices, or be its own computer, with processors, This presentation examined the details advanced data storage devices that, he memory, and interfacing software; of how magnetic recording works today feels, will play a significant role in the something printers already do today. and how physicists are using various future of storage. Consider a disk that runs Oracle or DB2, technologies to push out the superpara- or both! This level of abstraction pro- MRAM is an emerging variation of non- magnetic limit to increase disk density. vides a pleasant layer to a user who volatile RAM that uses magnetics to The superparamagnetic effect causes bits wants performance and ease of use. Gray provide a low-power, non-rotating to destabilize and disappear when the claimed that the central processing unit media, produced at the cost of DRAM, media grains become too small. By using is not busy and should be thrown away, with the speed of SRAM. Coufal went a demagnetizing field, researchers have or more precisely, given a more specific on to show that the ultimate in capacity been able to lengthen the mean time to task to do. destabilization, an effect they call “ther- can be achieved; individual bits repre- mal relaxing.”Using this technology Sea- sented by xenon atoms have been An on-disk processor could, for gate has been able to create a 101Gb/in2 demonstrated in a laboratory environ- instance, optimize disk arm placement, density disk, almost three times a dense ment. A practical device using this tech- control network connections, or per- as the originally predicted limit. nology, however, is in the distant future. form parallel I/O operations to many The next device shown was the IBM disks controlled by one CPU. In fact, Other methods for increasing density Millipede, a MEMS-based device that some companies, such as Network were presented. By making bits more uses 32x32 read/write heads in parallel Appliance, are already designing disks square, a higher signal-to-noise ratio can on polymer media, achieving a 40Gb/in2 using this technology. By making disks be achieved while increasing aerial den- density. IBM has already produced a into computers, a reduction in mainte- sity. Similarly, by recording bits orthogo- working prototype of this device. Holo- nance and per-byte cost can be achieved. nally to the media, instead of graphic storage also promises to be an These cost-benefit advantages make longitudinally, disks can reach greater useful and interesting medium. Shining storage bricks a more attractive option density with less noise. However, com- a focused laser beam into a recording than tape for large capacity systems. plications in writing at such high densi- medium produces an entire page of Additionally, by adding an application ties cause problems on the current memory. IBM has built a holographic layer to the disk, disks become user magnetic media. Kryder suggests that device with a 150Gb/in2 density with a friendly, essentially a plug-and-play type changing the media to a more self- throughput of 1Gb/sec. What is really of device. ordered magnetic array (SOMA) could exciting is that this storage can be organ- provide a more viable solution. By using ized associatively, so that one would SESSION: PARALLEL I/O a heat-assisted magnetic recording send a filtered reference request beam Summarized by Scott Banachowski (HAMR) and a bow-tie antenna in com- for “airplanes” and be returned an entire GPFS: A SHARED-DISK FILE SYSTEM FOR bination with SOMA media, a more data page of airplane photographs. focused and exact media head can be LARGE COMPUTING CLUSTERS developed, promising densities of Coufal warned that although these new Frank Schmuck and Roger Haskin, IBM 50Tb/in2. storage devices are technologically feasi- Almaden Research Center ble, their integration into actual prod- IBM’s GPFS, a shared-disk file system NON-MAGNETIC DATA STORAGE: ucts may take time. It is just a question for parallel clusters, can be configured in PRINCIPLES, POTENTIALS, AND PROBLEMS of economics. a SAN, a symmetric cluster, or as a clus- Hans Coufal, IBM Almaden Research ter of I/O nodes. There is an implemen- Center STORAGE BRICKS HAVE ARRIVED tation of GPFS running on the ASCI Hans Coufal began his talk by showing Jim Gray, Microsoft Research White supercomputer, an LLNL, that has the rapidly increasing capacity-and- When disk drives first came into exis- a throughput of 7GB/sec. The architec- speed growth rates, along with the expo- tence in the 1950s, every drive came ture combines large disks and RAIDs

78 Vol. 27, No. 2 ;login: into a single file system. Features include EXPLOITING INTER-FILE ACCESS PATTERNS the authors assumed that processor-to- wide-striping, large blocks, multiple par- USING MULTI-COLLECTIVE I/O processor bandwidth was greater than EPORTS allel nodes, byte-range locking, log Gokhan Memik, UCLA; Mahmut Kan- the processor-to-I/O bandwidth, which R recovery, RAID replication, and online demir, Pennsylvania State University; turned out to be the case. management. Alok Choudhary, Northwestern University AQUEDUCT: ONLINE DATA MIGRATION WITH ONFERENCE Schmuck chose to focus his talk on dis- One problem facing scientific parallel PERFORMANCE GUARANTEES C tributed locking mechanisms. GPFS uses applications is that several processing Chenyang Lu, University of Virginia; token-based locks to avoid excess mes- nodes may try to access multiple data Guillermo A. Alvarez and John Wilkes, sages passing to a central server. There is structures simultaneously. Currently, HP Labs one token server which grants permis- applications either access multiple data Chenyang Lu presented the Aqueduct sions to modify ranges of bytes within a structures stored in multiple files with system developed for online data migra- file. When a node wants a currently held poor performance or access multiple tion. Online data migration is used token, the holder must flush its data to data structures from a single file, an when data must be continuously accessi- disk and relinquish the token. For approach that doesn’t scale well. Gokhan ble. The challenges include keeping data sequential write sharing, this mecha- Memik presented a technique called consistent and bounding the impact of nism has very low overhead. To increase Multi-Collective I/O (MCIO) which is migration on the performance of other performance for concurrent write shar- optimized for these data access patterns. activities. Aqueduct uses feedback con- ing, a token request lists both currently Collective I/O (CIO) aims to help multi- trol to enforce the QoS latency contracts required data and data desired in the ple nodes access a single file, but its of foreground operations. future. This allows flexibility in the access patterns don’t match the case amount of data that nodes must relin- During a migration, the system moni- where the nodes access different data tors the average I/O latency over a win- quish when a token is requested, and structures in those files. leads to efficient concurrent write shar- dow of time, and, like a classical control ing that requires only a single revocation In CIO, multiple nodes communicate circuit, uses the computed error to per node. In GPFS, there is an efficient with each other to consider the best way adjust the rate of migration. The con- way to handle metadata updates for to access files by determining the storage troller uses an integral feedback loop, so shared writes. To prevent lock conflicts pattern of all nodes’ access requests, and that the rate is proportional to the sum for metadata updates, one node is then dividing the combined requests so of the worst errors (i.e., contract viola- dynamically elected “metanode,”and is that disks are accessed in an efficient tions) measured in the migration his- responsible for collecting and merging way. Once fetched, the data is then com- tory. Using the tuning parameters of the size and mtime updates from the municated to the nodes that requested victim latency (highest latency in the other nodes that modify the file. The it. MCIO expands CIO by considering sampling period) and process gain (sen- GPFS disk allocation map is interleaved the interaction of multiple files and sitivity of latency to changes of migra- so that a region of the map includes assigning the files to different nodes, and tion rate), control analysis was used to space on all disks; a node accessing a then letting each subset of nodes do a determine a stable but fast-tracking con- region of the map is able to allocate for collective I/O per file. The problem is stant for the feedback loop. striping patterns without contention of shown to be NP-hard for arbitrary-sized In Lu’s experiment, latency was bounded other nodes. files, so it requires heuristic approaches. between 0.8 and 1 of the latency con- Some data on the write-sharing Two such approaches are presented: a tract, meaning that the migration was throughput of the system was presented; greedy algorithm that assigns a node to not too conservative yet met the terms they found performance to scale linearly the file it reads the most, and a graph of the contract. He looked at perfor- with number of participating nodes, up algorithm that solves a maximal match- mance experiments on a real enterprise- to the point where the throughput ing problem using the Netflow package. scale storage system by playing traces exceeded the capability of the network Memik described an MCIO experiment generated from an OpenMail workload. switch (about 18 nodes). where the performance was compared to In this experiment Aqueduct reduced a naive CIO using some synthetic access foreground I/O latency by 76% while patterns. The MCIO algorithm reducing the latency contract violation improved response time by 80%. One ratio by 78%. audience member questioned whether

April 2002 ;login: FAST 2002 79 SESSION: LOW-LEVEL STORAGE FREEBLOCK SCHEDULING OUTSIDE OF DISK CONFIGURING AND SCHEDULING AN EAGER- OPTIMIZATION FIRMWARE WRITING DISK ARRAY FOR A TRANSACTION Summarized by Scott Banachowski Christopher R. Lumb, Jiri Schindler, and PROCESSING WORKLOAD Gregory R. Ganger, Carnegie Mellon Chi Zhang, Xiang Yu, and Randolph Y. TRACK-ALIGNED EXTENTS: MATCHING University Wang, Princeton University; Arvind ACCESS PATTERNS TO DISK DRIVE CHARAC- Krishnamurthy, Yale University TERISTICS Christopher Lumb explored taking advantage of the free bandwidth offered Chi Zhang presented the concluding talk Jiri Schindler, John Linwood Griffin, of the conference. The subject was Christopher R. Lumb, and Gregory R. by disks. When a disk head seeks a new Ganger, Carnegie Mellon University position, there are two latency sources: scheduling I/Os for transaction process- ing applications. Transaction processing Best Student Paper Award the track seek time and the rotational latency after the head is positioned over typically provides random access work- It is well known that random disk access the target track. If movement of the disk loads that exhibit little locality or isn’t as efficient as sustained streaming. head is delayed so that it arrives at the sequentiality. The research question is However, when accesses are track- target track just-in-time there is no rota- “How do we throw away disk space to aligned, efficiency increases even for tional latency, so the time before the improve performance?” Eager write is a small request sizes. Jiri Schindler pre- head movement may be exploited to do policy where writes are placed at the free sented track-aligned extents (traxtents) other useful tasks, such as continuing to block closest to the current disk-head to allocate related data within track fetch sequential data after the current position, and mirroring allows a read to boundaries. If data is allocated on track- read. Lumb found that this “free” band- be serviced by the copy closest to the sized extents, all I/O may be aligned width is available about a third of the request. Zhang examines combining within track boundaries, meaning that time. these techniques for improving perfor- single access will never incur rotational mance in TPC-C (transaction process- Freeblock scheduling is a scheme to take latency due to crossing a track. ing) benchmarks, and calls it an advantage of extra disk bandwidth. A eager-writing disk array (EW-Array). To support traxtents, the file system difficulty is that its implementation must have very detailed layout informa- requires very accurate disk models, espe- To support an EW-Array, Zhang was tion for each disk. Schindler et al. used cially if scheduling happens outside of faced with the problem of properly con- an algorithm to determine track bound- the disk firmware (in the driver); a con- figuring the system so that it has enough aries by looking for rotational latencies servative fudge factor must be added to disk space for eager writes, enough repli- in disk operations. The operation is estimates, reducing available free band- cas to provided mirroring benefits, and time-intensive, but the SCSI command width. However, they found that they the ability to handle striping. Determin- set supports procedures that allow infer- could reduce fudge factors by supplying ing the right configuration is a trade-off ence of track layout through queries, so queued disk commands to the disk such depending on the workload. On the the process is simplified for SCSI disks. that it never idles. other hand, Zhang shows how to design a disk scheduler to take advantage of the Schindler presented the performance Freeblock scheduling was tested with a write-anywhere nature of eager writing. results of a traxtents-supporting FFS vs. random small I/O workload. Lumb vanilla FFS. A 1GB file copy resulted in a found 3.3MB/sec. of free disk bandwidth A prototype of an EW-Array was devel- 20% improvement in runtime, while a with little impact on foreground opera- oped as a logical disk driver for Win- diff of two 512MB files resulted in 19% tion. However, this was lower than the dows, and the performance was runtime reduction. A video server pro- expected 5.3MB/sec. predicted by their compared to other disk array configura- viding concurrent data streams was able model. Lumb attributes this to disk tions by playing TCP-C traces at both to support 56% more streams due to the model inaccuracies and confusion in the original and accelerated rates. Zhang improvements in disk throughput and disk controllers prefetching system. showed results that indicate that EW- startup latency. Track-aligned extents Once the model was altered to reflect Array achieves better response times and also reduced the buffer space require- new findings, the performance was a higher I/O rates than other approaches ments of the file system. At the end of closer match to predictions. In conclud- given the same extra storage space. the talk Schindler was asked how it per- ing remarks, Lumb stated that Freeblock formed for small accesses, and he replied scheduling provides 15% of the disk it performed well in the PostMark bandwidth for free. benchmarks.

80 Vol. 27, No. 2 ;login: