THE

THE NEWSLETTER ON PDL ACTIVITIES AND EVENTS • FALL 2 0 0 4 http://www.pdl.cmu.edu/

AN INFORMAL PUBLICATION FROM Ursa Major Starting to Take Shape ACADEMIA’S PREMIERE STORAGE Greg Ganger SYSTEMS RESEARCH CENTER Ursa Major will be the first full system constructed on PDL’s journey towards DEVOTED TO ADVANCING THE self-* storage. Ursa Major will be a large-scale object store, in the NASD and STATE OF THE ART IN STORAGE OSD style, based on standard server technologies (e.g., rack-mounted servers and SYSTEMS AND INFORMATION gigabit-per-second Ethernet). This article discusses the long-range project goals, some of the progress that has been made on Ursa Major, and near-term plans. INFRASTRUCTURES. The Self-* Storage Vision Human administration of storage systems is a large and growing issue in modern CONTENTS IT infrastructures. PDL’s Self-* Storage project explores new storage architec- tures that integrate automated management functions and simplify the human Ursa Major ...... 1 administrative task. Self-* (pronounced “self-star”—a play on the unix shell wild-card character) storage systems should be self-configuring, self-tuning, Director’s Letter ...... 2 self-organizing, self-healing, self-managing, etc. Ideally, human administrators Year in Review ...... 4 should have to do nothing more than provide muscle for component additions, Recent Publications ...... 5 guidance on acceptable risk levels (e.g., reliability goals), and current levels of satisfaction. PDL News & Awards ...... 8 We think in terms of systems composed of networked “intelligent” storage bricks, New PDL Faculty ...... 11 which are small-scale servers consisting of CPU(s), RAM, and a number of disks. Dissertation Abstracts ...... 12 Although special-purpose hardware may speed up network communication and data encode/decode operations, it is the software functionality and distribution of work that could make such systems easier to administer and competitive PDL CONSORTIUM in performance and reliability. Designing self-*-ness in from the start allows MEMBERS construction of high-performance, high-reliability storage infrastructures from weaker, less-reliable base units; with a RAID-like argument at a much larger scale, this is the storage analogue of cluster computing’s benefits relative to EMC Corporation historical supercomputing. Engenio Information Technologies, Inc. We refer to self-* collections of storage bricks as storage constellations. In Hewlett-Packard Labs exploring self-* storage, we plan to develop and deploy a large-scale storage Hitachi, Ltd.

Hitachi Global Storage Technologies reads and writes large-scale storage infrastructure IBM Corporation A storage brick Intel Corporation (CPU, RAM and several disks) Microsoft Corporation Network Appliance Oracle Corporation users Panasas, Inc. Science and data mining researchers needing Seagate Technology large-scale storage Sun Microsystems Figure 1. Brick Based infrastructure: a case study for understanding data center Veritas Software Corporation challenges and proving ground for new self-management approaches. continued on page 12 FROM THE DIRECTOR’S CHAIR THE PDL PACKET Greg Ganger The Parallel Data Laboratory School of Computer Science Hello from fabulous Pittsburgh! Department of ECE 2004 has been a strong year of transition in the Carnegie Mellon University Parallel Data Lab, with several Ph.D. students graduating and the Lab’s major new project 5000 Forbes Avenue (Self-* Storage) ramping up. Along the way, Pittsburgh, PA 15213-3891 a faculty member, a new systems scientist, VOICE 412•268•6716 and several new students have joined the Lab, FAX 412•268•3010 several students spent summers with PDL Consortium companies, and many papers have been published. PUBLISHER For me, the most exciting thing has been the graduations: Garth Goodson, John Greg Ganger Linwood Griffin, Jiri Schindler, Steve Schlosser, and Ted Wong. The first four are the first Ph.D. students to complete their degrees with me as their primary EDITOR advisor, ending long speculation that I never intended to allow any of my Joan Digney students to leave :). All five of these new Doctors are now working with PDL Consortium companies (Garth at Network Appliance, John and Ted at IBM, The PDL Packet is published once per Jiri at EMC, and Steve at Intel); one or two of them are even expected back at year and provided to members of the the 2004 PDL Retreat as industry participants. PDL Consortium. Copies are given to other researchers in industry and The PDL continues to pursue a broad array of storage systems research, and this academia as well. A pdf version resides past year brought completion of some recent projects and good progress on the in the Publications section of the PDL exciting new projects launched last year. Let me highlight a few things. Web pages and may be freely distributed. Contributions are welcome. Of course, first up is the big new project, Self-* Storage, which explores the design and implementation of self-organizing, self-configuring, self-tuning, COVER ILLUSTRATION self-healing, self-managing systems of storage bricks. For years, PDL Retreat attendees pushed us to attack “storage management of large installations,” Skibo Castle and the lands that com- and this project is our response. With generous equipment donations from the prise its estate are located in the Kyle of Sutherland in the northeastern part of PDL Consortium companies, we hope to put together and maintain 100s of Scotland. Both ‘Skibo’ and ‘Sutherland’ terabytes of storage for use by ourselves and other CMU researchers (e.g., in are names whose roots are from Old data mining, astronomy, and scientific visualization). IBM, Intel, and Seagate Norse, the language spoken by the have helped us with seed donations for early development and testing. The Vikings who began washing ashore University administration is also excited about the project, and is showing their regularly in the late ninth century. The support by giving us some of the most precious resource on campus: space! Two word ‘Skibo’ fascinates etymologists, thousand square feet have been allocated in the new Collaborative Innovation who are unable to agree on its original meaning. All agree that ‘bo’ is the Old Center (CIC) building for a large machine room configured to support today’s Norse for ‘land’ or ‘place,’ but they argue high-density computing and storage racks. Planning and engineering the room whether ‘ski’ means ‘ships’ or ‘peace’ has been an eye-opening experience, inducing new collaborations in areas like or ‘fairy hill.’ dynamic thermal management. Some of the eye-popping specifications include 20KW per rack (average), with a total of two megawatts total, 1300 gallons of Although the earliest version of Skibo chilled water per minute, and 150,000 pounds of weight. seems to be lost in the mists of time, it was most likely some kind of fortified The Self-* Storage research challenge is to make the storage infrastructure building erected by the Norsemen. The as self-* as possible, so as to avoid the traditional costs of storage adminis- present-day castle was built by a bishop tration—deploying a real system will allow us to test our ideas in practice. of the Roman Catholic Church. Andrew Towards this end, we are designing Ursa Major (the first system) with a clean Carnegie, after making his fortune, bought it in 1898 to serve as his summer slate, integrating management functions throughout. In realizing the design, home. In 1980, his daughter, Margaret, we are combining a number of recent and ongoing PDL projects (e.g., PASIS donated Skibo to a trust that later sold and self-securing storage) and ramping up efforts on new challenges, such as the estate. It is presently being run as a block-box device and workload modeling, automated decision making, and luxury hotel. automated diagnosis. PDL’s Fates database storage project continues to make strides, and has con-

2 THE PDL PACKET PARALLEL DATA FROM THE DIRECTOR’S CHAIR LABORATORY

CONTACT US nected with scientific computing application domains to begin exploring “com- WEB PAGES PDL Home: http://www.pdl.cmu.edu/ putational databases.” A Fates-based database storage manager transparently Please see our web pages at exploits select knowledge of the underlying storage infrastructure to automati- http://www.pdl.cmu.edu/PEOPLE/ for further contact information. cally achieve robust, tuned performance. As in Greek mythology, there are three FACULTY Fates: Atropos, Clotho, and Lachesis. The Atropos volume manager stripes data Greg Ganger (director) across disks based on track boundaries and exposes aspects of the resulting par- 412•268•1297 [email protected] allelism. The Lachesis storage manager utilizes track boundary and parallelism Anastassia Ailamaki information to match database structures and access patterns to the underlying [email protected] storage. The Clotho dynamic page layout allows retrieval of just the desired table Anthony Brockwell [email protected] attributes, eliminating unnecessary I/O and wasted main memory. Altogether, the Chuck Cranor three Fates components simplify database administration, increase performance, [email protected] Christos Faloutsos and avoid performance fluctuations due to query interference. Computational [email protected] databases are a new approach to scientific computing in which observation data Garth Gibson [email protected] are loaded into general database systems, and then the scientific queries are Seth Goldstein optimized automatically by the DBMS rather than by the scientist producing [email protected] special-purpose code. Fates and other new database systems ideas will be key Mor Harchol-Balter [email protected] to the automated optimization component of computational databases. Chris Long [email protected] Other ongoing PDL projects are also producing cool results. For example, the Todd Mowry [email protected] MEMS-based storage work has culminated in an approach for evaluating wheth- David O’Hallaron er a new storage technology needs a new interface protocol. The self-securing [email protected] devices project continues to explore intrusion survival features of augmenting Adrian Perrig [email protected] devices with security functionality. Among other things, we have extended the Mike Reiter storage-based intrusion detection idea so that it can work in commodity disks, [email protected] Mahadev Satyanarayanan despite the lack of direct network attachment or file-based interfaces. The [email protected] PASIS project has extended the scalable read/write protocols to more general Srinivasan Seshan read-modify-write operations. The context-enhanced semantic file systems work [email protected] Dawn Song has produced some very promising early results. This newsletter and the PDL [email protected] website offer more details and additional research highlights. Chenxi Wang [email protected] On the education front, in Spring 2004, we again offered our storage systems Hui Zhang [email protected] course to undergraduates and masters students at Carnegie Mellon. Topics

STAFF MEMBERS span the design, implementation, and use of storage systems, from the charac- Karen Lindenfelser, 412•268•6716 teristics and operation of individual storage devices to the OS, database, and (pdl business administrator) networking techniques involved in tying them together and making them use- [email protected] Stan Bielski ful. The base lectures were complemented by real-world expertise generously Mike Bigrigg shared by guest speakers from industry. We continue to work on the book, and John Bucy Joan Digney several other schools have picked up Gregg Economou this trend and started teaching similar Manish Prasad Raja Sambasivan storage systems courses. Perhaps the Ken Tew most exciting news is that CMU has Linda Whipkey Terrence Wong given me time off from teaching and GRADUATE STUDENTS committee work to focus on complet- Michael Abd-El-Malek Adam Pennington ing the book (“Storage Systems”) on Mukesh Agrawal Ginger Perng which this class should be based. We Kinman Au David Petrou Shimin Chen Brandon Salmon view providing storage systems edu- Ivan Dobric Minglong Shao cation as critical to the field’s future; Stavros Harizopoulos Vlad Shkapenyuk James Hendricks Shafeeq Sinnamohideen stay tuned. Andrew Klosterman Craig Soules Julio López John Strunk I’m always overwhelmed by the ac- Chris Lumb Eno Thereska Amit Manjhi Niraj Tolia complishments of the PDL students and Michael Mesnier Tiankai Tu staff, and it’s a pleasure to work with Minglong describes her work on Clotho, Jim Newsome Mengzhi Wang them. As always, their accomplish- an aspect of the Fates Database Storage Spiros Papadimitriou Jay Wylie System, to Vlad. Stratos Papadomanolakis Shuheng Zhou ments point at great things to come.

FALL 2004 3 YEAR IN REVIEW

September 2004 lative Tasks” at the Internat’l crosoft Research in Cambridge,  Natassa gave an invited tutorial Conference on Supercomputing UK. Niraj Tolia was also in Cam- on “Database Architectures for (ICS) 2004 in St.-Malo, France. , UK, at the Intel Research New Hardware” at VLDB04 in  Jay Wylie presented “A Protocol Lab. Toronto, Canada. Approach to Survivable April 2004  John Griffin successfully de- Storage Infrastructures” at Fu-  Steve Schlosser successfully fended his Ph.D. dissertation DiCo II in Bertinoro, Italy. defended his Ph.D. dissertation, titled “Timing Accurate Storage  Chenxi Wang spoke on “Dy- “Using MEMS-based Storage Emulation: Evaluating Hypotheti- namic Quarantine of Internet Devices in Computer Systems.” cal Storage Components in Real Worms” at DSN 04 in Florence,  10 PDL faculty and students Computer Systems.” Italy.Jay Wylie also spoke at this attended FAST 04 in San Fran- conference, presenting “Efficient  James Hendricks presented “Se- cisco, CA. Byzantine-tolerant Erasure-coded cure Bootstrap is Not Enough:  Eno Thereska presented “A Storage.” Shoring up the Trusted Comput- Framework for Building Unob-  ing Base” at the 11th SIGOPS Christos gave a tutorial on “In- trusive Disk Maintenance Ap- European Workshop in Leuven, dexing and Mining Streams” at plications” at FAST 04. He and Belgium. SIGMOD 2004 in Paris, France. his co-authors Jiri Schindler,  Mengzhi Wang presented “Storage May 2004 John Bucy, Brandon Salmon, Device Performance Prediction with  6th annual PDL Industry Visit Day. Christopher R. Lumb, Gregory R. Ganger were awarded Best CART Models” at MASCOTS04  Jiri Schindler and Steve Schloss- Student Paper. in Volendam, The Netherlands. er, Greg’s first Ph.D. students to  Also presenting at FAST 04 A poster version was presented at finish, were awarded their de- were Jiri Schindler (“Atropos: A ACM SIGMETRICS 2004. grees in ECE. Ted Wong received Disk Array Volume Manager for  th his Ph.D. from SCS. 12 Annual PDL Retreat and Orchestrated Use of Disks”) and  Workshop. Mike Mesnier presented “File Steve Schlosser (“MEMS-based August 2004 Classification in Self-* Storage Storage Devices and Standard  Garth Goodson successfully Systems” at ICAC-04 in New Disk Interfaces: A Square Peg in defended his Ph.D. dissertation York. He also helped lead a a Round Hole?”) titled “Efficient, Scalable Con- storage workshop on intelligent  Over the past year, several indus- sistency for Highly Fault-tolerant storage devices and worked with try visitors have contributed to Storage.” others in the industry to put to- our Storage Systems course in-  Minglong Shao presented gether a research roadmap for the cluding David Black, EMC; Erik “Clotho: Decoupling Memory next 5-10 years. Riedel, Seagate; and Mike Kazar, Page Layout from Storage Orga-  Chenxi Wang presented “Provid- Network Appliance. ing Content-based Services on nization” at VLDB04 in Toronto, March 2004 top of Peer-to-peer Systems” at Canada.  SDI Speaker: Larry Huston,  Christos attended KDD2004 in DEBS’04 (Distributed Event- Seattle, WA where he had two Based Systems), in Edinburgh, continued on page 19 papers being presented: “Fast Scotland. Discovery of Connection Sub-  Mengzhi Wang graphs” and “Recovering Latent successfully Time-Series from their Observed proposed her Sums: Network Tomography with Ph.D research Particle Filters.” titled “Black-Box July 2004 Storage Device Models with  Chris Long presented “Chame- Learning.” leon: Towards Usable RBAC” at  James Newsome the DIMACS Workshop on Us- and Amit Manjhi able Privacy and Security Soft- spent the summer ware at Rutgers Univ., NJ. interning at Intel’s June 2004 Pittsburgh office,  David Petrou presented “Cluster and Eno Thereska Adam gives the go-ahead for Scheduling for Explicitly-Specu- interned with Mi- a PDL Industry Visit Day demo.

4 THE PDL PACKET RECENT PUBLICATIONS

D-SPTF: Decentralized cancel unfinished tasks if early results Request-level predictor R = (TimeDiff, ..., RW) RT suggest no need to continue. Doing 1 1 Request Distribution in Brick- Aggregate R2 = (TimeDiff, ..., RW) RT 2 CART so matches natural usage patterns, . . . based Storage Systems r 1 Rn = (TimeDiff, ..., RW) RT n making users more effective, and also r 2 Aggregate Lumb, Golding & Ganger enables a new class of schedulers. . Perf. P r n Workload-level predictor Proceedings of ASPLOS’04, Boston, In simulation, we show how batch-ac- W = (ArrivalRate, ..., Seq) CART P Massachusetts, October 7–13 ,2004. tive schedulers reduce user-observed response times relative to conven- Distributed Shortest-Positioning tional models in which tasks are re- Per-request and feature-based predictors. Time First (D-SPTF) is a request quested one at a time (interactively) or ri is the i-th request in the workload; RTi distribution protocol for decentralized in batches without specifying which is the response time of ri; Ri is the set of systems of storage servers. D-SPTF per-request characteristics for ri; W is the tasks are speculative. Over a range set of workload-level characteristics. exploits high-speed interconnects of situations, user-observed response to dynamically select which server, time is about 50% better on average average and 90th percentile response among those with a replica, should and at least two times better for 20% time with an relative error as low as service each read request. In doing of our simulations. Moreover, we 19% when the training workloads are so, it simultaneously balances load, show how user costs can be reduced similar to the testing workloads and exploits the aggregate cache capac- under an incentive cost model of a good interpolation across different ity, and reduces positioning times for charging only for tasks whose results workloads. cache misses. For network latencies are requested. expected in storage clusters (e.g., 10- Toward Automatic Context- 200µs), D-SPTF performs as well as Storage Device Performance based Attribute Assignment for would a hypothetical centralized sys- Prediction with CART Models Semantic File Systems tem with the same collection of CPU, cache, and disk resources. Compared Wang, Au, Ailamaki, Brockwell, Soules & Ganger Faloutsos & Ganger to popular decentralized approaches, Carnegie Mellon University Parallel D-SPTF achieves up to 65% higher Proceedings of the 12th Annual Meet- Data Laboratory Technical Report throughput and adapts more cleanly to ing of the IEEE / ACM International CMU-PDL-04-105. June 2004. heterogeneous server capabilities. Symposium on Modeling, Analysis, Semantic file systems enable users Cluster Scheduling for and Simulation of Computer and Tele- to search for files based on attributes communication Systems (MASCOTS Explicitly Speculative Tasks rather than just pre-assigned names. Volendam, The Netherlands. October This paper develops and evaluates Petrou, Ganger & Gibson 5–7, 2004. several new approaches to automati- Storage device performance predic- cally generating file attributes based Proceedings of the 18th Annual ACM on context, complementing existing International Conference on Super- tion is a key element of self-managed storage systems and application plan- approaches based on content analy- computing (ICS’04), Malo, France, sis. Context captures broader system June 26–July 1, 2004. ning tasks, such as data assignment. This work explores the application state that can be used to provide new Large-scale computing often consists of a machine learning tool, CART attributes for files, and to propagate at- of many speculative tasks to test models, to storage device modeling. tributes among related files; context is hypotheses, search for insights, and Our approach predicts a device’s per- also how humans often remember pre- review potentially finished products. formance as a function of input work- vious items [2], and so should fit the E.g., speculative tasks are issued by loads, requiring no knowledge of the primary role of semantic file systems bioinformaticists comparing DNA se- device internals. We propose two uses well. Based on our study of ten sys- quences and computer graphics artists of CART models: one that predicts tems over four months, the addition of adjusting scene properties. We pro- per-request response times (and then context-based mechanisms, on aver- mote a way of working that exploits derives aggregate values) and one age, reduces the number of files with the inherent speculation in applica- that predicts aggregate values directly zero attributes by 73%. This increases tion-level search made more common from workload characteristics. After the total number of classifiable files by by the cost-effectiveness of grid and being trained on the device in ques- over 25% in most cases, as is shown cluster computing. Researchers and tion, both provide accurate black-box in Figure 1. Also, on average, 71% of end-users disclose sets of speculative models across a range of test traces the content-analyzable files also gain tasks that search an application space, from real environments. Experiments additional valuable attributes. request specific results as needed, and show that these models predict the continued on page 6

FALL 2004 5 RECENT PUBLICATIONS RECENT PUBLICATIONS continued from page 5 STEPS Towards Cache- 96.7% reduction in instruction cache independently-tailored page layouts Resident Transaction misses for each additional concur- for dynamically changing as well as Processing rent transaction, and at the same time compound workloads, and (b) use of eliminates up to 64% of mispredicted alternative technologies at each level Harizopoulos & Ailamaki branches by loading a repeating ex- (e.g., disk arrays or MEMS-based ecution into the CPU. This storage devices). We describe the Proceedings of the 30th VLDB Con- paper (a) describes the design and Clotho design and implementation ference, Toronto, Canada, 29 August implementation of Steps, (b) analyzes using disk array logical volumes and –3 September 2004. Steps using microbenchmarks, and simulated MEMS-based storage de- Online transaction processing (OLTP) (c) shows Steps performance when vices, and we evaluate performance is a multibillion dollar industry with running TPC-C on top of the Shore under a variety of workloads. high-end database servers employing storage manager. state-of-the-art processors to maxi- AutoPart: Automating Schema mize performance. Unfortunately, Clotho: Decoupling Page Design for Large Scientific recent studies show that CPUs are Layout from Storage Databases Using Data far from realizing their maximum Organization Partitioning intended throughput because of delays in the processor caches. When running Shao, Schindler, Schlosser & Papadomanolakis & Ailamaki Ailamaki & Ganger OLTP, instruction-related delays in 16th International Conference on the memory subsystem account for Proceedings of the 30th VLDB Con- Scientific and Statistical Database 25 to 40% of the total execution time. ference. Toronto, Canada, 29 August Management (SSDBM). Santorini In contrast to data, instruction misses –3 September 2004. Island, Greece. June 21–23, 2004. cannot be overlapped with out-of-or- der execution, and instruction caches As database application performance Database applications that use multi- cannot grow as the slower access depends on the utilization of the terabyte datasets are becoming in- time directly affects the processor disk and memory hierarchy, and the creasingly important for scientific speed. The challenge is to alleviate speed gap between the processor and fields such as astronomy and biology. the instruction-related delays without memory components widens, smart Scientific databases are particularly increasing the cache size. data placement plays a central role in suited for the application of automated We propose Steps, a technique that increasing locality and in improving physical design techniques, because minimizes instruction cache misses memory utilization. Existing tech- of their data volume and the com- in OLTP workloads by multiplexing niques, however, do not optimize plexity of the scientific workloads. concurrent transactions and exploiting accesses to all levels of memory Current automated physical design common code paths. One transaction hierarchy and for all the different tools focus on the selection of indexes paves the cache with instructions, workloads, because each storage and materialized views. In large-scale while close followers enjoy a nearly level uses different technology (cache, scientific databases, however, the data miss-free execution. Steps yields up to memory, disks) and each application volume and the continuous insertion accesses data using different (often of new data allows for only limited

thread A thread B thread A thread B conflicting) patterns. This paper in- indexes and materialized views. By CPU CPU contrast, data partitioning does not 00101 00101 00101 00101 code troduces Clotho, a new buffer pool instruction 1001 1001 1001 1001 hits in instruction replicate data, thereby reducing cache 00010 00010 00010 00010 and storage management architecture. cache capacity 1101 1101 1101 1101 window space requirements and minimizing 110 110 110 110 Clotho decouples in-memory page 10011 10011 10011 10011 update overhead. In this paper we 0110 0110 0110 0110 layout from data organization on context- 00110 00110 00110 00110 switch non-volatile storage devices, enabling present AutoPart, an algorithm that before after point independent data layout design at automatically partitions database CPU executes code CPU performs context-switch each level of the storage hierarchy. tables to optimize sequential access Using Clotho, a DBMS can maximize assuming prior knowledge of a rep- Before: As the instruction cache cannot fit cache and memory utilization by resentative workload. The resulting the entire code, when the CPU switches (a) transparently using appropriate schema is indexed using a fraction of (dotted line) to thread B it will incur the data layouts on memory and non- the space required for indexing the same number of misses. After: If we original schema. To evaluate AutoPart “break” the code into several pieces that volatile storage, and (b) dynamically fit in the cache, and switch execution back synthesizing data pages to follow we built an automated schema design and forth between the two threads, thread application access patterns at each tool that interfaces to commercial B will find all instructions in the cache. level as needed. Clotho enables (a) continued on page 7

6 THE PDL PACKET RECENT PUBLICATIONS RECENT PUBLICATIONS

continued from page 6 database systems. We experiment munications. Two worms observed in time stalled on CPU cache misses, with AutoPart in the context of the the traces, however, would be signifi- and explores the use of prefetching Sloan Digital Sky Survey database, cantly slowed down. to improve its cache performance. a real-world astronomical database, Applying prefetching to hash joins running on SQL Server 2000. Our The Safety and Liveness is complicated by the data dependen- experiments demonstrate the benefits Properties of a Protocol cies, multiple code paths, and inherent of partitioning for large-scale systems: Family for Versatile Survivable randomness of hashing. We present Partitioning alone improves query Storage Infrastructures two techniques, group prefetching and execution performance by a factor software-pipelined prefetching, that of two on average. Combined with Goodson, Wylie, Ganger & Reiter overcome these complications. These indexes, the new schema also outper- schemes achieve 2.0–2.9X speedups forms the indexed original schema by Carnegie Mellon University Parallel for the join phase and 1.4–2.6X 20% (for queries) and a factor of five Data Laboratory Technical Report speedups for the partition phase over (for updates), while using only half CMU-PDL-03-105. March 2004. GRACE and simple prefetching ap- the original index space. Survivable storage systems mask proaches. Compared with previous cache-aware approaches (i.e. cache Dynamic Quarantine of faults. A protocol family shifts the decision of which types of faults from partitioning), the schemes are at least Internet Worms implementation time to data-item cre- 50% faster on large relations and do Wong, Wang, Song, Bielski & ation time. If desired, each data-item not require exclusive use of the CPU Ganger can be protected from different types cache to be effective. and numbers of faults with changes Proceedings of the International Con- only to client-side logic. This paper 0 0 code 0 ference on Dependable Systems and presents proofs of the safety and 0 0 1 Networks (DSN-2004). Palazzo dei liveness properties for a family of 1 code 1 Congressi, Florence, Italy. June 28th 1 1 storage access protocols that exploit 2 2 code 2 –July 1, 2004. data versioning to efficiently provide 2 2 3 If we limit the contact rate of worm consistency for erasure-coded data. 3 code 3 3 3 traffic, can we alleviate and ultimately Members of the protocol family may 0 0 code 0 contain Internet worms? This paper assume either a synchronous or asyn- 0 0 stage 1 sets out to answer this question. Spe- chronous model, can tolerate hybrid 1 1 crash-recovery and Byzantine failures 1 cifically, we are interested in analyz- 2 2 ing different deployment strategies of of storage-nodes, may tolerate either memory 2 2 operation 3 rate control mechanisms and the effect crash or Byzantine clients, and may or 3 3 thereof on suppressing the spread of may not allow clients to perform re- 1 3 worm code. We use both analytical pair. Additional protocol family mem- Group prefetching (G = 4) models and simulation experiments. bers for synchronous systems under We find that rate control at individual omission and fail-stop failure models 0 Software-pipelined prefetching hosts or edge routers yields a slow- of storage-nodes are developed. (D = 1) down that is linear in the number of 1 0 Improving Hash Join 0 hosts (or routers) with the rate limit- 2 1 ing filters. Limiting contact rate at the Performance through 1 0 backbone routers, however, is sub- Prefetching 3 2 1 0 stantially more effective -- it renders 3 2 an interation a slowdown comparable to deploying Chen, Ailamaki, Gibbons & Mowry 1 0 3 2 rate limiting filters at every individual Proceedings of the 20th International 1 0 host that is covered. This result holds 3 2 Conference on Data Engineering 1 true even when susceptible and infect- (ICDE 2004). Boston, MA. March 30 3 2 ed hosts are patched and immunized –April 2, 2004. 3 2 dynamically. To provide context for our analysis, we examine real traf- Hash join algorithms suffer from ex- 3 fic traces obtained from a campus tensive CPU cache stalls. This paper computing network. We observe that shows that the standard hash join algo- Intuitive pictures of the prefetching schemes. rate throttling could be enforced with rithm for disk-oriented databases (i.e. minimal impact on legitimate com- GRACE) spends over 73% of its user continued on page 16

FALL 2004 7 AWARDS & OTHER PDL NEWS

September 2004 ence,” and is designed as a three year Sanjay Seshan Welcomes a New appointment. Brother! The Finmeccanica Group ranks Asha, Sanjay among the largest international firms and Srini Ses- in its operating sectors of aerospace, han are proud defense, energy, transportation and to announce information technology. They are ac- the arrival of tive in the design and manufacture of a new baby aircraft, satellites, power generation boy to their components, information and technol- family. Ar- ogy services—to name a few—and vind Seshan the realization of these systems via a r r i v e d a t engineering, electronics, information 6:16 p.m. on September 9th, weighing technology and innovative materials. 6 lbs. 14 oz. and measuring 21 inches. —with info from CMU 8.5x11 News, Jun 3, 2004. Our congratulations to ! May 2004 September 2004 PDL Student Graduates: Garth Goodson, John Griffin join Congratulations Ted! PDL Consortium Companies Garth Goodson successfully defended Ted Wong convocated with a Ph.D. his thesis on “Efficient, Scalable from the School of Computer Science Consistency for Highly Fault-tolerant on May 16. His thesis work was on Storage” on August 28, and has a start “Decentralized Recovery for Surviv- date of September 13 for his new job able Storage Systems” (abstract is at Network Appliance in Sunnyvale, available on page 13). Ted has been CA. John Griffin also successfully with IBM Research since December defended his research on “Timing Ac- 2003, working on self-managing curate Storage Emulation: Evaluating brick-based storage systems. In the Hypothetical Storage Components last twelve months, Ted has also got- in Real Computer Systems” on Sep- ten married, moved across the US ing this spring. The hooding ceremony tember 1. He is set to move to the to start his new job at Almaden, and took place on May 15 and they at- New York City area to begin work at bought a house. tended the University’s convocation IBM’s T.J Watson Research Center in ceremonies on May 16. Jiri has been mid-September. Dissertation abstracts working at EMC in Boston since last begin on page 12. fall, and Steve has taken a position with Intel in Pittsburgh. Abstracts of June 2004 their dissertations are available start- Srinivasan Seshan awarded ing on page 12. Finmeccanica Chair May 2004 Srinivasan Best Student Paper Award at Seshan, as- PAKDD ‘04 sociate pro- f e s s o r i n One of Christos Faloutsos’ students, the CS De- Jia-Yu (Tim) Pan, won the ‘best stu- p a r t m e n t , May 2004 dent paper’ award in PAKDD 2004 for h a s b e e n the paper “AutoSplit: Fast and Scal- a w a r d e d Greg’s First Ph.D. Students to able Discovery of Hidden Variables in the school’s Graduate: Congrats to Jiri and Stream and Multimedia Databases,” Finmecca- Steve! co-authored with Christos Faloutsos, nica Chair. Endowed in 1989, the Greg’s first students to complete Masafumi Hamamoto and Hiroyuki Italian Finmeccanica Fellowship “ac- their Ph.D.s, Steve Schlosser and Jiri Kitagawa. The paper proposed a new, knowledges promising young faculty Schindler, were awarded their degrees generic method for pattern detection; members in the field of computer sci- in Electrical and Computer Engineer- continued on page 9

8 THE PDL PACKET AWARDS & OTHER PDL NEWS continued from page 8 experiments on motion capture data, member companies who support our including a software development images and financial time series, research. From www.hitachigst.com: repository, the PDL web server, and showed that it consistently outper- Hitachi Global Storage Technologies Lab member home directories. forms the traditional SVD method. was founded in 2003 and was formed as February 2004 Congratulations, Tim! a result of the strategic combination of Hitachi and IBM’s storage technology Mowry to Head Intel Lab April 2004 businesses and have their head office Todd Mowry, PDL Researchers Receive Best in San Jose, CA. Other major develop- associate pro- Student Paper Award at FAST ‘04 ment and manufacturing locations are fessor of Com- Congratulations to PDL researchers found worldwide - in Japan, Thailand, puter Science, Eno Thereska, Jiri Schindler, John Singapore and Mexico. The company’s will succeed Bucy, Brandon Salmon and Gregory vision is to enable users to fully engage Mahadev Sa- R. Ganger, who have been awarded in the digital lifestyle by providing ac- tyanarayanan Best Student Paper by the program cess to large amounts of storage capac- as head of In- committee of the USENIX Confer- ity in formats suitable for the office, on tel Research ence on File and Storage technologies the road and in the home. P i t t s b u rg h , (FAST ‘04) for their paper “A Frame- March 2004 effective this work for Building Unobtrusive Disk May. Mowry will bring a new research Maintenance Applications.” CMU Sensor Detects Computer thrust to the lablet at the intersection Hard Drive Failures of databases, architecture, compilers The FAST Best Student Paper award The Critter Temperature Sensor, a and operating systems. According to was also given to PDL researchers in new heat-sensitive sensor to detect Satyanarayanan, in the two short years 2002 when Jiri Schindler, John Lin- computer hard drive failures, which of its existence, Intel Research Pitts- wood Griffin, Christopher R. Lumb, attaches to a user’s desktop computer, burgh is already making a big impact and Gregory R. Ganger were recog- is being deployed across campus to on a number of areas of research, in- nized for their paper “Track-Aligned monitor the working environment of cluding personal computing mobility Extents: Matching Access Patterns to university computers, according to (Internet Suspend/ Resume project), Disk Drive Characteristics.” Michael Bigrigg, a project scientist wide-area sensing (IrisNet project), April 2004 for the PDL. and interactive search of complex data (Diamond project). “We are clearly Best Paper Award at ICDE 2004 “We are trying save the life of the computer hard drive. Hard drives get past the startup phase, and can look Shimin Chen, Anastassia Ailamaki, hot and the sensor is designed to pick forward to continued growth and Phillip Gibbons, and Todd Mowry up the slightest temperature variation.” many more accomplishments in 2004 received the best paper award for Bigrigg added that the sensor will help and beyond,” Satyanarayanan said. “Improving Hash Join Performance researchers understand wasted energy –cmu.misc.news, Feb. 3, 2004 through Prefetching” at the Interna- with the hope of extending the lifes- tional Conference on Data Engineer- January 2004 pan of a computer hard drive by sens- ing (ICDE) 2004. The conference ing how much daily heat a hard drive Spiros Papadimitriou wins a Best took place in Boston, MA from March endures. So far, the new sensor, the Paper Award at VLDB03 30 through April 2. ICDE is one of size of a dime, has been deployed in the top database conferences, with C o m p u t e r offices and labs throughout Carnegie hundreds of submitted papers, and S c i e n c e Mellon’s Hamburg Hall. extremely selective acceptance ratio Ph.D. candi- date Spiros (typically, 1-of-5 to 1-of-7). The paper –with info from ece news & events, Mar. 1, 2004 focuses on the most expensive data- P a p a d i m i - base operation (‘join’), and proposes February 2004 triou has re- novel methods to accelerate it. Network Appliance Donates Filer ceived a Best for PDL Storage Needs Paper Award April 2004 f r o m t h e Network Appliance has donated a Very Large HGST Joins the PDL Industrial FAS900 series filer with 2 Terabytes Research Consortium Data Bases (VLDB) 2003 Conference of raw capacity and all of the software for his paper “Adaptive, Hands-Off The PDL would like to welcome bells and whistles, with a retail value Stream Mining,” which was co-au- Hitachi Global Storage Technologies of $170K, in all. This filer will be (HGST) to the group of Consortium used for critical PDL storage needs, continued on page 11

FALL 2004 9 URSA MAJOR continued from page 1 Second, most read operations com- Goal specification & plete in just a single round-trip; only complaints concurrent sharing or failures require Performance information & additional rounds. Third, most pro- delegation tocol processing is shifted to clients, Management Administrator allowing greater scalability. Fourth, Supervisors hierarchy the protocols do not require digital signatures; cryptographic hashes are instead used to tolerate Byzantine Workers failures. These features, combined, (storage will allow Ursa Major to protect bricks) important data from multiple failures without excessive overhead. Versatil- Additional Services: ity is achieved by allowing each data Directory • object—actually, each range of data • “First contact” I/O request Event log/notify in each object—to be encoded dif- routing • ferently, tolerating different numbers • On-line diagnosis Routers ... and types of failures, and to be stored • on different workers. The data access protocol family, by design, allows the I/O requests & replies fault model and encoding scheme to be selected on a per-data-item basis. Additional versatility is achieved by allowing the encoding and/or distribu- tion to be changed dynamically. Ursa Major’s versatility helps most Clients clearly with tuning, offering fine- Figure 2. Self-* Storage Architecture grained control over key performance constellation, with capacity provided age project have consisted of large and reliability determiners of distrib- to research groups (e.g., in data min- amounts of design and infrastructure uted storage systems. Rather than ing and scientific visualization) building. The foci during this process forcing a single pre-configured data around Carnegie Mellon who rely on has been on enabling high levels of distribution policy to be applied for large quantities of storage for their fault-tolerance, versatility, and instru- all storage, or even each entire dataset, work. We are convinced that such mentation. Together, these can pro- Ursa Major allows this policy choice deployment and maintenance is nec- vide a foundation for the automated to be specialized for each object and essary to evaluate the ability of new decision making, feedback control, to be changed dynamically. This architectures/technologies to simplify diagnosis, and repair mechanisms versatility eliminates three key prob- administration for the system scales needed to achieve self-*-ness. lems. First, finding a single encoding and workload mixes that traditionally choice (e.g., replication vs. erasure Fault tolerance is achieved by spread- codes) that is right for all files requires present difficulties. As well, our use ing data redundantly across a set of of low-cost hardware and immature compromising on performance and workers (i.e., storage bricks) so as to estimating which types of files are software will push the frontiers of ensure availability despite failures. fault-tolerance and automated recov- likely to be most popular. Second, PDL’s PASIS project has developed choosing a single degree of fault toler- ery mechanisms, which are critical for novel data access protocols that pro- storage infrastructures. ance for all data ignores the differing vide efficient fault tolerance by ex- reliability goals of different datasets, Our self-* storage white paper [Gang- ploiting local data versioning within making all data pay the capacity and er 03] categorizes storage administra- each storage-node (as provided, in our performance costs associated with tion tasks and overviews the self-* case, by PDL’s S4 object store). The the maximum fault tolerance desired. storage project in more detail. protocols gain efficiency from several Third, disallowing dynamic change features. First, they allow use of m-of- prevents observation-based setting of Early Foci: Fault-tolerance, n erasure codes in addition to standard Versatility & Instrumentation the policies. Ursa Major’s versatility replication, which can substantially The initial stages of the self-* stor- reduce communication overheads. continued on page 11

10 THE PDL PACKET AWARDS & OTHER PDL NEWS continued from page 9 thored with Anthony Brockwell and January 2004 November 2003 Christos Faloutsos. The conference LSI Logic (now Engenio) Joins Steve and Rachel Married! took place this past September in Ber- PDL Research Consortium Congratulations to Steve Schlosser lin and is one of the most prestigious and Rachel Fielder, who were married and selective database conferences. The PDL is pleased to announce on November 8, 2003, in Pittsburgh – with info from CMU 8.5x11 News, Jan. 8, 2004 that LSI Logic has joined the PDL at the Schenley Park Visitor Center. Consortium of companies that sup- Steve recently graduated from CMU January 2004 port and participate in PDL research. with his Ph.D. in Electrical and Com- Seagate supports the Self-* From the LSI Logic page: Founded in puter Engineering and is now at Intel Storage Project with Equipment 1981, LSI Logic pioneered the ASIC Research in Pittsburgh. Donation (Application Specific Integrated Cir- Greg Ganger (Assoc. Professor, ECE cuit) industry. LSI Logic is a leading and CS) and the Parallel Data Lab designer and manufacturer of com- (PDL) have received a $25K equip- munications, consumer and storage ment grant of high-end SCSI disks semiconductors for applications that from Seagate. The grant significantly access, interconnect and store data, increases the capacity of the testbed voice and video. Today, the company for PDL’s new Self-* Storage proj- focuses on providing highly complex ect, which seeks to create large-scale ASICs, ASSPs (Application Specific self-managing, self-organizing, self- Standard Products), RapidChip™, tuning storage systems from generic host bus adapters, software and stor- servers. age systems.

URSA MAJOR continued from page 10 allows reasonable default choices For example, system-healing services Setting Up the Physical to be used initially and then modi- may subscribe for events indicating Infrastructure fied, object by object, to match each failures or lost data and take corrective A large-scale computing infrastruc- object’s observed workload and any action. Diagnosis services may mine ture project involves logistics chal- load imbalances in the system. substantial portions of the event his- lenges beyond building and maintain- Instrumentation pervades Ursa Ma- tory to identify root causes of recur- ing software. To deploy Ursa Major jor’s design, providing information ring problems. at interesting scales, we need large about the system required for it to A “trace” is a recorded sequence of amounts of equipment and prop- diagnose, repair, and even predict requests and response timings. Many erly-conditioned machine room(s) problems. Two primary forms of components will retain such traces of in which to operate it. For today’s information are collected: events and their activity and responsiveness, in- high-density computing equipment, traces. An “event” is an observation cluding workers, metadata servers, and machine room design involves power, made by some component of the sys- clients wishing to report performance cooling, and structural challenges that tem that it believes may be of interest problems. Traces allow post-hoc were new to us and Carnegie Mellon. to another component or to a diagnosis analysis of workloads and efficiency And, of course, acquiring the equip- agent. Examples include a worker re- and also allow the system to explore ment and funds for renovations are porting corrupted blocks, a metadata “what if?” questions about addition of always challenging. server reporting an attempted quota datasets and workload redistribution. Carnegie Mellon’s administration is violation, and a client reporting an Cross-component activity tracking is excited about this project, and has unresponsive worker. The event ser- used to correlate requests observed demonstrated this excitement in the vice collects these events, which are at different system components, so most tangible ways: space allocations reported to it, and allows components that performance debugging services and renovations. Over the last year, of the system to query for events of can deduce which component(s) are PDL’s existing machine room has interest. (The event service is thus a responsible for performance problems publish-subscribe database system.) observed by clients. continued on page 16

FALL 2004 11 DISSERTATION ABSTRACTS

PH.D. DISSERTATION effect of using that component in a difficult by concurrent updates, partial Timing-Accurate Storage real system. updates made by clients that fail, and Emulation: Evaluating We built a functional timing-accurate failures of storage-nodes. Hypothetical Storage Components storage emulator and demonstrated This thesis demonstrates a novel ap- in Real Computer Systems its use in experiments involving cur- proach to achieving scalable, highly John Linwood Griffin, ECE rently-existing storage products, ex- fault-tolerant storage systems by le- September 1, 2004 periments evaluating the potential of veraging a set of efficient and scalable, nonexistent storage components, and strong consistency protocols enabled Timing-accurate storage emulation experiments evaluating interactions by storage-node versioning. Versions offers the opportunity to investigate between modified external system maintained by storage-nodes can be novel uses of storage in computer sys- architectures and modified or hypo- used to provide consistency, without tems, permitting forays into the space thetical storage device functionality. the need for central serialization, and of hypothetical device functionalities To explore standalone hypothetical despite concurrency. Since versions without the difficulties of developing storage components in computer are maintained for every update, even and supporting extensively nonstan- systems, we configured our emulator if a client fails part way through an dard or novel interface actions in with three device models represent- update, concurrency exists during an prototype or production systems. This ing a currently-available production update, the latest complete version dissertation demonstrates that there is disk drive, a hypothetical 50,000 of the data-item being accessed still a current and pressing need for a new RPM disk drive, and a hypothetical exists in the system—it does not get storage evaluation technique, and that MEMS-based storage device, and ex- destroyed by subsequent updates. Ad- it is feasible to design and construct ecuted three distinct application-level ditionally, versioning enables the use a timing-accurate storage emulator workloads against these three emula- of optimistic protocols. and to use an emulator for interesting tor configurations. To explore new This thesis develops a set of con- systems-level experimentation. system architectures with expanded sistency protocols appropriate for Timing-accurate storage emulation of- device functionality, we applied the constructing blockbased storage and fers a unique performance evaluation principles of timing-accurate storage metadata services. The block-based capability: the flexibility of simulation emulation in an investigation into and the reality of experimental mea- storage-based intrusion detection continued on page 13 surements. This allows a researcher systems. This experimentation dem- to experiment with not-yet-existing onstrates the feasibility of including storage components in the context of intrusion detection capabilities into real systems executing real applica- a standalone processing-enhanced tions. As its name suggests, a timing- disk drive, and also demonstrates accurate storage emulator appears to how existing communications paths the system to be a real storage com- may be used by an operating system ponent with service times matching to transmit and receive information a simulation model (or mathematical regarding the configuration and op- model) of that component. This al- erational status of such an intrusion lows simulated storage components detection-enhanced device. to be plugged into real systems, PH.D. DISSERTATION which can then be used for complete, application-based experiments. To Efficient, Scalable Consistency for accomplish this, the emulator must Highly Fault-tolerant Storage synchronize the simulator’s internal Garth Goodson, ECE time with the real-world clock, insert- August 28, 2004 ing requests into the simulator when they arrive and reporting completions Fault-tolerant storage systems spread when the simulator determines they data redundantly across a set of stor- are done. If the simulator’s model age-nodes in an effort to preserve and represents a real component, the sys- provide access to data despite failures. tem-observed performance will be One difficulty created by this architec- of that component. Thus, the results ture is the need for a consistent view, from application benchmarking will across storage-nodes, of the most re- John looks to his future. represent the end-to-end performance cent update. Such consistency is made

12 THE PDL PACKET DISSERTATION ABSTRACTS continued from page 12 storage protocol is made space effi- model that allows their use to be ex- cient through the use of erasure codes plored under a variety of workloads. and made scalable by off-loading To answer the question of whether a work from the storage-nodes to the new type of device requires changes clients. The metadata service is made to systems, I present a methodology scalable by avoiding the high costs that includes two objective tests for associated with agreement algorithms determining whether the benefit from and by utilizing threshold voting quo- a device is due to a specific difference rums. Fault-tolerance is achieved by in how that device accesses data or developing each protocol in a hybrid is just due to the fact that the device storage-node fault model (a mix of is faster, smaller, or uses less energy Byzantine and crash storage-nodes than current devices. I present a range Eno discusses Freeblock Scheduling with can be tolerated), capable of tolerating of potential uses of MEMStores in Robin Huber of Engenio at the PDL Spring crash or Byzantine clients, and utiliz- computer systems, examining each Visit Day. ing asynchronous communication. under a number of user workloads, us- ing the two objective tests to evaluate failures of servers. Thus, a survivable PH.D. DISSERTATION their efficacy. storage system can guarantee both the Using MEMS-based Storage availability and the confidentiality of Devices in Computer Systems Using the evidence presented and stored data. the two objective tests, I show that Research into survivable storage Steven W. Schlosser, ECE systems can incorporate MEMStores systems investigates the use of m- May 2004 easily and employ the same standard of-n threshold sharing schemes to abstractions and interfaces used with MEMS-based storage is an interesting distribute data to servers, in which disk systems. At a high level, the new technology that promises to bring each server receives a share of the intuition is that MEMStores are me- fast, non-volatile, mass data storage to data. Any m shares can be used to re- chanical storage devices, just like disk computer systems. MEMS-based stor- construct the data, but any m - 1 shares drives, only faster, smaller, and requir- age devices (MEMStores) themselves reveal no information about the data. ing less energy to operate. Accessing consist of several thousand read/write The central thesis of this dissertation data requires an initial seek time that tips, analogous to the read/write is that to truly preserve data for the is distance-dependent, and, once ac- heads of a disk drive, which read and long term, a system that uses threshold cess has begun, sequential access is write data in a recording medium. schemes must incorporate recovery the most efficient. This intuition is This medium is coated on a moving protocols able to overcome server described in more detail, and the result rectangular surface that is positioned failures, adapt to changing availability is shown to hold for the range of uses by a set of MEMS actuators. Access or confidentiality requirements, and presented. times are expected to be less than a operate in a decentralized manner. millisecond with energy consumption PH.D. DISSERTATION To support the thesis, I present the 10-100X less than a low-power disk design and experimental performance drive, while streaming bandwidth and Decentralized Recovery for Survivable Storage Systems analysis of a verifiable secret redistri- volumetric density are expected to be bution protocol for threshold sharing around that of disk drives. Theodore Ming-Tao Wong, SCS schemes. The protocol redistributes This dissertation explores the use of December 1, 2003 shares of data from old to new, pos- MEMStores in computer systems, Modern society has produced a wealth sibly disjoint, sets of servers, such that with a focus on whether systems can of data to preserve for the long term. new shares generated by redistribution use existing abstractions and interfac- Some data we keep for cultural ben- cannot be combined with old shares es to incorporate MEMStores effec- efit, in order to make it available to to reconstruct the original data. The tively, or if they will have to change future generations, while other data protocol is decentralized, and does not the way they access storage to benefit we keep because of legal impera- require intermediate reconstruction from MEMStores. If systems can use tives. One way to preserve such data of the data; thus, one does not create MEMStores in the same way that they is to store it using survivable storage a central point of failure or risk the use disk drives, it will be more likely systems. Survivable storage is distinct exposure of the data during protocol that MEMStores will be adopted when from reliable storage in that it tolerates execution. The protocol incorporates they do become available. confidentiality failures in which unau- a verification capability that enables Since real MEMStores do not yet thorized users compromise compo- new servers to confirm that their exist, I present a detailed software nent storage servers, as well as crash continued on page 13

FALL 2004 13 RECENT PUBLICATIONS continued from page 7 Efficient Byzantine-tolerant reasons hold for MEMS-based stor- system call interface Erasure-coded Storage age. This result is borne out through several case studies of proposed roles system call fs snapshots Goodson, Wylie, Ganger & Reiter translator MEMS-based storage devices may take in future systems, and potential Proceedings of the International Con- background scheduler ference on Dependable Systems and in-kernel API policies that may be used to tailor foreground background freeblock scheduler + systems’ access to MEMS-based stor- Networks (DSN-2004). Palazzo dei scheduler scheduler idle time detector Congressi, Florence, Italy. June 28th age. We argue that when considering disk the use of MEMS-based storage in –July 1, 2004. dispatch params queue device driver systems, their performance should be compared to that of a hypothetical This paper describes a decentralized to storage device consistency protocol for survivable disk drive that matches the speed of a MEMS-based storage device. We dis- storage that exploits local data ver- Freeblock system components. sioning within each storage-node. cuss exceptional workloads that can Such versioning enables the protocol for disk access with a system’s pri- use specific features of MEMS-based to efficiently provide linearizability mary applications. It opportunistically storage devices and that may require and wait-freedom of read and write completes maintenance requests by extensions to current abstractions. operations to erasure-coded data in using disk idle time and free-block Also, we consider the ramifications asynchronous environments with scheduling. In this paper, three disk of the assumptions that are made in Byzantine failures of clients and serv- maintenance applications (backup, today’s models of MEMS-based stor- ers. By exploiting versioning storage- write-back cache destaging, and disk age devices. nodes, the protocol shifts most work layout reorganization) are adapted to to clients and allows highly optimistic the system support and evaluated on Diamond: A Storage operation: reads occur in a single a FreeBSD implementation. All are Architecture for Early Discard round-trip unless clients observe con- shown to successfully execute in busy in Interactive Search currency or write failures. Measure- systems with minimal (e.g., <2%) im- ments of a storage system prototype pact on foreground disk performance. Huston, Sukthankar, using this protocol show that it scales In fact, by modifying FreeBSD’s Wickremesinghe, Satyanarayanan, well with the number of failures toler- cache to write dirty blocks for free, Ganger, Riedel & Ailamaki ated, and its single request response the average read cache miss response Proceedings of the 3rd USENIX Con- time compares favorably with an ef- time is decreased by 15–30%. ference on File and Storage Technolo- ficient implementation of Byzantine- MEMS-based storage devices gies (FAST ‘04). San Francisco, CA. tolerant state machine replication. March 31, 2004. and standard disk interfaces: A A Framework for Building square peg in a round hole? This paper explores the concept of Unobtrusive Disk Maintenance early discard for interactive search of Applications Schlosser & Ganger unindexed data. Processing data inside storage devices using downloaded Proceedings of the 3rd USENIX Con- Thereska, Schindler, Bucy, Salmon, searchlet code enables Diamond to ference on File and Storage Technolo- Lumb & Ganger perform efficient, application-specific gies (FAST ‘04). San Francisco, CA. filtering of large data collections. Ear- March 31, 2004. Proceedings of the 3rd USENIX Con- ly discard helps users who are looking ference on File and Storage Technolo- MEMS-based storage devices are a for “needles in a haystack” by elimi- gies (FAST ‘04). San Francisco, CA. nating the bulk of the irrelevant items March 31, 2004. new technology that is significantly different from both disk drives and as early as possible. A searchlet con- This paper describes a programming semiconductor memories. These sists of a set of application-generated model and system support for clean differences motivate the question filters that Diamond uses to determine construction of disk maintenance ap- of whether they need new abstrac- whether an object may be of interest plications. Such applications expose tions to be utilized by systems, or if to the user. The system optimizes the the disk activity to be done, and then existing abstractions will work well. evaluation order of the filters based process completed requests as they This paper addresses this question by on run-time measurements of each are reported. The system ensures examining the fundamental reasons filter’s selectivity and computational that these applications make steady that the abstraction works for existing cost. Diamond can also dynamically forward progress without competing systems, and by showing that these continued on page 15

14 THE PDL PACKET RECENT PUBLICATIONS continued from page 14 partition computation between the This paper describes how systems storage devices and the host computer can automatically learn to classify to adjust for changes in hardware and the properties of files (e.g., read-only network conditions. Performance access pattern, short-lived, small in Attacker numbers show that Diamond dynami- size) and predict the properties of new Network IDS cally adapts to a query and to run-time files, as they are created, by exploit- Ethernet system state. An informal user study ing the strong associations between Network Router of an image retrieval application a file’s properties and the names and supports our belief that early discard attributes assigned to it. These asso- significantly improves the quality of ciations exist, strongly but differently, interactive searches. in each of four real NFS environments Host studied. Decision tree classifiers can IDS SCSI or Atropos: A Disk Array Volume automatically identify and model such ATA Bus Manager for Orchestrated Use associations, providing prediction of Disks accuracies that often exceed 90%. When network IDS Desktop Such predictions can be used to select is bypassed, host IDS Computer continues monitoring. Schindler, Schlosser, Shao, Ailamaki storage policies (e.g., disk allocation & Ganger schemes and replication factors) for Disk Proceedings of the 3rd USENIX Con- individual files. Further, changes in IDS ference on File and Storage Technolo- associations can expose information gies (FAST ‘04). San Francisco, CA. about applications, helping auto- When host IDS is Locally-attached Disk March 31, 2004. nomic system components distinguish turned off, disk IDS growth from fundamental change. continues monitoring. The Atropos logical volume manager allows applications to exploit charac- On the Feasibility of Intrusion The role of a disk-based intrusion detection teristics of its underlying collection Detection Inside Workstation system (IDS). A disk-based IDS watches over all data and executable files that of disks. It stripes data in track-sized Disks are persistently written to local storage, units and explicitly exposes the monitoring for suspicious activity that boundaries, allowing applications to Griffin, Pennington, Bucy, Choundappan, Muralidharan & might indicate an intrusion on the host maximize efficiency for sequential computer. access patterns even when they share Ganger the array. Further, it supports efficient Carnegie Mellon University Parallel Integrating Portable and diagonal access to blocks on adja- Data Lab Technical Report CMU- Distributed Storage cent tracks, allowing applications to PDL-03-106. December, 2003. orchestrate the layout and access of Tolia, Harkes, Kozuch & two-dimensional data structures, such Storage-based intrusion detection Satyanarayanan as relational database tables, to maxi- systems (IDSes) can be valuable tools Proceedings of the 3rd USENIX Con- mize performance for both row-based in monitoring for and notifying ad- ference on File and Storage Technolo- and column-based accesses. ministrators of malicious software ex- gies (FAST ‘04). San Francisco, CA. ecuting on a host computer, including March 31, 2004. File Classification in Self-* many common intrusion toolkits. This Storage Systems paper makes a case for implementing We describe a technique called loo- IDS functionality in the firmware of kaside caching that combines the Mesnier, Thereska, Ellard, Ganger workstations’ locally attached disks, strengths of distributed file systems & Seltzer on which the bulk of important system and portable storage devices, while files typically reside. To evaluate the negating their weaknesses. In spite of Proceedings of the First International its simplicity, this technique proves to Conference on Autonomic Comput- feasibility of this approach, we built a prototype disk-based IDS into a SCSI be powerful and versatile. By unifying ing (ICAC-04). New York, NY. May distributed storage and portable stor- 2004. disk emulator. Experimental results from this prototype indicate that it age into a single abstraction, lookaside To tune and manage themselves, file would indeed be feasible, in terms of caching allows users to treat devices and storage systems must understand CPU and memory costs, to include they carry as merely performance and key properties (e.g., access pattern, IDS functionality in low-cost desktop availability assists for distant file serv- lifetime, size) of their various files. disk drives. continued on page 18

FALL 2004 15 DISSERTATION ABSTRACTS continued from page 13 shares can be used to reconstruct the system software (here referred to as inherent to a storage device. The two original data. storage manager), which translates ap- described performance attributes plication requests into individual I/Os, mesh well with existing storage PH.D. DISSERTATION can automatically match application manager data structures, requiring Matching Application Access access patterns to the provided charac- minimal changes to their code. Most Patterns to Storage Device teristics. This results in more efficient importantly, they simplify the error- Characteristics utilization of storage devices and thus prone task of performance tuning. improved application performance. Jiri Schindler, ECE Exposing performance characteristics August 22, 2003 This dissertation (i) identifies specific has the biggest impact on systems with Conventional computer systems features of disk drives, disk arrays, regular access patterns. For example have insufficient information about and MEMS-based storage devices not in database systems, when decision storage device performance char- exploited by conventional systems, support (DSS) and on-line transac- acteristics. As a consequence, they (ii) quantifies the potential perfor- tion processing (OLTP) workloads utilize the available device resources mance gains these features offer, and run concurrently, DSS experiences inefficiently, which, in turn, results in (iii) demonstrates on three different a speed up of up to 3X, while OLTP poor application performance. This implementations (FFS file system, da- exhibits a 7% speedup. With a single dissertation demonstrates that a few tabase storage manager, and disk array layout taking advantage of access high-level, device-independent hints logical volume manager) the benefits parallelism, a database table can be encapsulating unique storage device to the applications using these storage scanned efficiently in both dimen- characteristics can achieve signifi- managers. It describes two specific at- sions. Additionally, scan operations cant I/O performance gains without tributes: the access delay boundaries run in time proportional to the amount breaking the established abstraction attribute delineates efficient accesses of query payload; unwanted portions of a storage device as a linear address to storage devices and the parallel- of a table are not touched while scan- space of fixed-size blocks. A piece of ism attribute exploits the parallelism ning at full bandwidth.

URSA MAJOR continued from page 11 been renovated to have more power, experts, among others, having been so far. In addition, Cisco has donated cooling, and communication capabili- helping advise us on how to get things networking equipment to establish ties; it can now house approximately right, and we are planning long-term multi-gigabit paths between racks and 8 racks of storage bricks, assuming collaborations on dynamic thermal from the machine room to the campus that each rack’s equipment draws management and power-savings backbone. 8-10 kilowatts (KW) of power, and research. Status & Plans them to the rest of campus On the equipment front, we still have with multiple gigabit Ethernet links. We have made significant progress a ways to go. We recently submitted in developing the software for Ursa In addition, we have been allocated two sizeable infrastructure grants 2000 square feet in the new build- Major. Several experimental ver- (one to NSF and one to DoD), which sions of the fault-tolerance protocols ing for an additional, larger machine would provide approximately half of room, and this room is being designed and server storage mechanisms have our deployment target (25 racks of been built in previous projects, and we to support high-power computation storage bricks). We are hoping for our infrastructure in addition to our large- have been refining both their designs industry partners to provide the other and their implementations. A client scale storage infrastructure. Overall, half, enabling the large-scale experi- the room provides space for 50-60 library for providing a clean object ences needed to truly explore the self- API has been created, and an NFS racks of equipment plus the associated * storage vision. Good initial progress power distribution and cooling equip- server has been ported atop it; this has been made. Generous donations provides an interface that unmodified ment, assuming a forward-looking from Intel, IBM, and Seagate have average of 20 KW per rack. clients can use to access constellation provided two racks for our current storage. First instances of metadata The power and cooling challenges development and testing of the Ursa and event services have been created are substantial and all new to CMU’s Major software. NSF and DoD funds physical facilities staff; HP and APC have roughly matched these donations continued on page 20

16 THE PDL PACKET NEW PDL FACULTY

Chuck Cranor W e a r e neering from the University of Dela- applications and services. He feels pleased to ware, and M.S. and D.Sc. in Computer that understanding a wide range of w e l c o m e Science from Washington University hardware and kernel-level software Dr. Chuck in St. Louis, Missouri. As part of his structures enables one to design kernel Cranor to graduate research he wrote the UVM and application-level interfaces that t h e P D L . Virtual Memory system, which is in are well optimized, clean, safe, and He joined worldwide use as part of the kernel portable. C a r n e g i e of the NetBSD and OpenBSD open- Other areas Chuck has worked in Mellon as a source operating systems projects. system sci- include high speed network monitor- entist last Chuck is interested in systems re- ing, packet telephony, and system on December and is working with us to search in the areas of computer a chip embedded devices. His past make the Self-* Storage project real. operating systems and networking. experience maintaining and remodel- Prior to moving to Pittsburgh, Chuck In particular, he is interested in the ing his research group’s computer lab worked at AT&T Labs-Research in design of low-level systems software at AT&T is proving invaluable in the Florham Park, New Jersey. Chuck that is used to control a computer’s development of Ursa Major and other received his B.S. in Electrical Engi- resources and build higher-level complete systems.

David O’Hallaron D a v i d and Mike Kozuch) the pilot deploy- Randy Bryant, he recently published O ’ H a l l a r o n ment on the Carnegie Mellon campus a new core computer systems text is an Associ- of Internet Suspend/Resume (ISR), (Computer Systems: A Programmer’s ate Professor which combines virtual machines Perspective, Prentice Hall, 2003) in the depart- and distributed storage systems to that has been adopted by numerous ments of Com- allow people to access their personal schools worldwide. puter Science computers from any other computer. and Electrical The new PDL petabyte data storage and Computer system, Ursa Major, would be an ideal Engineering repository for storing and analyzing at Carnegie Mellon University. He both the massive Quake datasets and received his Ph.D. from University the ISR virtual machine state. of Virginia. After a stint at GE, he In 1998 the CMU School of Computer joined the Carnegie Mellon faculty Science awarded Prof. O’Hallaron in 1989. and the other members of the CMU Prof. O’Hallaron works on parallel Quake Project the Allen Newell and distributed computer systems. Medal for Research Excellence. In He is currently leading (with Jacobo 2000, a benchmark he developed for Bielak) the Carnegie Mellon Quake the Quake project was selected by project, which is developing the ca- SPEC for inclusion in the influential pability to predict the motion of the CPU2000 and CPU2000omp (Open ground during strong earthquakes, and MP) benchmark suites. In November, the Euclid project, which is develop- 2003, Prof O’Hallaron and the other ing computational database systems members of the Quake team won the that represent massive scientific 2003 Gordon Bell Prize, the top in- datasets as database structures and ternational prize in high performance that perform the scientific computing computing. In Spring 2004, he was process by creating, querying, and awarded the Herbert Simon Award updating these databases. He is also for Teaching Excellence by the CMU Greg welcomes our PDC members to the leading (with Satya, Casey Helfrich, School of Computer Science. With PDL Spring Visit Day.

FALL 2004 17 RECENT PUBLICATIONS continued from page 15 ers. Careless use of portable storage decision of which types of faults from has no catastrophic consequences. disclosures, requests task implementation time to data-item cre- & cancellations control Experimental results show that sig- scheduler ation time. If desired, each data-item user 1 supply or processor nificant performance improvements drop outputs stats control can be protected from different types are possible even in the presence of and numbers of faults. This paper start, preempt stale data on the portable device. outputs output store & remove tasks describes and evaluates a family of user n dropped storage access protocols that exploit Design and Implementation of outputs outputs task 1 task n data versioning to efficiently provide a Freeblock Subsystem users task control cluster resources consistency for erasure-coded data. This protocol family supports a wide Thereska, Schindler, Lumb, Bucy, range of fault models with no changes Salmon & Ganger Interaction between users, the scheduler, and the computing center’s resources. to the client-server interface or server Carnegie Mellon University Parallel implementations. Its members also exploring search spaces disclose sets Data Lab Technical Report CMU- shift overheads to clients. Readers of speculative tasks, request results as PDL-03-107, December, 2003. only pay these overheads when they needed, and cancel unfinished tasks actually observe concurrency or fail- Freeblock scheduling allows back- if early results suggest no need to ures. Measurements of a prototype ground applications to access the continue. Doing so matches natural block-store show the efficiency and disk without affecting primary sys- usage patterns, making users more scalability of protocol family mem- tem activities. This paper describes effective, and also enables a new class bers. a complete freeblock subsystem, of schedulers. In simulation, we dem- implemented in FreeBSD. It details onstrate how batchactive schedulers Balancing Locality and significantly reduce user-observed re- new space- and time-efficient algo- Randomness in DHTs rithms that make freeblock scheduling sponse times relative to conventional useful in practice. It also describes models in which tasks are requested Zhou, Ganger & Steenkiste algorithm extensions for using idle one at a time (interactively) or re- time, dealing with multi-zone disks, quested in batches without specifying Carnegie Mellon University Technical reducing fragmentation, and avoiding which are speculative. Over a range Report CMU-CS-03-203, November starvation of the inner- and outer-most of simulated user behavior, for 20% 2003. tracks. The result is an infrastructure of our simulations, user-observed Embedding locations in DHT node that efficiently provides steady disk response time is at least two times IDs makes locality explicit and, access rates to background applica- better under a batchactive scheduler, thereby, enables engineering of the tions, across a range of foreground and about 50% better on average. trade-off between careful placement usage patterns. Batchactive schedulers achieve such and randomized load balancing. This improvements by segregating tasks paper discusses hierarchical, topol- Scheduling Explicitly- into two queues based on whether ogy-exposed DHTs and their benefits speculative Tasks a task is speculative and scheduling for content locality, and administra- these queues separately. Moreover, we tive control and routing locality. Petrou, Ganger & Gibson show how user costs can be reduced under an incentive cost model of A Prototype User Interface Carnegie Mellon University Technical charging only for tasks whose results Report CMU-CS-03-204, November are requested. for Coarse-Grained Desktop 2003. Access Control A Protocol Family for Large-scale computing often consists Long, Moskowitz & Ganger of many speculative tasks to test Versatile Survivable Storage hypotheses, search for insights, and Infrastructures Carnegie Mellon University Technical review potentially finished products. Report CMU-CS-03-200, November For example, speculative tasks are is- Goodson, Wylie, Ganger & Reiter 2003. sued by bioinformaticists comparing Carnegie Mellon University Parallel Viruses, trojan horses, and other DNA sequences and computer graph- Data Lab Technical Report CMU- malware are a growing problem for ics artists adjusting scene properties. PDL-03-103, December 2003. computer users, but current tools and This paper promotes a new computing research do not adequately aid users model for shared clusters and grids Survivable storage systems mask in which researchers and end-users faults. A protocol family shifts the continued on page 19

18 THE PDL PACKET RECENT PUBLICATIONS continued from page 18 in fighting these threats. One approach and use. In addition to a security compromises to field-upgradeable to increasing security is to partition all model that is intrinsically useful, we devices. In particular, secure boot applications and data based on general believe development of this system standards should verify the firmware task types, or “roles,” such as “Person- will inform issues in the design and of all devices in the computer, not al,” “Work,” and “Communications.” implementation of usable security in- just devices that are accessible by the This can limit the effects of malware terfaces, such as refinement of design host CPU. Modern computers contain to a single role rather than allowing it guidelines. many autonomous processing ele- to affect the entire computer. We are ments, such as disk controllers, disks, developing a prototype to investigate Secure Bootstrap is Not network adapters, and coprocessors, the usability of this security model. Enough: Shoring up the that all have field-upgradeable firm- Our initial investigation uses cogni- Trusted Computing Base ware and are an essential component tive walkthrough and think-aloud user of the computer system’s trust model. studies of paper prototypes to look at Hendricks & van Doorn Ignoring these devices opens the sys- this model in the context of realistic tem to attacks similar to those secure tasks, and to compare different user Proceedings of the Eleventh SIGOPS boot was engineered to defeat. interface mechanisms for managing European Workshop, ACM SIGOPS, data and applications in a role-based Leuven, Belgium, September 2004. system. For most participants, our We propose augmenting secure boot interface was simple to understand with a mechanism to protect against

YEAR IN REVIEW continued from page 4

of Intel Pittsburgh, presented age Systems Research at CMU's “Modeling Replication Protocols “Dimond: A Storage Architecture Parallel Data Lab. with Actions and Constraints.” for Early Discard in Interactive  SDI Speaker: Garth Gibson, of Search.” Panasas, Inc., spoke on “Scalable  Christos gave a tutorial on Object-based Storage.” “Stream and Sensor Data Min-  James Newsome presented ing” at EDBT 2004 in Heraklion “GEM: Graph EMbedding for Greece. Routing and Data-centric Storage February 2004 in Wireless Sensor Networks” at ACM SenSys in Los Angeles,  Chenxi Wang presented “On Tim- CA. ing Attacks Against Low-latency Mix-based Systems” at Financial October 2003 Cryptography 04 in Florida.  11th Annual PDL Retreat and Workshop. January 2004  8 PDL faculty and students at-  Christos Faloutsos began his tended SOSP 03 in Bolton Land- sabbatical at IBM Almaden, ing, NY. researching database privacy and  SDI Speaker: Craig Harmer, of data mining on large graphs. Veritas Software Corporation, December 2003 spoke on “Current Considerations  Ted Wong successfully defended and Future Features in a Real his Ph.D. dissertation “Decentral- World File System.” ized Recovery for Survivable  SDI Speaker: Howard Gobioff, of Storage Systems.” Google and also a PDL alumni, presented “The Google File Sys- November 2003 tem.”  Greg spoke at Supercomputing  SDI Speaker: Marc Shapiro, Eno works on his putting game at the PDL 2003 in Phoenix, AZ on "Stor- Microsoft Research, spoke on 2003 Retreat.

FALL 2004 19 URSA MAJOR

Continued from page 18 and are being iteratively extended. large file server WRITE (D) New constellation infrastructure soft- (initially NFS survivable ware is being created for starting the version 3). This storage infrastructure various components, monitoring their design deci- user group 1 liveness, providing for migration and sion was made messaging, etc. to reduce soft- head end 1 Our plan is to have a complete first ware version instance of Ursa Major running by complexities the end of 2004. This first instance and user-visible READ(D) will have the fault-tolerance, versatil- changes—client machines can be ity, and instrumentation mechanisms user group n described earlier. It will allow us to unmodified, and begin exploring performance trade- all bug fixes and head end n unmodified user systems bridges into the store offs and optimization, healing, and experiments can diagnosis algorithms in the context of happen trans- Figure 3. Head-end systems act as bridges into the storage the real system. (Initial explorations parently behind infrastructure, translating NFS commands into the infrastructure’s internal protocols. will occur in simulation.) a standard file By the end of the academic year, we server interface. The resulting archi- The system instrumentation will also hope to deploy a refined instance of tecture is illustrated in Figure 3. Each allow us to capture invaluable long- Ursa Major, with early optimization user community will mount a single term information about workloads, and healing agents, in support of our large file system, as though they were data change rates, and hardware and first real customer. Of course, initial the only users, and communicate with software component failure rates. customers will be carefully selected a specific (virtual) head-end system. Summary to have the appropriate expectations, As the system grows more solid, and PDL’s Self-* Storage project is a big, since we expect to have problems in our equipment infrastructure expands, exciting, scary expedition towards the early stages, possibly resulting we will add additional storage custom- dependable, automated storage infra- in unexpected downtime and even ers. As indicated earlier, the target is a structures. Things are off to a good data loss. The earthquake modeling petabyte of usable storage with a wide start, and Ursa Major is starting to research pursued by Prof. David variety of real customers. Although take shape. Stay tuned, over the next O’Hallaron’s group is an ideal first the final goal is true self-*-ness, all few years, as the project unfolds -- it project, as the large datasets on which early stages will require sysadmin promises to be very interesting. they work are created from much staff for the deployed infrastructure Reference smaller observation datasets that are and also time spent by graduate stu- replicated elsewhere. Of course, we dents. We plan to carefully account for Ganger, G.R. et al. Self-* Storage: will also be exercising the deployed and categorize every man-hour, allow- Brick-based Storage with Automated infrastructure with benchmark work- ing us to understand and report where Administration. Carnegie Mellon loads to provide additional loads. their time goes and to quantify im- University Technical Report, CMU- From customers’ perspectives, the provements made in iterations of the CS-03-178, August 2003. storage infrastructure will look like a system’s self-management features.

PDL Retreat 2003 attendee group photo.

20 THE PDL PACKET