THE MAGAZINE OF USENIX & SAGE April 2003 • volume 28 • number 2

inside: CONFERENCE REPORTS WIESS ‘02 OSDI ‘02

& The Advanced Computing Systems Association & The System Administrators Guild conference reports

This issue’s reports focus on WIESS ‘02 2nd Workshop on Industrial example, a stock trading system may and on OSDI ‘02. Experiences with Systems have several redundant pathways for Software (WIESS ‘02) entering a trade, to protect against trades being lost before they have been BOSTON, MASSACHUSSETS entered. For the IT infrastructure, this OUR THANKS TO THE SUMMARIZERS: DECEMBER 9-11, 2002 means the redundant pathways need to Scott Banachowski KEYNOTE ADDRESS be synchronized at some point. This Richard S. Cox type of problem is rarely considered by Douglass J. Wilson, IBM Steven Czerwinski researchers or product developers. Himanshu Raj Summarized by Richard S. Cox Cristian Tapus MIT’s Technology Review recently ran a Third, error logging and reporting is Charles P. Wright story titled “Why Software Is So Bad.” important. As an industry, we currently Praveen Yalagandula support very primitive logging with no Wanghong Yuan The key is the problem of integration. Nickolai Zeldovich mechanisms for root-cause analysis or Ben Zhao CIOs spend 35% of their budgets on correlation of failures. Error messages Yutao Zhong integration, because every new system are often arcane or not useful, and “first- must work with the existing infrastruc- failure” capture is impossible. This is ture. The complexity of integration is evidenced by a common, though unreal- driven up by the constraints of the busi- istic, request from support center staff: ness environment as well as those of the “Turn logging on and recreate the fail- software. ure.”Because logging events need to be Several lessons can be learned from correlated, error tracking and logging studying systems usage. First, standards should be a basic service of the OS. and componentization are proving inef- fectual for complex systems. For exam- SESSION 2 ple, LDAP is a fine protocol, but no two Summarized by Wanghong Yuan organizations use the same schema. USING END-USER LATENCY TO MANAGE Making matters worse, interoperability INTERNET INFRASTRUCTURE is poor due to differing interpretations Bradley Chen, Michael Perkowitz, of standards, edge conditions, and ven- Appliant dor-specific extensions. This is leading The problem addressed in this paper is to a change from creating solutions by that distributed application perfor- mixing “best-of-breed” products to mance is important but hard to under- using a single “best-of-suite” package. stand. CDN selection and CRM systems Unfortunately, much of the literature on were offered as examples to illustrate the building component systems is aca- problem. The basic approach proposed demic, failing to deal with the scale of is to use end-user latency analysis: (1) large systems. content (e.g., an HTML Web page) is Second, systems will fail. Other indus- tagged to collect data; (2) tagged data is tries have accepted this, but software observed on the desktop (end-client sys- engineers are just now realizing that fail- tem); and (3) data is analyzed on the ure is hard. The recovery design must fit management server. the usage, which means the designer The challenges for this approach include must understand the failure modes in (1) technique issues such as larger data practice. This may mean using less sets, heavy-tailed data, and the deriva- sophisticated algorithms that are better tion of request properties, and (2) social fitted to the purpose. It also means and economic issues such as privacy. accepting that business redundancy may The results show that end-user latency be at odds with IT redundancy. For analysis can monitor relevant informa- tion, which is obscured otherwise.

82 Vol. 28, No. 2 ;login: BUILDING AN “IMPOSSIBLE” VERIFIER The previous talk was given in 1977, SESSION 4 ON A JAVA CARD when the main computer models were Summarized by Cristian Tapus EPORTS

Damien Deville, Gilles Grimaud, Uni- IBM mainframes, coming VAX, and R AN EXAMINATION OF THE TRANSITION OF versité de Science et Technologies de PDP-11s, while C was taking the place of THE ARJUNA DISTRIBUTED TRANSACTION Lille ASM and structured programming PROCESSING SOFTWARE FROM RESEARCH TO

The smart card device environment became the dominating idea. Three PRODUCTS ONFERENCE imposes constraints on CPU, memory, approaches of building system software C M.C. Little, HP–Arjuna Labs; S.K. and I/O. As a result, Java Card Virtual were introduced and compared: “Do it Shrivastava, Newcastle University Machine needs to be adapted to the right,”“Do it over,”and “Do it small, Arjuna started in 1986 as a research smart card. The regular verification with tools.”“Do it right” emphasizes an project at the University of Newcastle, approaches do not fit, since unification optimistic on-requirement analysis that England. Arjuna was a “vehicle for get- is costly. The proposed approach assumes “we know what we are doing.” ting Ph.D. degrees.”The decision to use addresses the above problems via (1) “Do it over” puts more emphasis on C++ was a pragmatic one (expensive non-stressing encoding and (2) efficient early implementation by still starts from Euclid vs. free C++ AT&T). Arjuna was fixed points using a software cache pol- scratch. The last approach, by contrast, designed to be a toolkit for development icy. considers tools instead of systems and of fault-tolerant applications which builds small and fast so that, if neces- would provide persistence, concurrency ENHANCEMENTS FOR HYPER-THREADING sary, failures can happen quickly. TECHNOLOGY IN THE OPERATING SYSTEM: control, and replication. Modularity was SEEKING THE OPTIMAL In order to see the effect of these strate- the key to the longevity of the system. Jun Nakajima, Venkatesh Pallipadi, gies, Mashey discussed different metrics In 1994 Newcastle University asked In this talk, Jun Nakajima first gave an to qualitatively measure success and gave them to implement a student registra- overview of Hyper-Threading (HT) statistics and observations of projects in tion system because the “academic technology by comparing it with multi- data processing. Figures and numbers researchers are cheap.”The system was processors. The reason behind HT is showed the low percentage of complete supposed to run on multiple platforms, that CPU units are not fully utilized. To success and indicated the larger a project serve about 15,000 students over five fully utilize CPU units, the HT approach is, the higher overhead it has to pay. days, and could not tolerate failures. is to use two architectural sets, thereby Laws of program evolution also state There were problems, though. Assump- executing two tasks simultaneously. that the entropy of a project increases tions were made about network parti- with time and may result in a complex tions and recovery that made the system The HT approach requires the OS program used to solve a simple problem. scheduler to support HT-aware idle han- fail to identify dead machines vs. slow dling, processor-cache affinity, and scal- Several principles were offered to coun- connections. Intuition is not a good ability (per-package run queue). This teract these problems: “build it fast,” approach to designing systems. paper proposes a micro-architecture “keep it small and simple,”and “build The year 1995 brought standards for scheduling assist (MASA) methodology for change.”Existing tools should be uti- transactions: object transaction system to address the above problems, thereby lized whenever possible. It would be specifications (OTSS) from OGM. It achieving an optimal process placement. good to build tools and consider the shared many similarities with Arjuna, interfaces of connecting tools. Some but it was only a two-phase commit pro- INVITED TALK “small tactics,”including “lifeboat the- tocol engine (persistence and concurrent ory,”“sinking lifeboat theory,”and other control where required from elsewhere). SOFTWARE STRATEGY FROM THE considerations about people and consol- “1980 TIME CAPSULE” At this time the OTSArjuna system was idations, were also discussed. John R. Mashey developed. With only slight changes to the interfaces between modules, the sys- Summarized by Yutao Zhong Even after 25 years of work, we need to keep these problems in mind, since sys- tem was complying with OTS. JTSAr- John Mashey reused the slides from a tem complexity is much higher nowa- juna followed just two years later as the talk he gave 25 years ago titled “Small Is days; fortunately, people are increasingly first Java transaction service. Beautiful and Other Thoughts on Pro- aware of these issues. gramming Strategies.”It is interesting to In 1999 the Java and C++ transaction see from these old slides and the newly Mashey ended the talk by saying, “We service were marketed; only one year added comments what has changed and have met the enemy and they are us.” later Bluestone took over Arjuna Solu- what hasn’t. tions Limited and was, in turn, acquired by HP in 2001. When the system was

April 2003 ;login: WIESS ‘02 83 acquired by Bluestone the need for real to the entire engineering staff (one Currently there are almost no such testing became a reality. For the previous advantage of being small), and there are incentives. decade only about 20 tests had been strict coding standards (it is the law). Andrew Hume: I do technology transfer used, but this was increased to over 4000 The talk continued by describing tech- at AT&T. The problem is enticing tests in order to stretch every feature of niques used to obtain the final product. researchers, because you go for a while the system. The previous method was to When you hit bedrock, try to rethink without publishing papers. On the other get a release out to the users, and users what you are doing; and observe the hand, you can then write a different kind would then report problems back and “rule of holes – if you are in one, stop of paper, about the real aspects of sys- bugs would be fixed. Not anymore. The digging.”In the end, certain lessons were tems. Academia should care more about industry method was different from learned from the development process: results having to do with real details. many perspectives: write manuals and Designs and reviews are important, but white papers and train other people. “I Noah Mendelsohn: Academic papers reviews are not perfect; there needs to be used to laugh at white papers, but I real- don’t line up with industrial interests. If a willingness to stop and change course ized you need skills. An academic person conferences did accept industry papers, when necessary and to throw code away, cannot do it. Academic people do tech- will companies write them? even if it works; and you need someone nical reports, which are different,”Little nearby who’s close to the process but Andrew Hume: Yes. Motivating factors said. objective. are satisfaction and recognition, perhaps Was it worth it? YES. But it was stressful because this is rare. moving away from R&D. “If you have a JOINT WIESS/OSDI PANEL Noah Mendelsohn: There’s also an family don’t do it. If you are in the RESEARCH MEETS INDUSTRY opportunity cost to writing a paper, of industry and you feel you are stressed Chair: Noah Mendelsohn; Panelists: losing developer time. up, move to academia.” Ramon Caceres, ShieldIP; Mark Day, Charles Leiserson: Students going into When asked what they would do differ- Cisco Systems; Charles Leiserson, MIT; industry don’t understand company cul- ently, the reply was that they would (1) Dick Flower, HP; Brian Bershad, Univer- sity of Washington ture; they are used to the academic envi- get somebody else to do the failure ronment. recovery and (2) make sure that they Summarized by Nickolai Zeldovich would have more than 20 tests. The joint panel discussed issues related Margo Seltzer: There needs to be moti- to the ridge between research and indus- vation for companies to write papers. TREE HOUSES AND REAL HOUSES: RESEARCH trial development and their correlation. Engineers want to write papers, but need AND COMMERCIAL SOFTWARE The discussion was entitled “Research to sell papers to managers, as a tool for Susan LoVerso and Margo Seltzer, Meets Industry.” marketing, for example. Sleepycat Software Brian Bershad: Knowing how to teach Susan talked about the process they fol- Brian Bershad: Often companies don’t helps in the industry, as does having a lowed to make a commercial product want intellectual property published. degree. You need to know how to man- out of BerkeleyDB. The main argument Thus, commercial papers lack technical age and motivate people in academia. was that a research prototype is like a detail. Coming back to academia, you start ask- tree house – it doesn’t last – while a ing questions like, “Who cares about this Charles Leiserson: Writing papers is usu- commercial product is the real house. project?”“Will it scale?” and so on. ally as useful internally as getting it pub- Sleepycat was founded in 1996 and lished. Papers help internal transformed DB1.85 into a real product. Below are capsules of the discussion by communication. It added transactions, utilities, and the panel and the audience recoverability while continuing to be Roblis(?), Intel: At Intel Labs, writing Someone Whose Name I Forgot: People open source. Sleepycat is a “distributed” papers is rewarded and expected. In the in academia are generally not interested company; with employees spread across product groups, however, it is viewed as in details, testing, and usability, which the world, it is hard for them to interact a net negative. It would be useful if con- are needed to take something from with each other directly. But the hetero- ferences could accept/reject rough drafts research to a product. The industry in geneous environment makes the com- to avoid wasted write-up efforts. general is also not very interested in pany more powerful. In order to research work, reading papers, and so Anthropologists studied engineers and produce quality software, however, on. found that usually there are a few “lead- Sleepycat must follow rigorous software ers’’ in engineering groups that go to practices. Designs and reviews are sent Mark Day: More incentives are needed conferences, lead things to turn some- to get industry and academia to interact.

84 Vol. 28, No. 2 ;login: thing into a product, etc. Maybe we university people. Management, leader- 5th Symposium on Operating don’t need people transfer, we just need ship, motivation, educating about team-

Systems Design and EPORTS to market things to these “leaders’’? work, working with each other. Implementation (OSDI ‘02) R Dick Flower(?): There are groups with- Fred Douglis (IBM): Were there more TECHNICAL SESSIONS out leading individuals. Having an core industrial papers back when

DECENTRALIZED STORAGE SYSTEMS ONFERENCE advanced development group of some USENIX took extended abstracts? C sort could be useful, though. Summarized by Himanshu Raj Chris Small: We seem to have a hangover Brian Bershad: I think some companies after the dot-com boom. There was a FARSITE: FEDERATED, AVAILABLE, AND have reasonable expectations of the huge flux of ideas from research to com- RELIABLE STORAGE FOR AN INCOMPLETELY research world, and some companies merce. TRUSTED ENVIRONMENT don’t. Atul Adya, William J. Bolosky, Miguel Brian Bershad: The dot-com boom Castro, Gerald Cermak, Ronnie John(?): The HotChips conference, for shows what happens when the barriers Chaiken, John R. Douceur, Jon Howell, example, only produces presentations to adoption from research are removed. Jacob R. Lorch, Marvin Theimer, Roger and not papers. It’s much easier to get a The result wasn’t so great – too many P. Wattenhofer, Microsoft Research presentation, rather than a paper, from a worthless ideas, no industrial experi- The goal of this research was to make a lead chip designer. ence. Doing a startup is easier the sec- scalable serverless distributed file system ond time around. Mark T (MS Research): At PARC, of the while maintaining security against mali- people who went to industry, none ever Someone from VMware: We were lucky cious attacks in untrusted systems. came back for long. Can you ever come to have good timing to submit our paper Byzantine protocols are used to define back from the industry? to OSDI – the submission deadline was untrusted infrastructure. The FARSITE a few months after an internal deadline, solution is a virtual global file store of Brian Bershad: No, not possible to come which gave us time to gather results. The encrypted data that is replicated to facil- back and be the same. Your focus changes community should be more receptive to itate availability. Storage is divided into to short-term goals. papers about released or dead products; two parts: file data and metadata. Meta- Charles Leiserson: In my lab, lots of peo- they are valuable. data information is a hash-computed ple, including staff, did it OK. Doing so form of actual file data. The system is Jun Nakajima (Intel): In this economy, colors your interest, though. You learn built around an infrastructure to store R&D costs are being reduced and moved about things like barriers to adoption, file data and metadata separately, and to China. For the cost of one engineer etc. traditional Byzantine properties are here you can get three to five engineers applied only for machines storing meta- Mark Day: Do you mean returning to in China. How do you justify the three- data. applied research or to academia? to-five times cost? Since Byzantine operations are costly, Mark T (MS Research): PARC returnees Charles Leiserson: Education. Also loca- they are not performed per file I/O. were successful in the industry and kept tion – most other companies are located Instead, a Byzantine operation is defined going back to form new startups. Focus here. valid for a period of lease. Various differ- on doing something with impact in the Erez Zadok (Stony Brook): Academia is ent types of leases are available to suit world. not preparing students for life in indus- different consistency requirements. Noah Mendelsohn: Having gone to try. It’s difficult to convince universities Batching is another concept used to industry before grad school gave me a to create courses with practical aspects. reduce cost. The system is implemented great perspective on reality, judging the as a service in user level and as a kernel realism of projects, etc. It’s very hard to mode driver that routes the actual file do research part-time. system calls to NTFS. According to the results reported, the system performs Bradley Chen (Appliant): What do you better than a central file system, though think about requiring faculty to have worse than bare-bones NTFS. The sys- industrial experience? tem is not designed to address efficient Charles Leiserson: Depends on the qual- large-scale write sharing, database ity of the experience. But yes, there are semantics, or disconnected operations. things from industry to be taught to The project link is http://research.microsoft.com/sn/Farsite/.

April 2003 ;login: OSDI ‘02 85 TAMING AGGRESSIVE REPLICATION IN THE data for performance reasons. The main them. The corresponding solution PANGAEA WIDE-AREA FILE SYSTEM idea behind the system is to use a log per would be to recycle resources when they Yasushi Saito, Christos Karamanolis, user and combine potentially multiple are exhausted, reclaiming them from Magnus Karlsson, Mallik Mahalingam, logs to serialize the updates made on a certain applications. Combinations of HP Labs shared object. Serialization is based on a the two types of resource attacks are Pangaea is a scalable distributed file sys- version-vectoring scheme rather than harder to deal with. For example, when tem targeted for the type of WAN infra- using a timestamp technique. The evalu- file descriptors (nonrenewable) are recy- structure characteristic of multinational ation compares Ivy’s performance with a cled to protect against claim-and-hold companies with overseas corporate local file system to see the load charac- attacks, they become a renewable offices and a need to share data. Design teristics and runtime comparisons made resource and therefore vulnerable to goals of Pangaea include hiding WAN with NFS over WAN. Results show that busy attacks. link latencies, availability in a high- log operations tend to dominate the per- The proposal is to utilize a toolkit con- change environment, and network usage formance of the system over WAN and taining “sensors” and “actuators” to pro- efficiency. The system assumes the pres- parallel fetching of log records can be tect both types of resources, with low ence of an available secure infrastruc- used to hide latency. The system programming burden. The toolkit is a ture, such as VPN. The system also addresses sharing among only a small combination of techniques from work in provides eventual consistency, though number of writers and hence does not protection, static analysis, anomaly manual open/close-style consistency address the scalability issues involved detection, and profiling. To protect could also be provided. The system with a large number of writers sharing renewable resources, the approach is to employs pervasive replication to dynam- an object. Merging of logs is performed divide functionality into distinct services ically replicate each file/directory in the later, as in the Coda file system, and con- and balance resources among them, such system independently. Benefits drawn flict resolution is addressed then. The that the impact of an attack on a single from intensive replication are speed, way to provide effective read sharing in service is limited to that service. To pro- availability, and network efficiency. The Ivy is to use multiple file systems. The tect nonrenewable resources, they need system is implemented in project link is http://pdos.lcs.mit.edu/ivy. to be recycled when necessary. The algo- based on SFS API. The NFS client at the rithm to choose the resource instance to kernel level routes the I/O requests to ROBUSTNESS reclaim can be driven by a timer. The the Pangaea service at the user level. The Summarized by Ben Zhao timer can be set on idleness or on the system uses graph-based replica man- DEFENSIVE PROGRAMMING: length of the service lifetime. The work agement. A random graph is created for USING AN ANNOTATION TOOLKIT TO BUILD proposes a user-defined progress metric every file/directory in the system, and DOS-RESISTANT SOFTWARE (amount of data output or number of edges of this graph are used for update Xiaohu Qie, Ruoming Pang, Larry state transitions) that will reclaim propagation among replicas and for Peterson, Princeton University resources from the “slowest” principal. replica discovery. Since the system does Qie began by examining how typical not have a central lock manager, it uses a The toolkit is implemented as 11 C DoS attacks work. One attacks a Web technique called harbingers to compute macros and library functions. The server, for example, by intentionally a spanning tree so that duplicate trans- authors also modified gcc for auxiliary slowing down TCP, faking packet loss, missions can be avoided. This technique code generation at compile time. The and attempting to tie down as many also helps reduce the propagation delay. evaluation contains case studies of a TCP connections at the server end as The project link is flash Web server. The Web server is par- possible. It is useful to classify resources http://www.hpl.hp.com/research/ssh. titioned into 46 services; 60 annotations as renewable (CPU, network bandwidth, were added to the code. Under a slash disk bandwidth) and nonrenewable IVY: A READ/WRITE PEER-TO-PEER FILE attack, the annotated server response (processes, file descriptors, memory SYSTEM time is 5.1 milliseconds, compared to a buffers). Renewable resources are vul- Athicha Muthitacharoen, Robert Morris, normal response time of 4.3 millisec- nerable to “busy attacks,”which try to Thomer M. Gil, Benjie Chen, MIT onds, and is significantly lower than a request the resources faster than they The main goal of Ivy is to build a highly non-annotated server under attack, can be allocated. The corresponding available file system out of inexpensive which has a response time of 25 seconds. solution is protection via admission infrastructure that can scale to multiple A possible limitation is that its effective- control. Nonrenewable resources are writers on the same data. The system ness depends on service granularity. The vulnerable to “claim-and-hold attacks,” leverages DHT in the core, and provides project link is which attempt to request and hold on to weaker consistency guarantees for meta- http://www.cs.princeton.edu/nsg.

86 Vol. 28, No. 2 ;login: USING MODEL CHECKING TO DEBUG DEVICE late nodes in the network, where sched- pages require allocation of contiguous, FIRMWARE uling granularity is on the order of aligned memory; (2) a superpage can be EPORTS

Sanjeev Kumar, Kai Li, Princeton entire even handlers. This means han- created out of several normal pages R University dlers are treated atomistically, and syn- (promoted) or broken into several pages Device firmware is a piece of concurrent chronization bugs can be missed. CMC (demoted); and (3) the need to prevent

software that achieves high performance tries to search the entire space, but it can internal fragmentation. Each issue is ONFERENCE checkpoint at decision points and dealt with in an opportunistic manner. C at the cost of software complexity. It contains subtle race conditions that resume later where different states can For example, once an application make it difficult to debug using tradi- be generated. The work uses three opti- touches the first page of a memory tional debugging techniques. The prob- mizations to reduce the search space: object it will quickly touch every page. lem is further compounded by the hash compaction, downscaling, and state Each superpage is created as long as pos- lack of debugging support on the canonicalization. Hash compaction is sible and at the earliest point. To do this, devices. Model checking is a promising the use of hashtables to store previously reservation is employed, but it may be approach. It can systematically explore seen states so they are not examined broken if the memory is needed (the all possible scheduling orders and pro- again, and computing hashed signatures oldest reservations are broken first). The vide counter-examples of bugs found. for each state to reduce space require- same type of opportunistic algorithm is The general technique is to extract mod- ments. Downscaling is the use of a small applied to promotion and demotion. To els from programs either manually or number of nodes in order to reduce the keep fragmentation low the page demon via a compiler. The authors extracted state space. Complex interaction bugs restores contiguity, and wired pages are models for the Spin model from pro- can still be produced, but it might miss clustered. On the SPEC CPU 2000 inte- grams written in the ESP language. ESP bugs only seen on large-scale interac- ger and floating point operations, a per- is a language for programmable devices tions. State canonicalization is the sim- formance improvement of about 11% that the compilers use to generate tests. plification of similar states down to a was observed. For a large matrix trans- In evaluation, the techniques are applied single state, which is then evaluated. position, an improvement of over 600% to VMMC (high performance commu- When applied to AODV routing proto- was observed. nication design that bypasses the OS for col implementations, the CMC checker More information is available at http:// data transfers). VMMC firmware was found 42 bugs (of which 34 are distinct, www.cs.rice.edu/~jnavarro/superpages/. reimplemented using ESP. Seven models and one is a bug in the specification). were found using abstract models, VERTIGO: despite the global nature of some bugs KERNELS AUTOMATIC PERFORMANCE-SETTING FOR (deadlock). These bugs would be hard to Summarized by Charles P. Wright find without using a model. Where a full Krisztián Flautner, ARM Limited; Trevor PRACTICAL, TRANSPARENT OPERATING search of the state space is not possible, Mudge, University of Michigan SYSTEM SUPPORT FOR SUPERPAGES partial searches can minimize resource Juan Navarro, Rice University and Uni- Flautner presented a software frame- costs and still produce useful results. versidad Católica de Chile; Sitaram Iyer, work to do energy management by set- Peter Druschel, and Alan Cox, Rice ting processor speed. The processor CMC: A PRAGMATIC APPROACH TO MODEL University consumes 32% of the power budget on CHECKING REAL CODE Translation lookaside buffer (TLB) cov- small devices (e.g., PDAs). Vertigo Madanlal Musuvathi, David Y.W. Park, focuses on power management when the Andy Chou, Dawson R. Engler, David L. erage has decreased by a factor of 1000 CPU is performing work, not when the Dill, Stanford University in 15 years. In 1985 the TLB miss over- CPU is idle. The underlying principle is Many system errors do not emerge head was less than 5%; today it is over to run just fast enough to meet dead- unless some intricate sequence of events 30%. This is primarily due to increases lines, without using higher power con- occurs. In practice, this means that most in the size of working sets, yet TLB size sumptions. An increase in performance systems have errors that only trigger has remained constant. Many architec- will create an exponential increase in after days or weeks of execution. Model tures allow the creation of superpages. A energy usage—it is better to use a checking is an effective way to find such superpage TLB is like a normal TLB, but smaller amount of computing power for subtle errors. This work contributes the a size field is added. Navarro presented a a longer period of time than to use a C model checker, which links to code, practical implementation of superpages large amount of power over a short emulates a real system, captures the for FreeBSD 4.3. period. states of the system, and analyzes the There are three major issues when results. CMC schedules threads to emu- implementing superpages: (1) super- Vertigo is a module that monitors system execution to determine

April 2003 ;login: OSDI ‘02 87 how fast things need to go. There are five to reduce overall power consumption), of aggregate queries (count, max, aver- hooks in the kernel (e.g., task switching, VFS modifications, and ext2 modifica- age, etc.). Madden asserts that most some system calls, and swapping) that tions. The goal of these modified com- common data-analysis operations are are used to determine activity. A policy ponents is to cluster I/O operations into aggregate operations. For example, the stack combines multiple simple algo- batches, thus leaving the drive idle for average temperature over all the sensors rithms to determine the best perfor- the longest period of time possible. For (or in a given sector) is a more interest- mance level. Each algorithm can be the Amp MP3 player, 150 lines of code ing indicator than the temperature at specified for a specific performance situ- were modified (the bit rate was used to each individual node. ation. For example, an interactive per- determine timeouts). While this modi- There are several methods that can be formance algorithm may monitor X fied Amp was running, an unmodified used to decrease communication. The server events. mail client using write was used. Using first method is to incrementally com- coop_read, a power consumption was Vertigo was compared to the Crusoe pute values using partial state records reduced to 210 joules from 373 joules. LongRun on-chip power saving system. (PSRs). For example, an average can be This is a better energy savings than an Using application-specific knowledge transmitted to a node’s parent as a sum “Oracle” policy and one which always was very effective. For example, the Cru- and a count, and the parent’s values can makes the right power decision based soe LongRun would cause spikes to full be inserted into this PSR. Additionally, upon a previous trace, without modify- power when the GNOME clock ticked. snooping or guesses can improve perfor- ing the timing of the I/O. When playing MPEG movies both Ver- mance. If the desired aggregate is the tigo and LongRun do not drop any max, a node does not need to communi- PHYSICAL INTERFACE frames, but Vertigo used 52% of the cate its own value if it hears a value peak performance level and LongRun Summarized by Charles P. Wright larger than its own. If the root knows the used 80%. The conclusion is that the TAG: A TINY AGGREGATION SERVICE FOR max value is at least 50, then it can kernel has lots of valuable information AD-HOC SENSOR NETWORKS reduce communication by communicat- that is lost on the chip. Samuel Madden, Michael J. Franklin, ing this value to other nodes. Joseph M. Hellerstein, University of COOPERATIVE I/O: A NOVEL I/O California, Berkeley; Wei Hong, Intel More information can be obtained at SEMANTICS FOR ENERGY-AWARE Research http://telegraph.cs.berkeley.edu/tinydb/. APPLICATIONS Sensor networks are a collection of FINE-GRAINED NETWORK TIME Andreas Weissel, Bjorn Beutel, and small, inexpensive battery-run devices SYNCHRONIZATION USING REFERENCE Frank Bellosa, University of Erlangen with sensors and RF interfaces. Pro- BROADCASTS Traditional operating system power gramming a sensor network is a difficult Jeremy Elson, Lewis Girod, Deborah management assumes the timings of task: It took two weeks for two experi- Estrin, UCLA disk operations by user applications are enced students to program a vehicle- To present a consistent view of informa- unknown and cannot be influenced. tracking sensor network. TAG eliminates tion, sensor networks need to have a Additionally, transitioning to a low- the need to program sensor networks by consistent view of time. This problem power mode will actually waste power if using an SQL-like declarative lan- has already been solved on the Internet the transition was unnecessary. Cooper- guage—using TAG, the same vehicle (e.g., NTP), but sensor networks do not ative I/O changes this assumption by position network was programmed in have the infrastructure available to introducing three new system calls: two minutes. Sensor networks are Internet hosts. Sensor applications also coop_read, coop_write, and coop_open. installed under harsh conditions (e.g., in have stronger time synchronization Along with the standard parameters for habitat- or earthquake-monitoring requirements than the Internet (tracking these calls, a timeout and an abortable applications). The primary metric used phenomena may require microsecond- flag are passed (e.g., a MPEG player may for sensor networks is power consump- level synchronization). specify that I/O can be deferred until the tion. Berkeley “Mica Motes” run for only frame actually needs to be decoded). two to three days when using full power Elson presented reference broadcast syn- This allows the operating system to but can last up to six months at a 2% chronization (RBS). Traditional syn- schedule I/O intelligently. duty cycle. Communication dominates chronization methods have lots of nondeterministic delay when sending There are three components to coopera- the power consumption cost, so they use packets (e.g., backoff timers or link-level tive I/O: a modified IDE driver that bytes sent as a metric. retransmission). Receiving a packet that shuts down the disk after the break-even To reduce the communications over- a host sent has much less variation than point (the number of seconds required head, TAG allows in-network processing the time it takes to actually send a packet

88 Vol. 28, No. 2 ;login: (1 bit width for receive vs. 1000 for an accurate implementation of a good because of the near impossibility of send). Therefore, two hosts can make scheduler. There are two types of kernel administering and configuring millions EPORTS note of the time they received a packet timers: fine-grained (soft) or one-shot of sensor nodes; they must be able to R sent by a third host. The two receivers timers (firm). There are two overheads self-organize. However, as he showed now know the difference between their when evaluating timers: reprogramming with several different self-organizing

clocks. Clock skew perturbs this obser- and interrupts. Reprogramming the methods to compute spanning trees, ONFERENCE vation, however, so a best-fit line is used timer turns out to be inexpensive, but designing the local rules that lead to cor- C to determine the difference. the interrupts are expensive. However, rect global behavior is difficult. This is soft timers have a potentially where engineering needs to be applied. RBS synchronized the clock on a Com- unbounded latency. TSL uses firm paq iPaq to precisions of 6 microsec- Balakrishnan also advocated self-organ- timers. Firm timers insert checks into onds, whereas NTP is only able to obtain ized networks because they can elimi- kernel paths (e.g., system call entry/exit a precision of 53 microseconds. The nate human misconfiguration from to check the timer) but also use one- clock resolution on the Linux platform distributed systems and allow such sys- shot timers that are configured to over- is only 1 microsecond; Elson believes tems to adapt to errors and change. He shoot the required delay. This provides that a more accurate clock would yield argued that distributed systems are all some guarantee while, hopefully, reduc- better results. The performance under a about enabling autonomy at subsystems, ing the number of interrupts. 6 Mbps load shows even better results: but with this autonomy come problems RBS degrades to 8 microseconds, but Linux has already broken down the big with misconfiguration. Using traces, he NTP degrades to 1542 microseconds. kernel lock, but some locks are still large. showed how a significant portion of TSL uses voluntary lock yielding to invalid DNS queries were caused by RBS effectively removes sender nonde- increase kernel responsiveness. Finally, human misconfiguration. terminism from network time synchro- TSL implements a proportional share nization. This facilitates a wide range of Kubiatowicz used a thermodynamics scheduler that provides a constant-speed applications, including acoustic ranging analogy to argue the importance of self- virtual machine. Even heavy-load soft- and collaborative signal detection. organizing networks. Large systems can ware modems (which require 4–16 mil- exhibit stability through statistics if they lisecond guarantees) are supported. TSL SUPPORTING TIME-SENSITIVE APPLICATIONS possess replicated components that implements sub-millisecond-timing ON A COMMODITY OS interact and adjust to one another. guarantees on a general-purpose operat- Ashvin Goel, Luca Abeni, Charles Kra- Energy could be injected into the system ing system. TSL imposes 1.5% overhead, sic, Jim Snow, Jonathan Walpole, through both passive and active correc- which is low for firm timers. For the Oregon Graduate Institute tion mechanisms. He labeled such sys- additional kernel preemption points, an Fast processors enable interactive real- tems as “thermospective.”With Moore’s overhead of 0.5% was introduced. time applications in software: for exam- Law enabling redundancy and with the ple, software radio, software modems, need to eliminate human configuration, PANEL voice over IP, video conferencing, and he saw these systems as being the future. accurate network traffic generators. Summarized by Steven Czerwinski Druschel presented a spectrum of cur- However, these applications need mil- SELF-ORGANIZING NETWORKS FROM SENSOR rent distributed systems, with decentral- lisecond to microsecond timing guaran- NETS TO P2P: PANACEA OR PIPE-DREAM? ized approaches on one end and self- tees. It has long been accepted that to Co-Moderators: Peter Druschel, Rice organizing ones on the other. He argued provide such timing guarantees a special University; David Culler, University of real-time OS is required and that general California, Berkeley/Intel; Panelists: that natural (biological) systems are the purpose OSes need a complete redesign. Hari Balakrishnan, MIT; Yaneer Bar- only truly self-organizing systems, with Real-time operating systems have many Yam, NECSI; John D. Kubiatowicz, sensor networks being fairly close. Sys- disadvantages (e.g., nonstandard inter- University of California, Berkeley tems requiring ACID semantics have dif- faces and small user communities). Goel Are self-organized networks superior to ficulties making it onto the presented time-sensitive Linux (TSL). traditionally engineered solutions and self-organized end. He also noted that TSL aims to provide real-time perfor- are they necessary to solve today’s prob- the systems we engineer are robust to mance on commodity general-purpose lems? In Culler’s opinion, self-organized both mundane failures and malicious operating systems using an evolutionary networks are necessary but require engi- attacks, while self-organizing ones are approach. neering in order to build systems with only robust to mundane failures. They the desired predictable global behaviors. would require (at the least) a trusted The requirements for TSL were fine- They are necessary in sensor networks certificate authority to be robust to grained timers, a responsive kernel, and attacks.

April 2003 ;login: OSDI ‘02 89 Bar-Yam used analogies from biology to Efficient use of available machine mem- interrupts are batched together and show that we already have the concep- ory is provided through memory shar- applied when the corresponding VM is tual tools to demystify self-organizing ing, where a single page on the machine scheduled, thus reducing the overhead networks. It may be hard to understand is shared by multiple VMs (using copy- of context switches; and (2) idle-with- the progression of a mouse embryo on-write semantics). A background timeout instruction – this instruction from a macroscopic perspective, but process computes hashes of pages to allows VM to specify how long it yields, that’s because we don’t understand the determine the duplicate pages. For “best thus leading to better scheduling. The local rules or patterns of behavior of the case” workloads, in which multiple first technique provided a 30% improve- smaller components. He showed how Linux VMs are run, about 60% memory ment in performance in experiments, different types of patterns of behavior savings are observed. For real workloads, and the second scheme yielded a 100% (such as local majority, two-dimensional the savings ranged from 7% to 32%. throughput improvement. condensation, and local activation/long- For a memory allocation scheme that range inhibition) can lead to interesting REVIRT: ENABLING INTRUSION ANALYSIS provides fairness among virtual phenomena, such as the stripes on a THROUGH VIRTUAL-MACHINE LOGGING machines while being efficient, the zebra’s back. AND REPLAY author proposes the concept of “idle George W. Dunlap, Samuel T. King, Audience members pointed out the dif- memory tax,”where the idle pages are Sukru Cinar, Murtaza A. Basrai, and ficulties of creating such a system from charged more than active pages. This Peter M. Chen, University of Michigan an economic and business standpoint new mechanism resulted in a 30% The aim of this work is to provide a way (who pays for all of this?) along with throughput increase for the workload for post-mortem analysis of intrusions. privacy concerns (do you really want considered in experiments. Typical system logs are subverted by the your data going anywhere and every- intruder. The “CoVirt” project aims at where?). Some also cautioned against SCALE AND PERFORMANCE IN THE DENALI enhancing security by running the target ISOLATION KERNEL the misuse of biology and other non- OS and all target services inside a Virtual computer science metaphors, which can Andrew Whitaker, Marianne Shaw, and Machine (VM) and then adding security Steven D. Gribble, University of encourage similarities being drawn services in the VM or host platform. Washington where none exist. ReVirt aims at checkpointing and log- The goal of this work is to enable the ging a VM’s execution trace so that it execution of untrusted code while pro- VIRTUAL MACHINES can be replayed later. The virtual viding isolation so that the untrusted Summarized by Praveen Yalagandula machine used is UMLinux, a Linux ker- code does not interfere with any other nel that can be run on any other Linux MEMORY RESOURCE MANAGEMENT IN process on the system. The Denali “iso- machine. VMWARE ESX SERVER lation kernel” isolates untrusted software Carl A. Waldspurger, VMware services in separate protection domains. To enable complete replay, checkpoint- This won the Best Paper award. The approach is to use virtual machines ing is done that covers the memory, CPU, and disk states, and logging is VMware ESX server is a thin kernel to to provide isolation with strategic modi- done that covers all keyboard, network multiplex hardware resources among fications for scalability, simplicity, and events, and interrupts, along with the virtual machines. The three main issues performance. data corresponding to these events. that arise in memory resource manage- Denali’s virtual machine architecture Replaying the interrupts is a hard prob- ment are fairness, performance isolation achieves scalability and performance at lem, and the authors use tuple, as in the among virtual machines, and efficient the cost of giving up backwards compat- project, to uniquely identify utilization of the available machine ibility. It omits rarely used features like the place in execution where interrupts memory. BIOS, protection rings, etc.; revises should happen. To efficiently reclaim memory from a interrupts and MMU; and simplifies The overhead ranged from virtual machine, Waldspurger proposes hardware I/O instructions. The resulting 1% to 58% for different workloads. The the ballooning technique where a driver core kernel is an order of magnitude logging overhead on runtime is about inside the virtual machine allocates smaller than the bare-bones Linux 8%, and the log grew at a rate of some pages, forcing the guest OS to evict 2.4.16 kernel. 1.4GB/day in the worst case workload pages not in use or to swap some pages. For scalability, Denali employs the fol- and at 0.04GB/day in the best case. Experimental results show that there is lowing techniques: (1) batched, asyn- only a small overhead of 1.4% to 4.4% chronous interrupts – instead of in using this technique. invoking a VM when interrupt arrives,

90 Vol. 28, No. 2 ;login: CLUSTER RESOURCE MANAGEMENT (EDF), Yield-Inflated Deadline (YID), AN INTEGRATED EXPERIMENTAL Summarized by Praveen Yalagandula Greedy, and Adaptive techniques. The ENVIRONMENT FOR DISTRIBUTED SYSTEMS EPORTS

experimental results show that Adaptive AND NETWORKS R INTEGRATED RESOURCE MANAGEMENT FOR outperforms all other heuristics on a 16- Brian White, Jay Lepreau, Leigh Stoller, CLUSTER-BASED INTERNET SERVICES node cluster. Robert Ricci, Shashi Guruprasad, Mac Kai Shen, University of Rochester; Newbold, Mike Hibler, Chad Barb, and Hong Tang,University of California, ONFERENCE RESOURCE OVERBOOKING AND APPLICATION Abhijeet Joglekar, University of Utah C

Santa Barbara; Tao Yang, University of PROFILING IN SHARED HOSTING PLATFORMS Typically, network experiments are done California, Santa Barbara and Ask Bhuvan Urgaonkar, Prashant Shenoy, Jeeves; Lingkun Chu, University of Cali- through simulation, emulation, or on University of Massachusetts; Timothy fornia, Santa Barbara live networks. While simulation is Roscoe, Intel The challenges involved in hosting large- repeatable but not accurate, live network The goal of this work is to maximize the scale resource-intensive Internet services experimentation is realistic but not number of hosted applications on a on a server cluster are: (1) scalability repeatable. The emulation method of server cluster while providing resource and robustness, (2) timely response, experimentation is a hybrid approach guarantees to the applications. Taking a (3) efficient resource utilization, (4) that creates a synthetic network environ- worst-case load and assigning those adaptive resource management, and (5) ment but requires tedious manual con- amounts of resources is not efficient, differentiated services. The goal of the figuration. Netbed complements existing since the average load of an application Neptune project is to provide program- experimental environments by spanning is typically an order of magnitude less ming and run-time environment sup- simulation, emulation, and live experi- than the worst case. So the authors pro- port for effective management of mentation, integrating them into a com- pose to use the scheme of overbooking services through partitioning, replica- mon framework. The integration allows resources and show that this scheme is tion, and aggregation. Instead of using ease of use while being realistic. About feasible and maximizes the revenue gen- monolithic metrics such as throughput, 2176 experiments were done on the erated by the available resources. mean response time, etc., the authors Netbed within the last 12 months by about 365 users. define “quality-aware service yield” with The authors define “capsules” as the respect to a request as denoting the components of an application that runs The Netbed uses a virtual machine amount of economic benefit resulting on a node. To determine the resource approach for network experimentation. from servicing this request in a timely requirements of a capsule, the authors Configuration time is improved through fashion, and then try to maximize the perform “application profiling” using automation by two orders of magnitude. aggregate service yield over all requests. Linux Trace Toolkit (for CPU and mem- Network nodes are emulated using vir- ory requirements) with well known Service differentiation is done based on tual machines on a cluster of nodes. traces. From typical application profiles, service classes, where service accesses of Links including WAN links are emulated the authors conclude that these capsules a particular service class obtain the same using VLANs and tunnels. The network exhibit different degrees of burstiness level of service support. A service class topology to be emulated can be specified and use “Token Buckets” to represent the can be a set of client identities, service either using a ns-type Tcl-based specifi- resource requirements. A Token Bucket types, or data partitions. In Neptune, cation or in a Java-based GUI. A global of a capsule with two parameters s and p two-level request distribution and resource allocator scheme assigns local states that the resource usage of that scheduling is done: gateways do random cluster resources to different components capsule over any time period t has to be polling of the servers and try to achieve of the network topology requested. <= s*t + p. Each capsule specifies an load balancing, and service differentia- Configuring a six-node dumbbell net- overbooking tolerance parameter, O,to tion is done at the servers. This two-level work took just 3 minutes on Netbed, in denote the probability with which the architecture provides scalability and comparison to a 3.5-hour effort by a resource requirements of that capsule robustness at the cost of less isolation student with significant Linux system can be violated. Once capsules’ resource and fairness. Within a server, a request administration experience. requirements are estimated, these are scheduler schedules requests from the mapped to nodes using a simple algo- For more information, see queues belonging to different classes to rithm that uses a greedy technique. A http://www.netbed.org. several worker threads such that the capsule can be mapped to a node only if aggregate yield is maximized. The offline the resource requirements of the capsule optimal scheduling problem is NP-com- can be satisfied by the node. The experi- plete, and, hence, the authors use heuris- mental results show that there is a 100% tics such as Earliest Deadline First improvement with just 1% overbooking.

April 2003 ;login: OSDI ‘02 91 PEER-TO-PEER INFRASTRUCTURE on its native OS. Questions from audi- and assembles files for users in objects Summarized by Scott Banachowski ence members revealed that it was not called “container files.”When data from known yet exactly how far ModelNet a container is modified, it is written to a SCALABILITY AND ACCURACY IN A scales, and that it does require a lot of new chunk, preserving the older ver- LARGE-SCALE NETWORK EMULATOR storage. sions of the data. The performance of Amin Vahdat, Ken Yocum, Kevin Walsh, backup and restore operations is compa- Priya Mahadevan, Dejan Kostic, Jeff More information is available at rable to VFS copies. Chase, David Becker, Duke University http://issg.cs.duke.edu/modelnet.html. Yocum discussed a network traffic emu- The talk generated enough controversy lator designed to provide realistic sce- PASTICHE: that there were long lines at the ques- narios for complex systems such as the MAKING BACKUP CHEAP AND EASY tioning microphones, mostly people Internet. Using the emulator, called Landon P. Cox, Christopher D. Murray, interested in more in-depth compar- ModelNet, has advantages over simula- Brian B. Noble, University of Michigan isons with other backup methods. tion because it allows execution of real Users rarely, if ever, make backups of code while still providing control over their personal systems, because it is SECURE ROUTING FOR STRUCTURED network conditions not possible with expensive and time-consuming. Capital- PEER-TO-PEER OVERLAY NETWORKS live deployment. The goals when devel- izing on the trend that many disks are Miguel Castro, Microsoft Research; oping ModelNet included support for often less than half-full, Pastiche is a sys- Peter Druschel, Rice University; Ayal- 10K nodes with a 10Gbps bisection tem for peer-to-peer backups of files on vadi Ganesh, Antony Rowstrom, Microsoft Research; Dan S. Wallach, bandwidth, and realistic emulation of others’ computers. Recognizing that Rice University network failures and cross-traffic. many of the binaries on a disk are iden- tical to the binaries of other users, much While peer-to-peer overlay networks are The emulator organizes networks into of the cost of transferring data is elimi- scalable, self-organizing, and robust with two types of nodes: (1) edge nodes that nated. The goal of Pastiche is efficient, respect to node failure, they are suscepti- run the code being tested and connect cost-effective backup, while preserving ble to malicious participants. The talk through (2) core nodes that run Model- individual privacy. presented several attacks on these over- Net emulation code. A technique called lays followed by a discussion of defenses. “distillation” is the key for providing the As its name implies, Pastiche is assem- scalability necessary for handling large bled from already existing technologies. Castro began with an overview of the numbers of nodes. Distillation trans- Pastiche uses content-based indexing of Pastry routing overlay and then forms the topology of core nodes, which data, the same techniques employed by described several attacks on this tech- represent the Internet, into a smaller LBFS. Data is fingerprinted and divided nique. In one type of attack, a node can subset of nodes that preserve only inter- into chunks, and a hash function choose its node ID so that, instead of esting links, including the first and last uniquely identifies each chunk. Using being random, it is positioned to control hops of the edge nodes. In this only a subset of fingerprints from a disk another node’s network access or pre- approach, instead of injecting packets – for example, a fingerprint from a Win- vent availability of objects. A defense that incur processing overhead for an dows distribution – Pastiche can identify against this attack would be to certify emulator, cross-traffic is simulated by redundant copies of the data on other node IDs using keys from a trusted changing the characteristics of the con- machines. To locate machines for back- source. To prevent users from obtaining nections through the core nodes. ing up data, or “backup buddies,”Pas- a large number of node IDs, certificates, tiche uses two overlay networks deter- it was suggested, might require purchas- The ModelNet emulator was verified by mined by Pastry, a peer-to-peer routing ing. Other attacks on overlays affect reproducing experiments from a previ- infrastructure. A mechanism called routing: for example, supplying peers ously published study of the CFS storage “lighthouse sweep” was added to Pastry with fake proximity information or bad system layered on the Chord distributed to ensure a geographically diverse set of routing table information to increase the hashtable. Running Chord/CFS on the nodes. probability that messages travel through edge nodes and substituting ModelNet a malicious node. A defense for attacks for the network, the throughput of data When participating in Pastiche, your on routing is to maintain a fallback transfers closely matched the previously system may contain information that table, with constrained and more verifi- published results. Yocum concluded also backs up your peers’ systems, so the able routing for use when the perfor- with the assertion that ModelNet is file system must ensure that this data is mance-based routing table fails. Finally, effective for studying how your code not deleted or modified. The Chunk- a malicious node may drop or misroute behaves in a large-scale network running store file system views all data as chunks messages. A solution is to incorporate a

92 Vol. 28, No. 2 ;login: routing test, and if it fails, rely on a and is based on choosing a random NCRYPTFS: A SECURE AND CONVENIENT redundant route. set of witnesses to participate in the CRYPTOGRAPHIC FILE SYSTEM EPORTS

protocol. Charles P. Wright, Michael C. Martino, R Using these security techniques, peer-to- and Erez Zadok, Stony Brook University peer protocols may still work even when CONFIDENTIAL BYZANTINE FAULT-TOLERANCE Contact: Charles P Wright, up to a quarter of the nodes of an over- Jian Yin, Jean-Philippe Martin, Arun

[email protected] ONFERENCE lay network are malicious, and they pro- Venkataramani, Lorenzo Alvisi, Mike C

NCryptfs is a stackable file system based vide efficiency when the actual number Dahlin, University of Texas, Austin of compromised nodes is small. In the on CryptFS from FiST. The low-level file Contact: Arun Venkataramani, system is transparent to applications. An question period, one audience member [email protected] quipped that the idea of charging for attach maps an accessed directory to its As replication systems add more servers certificates was the work of the presen- associated encrypted directory (which and heterogeneity, they become increas- ter’s employer and suggested that the stores the actual data in cipher form). ingly vulnerable to attack, so providing alternative of using a real-world authen- Each attach keeps its own data and confidentiality for replicated data is a tication based on a user’s identity is authorizations private, and on-exit call- difficult problem. This system increases more viable. backs purge the clear-text data from the the intrusion-tolerance of a set of repli- kernel. cation servers when a number of the WORK-IN-PROGRESS REPORTS servers fail. SUPPORTING MASSIVELY MULTIPLAYER Summarized by Scott Banachowski GAMES WITH PEER-TO-PEER SYSTEMS DISCOVERING BOTTLENECKS IN DISTRIBUTED INCREASING FILE SYSTEM BURSTINESS FOR Wei Xu and Honghui Lu, University of SYSTEMS ENERGY EFFICIENCY Pennsylvania Athanasios E. Papathanasiou, Michael Athicha Muthitacharoen, MIT; Jeffrey Contact: Honghui Lu, L. Scott, University of Rochester C. Mogul, Janet L. Wiener, HP Labs [email protected] Contact: Athicha Muthitacharoen, Contact: Athanasios Papathanasiou, A massively multiplayer game supports [email protected] [email protected] up to 200,000 players. Traditionally, In large, distributed systems it is not This report describes a method to create games use a client-server architecture, always possible to investigate causes of longer idle times in disk traffic so that but Wei Xu proposes using peer-to-peer performance bottlenecks created by these idle periods may be exploited for protocols. The talk describes a mapping internal, proprietary components, power saving. The key is to increase the of players to subsets of multicast groups. because discovering problems often burstiness of access using aggressive By trading consistency for performance, requires instrumenting these compo- prefetching combined with new disk- only “nearby” players need to synchro- nents to measure statistics. MIT is devel- scheduling algorithms. Trace experi- nize their environments using P2P mul- oping a tool to identify critical paths ments show the energy reduction using ticast groups. A prototype game was using a passive trace of messages. Using this technique during an MP3 playback developed using Scribe. the relationships between messages, the reached 55%. tool automatically infers the source of KELIPS: A FAT BUT FAST DHT FAB: FEDERATED ARRAY OF BRICKS bottlenecks. Indranil Gupta, Prakash Linga, Dr. Ken- Yasushi Saito, Svend Frolund, Arif Mer- neth Birman, Dr. Al Demers, Dr. Rob- WITNESS: LEADER ELECTION WITHOUT chant, Susan Spence, Alastair Veitch, bert Van Renesse, Cornell University HP Labs MAJORITY Contact: Indranil Gupta, Haifeng Yu and Amin Vahdat, Duke Contact: Yasushi Saito, [email protected] University [email protected] Kelips is a peer-to-peer probabilistic Contact: Haifeng Yu, [email protected] The talk described a logical disk system protocol for group discovery, in which The title must have been inspired by the that uses low-cost commodity CPU and the lookup cost of a file is reduced by last presidential election. Many distrib- disks and is intended to replace high- enabling the address of any file to be dis- uted algorithms require a node be end disk arrays. The decentralized sys- covered within a single hop. This is elected as leader, but under some kinds tem software, based on Petal, achieves achieved by increasing the size of file of failures it is impossible to guarantee high-performance and fail-over ability index tables on each peer and using that the elected leader is unique. The by replicating disk blocks throughout background communication, or “gossip- new election algorithm provides proba- the cluster. ing,”between nodes to keep state bilistic guarantees of a unique leader, updated.

April 2003 ;login: OSDI ‘02 93 IMPROVISED NETWORK: AUTONOMOUSLY processing of these events to the infra- generated by the VM. The flexible RECONFIGURABLE MOBILE NETWORK structure (called “planets”), applications framework allows policies to be added Nobuhiko Nishio, Keio University, Japan that subscribe to the events remain on the fly, and in the example provided Contact: Nobuhiko Nishio, lightweight. Sharing the computation in the talk, Shaw focused on a “don’t [email protected] among several applications reduces both speak unless spoken to” policy for con- New applications are emerging that use development and network costs. tainment of client-server code. a combination of wireless networks and THE EXNODE DISTRIBUTION NETWORK IMPROVING APPLICATION PERFORMANCE distributed sensor nodes, as in cellular Jeremy Millar, University of Tennessee THROUGH SYSTEM CALL COMPOSITION phones. In such an ad hoc network, both Amit Purohit, Joseph Spadavecchia, sensors and sink nodes may be mobile, Contact: Jeremy Millar, [email protected] Charles Wright, Erez Zadok, Stony so the research is developing ways to Brook University adapt to the changing environment exNode is a content distribution net- Contact: Amit Purohit, without hurting performance. work. It is developed to provide access to time-limited data, such as the release of [email protected] PROBABILISTIC ENERGY SAVING IN SENSOR a software product, and is currently used A problem with application perfor- NETWORKS by RedHat. The exNode architecture is mance is overhead incurred by system Santashil PalChaudhuri and David B. effective at distributing load by imple- calls that move data across the kernel Johnson, Rice University menting a highly distributed wide-area boundary. This system provides a solu- Contact: Santashil PalChaudhuri, RAID system. tion that removes user-level bottlenecks [email protected] by moving user code into the kernel. On mobile devices, idle and receive peri- SCALABLE CONSTRAINED ROUTING IN Using a tool called Cosy, combined with OVERLAY NETWORKS ods use about the same amount of the gcc compiler, designated code is energy, so if idle periods may be Xiaohui Gu and Klara Nahrstedt, Uni- compiled into special code segments versity of Illinois, Urbana-Champaign replaced with inactivity, the device that can be loaded into the kernel at stands to save a lot of energy. According Contact: Xiaohui Gu, [email protected] runtime. Static and dynamic checks to the “birthday paradox,”a relatively This system is a step toward value-added ensure that kernel security is not vio- small number of people can ensure a service overlays. In overlay networks, lated, and adding preemption to the ker- high probability that two of them share such as those used by peer-to-peer nel protects against user segments the same birthday. Applying this princi- applications, it is desirable to satisfy monopolizing the CPU. ple to communication, only a small some end-to-end constraints – for PERFORMANCE OF MACH-KERNEL number of nodes is needed to ensure example, establishing a level of quality of Igor Shmukler, OS Research that a sender and receiver are active service between endpoints. Qualay is a simultaneously. Using a probabilistic- proposed overlay network designed to Contact: Igor Shmukler, based protocol, the device pre-chooses provide QoS constraints over paths. In [email protected] its waking and sleeping periods, intro- the setup phase, service paths are chosen Shmukler spoke about enhancements to ducing some increase in communication by probing nodes, and in the runtime the Mach kernel aimed at increasing its latency but drastically reducing power phase, faults are detected and paths attractiveness to the user community. consumption. rerouted to maintain QoS. Although Mach introduced many good ideas, it didn’t really catch on because it SOLAR: SUPPORTING CONTEXT-AWARE REVERSE FIREWALLS IN DENALI was never fine-tuned for the common- MOBILE APPLICATIONS Marianne Shaw and Steve Gribble, case performance. Shmukler tried to Guanling Chen and David Kotz, University of Washington clear Mach’s bad name by discussing Dartmouth University Contact: Marianne Shaw, proposed improvements, including Contact: Guanling Chen, [email protected] changing the memory management sub- [email protected] Shaw presented a way to introduce poli- system, optimizing the RPC implemen- The goal of this research is to provide cies and mechanisms to protect the tation, adding new synchronization flexible and scalable pervasive comput- Internet from bad services. The system primitives, and stomping on a slew of ing. Solar is an infrastructure for context allows untrusted code to run in the net- bugs. computation. An example is a mobile work infrastructure on a virtual device that subscribes to a set of inter- machine, with a reverse firewall that pre- esting events; in Solar, by moving the vents the Internet from malicious traffic

94 Vol. 28, No. 2 ;login: ELASTIC QUOTAS serving up audio, video clips, and and nearness (proximity). This new Ozgur Can Leonard, Jason Nieh, Erez movies. The authors studied HTTP Web scheme was shown through simulation EPORTS

Zadok, Jeffrey Osborn, Ariye Shater, traffic, Akamai CDN, and Kazaa and to improve system capacity by 60–90% R Charles P. Wright, Kiran-Kumar Gnutella nets. The basic result of the while maintaining low request latencies Muniswamy-Reddy, Stony Brook Uni- authors’ trace, conducted at the Univer- for clients. One dynamic algorithm, Fine versity

sity of Washington, is that peer-to-peer Dynamic Replication (FDR), is espe- ONFERENCE Contact: Jeffrey R. Osborn, traffic constitutes a large fraction of the cially promising. It keeps fine-grained C [email protected] bytes, and it’s very different from the information on URL popularity to “Elastic quotas” for disks are aimed at Web. For example, it may be possible to adjust the number of replicas. They’re shared file servers, such as those used by cache 80–90% of the outbound traffic trying to deploy it on PlanetLab. university students, where each user and 60% of the inbound traffic. But it receives a quota of space. By implement- takes a long time to warm up the cache MIGRATION ing elastic quotas, extra space may be (about a month). In both directions P2P Summarized by Richard S. Cox allocated to users for their temporary objects are three orders of magnitude THE DESIGN AND IMPLEMENTATION OF ZAP: use, but this space may later be larger than Web objects. A small number A SYSTEM FOR MIGRATING COMPUTING of objects account for most of the bytes reclaimed. The elastic quota service sets ENVIRONMENTS in P2P systems. both global and user-assigned policies Steven Osman, Dinesh Subhraveti, for how the space occupied by files des- Gong Su, and Jason Nieh, Columbia TCP NICE: A MECHANISM FOR ignated as elastic will be reclaimed, using University BACKGROUND TRANSFERS information such as size or creation Zap supports the transparent migration time. The next step in their research is to Arun Venkataramani, Ravi Kokku, Mike Dahlin, University of Texas, Austin of unmodified applications. The migra- determine if users will embrace such a tion of network applications is sup- system. TCP NICE, a building block for back- ground transfers, finds and uses spare ported without loss of connectivity. Zap-migrated processes leave no resid- A MAIL SERVICE ON OCEANSTORE bandwidth in the Internet to improve ual state behind on the previous system. Steven Czerwinski, Anthony Joseph, availability, reliability, latency, and con- Implementing Zap involves minimal John Kubiatowicz, University of sistency. As a “new” variant on TCP con- California, Berkeley gestion control, TCP NICE is similar to changes to a commodity operating sys- tem and requires low overhead. Contact: Steven Czerwinski, TCP VEGAS monitor RTT but provides [email protected] three changes: a more sensitive conges- Three problems must be solved to The Mail Service uses the OceanStore tion detector, multiplicative reduction in migrate processes: resource consistency, file system to provide low-latency access response to increasing RTT, and the pos- resource conflicts, and resource depen- to email, independent of a user’s loca- sibility of having a congestion window dency. Zap’s solution to all three is the tion. Goals of the system include data less than one. With NICE you can process domain (pod). A pod is a private durability and relaxed consistency by bound the interference caused by back- virtual space that may contain a single allowing application-specific conflict ground flows. One use is prefetching. process, a process group, or a whole user resolution. Following this session, mem- The authors found that NICE could session. As a private space, processes in a bers of the project gave a demo of improve performance by a factor of pod cannot interact with processes out- OceanStore. three, where using old-style TCP hurt side a pod. Pods are migrated as a unit. performance by a factor of six. Zap contains pods by introducing a thin NETWORK BEHAVIOR layer in the Linux kernel, virtualizing THE EFFECTIVENESS OF REQUEST Summarized by Kenneth Yocum process IDs, IPC, memory, the file sys- REDIRECTION ON CDN ROBUSTNESS tem, network, and devices. The overhead AN ANALYSIS OF INTERNET CONTENT Limin Wang, Vivek Pai, Larry Peterson, of this approach is minimal, and the pod DELIVERY SYSTEMS Princeton University images are small. Stefan Saroiu, Krishna P. Gummadi, We now use replication across geo- Richard J. Dunn, Steven D. Gribble, graphic distance to deliver content. More information can be found at Henry M. Levy, University of Washing- Client requests are delivered to the http://www.ncl.cs.columbia.edu/research/ ton “best” candidate based on server load, migrate. There’s a lot more than just Web content server closeness, and cache. This work being served across the Internet. Now we describes current schemes and intro- have CDNs and peer-to-peer systems duces a new one to balance locality, load,

April 2003 ;login: OSDI ‘02 95 OPTIMIZING THE MIGRATION OF VIRTUAL data transferred to migrate a capsule as breakpoint is reached, an exception is COMPUTERS well as the time-to-start. raised, simulating the effect of the check. Constantine P. Sapuntzakis, Ramesh Micro-benchmarks, as well as an imple- Chandra, Ben Pfaff, Jim Chow, Monica LUNA: A FLEXIBLE JAVA PROTECTION SYSTEM mentation of an extensible Squid Web- S. Lam, Mendel Rosenblum, Stanford Chris Hawblitzel, Dartmouth College; cache, confirm that Luna’s isolation University Thorsten von Eicken, Expertcity imposes low overhead. By virtualizing the x86 architecture, the Extensible applications require protec- VMware GSX server enables an entire tion schemes that can isolate extensions virtual machine’s (VM) hardware state while permitting lightweight communi- to be easily suspended and captured. cation. Java uses language-based approaches Once saved, the state can be sent to to enforce domain separation, enabling another machine and resumed. How- cheap communication because of the ever, capturing the entire state generates single address space. However, systems machine images, or capsules, that are with Java extensions lack clear domain gigabytes in size. This work applies sev- boundaries; all code and objects are eral optimizations to reduce the capsules stuck together. The resources used by an to a size that can be transferred over a extension cannot be reclaimed if the DSL link in under 20 minutes, enabling extension is terminated, because they applications such as user mobility and may be referenced by other parts of the software updates. The two largest com- system. ponents of a capsule are the disk and By introducing a task abstraction, exten- memory images. sions in a Java system can be strongly Using standard copy-on-write tech- isolated. Tasks contain all the objects, niques, VMware can track the changes to threads, and code for an extension. All a disk image and transfer only the differ- cross-task communication is explicit. In ences if the target machine already has Luna, regular (local) pointers are not an old version of the disk image. By allowed to reference objects in other hashing each disk block, and searching tasks. Remote pointers, a new type of for a block with matching hash value on reference that is allowed to point to the target system, the server can avoid objects in other tasks, are Luna’s mecha- transferring pages whose contents nism supporting intertask communica- already exist on the target system. Much tion. Remote pointers may be revoked at of a VM’s memory may not be in active any time; if a revoked remote pointer is use; thus, if VMware could request that used, an exception is raised. This allows the guest OS de-allocate inactive pages, an entire extension to be removed from the size of the memory image could be the system cleanly, without dangling ref- greatly reduced. This is the idea behind erences in other tasks. Remote pointers ballooning, which utilizes a driver added are implemented with a two-word struc- to the guest OS to reclaim low-priority ture. The first word is just the memory pages prior to suspending the VM. address of the object. The second word Finally, by demand-paging the disk is a pointer to the permit, which con- images, the time to resume the VM on tains a revocation flag and is checked the target can be reduced. Demand-pag- before each use. As an optimization that ing takes advantage of the disk-latency removes most checks in common cases, tolerance already built into modern Luna can generate loop code that does OSes. Several macro-benchmarks show not contain any checks. On revocation, that the combination of these tech- threads using the object are suspended, niques is effective in reducing the total and a breakpoint is placed where the check would have been. If and when the

96 Vol. 28, No. 2 ;login: