<<

structure. does this by pro- viding complementary packages (open or closed) that interact with those developed by the open source community. By focusing on com- USENIX ANNUAL plementing existing projects, rather than providing substitutes, they conference TECHNICAL avoid competing with open source developers, an arrangement that reports CONFERENCE benefits all involved. (USENIX ’04) At Novell, this has required reworking the legal framework Boston, Massachusetts This issue’s reports under which licenses are sold, June 27–July 2, 2004 focus on the USENIX expending significant effort in con- Annual Technical vincing customers to accept solu- Conference (USENIX PLENARY SESSION tions combining proprietary and ’04), held in Boston, open components, and changing Massachusetts, Summarized by Richard S. Cox the focus of the organization. June 27–July 2, 2004. Open Source and Proprietary Soft- Greg Mitchell asked how the soci- Our thanks to the ware: A Blending of Cultures ology of the company changed as scribe coordinator: Alan Nugent, Novell more open source developers were Alan Nugent opened the USENIX brought in. Alan responded that, Rik Farrow Annual Technical Conference with while some employees were upset Our thanks to the his plenary session addressing the and a few even left, the acquisition summarizers: integration of open source software of open source teams has been very and procedures at Novell. successful and brought more Bill Bogstad energy throughout the company. Many people believe that open Novell was able to do very well Ming Chow source will destroy the software retaining employees from acquired industry; that it is developed by Brian Cornell companies. hackers without discipline; that it Richard S. Cox is a fad; or that there is no money in open source. Seeking to debunk GENERAL SESSION PAPERS: Todd Deshane these myths, Alan first suggested INSTRUMENTATION AND Patty Jablonski that, rather than wrecking the DEBUGGINGOOOOOO O industry, open source has increased Rob Martin Summarized by Swaroop Sridhar diversity and thus has created Martin Michlmay opportunities. Second, open source Making the “Box” Transparent: software can be of very high quality, Performance as a Adam S. Moskowitz since a majority of open source First-Class Result Peter Nilsson contributors are professional devel- Yaoping Ruan and Vivek Pai, Prince- opers working on projects that G. Jason Peng ton University interest them. The community is Mr. Yaoping Ruan presented the Calicrates Policroniades growing daily, and contributors are “DeBox”ing technique for debug- quick to realize important initia- David Reveman ging OS-intensive applications. He tives. While open source software is began the talk with a motivating Matt Salter free, there is a market for selling example of monitoring system call the support and maintenance con- Swaroop Sridhar performance on a server running tracts that large customers require the SpecWeb99 benchmark. He Sudarshan Srinivasan before they are willing to build pointed out that system call profile mission-critical systems using a Matus Telgarsky as measured from some- package. Wanghong Yuan times indicated anomalous kernel The adoption of open source has behavior. He identified the trade-off Ningning Zhu allowed Novell to work with cus- between speed, completeness, and tomers to build solutions that more accuracy among various profiling closely match their needs and infra- tools. Later, Ruan presented the

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 41

design of the DeBox system. The that for a tool to be used in produc- question, Cantrill said that there key idea is to make the system call tion, the necessary constraints are was no extra effort required to use performance a first-class result and that there should be zero probe this tool with third-party kernel- return it in-band (like errno). effect when disabled, and the sys- level modules. When asked Proposing a split between the tem must be absolutely safe. To whether there were any plans to measurement policy and mecha- have systemic scope, both kernel port their system to or any nism, Ruan said that the applica- and applications must be instru- other , Cantrill tions should be able to interactively mentable, and the system must be answered in the negative and profile interesting events. able to prune and coalesce the quipped, “Use the best OS avail- Later, Ruan gave details about the enormous amount of data into use- able!” The authors can be con- implementation of DeBox. He gave ful information. tacted at [email protected]. the details of profiling primitives Later, Cantrill introduced the vari- sum.com. added to the kernel and the inter- ous concepts and features of Flashback: A Lightweight Exten- face available to the applications. DTrace: dynamic-only instrumenta- sion for Rollback and Determinis- He also provided details about the tion, unified instrumentation, arbi- tic Replay for Software Debugging various kinds of information that trary context kernel instrumenta- Sudarshan M. Srinivasan, Srikanth the system offered, the amount of tion, high-level control language, Kandula, Christopher . Andrews, change that had to be done to the predicate and arbitrary action spec- and Yuanyuan Zhou, University of kernel and applications, and so on. ification, data-integrity constraints, Illinois, Urbana-Champaign Ruan went on to present a case facility for user-defined variables, study on Flash Web server per- data aggregation, speculative trac- With the increase in volume and formance. He presented various ing, scripting capacity, boot-time complexity of software develop- optimizations with a -by-step tracing, virtualized consumers, etc. ment, there is a proportional performance analysis. Next, Cantrill elaborated on the D increase in software bugs, their language: syntax and use, D inter- effects, and the difficulty in tracing Ruan concluded by stating that or even reproducing them. Various DeBox is very effective on OS- mediate form, probes, providers and actions, aggregations and scala- checkpointing and logging mecha- intensive applications and complex nisms and their applications have workloads. He also claimed that the bility of the architecture. Cantrill also shared some experiences with received a lot of research attention results showed that the system was in the last decade. Mr. Sudarshan portable. During the Q&A session, DTrace and gave some examples of D scripts and analyzed their results. Srinivasan presented Flashback, a Ruan said that they were investigat- lightweight OS extension to facili- ing the use of DeBox on other OS- Finally, using the example of a bug in gtik2 applet2—a stock ticker for tate rollback and replay, as applied intensive applications such as data- to software debugging. base systems, but the results were GNOME desktop—he showed how not yet available. More information a small error could After providing a brief general about DeBox can be found at cause widespread damage in a pro- background and motivation for http://www.cs.princeton.edu/~yruan/ duction system such as SunRay lightweight checkpointing, Srini- debox, or by contacting {yruan, server. Cantrill challenged the idea vasan went straight into the main vivek}@cs.princeton.edu. that no other existing tool could idea of Flashback. Flashback trace this problem to its root cause, achieves checkpointing by forking Dynamic Instrumentation of Pro- and that a trace was possible only a shadow , thus replicating duction Systems by the extensive use of aggregation the in-memory state of the process. Bryan M. Cantrill, Michael W. functions and local variables The processes’ interactions with the Shapiro, and Adam H. Leventhal, Sun provided by DTrace. system are logged so that, during Microsystems During the Q&A session, Jonathan replay from a checkpoint, the In introducing Bryan Cantrill, ses- Shapiro said that he believed that (shadow) process gets an execution sion chair Val Henson—also from the gtik2 applet2 problem should environment similar to the original Sun Microsystems—said that she be attributed to the fundamental run. Srinivasan presented some could definitely confirm Sun’s use problems in monolithic kernel challenges posed due to multi- of DTrace in production. Cantrill design, and asked the speaker to threading, memory-mapped files, began his power-packed speech by comment on the use of DTrace for and shared memory and signals. He stating that all of today’s tools were debugging kernel bugs. Cantrill did also presented the approaches targeted at development and not not totally agree with Shapiro’s adopted in Flashback toward production. As a result, the systems views, but only asserted that solving these problems. are incapable of dealing with sys- DTrace was effective in tracing ker- Srinivasan went on to present some temic problems. Cantrill asserted nel-level bugs. Answering another details of the present Linux imple-

42 ;LOGIN: VOL. 29, NO. 5 mentation regarding modifications gent Orchestrator for Service Opti- ule.” Estimating individual task to the kernel, changes to GDB, etc. mization. duration is crucial, and CHAMPS Srinivasan identified incorporating uses “past deployments” to calcu- replay support for multi-threaded The CHAMPS system consists of two late future durations for individual applications as an area for future subcomponents: the Task Graph Builder tasks. work. Later, responding to Val Hen- and the Planner and Scheduler. The end Multiple task graphs, each repre- son’s question regarding possible product of the system is a change plan senting a single change, are input applications of Flashback other depicted in a standard workflow lan- into the Planner and Scheduler, than debugging, Srinivasan said guage (BPEL4WS). This, in turn, is fed which then binds the changes to they were investigating uses of into an “off-the-shelf” deployment sys- services and resources and opti- Flashback in other avenues, such as tem which “rolls out the changes and mizes a schedule for all of the lightweight transaction models. provides feedback status information changes. “We are treating this prob- The for Flashback can back into the [CHAMPS] system for lem as an optimization problem.” be obtained at http://carmen.cs. summary planning.” The workflow The optimization is done by “fifty uiuc.edu/. engine executes the plan and monitors pages of Java . . . not visible to the whether each activity has completed or administrator. We support a very ADVANCED SYSTEMOOOOO failed. general level of objective functions ADMINISTRATION SIG: A key goal of CHAMPS is optimiz- [for] minimizing penalties, maxi- AUTOMATING SYSTEM AND ing the schedule based on depend- mizing profits. The administrator STORAGE CONFIGURATION encies to carry out tasks in parallel selects from push-button options wherever possible. The information that provide choices like ‘maximize Summarized by Rob Martin used to figure out which tasks are profits,’ ‘minimize downtime,’ The CHAMPS System: A Schedule- going to be carried out in sequence ‘maximize throughput,’ ‘minimize Optimized Change Manager and which in parallel are “product costs.’ By selecting one [or a combi- nation] of these choices the opti- Alexander Keller, IBM T.J. Watson dependency descriptions.” “The mization parameters are automati- Research Center availability of authoritative depend- ency information [from package cally set.” The CHAMPS system Dr. Alexander Keller began by developers] is very important.” then calculates the optimum sched- describing CHAMPS (Change Man- Once the dependencies are put into ule, if necessary deciding that cer- agement with Planning and Sched- the Task Graph Builder, the system tain changes cannot be accom- uling) as “a schedule optimized generates the Task Graph. plished given the overall set of change management system” that is changes requested. not yet a product. “It’s a research “The Task Graph tells us which Keller concluded by listing the prototype [providing] change man- tasks are going to be carried out, in areas that require more work in the agement with planning and sched- what order . . . , and whether they future, including “tooling for uling.” Its end product is the must be in sequence or can be in deployment descriptors,” reusing schedule: “all the things that are parallel.” The Task Graph is used as change plans (storing them in an going to be carried out on which input to the Planner and Scheduler. XML library, for example), know- machines [and] concrete systems “The Planning system may make ing when a plan is running behind that are going to carry out these decisions such as ‘we must take schedule, carrying configuration tasks.” Keller described this as “a away a machine from customer X information along with the work- change plan.” and give it to customer Y’; in order to do that [the system] must be flows, and identifying parameters Keller described CHAMPS within aware of the service level agree- that flow out of one task and are the larger context of change man- ments and policies that the data required for other downstream agement as “trying to assess the center has. . . . It is up to the plan- tasks. impact of a change and figure out ning system to bind the existing During the Q&A session, there was what the dependencies between Task Graph to the complete system a lively exchange on the “sad state different tasks are and creating a to generate concrete system names, of dependencies in software pack- change plan. . . . We are specifically times, and dates. ages.” Is there a standard for not concerned with actually imple- describing dependencies? Work menting or rolling out a change, “We put in declarative information done by the Grid Forum on defin- because there are deployment sys- about the relationships between ing a standard, and the use of tems that can do this.” Later in the tasks, [and the CHAMPS system] dependency sniffing tools were talk, Keller gave examples of such automatically generates this sched- mentioned. systems: cfengine and Tivoli Intelli- ule and allows the administrator to apply modifications to the sched-

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 43

Autonomics in System Configura- and day-to-day use, the autonomic each network segment.” This tion system adjusts the fabric and con- expresses a high-level policy rather Paul Anderson, University of Edin- figuration of the system so that it than details like “make this burgh comes back into alignment with machine configured as a DHCP the original specifications and poli- server.” The final goal of an auto- What is system configuration? Paul cies. The feedback from the auto- nomic system is to take these Anderson, professor and researcher nomic system does not make declarative statements and generate at the University of Edinburgh, changes to the original specifica- the details. “System administrators starts out with some background tion; rather, it brings the fabric will specify those criteria that are on the general subject. When you back to providing the original serv- important for the job without spec- want to build a new site, you start ices and policies specified. ifying the details. The important off with three things: the hardware “Autonomics is not new. Cfengine point is not to specify too much (empty disks and bare metal), the detail, because you need to give the software, and specifications and and lcfg are examples of tools that provide this sort of automatic fix- autonomic system room to move.” policies about how you want the If something breaks, the autonomic final system to run. The core of the ing up of configuration files when something goes wrong at the host system needs flexibility in order to configuration problem is to take fix the problem. those three things and put them level. There are inter-host auto- together to get some sort of com- nomic systems like fault-tolerance Autonomic systems require a lot of puter system that performs to the systems, RAID, and load balancing trust in the system. The system specification. Anderson refers to that will adjust systems. What is automatically makes some serious the final site as a “fabric,” a term he new is trying to think of this in a decisions for you. “System admin- borrows from the recent work in uniform way and integrating it into istrators are not normally happy grid computing. the configuration system.” giving that kind of freedom to the Anderson described the major system. You want the system to Anderson points out that the “main decide things for you but you want thing to notice is the big distinction issues under consideration in researching autonomic solutions. to be able to review them and between the software and the con- adjust them to make sure they are figuration policies. The pile of soft- He described “a declarative specifi- cation of what the system behavior right.” Autonomic systems will ware you start off with has no con- need to provide feedback as to why figuration and in theory can all be should look like. Some kind of log- ical statement that is true about the something has happened. “You put on all the machines that you’ve should be able to ask the system, got. It’s the specifications and the system rather than a recipe about how to get there. If you don’t have a ‘Why have you put that there?’” configuration policies that differen- Some mechanism for reviewing sys- tiate the individual machines.” good declarative statement to start with, then you don’t know what to tem actions and tuning the policy Configuration starts with the base do. . . . The language you need to implementation for future actions layer of internal services inside describe the configuration is not a needs to be provided. your fabric at a lower level than the programming language: We are not What Anderson is seeking is a com- applications you want to end up talking about a process, we are talk- promise between the two extremes with. DNS, NFS, DHCP, and like ing about the description of the of, on the one hand, a complete services form a base layer you have configuration data and the way that expert system that can be given to get going before you build any- that system actually is.” Anderson high-level policy goals and perform thing on top of it. gives an example of a declarative all reasoning and logic decisions, The idea of autonomics is to “take statement. “‘Host X uses host M as generating all individual assign- some of the low-level decision a server.’ In most configura- ments for all machines and serv- making away from the system tion systems you don’t see state- ices, and a solution that is based on administrator and have a lot of ments like these. Rather, you see hand-crafting (or scripting) the things happen automatically, so the lower-level details, like a script set- low-level specifications for system administrator can move up ting parameters for .cf.” machines and services required to a level and think of higher-level The goal is to use the declarative deliver the specified policy. What policies and planning.” As with a language to describe the system we want is some autonomics but compiler, you trust the autonomic and “let the autonomic system jug- not a completely unpredictable sys- system to place low-level data and gle the details” to make sure the tem. decide which bits go where. specifications remain true. “The autonomic system has to be After the initial configuration, as An example declaration: “Make able to change all aspects of a sys- change occurs due to load balanc- sure we have two DHCP servers on tem configuration dynamically. ing, software or hardware failure, was never designed to be re-

44 ;LOGIN: VOL. 29, NO. 5 configured on the fly.” UNIX has all Storage needs keep growing as per- on next year’s big picture” talk for sorts of config files in all sorts of byte cost gets cheaper. The goal in his students. formats; services may need to be storage is to increase efficiency by According to Couch, the future of stopped and re-started in order to reducing redundancy. Current tech- system administration is about make certain changes. “This is a big niques (compression, duplicate “cost models” and “supporting the problem in incorporating autonom- block-and-chunk suppression, and enterprise mission.” Couch sum- ics into system configuration.” resemblance detection) have short- marized it this way: “Based upon a Anderson concluded by reviewing comings. Purushottam Kulkarni cost model, we can re-define good the lcfg system, analyzing where it proposed a technique called system administrating. That idea has useful autonomic capabilities Redundancy Elimination at Block is rather cosmic, because what we and where it falls short. He pointed Level (REBL), which first detects are doing right now, what we con- to the http://www.lcfg.org Web site duplicate chunks and encodes sider ‘good’ right now, I would and the LISA ’03 Gridweaver paper blocks using the resemblance tech- claim does not make sense with for those who want to explore the nique. This paper also evaluates respect to any cost model. . . . complete details. five techniques to quantify the Looking at things from a broader effectiveness of REBL. perspective of lifecycle costs, we GENERAL SESSION PAPERS: O Alternatives for Detecting Redun- get a better idea of whether we are SWIMMING IN A SEA OF DATA dancy in Storage Systems Data doing our jobs.” Calicrates Policroniades and Ian Professor Couch refers to tradi- Summarized by G. Jason Peng Pratt, Cambridge University tional SA thinking and practice as and Wanghong Yuan Calicrates Policroniades introduced “micro-scale reasoning” and “the Prioritization: Reducing the benefits of redundancy elimina- bottom-up approach.” He includes Delays on Legitimate Mail Caused tion and previous techniques for “adhering to practices, process by Junk Mail redundancy elimination, and then maturity, six nines at the server, closing tickets quickly, reducing Dan Twining, Matthew M. compared three frequently used troubleshooting costs” as examples Williamson, Miranda J.F. Mowbray, techniques: whole-file content of micro-scale thinking. The oppo- and Maher Rahmouni, Hewlett- hashing (WF), fixed-size blocking site of this is the “top-down Packard Labs (FSB), and content-defined chunks (CDC). The results show that in approach”: “In a top-down Matthew Williamson discussed the terms of compression ratio, CDC is approach we start at the organiza- motivation for this paper. In partic- the best, FSB is almost as good, and tion and mission and work down. ular, he described the delay prob- WF is the worst. But when com- It turns out that starting at that lem caused by junk , the dis- pression, processing overhead, and point and thinking out the whole tribution of junk mails, and the storage overhead are considered, nature of the profession leads to source of junk mails. Dan Twining however, no one solution wins. different conclusions and that’s the then presented the proposed subject of this talk today.” approach, which combines pre- Professor Couch lists some obser- acceptance (header scanning) and ADVANCED SYSTEMOOOO O vations drawn from “macro-scale post-acceptance (content scanning) ADMIN ISTRATION SIG: thinking”: “System administration to predict the next message type SYSTEM ADMINISTRATION: enables particular things. It enables based on sending history. The pre- TH E BIG PICTUREOOOOOO missions. It supports plans. It man- acceptance method maintains the ages resources. It enforces policies. number of good and total messages Summarized by Rob Martin There is a very high level at which, and tells if a server is good based The Technical Big Picture Burgess says, ‘the system adminis- on the ratio. The system is imple- Alva Couch, Tufts University trator manages human computer mented in a lightweight manner ecologies.’” Professor Couch says and shows good results on a real Each fall at Tufts University, Profes- this macro-scale thinking will lead system. sor Alva Couch presents a talk to his students on the Big Picture in to some sacrilegious ideas. “Six Redundancy Elimination Within system administration. In Couch’s nines for the mission does not Large Collections of Files words, “Where are we going? What require six nines at the servers. . . . Purushottam Kulkarni, University of are we going to do? How is it going There is a fundamental idea that we Massachusetts; Fred Douglis, Jason to work? What is going to be the build six nine infrastructures upon LaVoie, and John M. Tracey, IBM T.J. benefit?” This year at USENIX Tech six nine servers and six nine foun- Watson Research Center ’04, Professor Couch let us in on a dations. That is actually not true. preview of his “technical briefing Meta-stability is enough. Perceived stability is enough. Security that

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 45

compromises mission can be inap- ues rather than just considering the source community. Outsourcing propriate and counter-productive. cost of implementing a change. affects and limits many things. One Security is not an end unto itself. It They also include the cost of not cannot outsource the value of peo- is part of a larger mission picture doing things, as well as the cost of ple being together in one place and and availability and security have doing things.” thinking about a common problem. to be balanced.” Couch proposes that we have to That will remain a factor in system Three recent published papers that change the way we are doing SA: administration I think for as long “got a lot of flak in the last LISA “In pursuing micro-scale perfection as we live.” and LISAs before” got Professor we are pursuing a private game, Couch thinking this way. The and nobody except us cares. We GENERAL SESSION PAPERS: papers were on the cost of down- have to play a different game. . . . NETWORK PERFORMANCEO time (Patterson); the timing of Micro-scale system administration security patches (Beattie et al.); as we know it is doomed. The idea Summarized by Sudarshan and. resource management without of pursuing six nines at the server Srinivasan quotas for specific users (Burgess). will be a solved problem in 10 Monkey See, Monkey Do: A Tool Patterson says downtime can be years.” He referred to the previous for TCP Tracing and Replaying USENIX Tech session on autonom- quantified. Professor Couch “can- Yu-Chung Cheng, Stefan Savage, and ics and said, “We are beginning to not believe the resistance that this Geoffrey M. Voelker, University of understand how to automate idea got. Nobody ever talked about California, San Diego; Urs Hölzle installs, automate deployments, cost models before. And maybe and Neal Cardwell, because of that this was a very con- and automate monitoring and The authors describe Monkey See, troversial idea.” recovery. In the near future much of this will be automated. Unfortu- Monkey Do (MS-MD), a tool that Couch then talked about the paper nately, system administration is generates realistic requests in that “caused a riot.” “Beattie et al. subject to outsourcing. . . . But we order to test changes to the back showed that if uptime was impor- can’t . . . automate macro-scale end of Google search engines. Cur- tant because downtime was expen- thinking. That’s the future of sys- rently, changes to servers are tested sive, then waiting a couple of tem administration. Being able to by either using synthetic (and con- weeks to apply a new security take these boxes with six nines and sequently unrealistic) workloads or patch was probably optimal. . . . make them talk to each other to real users, making it risky and The thinking was about a cost support an enterprise mission. effort-consuming. The motivation model; it was not about simple reli- for developing MS-MD is to over- “The future will be about under- gion, about just applying things come the shortcomings of existing standing cost and value and how because you are supposed to, but approaches of testing. about understanding that patches they relate. It’s going to be about The tool has two phases of - of a problem in the very first case . . . interacting with middleware. It’s tion—the Monkey See phase, have problems themselves and that not about supporting users; it’s where it observes real network con- waiting a couple of weeks to apply about supporting missions.” nections, measuring network traffic all of them was a better enterprise Professor Couch says the new chal- parameters along the way, and the strategy. lenge is to “take the human mission Monkey Do phase, where it gener- and turn it into something the “Finally, another very bizarre idea: ates realistic workloads based on machine can understand.” To do protect useful work instead of lim- the previously observed metrics. this we will need to research new iting people. Burgess actually pro- The recorded parameters include areas, such as economic models to poses a game theoretical approach the HTTP header, query parame- describe “making day-to-day cost- for quotas. The idea of the game is ters, delay ACK policy, and other value decisions.” Professor Couch extremely simple. You have a very measurable quantities (such as suggests we need to ask, “How does simple strategy for deleting pigs response time). All tracing is done one best quantify the value of mis- and you counter every strategy the in front of the server farm, and the sion support? How does SA work user could use to defeat you. It uses authors assume that congestion— impede or aid mission? Is anything random for cleanup to i.e., queueing of requests—hap- you are doing getting in the way of beat the wily user. pens, if at all, only along the data mission and how can you stop?” “These three examples have com- path. They also assume that the mon attributes. They consider the He concluded by reminding us Web servers themselves are well broader picture of enabling work what can’t be automated and out- provisioned and that there is no and mission rather than fixing sys- sourced in SA: “We are here at this congestion in the intranet. Caching tems. They consider costs and val- conference because one can’t out-

46 ;LOGIN: VOL. 29, NO. 5 behavior is also recorded and reordered, though along each chan- Anees Shaikh, IBM T.J. Watson replayed by MS-MD. nel packets are still in order. This Research Center They evaluate the tool with respect generates too many DUP ACKS. Multihoming is a frequently to two questions: how accurately it The problem of packet reordering adopted strategy in enterprise net- reproduces the workload, and how is solved by using SACK TCP. works for improving performance accurately it predicts server per- mTCP uses an extended scoreboard as well as availability of the net- formance with changes effected. algorithm to figure out which pack- work. Route controllers are used to Results show that the measured ets have been received and which provide the required performance times without changes to the ker- are outstanding. Packets are sent in and availability characteristics. nel match up more or less between the order in which they are queued, Aditya presented the results of the original run and the simulation and the choice of channel is based experiments conducted by the using MS-MD. The tool is more on proportional scheduling. authors to evaluate the perform- accurate when the RTTs are small; For handling shared congestion, ance benefits achievable from com- they ascribe this behavior to the mTCP drops one or more of the mercially available multihoming fact that the client emulators are shared connections in the presence network solutions. This work on Linux systems, which have a of congestion, so that single-chan- extends an earlier study, which more aggressive ACK policy than nel connections do not suffer at the showed potential performance traditional Windows clients. Exper- expense of aggregated connections. improvements of up to 40% com- iments also show that the tool ac- Shared connections are detected by pared to no route control for a curately predicts changes in net- studying the correlations between three-ISP network under ideal work behavior when services are the different fast retransmissions— route control. Aditya also presented changed. The tool works for Google, closely related fast retransmissions a simple near-optimal greedy route and the authors contend that it will between two links point to a shared control algorithm. also be usable in other domains. connection. For path selection, The route controller needs to moni- A Transport Layer Approach for overlay networks are used to create performance on the ISP links Improving End-to-End Perfor- candidate paths from which a sub- and choose the best link to send mance and Robustness Using set of paths is selected greedily with packets through based on the per- Redundant Paths the minimum common links formance of the ISPs at that time. It between them. The greedy algo- also needs a way to direct traffic Ming Zhang, Junwen Lai, Larry rithm chooses paths that are most Peterson, and Randolph Wang, once this choice is made. The disjoint so that there is minimum authors use EWMA (exponential Princeton University; Arvind Krish- interference between the paths in namurthy, Yale University weighted moving average) to track terms of performance impact and average performance of each of the Ming described mTCP, a transport- the effect of failed links. ISPs and decide on the best path. level network protocol developed Performance measurements show While redirecting outbound traffic by the authors for aggregating the that the throughput of mTCP is along a particular ISP’s network is bandwidth of multiple heteroge- more or less cumulative of the indi- trivial, redirecting incoming traffic neous paths between two hosts. vidual network throughputs, as it is a little more difficult. In-bound Bandwidth aggregation provides should ideally be. Separate per-path traffic from within the enterprise is the benefits of improved perform- congestion control provides better redirected using NAT, and exter- ance compared to individual net- throughput than combined control. nally originated traffic is handled work connections, while also The failure-detection and recovery by modifying the DNS entries. improving the resilience of the mechanisms adopted by mTCP aggregate connection. The main While monitoring the ISP links’ work effectively, allowing the net- performance, only the traffic to the challenges for providing effective work to recover seamlessly from bandwidth aggregation are conges- top Web servers is observed, since the failure of one or more of the they account for most of the net- tion control, congestion sharing, links. Finally, the throughput of the recovery from failed paths, and work traffic. The actual monitoring mTCP system is significantly better can be either passive or active. In selecting which paths to use for than individual paths in the pres- packets dynamically. passive measurement, the route ence of congestion. controller periodically measures mTCP uses a single send/receive Multihoming Performance Bene- the turnaround time (time between buffer for all connections, along fits: An Experimental Evaluation of the last HTTP request and the first with per-path congestion control. Practical Enterprise Strategies response); this is an estimate of the Packets get striped across the vari- RTT of the network link. Route ous possible links. This leads to a Aditya Akella and Srinivasan Seshan, Carnegie Mellon University; controllers doing active monitoring greater chance of packets getting initiate out-of-band probes to get

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 47

network performance metrics. Slid- to the problems (and their solu- Experiences with Large Storage ing windows and frequency counts tions). Environments are two methods used to decide the There is currently virtually no Andrew Hume, AT&T Research frequency of monitoring. automated solution to analyzing a In his talk, subtitled “Inside an The performance measurements customer’s environment (with , no one can hear you show that even simple passive respect to storage needs), no way to scream,” Andrew Hume discussed monitoring of the network connec- translate high-level business goals his experiences trying to actually tions offers significant performance into technology requirements and deal with large quantities of data. gains compared to no route control. policies, no way to plan a storage How large? In one case his project Active monitoring outperforms infrastructure based on these needs collected between 200GB and passive monitoring, but only by a and requirements, and no way to 700GB of new data every day. small margin. The use of history in monitor the solution to ensure that Hume was quick to point out that EWMA does not offer any perform- it does not violate the stated goals “20TB just doesn’t go as far as it ance benefit; the current perform- and requirements. Voruganti’s work used to.” Hume also mentioned ance of the network connections is is limited to a tool to helping plan improvement in compression algo- a good indicator of future perform- the storage infrastructure. He was rithms, some of which study the ance. With regard to the effect of also careful to note that this tool is data and then routinely deliver the frequency of sampling on per- intended to make things easier for ratios well above 90%. [Summa- formance, results show that a very the architect, not replace him. rizer’s note: Some of these compres- small sampling interval is harmful There are many daunting tasks fac- sion programs were written by to the performance because of fre- ing an architect when planning a Dave Korn; when I asked him if quent changes to NAT and DNS storage infrastructure: interoper- they would be released to the pub- entries, while a too-large sampling ability concerns, best practices to lic, he said, “We want to, we’re try- interval may lead to stale values follow, gathering data from multi- ing to, but we’re not there yet.”] being used for the network metrics. ple sources (all of which use differ- ent reporting mechanisms), per- To process this data, Hume uses no RAID, no SAN or NAS, no distrib- ADVANCED SYSTEMOO forming complicated “what-if” analysis, and validating that every- uted filesystems, and no special ADMINISTRATION SIG: hardware save for a high-speed LARGE STORAGEOOOO thing was done correctly. SMaestro is intended to make at least some of low-latency network and suitable host adapters on each node. In- Summarized by Adam S. this easier. stead, data is broken into chunks, Moskowitz Typical application requirements shipped to the next available node Autonomic Policy-Based Storage include I/Os per second (both aver- in the cluster, and then processed Management age and peak), bandwidth, percent- locally. In doing this sort of thing, ages of random/sequential reads/ Kaladhar Voruganti, IBM Almaden Hume learned that networks and writes, MTBF, maximum acceptable Research disk controllers are not as reliable outage times, encryption, integrity, as previously thought. To handle Kaladhar Voruganti discussed his wrote-once support, recoverability, these unreported errors, all data is work on the SMaestro Network retention times, scalability, and (of checksummed before and after Storage Planner, which is part of course) cost. All this must be bal- each transfer, sometimes more than IBM’s autonomic computing effort. anced against typical storage sys- once. Part of Hume’s work is trying Despite advances in technology, tem capabilities such as latency, I/O to be able to prove that files are storage resources tend to be poorly rates, availability, points of failure, “correct” to a standard high enough utilized, it is difficult to map appli- “hot upgrade,” reliability, and (here to be accepted in court. In part this cation needs to storage systems it is again) cost. SMaestro uses tem- is being driven by legislation such based on their capabilities, and plates for both categories as well as as the Sarbanes-Oxley Act. the number of system administra- for software, backup/ tors needed to manage storage has restore software, and SAN file sys- Hume claims that backup systems not decreased (despite claims of tems. Policies are written in plain are getting worse; they now have to easier configuration and manage- English and then translated, using buy new backup hardware about ment). Customers are moving to data models, into something that every three years. His project cur- SAN and NAS solutions, but these can be used with the templates to rently uses AIT-2, which he finds problems are not going away. Voru- suggest a storage architecture and unacceptably slow. By comparison, ganti believes that there are both to verify that the proposed architec- computers get faster and more reli- process and technological aspects ture meets the requirements. able every time he buys them.

48 ;LOGIN: VOL. 29, NO. 5 In the Q&A session, as a comment by all these devices is still an is better than reactive recovery; on a question from an audience unsolved problem. Correlating TCP-style timeout calculations out- member (to Voruganti), Andrew reports from multiple sources to perform those based on a virtual claimed to be from “the Ken identify a fault or anomaly is coordinate scheme; and simple Thompson school of thought on important, both to limit the global sampling is as good as other expert systems: there’s table look- amount of information presented much more sophisticated schemes up, fraud, and grand fraud.” to a human and to avoid burdening in neighbor selection. the infrastructure with excessive They’ve used the DHT implementa- PLENARY SESSION management traffic. Some support tion Bamboo, which is derived from the networking hardware— from Pastry but has been enhanced Summarized by Richard S. Cox for example, programmable aggre- with the above techniques to han- gation in the routers—might be Network Complexity: How Do I dle churn. The code and additional useful here. Manage All of This? information can be found at Having determined the state of the http://bamboo-dht.org. Eliot Lear, Cisco Systems network, closing the loop requires Q: Is TCP-style timeout doing well Eliot presented an appeal for work the ability to control the devices. because of absence of background on network management. While Here, standards are less advanced, traffic? providers were previously con- though a certain amount of that cerned with only a few very expen- can be attributed to the cutting- A: To some degree, the answer is sive devices, we now have a huge edge nature of the field. The NET- probably yes, although the bench- number and variety of devices CONF protocol is an effort to pro- mark itself also creates some load ranging from routers and switches vide a common transport protocol imbalance. In any case, background to laptops and phones, all of which and syntax for configuration, but traffic would probably hurt per- need to be monitored and managed vendor-independent configuration formance of a virtual coordinate in order to support critical network schemas are still unspecified. scheme as well. applications. Q: How did you have a good time- The first step in managing is to GENERAL SESSION PAPERS: out for the return path, and how know what is connected to the net- OVERLAYS IN PRACTICEOOO did you measure latency when the work. Unfortunately, today’s dis- return path was different from the covery mechanisms won’t work for Summarized by Ningning Zhu lookup path? the millions of devices on future Handling Churn in a DHT A: There is an ACK for each lookup networks. Instead, devices will message on every hop to get laten- Sean Rhea, Dennis Geels, and John need to “call home” or send notifi- cies, and a conservative timeout is Kubiatowicz, University of Califor- cation when their status changes. used for the return path. There is nia, Berkeley; Timothy Roscoe, This presents issues from what to an average of five hops in the Research name the device to determining lookup and only one return hop, so where to send the notifications. In Awarded Best Paper this conservative estimate isn’t too some cases, there may be multiple far off. We could not have explored interested parties: for example, an According to statistics from real the possibilities of using virtual ISP, a VPN provider, a voice service P2P networks such as Kazaa, churn coordinates for only the return provider, as well as the customer, (“the continuous process of node hop, or have the return path be may all be interested in updates arrival and departure”) is prevalent along the lookup path. from an IP phone; managing this in real life. Through experiment on Q: In this paper, what is the dimen- while respecting privacy and secu- ModelNet—an emulated net- sion of virtual coordinate space? rity is a major challenge. work—the authors show that sev- eral important distributed hash A: It is a 2.5 dimensions space with Having found the network compo- table (DHT) variances (e.g., Tapes- x and y in a plane and z above the nents, we next need to determine try, Chord, and Pastry) all failed to plane. The distance between (x1, their current status. As a bright handle churn very efficiently. y1, z1) and (x2, y2, z2) is Z1 plus point, standards for monitoring Z2 plus the square root of and state retrieval from individual This work identifies and explores ((x1–x2)2 + (y1–y2)2). The latest devices, such as SNMPv3 and sys- three factors affecting DHT per- work seems to indicate that log, are maturing and investment is formance under churn: reactive this is a good metric. being made in tools. Even simple versus periodic failure recovery, devices now support an SNMP message timeout calculation, and Q: Why does Tapestry work fine in interface. However, dealing with proximity neighbor selection. simulation but not so well in a the large amounts of data generated Results show that periodic recovery more realistic network emulation?

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 49

A: Because Tapestry has no leaf set; A: I’m not very familiar with NTP NATs, and firewalls, and has user- Tapestry really needs to recover and therefore have no clear answer. friendly managing tools. The quickly from table neigh- Q: Why did you choose to use 8D results indicate that the overlay tree bor failures. This problem leads it Euclidean space? Is there anything built by ESM is quite optimized to be built to recover reactively, particularly important about the and well structured, and 90% of the therefore to suffer from the same number of dimensions? users get 90% of the bandwidth. In problems that Bamboo/Pastry this paper, Kay Sripanidkulchai exhibits from reactive recovery. A: From my experience, 6 to 8 presented a concept called dimensions would all be fine. A “Resource Index” to quantify the Q: Have you tried some sort of too-low number would not yield periodic/reactive hybrid? bandwidth utilization in the sys- the desired accuracy, and a too-high tem. When Resource Index indi- A: It is really hard to do without number increases the calculation cates the system resource utiliza- reverting to the policy of increasing complexity. tion is high, some users experience traffic under stress. There’s proba- Q: Has the system been tested on a degradation of video quality. One bly some middle ground, but we’re practical Internet domain other important lesson learned from ESM not sure what it is yet. than PlanetLab, which has a more is that there are a lot of NATs in the A Network Positioning System for artificial environment? If yes, what system (70%), which ties up re- the Internet was the result? sources and causes poor perfor- T.S. Eugene Ng, Rice University; Hui A: No. But I expect the scheme to mance. Zhang, Carnegie Mellon University work well on practical Internet The ESM system can be accessed Knowledge about network distance domains, too. on line at http://esm.cs.cmu.edu. is essential for the performance Q: Why did you choose Euclidean Q: There have been several recent optimization of a large distributed space instead of another model? proposals, such as CoopNet and system. For an n-node system, A: Several other geometry models SplitStream, to use multiple trees directly computing the delay were explored; there wasn’t much for streaming. From your experi- between each pair of nodes takes difference in the result. ence do you think multiple tree O(n2) time. The author proposes to approaches are necessary? Early Experience with an Internet solve the scalability issue by build- A: From what we’ve seen, single ing a network positioning system Broadcast System Based on Over- lay Multicast trees seem to do fairly well, though (NPS) using a Euclidean space for larger-scale groups, multiple model. Each node infers its net- Yang-hua Chu, Aditya Ganjam, San- trees may become useful. work coordinates by measuring jay G. Rao, Kunwadee Sripanidkul- their distance to several reference chai, Jibin Zhan, and Hui Zhang, Q: How do you deal with the data points, and network distance Carnegie Mellon University; T.S. proxy server in your system? between any pair can then be calcu- Eugene Ng, Rice University A: It is counted as NAT behind fire- lated by their network coordinates. Internet multicast has been studied wall. In building such a network posi- for many years; the protocol design Q: How do you avoid “free rides,” tioning system, there are many and evaluation were mostly based i.e., when a user lies about the practical issues, including system on static analysis and simulation. resource that he or she is going to bootstrap, how to support a large ESM (End System Multicast) is the contribute to the system? number of hosts, how to select ref- first mature and deployed system to A: ESM uses actual measurement erence points, how to maintain use Overlay Multicast for broad- instead of trusting a user’s claim. position consistency, how to adapt casting video and audio streams. to Internet dynamics, and how to The system has had four publishers maintain position stability. The sys- and 4000 users, providing unique tem is evaluated on PlanetLab with experiences on Internet broad- 127 nodes using an 8D Euclidean casting. model; results showed that position ESM requires no change to network accuracy was fully maintained infrastructure; the end hosts in an through the 20-hour testing period. ESM system are programmable to Q: Is there any technique from NTP support application-specific cus- that NPS can also benefit from, tomizations. ESM is a distributed considering that NTP and NPS and self-improving protocol opti- have some common issues to deal mized for end-to-end bandwidth. It with? supports heterogeneous receivers,

50 ;LOGIN: VOL. 29, NO. 5 GENERAL SESSION PAPERS: /C++ code, and mentioned that ward restricted credentials. The SECURE SERVICEOOOOOOO the major problem is dynamic con- evaluation tries to answer two tent. Currently, Apache/PHP does questions: Is REX reliable and are Summarized by Wanghong Yuan not work well, due to the poor there any architecture benefits? Reliability and Security in the isolation. More information is available at CoDeeN Content Distribution Net- Max then described their system, http://www.fs.net. work called the OK Web server (OKWS). Q: Are there problems in file Limin Wang, KyoungSoo Park, Its design is that of a multi-service access? Ruoming Pang, Vivek Pai, and Larry Web site, and its isolation strategy A: Only in some access, such as Peterson, Princeton University is the least-privilege principle. Max write and read. also gave an example of how to KyoungSoo Park first introduced build a Web service with OKWS Q: What is the relationship CoDeeN, an academic content-dis- and illustrated how the isolation between TCP and channel? tribution network on PlanetLab, works in OKWS in detail. OKWS A: There is one TCP and there are and its security problems. The root is implemented in C++ with SFS multiple channels. of these problems is that CoDeeN libraries, database translation has no end-to-end authentication. libraries, and Perl-like tools. The USEBSD SIG KyoungSoo then described their key point is one process and one approach to security, which thread for one service, without syn- Summarized by Adam S. includes multi-level rate limiting chronization. SparkMatchv4 is Moskovitz and privilege separation. They built using OKWS, which com- The NetBSD Update System achieve reliability by using active pared favorably to Apache, Haboob, local and peer monitoring. In addi- and Flash. The source code is avail- Alistair Crooks, The NetBSD Project tion, he discussed DNS problems able at http://www.okws.org. Alistair Crooks described a system solved via mapping objects in the for downloading and installing same proxy and CoDNS. Finally, Q: Is there an advantage for not binary patches, similar in many he summarized the lessons and maintaining the database pool? ways to Microsoft’s Windows future work, including robot de- A: Yes, based on observation of Update Facility, but for use on any tection, CoDeploy and CoDNS. Apache. number of platforms. Crooks’ sys- More information is available at Q: Why not replace script? Why tem runs on a variety of platforms, http://codeen.cs.princeton.edu. develop a new Web server? including the *BSD variants, Mac Q: What causes the DNS problems? A: Security problems in Apache. OS X, Linux, and Solaris. Like the Microsoft system, NetBSD Update A: Local DNS server overload. Q: How do you support two is easy to use and gives the user Q: What are other solutions for requests sharing the same service? three options for automatic behav- accessing local content? A: It’s possible to do this; the paper ior: inform the user; inform and A: More efficient approaches, with has more detail on how. download appropriate packages; more information; currently the REX: Secure, Extensible Remote inform, download, and install the privilege separation is simple. Execution packages/patches. Crooks uses a file on the update server to list Building Secure High-Performance Michael Kaminsky, Eric Peterson, M. packages for which vulnerabilities Web Services with OKWS Frans Kaashoek, and Kevin Fu, MIT; exist and a program that runs on Max Krohn, MIT Daniel B. Giffin, David Mazières, the target system to say which of New York University The motivation story is - those packages is present and Match version 1, which crashed The motivation is that remote exe- should be updated. with 500,000 signups. Version 2 cution is important but there are NetBSD Update includes other solved some problems in the data- features not widely available in cur- important features, the most signif- base, but there were too many con- rent tools. The major problem icant (in my mind) being the ability nections. Version 3 in 2002 distrib- addressed in REX is locating the to digitally sign update packages uted database but the development simplest abstraction that can sup- and the user’s ability to accept or cycle was too long. Max summa- port all of these features. Michael reject those updates based on the rized the desired Web service fea- described how to establish a ses- validity of the signature(s). Unfor- tures: thin fast server, smart gzip sion, run a program, pass file tunately, like so many other sys- support, small number of database descriptors, use tems, the lack of a widely accepted connections, memory reclamation, forwarding, connect through NAT public key infrastructure means and an easy and safe way to run and dynamic IP address, and for- this feature is still a bit more cum-

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 51

bersome than it ought to be readiness to answer requests via address space to another in order to (which, of course, should not be OSPF Link State Advertisements; provide protection domain bound- considered a shortcoming of simple wrapper informs the aries efficiently. Failing to properly Crooks’ work). when an instance of “named” fails manipulate data from different pro- Another feature is the ability to (internal assertion failures are set tection domains degrades the per- undo the effects of an update. to dump core and exit). formance of the system. NetBSD Update automatically pre- For UDP-based DNS queries, it The author then presented the serves all files that will be overwrit- doesn’t matter which node in a methodology to evaluate their solu- ten and stores them for the user in cluster provides the answer. For tion. Based on the EROS micro-ker- case the update causes more prob- TCP-based operations like zone nel to support domain factoring, lems than it solves, or if worse vul- transfers, all packets in the transac- they built two network subsystems nerabilities are found in the update tion must be routed to a single to evaluate the costs due to the than existed in the unpatched sys- node; Abley uses a combination of user-level implementation, the tem. [Summarizer’s note: Of “flow hashing” (via Cisco Express domain factoring, and the micro- course, this sort of thing has never Forwarding or the “load-balance kernel performance. In particular, happened—the phrase “the patch per-packet” feature on Juniper they built an EROS-based - for the latest jumbo patch” is routers), avoiding ECMP routes for lithic network subsystem and an merely a joke among system stateful transactions, and using EROS-based domain factored net- administrators.] BGP. work subsystem. Anshumal A Software Approach to Distribut- explained in detail the implementa- ing Requests for DNS Service GENERAL SESSION PAPERS:O tion of both systems and their dis- Using GNU Zebra, ISC BIND 9, THE NETWORK-APPLICATION tinctive features. When presenting and FreeBSD I NTERFACEOOOOOOOOOO O experimental results and evalua- tion, he included a conventional Joe Abley, Internet Systems Consor- Summarized by Calicrates Linux in-kernel network stack as a tium, Inc. Policroniades reference baseline for performance [Summarizer’s note: This talk/sys- comparison. A detailed explanation Network Subsystems Reloaded: tem deals with certain aspects of of their results in terms of latency A High-Performance, Defensible routing that go beyond my general and throughput showed that the Network Subsystem knowledge of the subject; if some- performance exhibited by the thing you read doesn’t make sense, Anshumal Sinha, Sandeep Sarat, and domain factored network subsys- the error is almost certainly mine Jonathan S. Shapiro, Johns Hopkins tem was comparable or close to the and not the speaker’s.] University other two strategies (EROS mono- Joe Abley described a system for Anshumal Sinha introduced his lithic approach and conventional distributing DNS requests across talk with several issues observed in Linux network stack). multiple hosts without the use of in-kernel monolithic network sys- Anshumal concluded his talk by dedicated load balancers, by using tems: security (they represent a sin- remarking that domain factoring is a “service address” (sometimes gle point of failure), maintainability more feasible than previously called a “virtual IP address”), any- (robustness-critical code is large assumed, that the instruction cache cast, and the Equal Cost Multi-Path and difficult to maintain and plays a significant role in the per- (ECMP) feature of the OSPF rout- debug), and flexibility (lack of sup- formance of the system, and that ing protocol. This system is cur- port for the simultaneous existence factoring provides the basis for rently in use for the F root name of multiple protocols, complexity defensible systems. to do application-specific optimiza- server (run by ISC—the Internet accept()able Strategies for Improv- tions). He stressed that monolithic Systems Consortium), which also ing Web Server Performance provides slave service for 30–40 network systems are only used ccTLDs. Currently, 24 nodes across because of their performance bene- Tim Brecht, David Pariag, and Louay California and in New York, Tokyo, fits. The author mentioned that Gammo, University of Waterloo and Stockholm make up what previous user-level implementa- Tim Brecht discussed how particu- appears to be a single root name tions have failed to deliver suffi- lar strategies to handle connection server. cient throughput, noting, however, requests affect the performance of their hypothesis that earlier sys- Individual nodes in the cluster are Web servers. He began by mention- tems failed to provide an appropri- configured with unique unicast IP ing how to improve Web servers’ ate solution to a key problem: per- addresses, and with the service performance by modifying their formance degradation resulting address on the loopback interface. corresponding accept strategies. He from data copying from one Hosts inform the routers of their mentioned that an adequate solu-

52 ;LOGIN: VOL. 29, NO. 5 tion should not only improve Web Lazy Asynchronous I/O for Event- Khaled finished his talk by high- servers’ peak performance but also Driven Servers lighting LAIO’s generality (covering be able to maintain it even under Khaled Elmeleegy, Anupam Chanda, all the I/O calls) and simplicity overload conditions with a large and Alan L. Cox, Rice University; (requiring fewer lines of code with- number of connections. He pre- Willy Zwaenepoel, EPFL, Lausanne out handlers or the need to main- sented throughput results for three Presenter: Khaled Elmeleegy tain state). In terms of throughput, architecturally different Web LAIO meets or exceeds the per- servers: the event-driven, user- Khaled Elmeleegy began his pres- formance of other methods. During mode micro-server (39–71% entation explaining why event- Q&A, the audience mainly focused improvement); the multi-threaded, driven architectures are used and on comparing LAIO with multi- user-mode Knot (0–32%); and the the difficulties that threaded servers; Khaled men- kernel-mode (19–36%). experience when developing high- tioned, however, they had decided performance servers using existing Tim stressed that current multi- to focus the study on event-driven I/O libraries. He remarked that cur- strategies. accept servers overemphasize rent I/O libraries have an incom- acceptance of new connections and plete coverage or leave to applica- ignore the processing of existing tions the burden of state main- USEBSD SIG connections. Second, he mentioned tenance. In contrast, Lazy Asyn- Summarized by Matus Telgarsky that his work aimed to reduce the chronous I/O (LAIO) is a good gap in performance typically seen option to develop high-perform- Building a Secure Digital Cinema between kernel-mode and user- ance, event-driven servers with less Server Using FreeBSD mode servers. Next, he described in programming effort, because it cov- Nate Lawson, Cryptography detail the three architectures that ers all possible blocking I/O calls, Research were analyzed in the paper and creates a continuation only when Nate’s dense, practical, and often explained how the accept-limit, an operation actually blocks, and anecdotal talk went beyond general which defines an upper limit on the notifies applications only when a crypto issues and concepts to also number of connections accepted blocking operation has been wholly discuss the troubles and travails consecutively, affects each tech- completed. Next, Khaled explained afflicting construction of a digital nique’s performance. the way in which event-driven cinema server and problems with The presenter made a careful analy- servers generally work and the role its wide acceptance. sis of the impact of varying the that event-handlers play in their accept-limit on each of the servers operation; he also mentioned how After providing a quick overview of based on their experimental results. blocking operations degrade server Cryptography Research Inc., Nate In his experiments they used two throughput. He presented LAIO as presented an extremely cogent dia- workloads: a one-packet workload a solution to the typical blocking gram that proved one of the strong- and a SpecWeb99-like workload problems seen in event-driven est bromides of the security cir- that uses httperf to generate over- servers and proceeded to describe cuit—you are only as strong as load conditions. The experimental the LAIO API, their functions, and your weakest link. A graph pitted results presented by the author implementation. probability of compromise against effort/cost of attack. The perfect were organized by server perform- After introducing the two different scenario features a deep curve pre- ance, queue drop rates, and laten- workloads used in their experi- dicting massive effort for even the cies observed under different ments, Khaled compared LAIO’s slightest crack; however, usual cir- accept-limit policies. Tim men- throughput and performance cumstances produce an almost tioned that it is necessary to ensure against other I/O libraries. With inverted curve (symmetric against that Web servers accept connec- this purpose, the authors modified y=x), where bribing employees, tions at sufficiently high rates so the networking and disk I/O strate- using common scripted attacks, that a balance between accepting gies of the Flash Web server and and abusing operating system holes and working times can be ade- measured the results obtained for is often easy and constitutes the quately established. Finally, he said different versions of the server. In majority of compromises, rather they were able to demonstrate that situations where performance was than a crack of an encryption key, a performance improvements can be comparable, the complexity of the shield many often deem fully obtained by modifying the accept programs decreased because it is sufficient. strategies used in Web servers. not necessary to write handlers or to maintain state as in conventional Nate continued to describe three non-blocking I/O, which finally fundamental underpinnings of leads to a reduction of the amount any security paradigm: strength of code that needs to be written. (which encryption provides),

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 53

assurance (pragmatic assessment of to an auth server. Nate spent a years, with hard work solidifying a attack methods, especially easier moment covering common pitfalls new SMP model and threading entrances), and renewability (post- in encryption selection, including core, among numerous other mortem reconstruction). an amusing pair of images present- improvements. 5.3 features include Traditional analogue cinema (obvi- ing a still identifiable pattern in an gcc 3.4, PCI and ACPI work, X.org ously) uses film cameras, is trans- encrypted image due to naively X server, more fine-grained locking ferred to a digital format for post- small block selection. work (including heavy “giant” lock processing (effects, editing, etc.), A recurring suggestion was careful removal in many subsystems), SMP is printed thousands of times (at thread-model analysis in order to thread scheduler tweaks, and the $3,000 a copy, which degrades after realistically and sensibly determine flexible pf packet filter, among oth- a week of use anyway), and is circumstances and identify weak ers. SMPng is satisfactorily transi- played on $30,000 projectors. Digi- points, and then assign design tioning from bug and correctness tal cinema is directly captured to parameters accordingly. Nate checking to performance tuning. hard drives and obviously omits detailed many possible scenarios Next came Alistair Crooks present- any tedious conversion; however, (one-time read-only access, ing NetBSD, alive since 1993 and distribution and projection stan- repeated read-only access, one-time eagerly awaiting a 2.0 release. dards do not exist—costs are pro- read-write access, and repeated Nascent features include SMP hibitive, and the market is in flux. read-write access) and the specific work, scheduler activations, Not only does refitting cost around dangers within each. Similarly, a kqueues, wireless drivers, and $100,000, but there are even ques- good security structure depends on many other features. A primary tions whether the theater is respon- clear top-down study and careful goal of NetBSD is to function on sible for said expenditure. In 2003, perusal of all possibilities. many architectures—the only 90 theaters in the U.S. were Panel: The State of the BSD Pro- .org sidebar presents an digital (30 in 2000). jects impressive list of 54 disparate devices. Time was also spent Digi-Flicks enlisted Cryptography Chair: Marshall Kirk McKusick Research to design a new security describing the package distribution system and, hopefully, extend digi- The FreeBSD Project: Robert Wat- system, pkgsrc, which is easily con- tal cinema acceptance. Planned son, Core Team Member, The figurable, consistent, supports mul- design goals were transport inde- FreeBSD Project tiple versions of installed programs, pendence, thorough use of strong The NetBSD Project: Christos and currently boasts over 4500 crypto algorithms, multi-factor Zoulas, President, NetBSD Founda- packages. authentication (i.e., simultaneously tion Matt Dillon discussed his ambi- utilized smart card, pass code, and The DragonFly BSD Project: Matt tious DragonFly BSD project, key file), flexible authorization Dillon, Project Leader, The Dragon- which is steadily advancing upon a policies, reliable playback over Fly BSD Project 1.0 release. Already a veteran imperfect media, and, of course, a hacker (contributor to Linux and rapid development cycle. Amus- All attending BSD parties were able FreeBSD, among many other proj- ingly enough, for the 300 target to give impromptu presentations ects), Matt’s aggressively progres- theaters, shipping hard drives was detailing past work, current revela- sive plan for restructuring BSD found to be the most cost-effective tions, and future plans. FreeBSD, revolves around a message-passing distribution method. NetBSD, and DragonFly BSD were core with a lightweight IPC model. represented; rumblings abounded The sample extant hardware was Much of the talk and current focus regarding Darwin and OpenBSD, is extremely low level—for rather unpleasant—a simple UNIX- supporters of which were unfortu- like OS over a 33MHz PowerPC instance, much focus is going into nately concurrently occupied with restructuring to optimize cache with 64MB RAM, too many ASICs, other tasks. and no documentation. Serial and usage in all subsystems, from the even Ethernet proved too slow for Robert Watson’s speedy flight ground up. One corollary goal is data transmission, so the SCSI through FreeBSD 4 (stable) and 5 minimizing use of fine-grained interface was selected to actually (development) built a good meas- mutexes. DragonFly is a continua- transfer the films (FreeBSD was ure of excitement and confidence tion of FreeBSD 4.X, though a selected partially due to the ease in the impressive list of features pleasant rapport exists with the with which the disc access code stabilizing in the new code. 4.10 parent project, and indeed updates could be modified to play nice with has survived healthily with minor still filter through. this chaotic configuration). The security updates and has enjoyed This enthusiastic one and a half encrypted drive would be decoded hardened security. 5.X has been in hour panel showed the BSD proj- with a smart card and by dialing in continuous development for five ects to be in excellent condition,

54 ;LOGIN: VOL. 29, NO. 5 with a dedicated group of hackers venience, risk, and privacy are ogy, and social norms. He said that backing each—in fact, one of the often subjective. by turning the above four knobs, few times I witnessed the BSD Next, Schneier brought up the we must be able to work out the hackers leaving the laptop room topic of risk assessment. He warned right kind of security. (and their diligent hackery) was to that people are always bothered Later, Schneier said that we need to attend this informative panel. about “spectacular” risks (e.g., accept the risks as real, and try to FreeBSD and NetBSD pointed out risks while flying in a plane) and reduce them. He also proposed that their foundations, facilitating dona- downplay “pedestrian” or “under one possible solution is to put the tions through cheerily tax- control” risks (e.g., risks while person who can best mitigate the deductible exchanges. DragonFly driving), which matter much more risk in charge of the risk. He illus- has not yet formed a foundation, in our lives. Schneier identified trated this point with the example though Matt Dillon would certainly technology and media as two main of a supermarket cash register, enjoy frequent surreptitious anony- culprits causing this problem. where the customer is “used” to mous gifts. News, by definition, means that guard against cashier malpractices. which does not happen every day. Schneier concluded by saying that PLENARY SESSION The media only show the uncom- as individuals we have very little mon happenings, replaying them power, but as an aggregate we can Summarized by Swaroop Sridhar over and over to create a feeling achieve a lot toward our collective Thinking Sensibly About Security that they are very common. Tech- good. in an Uncertain World nology contributes its bit, obscur- ing risks by hiding operational Bruce Schneier, Counterpane Internet GENERAL SESSION PAPERS: details from users. Security, Inc. UNPLUGGEDOOOOOOOO O Again, bringing up the topic of why Mr. Bruce Schneier delivered his extreme trade-offs (such as Summarized by Matus Telgarsky thought-provoking and entertain- National ID cards) are taken for lit- ing talk without any kind of visual Energy Efficient Prefetching and tle gain, Schneier said that security aid. He began by saying that we are Caching decisions are usually made for non- living in an interesting era called Athanasios E. Papathanasiou and security reasons. This leads to a “silly security season.” Introducing Michael L. Scott, University of notion of an “agenda” among all the notion of all of us being “secu- Rochester the “players” of the bigger system, rity consumers,” he said that we of which security is a part. For Awarded Best Paper need to step back and analyze example, closing national highways whether this security is really is good according to a police Modern operating system design worth it. Is it worth the billions of agenda, but bad according to a pub- prescribes a plethora of heuristics dollars and the loss of convenience, lic agenda. Schneier proposed a for caching and prefetching to aid anonymity, performance, or free- model of “Security Utilitarian- in disk performance, with nary a dom? Elaborating on security ism”—which leads to the greatest word about energy savings. Studies trade-offs, he said that it is well security for the greatest number of attribute 9–32% of laptop energy known that trade-offs are ubiqui- people. expenditure to hard disk use; tous, but there is a fundamental hence, heavy savings in that realm security trade-off paradox: We, Schneier stated that one of the fun- will result in drastic overall who claim to be the “most intelli- damental problems is that we often improvements. gent” species on the face of this have no control over the security Traditional prefetching techniques earth, always make the wrong secu- policies that are implemented. The aim to reduce disk access latency rity trade-offs. Much of Schneier’s right kind of security should be by attempting to maintain the talk was based on why this is so worked out by means of negotia- working set of an application’s disk and how it could be fixed. tions and deliberations. He cau- tioned, however, that the negotia- data cached by replacing unused Schneier said that the way to do tions should be held at the right cache elements with simplistically good security trade-offs is to slow time, with the right people. Argu- determined prefetch targets. Unfor- down and have a basis for rational ing with a security guard at an air- tunately, this does not preclude the discussion, rather than making port gate, for example, would be a possibility of an application reading quick decisions based on emotions. bad idea. Schneier identified four and writing at arbitrary times, obvi- However, he warned that this could factors that can effect a change in ating the possibility of simply tack- be quite complicated because the security norms—government rules ing on any sort of basic power man- meaning of factors such as incon- and laws, market forces (e.g., refus- agement scheme. Indeed, many of ing to use an insecure OS), technol- the tests on a stock

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 55

showed 100% of the disk idle times design and implementation of a and testing of IPC and network to fall beneath one second, not solution—termed “time-based communication. nearly long enough to enter a disk’s fairness”—and experimental Due to time constraints, a large power-saving state without incur- verification. portion of the talk focused on ring a net power efficiency loss 802.11b was used as the test case. FUSD (Framework for User-Space owing to energy required for Though traditionally known to Devices), which is a kernel module transition. sport 11Mbps, the standard also proxy to device file events. FUSD, The presentation detailed the defines three other rates: 1, 2, and though new, is already used by design and implementation of 5.5. Vendors use these speeds when numerous applications to simplify Bursty, a mechanism providing packet transmission failure becomes communication with device nodes. highly speculative prefetching, a a problem, eventually bumping the It is essentially a micro-kernel kernel interface to hint disk access speed to the slowest rate, which extension to Linux. At the mo- patterns, and a daemon both to features the highest resil- ment, performance is sufficient but monitor and manage the system. ience. Current channel proportion- unsatisfactory; read throughput, for Applications may hint improperly, ing and access point downlink instance, can be 3 to 17 times slow- or lack hints entirely, so the moni- scheduling techniques result in er than analogous read perform- tor must both generate and judge throughput-based fairness, mean- ance without the FUSD proxy over- extant hints. The prefetching is also ing a slower rate receives a larger head. self-aware—the success rate of the portion of channel time, ostensibly EmSim and EmCee are simulation algorithm is constantly measured aiding the feeble companion but tools; these in turn are modular to to determine whether further causing the hare to be tied to the allow for easy extension and mini- adjustments are required. Addition- turtle. mized footprint. EmRun starts up, ally, per-application idle times are Time-based fairness apportions maintains, and shuts down an irrelevant if they are not in phase channel use equally by time, result- EmStar system according to a pol- between applications; hence, the ing in much higher possible icy in a configuration file; it fea- daemon also attempts to organize throughput. The average time for tures process respawn, in-memory these patterns to allow for consis- network tasks to complete is also logging, fast startup, and graceful tent disk avoidance between all reduced, obviously a benefit to shutdown. All components (more applications. Once the predicted many mobile users and definitely than are listed here) are written idle period is estimated to beyond to anyone who would rather have with modularity in mind, and code the intersection of regular drive use things to do while a laggy transmis- is heavily reused. It has already and idle use combined with transi- sion completes. The implementa- proved useful in numerous projects tion expenditure, the drive is pow- tion is flexible enough to function at the CENS labs working with a ered down into an appropriate low- properly on extant access points variegated set of hardware. power mode. A variety of tests with and does not need extensive modi- different applications using a vari- fication on clients; adoption is easy SECURITY SIG ety of workloads and disk access and backwards-compatible. patterns (and memory configura- Summarized by Ming Chow tions—Bursty is hungry!) found EmStar: A Software Environment 60–80% energy savings, with negli- for Developing and Deploying Panel: The Politicization of gible losses in efficiency. Wireless Sensor Networks Security Time-Based Fairness Improves Lewis Girod, Jeremy Elson, Alberto Moderator: Avi Rubin, Johns Hopkins Performance in Multi-Rate Cerpa, Thanos Stathopoulos, Nithya University Ramanathan, and Deborah Estrin, WLANs Panelists: Ed Felten, Princeton Uni- UCLA Godfrey Tan and John Guttag, MIT versity; Jeff Grove, ACM; Gary The burgeoning study, experimen- McGraw, Cigital Modern Wireless networks theoret- tation, and deployment of wireless The common theme of this panel ically maintain decent throughput sensor network applications is gen- was how politicized security, espe- when congested, though in practice erating a need for full-fledged cially that relating to technology, the common utilization of rate development suites. EmStar pro- has become. Professor Avi Rubin diversity as an automatic signal vides just that for 32-bit embedded spoke of his experiences working strengthening scheme causes stan- MicroServer platforms: tools and on the issue of electronic voting dard throughput-based fairness libraries providing simulation, (eVoting). He spoke about dealing schemes to result in unexpectedly visualization, and emulation. Other with policy issues, and about how poor performance. This paper pre- functionality aids in development sents an overview of the problem, eVoting has become a partisan, politically charged issue and, as

56 ;LOGIN: VOL. 29, NO. 5 such, is targeted for abuse. An actions by the entertainment FREENIX OPENING REMARKS example is that companies produc- industry. AND AWARDSOOOOOOOOOO ing eVoting technologies and Professor Ed Felten spoke about equipment have strong political the Digital Millennium Copyright Summarized by Martin ties. The goal from each political Act (DMCA) and his work, which Michlmayr party is to “not have the other guy made national headlines several Bart Massey, Portland State Univer- win.” Professor Rubin has been on years ago. Professor Felten stated sity; , Hewlett-Packard major news sources (e.g., CNN) that the DMCA was created by Cambridge Research Lab speaking about technical issues of negotiations in which computer eVoting, and has received numer- Bart Massey and Keith Packard scientists were not involved. His opened the FREENIX track, a ous telephone calls from both work with advisee John Halderman Democrats and Republicans. Pro- forum devoted to free and open was discussed—the weak DRM source software, by giving a brief fessor Rubin recounted being called technology created by SunnComm to testify in front of Congress about summary of papers that were sub- could be bypassed on Windows mitted this year. Out of 61 papers eVoting, and recalled the amount of computers by holding down the fighting and bickering on both submitted, 15 were accepted. The shift key. The government was organizers were happy to see that political sides dealing with the cracking down on Professor Fel- issue. He summed up the current among the accepted papers, seven ten’s research and he was threat- were from students, and seven were state of politics by saying that “par- ened by the RIAA under the tisanship has never been worse.” non-US papers. They said that the DMCA. Professor Felten settled quality of all submitted papers was Gary McGraw spoke of the long with both Princeton University and very high and that the review history of politicization of scientific the government by creating educa- process was more formal than in research and development and the tional packets for the government the last few years, adding three degree to which current scientific on security research. Professor Fel- external reviewers to the program research and development are ten recalled testifying in Congress committee. They also thanked influenced by politics (like Galileo about a bill to limit developing DoCoMo for sponsoring student and Darwin centuries ago). He tools on decoding technologies, travel for the conference. stated that security and terrorism and summarized the atmosphere are sensitive subjects, and that “we in one word: “theater.” Finally, he In this opening speech, two awards should understand the problem, gave out his Web site: to papers in the FREENIX track having worked in an asymmetric http://www.freedom-to-tinker.com. were given. The Best Paper award situation for years in computer went to “Wayback: A User-level The theme from all of the panel Versioning for Linux,” security.” McGraw also said that speakers was clear: “We (the com- too often “individual rights can be and the Best Student Paper was puting and scientific communities) “Design and Implementation of trumped in the name of security” need to step up to the plate and (e.g., DMCA and the Patriot Act). Netdude, a Framework for Packet educate people on technological Trace Manipulation.” Jeff Grove has worked with the issues.” The goal can be accom- government on Capitol Hill, and plished by being more involved, by There will be another FREENIX expressed his dissatisfaction on the being partisan, and by talking to track at USENIX ’05 in Anaheim, number of bad laws being imple- anyone who is curious. Openness California. Since future USENIX mented, including the DMCA, and and debate are encouraged and are conferences will take place around the regulation of P2P networks. healthy. It is critical to tell the truth April, the deadline for FREENIX Grove outlined how the Senate can and to convince people about submissions is October 22, 2004, address and jump on issues and what’s really going on. Gary rather than in December. More make dumb laws. The problem per- McGraw also said that attacking information on the next FREENIX sists because of bad conclusions, systems is a necessary part of secu- track and Call for Papers can be bad assumptions, and lack of basic rity and that outlawing attacks found at http://www.usenix.org/ understanding about technologies. makes little sense. Finally, media events/usenix05/cfp/freenix.. In addition, there is a small handful and politics are great investments: of powerful players who are effec- the “ effect” helps ridicule tive in influencing the government bad laws, and working even with to create laws fitting their agendas. your local government is a 10–15- Bad laws expose developers to lia- year investment. bilities, even when there’s no infringement, and provide civil enforcement by encouraging legal

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 57

FREENIX INVITED TALK introduce semi-transparent win- UDP. Permissioned users can view dows and flicker-free painting, and information on their phone or via Summarized by Martin it will also implicitly provide dou- Web. The benefits of the technolo- Michlmayr ble-buffering for all built-in and gies to small business include con- The Technical Changes in custom widgets: this is transparent, venience, efficiency, and safety. The and no code needs to be rewritten. security of the uLocate service con- Version 4O In addition, it will allow large win- sists of two layers: a carrier level , Trolltech dows (even modern window sys- and an application level. Mr. Linux/Open Source tems limit a widget’s coordinate Schroth also discussed privacy con- Matthias Ettrich, founder of the system to 16 bit, but Qt 4 won’t cerns about the technology, namely, KDE project and a main developer have this limitation), and there will addressable IP addresses, privacy, on Qt, gave an overview of the next be improvements in size and per- and leadership and ownership of generation of Qt, a cross-platform formance. Qt 3 was originally risks. C++ GUI toolkit. Qt supports X11, designed for desktop computers Matthew Gray of Newbury Net- , Mac OS X, and (with fast CPUs with FPUs, lots of works discussed his concerns about embedded Linux, and offers native RAM and disk space). On the other location-tracking technologies. The look and feel on each of these plat- hand, Qtopia was designed for goal at Newbury Networks is to see forms. Qt provides single-source embedded systems. Qt 4 aims at what people are accessing without compatibility: one source code merging the benefits of both prod- interfering with larger networks compiles on all target platforms. uct lines into one. (e.g., Starbucks) and to eliminate While Qt mainly offered GUI func- In summary, Qt 4 will provide a such false positives to enhance tions in the past, it is much more number of new features that will security and privacy. Mr. Gray than a GUI library these days: It offer new possibilities for cross- noted that security and privacy are also supports I/O, printing, net- platform development. Ettrich in opposition to each other. He said working, SQL, process handling, hopes that a first technology pre- that consumers and regulators and threading. One aim of Qt is to view will be made really soon, with must understand the risks of loca- provide an excellent programming another one following in Q3 2004. tion-tracking technologies. experience. A beta of Qt 4 should be released in Marcus Jakobsson of RSA Laborato- Qt introduced the signals-and-slot Q4 2004, with the final version fol- ries listed three ways in which loca- concept in order to allow different lowing in Q1 2005. tion privacy can be violated: active GUI components to communicate. attacks (keep asking a device), pas- You can connect any signal to any SECURITY SIG sive attacks (listen to communica- number of slots in any module, and tion from other devices), and communication is done at run Summarized by Ming Chow remote attacks (infer location from time. The sender and receiver don’t Panel: Wireless Devices and Con- public information). He said that need to know each other. In ver- sumer Privacy legislation for location privacy is sion 4, connections can be either necessary and meaningful. How- Organizers: Ari Juels, RSA Laborato- synchronous or asynchronous ever, such law will be difficult to ries; Richard Smith, Consultant (“equal connections”); this will enforce because detecting abuse by allow thread communication. Panelists: Markus Jakobsson, RSA institutions is hard, and it is even Arthur is Qt’s paint subsystem, and Laboratories; Frank Schroth, uLo- harder to detect abuse by individu- version 4 will offer several new fea- cate; Matthew Gray, Newbury Net- als. He noted that countermeasures tures: linear gradient brushes, alpha- works have been proposed but not blended drawing, anti-aliased lines, The panel talked about wireless deployed. In conclusion, Mr. painter paths, and an OpenGL technologies, including GPS and Jakobsson listed several things that backend. RFID, and privacy issues concern- must be done immediately: the Interview is a model/view frame- ing them. Frank Schroth discussed threats of location privacy must be work for tree views, lists views, and uLocate’s wireless technology, studied and understood, legislation tables. In the Model-View-Con- which enables small business users must be enacted, and countermea- troller (MVC) paradigm, all these to view the location of phones in sures must be implemented. components are separated from their account, including maps and Finally, Ari Juels and Richard Smith each other. The model contains routes. uLocate’s technology is discussed radio frequency identifi- data, the view renders data, and the based on GPS on the Nextel Net- cation (RFID) tags and privacy controller transforms interaction work. The interface on cellular concerns about the technology. Mr. with the view into actions to be phones is a Java-based application Juels presented a brief tutorial of performed on the model. Qt 4 will that transmits data to server via the RFID technology: an RFID tag

58 ;LOGIN: VOL. 29, NO. 5 uses a chip (IC) antenna slightly FREENIX SESSION: SERVER C-JDBC: Flexible Database Clus- larger than a quarter. Currently, tering Middleware many people have tools and gadg- Summarized by Matus Telgarsky Emmanuel Cecchet, INRIA; Julie ets that have RFID tags, such as Migrating an MVS Mainframe Marguerite, ObjectWeb; Willy E-ZPass, Mobil SpeedPass, and Application to a PC Zwaenepoel, EPFL physical-access cards. RFID tags are seen as next-generation barcodes; Glenn S. Fowler, Andrew G. Hume, The general trend in modern high- Mr. Juels listed the benefits of RFID David G. Korn, and Kiem-Phong Vo, power computing is toward clus- tags over barcodes (they’re fast, effi- AT&T Labs ters of commodity machines, cient, mobile, can uniquely specify Rotting at the hearts of many old achieving a significantly superior objects, and require little computa- institutions’ organizational frame- price-power ratio over more tradi- tional power). What this means is works are mainframes and their tional and expensive many-CPU that the world will consist of bil- respective applications. Though the SMP machines. Though many tiers lions of $0.05 computers. The software may be relatively depend- of server applications have major privacy problem concerning able, operational costs are prohibi- extended to utilize this trend, in RFID technology is that it can be tive (the task emulated within this general RDBMS installations have used to profile a person incredibly paper is estimated at $20,000 per lagged behind. Limited support has easily and quickly, providing month just for mainframe use), and come from Oracle and IBM, but detailed information, for example, the code consists of thousands, open source databases either relied on artificial body parts and other even millions of lines of ancient on simplistic master-slave replica- details of a person. Mr. Juels noted COBOL and JCL, on a system with- tion services or other similar com- that approximately 42% of Goggle out a hierarchical file system. Emu- promises. hits on a search for RFID contain lating the process is a feasible and Clustered JDBC (C-JDBC) is an the word “privacy.” The solution to cost-effective alternative. open source database middleware the privacy problem is to kill RFID The MVS was to handle a mam- which abstracts pools of databases tags. However, RFID tags are too moth of data, so a variety of tools into a virtual database, complete useful. Mr. Juels concluded his talk were written to efficiently compress with load balancing, query caching, by saying that there is serious dan- it prior to transmission. The Open- logging, checkpointing, schedul- ger to privacy if the technology is COBOL compiler was extended to ing, authentication, and other fea- deployed naively, but the danger handle a few language extensions tures. JDBC is used to connect to can be mitigated to strike a techni- and different character sets, parse virtually any database (as they basi- cal balance with society. compressed data directly, and also cally all provide JDBC drivers), and At the end of the talk, the panel receive a few performance enhance- allows for seamless integration of discussed what must be done now ments. An extended sort program heterogeneous database farms into to mitigate privacy concerns. One was built to enable MVS features, single resources. Performance has question to the panel was whether and a flexible JCL interpreter was also been considered deeply: For policy legislation hurts or helps built with handy features such as most workloads, increasing nodes technical development. A member ksh script generation. An unsophis- results in linear benchmark of the panel suggested that a policy ticated scheduler was developed to improvements, meaning superbly of saving data for 90 days would be emulate MVS handling of processes. minor overhead and excellent scalability. sufficient. There was also a discus- David Korn took a moment to quip sion about disclosure of informa- that a 25-year-old tip indicated that Fault tolerance and redundancy are tion to consumers. Mr. Schroth sort is optimally performed on a not only accounted for with a flexi- responded that it has been startling UNIX machine by transferring it to ble load balancer, but C-JDBC con- to him how people do not care tape, performing it on a mainframe, trollers themselves can be stacked about disclosing information about and transporting it back—yet today horizontally to virtualize the same themselves: people are willing to the situation is reversed. Two databases and seamlessly provide give lots of information to compa- 2.8GHz Pentium 4 machines were redundancy. Arbitrary trees may be nies, including passwords. used to emulate the mainframes, at constructed by attaching C-JDBC under $4,000 total. The 60-hour controllers as client databases to MVS task took 19 hours on the other C-JDBC controllers. Though shiny new silicon. Data transmis- only 10 months have passed since sion ballooned surprisingly to its initial beta release, C-JDBC has nearly 24 hours due to tapes act- already been downloaded more ing—predictably—unpredictably than 15,000 times. fussily.

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 59

Wayback: A User-Level Versioning SECURITY SIG systems is not the issue, nor does it File System for Linux matter; the real issue is “how to Brian Cornell, Peter A. Dinda, and Summarized by Ming Chow catch the bad guys.” He explained Fabian E. Bustamente, Northwestern Debate: Is an Operating System that there really isn’t a monoculture University Monoculture a Threat to Security? of operating systems—not all Win- dows systems are alike. In fact, Mr. Awarded Best Paper Dan Geer, Verdasys, Inc.; Scott Char- Charney argued that monocultures ney, Microsoft may be beneficial. He gave the Modern file systems and operating Moderated by Avi Rubin, Johns Hop- example of Southwest Airlines and systems, though very tolerant of kins University how all the planes are Boeing 737s. naughty applications thanks to Southwest Airlines has accepted caching, journaling, prefetching, In one of the most anticipated the risks that all their planes are locking, and a whole pile of other events of the conference, Dan Geer Boeing 737s, and the major benefit nuances, are still rather unhelpful debated Scott Charney on whether is low-cost maintenance (e.g., to the common and simple user an operating system monoculture is pilots can operate any plane with- errors of accidental file deletion a threat to society. Dan Geer—an out having to learn each one indi- and overwrites. Some efforts have instrumental contributor in the vidually). Mr. Charney explained been undertaken to provide gener- MIT Athena Project, former CTO at that a computer hacker’s goals are alized undelete operations, and @Stake, Inc., and the former presi- to compromise confidentiality and code management repositories pro- dent of USENIX—argued that the the integrity and configuration of vide versioning, but for the most problem is avoidable and mitigable, accounts and systems. Finally, Mr. part these sorts of mistakes must be but difficult. He compared the cur- Charney compared the digital age protected against at the application rent situation to the natural world to the Industrial Revolution: mak- level, if at all. and biology, and how a diverse gene set mitigates predation and ers do not want security because Wayback provides a versioning his- disease. In a computing context, a the benefits of technology out- tory through a hidden undo log for diverse series of operating systems weigh security. He believes that it is any directory remounted with it. It mitigates the onslaught of attacks unfortunate that our current situa- depends on the underlying file sys- and security breaches. Dr. Geer said tion reflects that of the Industrial tem to provide the storage, hence that “if there is a monoculture, then Revolution, and we need to look at alleviating complexities arising the bigger the species, the juicier security holistically. from partitioning and supporting and more attractive.” He was criti- During the Q&A session, Dr. Geer the disparate menagerie of extant cal of the current state of comput- was asked about recommendations file systems. The undo log is pre- ing and Microsoft’s dominance: regarding the fact that there are cisely that—it records data, allow- There are too many gadgets in Win- only a small handful of operating ing it to return to the previous state dows, leading to confusion as the systems available. He responded by of a file. Any write operation causes operating system goes beyond its saying that “we do not have enough a new entry to be placed in the log. threshold. In addition, he stated alternatives” and “standards that File-system calls are observed by that there are more serious security matter must be platform independ- using the FUSE proxy, greatly facil- problems that are not publicly ent.” Mr. Charney was asked about itating development but unfortu- known, and virus writers are two the insecurity of Microsoft Internet nately hampering speed slightly. steps ahead of antivirus writers. Dr. Explorer. He responded that Even so, Wayback is quite quick, Geer concluded his arguments by Microsoft is releasing better usually not too far regressed from reflecting on history, that “lessons for its products as part of its the file system inside FUSE. learned in the real world apply to antitrust settlement. He added that Future plans potentially include the computing world: All monocul- “the issue [security] is in large compression, redo logs to provide tures live on borrowed time like applications.” bi-directional revision traversal, cotton and potatoes, and we are Both experts gave their closing and even hierarchical version stor- subject to the laws of nature.” remarks after all questions. Mr. age. The system is absolutely trans- Scott Charney spent years working Charney concluded by saying that parent and provides a simple usage in the public sector, most notably “we have dug ourselves into a deep paradigm to return to old versions. combating cybercrimes as chief of hole and we need to understand Since all changes are stored, a foggy the Computer Crime and Intellec- computer security issues holisti- memory and a dim spark of energy tual Property Section (CCIPS) in cally to dig out of the hole.” Dr. are enough to recover practically the Criminal Division of the US Geer concluded by ascribing the anything. Department of Justice. Mr. Charney public sickness to bowing to one stated that diversity of operating operating system, and pointed out

60 ;LOGIN: VOL. 29, NO. 5 that virus writes attack one culture. and the query handler processes ancing are achieved by breaking He reminded the audience of user queries. These three compo- resources into pieces and replicat- nature’s lessons, and that those nents require Web servers to take ing everything. The software is with the most to lose are those who the queries, index servers to store aware of the redundancy structure are the most interdependent. the names of the documents, docu- and spreads things around to avoid ment servers to store all of the Web a single point of failure. Google’s PLENARY SESSION documents, and ad servers to deter- PageRank system (a ranking that is mine what advertisements to show based on the number of links to a Summarized by Patty Jablonski based on auction money. given Web page) is used for order- and Todd Deshane There are four main sources of fail- ing document names in the index. The index gets broken into Cheap Hardware + Fault ure that could occur in this type of “shards” based on its PageRank Tolerance = Web SiteOO Web server system. The hardware (e.g., disks), software, the network, score. Higher-ranked pages will be Rob Pike, Google, Inc. or power could fail at any given replicated more than lower-ranked As we were waiting for the presen- time. Google deals with each of pages, so that the higher-ranked tation to begin, a continuous these types of failures through pages are less likely to be lost in a stream of words and phases in redundancy and replication. time of failure. Replication is done many different languages scrolled at the index level, the document At Google they expect hardware, down the open terminal window level, and across data centers. especially disks, to fail on a daily that showed on the projector When a query is sent, DNS resolves basis. Rob Pike stated that the there screen. This stream of words and to 1 of n data centers (presumably is a “mean time to failure of three phrases contained people’s names, the one closest to the sender). The years for one machine, so for 1,000 Web sites, and places. Rob Pike load balancer then chooses one of computers, expect to lose one per opened the presentation by saying the Web servers, which then sends day!” So, at Google, they expect to that this is an example of unfiltered the query to one replica of each lose many more than one computer search queries that people submit index shard. Since the index is per day. With such a high loss rate to the Google search engine and read-only and replicated across and the need of multiple computers that this is what Google looks like index servers, the search is done in for storing the 10s of terabytes of “from Google’s perspective.” parallel. This entire process takes the Internet, Google needs a lot of approximately a quarter of a sec- To demonstrate Google’s global machines and disk space. It makes ond. Google’s underlying file sys- presence, Rob showed a map of the sense that to save money on the tem is abbreviated GFS. This file world that depicted queries/square large number of computer pur- system is large and distributed and degree/hour for a selected day. chases, Google chose to buy rela- contains chunks of files on chunk Spots of light represented the tively cheap hardware. In the servers (there are multiple chunks Google queries received from a beginning of Google, when it was per chunk server). The chunks are given area of the world at that time. located at google.stanford.edu, it replicated, often three times but He noted that there was a place in consisted of a few cheap PCs in a more often if they are heavily used. Tokyo, Japan, that never goes dark. Stanford University computer lab. The chunk servers act as a master He chose this particular day of data Shortly after, it moved to someone’s and support automated fail-over. because on that day, back in garage, until it finally reached its (For more information on GFS, see August, the northeastern United current location in more sophisti- the GoogleFS paper in SOSP ’03.) States had experienced a power cated computer racks contained As seen here, reliable software is outage. This area was seen as a dark within multiple data centers across more important than expensive area on the map that would other- the world. Google’s success story is hardware in Google’s case. wise have been lighted. He jokingly unlike many others during the dot- said, “Where there’s electricity, com boom, during which time As mentioned, network and power there are Google users.” many startup companies spent all outages can also occur. Rob noted that, unlike their reliance on cheap So, what is Google.com made out of their money on expensive hard- hardware, Google cannot operate of? The search engine consists of a ware with “gleaming racks,” while effectively with cheap networking crawler, an indexer and a query Google held their entire systems equipment, such as switches and handler. The crawler collects docu- together with Velcro! routers. It is important to note that ments and makes “copies of the The PCs at Google are unreliable, network or power failures only Internet,” which now contains over cheap, and fast. In order to make reduce capacity (the query request 4 billion Web documents and over these computers reliable, Google will merely time out and just needs 880 million images. The indexer must use fault-tolerant software. to be reissued). When an entire processes and represents the data, Reliability, scalability, and load bal-

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 61

rack of servers was accidentally ing these tasks at high speeds. Peter track of scores based on the use of wiped clean, no data was actually Nilsson and David Reveman intro- buffers. lost—just the terabytes of storage duced Glitz, which brings this In his initial implementation, Eric capacity. The same automated soft- hardware power to the user by pro- was able to render anti-aliased text ware-recovery mechanisms are in viding an interface between five times faster than previously. place if there is an unreachable net- OpenGL and the graphics library However, text using work or a power failure. , where it can be used easily alpha for subpixel anti-aliasing was Throughout Google’s existence, by developers. five times slower. Eric would like to they have learned many valuable The goals of Glitz’s design are to implement X video, support for lessons. The best lesson is “failures create a system with efficiency, GLX and a fix for the composite will happen, plan for it and sur- quality, and consistency. Glitz alpha speeds in future versions. His vive.” When things break, they implements hardware features, work can be found at http://pdx. break too often for humans to fix, including native off-screen draw- freedesktop.org/~anholt/freenix2004. so there needs to be an automated, ing, image transformation, polygon How Xlib Is Implemented (and “self-healing” system in place. rendering, clipping, gradients, and What We’re Doing About It) Another important lesson when convolution filtering. Glitz also dealing with commodity (cheap) allows for seamless integration Jamey Sharp, Portland State Univer- hardware and components is that between 2D and 3D environments. sity you need to use better software and Both accuracy and performance in Jamey began his presentation by be careful not to cut corners too Glitz were compared against other explaining how Xlib works. He much (i.e., two machines per rendering engines. Glitz was found introduced an abstraction between power supply may seem good on to operate anywhere from 3 to 200 three layers in Xlib: transport, pro- paper, but “it is a false economy,” times faster than the XRender ex- tocol, and utilities. The transport since system restoration requires tension to X servers. Using Glitz, layer handles communication you to power down both comput- Peter and David have shown that between the client and server, the ers when only one needs to be off). Cairo can be very fast. Glitz is at protocol layer constructs requests It is also necessary to make sure http://glitz.freedesktop.org. for the server, and the utilities layer that there is adequate cooling for does everything else. The transport all of the computers. And, finally, High Performance X Servers in the and protocol layers, though very Google needs to continue to Kdrive Architecture important, are not a big part of the improve by adapting their fault-tol- Eric Anholt, LinuxFund Xlib implementation. Xlib was erant techniques, software, and Eric began by introducing Kdrive, a designed long ago for the systems algorithms to newer and faster small X server written by Keith that were available then and has hardware. They continue to look Packard. Since Kdrive is a smaller been added onto many times since, for new architectures for redun- server, it is easier to change. Eric creating a system that is not well dancy, improve automated failure then went into some background designed. and recovery, and develop their about the capabilities of modern Jamey introduced a solution to this: new services (e.g., Gmail) around hardware and of current extensions XCB. XCB is an implementation of these principles. to the X server. Finally, he intro- an X client library that is simpler duced the Kdrive Acceleration and smaller than Xlib. XCB focuses FREENIX SESSION: FREE DESKTOP Architecture (KAA), an accelera- mainly on the transport and proto- tion extension to the Kdrive X col layers. The problem with XCB Summarized by Brian Cornell server. is that most X programs are written Glitz: Hardware Accelerated Image The KAA offers a few improve- to use Xlib, and rewriting them to Compositing Using OpenGL ments, including compositing, use XCB would be a major task. Thus Jamey needed a migration Peter Nilsson and David Reveman, blending, and an improved off- path to make this process easier. Umeå University screen memory manager. Whereas other implementations are limited Jamey first tried to implement this Desktop computer users have to off-screen memory of the same migration by reimplementing the many more demands today than color depth and image width as the Xlib API using XCB. The problem they used to. Users expect features screen, the memory manager in with that is that the Xlib API is such as translucency, shadows, and KAA allows off-screen memory to enormous. After making little transformations to be prevalent. be of any type and size. This mem- progress into the immense reposi- Hardware manufactures have met ory manager also determines what tory of Xlib, Jamey started over, these demands and provide graph- should be kept in which type of this time going from bottom up. He ics processors capable of perform- memory using a system that keeps began with a complete Xlib imple-

62 ;LOGIN: VOL. 29, NO. 5 mentation, and gradually replaced ments have been employed selec- response times. RCU callbacks at the more crucial components with tively, which makes the simulator the end of grace periods (between XCB equivalents. The result is that more difficult to use but accurately context switches) cause too much current Xlib programs can easily depicts behavior and views. Real- latency in these real-time appli- by migrated to use these smaller life problems and subtle interac- cations. client libraries, and there is no tions that can distract or confuse First, it is important to distinguish noticeable change in performance. the pilot have been accurately mod- between readers and writers. Read- Jamey’s work can be found at eled. Many of these features, ers can access old versions of files http://xcb.freedesktop.org. including instrumentation prob- independently of subsequent writ- lems and forces of nature, are often ers. In this case, garbage collection USELINUX SIG reported as bugs, since people who is needed to remove old or invalid use Microsoft Flight Simulator are copies of the files. Writers, on the Summarized by Patty Jablonski not as familiar with the situations other hand, create new files and and Todd Deshane that real pilots face as seen in delete old ones atomically. Because FlightGear. The FlightGear Flight Simulator of this, readers have little to no The FlightGear implementation is overhead, while writers have a sub- Alexander R. Perry, P.A. Murray modular and quite complex: “It stantial amount of overhead. The FlightGear Flight Simulator is takes a lot of code to make things Real-time latencies are 800 a graphics project that simulates behave badly.” FlightGear uses net- microseconds measured under flying aircraft in reality. Perry says, working for remote access and load. Measurements were taken “It is not a game.” This is an open allows a flight instructor to adjust with Andrew Morton’s “amlat” tool. source project that has been the pilot’s settings without the pilot This 800 microsecond latency is released under the GNU General knowing. It uses XML to allow too long for such applications as Public License (GPL) for Mac, changes to the flight environment. engine controls, where there is a Win32, IRIX, and Linux 32 and 64 Additionally, there is a property need to have three degrees of con- bit. The simulator is portable, mod- database that stores all of the trol when measuring revolutions ular, platform neutral, and uses scenery for the entire . A per minute (rpm). There are three advanced algorithms: “It uses mod- large amount of storage space is ways in which Sarma and McKen- els, not just guesses.” needed for this. The 3D graphics ney are trying to solve this latency The FlightGear project was started and audio are made with OpenGC problem. in 1996 by David Murr. Today its and OpenAL. Third-party exten- One possible solution for the worldwide developer community, sions to the flight simulator are latency problem described here is which includes Perry, consists of 89 generally done in Python. “Per-CPU Daemon.” The primary people and is still growing. This To download the latest version advantage of this approach is that group is inclusive (goes beyond of FlightGear, see http://www. it is transparent to users of just software engineers) and is flightgear.org. Related projects and “call_rcu( ).” Disadvantages multi-disciplinary (includes both links:FlightGear Aviation Training include proliferation of kernel dae- technical and nontechnical peo- Device: http://fgatd.sourceforge.net mons and tuning parameters. ple). Perry states that beginners are OpenAL: http://www.openal.org welcome to join in the project’s OpenGC: http://www.opengc.org Another potential solution pro- development as well. For more PLIB: http://plib.sourceforge.net posed by Sarma and McKenney is information on the version releases called “Direct Invocation of RCU Making RCU Safe for Deep of FlightGear, visit http://www. Callbacks.” Advantages to this Sub-Millisecond Response Real- flightgear.org/version.html. option are that there are no kernel Time Applications daemons and no tuning parame- FlightGear has many features that Dipankar Sarma and Paul E. McKen- ters, and it eliminates “softirqs” for make the simulation as realistic as ney, IBM callbacks. Disadvantages are that it possible. The flight simulator has is not transparent to the user and it 3D aircraft and scenery as seen RCU, or Read-Copy Update, is a can cause problems if used incor- from the pilot’s perspective in the reader-writer synchronization rectly by the user. cockpit. The lighting changes are mechanism for the 2.6 Linux ker- realistic and the aircraft is complex. nel. RCU is best for “read-mostly” Finally, the third option presented Various things affect the overall data structures. Although this by Sarma and McKenney on deal- flight experience, such as an open works well for most situations, ing with real-time latency is window on the aircraft, weather real-time applications in the Linux “Throttling of RCU Callbacks.” conditions ( and smog), and environment are becoming more The main advantage of this option temperature. Intentional impair- popular and are in need of quicker is that, like “Per-CPU Daemon,”

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 63

it is transparent to users of Utopia are needed to provide this FREENIX SESSION: SECURITY “call_rcu( ).” Disadvantages of this support. method are tuning parameters and HAL is a central repository of Summarized by Matt Salter that the current implementation is device information. HAL has a plat- Design and Implementation of based on iterations and not time. form-independent interface and Netdude, a Framework for Packet The initial performance results persistent key/value pairs, provides Trace Manipulation show that all three approaches have asynchronous notifications of Christian Kreibich, University of similar, significant performance changes to devices, and handles Cambridge, UK increases and little complexity. Universal Resource Identifiers Sarma and McKenney conclude (URIs) to access devices. The appli- Awarded Best Student Paper with their argument that RCU can cation programmer’s interface Solving a problem that involves be made safe for real time. Finally, (API) to HAL is “libhal.” Using manipulating network traffic often they found that up to this point, HAL instead of legacy code reduces requires complex filtering, fine- “Throttling of RCU Callbacks” is the amount of code needed to han- grained and large-scale editing, and the less intrusive of the two trans- dle devices significantly (thousands visualization. Finding well-main- parent choices. They note that of lines of photo application code tained tools with the desired func- other performance issues exist for can be reduced to less than 10 lines tionality is often a hassle and not real-time applications and that of code with HAL). always possible. Another approach tools to identify problems other On most current Linux distribu- is to write your own solution. than latency are currently under tions, the contents of /dev contains While tools that allow editing of development. about 18,000 device nodes. Realis- captured network traffic have been Making Hardware Just Work tically, you only want to see a list of created, they are often not reusable , the devices that you currently have. at the API level, since their func- The clean, elegant user-space solu- tionality is only available in stand- The problem with the Linux desk- tion to this is called “.” /udev alone executables. top is that Linux lags behind its only lists devices that you actually The network dump data displayer competition in terms of hardware have on your system and can be and editor, Netdude, is a frame- management. Instead of having to renamed for your convenience. su to root and mount drives for a work for packet inspection and simple plug-and-play device, we In order to tie all of this together, manipulation. Netdude has GUI want our desktop to be able to fig- Project Utopia needed a message- and command line usage para- ure out what to do for us. “Think passing system called D-BUS. The digms. It allows for scaling of trace MacOS X simplicity,” Robert Love kernel can send out D-BUS mes- sizes and is reusable at all levels, as says. sages of new devices. The kernel/ well as being extensible. D-BUS layer is used by the GNOME A bottom-up view of Netdude’s lay- Robert Love is one of the main Volume Manager. developers of “Project Utopia,” an ered architecture is as follows: libp- umbrella project focused on using The GNOME Volume Manager, a cap handles elementary trace file the 2.6 Linux kernel’s new features, manager of disk and other media operations. libpcacpnav is a wrap- and hotplug, along with HAL volumes, allows you to automati- per around libpcap that allows one ( layer), udev cally manage volumes, automount- to jump to arbitrary points in the (a user-level device management ing or autoplaying new media/ trace file, identified by timestamps system), D-BUS (a message bus), devices, like automatically playing or offsets. libpcacpnav uses heuris- and the GNOME Volume Manager. a CD or DVD. It can also create tics to get in with the packet The goal of this project is to be desktop icons based on the type of stream. Above libpcacpnav is lib- clean and elegant without the use device or media attached to your netdude, the core of the framework of kernel hacks, like “supermount.” system. which makes the editing of large It is very important to “do things Linux is made up of a lot of proj- traces transparent. libnetdude is right.” ects, which often lack integration. extensible through two kinds of plugins: feature and protocol. It The 2.6 Linux kernel does not pro- The Utopia Project developers provides per-packet TCP dump vide central management as is. It hope to bring some unification and output, as well as an observer/ also does not have a platform- integration to the Linux desktop. observee API to inform the user of agnostic daemon to take advantage updates. The GUI is GTK-based of sysfs’ device database and hot- and extensible through the same plug’s ability to provide notification kinds of plugins as libnetdude, and of when a new device is added. The additional components of Project

64 ;LOGIN: VOL. 29, NO. 5 updates itself through libnetdude’s created containing two files, add this paper are a collection of observer API. and del. Users are added and domains, types, and group defini- Handling of big trace files always deleted by writing their UID to the tions, as well as the access rules involves limiting the number of add and del files, respectively. TPE pertaining to them. Type defini- packets in memory. Since it is not enhances security by preventing tions consist of both path assign- possible to simply use mmap() for execution of untrusted code on the ments and access grants. Both type inserting and deleting packets, system. The check of path and user and domain definitions may con- trace files are edited at the granu- occurs exactly before execution is tain assert statements that are used larity of trace areas, which are allowed, and if the user and the for maintenance of policy con- bounded by timestamps or frac- path are both untrusted, execution straints, which are interpreted and tional offsets. Modified trace areas is denied. enforced by a policy consistency become trace parts, which are flat- In addition to trusting code in root class. Since domains and types can tened onto the original trace file owned directories, TPE LSM trusts declare conflicting access rules, pri- when the file is saved. code in directories of trusted users. orities for the access rules are TPE is part of the LSM patch as of defined. These priorities are deter- Netdude has served as a mecha- mined by placing specific rules nism to conveniently access suspi- 2.5.70. It is open to improvements as it is released under a dual BSD/ over general rules, inbound access cious network activity, create traces rules over outbound access rules, for network performance evalua- GPL license. LSM, accepted as the current method to introduce secu- and use of the “absolute” keyword. tion, edit honeypot traffic, and gen- Group definitions facilitate more erate IDS signatures. There is much rity to the Linux kernel by the ker- nel community, is a small project generic modules. These are work left to do, including packet achieved through either the key- resizing and support of scripting with lots of potential, for which many more modules are needed. word “all,” binding a group name environments. Help is welcome! to a set of domains or types, or Trusted Path Execution for the Modular Construction of DTE namespace globbing. Groups are Linux 2.6 Kernel as a Linux Secu- Policies expanded only at time of reference rity Module Serge E. Hallyn, IBM Linux Technol- and can be dynamically extended. Niki A. Rahimi, IBM ogy Center; Phil Kearns, College of Modules interact with the system William and Mary Trusted Path Execution (TPE) was by obtaining system-specific data. originally a kernel patch to Open- Domain Type Enforcement (DTE) They can also be moved between BSD 2.4 created by Mike Schiffman. is a mandatory access control sys- systems and shipped with software. It was later modified for OpenBSD tem introduced in the 1970s by Modules are loaded into the DTE 2.8 and 2.9 by the Stephanie proj- Honeywell, TIS. It assigns types to LSM module through a configura- ect. files and domains to processes. A tion file which is generated by a domain is structured as a list of script that takes a list of module Currently, TPE is implemented for sets. One of these is the entry type files as input. Work related to this Linux 2.5/2.6 as a Linux Security set, which specifies through which project includes DTEX by Chuck Module (LSM). TPE’s notion of a types the domain may be entered. Fox, Fedora, the IBM research proj- trusted path is not to be confused Another is the type access set, ect Goyko, and Tresys. This project with the more common concept of which specifies which types the still needs to be applied to SELinux, trusted path in a network context. domain can access. The signal which would require object classes In TPE, a trusted path is root access set specifies which domains and fine-grained permissions, and owned and neither group nor the domain can signal, and the the modules need to be distributed. world writable. A trusted user, root transitions set specifies which Future work also includes possible by default, is any user on the access domains the domain can transition improvement of the priority speci- control list (ACL) determined by to. There are two types of domain fication. the system administrator. transitions, auto and exec. When a The TPE LSM performs a check process under some domain exe- upon execution of a file by utilizing cutes a file which is an entry point the tpe_bprm_set_security hook to another domain, either it must in the LSM framework. Upon exe- switch to the new domain (auto) or cution of a file, the module verifies exercise the default option of keep- whether the user and the path are ing its domain (exec). trusted. The TPE ACL is modified A DTE policy contains lots of via a sysfs pseudo file-system domains, types, and defaults. The approach. A directory called tpefs is policy module files presented in

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 65

USELINUX SIG strategy does not work; being the tool undergoes steady develop- friendly is the better approach. To ment and bug fixes. There is good Summarized by Martin help create community, LinuxChix and extensive documentation, as Michlmayr uses specialized mailing lists, some well as a lively and informative Building and Maintaining an of them focused and technical but mailing list. Finally, SWISH-E International Volunteer Linux others allowing completely off- offers a “bulk insert” method which Community topic discussions. A member-only doesn’t presume to know how or posting policy on some lists what you want to index. SWISH-E Jenn Vesperman, author and consult- increases the sense of community. currently handles XML, HTML, ant; Val Henson, Sun Microsystems Henson emphasized the impor- and text, but there are two methods Val Henson shared her insights tance of delegation as part of organ- for dealing with arbitrary files. about creating a volunteer commu- ization. She suggested that the first First, an external program can be nity based on her experience with coordinator of the project burned written that converts a file to a for- the LinuxChix project. LinuxChix out because she took on too many mat SWISH-E understands. Sec- is an international community tasks herself; Henson therefore jok- ond, a FileFilter for each given file whose focus is to create a friendly, ingly said that the number one rule type can be created. This approach predominantly female Linux com- is to do nothing yourself. Instead, is more modular, but it’s slower, munity. The project was founded in delegate to other people; once a since it invokes a child process for 1999 by Deb Richardson, who cre- task has been delegated, let go and each file. ated a Web site, mailing lists, and a don’t interfere. It is also important Rabinowitz showed in some exam- logo. By 2001, Deb was burning to give credit. One task of the main ples the ease of creating indexes out as the LinuxChix project strug- coordinator is to monitor the with SWISH-E and introduced gled to remain active. At that time, health of other volunteers, and to SMAN (http://www.joshr.com/ Jenn Vesperman was chosen as the act accordingly—for example, by src/sman/), a tool to search UNIX new coordinator. sending an overworked volunteer man pages that is based on SWISH- Henson summarized the lessons on vacation and by finding more E. Finally, Rabinowitz summarized she learned about building and volunteers to help out. Finally, she some future development. The 2GB running an international volunteer also suggested that rules should be size limit should be removed soon, community, grouping them into kept to a minimum—rules will and UTF-8 support is a major fea- three categories: First, the social drive good people away, and trolls ture that is being worked on. The category, where she suggests that won’t abide by them anyway. ranking system should be rewrit- you have to build a sense of com- Indexing Arbitrary Data with ten, and the main developer of munity; second, there are organiza- SWISH-E SWISH-E is interested in working tional aspects, which boils down to together with a graduate student delegation; the third category is the Josh Rabinowitz, SkateboardDirec- who would like to pursue such a technical one, and Henson empha- tory.com project. sized the importance of using tech- Josh Rabinowitz introduced nology that distributes well. SWISH-E, a simple Web-indexing FREENIX SESSION Henson suggested that building a system for humans. He summa- sense of community was fairly easy rized the tool as being a fast, pow- Demonstration: Croquet, a Net- for LinuxChix, because building erful, flexible, free, and easy-to-use worked Collaborative 3D Immer- community was the goal of the system for indexing collections of sive Environment Web pages, and suggested that the project. Women are quite excited to Dave Reed, HP Labs find other women who share their definition would not be complete This demonstration of a 3D net- interests, and they create bonds if any of these adjectives were worked, collaborative Croquet fairly easily. One interesting obser- removed. SWISH-E is based on environment is considered a vation Henson made was that being Kevin Hughes’ SWISH project from “research project, not a product,” friendly and nice does not attract 1994 and is now maintained by Bill says Dave Reed, project developer. incompetent people. While Linux- Moseley. The tool is written in C It demonstrates a peer-to-peer net- Chix’ explicit goal is being friendly and creates binary indexes. There worked application that supports to its members, there is some hos- are C, Perl, and PHP interfaces. collaborative computing and scala- tility in other projects to new mem- There are several alternatives to ble computation. bers, possibly in order to have a SWISH-E, such as htdig and high barrier to joining in order to MySQL, but Rabinowitz argued The demonstration used two PCs keep incompetent contributors that SWISH-E has several attractive networked together on the same away. According to Henson, this features. It is fast and robust, and switch (the network is meant to

66 ;LOGIN: VOL. 29, NO. 5 provide different views of the same movements and positions so that software tools but with the data world), but Reed claimed that the the appropriate action is taken. itself. The most well-known exam- environment can support many This system is considered real time, ple of this is the human genome more peers. There were some net- and it is therefore very important project, where the completed working problems during the that each machine in the network genome was uploaded to the public demonstration which made one of has the correct synchronized time. databases pretty much as soon as it the views update too slowly or In the future, Reed would like to was ready. Most public genomic inaccurately show the world. have this Croquet environment sequencing centers submit new Because of this, Reed presented the implemented for small devices, data to Genbank (a public reposi- environment in one perspective such as cell phones. He also hopes tory of sequence information) only. to develop a security model for the pretty much as soon as it is The environment showed clouds system. There is an important collected. on a 3D plane with windows or unanswered question: What should The genomics community, like the “portals” on it. These portals acted people be allowed/not allowed to Linux community, actively con- as hyperlinks or mirrors to other do in this environment (what is tributes software to the public on worlds. Dave went from portal to considered cheating, what should a regular basis. A genome browser portal showing us what each had be allowed to be read/written, how from USCS and the Ensembl inside. One portal showed a recur- much should you be allowed to browser/data miner/visualizer are sive pyramid and another had see?)? both freely available. Jim Kent’s water that rippled when the mouse This Croquet project is approxi- genomic assembler, which was was moved over it. Still other por- mately nine months old and is instrumental in the public effort to tals showed images of people, a being developed in partnership complete the human genome, is chess board, and a flag with a with the University of Minnesota also freely available. Finally, Lin- spring and mast to show how the and the University of Wisconsin. coln Stein has contributed numer- flag changes when moved with the The project is scheduled to be ous CPAN modules, both for mouse. released as open source in an open genomics and such commonly used He entered into one portal that and public forum. modules as the Perl interface to brought us to an underwater world. Tom Boutell’s libgd and Stein’s own CGI.pm. He explained that with this sce- USELINUX SIG nario, you can “change the laws of Using these and other public tools, physics” by making heavy objects Summarized by Adam S. the Smith Centre was the first to float or have whispers only be Moskovitz publicly release the full sequence of heard by certain people across the the coronavirus believed to be Linux and Genomics: room. In this underwater world, responsible for SARS (Sudden The Two Revolutions Reed was represented as a fish. He Acute Respiratory Syndrome). This used a Paint program to draw a new Martin Krzywinski and Yaron was accomplished in just five days, character on the fly and then added Butterfield, Genome Sciences Centre using an eight-way Linux system. this character to the world. The The session started off with Martin Krzywinski shared some of the funny part was that there were large Krzywinski, from Canada’s Michael feedback the center received; much help signs in the water explaining Smith Genome Sciences Centre, of it was positive but some wasn’t. “how to make a fish” or “how to talking about Linux and genomics, One person wrote: make your fish swim” for the their near-parallel rapid advance- Subject: You have to be NUTS! beginners. He then left this world ments, how Linux is used in My daughter doesn’t think its such and showed a vast world with genomics, and how genomics has a good idea to have the gene waterfalls and trees in it that he independently adopted many of the sequencing for the new coronavirus called a “traditional video game same “ideals” as the Linux on the internet. I don’t either! world.” community. There should have been a better This 3D Croquet environment was Martin started by discussing what I way! You must be crazy! implemented using OpenAL and believe are the most significant par- [Summarizer’s note: I suppose OpenGL for its audiovisual fea- allels between the Linux and some people feel the same way tures. It works with communicat- genomics communities, namely, about making the Linux source ing objects whose messages are openness and innovation. Just as code public.] replicated or “cloned”; most mes- the Linux community encourages sages do not go over the network people to build useful things and By the way, Martin Krzywinski gets (almost all computation is local). give them away, the genomics com- my award for most interesting Objects are governed by mouse munity does that not only with slides: black drawings on a red

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 67

background and the funkiest font Martin also pointed out several to fail, and no viruses or worms I’ve seen in a long time. other benefits of thin Linux, chiefly, have affected their network. Thin Client Linux, a Case Presen- that employees were more produc- Towards tation of Implementation tive. With Windows, too many Platforms applications could be customized Martin Echt, Capital Cardiology and employees spent too much Ibrahim Haddad, Ericsson Research Associates; Jordan Rosen, Lille Corp. time doing this with no real gain in The third talk of this session was In this presentation, Martin and production; with Linux it was eas- what appears to be a refereed paper, Jordan described how Martin’s ier to disallow such customiza- written and presented by Ibrahim medical practice decided to install a tions. Another benefit was that Haddad from Ericsson Research on Linux-based thin client instead of applications could be customized “Carrier Grade Linux”—that is, a Windows PCs and how that deci- to prevent user-caused “outages.” Linux operating system capable of sion has worked out. The most obvious example was being used in servers and switches Martin started by describing their preventing users from closing an in a public telecommunications practice (with more than 200 application without properly quit- network. Typically, such servers employees total, 40+ doctors, seven ting; they simply removed the “X” require 99.999% reliability (less offices, seven hospitals, in New button from the menu bar! The last than five minutes of downtime per York and western Massachusetts), of these savings Martin mentioned year), and switches require their work load (128,000 patient was preventing employees from 99.9999% (less than 30 seconds visits, 800 open-heart surgeries, using the computers for personal downtime). Obviously, no version 380,000 services billed per year for use (things like playing solitaire, of Linux is there yet, but Haddad’s $22 million in revenue), and their downloading music files, and set- talk summarized what is needed to MIS needs (billing, storing test ting fancy screen savers). He esti- get there, as well as what features results, financial planning and mated that at 15 minutes per day will be required by carriers before analysis, payroll, medical imaging, per employee, such wasted time Linux could be used to replace calendar, email, word processing, cost his company over $110,000 existing, proprietary systems. and more). each year. Haddad presented an overview of Martin then proceeded to give a Jordan then took over and pre- the groups (committees, working fairly detailed cost-benefit analysis sented the technical side of things. groups, associations) working on of “thick” Windows versus thick The first thing he mentioned was this problem: The PCI Industrial Linux versus thin Linux. While that some applications could not be Computer Manufacturers Group their initial outlay was about made to work under Linux; for (highly available hardware), the $15,000 more for thin Linux, sub- these the practice kept 10 “thick” Carrier Grade Linux Working sequent savings more than made up Windows systems and set up a sin- Group of the Open Source Devel- for that difference. Specifically, for gle Windows 2000 Server system opment Labs (Linux improve- Martin’s practice, their initial outlay for data storage; Samba was used ments), and the Service Availability per would have been for logons and drive mappings. Forum (defining high-availability about $3,000 for Windows com- Next, Jordan discussed the high APIs). Haddad works with the CGL pared to about $2,100 for thin and low points of the software used working group, who released their Linux. Their first-year operating on the thin Linux client: OpenOf- first public draft in May 2004. costs showed a similar savings: fice worked quite well, but other The remainder of Haddad’s presen- $2,800 vs. $1,300. With 200 work- applications (Evolution and tation covered three services not stations, the savings from choosing Mozilla) had a few problems, such found in the stock Linux release a thin Linux client were clear. The as tending to crash or not handling that would be required for mission- last savings was in significantly certain required functions (some critical environments. The first reduced support costs for remote Web sites, calendaring). There were service was TIPC (Transparent sites: because almost everything some problems with low-level Inter-Process Communication), an was done at the central office, and things such as file permissions and intra- and inter-cluster protocol because hardware maintenance was lack of file locking in OpenOffice. that provides a framework for mostly reduced to swapping out On the whole, user acceptance of supervising and reporting topology bad machines (which could be the thin Linux client was high, and changes. TIPC has been used by done by an employee with no spe- the practice has been running for Ericsson for several years now and cial skills), their remote mainte- 300 days without a single server has been available as open source nance costs were reduced to nearly crash, the network has never gone since February 2003. The second nothing. down except from human error, the service, DigSig (Distributed Digital remote desktop (via VPN) has yet Signature), is part of the larger Dis-

68 ;LOGIN: VOL. 29, NO. 5 tributed Security Infrastructure against spam: realtime blackhole SMTP AUTH and TLS, but both are (DSI) initiative. This service allows lists (RBLs), content filtering, and MTA-to-MTA, not end-to-end. Per- an administrator to embed digital challenge-response. RBLs are con- user authentication would be possi- signatures in ELF binaries and adds troversial because their false-posi- ble with PGP or S/MIME. Finally, functionality to the Linux kernel tive rate is pretty high. Content fil- there are economic schemes which that prevents unsigned or badly tering comes in different classes: shift costs from recipient to sender. signed binaries from executing. heuristic filters work by observing A very small cost doesn’t hurt usual DigSig has been available as open what spammers are doing and cre- senders (perhaps 100/day) but does source since January 2003. The last ating means to detect and counter hurt bulk senders (millions/day). service is a package that imple- them. Unfortunately, this leaves us These systems do not necessarily ments native support for asynchro- in a reactive mode; they send spam, have to be cash-based since the nous events in the Linux kernel. we adapt our tools, and in the credit can come in a different form. Carrier grade platforms must meantime we suffer from spam. At the end of the talk, Allman made process huge numbers of events Fingerprinting and collaboration several predictions. First, he sug- quickly and efficiently, and this store a fingerprint of a spam mes- gested that spam will never go package implements the first tier of sage so other people can test the away completely. Authentication such services. It was also released fingerprint and discard spam. will help but won’t solve the prob- as open source in January 2003. Again, this method is reactive and lem by itself. He thinks that spam Allman suggested that it is only will be “manageable” within two to PLENARY SESSION effective when the fingerprint data- three years, and that legislation will base is updated every 15 seconds! scare away bit players, but not large Summarized by Martin Machine learning filters let the commercial spammers. Michlmayr computer figure out the interesting stuff. This method needs two piles The State of the Spam of “training data”: spam and not- FREENIX SESSION:OOOOO Eric Allman, Sendmail, Inc. spam. While this method works SOFTWARE ENGINEERING fairly well for individuals, this is Eric Allman opened his speech in a Summarized by Brian Cornell very funny way when he analyzed less the case on the server level, the way talks on spam traditionally since legitimate mail varies a lot Managing Volunteer Activity in work. They first summarize what depending on the user. Free Software Projects spam is all about, mention that Newer methods which are cur- Martin Michlmayr, University of spam is bad, go on to say that the rently being worked on are traffic Melbourne situation is really bad, and finally analysis, identity authentication, Martin is a member of the claim that their product will solve and economic schemes. Traffic- GNU/Linux team and brought his all spam problems. Allman did not based filtering observes typical traf- experience with free software proj- go this route, and while he sug- fic patterns. For example, a host ects to the community. The main gested several ways to combat that normally sends 100 messages problem he introduced was that spam, he also made it clear that it in a month and suddenly sends volunteers will sometimes neglect will take years to come up with millions in a few minutes is very their duties, and it is hard to figure effective solutions and that every- suspicious. One possible way to use out when they do. For small proj- one has to work together. this is to greylist a host and slow ects this can mean that the project In the beginning, Allman gave down the connection significantly. dies because nobody finishes it. For some statistics and summarized Identify-based filtering almost large projects this means that the claims about spam. Apparently, always requires authentication. You quality of the product suffers and there are about 90 “world class can use allow-lists and lock-lists in there are delays in the release of spammers” who pay US$100,000 order to reduce the amount of new versions. resources spent on more expensive per month for bandwidth and Martin went on to describe how spam checks. There are two servers. According to SpamHaus, Debian is organized and what they philosophies: everything not 200 spam operations account for do about this problem. At Debian explicitly illegal gets through 90% of all spam. Spam costs mere there isn’t a hierarchical manage- (default to accept) or all not explic- microcents per message, which is ment structure, so developers aren’t itly legal gets blocked (default to why spammers continue to operate supervised by a manager. Therefore deny). Sender authentication is not even though AOL rejects 80% of all they have to carefully look through an anti-spam solution in and of incoming mail as spam. hundreds of developers to figure itself, but it is essential for identity- Allman proceeded to summarize out who is neglecting to do what based algorithms. We already have existing and new technologies used they should. They find these people

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 69

in many ways: for example, when trees during parsing. Unfortunately, compatibility. Performance im- there is a bug in a package that is these trees are somewhat tailored provements have been made pri- release critical or when a newer toward C and don’t make concepts marily in three areas: predictive version of the software a package such as tail recursion, garbage exposes, reduced flicker by unset- provides is available. Every once in collection, and scope representa- ting the background, and reduced a while they compile a list of peo- tion easy. For this reason, a new signal emission overhead. ple who appear to be neglecting system called SSA is being devel- Current goals for GTK+ are to pro- packages. oped with generic trees and more vide a full platform, close gaps to Once you know who is not doing optimization. higher-level software layers, sani- what they should, you have to do tize the GNOME library stack, keep something about it. Kicking some- FREENIX INVITED TALK up with evolving UI needs, and one off of a project unnecessarily is maintain binary compatibility. Summarized by David Reveman not a good idea, because they could The 2.6 release is planned for and Peter Nilsson provide a lot of help for your proj- December 2004 and will contain ect, and they may not be easy to Current GTK+ Development solidified 2.4 add-ons as well as replace. Debian contacts maintain- Mattias Clasen other smaller additions. The new ers asking them if they are still file chooser will work well in 2.6; GTK+ is a multiplatform toolkit for active, and gives them two to three improvements include shared set- creating graphical user interfaces weeks to respond. If they don’t tings with Nautilus, automatic with excellent internationalization respond, they’re contacted again shortcuts, recent files, and the abil- support. GTK+ was initially devel- and given more time to respond ity to choose file formats in the oped for and used by the GIMP, the before they are eventually removed. Save dialog. Some missing features GNU Image Manipulation Pro- Because these people are volun- will be added to the combo box, gram. Today GTK+ is used by a teers, Debian cannot be overly including separators, scrolling, and large number of applications and is demanding of them, and therefore a so-called insensitive items. New the toolkit used by the GNU pro- polite system like this is necessary. additions to 2.6 will also be made ject’s GNOME desktop. It can be Martin also points out that you in areas of command line argument used with a wide range of program- can try to prevent the dereliction parsing, an icon list widget, a ming languages. problem by having redundancy progress cell renderer, and some throughout the project. Mattias described the different widgets from libgnomeui. Support components GTK+ is based on. Creating a Portable Programming for rotated text has been added to Glib is a low-level core library that Language Using Open Source Soft- Pango, and more work will also be provides data-structure handling ware going into Pango. for C, portability wrappers, and Mattias talked a lot about what we Andreas Bauer, Technische Univer- interfaces for such runtime func- can expect to see in the future of sität München tionality as an event loop, threads, GTK+. Some of the planned dynamic loading, and an object sys- Andreas Bauer’s talk would have changes are a new rendering tem. Pango is a library for layout been very welcome in any class on model, support for RGBA visuals, and rendering of text, with an compilers. He gave detailed infor- an improved theme system, a built- emphasis on internationalization. mation about the use of gcc to cre- in printing system, and full intro- The ATK library provides a set of ate new programming languages. spection. A big change that will interfaces for accessibility, and the Gcc, the Gnu Compiler Collection, happen to GTK+ is the introduc- GDK library provides a layer of has support for many different lan- tion of a new rendering model. abstraction that sits between GTK+ guages, and independently has sup- This will be accomplished by mov- and the underlying windowing port for many architectures. An- ing to the Cairo library for render- system. dreas presented the capabilities of ing. Cairo is a modern 2D graphics gcc through a simple expression The talk briefly covered the addi- library with a PostScript-like API. It language he called Toy. tions to the 2.4 release, like the has capabilities similar to Java 2D, Andreas talked about the current new, much improved file chooser SVG, and PDF 1.4; alpha-composit- design of gcc’s programming lan- and the new combo box. He men- ing is a natural part of Cairo. Cairo guage interface. Gcc uses trees to tioned that current maintenance has output back ends for X, express the language, and then gen- of the 2.4 release is mainly directed OpenGL, local image buffers and erates an intermediate language to bug fixing and performance im- PostScript. Support for RGBA visu- called RTL based on these trees. provements. It is very complex to als will be added to GTK+ and will Programming language interfaces fix bugs in such a widely used make translucent windows, fade-in are responsible for generating these toolkit without breaking backwards effects for menus, and drop shad-

70 ;LOGIN: VOL. 29, NO. 5 ows well supported. The motiva- then contend with vulnerabilities Implementing Clusters for High tion for a new improved theme sys- in applications over which they Availability tem is the desire to remove GTK+ have no direct control. The chal- James E.J. Bottomley, SteelEye dependencies, fully support Cairo’s lenges of distributed security Technology rendering model, and include lay- include creating a coherent imple- out access in the theme system. A mentation that does not leave any A “highly available” (HA) system is full theme system specification security gaps, integrating different any system that takes action to should be made available and will security solutions from different increase availability beyond what most likely use a standard syntax vendors, and managing the system would ordinarily be possible. HA like XML’s CSS. The built-in print- to prevent misconfigurations and clusters consist of multiple net- ing system will include appropriate inconsistencies. worked local machines with some printing dialogs and will be based type of shared storage. There are Because most of the target applica- three types of HA clusters. The sim- on Cairo, with back ends for CUPS, tions have only a few users with lpr, and GDI. Introspection is use- plest type is a two-node-only clus- whom everything is done, a secu- ter, which cannot be scaled. A sec- ful for language bindings, docu- rity policy based on process is mentation, and IDEs. GTK+ ond type is the quorate cluster, needed. At the node level, such which is centrally controlled and already supports introspection of security is achieved through type hierarchy, properties, and sig- will not work without a member- mandatory access control. The ship service. A quorate cluster is nals, but not yet of virtual func- model presented in this paper tions in class structs and library defined such that no other cluster extends mandatory access control may be formed from excluded functions, which will be added in to the entire cluster. Processes are the future. nodes, which means it cannot be assigned a unique security ID split into two clusters. If the cluster (ScID), assembled from the ScID of is split, the majority of the nodes EXTREME LINUX SIG the binary (stored in the ELF survive. The final type is the header), the ScID of the parent resource-driven cluster, in which Summarized by Matt Salter process, and the node security ID resources are grouped by which A New Distributed Security Model (SnID). To achieve compartmental- services they belong to. In a for Linux Clusters ization, virtual security zones are resource-driven cluster, a node set up inside the cluster. Security Makan Pourzandi, Open Systems must simply establish ownership of zones are groups of ScIDs and Lab, Ericsson Research a group to export the service. SnIDs. The distributed security Resource-driven clusters also allow The target applications for distrib- policy allows for access control independent subclusters to form. uted security are large distributed decisions on the process level based The simplest of these cluster types applications with a large software on the IDs of the source and target is two-node-only, followed by base that provide around-the-clock processes. Network, socket, and resource-driven, and the far more service and require high availability transition rules also exist. The complex quorate cluster. Recovery (99–99.999% uptime). The model architecture of the distributed is much faster in resource-driven presented in this paper specifically access control implementation is as clusters than in quorate clusters. targets Linux clustered servers and follows: each cluster has a single is intended for servers exposed to security server and each node has a Determining availability is difficult the public, providing services to security manager. The security pol- because you need to know what the different operators, and running icy is propagated from the security system’s uptime and downtime are untrusted third-party software. server to the security managers, in your environment. While dupli- cation of nodes allows you to deter- Distributed security has several which enforce policy at the node mine downtime, it does not allow requirements. One is security isola- level via you to determine uptime. Uptime tion, or compartmentalization. This channels. can only be controlled through is needed because exploitable vul- This model is not intended to careful implementation and nerabilities are probable in a large replace existing security solutions, deployment of the cluster. How- software base; without compart- but, rather, to serve as an add-on to ever, whether availability or down- mentalization, a single vulnerabil- them. Challenges include creating a time is significant depends on the ity could expose the entire system. comprehensible and acceptable type of service being offered. Runtime changes to the security security policy and explicitly defin- context must be possible and ing security zones in the distrib- Often, it is the application that fails reflected immediately, and applica- uted security policy. instead of the server. Monitoring tion-layer security cannot be relied applications is important so that upon, since administrators must application failures can be spotted

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 71

and corrected. Local application FREENIX SESSION: Henning Niss presented mGTK, a recovery is important as well, since SYSTEM BUILDING binding to the GTK+ graphics applications fail more often than toolkit for the Standard ML lan- nodes, and local recovery decreases Summarized by Brian Cornell guage. The goal of this project was downtime and minimizes disrup- KDE : An Application Inte- to provide SML access to a good tion. Also, monitoring for failures gration Framework general-purpose toolkit. Keeping in general is important, since while this in mind, the developers redundancy protects your system David Faure, Ingo Klöcker, Tobias wanted a direct binding to the C from the first failure, the second König, Daniel Molkentin, Zack interface of GTK; they wanted it to failure will take your system down. Rusin, Don Sanders, and Cornelius work under any SML compiler, and Schumacher, KDE Project Uptime can be improved by assess- they wanted compile-time type ing cluster hardware and eliminat- presented checking. Other interfaces to GTK ing single points of failure (SPOFs). Kontact, a Personal Information only give errors at runtime, making Clusterwide SPOFs should be elim- Manager (PIM) for the K Desktop it harder to fix bugs and optimize inated entirely, while individual- Environment (KDE). Kontact was programs. node SPOFs should be evaluated to designed to integrate individual SML is a functional language with a see if eliminating them would components such as Kmail, Korga- formal definition. It is separated improve uptime. In a shared stor- nizer, Kaddressbook, Knotes, into two parts: the core language age cluster, the real SPOF is storage Knode, and Kpilot. The developers and the module language. There and should be addressed through wanted an interface with which all are many implementations of SML, replication by making sure that the of these programs could be used and the mGTK developers targeted external array is configured as together, without maintaining the two of them, Moscow ML and RAID 1. Power supply, mechanical separation between the individual MLton. They used a system of type devices, and the connection to the projects. constraints, including what are storage are the node SPOFs. One Kontact was designed with the known as phantom types to enforce way to eliminate the connection to basic goal of keeping all of the type checking. a storage SPOF is to have multiple components in it as separate as Using mGTK, GTK+ classes are connections to the storage from a possible without the user being translated to SML signatures. Class node. This is called a multipath able to tell. To satisfy this, the com- types are represented as SML types, cluster. ponents had to be integrated on an and methods are implemented as The biggest Linux-specific problem application level and still be able to functions. mGTK can automatically faced by cluster manufacturers run alone. But to maintain the sem- generate the SML binding based on with binary modules is simply blance of integration, the compo- the .defs file that comes with the keeping up with the kernel patches nents needed an integrated UI, GTK API. mGTK is available at and releases. Another problem is inter-component communication, http://mgtk.sourceforge.net. and shared settings. the dreaded “oops,” which kills and the Art of Repeated kernel processes and then tries to With these constraints in mind, the Research continue. If the kernel was in a crit- KDE Kontact team designed Kon- ical section at oops time, the sys- tact to use plugins from each appli- Bryan Clark, Todd Deshane, Eli Dow, tem may hang. Large Block Device cation with a Kpart for the user Stephen Evanchik, Matthew Fin- support (LBD) is a Linux feature interface. The components then layson, Jason Herne, and Jeanna that helps clusters. It is limited to communicate using DCOP. Using Neefe Matthews, Clarkson Univer- 2TB in the 2.4 kernel. Multipath the Kparts—basically component sity solutions are different for every versions of the applications—Kon- Repeated research is a process often vendor in the 2.4 kernel, but an tact embeds each component into a used to verify results in scientific attempt is being made to unify the unified Kontact user interface. The research. Jeanna, Stephen, and architecture on for components can also use a unified Todd presented an application of 2.6. configuration through Kconfig. this process to the world of com- Kontact is an ongoing project: puter science research. As a class, http://www.kontact.org. they tried to reproduce the tests in mGTK: An SML Binding of GTK+ the paper “Xen and the Art of Visualization.” They wanted to see Ken Friis Larsen and Henning Niss, if they could get the same results, IT University of Copenhagen, Den- and to apply more tests to Xen in mark hopes of further examining its performance.

72 ;LOGIN: VOL. 29, NO. 5 Reproducing the environment in 16GBs of dedicated local memory. coherency hardware just to keep which the original Xen tests were Bricks are connected using SGI’s system status variables updated. A run was not easy. They had to first NUMAlink technology, which sup- number of kernel hash tables are obtain the same hardware and then ports cache coherency and uses sized at O(1%) of total system install the same software used in specialized routers. Altix systems memory, which is more memory the original tests. The Xen authors have achieved records on a number than exists at any one brick in the listed the benchmarks they used, of HPC benchmarks, including system. These tables are now spread making it easy to reproduce those, SPEComp L2001 in June 2004. The out in the same way that the buffer but assembling and running them underlying interconnect technol- cache is. was time-consuming. Also, some of ogy supports up to 2048 CPUs, but Changes were also made to allow the benchmarks used were closed SGI currently only supports 256 system operators effective use of benchmarks, so they had to be (soon 512) CPUs in a single SSI the system. The dplace command replaced with similar open source (shared system image) under allows the operator predictable alternatives. The result of all of the Linux. memory and CPU allocation to the work was that their repeated meas- SGI believes that porting of single threads in a single process. By spec- urements were within 5% of the CPU applications to an SSI system ifying the appropriate parameters, original measurements. can be much easier then porting to performance can be enhanced by The team applied many other tests a computing cluster since non-per- taking advantage of knowledge of to Xen: they tested its usability as a formance-critical code can be left the memory access patterns of the set of virtual Web servers; they as non-parallel. When SGI decided various threads in a process. tested it on commodity hardware, to develop a NUMA system using Looking forward, many of SGI’s rather than the server machine that Itanium CPUs, a Linux port to the changes have made their way into the original tests had been run on; Itanium was already available and it the 2.6 Linux kernel. As a result, a and, finally, they compared the per- was decided that it would be easier generic 2.6 kernel will boot on an formance of Xen to that of an IBM to start from this port than to move Altix system. As SGI moves to sup- zServer. They learned from this that IRIX from MIPS to Itanium. The porting kernels based on 2.6, they repeating research is not easy, but it current goal is that if an application expect improved scalability and the is an important reality check in the runs on a generic Itanium under ability to support larger systems. development of new technologies. RedHat AS 3.0, then it should run on an Altix system. Quantian: A Single-System Image Scientific Cluster Computing EXTREME LINUX SIG However, even getting the 2.4 Environment Linux kernel to run well on the Summarized by Bill Bogstad Altix hardware has been an inter- Dirk Eddelbuettel, Debian Project Scaling Linux to Extremes: Experi- esting challenge. The kernel had to Quantian is a ence with a 512-CPU Shared Mem- be taught the performance differ- that is focused on cluster-based sci- ory Linux System ences between local and remote entific computing. It was first memory. On a 512-CPU system this released in March 2003 and has Ray Bryant, John Baron, John is critical, since only 0.4% of total gone through a number of major Hawkes, Arthur Raefsky, and Jack system memory is local (i.e., fast). releases since then. The latest Steiner, Silicon Graphics, Inc. A new round-robin buffer cache releases can no longer fit onto a Ray Bryant spoke about SGI’s Altix page allocation algorithm is used to single CD and now require a Itanium 2-based HPC (high per- avoid having a brick fill up all of its bootable DVD or from a formance computing) servers. local memory with cached pages, hard disk. Quantian’s lineage can Non-shared memory computing which would leave no local mem- be traced back to the popular clusters are frequently talked about ory in which to run applications. Debian distribution. The path is today, but SGI believes that NUMA An O(1) scheduler was added with from Debian to to cluster- (non-uniform memory access) the elimination of a global run- Knoppix to Quantian. From Knop- shared-memory compute servers queue lock and a resulting sixfold pix, it inherits read-only media- remain appropriate for many HPC improvement on some bench- based simplicity and automatic applications. SGI’s Altix systems are marks. Elimination of system glo- hardware detection, along with architecturally similar to their bal variables in favor of per-CPU support for persistent data on USB MIPS-based Origin 3000 servers. variables and value aggregation as storage devices. clusterKnoppix The basic building block of an Altix required was needed to support adds zero-configuration Open- system is a computing brick that very large systems. Without these Mosix clustering with automatic has two pairs of Itanium 2 CPUs. changes, the system would spend process migration along with the Each pair of CPUs can have up to all of its time pounding the cache cluster-compatible Mosix File Sys-

;LOGIN: OCTOBER 2004 USENIX ’04 ANNUAL TECHNICAL CONFERENCE 73 tem. A single machine can be Cluster Computing in a Computer the correct functioning of the clus- booted from Quantian media and Major in a College of Criminal ter? A cluster-specific Linux distri- then other machines can network- Justice bution or self-configuration? boot via the PXE protocol and form Boris Bondarenko and Douglas E. Verifying the correct functioning of a single Mosix cluster. Salane, John Jay College of Criminal the cluster was of particular con- Quantian extends clusterKnoppix Justice cern to Douglas. This concern was with a large number of scientific John Jay College is a specialized strengthened when the test soft- computing applications. In particu- liberal arts college within the City ware that is included in the BLACS lar, Beowulf-style clustering tools University of New York system. It portion of the ScaLAPACK software and libraries are included along offers degrees in Law and Police library reported incorrect results with the statistical package R and Science, Fire Science, and Forensic for some of its tests. In the end, the the SNOW extensions. SNOW Science among others. So, you error was traced to a faulty Gigabit allows easy access to high-level might ask, just what kind of cluster Ethernet card in one of the parallel statistical computing. Some computing is needed in a College machines. Other cluster-computing Knoppix packages that are not of Criminal Justice? Douglas Salane packages don’t always provide related to scientific computing or made it clear that there are a num- those kinds of tests. On the other related software development have ber of areas where significant com- hand, ScaLAPACK can be difficult been dropped in order to make puting resources can be helpful. to use. room for Quantian’s scientific com- For a small site, just figuring out puting additions. Current and planned projects include simulations of the fires that what cluster-computing software is Currently, Quantian is essentially a occurred after the attack on the available and how to set it up is a one-man operation maintained by World Trade Center, database significant undertaking. Unfortu- Dirk. He responds to requests for analysis and data mining of the nately, Linux distributions, like the the addition of new packages as FBI’s National Incident-Based previously mentioned Quantian, time and interest allow. Distribu- Reporting System, and molecular were not available when they first tion size, network security con- modeling for toxicology studies. started working on their cluster. cerns, and surveying users for their Support for heterogeneous clusters needs and configurations remain John Jay College has a relatively would also help by allowing them open issues for him. Even though it small cluster-computing facility. to expand the size of their cluster is primarily a repackaging of other The compute cluster consists of 12 over time without sacrificing per- components, Quantian deserves a nodes with two CPUs each. A sepa- formance to the demands of opti- look if you are interested in scien- rate database cluster has four mizing software to the lowest com- tific computing. At the end of his nodes, and the computing labora- mon denominator. talk, Dirk mentioned that the lap- tory has 30 Linux . top he was using was running Still, they had to go through much Quantian with a USB flash drive for of the same decision-making persistent storage. It seems that his processes that larger facilities might employer will not let him install go through. Blade/rack systems or Linux on the company-supplied piles of PCs? What network file laptop, so he has found another system to use? What interconnect way. Let’s hope Dirk keeps finding technology? How to manage and another way. monitor the cluster? How to test

74 ;LOGIN: VOL. 29, NO. 5