THE MAGAZINE OF USENIX & SAGE August 2002 volume 27 • number 5

inside:

CONFERENCE REPORTS USENIX 2002

& The Advanced Computing Systems Association & The System Administrators Guild conference reports

2002 USENIX Annual KEYNOTE ADDRESS Technical Conference THE INTERNET’S COMING SILENT SPRING Lawrence Lessig, Stanford University MONTEREY, CALIFORNIA, USA OUR THANKS TO THE SUMMARIZERS: Summarized by David E. Ott JUNE 10-15, 2002 For the USENIX Annual Technical Conference: In a talk that received a standing ova- Josh Simon, who organized the collecting of ANNOUNCEMENTS tion, Lawrence Lessig pointed out the the summaries in his usual flawless fashion Summarized by Josh Simon recent legal crisis that is stifling innova- Steve Bauer tion by extending notions of private Florian Buchholz The 2002 USENIX Annual Technical Matt Butner Conference was very exciting. The gen- ownership of technology beyond rea- Pradipta De eral track had 105 papers submitted (up sonable limits. Xiaobo Fan 28% from 82 in 2001) and accepted 25 Hai Huang Several lessons from history are instruc- Scott Kilroy (19 from students); the FREENIX track tive: (1) Edwin Armstrong, the creator Teri Lampoudi had 53 submitted (up from 52 in 2001) of FM radio technology, became an Josh Lothian and accepted 26 (7 from students). enemy to RCA, which launched a legal Bosko Milekic campaign to suppress the technology; Juan Navarro The two annual USENIX-given awards David E. Ott were presented by outgoing USENIX (2) packet switching networks, proposed Amit Purohit Board President Dan Geer. The USENIX by Paul Baron, were seen by AT&T as a Brennan Reynolds Lifetime Achievement Award (also new, competing technology that had to Matt Selsky known as the be suppressed; (3) Disney took J.D. Welch tales, by then in the public domain, and Li Xiao “Flame” because Praveen Yalagandula of the shape of retold them in magically innovative Haijin Yan the award) went ways. Should building upon the past in Gary Zacheiss to James this way be considered an offense? Gosling for his “The truth is, architectures can allow”; contributions, that is, freedom for innovation can be including the built into architectures. Consider the Pascal compiler Internet: a simple core allows for an James Gosling for Multics, unlimited number of smart end-to-end Emacs, an early applications. SMP UNIX, work on X11 and Sun’s windowing system, the first enscript, Compare an AT&T proprietary core net- and Java. The Software Tools Users work to the Internet’s end-to-end Group (STUG) Award was presented to model. The number of innovators for the Apache Foundation and accepted by the former is one company, while for the Rasmus Lerdorf. In addition to the well- latter it’s potentially the number of peo- known Web server, Apache produces ple connected to it. Innovations for Jakarta, AT&T are designed to benefit the owner, mod_perl, while the Internet is a wide-open board mod_tcl, and that allows all kinds of benefits to all XML parser, kinds of groups. The diversity of con- with over 80 tributors in the Internet arena is stagger- members in ing. at least 15 The auction of spectrum by the FCC is countries. another case in point. The spectrum is usually seen as a fixed resource charac- terized by scarcity. The courts have seen Rasmus Lerdorf spectrum as something to be owned as property. Technologists, however, have shown that capacity can be a function of

62 Vol. 27, No. 5 ;login: architecture. As David Reed argues, annual in-person meetings or on the though there are three in-person high- capacity can scale with the number of email lists for the various groups – can bandwidth meetings per year. However, EPORTS users – assuming an effective technology join and be a member. The IETF is con- decisions reached in person must be rat- R and an open architecture. cerned with Internet protocols and open ified by the mailing list, since not every- standards, not LAN-specific (such as body can get to three meetings per year. Developments over the last three years

Appletalk) or layer-1 or -2 (like copper They produce RFCs which go through ONFERENCE are disturbing and can be summarized C

versus fiber). the Standard track; these need to go as two layers of corruption: intellectual- through the entire working group before property rights constraining technical The organizational structure is loose. being submitted for comment to the innovation, and proprietary hardware There are many working groups, each entire IETF and then to the IESG. Most platforms constraining software innova- with a specific focus, within several RFCs wind up going back to the work- tion. areas. Each area has an area director, ing group at least once from the area who collectively form the Internet Engi- Issues have been framed by the courts director or IESG level. neering Steering Group (IESG). The six largely in two mistaken ways: (1) it’s permanent areas are Internet (with The format of an RFC is well-defined their property – a new technology working groups for IPv6, DNS, and and requires it be published in plain 7- shouldn’t interfere with the rights of ICMP), Transport (TCP, QoS, VoIP, and bit ASCII. They’re freely redistributable, current technology owners; (2) it’s just SCTP), Applications (mail, some Web, and the IETF reserves the right of theft – copyright laws should be upheld LDAP), Routing (OSPF, BGP), Opera- change control on all Standard-track by suppressing certain new technologies. tions and Management (SNMP), and RFCs. In fact, we must reframe the debate from Security (IPSec, TLS, S/MIME). There The big problems the IETF is currently “it’s their property” to a “highways” are also two other areas: SubIP is a tem- facing are security, internationalization, metaphor that acts as a neutral platform porary area for things underneath the IP and congestion control. Security has to for innovation without discrimination. protocol stack (such as MPLS, IP over be designed into protocols from the “Theft” should be reframed as “Walt wireless, and traffic engineering), and start. Internationalization has shown us Disney,”who built upon works from the there’s a General area for miscellaneous that 7-bit-only ASCII is bad and doesn’t past in richly creative ways that demon- and process-based working groups. work, especially for those character sets strate the utility of allowing work to Internet Requests for Comments (RFCs) that require more than 7 bits (like reach the public domain. fall into three tracks: Standard, Informa- Kanji); UTF-8 is a reasonable compro- In , creativity depends upon the tional, and Experimental. Note that this mise. But what about domain names? balance between property and access, means that not all RFCs are standards. While not specified as requiring 7-bit public and private, controlled and com- The RFCs in the Informational track are ASCII in the specifications, most DNS mon access. Free code builds a free cul- generally for proprietary protocols or applications assume a 7-bit character set ture, and open architectures are what April first jokes; those in the Experimen- in the namespace. This is a hard prob- give rise to the freedom to innovate. All tal track are results, ideas, or theories. lem. Finally, congestion control is of us need to become involved in another hard problem, since the Internet The RFCs in the Standard track come reframing the debate on this issue; as is not the same as a really big LAN. from working groups in the various Rachel Carson’s Silent Spring points out, areas through a time-consuming, com- an entire ecology can be undermined by INTRODUCTION TO AIR TRAFFIC plex process. Working groups are created small changes from within. MANAGEMENT SYSTEMS with an agenda, a problem statement, an Ron Reisman, NASA Ames Research email list, some draft RFCs, and a chair. INVITED TALKS Center; Rob Savoye, Seneca Software They typically start out as a BoF session. Summarized by Josh Simon THE IETF, OR, WHERE DO ALL THOSE RFCS The working group and the IESG make Air traffic control is organized into four COME FROM, ANYWAY? a charter to define the scope, milestones, domains: surface, which runs out of the Steve Bellovin, AT&T Labs – Research and deadlines; the Internet Advisory airport control tower and controls the Summarized by Josh Simon Board (IAB) ensures that the working aircraft on the ground (e.g., taxi and The Internet Engineering Task Force group proposals are architecturally takeoff); terminal area, which covers air- (IETF) is a standards body, but not a sound. Working groups are narrowly craft at 11,000 feet and below, handled legal entity, consisting of individuals focused and are supposed to die off once by the Terminal Radar Approach Con- (not organizations) and driven by a con- the problem is solved and all milestones trol (TRACON) facilities; en route, sensus-based decision model. Anyone achieved. Working groups meet and which covers between 11,000 and 40,000 who “shows up” – be it at the thrice- work mainly through the email list, feet, including climb, descent, and at-

October 2002 ;login: USENIX 2002 63 altitude flight, runs from the 20 Air Questions centered around advanced long and complex to be easily remem- Route Traffic Control Centers (ARTCC, avionics (e.g., getting rid of ground con- bered. This means that DNS will play a pronounced “artsy”); and traffic flow trol), cooperation between the US and vital role in getting IPv6 deployed. Sev- management, which is the strategic arm. Europe for software development (we’re eral new resource records (RR) have Each area has sectors for low, high, and working together on software develop- been proposed to handle the translation, very-high flight. Each sector has a con- ment, but the various European coun- including AAAA, A6, DNAME and BIT- troller team, including one person on tries’ controllers don’t talk well to each STRING. Manning commented on the the microphone, and handles between other), and privatization. difference between IPv4 and v6 as a 12 and 16 aircraft at a time. Since the transport protocol and that systems number of sectors and areas is limited ADVENTURES IN DNS tuned for v4 traffic will suffer a perfor- and fixed, there’s limited system capac- Bill Manning, ISI mance hit when using v6. This is largely ity. The events of September 11, 2001, Summarized by Brennan Reynolds attributed to the increase in data trans- gave us a respite in terms of system Manning began by posing a simple mitted per DNS request. usage, but based on path growth pat- question: is DNS really a critical infra- The final extension Manning discussed terns, the air traffic system will be over- structure? The answer is not simple. Per- was DNSSec. He introduced this exten- subscribed within two to three years. haps seven years ago, when the Internet sion as a mechanism that protects the How do we handle this oversubscrip- was just starting to become popular, the system from itself. DNSSec protects tion? answer was a definite no. But today, with against data spoofing and provides Air Traffic Management (ATM) Deci- IPv6 being implemented and so many authentication between servers. It sion Support Tools (DST) use physics, transactions being conducted over the includes a cryptographic signature of aeronautics, heuristics (expert systems), Internet, the question does not have a the RR set to ensure authenticity and fuzzy logic, and neural nets to help the clear-cut answer. The Internet Engineer- integrity. By signing the entire set, the (human) aircraft controllers route air- ing Task Force (IETF) has several new amount of computation is kept to a craft around. The rest of the talk focused modifications to the DNS service that minimum. The information itself is on capacity issues, but the DST also may be used to protect and extend its stored in a new RR within each zone file handle safety and security issues. The usability. on the DNS server. software follows open standards (ISO, The first extension Manning discussed Manning briefly commented on the use POSIX, and ANSI). The team at NASA was Internationalized Domain Names of DNS to provide a PKI infrastructure, Ames made Center-TRACON Automa- (IDN). To date, all DNS records are stating that that was not the purpose of tion System (CTAS), which is software based on the ASCII character set, but DNS and therefore it should not be used for each of the ARTCCs, portable from many addresses in the Internet’s global in that fashion. The signing of the RR Solaris to HP-UX and as well. network cannot be easily written in sets can be done hierarchically, resulting Unlike just about every other major soft- ASCII characters. The goal of IDN is to in the use of a single trusted key at the ware project, this one really is standard provide encoding for hostnames that is root of the DNS tree to sign all sets to and portable; co-presenter Rob Savoye fair, efficient, and allows for a smooth the leafs. However, the job of key has experience in maintaining gcc on transition from the current scheme. The replacement and rollover is extremely multiple platforms and is the project work has resulted in two encoding difficult for a system that is distributed lead on the portability and standards schemes: ACE and UTF-8. Each encod- across the globe. issues for the code. CTAS allows the ing is independent of the other, but they ARTCCs to upgrade and enhance indi- can be used together in various combi- Manning stated that in an operation test vidual aspects or parts of the system; it nations. Manning expressed his opinion bed, with all of these extensions enabled, isn’t a monolithic all-or-nothing entity that while neither is an ideal solution, the packet size for a single query like the old ATM systems. ACE appears to be the lesser of two evils. response grew from 518 bytes to larger than 18,000. This results in a large Some future areas of research include a A major hindrance getting IDN rolled increase of bandwidth usage for high head-mounted augmented reality device out into the Internet’s DNS root struc- volume name servers and puts. There- for tower operators, to improve their sit- ture is the increase in zone file complex- fore, in Manning’s opinion, not all of uational awareness by automating ity. these features will be deployed in the human factors; and new digital global Manning’s next topic was the inclusion near future. For those looking for more positioning system (DGPS) technologies of IPv6 records. In an IPv4 world, it is information, Bill’s Web site can be found which are accurate within inches instead possible for administrators to remember at http://www.isi.edu/otdr. of feet. the numeric representation of an address. IPv6 makes the addresses too

64 Vol. 27, No. 5 ;login: THE JOY OF BREAKING THINGS the cause of a problem is to isolate it by Other lower-profile pieces covered were Pat Parseghian, Transmeta identifying the conditions that repro- laws prohibiting false contact informa- EPORTS Summarized by Juan Navarro duce the problem and repeating those tion in emails and domain registration R conditions in other test platforms and databases. Similar proposals exist that Pat Parseghian described her experience non-Crusoe systems. Then relevant fac- would prohibit misleading subject lines in testing the Crusoe microprocessor at

tors must be identified based on product in emails and require messages to clearly ONFERENCE

the Transmeta Lab for Compatibility C knowledge, past experience, and com- identify if they are advertisements. (TLC), where the motto is, “You make it mon sense. Briefly mentioned were laws impacting ...we break it!” and the goal is to make online gambling. engineers miserable. To conclude, Pat suggested that some of TLC’s testing lessons can be applied to Bills and laws affecting the architectural She first gave an overview of Crusoe, products that the audience was involved design of systems and networks and var- whose most distinctive feature is a code in, and assured us that breaking things is ious efforts to establish boundaries and morphing layer that translates x86 fun. borders in cyberspace were then dis- instructions into native VLIW. Such cussed. These included the Consumer peculiarity makes the testing process TECHNOLOGY, LIBERTY, FREEDOM, AND Broadband and Television Promotion particularly challenging because it WASHINGTON Act and the French cases against Yahoo involves testing two processors (the Alan Davidson, Center for Democracy and its CEO for allowing the sale of Nazi “external” x86 and the “internal” VLIW) and Technology materials on the Yahoo auction site. and also because of variations on how Summarized by Steve Bauer the code-morphing layer works (it may A major part of the talk focused on the “Experience should teach us to be most interpret code the first few times it sees technological aspects of the USA-Patriot on our guard to protect liberty when the an instruction sequence and translate Act passed in the wake of the September Government’s purposes are beneficent. and save in a translation cache after- 11 attacks. Topics included “pen regis- Men born to freedom are naturally alert wards). There are also reproducibility ters” for the Internet, expanded govern- to repel invasion of their liberty by evil- issues, since the VLIW processor may ment power to conduct secret searches, minded rulers. The greatest dangers to run at different speeds to save energy. roving wiretaps, nationwide service of liberty lurk in insidious encroachment warrants, sharing grand jury and intelli- Pat then described the tests that they by men of zeal, well-meaning but with- gence information, and the establish- subject the processor to (hardware com- out understanding.”– Louis Brandeis, ment of an identification system for patibility tests, common applications, Olmstead v. U.S. visitors to the US. operating systems, envelope-pushing Alan Davison, an associate director at games, and hard-to-acquire legacy appli- The technological community needs to the CDT (http://www.cdt.org) since cations) and the issues that must be con- be aware and care about the impact of 1996, concluded his talk with this quote. sidered when defining testing policies. law on individual liberty and the systems In many ways it aptly characterizes the She gave some testing tips, including the community builds. Specific sugges- importance to the USENIX community organizational issues like tracking tions included designing privacy and of the topics he covered. The major resources and results. anonymity into architectures and limit- themes of the talk were the impact of ing information collection, particularly To illustrate the testing process, Pat gave law on individual liberty, system archi- any personally identifiable data, to only a list of possible causes of a system crash tectures, and appropriate responses by what is essential. Finally, Alan empha- that must be investigated. If it is the sili- the technical community. sized that many people simply do not con, then it might be because it is dam- understand the implications of law. aged or because of a manufacturing The first part of the talk provided the Thus the technology community has a problem or a design flaw. If the problem audience with an overview of legislation important role in helping make these is in the code-morphing layer, is the either already introduced or likely to be implications clear. fault with the interpreter or with the introduced in the US Congress. This included various proposals to protect translator? The fault could also be exter- TAKING AN OPEN SOURCE PROJECT TO nal to the processor: it could be the children, such as relegating all sexual content to a .xxx domain or providing MARKET homemade system BIOS, a faulty main- Eric Allman, Sendmail Inc. board, or an operator error. Or Trans- kid-safe zones such as .kids.us. Other Summarized by Scott Kilroy meta might not be at fault at all: the similar laws discussed were the Chil- crash might be due to a bug in the OS or dren’s Online Protection Act and legisla- Eric started the talk with the upsides and the application. The key to pinpointing tion dealing with virtual child pornography. downsides to open source software. The upsides include rapid feedback, high-

October 2002 ;login: USENIX 2002 65 quality work (mostly), lots of hands, and the simple equation that drives business: sented in a mixture of hue and shape an intensely engineering-driven profit = revenue - expense. variations. There are many pre-attentive approach. The downsides include no visual cues, including size, orientation, Finally, the concept of value should be support structure, limited project mar- intersection, and intensity. The accuracy based on what is valuable to the cus- keting expertise, and volunteers having of these cues can be ranked, with posi- tomer. Eric observed that his customers limited time. tion triggering high accuracy and color needed documentation, extra value in or density on the low end. In 1998 Sendmail was becoming a suc- the product that they couldn’t get else- cess disaster. A success disaster includes where, and technical support. Visual cues combined with different two key elements: (1) volunteers start data types – quantitative (e.g., 10 Eric learned hard lessons along the way. spending all their time supporting cur- inches), ordered (small, medium, large) If you want to do interesting open rent functionality instead of enhancing and categorical (apples, oranges) – can source, it might be best not to be too (this heavy burden can lead to project also be ranked, with spatial position successful. You don’t want to let the stagnation and, eventually, death); being best for all types. Beyond this larger dragons (companies) notice you. (2) competition for the better-funded commonality, however, accuracy varied If you want a commercial user base, you sources can lead to FUD (fear, uncer- widely, with length ranked second for have to manage process, watch corpo- tainty, and doubt) about the project. quantitative data, eighth for ordinal, and rate culture, provide value to customers, ninth for categorical data types. Eric then led listeners down the path he watch bottom lines, and develop a thick took to turn Sendmail into a business. skin. These guidelines about visual perception Eric envisioned Sendmail Inc. as a small- can be applied to information visualiza- Starting a company is easy but perpetu- ish and close-knit company but soon tion, which, unlike scientific visualiza- ating it is extremely difficult! realized that the family atmosphere he tion, focuses on abstract data and choice

desired could not last as the company INFORMATION VISUALIZATION FOR of specialization. Several techniques are grew. He observed that maybe only the SYSTEMS PEOPLE used to clarify abstract data, including first 10 people will buy into your vision, Tamara Munzner, University of British multi-part glyphs, where changes in after which each individual is primarily Columbia individual parts are incorporated into an interested in something else (e.g., power, easier to understand gestalt, interactiv- Summarized by J.D. Welch wealth, or status). He warns that there ity, motion, and animation. In large data Munzner presented interesting evidence will be people working for you that you sets, techniques like “focus + context,” that information visualization through do not like. Eric particularly did not care where a zoomed portion of the graph is interactive representations of data can for sales people, but he emphasized how shown along with a thumbnail view of help people to perform tasks more effec- important sales, marketing, finance, and the entire graph, are used to minimize tively and reduce the load on working legal people are to companies. user disorientation. memory. The nature of companies in infancy can Future problems include dealing with A popular technique in visualizing data be summed up as: you need money, and huge databases, such as the Human is abstracting it through node-link rep- you need investors in order to get Genome, reckoning dynamic data, like resentation – for example, a list of cross- money. Investors want a return on the changing structure of the Web or references in a book. Representing the investment, so money management real-time network monitoring, and relationships between references in a becomes critical. Companies therefore transforming “pixel bound” displays into graphical (node-link) manner “offloads must optimize money functions. Eric’s large “digital wallpaper”-type systems. cognition to the human perceptual sys- experience with investors were lessons tem.”These graphs can be produced all in themselves. More than giving FIXING NETWORK SECURITY BY HACKING manually, but automated drawing allows money, good investors can provide con- THE BUSINESS CLIMATE for increased complexity and quicker nections and insight, so you don’t want Bruce Schneier, Counterpane Internet production times. investors who don’t believe in you. Security The way in which people interpret visual Summarized by Florian Buchholz A company must have bug history, data is important in designing effective Bruce Schneier identified security as one change management, and people who visualization systems. Attention to pre- of the fundamental building blocks of understand the product and have a sense attentive visual cues are key to success. the Internet. A certain degree of security of the target market. Eric now sees the Picking out a red dot among a field of is needed for all things on the Net, and importance of marketing target identically shaped blue dots is signifi- the limits of security will become limits research. A company can never forget cantly faster than the same data repre- of the Net itself. However, companies are

66 Vol. 27, No. 5 ;login: hesitant to employ security measures. As a second step, Schneier identified the Risks won’t go away; the best we can do Science is doing well, but one cannot see need to allow partners to transfer liabil- is manage them. A company able to EPORTS the benefits in the real world. An ity. Insurance would spread liability risk manage risk better will be more prof- R increasing number of users means more among a group and would be a CEO’s itable, and we need to give CEOs the problems affecting more people. Old primary risk analysis tool. There is a necessary risk-management tools.

problems such as buffer overflows need for standardized risk models and ONFERENCE

There were numerous questions. Listed C haven’t gone away, and new problems protection profiles and for more securi- below are the more complicated ones in show up. Furthermore, the amount of ties as opposed to empty press releases. a paraphrased Q&A format: expertise needed to launch attacks is Schneier predicts that insurance will cre- decreasing. ate a marketplace for security where cus- Q: Will liability be effective; will insur- tomers and vendors will have the ability ance companies be willing to accept the Schneier argues that complexity is the to accept liabilities from each other. risks? A: The government might have to enemy of security and that while secu- Computer-security insurance should step in; it needs to be seen how it plays rity is getting better, the growth in com- soon be as common as household or fire out. plexity is outpacing it. As security is insurance, and from that development, fundamentally a people problem, one Q: Is there an analogy to the real world insurance will become the driving factor shouldn’t focus on technologies but, in the fact that in a lawless society only of the security business. rather, should look at businesses, busi- the rich can afford security? Security ness motivations, and business costs. The next step is to provide mechanisms solutions differentiate between classes. Traditionally, one can distinguish to reduce risks, both before and after A: Schneier disagreed with the state- between two security models: threat software is released; techniques and ment, giving an analogy to front-door avoidance, where security is absolute, processes to improve software quality, as locks, but conceded that there might be and risk management, where security is well as an evolution of security manage- special cases. relative and one has to mitigate the risk ment, are therefore needed. Currently, Q: Doesn’t homogeneity hurt security? with technologies and procedures. In the security products try to rebuild the walls A: Homogeneity is oversold. Diverse latter model, one has to find a point – such as physical badges and entry types can be more survivable, but given with reasonable security at an acceptable doorways – that were lost when getting the limited number of options, the dif- cost. connected. Schneier believes this ference will be negligible. “fortress metaphor” is bad; one should After a brief discussion on how security think of the problem more in terms of a Q: Regulations in airbag protection have is handled in businesses today, in which city. Since most businesses cannot afford led to deaths in some cases. How can we he concluded that businesses “talk big proper security, outsourcing is the only keep the pendulum from swinging the about it, but do as little as possible,” way to make security scalable. With out- other way? A: Lobbying will not be pre- Schneier identified four necessary steps sourcing, the concenpt of best practices vented. An imperfect solution is proba- to fix the problems. becomes important and insurance com- ble; there might be reversals of First, enforce liabilities. Today, no real panies can be tied to them; outsourcing requirements such as the airbag one. consequences have to be feared from will level the playing field. Q: What about personal liability? A: This security incidents. Holding people As a final step, Schneier predicts that will be analogous to auto insurance: lia- accountable will increase the trans- rational prosecution and education will bility comes with computer/Net access. parency of security and give an incentive lead to deterrence. He claims that people to make processes public. As possible Q: If the rule of law is to become reality, feel safe because they live in a lawful options to achieve this, he listed indus- there must be a law enforcement func- society, whereas the Internet is classified try-defined standards, federal regula- tion that applies to a physical space. You as lawless, very much like the American tion, and lawsuits. The problems with cannot do that with any existing govern- West in the 1800s or a society ruled by enforcement, however, lie in difficulties ment agency for the whole world. warlords. This is because it is difficult to associated with the international nature Should an organization like the UN prove who an attacker is; prosecution is of the problem and the fact that com- assume that role? A: Schneier was not hampered by complicated evidence plexity makes assigning liabilities diffi- convinced global enforcement is possi- gathering and irrational prosecution. cult. Furthermore, fear of liability could ble. Schneier believes, however, that educa- have a chilling effect on tion will play a major role in turning the Q: What advice should we take away development and could stifle new com- Internet into a lawful society. Specifi- from this talk? A: Liability is coming. panies. cally, he pointed out that we need laws Since the network is important to our that can be explained. infrastructure, eventually the problems

October 2002 ;login: USENIX 2002 67 will be solved in a legal environment. in the particular area of expertise. Sup- GENERAL TRACK SESSIONS You need to start thinking about how to port from the hardware and software FILE SYSTEMS solve the problems and how the solu- vendors came next. While everyone was Summarized by Haijin Yan tions will affect us. willing to contract PI to write the por- tions of code that were specific to their STRUCTURE AND PERFORMANCE OF THE The entire speech (including slides) is hardware, no one wanted to pay the bill DIRECT ACCESS FILE SYSTEM available at http://www.counterpane. for developing the underlying founda- Kostas Magoutis, Salimah Addetia, com/presentation4.pdf. tion that was necessary for the drivers to Alexandra Fedorova, and Margo I. Seltzer, Harvard University; Jeffrey LIFE IN AN OPEN SOURCE STARTUP be useful. In the PI model, the infra- Chase, Andrew Gallatin, Richard Kisley, Daryll Strauss, Consultant structure cost was shared by several cus- tomers at once, and the technology kept and Rajiv Wickremesinghe, Duke Uni- versity; and Eran Gabber, Lucent Tech- Summarized by Teri Lampoudi moving. Once a driver is written for a nologies This talk was packed with morsels of vendor, the vendor is perfectly capable insight and tidbits of information about of figuring out how to write the next This paper won the Best Paper award. what life in an open source startup is driver necessary, thereby obviating the The Direct Access File System (DAFS) is like (though “was like” might be more need to contract the job. This and ques- a new standard for network-attached appropriate); what the issues are in tions of protecting intellectual property storage over direct-access transport net- starting, maintaining, and commercial- – hardware design in this case – are works. DAFS takes advantage of user- izing an open source project; and the deeply problematic with the open source level network interface standards that way hardware vendors treat open source mode of development. This is not to say enable client-side functionality in which developers. Strauss began by tracing the that there is no way to get over them, remote data access resides in a library timeline of his involvement with devel- but that they are likely to arise more rather than in the kernel. This reduces oping 3-D graphics support for Linux: often than not. the overhead of memory copy for data from the 3dfx voodoo1 driver for Linux Management issues seem equally impor- movement and protocol overhead. in 1997, to the establishment of Preci- Remote Direct Memory Access (RDMA) sion Insight in 1998, to its buyout by VA tant. In the PI setting, contracts were flexible – both a good and a bad thing – is a direct access transport network that Linux, to the dismantling of his group allows the network adapter to reduce by VA, to what the future may hold. and developers overworked. Further, development “in a fishbowl,”that is, copy overhead by accessing application Obviously, there are benefits from doing under public scrutiny, is not easy. buffers directly. open source development, both for the Strauss stressed the value of good com- This paper explores the fundamental developer and for the end user. An munication and the use of various out- structural and performance characteris- important point was that open source of-sync communication methods, like tics of network file access using a user- develops entire technologies, not just IRC and mailing lists. level file system structure on a products. A misapprehension is the Finally, Strauss discussed what portions direct-access transport network with inherent difficulty of managing a project RDMA. It describes DAFS-based client that accepts code from a large number of of software can be free and what can be proprietary. His suggestion was that hor- and server reference implementations developers, some volunteer and some for FreeBSD and reports experimental paid. The notion that code can just be izontal markets want to be free, whereas vertically, one can develop proprietary results, comparing DAFS to a zero-copy thrown over the wall into the world is a NFS implementation. It illustrates the Big Lie. solutions. The talk closed with a small stroll through the problems that project benefits and trade-offs of these tech- How does one start and keep afloat an politics brings up, from things like niques to provide a basis for informed open source development company? In merging branches to mailing list and choices about deployment of DAFS- the case of Precision Insight, the subject source tree readability. based systems and similar extensions to matter – developing graphics and other network file protocols such as OpenGL for Linux – required a consid- NFS. Experiments show that DAFS gives erable amount of expertise: intimate applications direct control over an I/O knowledge of the hardware, the libraries, system and increases the client CPU’s X and the Linux kernel, as well as the usage while the client is doing I/O. end applications. Expertise is mar- Future work includes how to address ketable. What helps even further is hav- longstanding problems related to the ing a visible virtual team of experts, integration of the application and file people who have an established track system for high-performance applica- record of contributions to open source tions. 68 Vol. 27, No. 5 ;login: CONQUEST: BETTER PERFORMANCE THROUGH By knowing the underlying cache single-level-store approach for current A DISK/PERSISTENT-RAM HYBRID FILE replacement policy, a cache-aware Web attempts to capitalize on system consis- EPORTS

SYSTEM server can reschedule the requests on a tency and efficiency. Their solution did R An-I A. Wang, Peter Reiher, and Gerald cached request-first policy to obtain per- not relate directly to EROS but, rather, J. Popek, UCLA; Geoffrey M. formance improvement. Experiments to the design and use of the constructs Kuenning, Harvey Mudd College

show that cache-aware and notions on which EROS is based. By ONFERENCE Motivated by the declining cost of per- improves average response time and sys- extending the mapping of the memory C sistent RAM, the authors propose the tem throughput. system to include the disk, systems are Conquest file system, which stores all further able to ensure global persistence small files, metadata, executables, and OPERATING SYSTEMS without regard for a disk structure. In shared libraries in persistent RAM; disks (AND DANCING BEARS) explaining the need for absolute system hold only the data content of remaining Summarized by Matt Butner consistency and an environment which large files. Compared to alternatives by definition is not allowed to crash, the THE JX OPERATING SYSTEM such as caching and RAM file systems, goal of having such an exhaustive design Conquest has the advantages of effi- Michael Golm, Meik Felser, Christian becomes clear. The cool part of this Wawersich, and Jürgen Kleinöder, ciency, consistency, and reliability at a work is the availability of its design and University of Erlagen-Nürnberg reduced cost. Using popular bench- architecture to the public. marks, experiments show that Conquest The talk opened by emphasizing the incurs little overhead while achieving need for, and the practicality of, a func- THINK: A SOFTWARE FRAMEWORK FOR faster performance. Future work tional Java OS. Such implementation in COMPONENT-BASED OPERATING SYSTEM includes designing mechanisms for the JX OS attempts to mimic the recent KERNELS adjusting file-size threshold dynamically trend toward using highly abstracted Jean-Philippe Fassino, France Télécom and finding a better disk layout for large languages in application development in R&D; Jean-Bernard Stefani, INRIA; Julia data blocks. order to create an OS as functional and Lawall, DIKU; Gilles Muller, INRIA powerful as one written in a lower-level This presentation discussed the need for EXPLOITING GRAY-BOX KNOWLEDGE OF language. component-based operating systems BUFFER-CACHE MANAGEMENT and respective structures to ensure flexi- The JX is a micro-kernel solution that Nathan C. Burnett, John Bent, Andrea bility for arbitrary-sized systems. Think uses separate JVMs for each entity of the C. Arpaci-Dusseau, and Remzi H. provides a binding model that maps kernel, and, in some cases, for each Arpaci-Dusseau, University of uniformed components for OS develop- application. The separated domains do Wisconsin, Madison ers and architects to follow, ensuring not share objects and have no thread Knowing what algorithm is used to consistent implementations for an arbi- migration, and each domain implements manage the buffer cache is very impor- trary system size. However, the goal is its own code. The interesting part of the tant for improving application perfor- not to force developers into a predefined presentation was the discussion of sys- mance. However, there is currently no kernel but to promote the use of certain tem-level programming with Java; key interface for finding this algorithm. This components in varied ways. areas discussed were memory manage- paper introduces Dust, a simple finger- ment and interrupt handlers. The printing tool that is able to identify the Think’s concentration is primarily on authors concluded by noting that their buffer-cache replacement policy. Dust embedded systems, where short devel- type-safe and modular system resulted automatically identifies the cache size opment time is necessary but is con- in a robust system with great configura- and replacement policy based on the strained by rigorous needs and limited tion flexibility and acceptable perfor- configuring attributes of access orders, resources. The need to build flexible sys- mance. recency, frequency, and long-term his- tems in such an environment can be costly and implementation-specific, but tory. Through simulation, Dust was able DESIGN EVOLUTION OF THE EROS to distinguish between a variety of Think creates an environment sup- SINGLE-LEVEL STORE ported by the ability to dynamically load replacement algorithm policies found in Jonathan S. Shapiro, Johns Hopkins the literature: FIFO, LRU, LFU, Clock, type-safe components. This allows for University; Jonathan Adams, University more flexible systems that retain func- Segmented FIFO, 2Q,and LRU-K. Fur- of Pennsylvania ther experiments of fingerprinting real tionality because of Think’s ability to This presentation outlined current char- bind fine-grained components. Bench- operating system such as NetBSD, acteristics of file systems, and some of Linux, and Solaris show that Dust is able marks revealed that dedicated micro- their least desirable characteristics. It kernels can show performance to identify the hidden cache replacement was based on the revival of the EROS algorithm. improvements in comparison to mono-

October 2002 ;login: USENIX 2002 69 lithic kernels. Notable future work also uses a design called “staged event- classes that implement staged computa- includes building components for low- driven architecture,”where each service tion and cohort scheduling on either a end appliances and the development of a is broken down into a set of stages con- uniprocessor or multiprocessor. It can real-time OS component library. nected by event queues. This architec- be used to define stages. ture is suitable for a modular design and The presentation showed the experi- BUILDING SERVICES helps in graceful degradation by adap- mental evaluation of the cohort schedul- tive load shedding from the event Summarized by Pradipta De ing over the thread-based model by queues. NINJA: A FRAMEWORK FOR NETWORK implementing two servers: a Web server, SERVICES The presentation concluded with exam- which is mainly I/O bound, and a pub- J. Robert von Behren, Eric A. Brewer, ples of implementation and evaluation lish-subscribe server, which is mainly Nikita Borisov, Michael Chen, Matt of a Web server and an email system compute bound. The SURGE bench- Welsh, Josh MacDonald, Jeremy Lau, called NinjaMail, to show the ease of mark was used for the first experiment and David Culler, University of Califor- authoring and the efficacy of using the and the Fabret workload for the second. nia, Berkeley Ninja framework for service develop- The results showed that cohort-schedul- Robert von Behren presented Ninja, an ment. ing-based implementation gave a better ongoing project that aims to provide a throughput than the thread-based framework for building robust and scal- USING COHORT-SCHEDULING TO ENHANCE implementation at high loads. able Internet-based network services, SERVER PERFORMANCE like Web-hosting, instant-messaging, James R. Larus and Michael Parkes, NETWORK PERFORMANCE Microsoft Research email, and file-sharing applications. Summarized by Xiaobo Fan Robert drew attention to the difficulties James Larus presented a new scheduling of writing cluster applications. One has policy for increasing the throughput of ETE: PASSIVE END-TO-END INTERNET SERVICE to take care of data consistency and server applications. Cohort scheduling PERFORMANCE MONITORING issues of concurrency, and for robust batches execution of similar operations Yun Fu, Amin Vahdat, Duke University; applications there are problems related arising in different server requests. The Ludmila Cherkasova, Wenting Tang, HP Labs to fault tolerance. Ninja works as a usual programming paradigm in han- wrapper to relieve the user of these dling server requests is to spawn multi- This paper won the Best Student Paper problems. The goal of the project is ple concurrent threads and switch from award. building network services that are scala- one thread to another whenever a thread Ludmila Cherkasova began by listing ble and highly available; maintaining gets blocked for I/O. Since the threads in several questions most Web service persistent data; providing graceful a server mostly execute unrelated pieces providers want answered in order to degradation; and supporting online evo- of code, the locality of reference is improve service quality. She reviewed lution. reduced; hence the effectiveness of dif- the difficulties of making accurate and ferent caching mechanisms. One way to The use of clusters distinguishes this efficient end-to-end Web service mea- solve this problem is to throw in more setup from generic distributed systems surement and the shortcomings of cur- hardware. But Larus presented a com- in terms of reliability and security, as rently available approaches. They plementary view where the program well as providing a partition-free net- propose a passive trace-based architec- behavior is investigated and used to work. The next important feature of this ture, called EtE, to monitor Web server improve the performance. project is the new programming model, performance on behalf of end users. which is more restrictive than a general The problem in this scheme is to iden- The first step is to collect network pack- programming model but still expressive tify pieces of code that can be batched ets passively. The second module recon- enough to write most of the applica- together for processing. One simple way structs all TCP connections and extracts tions. This model, described as “single is to look at the next program counter HTTP transactions. To reconstruct Web program multiple connection,”uses values and use them to club threads page accesses, they first build a knowl- intertask parallelism instead of multi- together. However, this talk presented a edge base indexed by client IP and URL threaded concurrency and is character- new programming abstraction, “staged and then group objects to the related ized by all nodes running the same computation,”which replaces the thread Web pages they are embedded in. Statis- program, with connections delegated to model with “stages.”A stage is an tical analysis is used to handle non- the nodes by a centralized connection abstraction for operations with similar matched objects. EtE Monitor can manager (CM). A CM takes care of hid- behavior and locality. The StagedServer generate three groups of metrics to mea- ing the details of mapping an external library can be used for programming in sure Web service performance: response connection to an internal node. Ninja this model. It is a collection of C++ time, Web caching, and page abortion.

70 Vol. 27, No. 5 ;login: To demonstrate the benefits of EtE mon- applications is what David Ott tries to The new exclusion caching schemes are itor, Cherkasova talked about two case explain in his talk. A C-to-C application evaluated on both single-client and mul- EPORTS studies and, based on the calculated class is identified as one set of processes tiple-client systems. The single-client R metrics, gave some insightful explana- communicating to another set of results are presented for two different tions about what’s happening behind processes across a common Internet types of synthetic workloads: Random

the variations of Web performance. path. The fundamental problem with C- (transaction-processing type workloads) ONFERENCE Through validation they claim their to-C applications is how to coordinate and Zipf (typical Web workloads). The C approach provides a very close approxi- all C-to-C communication flows so that exclusive policy was quite effective in mation to the real scenario. they share a consistent view of the com- achieving higher hit rates and lower read mon C-to-C network, adapt to changing latencies. They also showed a 2.2 times THE PERFORMANCE OF REMOTE DISPLAY network conditions, and cooperate to hit-rate improvement over inclusive MECHANISMS FOR THIN-CLIENT COMPUTING meet specific requirements. Aggregation techniques in the case of a real-life S. Jae Yang, Jason Nieh, Matt Selsky, points (AP) are placed at the first and workload, httpd, with a single-client set- Nikhil Tiwari, Columbia University last hop of the common data path to ting. Noting the trend toward thin-client probe network conditions (latency, The DEMOTE scheme also performed computing, the authors compared dif- bandwidth, loss rate, etc.). To carry and well in the case of multiple-client sys- ferent techniques and design choices in transfer this information, a new protocol tems when the data accessed by clients is measuring the performance of six popu- – Coordination Protocol (CP) – is disjointed. In the case of shared data lar thin-client platforms – Citrix inserted between the network layer (IP) workloads, this scheme performed worse MetaFrame, Microsoft Terminal Ser- and the transport layer (TCP, UCP, etc.). than the inclusive schemes. Theodore vices, Sun Ray, Tarantella, VNC, and X. Ott illustrated how the CP header is then presented an adaptive exclusive After pointing out several challenges in updated and used when packets origi- caching scheme – a block accessed by a benchmarking thin clients, Yang pro- nate from source and traverse through client is placed at the tail in the client’s posed slow-motion benchmarking to local and remote AP to arrive at their cache and also in the array cache at an achieve non-invasive packet monitoring destination, and how AP maintains a appropriate place determined by the and consistent visual quality. Basically, per-cluster state table and detects net- popularity of the block. The popularity they insert delays between separate work conditions. Through simulation of the block is measured by maintaining visual events in the benchmark applica- results, this coordination mechanism a ghost cache to accumulate the number tions of the server side so that the appears effective in sharing common of times each block is accessed. This new client’s display update can catch up with network resources among C-to-C com- adaptive scheme has achieved a maxi- the server’s processing speed. The exper- munication flows. iments are conducted on an emulated mum of 52% speedup in the mean latency in the experiments with real-life network over a range of network band- STORAGE SYSTEMS workloads. widths. Summarized by Praveen Yalagandula For more information, visit http://www. Their results show that thin clients can MY CACHE OR YOURS? MAKING STORAGE cs.cmu.edu/~tmwong/research and http:// provide good performance for Web MORE EXCLUSIVE www.hpl.hp.com/research/itc/csl/ssp. applications in LAN environments, but Theodore M. Wong, Carnegie Mellon only some platforms performed well for University; John Wilkes, HP Labs BRIDGING THE INFORMATION GAP IN video benchmark. Pixel-based encoding Theodore Wong explained the ineffi- STORAGE PROTOCOL STACKS may achieve better performance and ciency of current “inclusive” caching Timothy E. Denehy, Andrea C. Arpaci- bandwidth efficiency than high-level schemes in storage area networks – Dusseau, Remzi H. Arpaci-Dusseau, graphics. Display caching and compres- when a client accesses a block, the block University of Wisconsin, Madison sion should be used with care. is read from the disk and is cached at Currently there is a huge information both the disk array cache and at the gap between storage systems and file sys- A MECHANISM FOR TCP-FRIENDLY client’s cache. Then he presented the TRANSPORT-LEVEL PROTOCOL tems. The interface exposed by storage concept of “exclusive” caching, where a COORDINATION systems to file systems, based on blocks block accessed by a client is only cached David E. Ott and Ketan Mayer-Patel, and providing only read/write interfaces, University of North Carolina, Chapel in that client’s cache. On eviction from is very narrow. This leads to poor per- Hill the client’s cache, they have come up formance as a whole because of dupli- DEMOTE A revised transport-level protocol opti- with a operation to move the cated functionality in both systems and mized for cluster-to-cluster (C-to-C) data block to the tail of the array cache.

October 2002 ;login: USENIX 2002 71 reduced functionality resulting from The authors used prototype media tion activation. PCT file format options storage systems’ lack of file information. server EXEDRA to implement and eval- provide flexibility and various ways to uate different replication and bandwidth store exact states. PCT also has analysis The speaker presented two enhance- reservation schemes. This system sup- options; two examples were given – ments: ExRaid, an exposed RAID, and ports a variable bit-rate scheme, does “Simple” and “Mixed User and Linux I.LFS, Informed Log-Structured File Sys- stride-based disk allocation, and is capa- Code.” tem. ExRAID is an enhancement to the ble of supporting variable-grain disk block-based RAID storage system that The talk then went into how general striping. Two replication policies are exposes the following three types of sampling helps in the following areas: a presented: deterministic – data blocks information to the file system: (1) debugger-controller state machine; are replicated in a round-robin fashion regions – contiguous portions of the example script fragments; a simple on the secondary replicas; random – address space comprising one or multi- example program; and general value data blocks are replicated on the ran- ple disks; (2) performance information reports in terms of call-path histograms, domly chosen secondary replicas. Two about queue lengths and throughput of numeric value histograms, and data gen- bandwidth reservation techniques are the regions revealing disk heterogeneity erality. Related work, such as gprof and also presented: mirroring reservation – to the file systems; and (3) failure infor- Expect, was then summarized. The con- disk bandwidth is reserved for both mation – dynamically updated informa- tribution of this work is to provide a primary and replicas of the media file tion conveying the number of tolerable portable general value sampling tool. during the playback; and minimum failures in each region. The limitations are that PCT does not reservation – a more efficient scheme in support much of strip, and count inac- I.LFS allows online incremental expan- which bandwidth is reserved only for the curacies happen because of its statistical sion of the storage space, performs sum of primary data access time and the nature. Based on this limitation, -g is dynamic parallelism using ExRAID’s maximum of the backup data access preferred over strip. performance information to perform times required in each round. well on heterogeneous storage systems, Possible future directions included sup- Experimental results showed that deter- and provides a range of different mecha- porting more debuggers, such as dbx ministic replica placement is better than nisms with different granularities for and xdb; script generators for other random placement for small disks, mini- redundancy of files using ExRAID’s kinds of traces; more canned reports for mum disk bandwidth reservation is region failure characteristics. These new general values; and a libgdb-based twice as good as mirroring in the features are added to LFS with only a library sampler. PCT is available at throughput achieved, and fault tolerance 19% increase in the code size. http://pdos.lcs.mit.edu/pct. PCT runs on can be achieved with a minimal impact almost any UNIX-like system. More information: on the throughput.

http://www.cs.wisc.edu/wind. ENGINEERING A DIFFERENCING AND TOOLS COMPRESSION DATA FORMAT MAXIMIZING THROUGHPUT IN REPLICATED Summarized by Li Xiao David G. Korn and Kiem Phong Vo, DISK STRIPING OF VARIABLE AT&T Laboratories – Research BIT-RATE STREAMS SIMPLE AND GENERAL STATISTICAL PROFILING Stergios V. Anastasiadis, Duke Univer- WITH PCT This talk began with an equation: “Dif- sity; Kenneth C. Sevcik and Michael Charles Blake and Steven Bauer, MIT ferencing + Compression = Delta Com- Stumm, University of Toronto This talk introduced a Profile Collection pression.”Compression removes There is an increasing demand for the Toolkit (PCT) – a sampling-based CPU information redundancy in a data set. continuous real-time streaming of profiling facility. A novel aspect of PCT Examples are gzip, bzip, compress, and media files. Disk striping is a common is that it allows sampling of semantically pack. Differencing encodes differences technique used for supporting many rich data such as function call stacks, between two data sets. Examples are connections on the media servers. But function parameters, local or global diff -e, fdelta, bdiff, and xdelta. Delta even with 1.2 million hours of mean variables, CPU registers, or other execu- compression compresses a data set given time between failures, there will be more tion contexts. The design objectives of another, combining differencing and than one disk failure per week with 7000 PCT were driven by user needs and the compression, and reduces to pure com- disks. Fault tolerance can be achieved by inadequacies or inaccessibility of prior pression when there’s no commonality. using data replication and reserving systems. After presenting an overall scheme for a extra bandwidth during normal opera- The rich data collection capability is delta compressor and showing delta tion. This work focused on supporting achieved via a debugger-controller pro- compression performance, this talk dis- the most common variable bit-rate gram, dbct1. The talk then introduced cussed the encoding format of a newly stream formats (e.g., MPEG). the profiling collector and data collec- designed Vcdiff for delta compression. A 72 Vol. 27, No. 5 ;login: Vcdiff instruction code table consists of one that is nearest the client. As CDNs results imply this is a reasonable metric 256 entries of each coding up to a pair of have access only to the IP address of the to use to avoid really distant servers. EPORTS instructions, and recodes I-byte indices local DNS server (LDNS) of the client, R A study of the impact that client-LDNS and any additional data. the CDN’s authoritative DNS server associations have on DNS-based server maps the client’s LDNS to a geographic The talk then showed Vcdiff’s perfor- selection concludes that knowing the

region within a particular network and ONFERENCE

mance with Web data collected from client’s IP address would allow more C combines that with network and server CNN, compared with W3C standard accurate server selection in a large num- load information to perform CDN Gdiff, Gdiff+gzip, and diff+gzip, where ber of cases. The optimality of the server server selection. Gdiff was computed using Vcdiff delta selection also depends on the server instructions. The results of two experi- This method has two limitations. First, it density, placement, and selection algo- ments, “First” and “Successive,”were is based on the implicit assumption that rithms. presented. In “First,” each file is com- clients are close to their LDNS. This may Further information can be found at pressed against the first file collected; in not always be valid. Second, a single http://www.eecs.berkeley.edu/~zmao/ “Successive,”each file is compressed request from an LDNS can represent a myresearch.html or by contacting against the one in the previous hour. differing number of Web clients – called [email protected]. The diff+gzip did not work well because the hidden load factor. Knowledge of the diff was line-oriented. Vcdiff performed hidden load factor can be used to GEOGRAPHIC PROPERTIES OF INTERNET favorably compared with other formats. achieve better load distribution. ROUTING Vcdiff is part of the Vcodex package; The extent of the first limitation and its Lakshminarayanan Subramanian, Vcodex is a platform for all common impact on the CDN server selection is Venkata N. Padmanabhan, Microsoft data transformations, delta compres- dealt with. To determine the associations Research; Randy H. Katz, University of California at Berkeley sion, plain compression, encryption, between clients and their LDNS, a sim- transcoding (e.g., uuencode, base64). It ple, non-intrusive, and efficient map- Geographic information can provide is structured in three layers for maxi- ping technique was developed. The data insights into the structure and function- mum usability. Base library uses Dis- collected was used to study the impact of ing of the Internet, including interac- plines and Methods interfaces. proximity on DNS-based server selec- tions between different autonomous tion using four different proximity met- systems, by analyzing certain properties The code for Vcdiff can be found at rics : (1) autonomous system (AS) of Internet routing. It can be used to http://www.research.att.com/sw/tools. clustering – observing whether a client is measure and quantify certain routing They are moving Vcdiff to the IETF in the same AS as its LDNS – concluded properties such as circuitous routing, standard as a comprehensive platform that LDNS is very good for coarse- hot-potato routing, and geographic fault for transforming data. Please refer to grained server selection, as 64% of the tolerance. http://www.ietf.org/internet-draft/ associations belong to the same AS; draft-korn-vcdiff.06.txt. Traceroute has been used to gather the (2) network clustering – observing required data, and Geotrack tool has whether a client is in the same network- WHERE IN THE NET . . . been used to determine the location of aware cluster (NAC) – implied that DNS the nodes along each network path. This Summarized by Amit Purohit is less useful for finer-grained server enables the computation of “linearized A PRECISE AND EFFICIENT EVALUATION OF selection, since only 16% of the client distances,”which is the sum of the geo- THE PROXIMITY BETWEEN WEB CLIENTS AND and LDNS are in the same NAC; (3) graphic distances between successive THEIR LOCAL DNS SERVERS traceroute divergence – the length of the pairs of routers along the path. Zhuoqing Morley Mao, University of divergent paths to the client and its California, Berkeley; Charles D. Cranor, LDNS from a probe point using tracer- In order to measure the circuitousness Fred Douglis, Michael Rabinovich, oute – implies that most clients are of a path, a metric “distance ratio” has Oliver Spatscheck, and Jia Wang, AT&T topologically close to their LDNS as been defined as the ratio of the lin- Labs – Research viewed from a randomly chosen probe earized distance of a path to the geo- Content Distribution Networks (CDNs) site; (4) round-trip time (RTT) correla- graphic distance between the source and attempt to improve Web performance by tion (some CDNs select severs based on destination of the path. From the data, it delivering Web content to end users RTT between the CDN server and the has been observed that the circuitous- from servers located at the edge of the client’s LDNS) examines the correlation ness of a route depends on both geo- network. When a Web client requests between the message RTTs from a probe graphic and network location of the end content, the CDN dynamically chooses a point to the client and its local DNS; hosts. A large value of the distance ratio server to route the request to, usually the enables us to flag paths that are highly

October 2002 ;login: USENIX 2002 73 circuitous, possibly (though not neces- which led to discussion of the latest penalty is substantial. Cyclone was able sarily) because of routing anomalies. It development in the area of “host to find many lingering bugs. The final is also shown that the minimum delay causality.” step of the project will be to write a between end hosts is strongly correlated compiler to convert a normal C program The speaker ended with the limitations with the linearized distance of the path. to an equivalent Cyclone program. and future directions of the framework.

Geographic information can be used to This framework could be extended and COOPERATIVE TASK MANAGEMENT WITHOUT study various aspects of wide-area Inter- could find many applications in the area MANUAL STACK MANAGEMENT net paths that traverse multiple ISPs. It of security. Current implementation Atul Adya, Jon Howell, Marvin was found that end-to-end Internet doesn’t take care of startup scripts and Theimer, William J. Bolosky, John R. paths tend to be more circuitous than cron jobs, but incorporating the origin Douceur, Microsoft Research intra-ISP paths, the cause for this being information in FS could solve this prob- The speaker described the definitions the peering relationships between ISPs. lem. In the current implementation, log- and motivations behind the distinct Also, paths traversing substantial dis- ging is just implemented as a proof of concepts of task management, stack tances within two or more ISPs tend to concept. It could be made safe in many management, I/O management, conflict be more circuitous than paths largely ways, and this could be another impor- management, and data partitioning. traversing only a single ISP.Another tant aspect of future work. Conventional concurrent programming finding is that ISPs generally employ uses preemptive task management and hot-potato routing. PROGRAMMING exploits the automatic stack manage- Geographic information can also be Summarized by Amit Purohit ment of a standard language. In the sec- used to capture the fact that two seem- CYCLONE : A SAFE DIALECT OF C ond approach, cooperative tasks are ingly unrelated routers can be suscepti- Trevor Jim, AT&T Labs – Research; Greg organized as event handlers that yield ble to correlated failures. By using the Morrisett, Dan Grossman, Michael control by returning control to the event geographic information of routers we Hicks, James Cheney, Yanling Wang, scheduler, manually unrolling their can construct a geographic topology of Cornell stacks. In this project they have adopted an ISP. Using this we can find the toler- University a hybrid approach that makes it possible ance of an ISP’s network to the total net- Cyclone is designed to provide protec- for both stack management styles to work failure in a geographic region. tion against attacks such as buffer over- coexist. Thus, programmers can code flows, format string attacks, and assuming one or the other of these stack For further information, contact memory management errors. The cur- management styles will operate, depend- [email protected]. rent structure of C allows programmers ing upon the application. The speaker to write vulnerable programs. Cyclone also gave a detailed example of how to PROVIDING PROCESS ORIGIN INFORMATION use their mechanism to switch between TO AID IN NETWORK TRACEBACK extends C so that it has the safety guar- the two styles. Florian P. Buchholz, Purdue University; antee of Java while keeping C syntax, Clay Shields, Georgetown University types, and semantics untouched. The talk then continued with the imple- Network traceback is currently limited The Cyclone compiler performs static mentation. They were able to preempt because host audit systems do not main- analysis of the source code and inserts many subtle concurrency problems by tain enough information to match runtime checks into the compiled out- using cooperative task management. incoming network traffic to outgoing put at places where the analysis cannot Paying a cost up-front to reduce a subtle network traffic. The talk presented an determine that the code execution will race condition proved to be a good alternative, assigning origin information not violate safety constraints. Cyclone investment. to every process and logging it during imposes many restrictions to preserve Though the choice of task management interactive login creation. safety, such as NULL checks.These is fundamental, the choice of stack man- The current implementation concen- checks do not exist in normal C. agement can be left to individual taste. trates mainly on interactive sessions in The speaker then talked about some This project enables use of any type of which an event is logged when a new sample code written in Cyclone and how stack management in conjunction with connection is established using SSH or it tackles safety problems. Porting an cooperative task management. Telnet. The method is effective and existing C application to Cyclone is could successfully determine stepping pretty easy, with fewer than 10% change stones and reliably detect the source of a required. The current implementation DDoS attack. The speaker then talked concentrates more on safety than on about the related work done in this area, performance – hence, the performance

74 Vol. 27, No. 5 ;login: IMPROVING WAIT-FREE ALGORITHMS FOR (5) nodes that can act as repeaters, small devices. The idle-time power con- INTERPROCESS COMMUNICATION IN (6) devices of low cost and with low sumption is almost the same as the EPORTS

EMBEDDED REAL-TIME SYSTEMS power, and (7) sparse anchor nodes – receive power consumption on typical R Hai Huang, Padmanabhan Pillai, Kang nodes with GPS capability. A positioning wireless network cards (1319mW vs. G. Shin, University of Michigan algorithm determines the geographical 1425mW), while the sleep state con-

The main characteristic of the real- position of each node in the network. sumes far less power (177mW). To let ONFERENCE time/time-sensitive system is its pre- the system consume energy proportional C In two-dimensional space, each node dictable response time. But concurrency to the stream quality, the network card needs three reference positions to esti- management creates hurdles to achiev- should transition to sleep state aggres- mate its geographical position. There are ing this goal because of the use of locks sively between each packet. two problems that make positioning dif- to maintain consistency. To solve this ficult in the Pico Radio Network type The speaker presented previous work, problem, many wait-free algorithms setting: (1) a sparse anchor node prob- where different multimedia streams have been developed, but these are typi- lem, and (2) a range error problem. were studied for PDAs: MS Media, Real cally high-cost solutions. By taking Media, and Quicktime. It was found that advantage of the temporal characteris- The two-phase approach that the if the inter-packet gap is predictable, tics of the system, however, the time and authors have taken in solving the prob- then huge savings are possible: for space overhead can be reduced. lem consists of: (1) a hop-terrain algo- example, about 80% savings in MS rithm – in this first phase, each node The speaker presented an algorithm for media streams with few packet losses. roughly guesses its location by the dis- temporal concurrency control and The limitation of the IEEE 802.11b tance calculated using multi-hops to the applied this technique to improve three power-save mode comes into play when anchor points; and (2) a refinement wait-free algorithms. A single-writer, there are two or more nodes, producing algorithm – in this second phase, each multiple-reader, wait-free algorithm and a delay between beacon time and when node uses its neighbors’ positions to a double-buffer algorithm were pro- the node receives/sends packets. This refine its own position estimate. To posed. badly affects the streams, since higher guarantee the convergence, this energy is consumed while waiting. Using their transformation mechanism, approach uses confidence metrics: they achieved an improvement of assigning a value of 1.0 for anchor nodes The authors propose traffic shaping for 17–66% in ACET and a 14–70% reduc- and 0.1 to start for other nodes and energy conservation where this is done tion in memory requirements for IPC increasing these with each iteration. by proxy such that packets arrive at reg- algorithms. This mechanism is extensi- ular intervals. This is achieved using a A simulation tool, OMNet++, was used ble and can be applied to other non- local proxy in the access point and a for both phases and in various scenarios. blocking IPC algorithms as well. Future client-side proxy in the mobile host. The results show that they achieved work involves reducing the synchroniz- Simulations show that traffic shaping position errors of less than 33% in a sce- ing overheads in more general IPC algo- reduces energy consumption and also nario with 5% anchor nodes, an average rithms with multiple-writer semantics. reveals that the higher bandwidth connectivity of 7, and 5% range mea- streams have a lower energy metric surement error. MOBILITY (mJ/kB). Summarized by Praveen Yalagandula Guidelines for anchor node deployment More information is at are: high connectivity (>10), a reason- ROBUST POSITIONING ALGORITHMS FOR http://greenhouse.cs.uga.edu. able fraction of anchor nodes (> 5%), DISTRIBUTED AD-HOC WIRELESS SENSOR NETWORKS and, for anchor placement, covered CHARACTERIZING ALERT AND BROWSE SER- Chris Savarese, Jan Rabaey, University edges. VICES OF MOBILE CLIENTS of California, Berkeley ; Koen Langen- Atul Adya, Paramvir Bahl, and Lili Qiu, APPLICATION-SPECIFIC NETWORK doen, Delft University of Technology Microsoft Research MANAGEMENT FOR ENERGY-AWARE The Pico Radio Network comprises STREAMING OF POPULAR MULTIMEDIA Even though there is a dramatic increase more than a hundred sensors, monitors, FORMATS in Internet access from wireless devices, and actuators equipped with wireless Surendar Chandra, University of Geor- there are not many studies done on connectivity with the following proper- gia; and Amin Vahdat, Duke University characterizing this traffic. In this paper, ties: (1) no infrastructure, (2) computa- The main hindrance in supporting the the authors characterize the traffic tion in a distributed fashion instead of increasing demand for mobile multime- observed on the MSN Web site with centralized computation, (3) dynamic dia on PDA is the battery capacity of the both notification and browse traces. topology, (4) limited radio range,

October 2002 ;login: USENIX 2002 75 Around 33 million browsing accesses cant performance retribution. The gen- ence]]. The hope is that public availabil- and about 3.25 million notification eral C++-to-Tcl mapper SWIG grants ity of such tools will encourage further entries are present in the traces. Three the necessary object and hierarchal abili- development of grammar and lexical types of analyses are done for each one ties without the use of object-oriented software systems. of these two services: content analysis, Tcl extensions. The final API mapping concerning the most popular content techniques address C++ features such as SWILL: A SIMPLE EMBEDDED WEB SERVER categories and their distribution; popu- classes, overloaded methods, enumera- LIBRARY larity analysis, or the popularity distri- tions, and inheritance relations. All are Sotiria Lampoudi, David M. Beazley, bution of documents; and user behavior implemented in a proof-of-concept that University of Chicago analysis. maps the complete C++ API of the VRS This paper won the FREENIX Best Stu- to Tcl and are showcased in a complete dent Paper award. The analysis of the notification logs interactive 3-D-map system. shows that document access rates follow SWILL (Simple Web Interface Link Library) is a simple Web server in the a Zipf-like distribution, with most of the THE AGFL GRAMMAR WORK LAB form of a C library whose development accesses concentrated on a small num- Cornelis H.A. Coster, Erik Verbruggen, was motivated by a wish to give cool ber of messages; and the accesses exhibit University of Nijmegen (KUN) applications an interface to the Web. The geographical locality – users from same The growth and implementation of Nat- SWILL library provides a simple inter- locality tend to receive similar notifica- ural Language Processing (NLP) is the face that can be efficiently implemented tion content. The browser log analysis cornerstone of the continued evolution for tasks that vary from flexible Web- shows that a smaller set of URLs are and implementation of truly intelligent based monitoring to software debugging accessed most times, though the access search machines and services. In part and diagnostics. Though originally pattern does not fit any Zipf curve; and due to the growing collections of com- designed to be integrated with high-per- the highly accessed URLS remain stable. puter-stored human-readable docu- formance scientific simulation software, A correlation study between the notifi- ments in the public domain, the the interface is generic enough to allow cation and browsing services shows that implementation of linguistic analysis for unbounded uses. SWILL is a single- wireless users have a moderate correla- will become necessary for desirable pre- threaded server, relying upon non- tion of 0.12. cision and resolution. Subsequently, blocking I/O through the creation of a such tools and linguistic resources must temporary server which services I/O FREENIX TRACK SESSIONS be released into the public domain, and requests. SWILL does not provide SSL or they have done so with the release of the BUILDING APPLICATIONS cryptographic authentication but does AGFL Grammar Work Lab under the Summarized by Matt Butner have HTTP authentication abilities. GNU Public License, as a tool for lin- INTERACTIVE 3-D GRAPHICS APPLICATIONS guistic research and the development of A fantastic feature of SWILL is its sup- FOR TCL NLP-based applications. port for SPMD-style parallel applica- Oliver Kersting, Jürgen Döllner, Hasso tions which utilize MPI, proving Plattner Institute for Software Systems The AGFL (Affix Grammars over a valuable for Beowulf clusters and large Engineering, University of Potsdam Finite Lattice) Grammar Work Lab parallel machines. Another practical meshes context-free grammars with The integration of 3-D image rendering application was the implementation of finite set-valued features that are accept- functionality into a scripting language SWILL in a modified Yalnix emulator by able to a range of languages. In com- permits interactive and animated 3-D University of Chicago Operating Sys- puter science terms, “Syntax rules are development and application without tems courses, which utilized the added procedures with parameters and a non- the formalities and precision demanded Yalnix functionality for OS development deterministic execution.”The English by low-level C/C++ graphics and visual- and debugging. SWILL requires minimal Phrases for Information Retrieval ization libraries. The large and complex memory overhead and relies upon the (EP4IR), released with the AGFL-GWL C++ API of the Virtual Rendering Sys- HTTP/1.0 protocol. tem (VRS) can be combined with the as a robust grammar of English, is an conveniences of the Tcl scripting lan- AGFL-GWL generated English parser guage. The mapping of class interfaces is that outputs “Head/Modified” frames. done via an automated process and gen- The sentences “CompanyX sponsored erates respective wrapper classes, all of this conference” and “This conference which ensures complete API accessibility was sponsored by CompanyX” both gen- and functionality without any signifi- erate [CompanyX,[sponsored, confer-

76 Vol. 27, No. 5 ;login: NETWORK PERFORMANCE the measurements showed that a client Having attended the talk and read the Summarized by Florian Buchholz may run slower when paired with fast paper, I am still unclear about whether EPORTS

servers on fast networks. This is due to the authors are merely describing the R LINUX NFS CLIENT WRITE PERFORMANCE heavy client interrupt loads, more net- design decisions of TCP congestion con- Chuck Lever, Network Appliance; Peter work processing on the client side, and trol or whether they are actually the cre- Honeyman, CITI, University of Michi- gan more global kernel lock contention. ators of that part of the Linux code. My ONFERENCE guess leans toward the former. C Lever introduced a benchmark to mea- The source code of the project is avail- sure an NFS client write performance. able at http://www.citi.umich.edu/ In the talk, the speaker compared the Client performance is difficult to mea- projects/nfs-perf/patches/ TCP protocol congestion control meas- sure due to hindrances such as poor ures according to IETF and RFC specifi- hardware or bandwidth limitations. Fur- A STUDY OF THE RELATIVE COSTS OF cations with the actual Linux implemen- thermore, measuring application perfor- NETWORK SECURITY PROTOCOLS tation, which does conform to the basic mance does not identify weaknesses Stefan Miltchev and Sotiris Ioannidis, principles but nevertheless has differ- specifically at the client side. Thus a University of Pennsylvania; Angelos ences. A specific emphasis was placed on Keromytis, Columbia University benchmark was developed trying to retransmission mechanisms and the exercise only data transfers in one direc- With the increasing need for security congestion window. Also, several TCP tion between server and application. For and integrity of remote network ser- enhancements – the NewReno algo- this purpose, the benchmark was based vices, it becomes important to quantify rithm, Selective ACKs (SACK), Forward on the block sequential write portion of the communication overhead of IPSec ACKs (FACK) – were discussed and the Bonnie file system benchmark. Once and compare it to alternatives such as compared. a benchmark for NFS clients is estab- SSH, SCP, and HTTPS. In some instances, Linux does not con- lished, it can be used to improve client For this purpose, the authors set up form to the IETF specifications. The fast performance. three testing networks: direct link, two recovery does not fully follow RFC 2582 The performance measurements were hosts separated by two gateways, and since the threshold for triggering re- performed with an SMP Linux client three hosts connecting through one transmit is adjusted dynamically and the and both a Linux NFS server and a Net- gateway. Protocols were compared in congestion window’s size is not changed. work Appliance F85 filer. During testing, each setup, and manual keying was used Also, the roundtrip-time estimator and a periodic jump in write latency time to eliminate connection setup costs. For the RTO calculation differ from RFC was discovered. This was due to a rather IPSec the different encryption algo- 2988 since it uses more conservative large number of pending write opera- rithms AES, DES, 3DES, hardware DES, RTT estimates and a minimum RTO of tions that were scheduled to be written and hardware 3DES were used. In detail, 200ms. The performance measures after certain threshold values were FTP was compared to SFTP, SCP and showed that with the additional Linux- exceeded. By introducing a separate dae- FTP over IPSec, HTTP to HTTPS and specific features enabled, slightly higher mon that flushes the cached write HTTP over IPSec, and NFS to NFS over throughput, more steady data flow, and request, the spikes could be eliminated, IPSec and local disk performance. fewer unnecessary retransmissions can but as a result the average latency grows The result of the measurements were be achieved. over time. The problem could be traced that IPSec outperforms other popular to a function that scans a linked list of encryption schemes. Overall, unen- XTREME XCITEMENT write requests. After having added a crypted communication was fastest, but Summarized by Steve Bauer hashtable to improve lookup perfor- in some cases, like FTP, the overhead can THE FUTURE IS COMING: WHERE THE X mance, the latency improved consider- be small. The use of crypto hardware WINDOW SHOULD GO ably. can significantly improve performance. Jim Gettys, Compaq Computer Corp. The improved client was then used to For future work, the inclusion of setup Jim Gettys, one of the principal authors measure throughput against the two costs, hardware-accelerated SSL, SFTP, of the X Window System, outlined the servers. A discrepancy between the and SSH were mentioned. near-term objectives for the system, pri- Linux server and the filer test was marily focusing on the changes and CONGESTION CONTROL IN LINUX TCP noticed and the reason for that traced infrastructure required to enable replica- Pasi Sarolahti, University of Helsinki; back to a global kernel lock that was Alexey Kuznetsov, Institute for Nuclear tion and migration of X applications. unnecessarily held when accessing the Research at Moscow Providing better support for this func- network stack. After correcting this, per- tionality would enable users to retrieve formance improved further. However, or duplicate X applications between their servers at home and work.

October 2002 ;login: USENIX 2002 77 One interesting example of an applica- AUTHORIZATION AND CHARGING IN PUBLIC ACCESS CONTROL tion that currently is capable of migra- WLANS USING FREEBSD AND 802.1X Summarized by Florian Buchholz tion and replication is Emacs. To create a Pekka Nikander, Ericsson Research DESIGN AND PERFORMANCE OF THE new frame on DISPLAY try: “M-x make- NomadicLab OPENBSD STATEFUL PACKET FILTER (PF) frame-on-display DISPLAY 802.1x standards are well known in the Daniel Hartmeier, Systor AG ”. wireless community as link-layer Daniel Hartmeier described the new authentication protocols. In this talk, However, technical challenges make stateful packet filter () that replaced Pekka explained some novel ways of replication and migration difficult in IPFilter in the OpenBSD 3.0 release. using the 802.1x protocols that might be general. These include the “major IPFilter could no longer be included due of interest to people on the move. It is headaches” of server-side fonts, the to licensing issues and thus there was a possible to set up a public WLAN that nonuniformity of X servers and screen need to write a new filter, making use of would support various charging sizes, and the need to appropriately optimized data structures. schemes via virtual tokens which people retrofit toolkits. Authentication and can purchase or earn and later use. authorization issues are obviously also The filter rules are implemented as a linked list which is traversed from start important. The rest of the talk delved This is implemented on FreeBSD using to end. Two actions may be taken into some of the details of these interest- the utility. It is basically a filter according to the rules: “pass” or “block.” ing technical challenges. in the link layer that would differentiate A “pass” action forwards the packet and traffic based on the MAC address of the a “block” action will drop it. Where HACKING IN THE KERNEL client node, which is either authenti- more than one rule matches a packet, Summarized by Hai Huang cated, denied, or let through. The over- the last rule wins. Rules that are marked head for this service is fairly minimal. AN IMPLEMENTATION OF SCHEDULER as “final” will immediately terminate ACTIVATIONS ON THE NETBSD OPERATING ACPI IMPLEMENTATION ON FREEBSD any further rule evaluation. An opti- SYSTEM Takanori Watanabe, Kobe University mization called “skip-steps” was also Nathan J. Williams, Wasabi Systems implemented, where blocks of similar ACPI (Advanced Configuration and Scheduler activation is an old idea. Basi- Power Management Interface) was pro- rules are skipped if they cannot match cally, there are benefits and drawbacks to posed as a joint effort by Intel, Toshiba, the current packet. These skip-steps are using solely kernel-level or user-level and Microsoft to provide a standard and calculated when the rule set is loaded. threading. Scheduler activation is able to finer-control method of managing Furthermore, a state table keeps track of combine the two layers of control to power states of individual devices within TCP connections. Only packets that provide more concurrency in the sys- a system. Such low-level power manage- match the sequence numbers are tem. ment is especially important for those allowed. UDP and ICMP queries and mobile and embedded systems that are replies are considered in the state table, In his talk, Nathan gave a fairly detailed powered by fixed-capacity energy batter- where initial packets will create an entry description of the implementation of ies. for a pseudo-connection with a low ini- scheduler activation in the NetBSD ker- Takanori explained that the ACPI speci- tial timeout value. The state table is nel. One important change in the imple- fication is composed of three parts: implemented using a balanced binary mentation is to differentiate the thread tables, BIOS, and registers. He was able search tree. NAT mappings are also context from the process context. This is to implement some functionalities of stored in the state table, whereas appli- done by defining a separate data struc- ACPI in a FreeBSD kernel. ACPI Com- cation proxies reside in user space. The ture for these threads and relocating ponent Architecture was implemented packet filter also is able to perform frag- some of the information that was by Intel, and it provides a high-level ment reassembly and to modulate TCP embedded in the process context to ACPI API to the operating system. sequence numbers to protect hosts these thread contexts. Stack was espe- Takanori’s ACPI implementation is built behind the firewall. cially a concern due to the upcall. Spe- upon this underlying layer of APIs. cial handling must be done to make sure Pf was compared against IPFilter as well that the upcall doesn’t mess up the stack as Linux’s Iptables by measuring so that the preempted user-level thread throughput and latency with increasing can continue afterwards. Lastly, Nathan traffic rates and different packet sizes. In explained that signals were handled by a test with a fixed rule set size of 100, upcalls. Iptables outperformed the other two fil- ters, whose results were close to each

78 Vol. 27, No. 5 ;login: other. In a second test, where the rule set file’s protection bits. File-cloaking only from commodity software and hard- size was continually increased, Iptables works, however, if the client doesn’t hold ware. Emphatically, however, Ningaui is EPORTS consistently had about twice the cached copies of directory contents and not a Beowulf. Hume calls the cluster R throughput of the other two (which file-attributes. Because of this the clients design the “Swiss canton model,”in evaluate the rule set on both the incom- are forced to re-read directories by which there are a number of loosely

ing and outgoing interfaces). A third test incrementing the mtime value of the affiliated independent nodes, with data ONFERENCE compared only pf and IPFilter, using a directory each time it is listed. replicated among them. Jobs are C single rule that created state in the state assigned by bidding and leases, and clus- To test the performance of the modified table with a fixed-state entry size. Pf ter services done as session-based servers server, five different NFS configurations reached an overloaded state much later are sited via generic job assignment. The were evaluated. An unmodified NFS than IPFilter. The experiment was emphasis is on keeping the architecture server was compared against one server repeated with a variable-state entry size end-to-end, checking all work via check- with the modified code included but not and pf performed much better than sums, and logging everything. The used, one with only range-mapping IPFilter for a small number of states. resilience and high availability required enabled, one with only file-cloaking by their goal of 8-5 maintenance – vs. In general, rule set evaluation is expen- enabled, and one version with all modi- the typical 24-7 model where people get sive and benchmarks only reflect fications enabled. For each setup, differ- paged whenever the slightest thing goes extreme cases, whereas in real life, other ent file system benchmarks were run. wrong, regardless of the time of day – is behavior should be observed. Further- The results show only a small overhead achieved by job restartability. Finally, all more, the benchmarks show that stateful when the modifications are used, gener- computation is performed on local data, filtering can actually improve perfor- ally an increase of below 5%. Another without the use of NFS or network mance due to cheap state-table lookups experiment tested the performance of attached storage. as compared to rule evaluation. the system with an increasing number of mapped or cloaked entries on a system. Hume’s message is a hopeful one: ENHANCING NFS CROSS-ADMINISTRATIVE The results show that an increase from despite the many problems encountered DOMAIN ACCESS 10 to 1000 entries resulted in a maxi- – things like kernel and service limits, Joseph Spadavecchia and Erez Zadok, mum of about 14% cost in performance. auto-installation problems, TCP storms, Stony Brook University and the like – the existence of source One member of the audience pointed The speaker presented modification to code and the paranoid practice of log- out that if clients choose to ignore the an NFS server that allows an improved ging and checksumming everything has changed mtimes from the server and NFS access between administrative helped. The final product performs rea- thus still hold caches of the directory domains. A problem lies in the fact that sonably well, and it appears that the entries, the file-cloaking mechanism NFS assumes a shared UID/GID space, resilient techniques employed do make a could be defeated. After a rather lengthy which makes it unsafe to export files difference. One drawback, however, is debate, the speaker had to concede that outside the administrative domain of that the software mentioned in the the model doesn’t add any extra security. the server. Design goals were to leave paper is not yet available for download. protocol and existing clients unchanged, Another question was asked about scala- a minimum amount of server changes, bility of the setup of range mapping. CPCMS: A CONFIGURATION MANAGEMENT flexibility, and increased performance. The speaker referred to application-level SYSTEM BASED ON CRYPTOGRAPHIC NAMES tools that could be developed for that Jonathan S. Shapiro and John Vander- To solve the problem, two techniques are purpose. burgh, Johns Hopkins University utilized: “range-mapping” and “file- This paper won the FREENIX Best cloaking.”Range-mapping maps IDs The software is available at Paper award. between client and server. The mapping ftp://ftp.fsl.cs.sunysb.edu/pub/enf. is performed on a per-export basis and The basic notion behind the project is has to be manually set up in an export ENGINEERING OPEN SOURCE the fact that everyone has a pet com- file. The mappings can be 1-1, N-N, or SOFTWARE plaint about CVS, and yet it is currently N-1. In file-cloaking, the server restricts Summarized by Teri Lampoudi the configuration manager in most file access based on UID/GID and spe- NINGAUI: A LINUX CLUSTER FOR BUSINESS widespread use. Shapiro has unleashed cial cloaking-mask bits. Here users can Andrew Hume, AT&T Labs – Research; an alternative. Interestingly, he did not only access their own file permissions. Scott Daniels, Electronic Data Systems begin but ended with the usual host of The policy on whether or not the file is Corp. reasons why CVS is bad. The talk instead visible to others is set by the cloaking Ningaui is a general purpose, highly began abruptly by characterizing the job mask, which is logically ANDed with the available, resilient architecture built

October 2002 ;login: USENIX 2002 79 of a software configuration manager, bits indicating the impact of the added are searched in a directory. Specifically, continued by stating the namespaces features with respect to compatibility. FFS uses a linear search to find an entry which it must handle, and wrapped up Namely, a file system with the “incom- by name; dirhash builds a hashtable of with the challenges faced. pat” bit marked is not allowed to be directory entries on the fly. For a direc- mounted. Similarly, a “read-only” mark- tory of size n, with a working set of m X MEETS Z: VERIFYING CORRECTNESS IN THE ing would only allow the file system to files, a search that in certain cases could PRESENCE OF POSIX THREADS be mounted read-only. have been O(n*m) has been reduced, Bart Massey, Portland State University; due to dirhash, to effectively O(n + m). Robert T. Bauer, Rational Software Directory indexing changes linear direc- Corp. tory searches with a faster search using a FILESYSTEM PERFORMANCE AND SCALABILITY Massey delivered a humorous talk on fixed-depth tree and hashed keys. File IN LINUX 2.4.17 the insight gained from applying Z for- system size can be dynamically Ray Bryant, SGI; Ruth Forester, IBM mal specification notation to system increased, and the expanded inode, dou- LTC; John Hawkes, SGI software design rather than the more bled from 128 to 256 bytes, allows for This talk focused on performance evalu- informal analysis and design process more extensions. ation of a number of file systems avail- normally used. Other potential improvements were dis- able and commonly deployed on Linux machines. Comparisons, under various The story is told with respect to writing cussed as well, in particular, pre-alloca- configurations, of Ext2, Ext3, ReiserFS, XCB, which replaces the Xlib protocol tion for contiguous files which allows XFS, and JFS were presented. layer and is supposed to be thread for better performance in certain setups friendly. But where threads are con- by pre-allocating contiguous blocks. The benchmarks chosen for the data cerned, deadlock avoidance becomes a Security-related modifications, extended gathering were pgmeter, filemark, and hard problem that cannot be solved in attributes and ACLs, were mentioned. AIM Benchmark Suite VII. Pgmeter an ad-hoc manner. But full model An implementation of these features measures the rate of data transfer of checking is also too hard. In this case already exists but has not yet been reads/writes of a file under a synthetic Massey resorted to Z specification to merged into the mainline Ext2/3 code. workload. Filemark is similar to post- model the XCB lower layer, abstract mark in that it is an operation-intensive RECENT FILESYSTEM OPTIMISATIONS ON away locking and data transfers, and benchmark, although filemark is FREEBSD locate fundamental issues. Essentially, threaded and offers various other fea- Ian Dowse, Corvil Networks; David the difficulties of searching the literature Malone, CNRI, Dublin Institute of tures that postmark lacks. AIM VII and locating information relevant to the Technology measures performance for various file- problem at hand were overcome. As system-related functionalities; it offers David Malone presented four important Massey put it, “the formal method saved file system optimizations for FreeBSD various metrics under an imposed the day.” OS: , dirpref, vmiodir, and workload, thus stressing the perfor- dirhash. It turns out that certain combi- mance of the file system not only under FILE SYSTEMS nations of the optimizations (beautifully I/O load, but also under significant CPU Summarized by Bosko Milekic illustrated in the paper) may yield per- load. formance improvements of anywhere PLANNED EXTENSIONS TO THE LINUX between 2 and 10 orders of magnitude Tests were run on three different setups: EXT2/EXT3 FILESYSTEM for real-world applications. a small, a medium, and a large configu- Theodore Y. Ts’o, IBM; Stephen ration. ReiserFS and Ext2 appear at the All four of the optimizations deal with Tweedie, Red Hat top of the pile for smaller and medium file system metadata. Soft updates allow The speaker presented improvements to setups. Notably, XFS and JFS perform for asynchronous metadata updates. the Linux Ext2 file system with the goal worse for smaller system configurations Dirpref changes the way directories are of allowing for various expansions while than the others, although XFS clearly organized, attempting to place child striving to maintain compatibility with appears to generate better numbers than directories closer to their parents, older code. Improvements have been JFS. It should be noted that XFS seems thereby increasing locality of reference facilitated by a few extra superblock to scale well under a higher load. This and reducing disk-seek times. Vmiodir fields that were added to Ext2 not long was most evident in the large-system trades some extra memory in order to before the Linux 2.0 kernel was released. results, where XFS appears to offer the achieve better directory caching. Finally, The fields allow for file system features best overall results. dirhash, which was implemented by Ian to be added without compromising the Dowse, changes the way in which entries existing setup; this is done by providing

80 Vol. 27, No. 5 ;login: THINGS TO THINK ABOUT WORK-IN-PROGRESS REPORTS RELIABLE AND SCALABLE PEER-TO-PEER WEB Summarized by Bosko Milekic Summarized by Brennan Reynolds DOCUMENT SHARING EPORTS

Li Xiao, William and Mary College R SPEEDING UP KERNEL SCHEDULER BY REDUC- RESOURCE VIRTUALIZATION TECHNIQUES FOR ING CACHE MISSES WIDE-AREA OVERLAY NETWORKS The idea presented by Xiao would allow Shuji Yamamura, Akira Hirai, Mitsuru Kartik Gopalan, University Stony Brook end users to share the content of the

Sato, Masao Yamamoto, Akira Naruse, Web browser caches with neighboring ONFERENCE This work addressed the issue of provi- C

Internet users. The rationale for this is Kouichi Kumon, Fujitsu Labs sioning a maximum number of virtual that today’s end users are increasingly This was an interesting talk pertaining to overlay networks (VON) with diverse connected to the Internet over high- the effects of cache coloring for task quality of service (QoS) requirements speed links, and the browser caches are structures in the Linux kernel scheduler on a single physical network. Each of the becoming large storage systems. There- (Linux kernel 2.4.x). The speaker first VONs is logically isolated from others to fore if an individual accesses a page presented some interesting benchmark ensure the QoS requirements. Gopalan which does not exist in their local cache, numbers for the Linux scheduler, show- mentioned several critical research Xiao is suggesting that they first query ing that as the number of processes on issues with this problem that are cur- other end users for the content before the task queue was increased, the perfor- rently being investigated. Dealing with trying to access the machine hosting the mance decreased. The authors used how to provision the network at various original. This strategy does have some some really nifty hardware to measure levels (link, route, or path) and then serious problems associated with it that the number of bus transactions enforce the provisioning at run-time is still need to be addressed, including throughout their tests and were thus one of the toughest challenges. Cur- ensuring the integrity of the content and able to reasonably quantify the impact rently, Gopalan has developed several protecting the identity and privacy of that cache misses had in the Linux algorithms to handle admission control, the end users. scheduler. end-to-end QoS, route selection, sched- Their experiments led them to imple- uling, and fault tolerance in the net- SEREL: FAST BOOTING FOR UNIX ment a cache coloring scheme for task work. Leni Mayo, Fast Boot Software structures, which were previously For more information, visit Serel is a tool that generates a visual rep- aligned on 8KB boundaries and, there- http://www.ecsl.cs.sunysb.edu/. resentation of a UNIX machine’s boot- fore, were being eventually mapped to up sequence. It can be used to identify the same cache lines. This unfortunate VISUALIZING SOFTWARE INSTABILITY the critical path and can show if a par- placement of task structures in memory Jennifer Bevan, University of California, ticular service or process blocks for an induced a significant number of cache Santa Cruz extended period of time. This informa- misses as the number of tasks grew in Detection of instability in software has tion could be used to determine where the scheduler. typically been an afterthought. The the largest performance gains could be point in the development cycle when the The implemented solution consisted of realized by tuning the order of execution software is reviewed for instability is aligning task structures to more evenly at boot-up. Serel creates a dependency usually after it is difficult and costly to distribute cached entries across the L2 graph expressed in XML during boot- go back and perform major modifica- cache. The result was, inevitably, fewer up. This graph is then used to create the tions to the code base. Bevan has devel- cache misses in the scheduler. Some neg- visual representation. Currently the tool oped a technique to allow the ative effects were observed in certain sit- only works on POSIX-compliant sys- visualization of unstable regions of code uations. These were primarily due to tems, but Mayo stated that he would be that can be used much earlier in the more cache slots being used by task porting it to other platforms. Other development cycle. Her technique cre- structure data in the scheduler, thus extensions that were mentioned ates a time series of dependent graphs forcing data previously cached there to included having the metadata include that include clusters and lines called be pushed out. the use of shared libraries and monitor- fault lines. From the graphs a developer ing the suspend/resume sequence of is able to easily determine where the portable machines. unstable sections of code are and proac- tively restructure them. She is currently For more information, visit working on a prototype implementa- http://www.fastboot.org/. tion. For more information, visit http://www. cse.ucsc.ecu/~jbevan/evo_viz/.

October 2002 ;login: USENIX 2002 81 BERKELEY DB XML CATACOMB stack components. Another is to display John Merrells, Sleepycat Software Elias Sinderson, University of California, only certain fields of a data structure Merrells gave a quick introduction and Santa Cruz and have the ability to zoom in and out overview of the new XML library for Catacomb is a project to develop a data- if needed. Finally, the debugger should Berkeley DB. The library specializes in base-backed DASL module for the provide the ability to visually present storage and retrieval of XML content Apache Web server. It was designed as a complex data structures, including through a tool called XPath. The library replacement for the WebDAV. The initial linked lists, trees, etc. He said that a beta allows for multiple containers per docu- release of the module only contains sup- version of a debugger with these abilities ment and stores everything natively as port for a MySQL database but could be is currently available. XML. The user is also given a wide range extended to others. Sinderson briefly For more information, visit http:// of elements to create indices with, touched on the performance of her infovis.cs.vt.edu/datastruct/. including edges, elements, text strings, module. It was comparable to the or presence. The XPath tool consists of a mod_dav Apache module for all query IMPROVING APPLICATION PERFORMANCE query parser, generator, optimizer, and types but search. The presentation was THROUGH SYSTEM-CALL COMPOSITION execution engine. To conclude his pre- concluded with remarks about adding Amit Purohit, University of Stony Brook sentation, Merrell gave a live demonstra- support for the lock method and includ- Web servers perform a huge number of tion of the software. ing ACL specifications in the future. context switches and internal data copy- For more information, visit For more information, visit ing during normal operation. These two http://www.sleepycat.com/xml/. http://ocean.cse.ucsc.edu/catacomb/. elements can drastically limit the perfor- mance of an application regardless of CLUSTER-ON-DEMAND (COD) SELF-ORGANIZING STORAGE the hardware platform it is run on. The Justin Moore, Duke University Dan Ellard, Harvard University Compound System Call (CoSy) frame- Modern clusters are growing at a rapid Ellard’s presentation introduced a stor- work is an attempt to reduce the perfor- rate. Many have pushed beyond the age system that tuned itself, based on the mance penalty for context switches and 5000-machine mark, and deploying workload of the system, without requir- data copies. It includes a user-level them results in large expenses as well as ing the intervention of the user. The library and several kernel facilities that management and provisioning issues. intelligence was implemented as a vir- can be used via system calls. The library Furthermore, if the cluster is “rented” tual self-organizing disk that resides provides the programmer with a com- out to various users it is very time-con- below any file system. The virtual disk plete set of memory-management func- suming to configure it to a user’s specs observes the access patterns exhibited by tions. Performance tests using the CoSy regardless of how long they need to use the system and then attempts to predict framework showed a large saving for it. The COD work presented creates what information will be accessed next. context switches and data copies. The dynamic virtual clusters within a given An experiment was done using an NFS only area where the savings between physical cluster. The goal was to have a trace at a large ISP on one of their email conventional libraries and CoSy were provisioning tool that would automati- servers. Ellard’s self-organizing storage negligible was for very large file copies. cally select a chunk of available nodes system worked well, which he attributes For more information, visit and install the operating system and to the fact that most of the files being http://www.cs.sunysb.edu/~purohit/. middleware specified by the customer in requested were large email boxes. Areas a short period of time. This would allow of future work include exploration of ELASTIC QUOTAS a greater use of resources, since multiple the length and detail of the data collec- John Oscar, Columbia University virtual clusters can exist at once. By tion stage, as well as the CPU impact of Oscar began by stating that most using a virtual cluster, the size can be running the virtual disk layer. resources are flexible but that, to date, changed dynamically. Moore stated that disks have not been. While approxi- VISUAL DEBUGGING they have created a working prototype mately 80 percent of files are short-lived, John Costigan, Virginia Tech and are currently testing and bench- disk quotas are hard limits imposed by marking its performance. Costigan feels that the state of current administrators. The idea behind elastic debugging facilities in the UNIX world quotas is to have a non-permanent stor- For more information, visit is not as good as it should be. He pro- http://www.cs.duke.edu/~justin/cod/. age area that each user can temporarily poses the addition of several elements to use. Oscar suggested the creation of a programs like ddd and gdb. The first /ehome directory structure to be used in addition is including separate heap and deploying elastic quotas. Currently, he

82 Vol. 27, No. 5 ;login: has implemented elastic quotas as a For more information, visit http://www. A question was directed to Garbee about stackable file system. There were also citi.umich.edu/u/provos/systrace/. his knowledge of how the HP-Compaq EPORTS several trends apparent in the data. merger would affect Linux. Garbee R Many users would ssh to their remote VERIFIABLE SECRET REDISTRIBUTION FOR thought that the merger was good for host (good) but then launch an email SURVIVABLE STORAGE SYSTEMS Linux, both within the company and for Ted Wong, Carnegie Mellon University reader on the local machine which the open source community as a whole. ONFERENCE would connect via POP to the same host Wong presented a protocol that can be He stated that the new company would C and send the password in clear text used to re-create a file distributed over a be the number one shipper of systems (bad). This situation can easily be reme- set of servers, even if one of the servers with Linux as the base operating system. died by tunneling protocols like POP is damaged. In his scheme the user must He also fielded a question about the through SSH, but it appeared that many choose how many shares the file is split support of older HP servers and their people were not aware this could be into and the number of servers it will be ability to run Linux, saying that indeed done. While most of his comments on stored across. The goal of this work is to Linux has been ported to them and he the use of the wireless network were provide persistent, secure storage of personally had several machines in his negative, the list of passwords he had information, even if it comes under basement running it. collected showed that people were attack. Wong stated that his design of A question which sparked a large num- indeed using strong passwords. His rec- the protocol was complete and he is cur- ber of responses concerned problems ommendations were to educate and rently building a prototype implementa- people had with running Linux on encourage people to use protocols like tion of it. mobile platforms. Most people actually IPSec, SSH, and SSL when conducting For more information, visit http://www. did not have many problems at all. work over a wireless network, because cs.cmu.edu/~tmwong/research/. There were a few cases of a particular you never know who else is listening. machine model not working, but there THE GURU IS IN SESSIONS SYSTRACE were only two widespread issues: dock- Niels Provos, University of Michigan LINUX ON LAPTOP/PDA ing stations and the new ACPI power management scheme. Docking stations How can one be sure the applications Bdale Garbee, HP Linux Systems still appear to be somewhat of a one is using actually do exactly what Operation , but most people had devel- their developers said they do? Short Summarized by Brennan Reynolds oped ad-hoc solutions for getting them answer: you can’t. People today are using This was a roundtable-like discussion to work. The ACPI power management a large number of complex applications, with Garbee acting as moderator. He is scheme developed by Intel does not which means it is impossible to check currently in charge of the Debian distri- appear to have a quick solution. Ted each application thoroughly for security bution of Linux and is involved in port- T’so, one of the head kernel developers, vulnerabilities. There are tools out there ing Linux to platforms other than i386. stated that there are fundamental archi- that can help, though. Provos has devel- He has successfully help port the kernel tectural problems with the 2.4 series ker- oped a tool called systrace that allows a to the Alpha, Sparc, and ARM and is nel that do not easily allow ACPI to be user to generate a policy of acceptable currently working on a version to run added. However, he also stated that system calls a particular application can on VAX machines. ACPI is supported in the newest devel- make. If the application attempts to A discussion was held on which file sys- opment kernel, 2.5, and will be included make a call that is not defined in the tem is best for battery life. The Riser and in 2.6. policy, the user is notified and allowed to Ext3 file systems were discussed; people choose an action. Systrace includes an The final topic was the possibility of commented that when using Ext3 on auto-policy generation mechanism purchasing a laptop without paying any their laptops the disc would never spin which uses a training phase to record the charge/tax for Microsoft Windows. The down and thus was always consuming a actions of all programs the user exe- consensus was that currently this is not large amount of power. One suggestion cutes. If the user chooses not to be both- possible. Even if the machine comes was to use a RAM-based file system for ered by applications breaking policy, with Linux, or without any operating any volume that requires a large number systrace allows default enforcement system, the vendors have contracts in of writes and to only have the file system actions to be set. Currently, systrace is place which require them to pay written to disk at infrequent intervals or implemented for FreeBSD and NetBSD, Microsoft for each machine they ship if when the machine is shut down or sus- with a Linux version coming out shortly. that machine is capable of running a pended. Microsoft operating system. The only way this will change is if the demand for

October 2002 ;login: USENIX 2002 83 other operating systems increases to the what went wrong. Logging and testing machines still have to be ported, paths point that vendors renegotiate their con- for the successful completion of each will still be different, etc. Web-based tracts with Microsoft, and no one saw task are what make this model resilient. applications are another area that could this happening in the near future. Every management operation is logged be promising for portable applications. at some appropriate level of detail, gen- Nick went on to talk about the future of LARGE CLUSTERS erating 120MB/day, and every single file portability. The recently released POSIX transfer is md5 checksummed. This was Andrew Hume, AT&T Labs – Research 2001 standards are expected to help the characterized by Hume as a “patient” Summarized by Teri Lampoudi situation. The 2001 revision expands the style of management, where no one con- This was a session on clusters, big data, original POSIX sepc into seven volumes. trols the cluster, scheduling behavior is and resilient computing. Given that the Even though POSIX 2001 is more spe- emergent rather than stringently audience was primarily interested in cific than the previous release, Nick planned, and nodes are free to drop in clusters and not necessarily big data or pointed out that even this standard and out of the cluster; a downside to this beating nodes to a pulp, the conversa- allows for small differences where ven- model is the increase in latency, offset by tion revolved mainly around what it dors may define their own behaviors. the fact that the cluster continues to takes to make a heterogeneous cluster This can only hurt developers in the function unattended outside of 8-to-5 resilient. Hume also presented a long run. Along these lines, it was support hours. FREENIX paper on his current cluster pointed out that there may be room for project, which explained in more detail In response to questions about the pro- automated tools that have the capability much of what was abstractly claimed in jected scalability of such a scheme, to check application code and determine the guru session. Hume said the system would presum- the level of portability and standards ably scale from the present 60 nodes to compliance of the source. Hume’s use of the word “cluster” about 100 without modifications, but referred not to what I had assumed that scaling higher would be a matter for NETWORK MANAGEMENT, SYSTEM would be a Beowulf-type system but to a further design. The guru session ended PERFORMANCE TUNING loosely coupled farm of machines in on a somewhat unrelated note regarding Jeff Allen, Tellme Networks Inc. which concurrency was much less of an the problems of getting data off main- Summarized by Matt Selsky issue than it would be in a Beowulf. In frames and onto Linux machines – fact, the architecture Hume described Jeff Allen, author of Cricket, discussed issues of fixed- vs. variable-length was designed to deal with large amounts network tuning. The basic idea of tuning encoding of data blocks resulting in use- of independent transaction processing, is measure, twiddle, and measure again. ful information being stripped by FTP, essentially the process of billing calls, Cricket can be used for large installa- and problems in converting COBOL which requires no interprocess message tions, but it’s overkill for smaller instal- copybooks to something useful on a C- passing or anything of the sort a parallel lations, for which Mrtg is better suited. based architecture. To reiterate Hume’s scientific application might. You don’t need Cricket to do measure- claim, there are homemade solutions for ment. Doing something like a shell Where does the large data come in? some of these problems, and the reader script that dumps data to Excel is good, Since the transactions in question con- should feel free to contact Mr. Hume for but you need to get some sort of mea- sist of large numbers of flat files, a them. surement. Cricket was not designed for mechanism for getting the files onto billing; it was meant to help answer nodes and off-loading the results is nec- WRITING PORTABLE APPLICATIONS questions. essary. In this particular cluster, named Nick Stoughton, MSB Consultants Useful tools and techniques covered Ningaui, this task is handled by the Summarized by Josh Lothian included looking for the difference “replication manager,”which generated Nick Stoughton addressed a concern between machines in order to identify a large amount of interest in the audi- that developers are facing more and the causes for the differences; measuring ence. All management functions in the more: writing portable applications. He but also thinking about what you’re cluster, as well as job allocation, consti- proposed that there is no such thing as a measuring and why; remembering to tute services that are to be bid for and “portable application,”only those that use strace and tcpdump to identify leased out to nodes. Furthermore, fail- have been ported. Developers are still in problems. Problems repeat themselves; ures are handled by simply repeating the search of the Holy Grail of applications performance tuning requires an intricate failed task until it is completed properly programming: a write-once, compile- understanding of all the layers involved if that is possible, and if it is not, then anywhere program. Languages such as to solve the problem and simplify the looking through detailed logs to discover Java are leading up to this, but virtual automation. (Having a computer science

84 Vol. 27, No. 5 ;login: background helps but is not essential.) If code for arm32 has been introduced, as CLOSING SESSION you can reproduce the problem, you have some other rather major changes to HOW FLIES FLY EPORTS then should investigate each piece to various subsystems. More information is Michael H. Dickinson, University of R find the bottleneck. If you can’t repro- available in Christos’s slides, which are California, Berkeley duce the problem, then measure. You now available at http://www.netbsd.org/ Summarized by J.D. Welch

need to understand all the system inter- gallery/events/usenix2002/. ONFERENCE actions. Measurement can help deter- In this lively and offbeat talk, Dickinson C of the OpenBSD Project mine whether things are actually slow or discussed his research into the mecha- presented a rundown of various new if the user is imagining it. nisms of flight in the fruit fly (Droso- things that the OpenBSD and OpenSSH phila melanogaster) and related the Troubleshooting begins with a hunch, teams have been looking at. Theo’s talk autonomic behavior to that of a techno- but scientific processes are essential. You included an amusing description (and logical system. Insects are the most suc- should be able to determine the event pictures!) of some of the events that cessful organisms on earth, due in no stream, or when each event occurs; hav- transpired during OpenBSD’s hack-a- small part to their early adoption of ing observation logs can help. Some thon, which occurred the week before flight. Flies travel through space on hints provided include checking cron for USENIX ’02. Finally, Theo mentioned straight trajectories interrupted by sac- periodic anomalies, slowing down that, following complications with the cades, jumpy turning motions some- /bin/rm to avoid I/O overload, and look- IPFilter license, the OpenBSD team had what analogous to the fast, jumpy ing for unusual kernel CPU usage. done a fairly extensive license audit. movements of the human eye. Using a Also, try to use lightweight monitoring Robert Watson of the FreeBSD Project variety of monitoring techniques, to reduce overhead. You don’t need to brought up various examples of including high-speed video in a “virtual monitor every resource on every system, FreeBSD being used in the real world. reality flight arena,”Dickinson and his but you should monitor those resources He went on to describe a wide variety of colleagues have observed that flies on those systems that are essential. Don’t changes that will surface with the respond to visual cues to decide when to check things too often, since you can upcoming FreeBSD release, 5.0. Notably, saccade during flight. For example, more introduce overhead that way. Lots of 5.0 will introduce a re-architecting of visual texture makes flies saccade earlier information can be gathered from the the kernel aimed at providing much (i.e., further away from the arena wall), /proc file system interface to the kernel. more scalable support for SMP described by Dickinson as a “collision machines. 5.0 will also include an early control algorithm.” BSD BOF implementation of KSE, FreeBSD’s ver- Through a combination of sensory Host: Kirk McKusick, Author and sion of scheduler activations, large inputs, flies can make decisions about Consultant improvements to pccard, and significant their flight path. Flies possess a visual Summarized by Bosko Milekic framework bits from the TrustedBSD “expansion detector,”which, at a certain The BSD Birds of a Feather session at project. threshold, causes the animal to turn a USENIX started off with five quick and Mike Karels from Wind River Systems certain direction. However, expansion in informative talks and ended with an presented an overview of how BSD/OS front of the fly sometimes causes it to exciting discussion of licensing issues. has been evolving following the Wind land. How does the fly decide? Using the virtual reality device to replicate various Christos Zoulas for the NetBSD Project River Systems acquisition of BSDi’s soft- ware assets. Ernie Prabhakar from Apple visual scenarios, Dickinson observed went over the primary goals of the that the flies fixate on vertical objects NetBSD Project (namely, portability, a discussed Apple’s OS X and its success on the desktop. He explained how OS X that move back and forth across the clean architecture, and security) and field. Expansion on the left side of the then proceeded to briefly discuss some aims to bring BSD to the desktop while striving to maintain a strong and stable animal causes it to turn right, and vice of the challenges that the project has versa, while expansion directly in front encountered. The successes and areas for core, one that is heavily based on BSD technology. of the animal triggers the legs to flare improvement of 1.6 were then exam- out in preparation for landing. ined. NetBSD has recently seen various The BoF finished with a discussion on improvements in its cross-build system, licensing; specifically, some folks ques- If the eyes detect these changes, how are packaging system, and third-party tool tioned the impact of Apple’s licensing the responses implemented? The flies’ support. For what concerns the kernel, a for code from their core OS (the Darwin wings have “power muscles,”controlled zero-copy TCP/UDP implementation Project) on the BSD community as a by mechanical resonance (as opposed to has been integrated, and a new pmap whole. the nervous system), which drive the

October 2002 ;login: USENIX 2002 85 wings, combined with neurally activated manager, a GUI ACL manager for the although OpenAFS clients don’t deter- “steering muscles,”which change the finder, and a graphical login that obtains mine which file server to talk to very configuration of the wing joints. Subtle Kerberos tickets and AFS tokens at login effectively. Arla clients use RTTs to the variations in the timing of impulses cor- time were all under development. server to determine the optimal file respond to changes in wing movement, server to fetch replicated data from. Future goals planned for the underlying controlled by sensors in the “wing-pit.” Modifications to OpenAFS to support AFS protocols include GSSAPI/SPNEGO A small, wing-like structure, the haltere, this behavior in the future are desired. support for Rx, performance improve- controls equilibrium by beating con- ments to Rx, an enhanced disconnected Jimmy Engelbrecht and Harald Barth of stantly. mode, and IPv6 support for Rx; an KTH discussed their AFSCrawler script, Voluntary control is accomplished by experimental patch is already available written to determine how many AFS the halteres, whose signals can interfere for the latter. Future Arla-specific goals cells and clients were in the world, what with the autonomic control of the stroke include improved performance, partial implementations/versions they were cycle. The halteres have “steering mus- file reading, and increased stability for (Arla vs. IBM, AFS vs. OpenAFS), and cles” as well, and information derived several platforms. Work is also in how much data was in AFS. The script from the visual system can turn off the progress for the RXGSS protocol exten- unfortunately triggered a bug in IBM haltere or trick it into a “virtual” prob- sions (integrating GSSAPI into Rx). A AFS 3.6–derived code, causing some lem requiring a response. partial implementation exists, and work clients to panic while handling a specific continues as developers find time. RPC. This has since been fixed in Dickinson has also studied the aerody- OpenAFS 1.2.5 and the most recent IBM namics of insect flight, using a device Derrick Brashear of the OpenAFS devel- AFS patch level, and all AFS users are called the “Robo Fly,”an oversized opment team presented the OpenAFS strongly encouraged to upgrade. No mechanical insect wing suspended in a status report. OpenAFS was released release of Arla is vulnerable to this par- large tank of mineral oil. Interestingly, immediately prior to the conference; ticular denial-of-service attack. There Dickinson observed that the average lift OpenAFS 1.2.5 fixed a remotely was an extended discussion of the use- generated by the flies’ wings is under its exploitable denial-of-service attack in fulness of this exploration. Many sites body weight; the flies use three mecha- several OpenAFS platforms, most believed this was useful information and nisms to overcome this, including rotat- notably IRIX and AIX. Future work such scanning should continue in the ing the wings (rotational lift). planned for OpenAFS includes better future, but only on an opt-in basis. support for MacOS X, including work- Insects are extraordinarily robust crea- ing around Finder interaction issues. Many sites face the problem of manag- tures, and because Dickinson analyzed Better support for the BSDs is also ing Kerberos/AFS credentials for batch- the problem in a systems-oriented way, planned; FreeBSD has a partially imple- scheduled jobs. Specifically, most batch these observations and analysis are mented client; NetBSD and OpenBSD processing software needs to be modi- immediately applicable to technology, have only server programs available fied to forward tickets as part of the the response system can be used as an right now. AIX 5, Tru64 5.1A, and batch submission process, renew tickets efficient search algorithm for control MacOS X 10.2 (a.k.a. Jaguar) are all and tokens while the job is in the queue systems in autonomous vehicles, for planned for the future. Other planned and for the lifetime of the job, and prop- example. enhancements include nested groups in erly destroy credentials when the job the ptserver (code donated by the Uni- completes. Ken Hornstein of NRL was THE AFS WORKSHOP versity of Michigan awaits integration), able to pay a commercial vendor to sup- Summarized by Garry Zacheiss, MIT disconnected AFS, and further work on port Kerberos 4/5 credential manage- Love Hornquist-Astrand of the Arla a native Windows client. Derrick stated ment in their product, although they did development team presented the Arla that the guiding principles of OpenAFS not implement AFS token management. status report. Arla 0.35.8 has been were to maintain compatibility with MIT has implemented some of the released. Scheduled for release soon are IBM AFS, support new platforms, ease desired functionality in OpenPBS, and improved support for Tru64 UNIX, the administrative burden, and add new might be able to make it available to MacOS X, and FreeBSD, improved vol- functionality. other interested sites. ume handling, and implementation of more of the vos/pts subcommands. It AFS performance was discussed. The Tools to simplify AFS administration was stressed that MacOS X is considered openafs.org cell consists of two file were discussed, including: servers, one in Stockholm, Sweden, and an important platform, and that a GUI AFS Balancer. A tool written by one in Pittsburgh, PA. AFS works rea- configuration manager for the cache CMU to automate the process of sonably well over trans-Atlantic links, balancing disk usage across all

86 Vol. 27, No. 5 ;login: servers in a cell. Available from storage are data volumes for workstation stopping backups altogether. Their ftp://ftp.andrew.cmu.edu/pub/ software (400GB) and volumes for implementation allows for full volume EPORTS

AFS-Tools/balance-1.1b.tar.gz. course Web sites and assignments restores as well as individual directory R Themis. Themis is KTH’s enhanced (100GB). and file restores. They have finished cod- version of the AFS tool “package,” ing this work and are in the process of AFS usage at Intel was also presented.

for updating files on local disk from testing and documenting it. ONFERENCE

Intel has been an AFS site since 1994. C a central AFS image. KTH’s They had bad experiences with the IBM Peter Honeyman of CITI at the Univer- enhancements include allowing the 3.5 Linux client; their experience with sity of Michigan spoke about work he deletion of files, simplifying the OpenAFS on Linux 2.4.x kernels has has proposed to replace Rx with RPC- process of adding a file, and allow- been much better. They use and are sat- SEC GSS in OpenAFS; this would allow ing the merging of multiple rule isfied with the OpenAFS IA64 Linux AFS to use a TCP-based transport mech- sets for determining which files are port. Intel has hundreds of OpenAFS anism, rather than the UDP-based Rx, updated. Themis is available from 1.2.3 and 1.2.4 clients in many produc- and possibly gain better congestion con- the Arla CVS repository. tion cells, accessing data stored on IBM trol, dynamic adaptation, and fragmen- Stanford was presented as an example of AFS file servers. They have not encoun- tation avoidance as a result. RPCSEC a large AFS site. Stanford’s AFS usage tered any interoperability issues. Intel GSS uses the GSSAPI to authenticate consists of approximately 1.4TB of data has some concerns about OpenAFS; they SUN ONC RPC. RPCSEC GSS is trans- in AFS, in the form of approximately would like to purchase commercial sup- port agnostic, provides strong security, is 100,000 volumes. 3.3TB of storage is port for OpenAFS and to see OpenAFS a developing Internet standard, and has available in their primary cell, ir.stan- support for HP-UX on both PA-RISC multiple open source implementations. ford.edu. Their file servers consist and Itanium hardware. HP-UX support Backward compatibility with existing entirely of Solaris machines running a is currently unavailable due to a specific AFS servers and clients is an important combination of Transarc 3.6 patch-level HP-UX header file being unavailable goal of this project. 3 and OpenAFS 1.2.x, while their data- from HP; this may be available soon. base servers run OpenAFS 1.2.2. Their Intel has not yet committed to migrating cell consists of 25 file servers, using a their file servers to OpenAFS and are combination of EMC and Sun StorEdge unsure if they will do so without com- hardware. Stanford continues to use the mercial support. kaserver for their authentication infra- Backups are a traditional topic of dis- structure, with future plans to migrate cussion at AFS workshops, and this time entirely to an MIT Kerberos 5 KDC. was no exception. Many users complain Stanford has approximately 3400 clients that the traditional AFS backup tools on their campus, not including SLAC (“backup” and “butc”) are complex and (Stanford Linear Accelerator); approxi- difficult to automate, requiring many mately 2100 AFS clients from outside home-grown scripts and much user Stanford contact their cell every month. intervention for error recovery. An addi- Their supported clients are almost tional complaint was that the traditional entirely IBM AFS 3.6 clients, although AFS tools do not support file- or direc- they plan to release OpenAFS clients tory-level backups and restores; data soon. Stanford currently supports only must be backed up and restored at the UNIX clients. There is some on-campus volume level. presence of Windows clients, but Stan- Mitch Collinsworth of Cornell presented ford has never publicly released or sup- work to make AMANDA, the free ported it. They do intend to release and backup software from the University of support the MacOS X client in the near Maryland, suitable for AFS backups. future. Using AMANDA for AFS backups allows All Stanford students, faculty, and staff one to share AFS backup tapes with are assigned AFS home directories with non-AFS backups, easily run multiple a default quota of 50MB, for a total of backups in parallel, automate error approximately 550GB of user home recovery, and provide a robust degraded directories. Other significant uses of AFS mode that prevents tape errors from

October 2002 ;login: 87