THE MAGAZINE OF USENIX & SAGE February 2001 • volume 26 • number 1 { inside: CONFERENCE REPORTS#

& The Advanced Computing Systems Association & The System Administrators Guild conference reports

This issue’s reports are on the 4th 4th Annual Showcase Coar’s description of the HTTP project Annual Showcase & Conference, & Conference, Atlanta was more illuminating, involving XML (ALS 2000). and Djakarta (=Apache-Java). ATLANTA, GEORGIA, USA OUR THANKS TO THE SUMMARIZERS: OCTOBER 10–14, 2000 FOR ALS: Peter Salus GOODBYE, ATLANTA Vikram V.Asrani by Peter H. Salus Laurel Fan Thomas Naughton The Atlanta Linux Showcase was founded Zhedong Yu in 1996 by the Atlanta Linux Enthusiasts, which had been founded in December 1994. It grew too fast and this past Octo- ber was run by the USENIX Association, rather than by the amateurs that had made it such a success in 1997, 1998, and Ken Coar 1999. Last year there was an interesting ALS I was the dinner entertainment in 1998 session on Beowulf, so, on Friday, I trot- and in 1999, so I guess I can get away ted along to hear Jim Reese (of Google) with that rather dour beginning. talk about Linux clustering. He referred I enjoyed myself at ALS 4, but it wasn’t to his talk as “Scaling the Web: An the same. And it will be yet further trans- Overview of Google, A Linux Cluster for formed in 2001, when it moves from Fun and Profit.” Atlanta to Oakland. ALS 4, in fact, was Google has waxed tremendously. In June the Annual Linux Showcase, no longer 1999 they had 500 CPU and half a mil- Atlanta....Sic transit gloria mundi. lion hits daily. The show floor, as expected, was bigger In October and better. There were plenty of really 2000 (16 fine folks to talk to. Several of the invited months later) talks (which I’ll get to in a paragraph or it was 6000 so) were very interesting. But the tone CPU and 50M was different and it will be more different hits/day. in Oakland. Google uses USENIX for nearly a decade was an ama- Jim Reese cheap, off-the- teur organization. It has changed. But shelf, PC hard- even though I can generate nostalgia, the ware in which they replicate everything change and professionalization have been on there of Exodus’ sites: Santa Clara and good. Sunnyvale (which are connected by OC12 lines) and Herndon, VA (which is Ken Coar, Director and VP of the Apache connected by 2Gb lines to the West Software Foundation, spoke about life, Coast. They run 80 machine clusters with Apache, and Open Source development four 1Gb uplinks per cluster. All 100% on Thursday, 12 October. His descrip- Linux. tions of how Apache development works, what’s hot right now, and software licens- They get 100 queries/sec.; have 500TB of ing were interesting, but somehow just storage; and tens of GB/sec of I/O. didn’t fire up the audience (nor me). It Google runs redundancy like crazy. I may have been the general “preaching to wrote down lots more, but it was very the choir” aspect. He did remark that the impressive. Apache license was “BSD-ish.”

February 2001 ;login: 5 HACK LINUX TRACK than one RT signal to be sent at a time, hash tables. Lever wanted to know how similar to poll(). This can increase perfor- the performance of these hash tables REFEREED PAPERS mance by decreasing the number of sys- scales when moving from smaller mem- SESSION: KERNEL PERFORMANCE tem calls and the number of passes ory systems to machines with large physi- Summarized by Laurel Fan through the RT signal queue. cal memory. He cited an example in which a particular hash function unex- To test the performance of sigtimedwait4, ANALYZING THE OVERLOAD BEHAVIOR OF A pectedly broke down (placed all items in phhttpd was modified to use this and SIMPLE WEB SERVER only a few buckets) when adding a large compared to the unmodified phhttpd. number of items to the table. Niels Provos, University of Michigan; The overload behavior, the behavior of Chuck Lever, AOL-Netscape; Stephen the server under the load of a large num- Can one increase the size of a given hash Tweedie, Red Hat ber of clients, was examined. table and expect the hash function to Niels Provos analyzed phhttpd, a static continue to work as designed? Lever After a certain request rate, the perfor- Web server using a few different signal examined the performance of hash tables mance of phhttpd, as measured by the handling techniques. used by the page cache, buffer cache, reply rate, decreases dramatically. They directory entry cache, and inode cache in Signal-driven I/O is traditionally done found that merely switching to sigtimed- the Linux 2.2.5 kernel. Using kernel with the signal SIGIO, which is raised wait4() gave only a small improvement, instrumentation and the SPEC SDM when I/O events (such as data received, showing that the system call and signal benchmark, he was able to measure hash data sent, connection closed) occur. With handling are only a small part of the table behavior while controlling the the siginfo_t struct and sigwaitinfo() problem. syscall, information about what type of offered load on the test system. Another benefit of sigtimedwait4() is the event occurred causes the signal. How- Experimental results showed that the additional information available. When a ever, this doesn’t work well with servers hash function works well in the page, server is overloaded, it is too busy accept- with multiple connections, since there is inode, and dentry caches, and scales well ing new requests to take care of the old no information about which socket was as hash table sizes increase. The buffer ones. With sigtimedwait4(), the server can involved. cache hash function was not sufficient to detect when it is being overloaded and randomize the key, and long hash chains POSIX Real-Time signals (RT signals) are drop new connections in favor of com- resulted. All hash tables in the 2.2.5 ker- an improvement over SIGIO in many pleting old requests. ways. First, they allow a signal to be asso- nel were too small for large memory sys- ciated with a file descriptor. Second, sig- With this new enhancement, the reply tems. The inode hash table was so small nals are queued in the kernel, allowing rate no longer showed the steep decline that the hash chains averaged more than true event-driven applications. However, when in overload. Instead, the reply rate 200 entries each. leveled off and then decreased slowly. the queue is fixed size and can overflow, Later versions of the imple- Another interesting result was that drop- in which case it falls back to SIGIO until ment a dynamic hash table size for these ping connections actually decreased the the application clears the queue. The fall- caches, based on the size of a machine’s error rate. back is invoked by a SIGIO signal; the physical memory. Lever believes that in recovery process, however, uses poll() or Further information is available at most situations, the table size generated select(). . by the Linux kernel is appropriate for phhttpd is a static content Web server that good performance. uses RT signals. It is multi-threaded, with LINUX KERNEL HASH TABLE BEHAVIOR: ANALYSIS AND IMPROVEMENTS Lever then spoke about different hash multiple threads that each use sigwaitinfo function types. Modulus hash functions Chuck Lever, AOL Netscape to process events one at a time. Load bal- are generally expensive because they Chuck Lever started work on this project ancing is done by reassigning the listener require a division operation. Table-driven while working on the Linux Scalability socket to the next thread every time a hash functions are not practical because Project. Hash tables are a commonly used connection is accepted. (This is possible memory operations to read the tables are data structure in the Linux kernel because threads in Linux have unique more expensive than computation on because of their fast average insertion pids.) modern CPUs. Shift-add functions suf- and look-up times, compared with lists fice in most cases but should be checked The authors implemented a new system and trees. Performance of the kernel with real data prior to use. Multiplicative call, sigtimedwait4(), which allows more depends on the performance of these hash functions are mostly a reasonable

6 Vol. 26, No. 1 ;login: February 2001 either onthereference for blocks interval based titions isadjusted dynamically, thetwo par- SA-WWR scheme thesize of In the self-adjusted room-size scheme. Jeon’s solutionto thiswasto usea mal. thewaitingroom mini- keep thesize of itisessentialto Thus, erenced blocks. due to thesmallercache size holdingref- bly result performance inadeteriorated thecache could partitioning possi- small, Since cache sizes are prefetched blocks. the waitingroom isusedforthe while to beplaced room, intheweighing All referenced blocksare a waitingroom. room aweighing and into two partitions: Jeon thenproposed splittingthecache blocks prefetched never beused. might the itsdeficiencyisthat60%of However, can improve performance by upto 80%. ahead) policyissimpleandeffective and TheLRU-OBL (oneblocklook ahead. minded approach usingaoneblocklook andasimple- hint-based approach, a areference approach, history-based ing: policieswhich prefetch- incorporated of He mentionedthethree groups erature. replacement inthelit- policiesdescribed Jeon spoke aboutthevarious overview, As an into thereplacement policy. latency prefetching whileincorporating cache managementscheme to reduce I/O hisproposeddescribed dynamic buffer SeokJeon H. In thispresentation, University H. SeokJeonandSamNoh,Hong-Ik P S D isavailable information Further at functions dependsontheinputdataset. hash Performance of vant helps. Keeping cache size smallandrele- bility. dynamic cache sizes provide goodscala- Lever mentionedthat In conclusion, the inputdata. modulus hashfunctionsatrandomizing andare asgood generally instructions, choice require becausethey onlyafew CHEME REFETCHING YNAMIC B ASED B UFFER ;login: O N C ACHE S IMPLE M A ANAGEMENT ND A GGRESSIVE . 4 lne.Thiswindow-level composting blended. andtheoccludingregion canbe region Thus theoccluded occluding windows. alpha valuesforpixelsassigning in waya general to solve theproblem by talked about translucency. wayused inageneral to dealwith But could they notbe environments. incontrolledthe non-opaquewindows There are many techniques to simulate arewindows always completely opaque. But theoverlapping lapping happens. andwhichare are visible notwhenover- each window of defines which portions thecore protocol In X Window System, T byZhedong Summarized Yu S that the explaining by results, theexperimental at theimproved performance obtained expressing hissurprise , there wasaquestionby At theend, performance. SA-WWR doesprovide animproved bound processes before concluding that ered CPU- multiple performances with He alsoconsid- policy forreplacement. replacement policyandtheLRU-OBL mance ascompared to boththeLinux scheme provided animproved perfor- results by Jeon show thattheSA-WWR Experimental added aFIFOwaitqueue. replacement policyandinwhich he function inLinux to usetheSA-WWR tion inwhich hemodifiedthe Jeon thenspoke abouttheimplementa- and i+1. blocksi-1,i depending onthepositionsof are thetwo modified partitions sizes of Jeon explained how the ing thecases, Enumerat- or onthecache missstatistics. Inc. Keith Packard, Xfree86 Core Team, SuSE senter at contact thepre- For more information, function usedforsequentialfileaccess. ESSION HANA IU SHOWCASE LINUX ANNUAL TH RANSLUCENT : XF bread() REE W 86 NOSIN INDOWS function wasnotthe X & CONFERENCE bread() . Associates Adam Thornton,SineNomine . the XFree86 project at More canbeobtainedfrom information to implementadriver.is straightforward it Since XFree86-4.x iswell documented, modules aswell. architecture of canshare thesametype Thus alltheOSsonsamehardware thefullXserver. provided instead of new driver (orextension) moduleto be justthemodified/ allowing extensible, was introduced to make XFree86 more “Module Loading Architecture” Then, foroneOSDevice one driver binary problem andthelogistical of compatible; PCshouldbe VGA- card of the video theproblematic assumptionthat mented; waslargelyuntouched andundocu- part X(ddx) thedevice-dependent document; areal design lackof ous XFree86 design: previ- Cutshaw analyzed theproblems of Hohndel Dirk andRobin In theirpaper, and extensions forXFree86. itandknow howwith to develop drivers it’s to befamiliar important very systems, X Window System forPCUNIX tation of Since XFree86 isthestandard implemen- Cutshaw, Intercore Dirk Hohndel,SuSELinusAG;Robin XF D application development. begreatlyhelpfulfor extension will oueterlaiiyadIOpwro the to andI/Opower usethereliability of would UNIX expertise allow userswith Running Linux onit such as Web serving. making itsuited for tasks CPU power, than rather Its strength isinI/O, frame. The System/390 isIBM’s largestmain- L byLaurelSummarized Fan S NXO THE ON INUX ESSION EVELOPING REE 86-4. : K ERNEL X D S IESAND RIVERS YSTEM P ORTS /390 E TNIN FOR XTENSIONS 7 CONFERENCE REPORTS S/390 without dealing with its less desir- All devices are virtual, and most are major improvement. The ABI, applica- able characteristics, such as EBCDIC. implemented in terms of user-level tion binary interface, defines how the objects. For example, disks are imple- object code is laid out. Because the C++ One interesting feature is VM, Virtual mented as files, and terminals are imple- ABI has changed between GCC releases Machine. This allows a single S/390 to mented as xterms or ptys. The virtual in the past, libraries built with different run multiple virtual S/390s, each running processor is implemented with the ker- versions of GCC are incompatible. A sta- any OS, such as Linux. Using VM and nel’s arch interface. ble ABI for 3.0 and subsequent releases Linux S/390, a large virtual server farm will make distribution of both free and can be implemented on a single machine. Processes in user-mode Linux run as proprietary C++ libraries easier. 41,400 simultaneous copies of Linux user-mode processes in the host kernel. have been run in a test environment, and Syscalls are implemented using a tracing Many other C++ improvements have 3,700 copies have been run in produc- thread which intercepts system calls and also been made. Mangled names, espe- tion. redirects them to the user mode kernel. cially for complex templates, are much shorter, resulting in smaller object files. This has many practical purposes. Multi- A port to user mode presents many chal- Virtual bases are handled more effi- ple different versions of Linux can be run lenges and design problems, which Jeff ciently. A new, more standards-compliant for testing or academic purposes, without Dike addressed in his talk, such as con- C++ standard library will be included. the hassle or expense of multiple text switching and virtual memory. machines. ISPs or other service providers The infrastructure of the compiler itself A user-mode kernel has many applica- can give each of their customers their has also been worked on. Better internal tions. For example, it can be used as a own virtual Linux server, without worry- memory management allows GCC to use sandbox for untrusted code, debugging, ing about customers affecting each other. less memory. Many improvements, such and as a Linux binary compatibility layer A single S/390 can have a lower total cost as creating a parse tree for a whole func- for othere OSes. of ownership than the equivalent num- tion and using flow graphs to allow ber of stand-alone servers. While user-mode Linux is quite func- global optimization, will enable better tional, supporting kernel modules, X optimization techniques. Porting to the S/390 presented several clients, and networking, some work still unique issues, both because of the num- For more information, see needs to be done, such as SMP support, ber of virtual machines and because of . privileged instruction emulation, and the S/390’s unique architecture. One nesting. issue was the timer interrupt. In Linux, a SMP SCALABILITY COMPARISONS OF LINUX timer interrupt fires 100 times every sec- For more information, see KERNELS 2.2.14 AND 2.3.99 ond by default. This can decrease perfor- . Ray Bryant, Bill Hartner, Qi He, and mance significantly with many virtual Ganesh Venkitachalam, IBM Linux Tech- machines. Their current solution is to SESSION: POTPOURRI nology Center decrease the frequency of the interrupt, Summarized by Laurel Fan This study compared the SMP scalability which has the unfortunate effect of (the performance gain from adding more GCC 3.0: THE STATE OF THE SOURCE decreasing the responsiveness of interac- processors) of Linux kernel versions tive applications. A good solution to this Mark Mitchell and Alexander Samuel, 2.2.14 and 2.3.99. At the time of the would be for a virtual kernel to disable CodeSourcery, LLC study, these were, respectively, the newest timer interrupts when idle, and later stable version and the newest develop- restore its time from the host kernels. GCC, the GNU Compiler Collection, is ment version, which should have similar the primary compiler for GNU/Linux performance characteristics to the 2.4 For more information, see and an important part of the system. series. . The next major release, GCC 3.0, will include many improvements, and will Four benchmarks were used: Volano- A USER-MODE PORTOFTHELINUX KERNEL have a more rigorous quality assurance mark, Netperf, FSCache, and SPEC- Jeff Dike process. web99. Volanomark is a chat-room server User-mode Linux is a port of the Linux and client written in Java that makes kernel that itself runs on Linux. It is a full One major improvement is a standard- extensive use of threads and measures Linux kernel, but instead of running on ized C++. ABIGCC 3.0, the next major scheduler and TCP/IP stack perfor- hardware, it runs on a host kernel. release of the GNU Compiler Collection, mance. Netperf measures network will include a standardized C++ ABI, a performance. FSCache measures the

8 Vol. 26, No. 1 ;login: February 2001 open such as Controlled system calls, process. tem callisinvoked by aprivileged Database (ACD) whenacontrolled sys- making acheck onan Access Control The technique here described involves set up. –andiseasyto mal performance penalty applicationsandmini- recompilation of on thekernel nochange –requiring or create asolutionthathasminimalimpact thisapproach wasto The objective of bypassed. orcanbe applications, legitimate break applicationcode, modification of require and usinganon-executable stack, such asaddingboundschecking tions, Existingsolu- systems. other operating problem onLinux and major security a bufferoverflow orsimilarattackisa Subverting applicationsusing privileged Roma “LaSapienza,”Italy and LuigiV. Mancini,Universitadi del Calcolo,Italy;EmanueleGabrielli Massimo Bernaschi,IstitutoApplicazioni A B E Summarized byLaurelFan S SMP performance thanthe2.2kernels. upcoming 2.4kernels shouldhave better the Consequently, SMP scalability. kernels showed increase asignificant in the2.3.99 thesebenchmarks, On allof accessing staticanddynamic content. test a Web by server simulating clients SPECweb99 to isabenchmark designed thefilesystem cache. performance of e al o xml,the ACD for entry For example, tem call. the ACD each controlled with varies sys- contained Theinformation in examined. system callis forthatparticular entry the ACD When asystem callisinvoked, the system. an attacker could useto control gain of HNEET OTHE TO NHANCEMENTS ESSION LOCKING TTACKS , execve : S ECURITY B UFFER and , ;login: -O chmod VERFLOW L INUX are thosethat , -B K RE FOR ERNEL ASED A callto date andsize) abouttheexecutable files. stored (such aslastmodified information o oeifrain see For more information, buffer-overflow-based attacks. tohas beenshown protect againstseveral approach This and never inuser mode. onlyforthecontrolled system calls often: isnotaccessedthe newfunctionality performance impactislimited because 4 isstored information This permissions. and types, about domains, information andwhich contains read onbootup, policy iscontained inafilewhich isread TheDTE DTEforLinux kernel 2.3.38. of Hallyn talked about hisimplementation domains. sion to execute orchange system binaries into adomainthatdoesnothave permis- point anentry making theftpdbinary vented uparoot from shellby giving anftpdaemoncanbepre- For example, point. entry matically by executing afiledefinedasan can change domainsexplicitly orauto- Aprocess domains). andchanging signals files)andbetween domains(sending /etc. read/write/execute (processes of types Access iscontrolled from domainsto “types.” belong to andfilesbelongto “domains,” Processes the system from user. atrusted access control forprotecting method of Domain andType Enforcement isa of William andMary Serge E.HallynandPhilKearns,College D execve process ispermitted to which executable fileseach privileged h C,andachange to the ACD, anewcommand to manage kernel patch, This feature asmall isimplemented with since the ACD wascreated. entry orthetargetfilehasbeenmodified file, does nothave to permission execute the HANA IU SHOWCASE LINUX ANNUAL TH MI AND OMAIN contains about information execve T YPE al feithertheprocess fails if E FREETFOR NFORCEMENT execve chmod & CONFERENCE and , .The L INUX . < see For more information, be relatively insignificant. thisshould which executing filesisrare, in workloads, butfornormal slightly, DTE to thekernel slows performance Adding addingDTEto thekernel. with There performance impact isaslight done. domain (by calling P to (by access calling atype andwhenaprocess attempts in memory, itself. orto anadministrator to take alert action either analyze andfindattacks, thelogs toolsto detection to provide intrusion Piranha’s solutionto thisis read through. forahuman to information unimportant become too longandcontain too much Another problem isthattheauditlogcan additional password. would require bothroot access andan binaries, as theauditlogandPiranha such files, task oreditingsuch important themonitoring such assignaling tasks, some With Piranha, tem inthekernel. is to take steps to protect sys- thelogging Piranha Audit’s solutionto this intrusion. canbealtered thelogs to hidethe tion, tems isthatinaroot compromise situa- sys- existingOne problem logging with meet thoserequirements. Piranha Audit isanattempt to tems. forsecure sys- criteria asetof TCSEC, Auditing in isdescribed in thefuture. towhat doto prevent itfrom happening auditdatacanhelpyou decide ceeds, theattacksuc- Even if be ableto stop it. you might an attackwhenit’s happening, you candetect If system security. of part Auditing isanimportant andlogging Catania, Italy Francesco Pappalardo, Universityof Vincenzo Cutello,Emilio Mastriani,and U domains (with domains (with IRANHA http://www.cs.wm.edu/~hallyn/dte IIISTO TILITIES A UDIT I MPROVE :K execve ERNEL signal A UDIT ,aDTEcheck is ), E HNEET AND NHANCEMENTS ,orchange ), /L open OGGING ,access a ), >. 9 CONFERENCE REPORTS Testing has shown that Piranha Audit can lock the statistics structure, since it is they currently are working with, making detect and prevent attacks, and does not only updated by one CPU. note of the various manufacturers as well cause excessive performance degradation. as multiple BIOSes. These included Pan- In short, Bryant believes that the instru- cake: 36 Compaq Photon nodes, Rock- mented code should study the original SESSION: KERNEL PERFORMANCE II hopper: 128 Intel L440 GX+ SMP, and problem and not deviate to examining Sarnoff: 161 various types. He explained LOCKMETER: HIGHLY INFORMATIVE INSTRU- the instrumented problem. MENTATION FOR SPIN LOCKS IN THE LINUX the current quandary regarding BIOS KERNEL Bryant then described their solution: the and the lack of a sufficient standard. The Lockmeter, which is a set of instru- major vendors like Intel, DEC, Compaq, Ray Bryant, IBM Linux Technology Center; John Hawkes, SGI mented spin-lock routines providing cer- Dell, all have different BIOSes and the tain lock usage statistics on a per call availability of specifications also varies. Summarized by Vikram V. Asrani basis. He described the implementation Since no standard is present he discussed Ray Bryant introduced spin locks as the of both spin locks as well as the rwlocks a few options, one being to use a free low-level synchronization primitives in using the idea of saving a hash index in BIOS, but these currently lack the neces- the SMP (symmetric multiprocessing) some field in the lock structure. Bryant sary maturity to make this a realistic system. The two types of spin locks are showed a large set of useful statistics pro- option. Minnich also noted that the pro- spinlock_t and rwlock_t, the latter sup- vided by the Lockmeter. In addition, the posed Intel standard PXE leaves much to porting multiple read/write access. These authors had also examined the overheads be desired (or reduced, given the over- locks are operated upon using macros. introduced by this instrumentation. This sized technical docs). In light of these Bryant believes that the primary reason instrumentation increased system time issues he asked the question, “Can we get for the use of Linux in the market is so up to 20% and system throughput up to out of the BIOS mess?” Can Linux cold that it can be used as a server operating 14% (because of the larger set of instruc- boot Linux? As it turns out, the answer is system. However, vendor systems in the tions to be executed). Bryant mentioned yes. They can build a 32K hardware variants of UNIX provide a better perfor- that they have examined the problem and startup program and then unzip the ker- mance for SMP. Thus there is a need to have gotten some results. They now need nel. They can thus gain control of the improve Linux SMP performance. to use the results in order to examine machine from power on instead of hav- why the SMP performance does not Bryant described path length and lock ing to deal with intermediate software. scale; they will be continuing work on contention as the two main issues deter- this. The current version of Lockmeter The key question for LinuxBIOS was – mining SMP performance. Path length is available from Why? The ability to gain control over can be examined using profiling tools. .An previously BIOS-managed matters offers However, measuring lock contention was updated version of the Lockmeter paper several attractive options, such as allow- a harder task. can be found at ing log buffers to survive for diagnostic The reasons are as follows: Linux has a or information where they usually get fast implementation for locks. Gathering . zeroed out by default. Possibly the most statistical information can potentially stunning point was his demonstration of increase the overheads, and one wants to EXTREME LINUX TRACK booting to a single user in 3 to 5 seconds

keep this overhead minimal. Since the SESSION: POTPOURRI and approximately 10 seconds to reboot lock structures have been specifically SMP.Also impressive was the fact that Summarized by Thomas Naughton designed to optimize their performance there is no proprietary license and no in the presence of a cache, one cannot more hangs for hit , etc. And on a THE LINUX BIOS increase the size of the lock structure. purely geeky note, they are able to one- Ronald G. Minnich, James Hendricks, up the DEC BIOS’s “tinky Yellow Rose of One way to reduce the overhead of lock and Dale Webster, Los Alamos National Texas” by being able to play an MP3 – instrumentation is to store all lock statis- Laboratories from the BIOS! tics in per-CPU data structures. This has The session chair, Donald Becker, intro- the advantage of not introducing addi- duced Ron Minnich and mentioned that The LinuxBIOS is working on select tional cache traffic between processors he was working with clusters when they Intel, SiS, and VIA boards. It is not cur- that would occur if there were a single (Becker, et al.) began the Beowulf project rently working on Acer and RCC lock statistics structure shared among at CESDIS in the early 1990s. Minnich (Dell/Compaq use this), mainly due to a CPUs. Additionally, there is no need to briefly introduced several of the clusters lack of available details from RCC. But

10 Vol. 26, No. 1 ;login: February 2001 caused themto investigate this new cluster usingotherpopular topologies the costs forconnecting the64nodesof The thiscluster. developed foruse with new network architecture have they Thepresentation discussedthe results. interesting offerssomevery Testbed 2), KLAT2 (Kentucky Linuxtucky, Athlon Ken- The newcluster attheUniversity of of Kentucky Hank DietzandTim Mattox,University KLAT2’ can beobtainedfrom aboutLinuxBIOS information Further tions fortheSiS port. thecode contribu- has beenthecasewith as manufacturers insupport participate theirendgoalisto have Also, reasonable. boards shouldbe asubsetof of support andhence arethey targetingclusters, Minnich explained that from vendors. mainboards the ever-growing number of for inmaintainingsupport the difficulty Another pointthatwasmentioned uponreboot whendesirable. ing memory could easilybehandledby possiblyzero- mentioned andnoted assomethingthat for multi-user environments wasbriefly between reboots security Theissueof nel. FreeBSD venture to move thisto theker- possiblygotheroutemight usedby a He noted thatthey Power Management. Currently for there isnosupport devices. BIOS could beusedforembedded He alsoconfirmed thatLinux Intels.” “Five Minnich’s response, toasted?” “How many boards haveaptly, you The firstquestionfrom theaudience was same way. bootthe will vendor, regardless of node, andevery ordisk; CD-ROM, floppy, don’t needworking ior like you want; nodebehav- Band-Aids forBIOSdefects; no nodefrom own startup; lowing: mentionedthefol- The closingsummary information. to helpwith Dell andCompaq are trying S F LAT ;login: N EIGHBORHOOD N ETWORK ...all . obtained $2.75/MFLOPS and (They tional FluidDynamics) code. for theirresults onafullCFD(Computa- Gordon BellPrice/Performance award are They currently afinalistfor FNN. mance results from KLAT2 the with have price-to-perfor- seensignificant DietzandMattox network was~$8,100. The total cost forthe64-nodeKLAT2 cases. FNNs appearto besufficientformost however of thedefaultproperties straints, optimize foraspecificprogram’s con- TheGA can NICs). same number of NICsare used(notallnodesneedthe of the networks sothataminimum number GAisalsousedto The optimize node. routing tablesthatare usedforeach aswell as the diagram coded wiring outputfrom The theGAisacolor- cally. canbeproduced automati-and wiring optimize thenetwork sothattherouting TheGAisusedto nection network. theintercon- andconstruction of uration theconfig- (GA)to assistwith Algorithm aGenetic prompted themto make useof multiple NICsandswitches with inconfiguring manyThe difficulty nodes performance. while stillmaintaininglow cost andhigh several switches forfullconnectivity node to beusedto attach thenodesto allows This multiple NICsper topology. Flat Neighborhood Network (FNN) the promptedswitch thedevelopment of inexpensiveing allthenodesonasingle connect- of infeasibility The architecture. 4 HANA IU SHOWCASE LINUX ANNUAL TH Keynote SpeakerLarryWall and TheodoreTs’o & CONFERENCE oto thesite. root of otherrelated atthe with information obtained at aboutFNNcanbe information Further hardware. and Mattox use generally saidthey network cableswasasked,locating faulty Afinalquestionabout week. do every butit’s notsomethingyou wantto time, needed inareasonably smallamountof if canrecable everything They GA). the rerouting through software (from the donothavethey to do recable butrather elicited theinteresting pointthatoften Aquestionaboutcabling all nodes. to getto try they exceeding cache if arp also mentionedthatthere are issueswith and theswitch isactingasa He “subnet.” that currently each NIChasadifferent IP andMattox explained NICs wasasked, it? A questionregarding IPaddresses/ of why notmake use NICs are already there; The response wasthattherouting and and thisappearsto beacost tradeoff. the switch hasbeenmoved to thenode, ence membercommented thatsomeof Another audi- mostcommodity PCs. of would exceed thecurrent PCIbandwidth more thanfourorfive NICsatonce using Also, alreadywhich they manage. otherthanconnectivity, much payoff theresponse wasthatthere isnot per PC; NICs increasing the number of issue of theaudience the raised A memberof figuration/design. theGAthatisusedforcon-demonstrate including aCGIthatcanbeusedto their Web FNNs, site forworking with offerseveral tools They at tion network. reductionprice forasoundinterconnec- mance increase aswell asasignificant that FNNsofferasubstantialperfor- Mattox’s concluding remarks pointed out respectively.) precision, double andsingle $1.86/MFLOPS price/performance for and ifconfig to network locate faulty ping 11 CONFERENCE REPORTS SESSION: SYSTEMS that this was indeed a problem. The Evard then asked the panelists, “Is the server daemon simply hangs and other cluster architecture dependent on the THE PORTABLE BATCH SCHEDULER AND THE mechanisms need to be used to restart. computing model in the system adminis- MAUI SCHEDULER ON LINUX CLUSTERS Another person from the audience asked tration solution? If yes, how should it be Brett Bode, David M. Halstead, Ricky about the ability to perform progress changed?” One of the panelists answered Kendall, and Zhou Lei, Ames Labora- migration on clusters. Bode responded that it was a matter of getting the tools to tory; David Jackson, Maui High Perfor- that no such mechanism existed on PBS. work with the clusters, and the tools mance Computing Center To a question on what happens when the (rather than the clusters) needed to be Summarized by Vikram V. Asrani user’s processor utilization time has tweaked. Another panelist asked whether At the start of this talk, Brett Bode intro- reached the allotted amount, he the same set of tools could be used for duced batch systems. He spoke about the answered jobs are killed. In addition, PBS clusters with 32 nodes and clusters with two classes of parallel aware schedulers: generates a signal five minutes before the more than 32 nodes. The same set of namely, cycle stealers and dedicated sys- deadline. On PBS, users can also alter job APIs should exist, was the opinion of one tem schedulers. The Portable Batch request times. panel member. Coghlan mentioned that Scheduler (PBS) is one example of a ded- the tools required to manage small and icated system scheduler. Schedulers must PANEL: HAS CLUSTER large clusters are bound to be different be stable, portable, and should provide ADMINISTRATION BEEN SOLVED? since the complexity lies essentially in the efficient resource management. In his Moderator: Rémy Evard tools. Lindahl found out from the audi- opinion, PBS is probably the most com- Participants: Susan Coghlan, Turbo ence that only ~20% of the audience ran monly used and probably the best solu- Linux; Richard Ferri, IBM; Brian Finley, clusters with more than 64 nodes. VA Linux; Greg Lindahl, HPTi; John-Paul tion available. The Maui scheduler was Evard posed the next set of questions: originally used on HP systems and per- Navarro, Argonne National Laboratory; Lee Ward, Sandia National Laboratory; “What are the biggest scaling issues in formed well. The PBS scheduler is a and Stephen Scor, Oak Ridge National system administration? What scaling FIFO-like scheduler, scheduling jobs in a Laboratory problems have the panelists run into? FIFO order, except when the first task in Summarized by Vikram V. Asrani Why do the panelists consider clusters the FIFO queue is blocked by another with more than 64 nodes large? Where task. The PBS system prevents starvation The organizations represented on this do large clusters stress the existing tools?” using a starving job mechanism. panel are working on clusters of various Answering the question on scalability, sizes ranging up to 1,500 node clusters. Bode then provided an overview of the Lindahl clarified that the use of a data- Maui scheduler on Linux clusters. It is The first question posed by moderator base for cluster administration was a fully parallel aware, as it knows about the Evard to the panelists was, “Why does gross mistake. Brian Finley pointed out attributes, memory, and utilization of everyone have their own cluster solu- that tools to automate tasks was one of each node. It is a time-based reservation tions? Will we ever reach a state when a the main scaling issues. system, and idle nodes are back-filled single common solution will exist?” Greg The next question to the panelists was, with small jobs. Bode then described the Lindahl responded that since people have “What is the right community approach scheduler test for the PBS and Maui different specific requirements, they for cluster administration? Should the scheduler on a 64-node cluster of Pen- develop specific solutions. Another pan- plan be to (a) depend on vendors to pro- tium Pros. The simulation profile con- elist concurred, adding that clusters with vide solutions (which may be proprietary sisted of large, medium, and small debug more than 64 nodes had specific require- or otherwise)? (b) converge on a set of and failed tasks. The results with backfill ments and, hence, vendors developed tools that everyone else uses (presumably turned off showed that the Maui sched- their own solutions. One of the panelists open source)? (c) try to maintain a good uler provides a better processor usage. thought that it was essential to come up mix of solutions, keeping them environ- The Maui scheduler required five hours with a common solution. Susan Coghlan ment-rich in competitive variability? or less to complete the tasks for which a provided an analogy for this problem (d) continue to build their own solu- sequential execution processor took with enterprise management. She said tions?” As before, one panel member sug- between 90 to 100 hours. that one required flexible tools to meet gested the development of a standard API everybody’s needs. However, the present- In response to a question on what hap- set. However, some of the panel members day tools did not even do all that they pens when a node fails, Bode informed were in favor of the open source set of were supposed to. us that PBS does not restart the node and tools licensed under GPL. Finley sug- gested that if a tool breaks in a particular

12 Vol. 26, No. 1 ;login: February 2001 struction isheavily customer dependent. struction Tool con- and whatyour customer says. how much your customer wants, spend, on how much you money to are willing saidthatitalldepends Finley However, gested active management. diagnostic Onepanelmembersug- you cannotdo?” theexisting sysadmintools that do with “What doyou you wish could bers was, The finalquestionputto thepanelmem- the future. cited thisassomethingto work toward in who by theaudience, ion wassupported His opin- soonexist. that thismarket will suggested Finley provide specifictools. there isnomarket forcluster vendors to Lindahlsuggested thatcurrently efits. ben- one canfixitandeverybody source, theopen with then, users environment, Valerie CoxandVe MartinofALS Video game roominaction Illiad signinghisbookfor ;login: his fans 4 HANA IU SHOWCASE LINUX ANNUAL TH Ted T'sogivingBestPaperAward to Robert Ross There areOldFartsevenatALS. & CONFERENCE 13 CONFERENCE REPORTS