Linux 2.5 Kernel Developers Summit
Total Page:16
File Type:pdf, Size:1020Kb
conference reports This issue’s reports are on the Linux 2.5 Linux 2.5 Kernel Developers Linux development, but I certainly Kernel Developers Summit Summit thought that, in all of this time, someone would have brought this group together OUR THANKS TO THE SUMMARIZER: SAN JOSE, CALIFORNIA before. Rik Farrow, with thanks to La Monte MARCH 30-31, 2001 Yarroll and Chris Mason for sharing their Summarized by Rik Farrow Another difference appeared when the notes. first session started on Friday morning. The purpose of this workshop was to The conference room was set up with cir- provide a forum for discussion of cular tables, each with power strips for changes to be made in the 2.5 release of For additional information on the Linux laptops, and only a few attendees were Linux (a trademark of Linus Torvalds). I not using a laptop. USENIX had pro- 2.5 Kernel Developers Summit, see the assume that many people reading this vided Aeronet wireless setup via the following sites: will be familiar with Linux, and I will hotel’s T1 link, and people were busy <http://lwn.net/2001/features/KernelSummit/> attempt to explain things that might be typing and compiling. Chris Mason of unfamiliar to others. That said, the odd- <http://cgi.zdnet.com/slink?91362:12284618> OSDN noticed that Dave Miller had numbered releases, like 2.3 and now 2.5, <http://www.osdn.com/conferences/kernel/> written a utility to modulate the speed of are development releases where the the CPU fans based upon the tempera- intent is to try out new features or make ture reading from his motherboard. large changes to the kernel. The even- Another person whipped up a quick pro- numbered releases are considered the sta- gram to test an assertion made by the ble releases. first presenter about degraded perfor- I got my first impression of the people mance in the 2.4 release. attending the conference at the Thursday night reception. There was only one REQUIREMENTS FOR A HIGH PERFORMANCE woman out of the 65 registered atten- DATABASE dees, but other than that, this appeared Lance Larsh, Oracle Corporation very similar to any other USENIX event. If you thought that having big business The real difference is that few of these make suggestions about improving the people were system administrators, and Linux kernel would be poorly received, most were kernel hackers. I walked you would be wrong. In fact, I could tell around asking people what their focus that attendees were mostly receptive, as area in the kernel was. That question they want Linux to become a better com- turned out to be hard to answer, even mercial OS platform. though I could make some guesses by Larsh began by explaining a little about looking at the Kernel Developer’s mailing how Oracle works. He also told us that list traffic summary: <http://kt.zork.net/ databases like to do raw I/O, and that this kernel-traffic/latest.html>. For example, is not much of a benefit in the current Rik van Riel, originally from the Nether- Linux implementation. For example, lands but now working for Conectiva in even if a continuous batch of sectors is to Brazil, focuses on virtual memory and be written, the kernel breaks it up into memory management (VM and MM). I smaller batches and adds a buffer header also met one or two “kernel janitors,” to each sector. Another problem had to programmers who clean up kernel code, do with the elevator algorithm, which remove defunct code, etc. sorts and merges requests based on their I was also pleased to discover that this physical location on a hard drive (spin- meeting, put on by USENIX and OSDN dle). Another problem involved (<http://www.osdn.com>) and sponsored io_request_lock, a global lock that Larsh by IBM, EMC, and AMD, was the first suggested should be per device unless opportunity for many of these people to global synchronization was really meet in person. Perhaps this is not so required. amazing given the distributed nature of June 2001 ;login: 5 Larsh also suggested that Linux do away other process has been locked but is not multiple streams can use the same con- with the elevator algorithm and let the scheduled.” nection. Also, there are no bitwise flags, hardware do the work. Linus Torvalds and all options are word-aligned. Ted Ts’o, who moderated the event, asked if Larsh had tried setting some called a break at that point. Breaks were Someone else asked if there is any talk of elvtune parameter to one, and Larsh said always 30 minutes, giving ample time for moving part of the protocol into hard- he hadn’t. One thing that became clear to discussion. ware. Yarroll answered, “It is a dream. me was that most of the Linux kernel There are a lot of properties that should developers were software guys (some- SCTP make SCTP hardware implementable.” thing that Andre Hedrick really made a La Monte H.P. Yarroll, Motorola Ted Ts’o pointed out that fiber channels point of later). Modern hard drives La Monte Yarroll described a new proto- are very expensive, and SCSI over SCTP reorder the physical location of tracks on col that will be peer to UDP and TCP would be a viable option. the fly based on the current location of (layer four for OSI fans). SCTP stands for the heads, so using any elevator algo- During the break, Stephen Tweedie, the Stream Control Transmission Protocol rithm makes little sense. next presenter, moved toward the front (RFC2960) and has several design goals: and Linus intercepted him at the table Oracle also has problems with the mem- I Sequenced delivery of user messages where I was sitting. Soon, Ben LaHaise ory model used by Linux. Some IA32 within multiple streams joined in a spirited discussion about zero (Intel x86) -based systems can have in I Network-level fault tolerance copy writes. Zero copy writes avoid the excess of 4GB of RAM, but Linux device through support of multi-homing at performance hit of a memory to memory drivers handle this by using a bounce either or both ends of an association copy, and Linus shared his skepticism buffer to copy data to a region below I MTU set at layer four to prevent about how it is being implemented. My 1GB, losing performance to the copy. fragmentation at the IP layer impression was of a professor with not a Asynchronous I/O was also a problem, as I Optional bundling of multiple mes- lot of seniority arguing with his grad stu- is the use of the O_SYNC and O_DSYNC sages within the same packet dents and other professors. At one point, flags. Quite a lively debate started at this Linus said something that I thought was point, with one participant saying that SCTP very revealing: “We don’t want to wind O_DSYNC was the default in 2.4. (<http://www.cis.ohio-state.edu/cgibin/rfc/rfc2960.html>) up like Windows NT with lots of subtle is very long (134 pages), but Yarroll Then Larsh dropped a bombshell. He bugs. We want stability over perfor- stated that there are already 24 imple- reported that an SMP system with SCSI mance.” mentations that inter-operate. Essen- drives was 10 to 15 times slower, meas- tially, SCTP combines the reliability of ured with iozone, in 2.4 than in 2.2. This BLOCK DEVICE LAYER TCP with the ability to send messages, effect does not show up with IDE drives. Stephen Tweedie, RedHat even multiple message streams, over the Ts’o introduced this session by joking Oracle would also like support for large same connection. It has some built-in that it was completely uncontroversial page sizes. Richard Henderson said that fail-over because it supports the concept and not relevant to the kernel. This was a this would also benefit scientific applica- of multi-homing: that is, a single server fitting beginning. tions. This topic came up again on Satur- can listen at multiple interfaces, and if day during van Riel’s presentation about one path quits responding, it can resume Tweedie began by discussing scalability MM. In general, Oracle wants things the same connection using a different issues. These include large numbers of standardized across as many OS plat- interface and presumably another path. devices, large devices (current 2TB limit), forms as possible, in the same way, for 512-byte block size being a problem for Someone asked if SCTP can do load bal- example, that shared memory is. large disks, and related SCSI issues, like ancing, and Yarroll responded that this is large numbers of logical units. He This session went 17 minutes over sched- an open research issue with no known jumped next to robustness. Currently, ule and ended with an exchange between solution. “First, slag one network, move information from the SCSI layer does not Linus, Larsh, and Alan Cox. Linus sug- over to another one, slag that one, move pass to higher layers, so a one-bit error gested fast semaphores for scheduling back, and so on,”quipped Yarroll. Some- could result in a RAID disk being taken rather than spin locks, and Cox suggested one else asked whether SCTP was any offline. that the user space spin locks work like better than TCP in connection setup and Mozilla. “What Mozilla is doing is count- breakdown. Yarroll reported that SCTP is Broaching the issue of under-perfor- ing how many times you spin on locks somewhat better, as only the third packet mance, Tweedie stated that the kernel before giving up, assuming that some used in setup can carry data, and that cannot pass in single I/Os that are con- 6 Vol.