LSF '08: 2008 Linux Storage & Filesystem Workshop

example, increasing data protection can harm performance operation, including some discussion on which param- or increase purchase cost. Whereas the existing practice eters were needed for optimal operation (theme #2). The is to consult an area expert, John Strunk spoke on how I/O stack needs both micro-optimizations (performance utility functions can convey the cost-benefit structure to an within driver layers) and architectural changes (e.g., you automated provisioning tool. Users are then able to make have to parameterize the key attributes so that file systems appropriate trade-offs among various system metrics. can utilize SSDs optimally). Intel presented SCSI RAM Strunk et al. use utility functions, functions from a set of and ATA_RAM drivers to help developers tune the SCSI, metrics (e.g., revenue, availability, data value, power usage, ATA, and block I/O subsystems for these orders-of-magni- or purchase cost) to a utility value (e.g., dollars), to char- tude-faster (random read) devices. Hybrid drives were a acterize a particular point in the purchase space. To find hot topic at LSF ’07 but were only briefly discussed in the a desirable point in this (large) space they use a genetic introduction this year. algorithm to refine a configuration population over many Theme 2: Device Parameterization generations. Strunk then illustrated the value of this ap- The device parameters discussion is just beginning on proach through three case studies, including scenarios with how to parameterize device characteristics for the block a limited budget and where system cost can affect the long- I/O schedulers and file systems. For instance, SSDs want term solution. all writes to be in units of the erase block size if possible, Peter Honeyman asked why linear programming was not and device mapping layers would like better control over used instead of a genetic algorithm. Strunk answered that alignment and placement. The key object here is how to linear programming’s constraints on objective function form provide enough parameters to be useful but not so many rules out many real-world utility functions. Honeyman that “users” (e.g., the file system) get it wrong. The general also asked whether one can maximize multiple objectives; consensus was that having more than two or three param- Strunk replied that you would convert these to one utility. eters would cause more problems than it solved. Another audience member asked whether they had looked at a method for generating good utility functions, noting Theme 3: I/O Priorities that Strunk’s seemed simplistic. Strunk said they have, I/O priorities and/or bandwidth sharing has lots of folks that the paper has more examples, and that this is also an interested in I/O schedulers. There was consideration about area where they are doing further work. One person asked splitting the I/O scheduler into two parts: an upper half to whether this approach can determine whether it is better deal with different needs of feeding the Q (limit block I/O to upgrade an existing system or migrate to a new system. resource consumption) and a lower half to rate-limit what Strunk answered that they can do this, but that it is the gets pushed to the storage driver. second part of his thesis. Two audience members asked whether Strunk’s approach supported varying input values Theme 4: Network Storage as a function of time. Strunk answered that their system fo- Two technologies were previewed for addition to the Linux cuses only on static provisioning. The final questioner asked kernel: pNFS (parallel NFS) and FCoE (Fiber Channel over whether not finding the most optimal solution is a problem. Ethernet). Neither is ready for kernel.org inclusion, but Strunk replied that in the real world one often only gets in some constructive guidance was given on what directions the ballpark, and that this approach already does at least as specific implementations needed to take. well as today’s ad hoc approaches. The issues facing iSCSI were also presented and discussed. User- versus kernel-space drivers was a hot topic in Net- worked Block Storage forums. LSF ’08: 2008 Linux Storage & Filesystem Workshop n Introduction and Opening Statements: Recap of Last Year San Jose, CA Chris Mason and James Bottomley February 25–26, 2008 This session was primarily a scorecard of how many topics discussed last year are fixed or implemented this year. The storage track bright spots were the new filesystem (BTRFS, pronounced “butter FS,” which incorporates B-trees for directories and Summarized by Grant Grundler ([email protected]) an extent-based filesystem with 264 maximum file size) and Copyright 2008 Google, Inc. (Creative Commons Attribution emerging support for OSD (Object-base Storage Device) in License, http://code.google.com/policies.html or http:// the form of bidirectional command integration (done) and creativecommons.org/licenses/by/2.5/) long CDB commands (pending); it was also mentioned that Several themes came up over the two days: Seagate is looking at producing OSD drives. Theme 1: Solid State Drives Error handling was getting better, but there’s still a lot of SSDs (Solid State Disks) are coming. There was a good work to be done and we have some new tools to help test presentation by Dongjun Shin (Samsung) on SSD internal error handling. The 4k sector size, which was a big issue ;LOGIN: JUNE 2008 confErencE rEpOrts 107 last year, has receded in importance because manufacturers induce an uncorrectable sector failure on a SATA disk with are hiding the problem in firmware. the WRITE_LONG and WRITE_UNC_EXT commands. This forces the on-disk CRCs to mismatch, thus allowing n SSD at least medium errors to be injected from the base of the Dongjun Shin, Samsung Electronics stack. Dongjun gave an excellent introduction and details of how SSDs are organized internally (sort of a two-dimensional n Power Management matrix). The intent was to give FS folks an understanding Kristen Carlson Accardi, Intel of how data allocation and read/write requests should be Arjan van de Ven wrote PowerTOP and it’s been useful optimally structured. “Stripes” and “channels” are the two in tracking down processes that cause CPU power con- dimensions to increase the level of parallelization and thus sumption but not I/O. Although kjournald and pdflush increase the throughput of the drive. The exact configura- are shown as the apps responsible, obviously they are just tions are vendor-specific. The tradeoff is to reduce stripe surrogates for finishing async I/O. For example, postfix size to allow multithreaded apps to have multiple I/Os uses sockets, which triggers inode updates. Suggestions for pending without incurring the “lock up a channel during preventing this include using lazy update of nonfile inodes erase operation” penalty for all pending I/Os. Hard disk and virtual inodes. drives (HDDs) prefer large sequential I/Os, whereas SSDs With ALPM (Aggressive Link Power Management, http:// prefer many smaller random I/Os. www.lesswatts.org/tips/disks.php), up to 1.5 watts per disk Dongjun presented postmark (mail server benchmark) can be saved on desktop systems. Unlike disk drives, no performance numbers for various file systems. An obvious hardware issues have been seen with repeated powering up performance leader seemed to be nilfs for most cases, and it or down of the physical link, so this is safer to implement. was never the worst. Successive slides gave more details on Performance was of interest since trading off power means some of the FSes tested. Some notable issues were that flush some latency will be associated with coming back up to a barriers kill XFS performance and that BTRFS performance full-power state. The transition (mostly from Async Negotia- was better with 4k blocks than with 16k blocks. tion (AN) when restoring power to the Phys) from SLUM- Flush barriers are the only block I/O barriers defined today, BER to ACTIVE state costs about ~10 ms. Normal bench- and the flush barriers killed performance on the SSDs since marks show no performance hit, as the drive is always busy. the flash translation layer could no longer coalesce I/Os and We need to define a bursty power benchmark that is more had to write data out in blocks smaller than the erase block typical of many environments. size. Ideally, the file system would just issue writes using Kristen presented three more ideas on where Linux could erase block sizes. help save power. The first was to batch-average group I/O; 5–30 seconds is normal to flush data, so instead wait up to n Error Handling 10 minutes before flushing these. The second suggestion Ric Wheeler, EMC was a question: Can the block layer provide hints to the Ric Wheeler introduced the perennial error-handling topic low-level driver? For example, “Soon we are going to see with the comment that bad sector handling had mark- I/O; wake up.” The third suggestion was making smarter edly improved over the “total disaster” it was in 2007. He timers to limit CPU power-up events—that is, coordinate moved on to silent data corruption, noting that the situation the timers so they can wake up at the same time, do neces- here was improving with data checksumming now being sary work, then let the CPU go to a low-power state for a built into file systems (most notably BTRFS and XFS) and longer period of time. emerging support for T10 DIF. The “forced unmount” topic provoked a lengthy discussion, with James Bottomley claim- Ric Wheeler (EMC) opened the discussion on powering ing that, at least from a block point of view, everything down disks, since the savings there are typically 6–15 watts should just work (surprise ejection of USB storage was cited per disk.

Load more