LSF ’07: 2007 Linux Storage & Filesystem Workshop Namically Adjusted According to the Current Popularity of Its Hot Zone
Total Page:16
File Type:pdf, Size:1020Kb
June07login1summaries_press.qxd:login summaries 5/27/07 10:27 AM Page 84 PRO: A Popularity-Based Multi-Threaded Reconstruction overhead of PRO is O(n), although if a priority queue is Optimization for RAID-Structured Storage Systems used in the PRO algorithm the computation overhead can Lei Tian and Dan Feng, Huazhong University of Science and be reduced to O(log n). The entire PRO implementation in Technology; Hong Jiang, University of Nebraska—Lincoln; Ke the RAIDFrame software only added 686 lines of code. Zhou, Lingfang Zeng, Jianxi Chen, and Zhikun Wang, Hua- Work on PRO is ongoing. Future work includes optimiz - zhong University of Science and Technology and Wuhan ing the time slice, scheduling strategies, and hot zone National Laboratory for Optoelectronics; Zhenlei Song, length. Currently, PRO is being ported into the Linux soft - Huazhong University of Science and Technology ware RAID. Finally, the authors plan on further investigat - ing use of access patterns to help predict user accesses and Hong Jiang began his talk by discussing the importance of of filesystem semantic knowledge to explore accurate re - data recovery. Disk failures have become more common in construction. RAID-structured storage systems. The improvement in disk capacity has far outpaced improvements in disk band - The first questioner asked about the average rate of recov - width, lengthening the overall RAID recovery time. Also, ery for PRO. Hong answered that the average reconstruc - disk drive reliability has improved slowly, resulting in a tion time is several hundred seconds in the experimental very high overall failure rate in a large-scale RAID storage setup. The second questioner asked how well PRO recon - system. Disk-oriented reconstruction (DOR) is one of the struction compares to DOR reconstruction under no work - existing I/O parallelism-based recovery mechanisms. DOR load. Hong commented that when there is no workload follows a sequential order of stripes in reconstruction, re - the reconstruction performance for PRO and DOR is the gardless of user access patterns. Workload access patterns same. In response to a question about write overhead, need to be considered because 80% of the accesses are di - Hong stated that his research team is actively looking into rected to 20% of the data, according to Pareto’s Principle, this. The last question involved the sensitivity of the re - and 10% of the files accessed on a Web server typically ac - sults to the number of threads and to the time slice. Hong count for 90% of the server requests. The authors present explained that the impact of the number of threads and a popularity-based multi-threaded reconstruction opti - time slice is negligible in the current experimental configu - mization (PRO) that takes advantage of data popularity to ration. However, a more elaborate sensitivity study is un - improve reconstruction performance. PRO divides data derway in the project. units on the spare disks into hot zones. Each hot zone has a reconstruction thread. The priority of each thread is dy - LSF ’07: 2007 Linux Storage & Filesystem Workshop namically adjusted according to the current popularity of its hot zone. PRO keeps track of the user accesses and ad - justs the popularity of each hot zone accordingly. PRO se - San Jose, CA February 12–13, 2007 lects the reconstruction thread with the highest priority and allocates a time slice to it. When a thread’s time slice Summarized by Brandon Philips ([email protected]) runs out, PRO assigns a time slice to the next highest pri - Fifty members of the Linux storage and filesystem commu - ority thread. The process repeats until all of the data units nities met in San Jose, California, to give status updates, have been rebuilt. Priority-based scheduling is used so that present new ideas, and discuss issues during the two-day the reconstruction regions are always the hottest regions. Linux Storage & Filesystem Workshop. The workshop was Time-slicing is used to exploit the I/O bandwidth of hard chaired by Ric Wheeler and sponsored by EMC, NetApp, disks and access locality. Panasas, Seagate, and Oracle. PRO was compared to DOR in the evaluation because DOR is arguably the most effective among the existing re - JOINT SESSION construction algorithms. The evaluation of PRO examined reconstruction performance by measuring user response Ric Wheeler opened the workshop by explaining the basic time, reconstruction time, and algorithm complexity. PRO contract that storage systems make with the user to guar - was integrated into the original DOR approach imple - antee that the complete set of data will be stored, bytes are mented in the RAIDframe software to validate and evaluate correct and in order, and raw capacity is utilized as com - PRO. The evaluation was performed by replaying three dif - pletely as possible. It is so simple that it seems that there ferent Web traces that consisted of read-only Web search should be no open issues, right? activity. It was found that PRO consistently outperformed Today, these basic demands are met most of the time, but DOR in reconstruction and user response time by up to Ric posed a number of questions. How do we validate that 44.7% and 23.9%, respectively. PRO’s effectiveness relies no files have been lost? How do we verify that bytes are on the existence of popularity and locality in the workload correctly stored? How can we utilize disks efficiently for as well as the intensity of the workload. PRO also uses small files? How do errors get communicated between the extra memory for each thread descriptor. The computation layers? 84 ;LOGIN: VOL. 32, NO. 3 June07login1summaries_press.qxd:login summaries 5/27/07 10:27 AM Page 85 Through the course of the next two days some of these a fsck and, if Val’s fsck estimates for 2013 come true, hav - questions were discussed, others were raised, and a few ing a strategy to pass the time will become very important. ideas were proposed. Continue reading for the details. With her system booted Val presented an estimate of 2013 Ext4 Status Update fsck times. She first measured a fsck of her 37-GB home Mingming Cao gave a status update on ext4, the recent directory with 21 GB in use, which took 7.5 minutes and fork of the ext3 file system. The primary goal of the fork read 1.3 GB of filesystem data. Next, she used projections was the move to 48-bit block numbers; this change allows of disk technology from Seagate to estimate the time to the file system to support up to 1024 petabytes of storage. fsck a 2013 home directory, which will be 16 times larger. This feature was originally designed to be merged into ext3 Although 2013 disks will have a fivefold bandwidth in - but was seen as too disruptive [1]. The patch is also built crease, seek times will only improve by 20%, to 10 ms, on top of the patch set that replaces the indirect block map leading to a fsck time of 80 minutes! The primary reason and with extents [2] in ext4. Support for greater than 32K for long fscks is seek latency, since fsck spends most of its directory entries will also be merged into ext4. time seeking over the disk, discovering and fetching dy - namic filesystem data such as directory entries, indirect On top of these changes a number of ext3 options will be blocks, and extents. enabled by default in ext4; these include directory index - ing to improve file access for large directories, resize inode, Reducing seeks and avoiding the seek latency punishment which reserves space in the block group descriptor for on - are key to reducing fsck times. Val suggested one solution: line growing, and 256-byte inodes. Users of ext3 can use Keeping a bitmap on disk that tracks the blocks that con - these features today by using mkfs.ext3 -I 256 -O tain filesystem metadata; this would allow for reading all resize_inode dir_index /dev/device. data in a single arm sweep. This optimization, in the best case, would make a single sequential sweep over the disk A number of RFCs are also being considered for inclusion and on the 2013 disk reading all filesystem data would into ext4. This includes a patch that will add nanosecond only take around 134 seconds, which is a big improve - timestamps [3] and the creation of persistent file alloca - ment. A full explanation of the findings and possible solu - tions [4], which will be similar to posix_fallocate but tions can be found in the paper Repair-Driven File System won’t waste time writing zeros to the disk. Design [5]. Also, Val announced that she is working full Currently, ext4 stores a limited number of extended attrib - time on a file system called chunkfs [6] that will make utes in-inode and has space for one additional block of ex - speed and ease of repair a primary design goal. tended attribute data, but this may not be enough to sat - Zach Brown presented a blktrace of e2fsck. The basic out - isfy xattr-hungry applications. For example, Samba needs come of the trace is that the disk can stream data at 26 additional space to support Vista’s heavy use of ACLs, and Mbps and fsck is achieving 12 Mbps. This situation could eCryptFS can store arbitrarily large keys in extended at - be improved to some degree without on-disk layout tributes. This led everyone to the conclusion that someone changes if the developers had a vectorized I/O call. Zach needs to collect data on how xattrs are being used, to help explained that in many cases you know the block locations developers decide how to best implement xattrs.