Improving Linux Block I/O for Enterprise Workloads

Improving Linux Block I/O for Enterprise Workloads Peter Wai Yee Wong, Badari Pulavarty, Shailabh Nagar, Janet Morgan, Jonathan Lahr, Bill Hartner, Hubertus Franke, Suparna Bhattacharya IBM Linux Technology Center {wpeter,pbadari,nagar,janetinc,lahr,bhartner,frankeh}@us.ibm.com, [email protected] http://lse.sourceforge.net/ Abstract then, the kernel developer community has re- doubled its efforts in improving the scalability of Linux on a variety of server platforms. The block I/O subsystem of the Linux kernel All major server vendors such as IBM, HP, is one of the critical components affecting the SGI, Compaq, Dell and Sun not only support performance of server workloads. Servers typ- Linux on their platforms, but are investing a ically scale their I/O bandwidth by increasing considerable effort in improving Linux’s enter- the number of attached disks and controllers. prise capabilities. The Linux Technology Cen- Hence, the scalability of the block I/O layer is ter (LTC) of IBM, in particular, has been a ma- also an important concern. jor contributor in improving Linux kernel per- In this paper, we examine the performance of formance and scalability. This paper highlights the 2.4 Linux kernel’s block I/O subsystem on the efforts of the LTC in improving the perfor- enterprise workloads. We identify some of mance and scalability of the block I/O subsys- the major bottlenecks in the block layer and tem of the Linux kernel. propose kernel modifications to alleviate these Traditionally, the kernel block I/O subsystem problems in the context of the 2.4 kernel. The has been one of the critical components af- performance impact of the proposed patches fecting server workload performance. While is shown using a decision-support workload, a I/O hardware development has made impres- microbenchmark, and profiling tools. We also sive gains in increasing disk capacity and re- examine the newly rewritten block layer of the ducing disk size, there is an increasing gap 2.5 kernel to see if it addresses the performance between disk latencies and processor speeds bottlenecks discovered earlier. or memory access times. Disk accesses are slower than memory accesses by two orders 1 Introduction of magnitude. Consequently, servers running I/O intensive workloads need to use large numbers of disks and controllers to provide suffi- Over the past few years, Linux has made re- cient I/O bandwidth to enterprise applications. markable progress in becoming a server oper- In such environments, the kernel’s block I/O ating system. The release of Version 2.4 of layer faces a twofold challenge: it must scale the Linux kernel has been heralded as helping well with a large number of I/O devices and Linux break the enterprise barrier [5]. Since Ottawa Linux Symposium 2002 391 it must minimize the kernel overhead for each I/O requests into 512-byte units (even if I/O transfer. the device hardware and associated driver is capable of handling larger requests). This paper examines how the Linux kernel’s The 512-byte requests end up being re- block I/O subsystem handles these twin goals combined within the request queue before of scalability and performance. Using version being processed by the device driver. We 2.4.17 of the kernel as a baseline, we system- present an alternative that permits raw I/O atically identify I/O performance bottlenecks to be done at a page-size granularity. using kernel profiling tools. We propose solutions in the form of kernel patches, all but one • Efficient support for vector I/O: I/O in- of which has been developed by the authors. tensive applications often need to perform The performance improvements resulting from vector (scatter/gather) raw I/O operations these patches are presented using a decision- which transfer a contiguous region on disk support workload, a disk I/O microbenchmark to discontiguous memory regions in the and profiling data. In brief, the I/O perfor- application’s address space. The Linux mance bottlenecks addressed are as follows: kernel currently handles vectored raw I/O by doing a succession of blocking I/O operations on each individual element of the • Avoiding the use of bounce buffers: The I/O vector. We implement efficient sup- kernel can directly map only the first gi- port for vector I/O by allowing the vector gabyte of physical memory. I/O to high elements to be processed together as far as memory (beyond 1 GB) is done through possible. buffers defined in low memory and in- volves an extra copy of the data being • Lightweight kiobufs: The main data transferred. Capitalizing on the ability of structure used in raw I/O operations is the PCI devices to directly address all 4GB, kiobuf. As defined in 2.4.17, the kiobuf the block-highmem patch written by Jens data structure is very large. When raw I/O Axboe can circumvent the need to use is performed on a large number of devices, bounce buffers. the memory consumed by kiobufs is pro- hibitive. We demonstrate a simple way to • Splitting the I/O request lock: Each I/O reduce the size of the kiobuf structure and device in the system has an associated re- allow more I/O devices to be used for a quest queue which provides ordering and given amount of system memory. memory resources for managing I/O requests to the device. In the 2.4 kernel, all I/O request queues are protected by Most of the kernel performance bottlenecks a single io_request_lock which can listed above stem from the basic design of the be highly contended on SMP machines 2.4 block I/O subsystem which relies on buffer with multiple disks and a heavy I/O load. heads and kiobufs. The need to maintain com- We propose a solution that effectively re- patibility with a large number of device drivers places the io_request_lock with per queue has limited the scope for kernel developers to locks. fix the subsystem as a whole. In the 2.5 development kernel, however, the challenging task • Page-sized raw I/O transfers: Raw of overhauling the block I/O layer has been I/O, which refers to unbuffered I/O done taken up. One of the goals of the rewrite through the /dev/raw interface, breaks has been addressing the scalability problems of Ottawa Linux Symposium 2002 392 earlier designs [2]. This paper discusses the . raw_open new design in light of the performance bottle- . alloc_kiovec necks described earlier. The rest of the paper is organized as follows. Notice the call to alloc_kiovec to allocate Section 2 presents an overview of the 2.4 ker- a kernel I/O buffer, also known as a kiobuf. nel block I/O subsystem. The benchmark en- The kiobuf is the primary I/O abstraction used vironment and workloads used are described by the Linux kernel to support raw I/O. The in Section 3. Sections 4 through 8 describe kiobuf structure describes the array of pages the performance and resource scalability bot- that make up an I/O operation. tlenecks, proposed solutions and results. The The fields of a kiobuf structure include: newly written 2.5 kernel block layer is addressed in Section 9. Section 10 concludes // number of pages in the kiobuf with directions for future work. int nr_pages; // number of bytes in the data buffer 2 Linux 2.4 Block I/O int length; // offset to first valid byte For the purpose of this paper, our review of // of the buffer the 2.4 kernel block I/O subsystem will be lim- int offset; ited in scope. Specifically, it will focus on the “raw” device interface, which was added by // list of device block numbers // for the I/O Stephen Tweedie during the Linux 2.3 devel- ulong blocks[KIO_MAX_SECTORS]; opment series. // array of pointers to Unix has traditionally provided a raw interface // 1024 pre-allocated to some devices, block devices in particular, // buffer heads which allows data to be transferred between a struct buffer_head user buffer and a device without copying the *bh[KIO_MAX_SECTORS]; data through the kernel’s buffer cache. This // array of up to 129 page mechanism can boost performance if the data // structures, one for each is unlikely to be used again in the short term // page of data in the kiobuf (during a disk backup, for example), or for ap- struct page plications such as large database management **maplist[KIO_STATIC_PAGES]; systems that perform their own caching. The maplist array is key to the kiobuf in- To use the raw interface, a device binding must terface, since functions that operate on pages be estabished via the raw command; for exam- stored in a kiobuf deal directly with page struc- ple, raw /dev/raw/raw1 /dev/sda1. tures. This approach helps hide the complexi- Once bound to a block device, a raw device can ties of the virtual memory system from device be opened just like any other device. drivers – a primary goal of the kiobuf interface. A sampling of the kernel code path for a raw Once the raw device is opened, it can be read open is as follows: and written just like the block device to which it is bound. However, raw I/O to a block device sys_open must always be sector aligned, and its length Ottawa Linux Symposium 2002 393 must be a multiple of the sector size. The sector Requests are dequeued when the size for most devices is 512 bytes. scheduled tq_disk task calls run_task_queue() which invokes Let us examine the code path for a raw device generic_unplug_device().

Improving Linux Block I/O for Enterprise Workloads

Linux on the Road

Dm-X: Protecting Volume-Level Integrity for Cloud Volumes and Local

SGI™ Propack 1.3 for Linux™ Start Here

Mobile Video Ecosystem and Geofencing for Licensed Content Delivery

Buffered FUSE: Optimising the Android IO Stack for User-Level Filesystem

Performance Characteristics of VMFS and RDM Vmware ESX Server 3.0.1

Most Common Events and Errors in Actifio VDP 9.0.X

Opennebula 5.7 Deployment Guide Release 5.7.80

Network Interface Controller Drivers Release 20.08.0

OSD-Btrfs, a Novel Lustre OSD Based on Btrfs

Red Hat Enterprise Linux 7 Kernel Administration Guide

An Efficient and Flexible Metadata Management Layer for Local File Systems