Disk IO Tuning in AIX 6.1

Technical Education October 2011 IBM Systems Group Disk IO Tuning in AIX 6.1 Author: Dan Braden [email protected] Presenter: Steve Nasypany IBM Advanced Technical Skills http://w3.ibm.com/support/americas/pseries © 2010 IBM Corporation © 2010 IBM Corporation Agenda The importance of IO tuning Disk basics and performance overview AIX IO stack Data layout Characterizing application IO Disk performance monitoring tools Testing disk subsystem performance Tuning 2 2011 IBM Corporation Why is disk IO tuning important? Moore's law Processors double in price performance every 18 months Disk growth Disk densities are doubling every 12 months Customers are doubling storage capacities every 12-18 months Actuator and rotational speed increasing relatively slowly Network bandwidth - doubling every 6 months Approximate CPU cycle time 0.0000000005 seconds Approximate memory access time 0.000000270 seconds Approximate disk access time 0.010000000 seconds Memory access takes 540 CPU cycles Disk access takes 20 million CPU cycles, or 37,037 memory accesses System bottlenecks are being pushed to the disk Disk subsystems are using cache to improve IO service times Customers now spend more on storage than on servers 3 2011 IBM Corporation Why is disk IO tuning important? Seagate 15k RPM/3.5" Drive Specifications +35% 450 Capacity (GB) Max Sustained 180 DR (MB/s) +15% Read Seek (ms) 73 75 3.6 -1% 3.4 2002 2010 Disk IO service time not improving compared to processors 4 2011 IBM Corporation Performance metrics Disk metrics MB/s IOPS With a reasonable service time Application metrics Response time Batch job run time System metrics CPU, memory and IO Size for your peak workloads Size based on maximum sustainable thruputs Bandwidth and thruput sometimes mean the same thing, sometimes not For tuning - it's good to have a short running job that's representative of your workload 5 2011 IBM Corporation Performance metrics Use a relevant metric for testing Should be tied to business costs, benefits or requirements Batch job run time Maximum or sufficient application transactions/second Query run time Metrics that typically are not so relevant Application transaction time if < a few seconds Metrics indicating bottlenecks CPU, memory, network, disk Important if the application metric goal isn’t met Be aware of IO from other systems affecting disk performance to shared disk If benchmarking two systems, be sure the disk performance is apples to apples and you’re not really comparing disk subsystem performance 6 2011 IBM Corporation Disk performance “ZBR” Geometry Interface type ATA SATA SCSI IO service times are predominately FC seek + rotational latency + queueing time SAS 7 2011 IBM Corporation Disk performance When do you have a disk bottleneck? Random workloads Reads average > 15 ms With write cache, writes average > 2.5 ms Sequential workloads Two sequential IO streams on one disk You need more thruput IOPS vs IO service time - 15,000 RPM disk 500 400 300 200 100 IO serviceIO time (ms) 0 25 50 75 100 125 150 175 200 225 250 275 300 325 IOPS 8 2011 IBM Corporation How to improve disk performance Reduce the number of IOs Bigger caches Application, file system, disk subsystem Use caches more efficiently No file system logging No access time updates Improve average IO service times Better data layout Reduce locking for IOs Buffer/queue tuning Use SSDs or RAM disk Faster disks/interfaces, more disks Short stroke the disks and use the outer edge Smooth the IOs out over time Reduce the overhead to handle IOs 9 2011 IBM Corporation What is %iowait? A misleading indicator of disk performance A type of CPU idle Percent of time the CPU is idle and waiting on an IO so it can do some more work High %iowait does not necessarily indicate a disk bottleneck Your application could be IO intensive, e.g. a backup You can make %iowait go to 0 by adding CPU intensive jobs Low %iowait does not necessarily mean you don't have a disk bottleneck The CPUs can be busy while IOs are taking unreasonably long times If disk IO service times are good, you aren’t getting the performance you need, and you have significant %iowait – consider using SSDs or RAM disk Improve performance by potentially reducing %iowait to 0 10 2011 IBM Corporation Solid State Disk (SSD) High performance electronic disk From 14,000 – 27,000 IOPS possible for a single SSD SSD IO bandwidth varies across Power and disk subsystems Typically small (69-177 GB) and expensive compared to HDDs Read or write IOs typically < 1 ms About the same IO service time as compared to writes to disk subsystem cache About 5-15X faster than reads from disk Positioned for high access density (IOPS/GB) random read data Implementation involves finding the best data to place on the SSDs SSDs can save disk costs by reducing the number of spindles needed When high access density data exists A mix of SSDs and HDDs is often best 11 2011 IBM Corporation SSD vs. HDD performance SSD offers up to 33x – 125x more IOPS HDD IO service time typically 5X to 40X slower* 125X 40X 33X 5X 1X 1X HDD SSD HDD SSD Access time is drive-to-drive, ignoring any caching by SAS controller 12 2011 IBM Corporation RAM disk Use system RAM to create a virtual disk Data is lost in the event of a reboot or system crash IOs complete with RAM latencies For file systems, it takes away from file system cache Taking from one pocket and putting it into another A raw disk or file system only – no LVM support # mkramdisk 16M /dev/rramdisk0 # mkfs -V jfs2 /dev/ramdisk0 mkfs: destroy /dev/ramdisk0 (yes)? y File system created successfully. 16176 kilobytes total disk space. Device /dev/ramdisk0: Standard empty filesystem Size: 32352 512-byte (DEVBLKSIZE) blocks # mkdir /ramdiskfs # mount -V jfs2 -o log=NULL /dev/ramdisk0 /ramdiskfs # df -m /ramdiskfs Filesystem MB blocks Free %Used Iused %Iused Mounted on /dev/ramdisk0 16.00 15.67 3% 4 1% /ramdiskfs 13 2011 IBM Corporation The AIX IO stack Application Application memory area caches data to avoid IO Logical file system Raw Raw disks Raw Raw LVs NFS caches file attributes JFS JFS2 NFS Other NFS has a cached filesystem for NFS clients JFS and JFS2 cache use extra system RAM VMM JFS uses persistent pages for cache JFS2 uses client pages for cache LVM (LVM device drivers) Multi-path IO driver (optional) Disk Device Drivers Queues exist for both adapters and disks Adapter Device Drivers Adapter device drivers use DMA for IO Disk subsystem (optional) Disk subsystems have read and write cache Disk Disks have memory to store commands/data Write cache Read cache or memory area used for IO IOs can be coalesced (good) or split up (bad) as they go thru the IO stack IOs adjacent in a file/LV/disk can be coalesced IOs greater than the maximum IO size supported will be split up 14 2011 IBM Corporation Synchronous vs Asynchronous IOs Definition depends on the frame of reference Programmers/application When an application issues a synchronous IO, it waits until the IO is complete Asynchronous IOs are handed off to the kernel, and the application continues, and uses the AIO facilities in AIX When a group of asynchronous IOs complete, a signal is sent to the application Allows IO and processing to run simultaneously Filesystem IO Synchronous write IOs to a file system must get to disk Asynchronous IOs only need to get to file system cache GLVM or disk subsystem mirroring Synchronous mirroring requires that writes to both mirrors complete before returing an acknowledgement to the application Asynchronous mirroring returns an acknowledgement when the write completes at the local storage Writes to remote storage are done in the same order as locally 15 2011 IBM Corporation Data layout Data layout affects IO performance more than any tunable IO parameter Good data layout avoids dealing with disk hot spots An ongoing management issue and cost Data layout must be planned in advance Changes are often painful iostat and filemon can show unbalanced IO Best practice: evenly balance IOs across all physical disks Random IO best practice: Spread IOs evenly across all physical disks For disk subsystems Create RAID arrays of equal size and RAID level Create VGs with one LUN from every array Spread all LVs across all PVs in the VG The SVC can, and XIV does do this automatically 16 2011 IBM Corporation Random IO data layout Disk subsystem 1 2 1 2 3 4 5 3 datavg 4 # mklv lv1 –e x hdisk1 hdisk2 … hdisk5 5 # mklv lv2 –e x hdisk3 hdisk1 …. hdisk4 ….. Use a random order for the hdisks for each LV RAID array LUN or logical disk PV 17 2011 IBM Corporation Data layout for sequential IO Many factors affect sequential thruput RAID setup, number of threads, IO size, reads vs. writes Create RAID arrays with data stripes a power of 2 RAID 5 arrays of 5 or 9 disks RAID 10 arrays of 2, 4, 8, or 16 disks Do application IOs equal to, or a multiple of, a full stripe on the RAID array Or use multiple threads to submit many IOs N disk RAID 5 arrays can handle no more than N-1 sequential IO streams before the IO becomes randomized N disk RAID 10 arrays can do N sequential read IO streams and N/2 sequential write IO streams before the IO becomes randomized Sometimes smaller strip sizes (around 64 KB) perform better Test your setup if the bandwidth needed is high 18 2011 IBM Corporation Data layout Best practice for VGs and LVs Use Big or Scalable VGs Both support no LVCB header on LVs (only important for raw LVs) These can

Disk IO Tuning in AIX 6.1

UNIX OS Agent User's Guide

SUSE Linux Enterprise Server 11 SP4 System Analysis and Tuning Guide System Analysis and Tuning Guide SUSE Linux Enterprise Server 11 SP4

System Analysis and Tuning Guide System Analysis and Tuning Guide SUSE Linux Enterprise Server 15 SP1

SUSE Linux Enterprise Server 12 SP4 System Analysis and Tuning Guide System Analysis and Tuning Guide SUSE Linux Enterprise Server 12 SP4

Practical Linux Performance and Application Troubleshooting by Tanel Poder |

Seebeyond ICAN Suite Installation Guide

Advanced Administration of the Gnulinux Operating System, February 2010

Practical Resource Monitoring for Robust High Throughput Computing

Oracle Solaris Administration Common Tasks

Linux Performance Analysis in 60,000 Milliseconds

Problem Reporting and Analysis Linux on System Z - How to Survive a Linux Critical Situation

Optimizing AIX 7 Performance: Part 1, Disk I/O Overview and Long-Term Monitoring Tools (Sar, Nmon, and Topas) Martin Brown October 12, 2010 Ken Milberg