Improving Block-Level Efficiency with Scsi-Mq
Total Page:16
File Type:pdf, Size:1020Kb
Improving Block Level Efficiency with scsi-mq Blake Caldwell NCCS/ORNL March 4th, 2015 ORNL is managed by UT-Battelle for the US Department of Energy Block Layer Problems • Global lock of request queue per block device • Cache coherency traffic – If servicing part of a request on multiple cores, the lock must be obtained on the new core and invalidated on the old core • Interrupt locality – Hardware interrupt may occur on wrong core, requiring sending soft-interrupt to proper core 2 Improving Block Level Efficiency Linux Block Layer • Designed with rotational media in mind – Time spent in the queue allows sequential request reordering – a very good thing – Completion latencies 10ms to 100ms • Single request queue – Staging area for merging, reordering, scheduling • Drivers are presented with the same interface for each block device 3 Improving Block Level Efficiency blk-mq (multi-queue) • Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core submission queues 2) 1 more more hardware dispatch queues with affinity to NUMA nodes/CPU’s (device-driver specific) • IO scheduling within software queues – Inserted in FIFO order, then interleaved to hardware queues • Tags IOs that are reused for lookup on completion 4 Improving Block Level Efficiency Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Application Application blk-mq (Core 0, 1, 2) (Core 0, 1, 2) single queue libaio libaio ost_io ost_io Page Page Cache Cache Software Submission queues Hardware Dispatch Queues Block Device Driver Block Device Driver (SRP Initiator) (SRP Initiator) Hardware Parallelism Block Device Block Device (SRP Target) (SRP Target) Disk/SSD Disk/SSD 5 Improving Block Level Efficiency IO Device Affinity LNET To Storage Controller HCA 1 HCA 2 PCIe PCIe QPI CPU 0 CPU 1 NUMA node 0 NUMA node 1 OSS Node 6 Improving Block Level Efficiency Controller Caching (direct) OSS HCA RP0 RP1 RP0 RP1 ddn1 ddn2 Lun 0 Lun 1 Lun 3 Lun 4 7 Improving Block Level Efficiency Controller Caching (indirect) OSS HCA RP0 RP1 RP0 RP1 ddn1 ddn2 Lun 0 Lun 1 Lun 3 Lun 4 8 Improving Block Level Efficiency Evaluation Setup • Linux 3.18 • OSS – blk-mq (3.13) – Dual Ivy Bridge E5-2650 (2 – scsi-mq (3.17) NUMA nodes) – ib-srp multichannel (3.18) – 64GB – dm-multipath support (4.0) – Dual-port QDR IB to array • Lustre 2.7.0 rc1 • Storage Array – ldiskfs patches rebased to 3.18 – Dual DDN 10k controllers – 8GB write-back cache – RAID 6 (8+2) LUNs – SATA disk 9 Improving Block Level Efficiency Evaluation Goals • Block level testing: adapt tests done with null-blk device to a real storage device with scsi-mq – Does NUMA affinity awareness lead to efficiency • Increased bandwidth • Decreased request latency • Multipath performance • Explore benefits for filesystems – Ready for use with fabric attached storage? – Are there performance benefits? 10 Improving Block Level Efficiency Evaluation Parameters • Affinity – IB MSI-x interrupts spread among cores on NUMA node 1 – Fio threads bound to NUMA node 1 (closest to HCA) – Block device rq_affinity=2 – completion happens on submitting core • Block tuning – Noop scheduler – max_sectors_kb = max_hw_sectors_kb • IB-srp module – 16 channels – Max_sect 8192 – Max_cmd_per_lun 62 – Queue size 127 (fixed by hardware) 11 Improving Block Level Efficiency Throughput (direct path) thread/LUN 350000 300000 250000 200000 150000 2.6.32-431 Write IOPS Write 3.18 mq 100000 3.18 mq+mc 50000 0 1 2 4 6 LUNS 12 Improving Block Level Efficiency Throughput (indirect path) thread/LUN 350000 300000 250000 200000 150000 2.6.32-431 Write IOPS IOPS Write 3.18 mq 100000 3.18 mq+mc 50000 0 1 2 4 6 LUNS 13 Improving Block Level Efficiency CPU Usage (direct path) thread/LUN 90 80 70 60 50 usr 2.6.32-431 sys 2.6.32-431 CPU % 40 usr 3.18 mq 30 sys 3.18 mq 20 usr 3.18 mq+mc sys 3.18 mq+mc 10 0 1 2 4 6 8 LUNS 14 Improving Block Level Efficiency Throughput (direct path) 1 LUN 600000 500000 400000 300000 2.6.32-431 Write IOPS IOPS Write 200000 3.18 mq 3.18 mq+mc 100000 0 1 2 4 8 Threads 15 Improving Block Level Efficiency Read Throughput (direct path) 1LUN 350000 300000 250000 200000 2.6 sq 3.18 mq Read IOPS 150000 3.18 mq+mc 100000 50000 0 0 1 2 3 4 5 6 7 8 9 Threads 16 Improving Block Level Efficiency dm-multipath support • Evaluated on Linux 4.0rc1 with patches for dm blk-mq support – 4k IO, libaio, iodepth=127, SRP multi-channel support enabled Multipath Direct (no IO FWD) indirect (IO FWD) 393.2 MB/s 849.5 MB/s 854.8 MB/s 100658 IOPs 217468 IOPs 218829 IOPs 17 Improving Block Level Efficiency Request latency • Average for 100 4k IOs, fio with synchronous IO engine kernel Multipath Direct Indirect (no IO FWD) (IO FWD) 2.6.32-431.17.1 130.50 119.92 169.22 4.0rc1 (blk-mq) 130.56 103.58 142.84 Improvement -0.04% 13.6% 15.6% 18 Improving Block Level Efficiency Lustre Profiling • FlameGraph of kernel code using Perf, 100Hz, Linux 3.18, Lustre 2.7.0rc1, 1MB writes to a single OST 19 Improving Block Level Efficiency Lustre Applications • Metadata IO – Improve single request latency – Is bandwidth necessary during flushing metadata to MDT? • Object IO – Scheduling • Request size • Request tagging 20 Improving Block Level Efficiency Future Directions • Lustre with Linux 4.0 • Testing with hardware capable of 600k+ 4k IOPs – Random write performance for multiple thread/LUN • Evaluate multiple threads/LUN sequential writes • Read and random tests needs further investigation 21 Improving Block Level Efficiency Conclusion • scsi-mq has potential to lower CPU usage even with rotational media • scsi-mq has lower IO completion latency • Further evaluation needed of device drivers that support multiple hardware dispatch queues 22 Improving Block Level Efficiency Thank You Blake Caldwell <[email protected]> 23 Improving Block Level Efficiency .