Improving Block Level Efficiency with scsi-mq
Blake Caldwell NCCS/ORNL March 4th, 2015
ORNL is managed by UT-Battelle for the US Department of Energy Block Layer Problems
• Global lock of request queue per block device • Cache coherency traffic – If servicing part of a request on multiple cores, the lock must be obtained on the new core and invalidated on the old core • Interrupt locality – Hardware interrupt may occur on wrong core, requiring sending soft-interrupt to proper core
2 Improving Block Level Efficiency Linux Block Layer
• Designed with rotational media in mind – Time spent in the queue allows sequential request reordering – a very good thing – Completion latencies 10ms to 100ms • Single request queue – Staging area for merging, reordering, scheduling • Drivers are presented with the same interface for each block device
3 Improving Block Level Efficiency blk-mq (multi-queue)
• Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core submission queues 2) 1 more more hardware dispatch queues with affinity to NUMA nodes/CPU’s (device-driver specific) • IO scheduling within software queues – Inserted in FIFO order, then interleaved to hardware queues • Tags IOs that are reused for lookup on completion
4 Improving Block Level Efficiency Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Application Application blk-mq (Core 0, 1, 2) (Core 0, 1, 2) single queue libaio libaio ost_io ost_io
Page Page Cache Cache
Software Submission queues
Hardware Dispatch Queues
Block Device Driver Block Device Driver (SRP Initiator) (SRP Initiator)
Hardware Parallelism
Block Device Block Device (SRP Target) (SRP Target)
Disk/SSD Disk/SSD 5 Improving Block Level Efficiency IO Device Affinity
LNET To Storage Controller
HCA 1 HCA 2
PCIe PCIe
QPI
CPU 0 CPU 1
NUMA node 0 NUMA node 1
OSS Node
6 Improving Block Level Efficiency Controller Caching (direct)
OSS
HCA
RP0 RP1 RP0 RP1
ddn1 ddn2
Lun 0 Lun 1 Lun 3 Lun 4
7 Improving Block Level Efficiency Controller Caching (indirect)
OSS
HCA
RP0 RP1 RP0 RP1
ddn1 ddn2
Lun 0 Lun 1 Lun 3 Lun 4
8 Improving Block Level Efficiency Evaluation Setup
• Linux 3.18 • OSS – blk-mq (3.13) – Dual Ivy Bridge E5-2650 (2 – scsi-mq (3.17) NUMA nodes) – ib-srp multichannel (3.18) – 64GB – dm-multipath support (4.0) – Dual-port QDR IB to array • Lustre 2.7.0 rc1 • Storage Array – ldiskfs patches rebased to 3.18 – Dual DDN 10k controllers – 8GB write-back cache – RAID 6 (8+2) LUNs – SATA disk
9 Improving Block Level Efficiency Evaluation Goals
• Block level testing: adapt tests done with null-blk device to a real storage device with scsi-mq – Does NUMA affinity awareness lead to efficiency • Increased bandwidth • Decreased request latency • Multipath performance • Explore benefits for filesystems – Ready for use with fabric attached storage? – Are there performance benefits?
10 Improving Block Level Efficiency Evaluation Parameters
• Affinity – IB MSI-x interrupts spread among cores on NUMA node 1 – Fio threads bound to NUMA node 1 (closest to HCA) – Block device rq_affinity=2 – completion happens on submitting core • Block tuning – Noop scheduler – max_sectors_kb = max_hw_sectors_kb • IB-srp module – 16 channels – Max_sect 8192 – Max_cmd_per_lun 62 – Queue size 127 (fixed by hardware)
11 Improving Block Level Efficiency Throughput (direct path) thread/LUN
350000
300000
250000
200000
150000 2.6.32-431 Write IOPS IOPS Write 3.18 mq 100000 3.18 mq+mc
50000
0 1 2 4 6 LUNS
12 Improving Block Level Efficiency Throughput (indirect path) thread/LUN
350000
300000
250000
200000
150000 2.6.32-431 Write IOPS IOPS Write 3.18 mq 100000 3.18 mq+mc
50000
0 1 2 4 6 LUNS
13 Improving Block Level Efficiency CPU Usage (direct path) thread/LUN
90
80
70
60
50 usr 2.6.32-431 sys 2.6.32-431 CPU% 40 usr 3.18 mq 30 sys 3.18 mq 20 usr 3.18 mq+mc sys 3.18 mq+mc 10
0 1 2 4 6 8 LUNS
14 Improving Block Level Efficiency Throughput (direct path) 1 LUN
600000
500000
400000
300000 2.6.32-431 Write IOPS IOPS Write
200000 3.18 mq 3.18 mq+mc
100000
0 1 2 4 8 Threads
15 Improving Block Level Efficiency Read Throughput (direct path) 1LUN
350000
300000
250000
200000 2.6 sq 3.18 mq
Read IOPS ReadIOPS 150000 3.18 mq+mc
100000
50000
0 0 1 2 3 4 5 6 7 8 9 Threads
16 Improving Block Level Efficiency dm-multipath support
• Evaluated on Linux 4.0rc1 with patches for dm blk-mq support – 4k IO, libaio, iodepth=127, SRP multi-channel support enabled
Multipath Direct (no IO FWD) indirect (IO FWD) 393.2 MB/s 849.5 MB/s 854.8 MB/s 100658 IOPs 217468 IOPs 218829 IOPs
17 Improving Block Level Efficiency Request latency
• Average for 100 4k IOs, fio with synchronous IO engine
kernel Multipath Direct Indirect (no IO FWD) (IO FWD) 2.6.32-431.17.1 130.50 119.92 169.22 4.0rc1 (blk-mq) 130.56 103.58 142.84 Improvement -0.04% 13.6% 15.6%
18 Improving Block Level Efficiency Lustre Profiling
• FlameGraph of kernel code using Perf, 100Hz, Linux 3.18, Lustre 2.7.0rc1, 1MB writes to a single OST
19 Improving Block Level Efficiency Lustre Applications
• Metadata IO – Improve single request latency – Is bandwidth necessary during flushing metadata to MDT? • Object IO – Scheduling • Request size • Request tagging
20 Improving Block Level Efficiency Future Directions
• Lustre with Linux 4.0 • Testing with hardware capable of 600k+ 4k IOPs – Random write performance for multiple thread/LUN • Evaluate multiple threads/LUN sequential writes • Read and random tests needs further investigation
21 Improving Block Level Efficiency Conclusion
• scsi-mq has potential to lower CPU usage even with rotational media • scsi-mq has lower IO completion latency • Further evaluation needed of device drivers that support multiple hardware dispatch queues
22 Improving Block Level Efficiency Thank You
Blake Caldwell
23 Improving Block Level Efficiency