Improving Block Level Efficiency with scsi-mq

Blake Caldwell NCCS/ORNL March 4th, 2015

ORNL is managed by UT-Battelle for the US Department of Energy Block Layer Problems

• Global lock of request queue per block device • Cache coherency traffic – If servicing part of a request on multiple cores, the lock must be obtained on the new core and invalidated on the old core • Interrupt locality – Hardware interrupt may occur on wrong core, requiring sending soft-interrupt to proper core

2 Improving Block Level Efficiency Block Layer

• Designed with rotational media in mind – Time spent in the queue allows sequential request reordering – a very good thing – Completion latencies 10ms to 100ms • Single request queue – Staging area for merging, reordering, • Drivers are presented with the same interface for each block device

3 Improving Block Level Efficiency blk-mq (multi-queue)

• Rewrite of the Linux block layer (since kernel 3.13) • Two levels of queues 1) Per-core submission queues 2) 1 more more hardware dispatch queues with affinity to NUMA nodes/CPU’s (device-driver specific) • IO scheduling within software queues – Inserted in FIFO order, then interleaved to hardware queues • Tags IOs that are reused for lookup on completion

4 Improving Block Level Efficiency Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Application Application blk-mq (Core 0, 1, 2) (Core 0, 1, 2) single queue libaio libaio ost_io ost_io

Page Page Cache Cache

Software Submission queues

Hardware Dispatch Queues

Block Block Device Driver (SRP Initiator) (SRP Initiator)

Hardware Parallelism

Block Device Block Device (SRP Target) (SRP Target)

Disk/SSD Disk/SSD 5 Improving Block Level Efficiency IO Device Affinity

LNET To Storage Controller

HCA 1 HCA 2

PCIe PCIe

QPI

CPU 0 CPU 1

NUMA node 0 NUMA node 1

OSS Node

6 Improving Block Level Efficiency Controller Caching (direct)

OSS

HCA

RP0 RP1 RP0 RP1

ddn1 ddn2

Lun 0 Lun 1 Lun 3 Lun 4

7 Improving Block Level Efficiency Controller Caching (indirect)

OSS

HCA

RP0 RP1 RP0 RP1

ddn1 ddn2

Lun 0 Lun 1 Lun 3 Lun 4

8 Improving Block Level Efficiency Evaluation Setup

• Linux 3.18 • OSS – blk-mq (3.13) – Dual Ivy Bridge E5-2650 (2 – scsi-mq (3.17) NUMA nodes) – ib-srp multichannel (3.18) – 64GB – dm-multipath support (4.0) – Dual-port QDR IB to array • Lustre 2.7.0 rc1 • Storage Array – ldiskfs patches rebased to 3.18 – Dual DDN 10k controllers – 8GB write-back cache – RAID 6 (8+2) LUNs – SATA disk

9 Improving Block Level Efficiency Evaluation Goals

• Block level testing: adapt tests done with null-blk device to a real storage device with scsi-mq – Does NUMA affinity awareness lead to efficiency • Increased bandwidth • Decreased request latency • Multipath performance • Explore benefits for filesystems – Ready for use with fabric attached storage? – Are there performance benefits?

10 Improving Block Level Efficiency Evaluation Parameters

• Affinity – IB MSI-x interrupts spread among cores on NUMA node 1 – Fio threads bound to NUMA node 1 (closest to HCA) – Block device rq_affinity=2 – completion happens on submitting core • Block tuning – Noop scheduler – max_sectors_kb = max_hw_sectors_kb • IB-srp module – 16 channels – Max_sect 8192 – Max_cmd_per_lun 62 – Queue size 127 (fixed by hardware)

11 Improving Block Level Efficiency Throughput (direct path) thread/LUN

350000

300000

250000

200000

150000 2.6.32-431 Write IOPS IOPS Write 3.18 mq 100000 3.18 mq+mc

50000

0 1 2 4 6 LUNS

12 Improving Block Level Efficiency Throughput (indirect path) thread/LUN

350000

300000

250000

200000

150000 2.6.32-431 Write IOPS IOPS Write 3.18 mq 100000 3.18 mq+mc

50000

0 1 2 4 6 LUNS

13 Improving Block Level Efficiency CPU Usage (direct path) thread/LUN

90

80

70

60

50 usr 2.6.32-431 sys 2.6.32-431 CPU% 40 usr 3.18 mq 30 sys 3.18 mq 20 usr 3.18 mq+mc sys 3.18 mq+mc 10

0 1 2 4 6 8 LUNS

14 Improving Block Level Efficiency Throughput (direct path) 1 LUN

600000

500000

400000

300000 2.6.32-431 Write IOPS IOPS Write

200000 3.18 mq 3.18 mq+mc

100000

0 1 2 4 8 Threads

15 Improving Block Level Efficiency Read Throughput (direct path) 1LUN

350000

300000

250000

200000 2.6 sq 3.18 mq

Read IOPS ReadIOPS 150000 3.18 mq+mc

100000

50000

0 0 1 2 3 4 5 6 7 8 9 Threads

16 Improving Block Level Efficiency dm-multipath support

• Evaluated on Linux 4.0rc1 with patches for dm blk-mq support – 4k IO, libaio, iodepth=127, SRP multi-channel support enabled

Multipath Direct (no IO FWD) indirect (IO FWD) 393.2 MB/s 849.5 MB/s 854.8 MB/s 100658 IOPs 217468 IOPs 218829 IOPs

17 Improving Block Level Efficiency Request latency

• Average for 100 4k IOs, fio with synchronous IO engine

kernel Multipath Direct Indirect (no IO FWD) (IO FWD) 2.6.32-431.17.1 130.50 119.92 169.22 4.0rc1 (blk-mq) 130.56 103.58 142.84 Improvement -0.04% 13.6% 15.6%

18 Improving Block Level Efficiency Lustre Profiling

• FlameGraph of kernel code using , 100Hz, Linux 3.18, Lustre 2.7.0rc1, 1MB writes to a single OST

19 Improving Block Level Efficiency Lustre Applications

• Metadata IO – Improve single request latency – Is bandwidth necessary during flushing metadata to MDT? • Object IO – Scheduling • Request size • Request tagging

20 Improving Block Level Efficiency Future Directions

• Lustre with Linux 4.0 • Testing with hardware capable of 600k+ 4k IOPs – Random write performance for multiple thread/LUN • Evaluate multiple threads/LUN sequential writes • Read and random tests needs further investigation

21 Improving Block Level Efficiency Conclusion

• scsi-mq has potential to lower CPU usage even with rotational media • scsi-mq has lower IO completion latency • Further evaluation needed of device drivers that support multiple hardware dispatch queues

22 Improving Block Level Efficiency Thank You

Blake Caldwell

23 Improving Block Level Efficiency