Improved Storage Performance Using the New Kernel I/O Interface

John Kariuki (Software Application Engineer, Intel) Vishal Verma (Performance Engineer, Intel)

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 1 Agenda

. Existing Linux IO interfaces . io_uring: The new efficient IO interface . Liburing library . Performance . Upcoming features . Summary

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 2 Existing IO Interfaces

. Synchronous I/O interfaces:

o Thread starts an I/O operation and immediately enters a wait state until the I/O request has completed

o read(2), write(2), pread(2), pwrite(2), preadv(2), pwritev(2), preadv2(2), pwritev2(2)

. Asynchronous I/O interfaces:

o Thread sends an I/O request to the kernel and continues processing another job until the kernel signals to the thread that the I/O request has completed

o Posix AIO: aio_read, aio_write o Linux AIO: aio

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 3 Existing Linux User-space IO Interfaces

. SPDK: Provides a set of tools and libraries for writing high performance, scalable, user-mode storage applications . Asynchronous, polled-mode, lockless design . https://spdk.io

This talk will cover Linux Kernel IO Interfaces

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 4 The Software Overhead Problem

Intel® Optane™ DC SSD P4800X Intel® SSD DC P4610 4K Random Read Avg. Latency, Queue Depth=1 4K Random Read IOPS, numjobs=1

1.38 600,000 libaio 500,000 128 256 pvsync2 1 64 400,000 psync pvsync 1.32 300,000 32

vsync 1.33 IOPS vsync (Higher better)is (Higher 200,000 pvsync psync 1.32 16 pvsync2 100,000 8 sync 1.32 libaio 1 32 64 128 256 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0 100 200 300 Relative Latency (Lower is better) Queue Depth

Over 30% SW overhead with most of I/O interfaces Single thread IOPS Scale with increasing iodepth using libaio vs. pvsync2 when running single I/O to an Intel® Optane™ but other I/O interfaces doesn’t scale with iodepth> 1 P4800X SSD

Test configuration details: slide 24

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 5 io_uring: The new IO interface

. High I/O performance & scalable: . Zero-copy: Submission Queue (SQ) and Completion Queue (CQ) place in shared memory . No locking: Uses single-producer-single-consumer ring buffers . Allows batching to minimize syscalls: Efficient in terms of per I/O overhead. . Allows asynchronous I/O without requiring O_DIRECT . Supports both block and file I/O . Operates in interrupted or polled I/O mode

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 6 Introduction to Liburing library

. Provides a simplified API and easier way to establish io_uring instance . Initialization / De-initialization: . io_uring_queue_init(): Sets up io_uring instance and creates a communication channel between application and kernel . io_uring_queue_exit(): Removes the existing io_uring instance . Submission: . io_uring_get_sqe(): Gets a submission queue entry (SQE) . io_uring_prep_readv(): Prepare a SQE with readv operation io_uring_prep_writev(): Prepare a SQE with writev operation . io_uring_submit(): Tell the kernel that submission queue is ready for consumption

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 7 Introduction to Liburing library

. Completion: . io_uring_wait_cqe(): Wait for completion queue entry (CQE) to complete . io_uring_peek_cqe(): Take a peek at the completion, but do not wait for the event to complete . io_uring_cqe_seen(): Called once completion event is finished. Increments the CQ ring head, which enables the kernel to fill in a new event at that same slot . More advanced features not yet available through liburing . For further information about liburing . http://git.kernel.dk/cgit/liburing

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 8 I/O Interfaces comparisons

SW Synchronous Libaio io_uring Overhead I/O System Calls At least 1 per I/O 2 per I/O batch. 1 per batch, zero when using SQ submission thread. Batching reduces per I/O overhead

Memory Copy Ye s Ye s – SQE & CQE Zero-Copy for SQE & CQE

Context Switches Ye s Ye s Minimal context switching polling

Interrupts Interrupt driven Interrupt driven Supports both Interrupts and polling I/O

Blocking I/O Synchronous Asynchronous Asynchronous Buffered I/O Ye s No Ye s

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 9 Performance

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 10 Single Core IOPS: libaio vs. io_uring

4K Rand Read IOPS at QD=128 4x Intel® Optane™ SSD, 1 Xeon CPU Core, FIO 2,000,000 16 32 1,800,000 8 4 1,600,000 1,400,000 2 1 1,200,000 1,000,000

IOPS 16 32 800,000 8 2 4 600,000 1 400,000 io_uring: 1.87M IOPS/core 200,000 libaio: ~900K IOPS/core 0 0 5 10 15 20 25 30 35 IO Submission/Completion Batch Size

io_uring libaio

IO Submission and Completion batch sizes [1,32] Test configuration details: slide 24

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 11 Single Core IOPS: libaio vs io_uring vs SPDK 4K Rand Read IOPS at QD=128 21x Intel® Optane™ P4800X SSDs, 1 Xeon CPU Core

libaio 0.90

io_uring 1.87

SPDK 10.38

0 2 4 6 8 10 12 IOPS (Millions) io_uring: 2x more IOPS/core vs libaio SPDK: 5.5x more IOPS/core vs io_uring

IO Submission/Completion batch sizes 32 for libaio & io_uring with 4x Intel® Optane™ P4800X SSDs. libaio data collected with fio, io_uring data collected with fio t & SPDK with . Test configuration details: slide 24 2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 12 I/O Latency: libaio vs. io_uring vs. SPDK

Submission/Completion Latency (4K Read, QD=1) With Intel® Optane™ SSD

Libaio 1526 489

IO_uring (without fixedbufs) 704 155 Submission + Completion SW latency io_uring: 60% lower vs. libaio IO_uring (with fixedbufs) 647 154 SPDK: 60% lower vs. io_uring

SPDK 150 160

0 500 1000 1500 2000 2500 Avg.Latency (ns)

Submission Completion

Submission Latency: Captures TSC before and after the I/O submission. Completion Latency: Captures TSC before and after the I/O completion check Test configuration details: slide 24 2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 13 libaio vs io_uring I/O path SUBMISSION SUBMISSION + COMPLETION 85.17% 1.24% fio [kernel.vmlinux] [k] --81.46%--entry_SYSCALL_64 entry_SYSCALL_64 |--75.93%--do_syscall_64 |--85.12%--entry_SYSCALL_64 | |--73.32%--__x64_sys_io_uring_enter |--81.86%--do_syscall_64 … |--45.28%--__x64_sys_io_submit | | |--31.24%--io_ring_submit | |--44.08%--io_submit_one | | | |--30.83%--io_submit_sqe | |--32.03%--aio_read | | | |--23.37%--__io_submit_sqe | | |--30.38%--blkdev_read_iter | | | | |--22.39%--io_read | | | |--30.13%--generic_file_read_iter | | | | |--20.68%--blkdev_read_iter | | | |--29.30%--blkdev_direct_IO … … | | |--35.62%--io_iopoll_check COMPLETION | | | |--33.80%--io_iopoll_getevents | | | | |--28.61%--blkdev_iopoll |--81.86%--do_syscall_64 … |--34.24%--__x64_sys_io_getevents | | | | | |--0.87%--nvme_poll |--33.49%--do_io_getevents | | | |--1.34%--blkdev_iopoll |--31.87%--read_events … |--16.93%--schedule | |--15.18%--__schedule … |--7.95%--aio_read_events io_uring: submission + completion | |--2.24%--mutex_unlock in 1 syscall 2019… Storage Developer Conference. © Intel Corporation. All Rights Reserved. 14 Interrupt and Context Switch

METRICS libaio io_uring RATIONALE

io_uring polling HW Interrupts 172,417.78 251.80 eliminates Interrupts

Reduces context switches Context Switch 112.27 1.47 by 99%

Workload: 4K Rand Read, 60 sec, 4 P4800, no batching. HW interrupts & Context Switch metrics are per sec. We used fio for libaio test and fio t for io_uring. Test configuration details: slide 24 2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 15 Top-down Microarchitecture Analysis Methodology (TMAM) Overview

Source: https://fd.io/wp-content/uploads/sites/34/2018/01/performance_analysis_sw_data_planes_dec21_2017.pdf

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 16 TMAM Level-1 Analysis

TMAM Level-1 Analysis CPI (Lower is I/O Interfaces Better)

io_uring + batching 26.44 10.40 59.02 4.13 libaio 1.36 io_uring 0.58 io_uring 31.72 17.45 48.02 2.81 io_uring + batching 0.45

libaio 26.53 42.89 27.38 3.20

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 % Frontend Bound Stalls % Backend Bound Stalls % Retiring % Bad Speculation io_uring with batching: • 32% reduction in backend bound stalls vs. libaio • 32% improvement in µOps retired vs. libaio. 66% lower CPI for io_uring vs. libaio Workload: 4K Rand Read, 60 sec, 4 P4800 Test configuration details: slide 24 2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 17 TMAM Level-3 Analysis Cache, Branch & TLB: libaio vs. io_uring

iTLB-load-misses

dTLB-store-misses

dTLB-load-misses

branch-load-misses

LLC-store-misses

LLC-load-misses

L1-dcache-load-misses

L1-icache-load-misses

0 0.2 0.4 0.6 0.8 1 1.2 1.4

io_uring AIO

io_uring reduces icache & iTLB misses by over 60% vs. libaio

Workload: 4K Rand Read, 60 sec, 4 P4800 Test configuration details: slide 24 2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 18 TMAM Level-3 Analysis Cache, Branch & TLB: SPDK vs. IO_URING

iTLB-load-misses

dTLB-store-misses

dTLB-load-misses SPDK branch-load-misses 90% less iTLB and L1-icache misses LLC-store-misses 6x better IOPS/core

LLC-load-misses

L1-icache-load-misses

L1-dcache-load-misses

IOPS

0 1 2 3 4 5 6 7

SPDK IO_uring

Workload: 4K Rand Read, 60 sec Test configuration details: slide 24 2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 19 What’s Next for IO_URING

. io_uring for socket based I/O . Support already added for sendmsg(), recvmsg() . Support for devices like RAID (md), Logical Volumes(dm) . Async support for more system calls

. Eg: open+read+ in a single call

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 20 Summary

. io_uring is the latest high performance I/O interface in the Linux Kernel (available since 5.1 release) . Eliminates limitations of current Linux kernel async I/O interfaces . Building an application for next generation of NVMe SSDs? io_uring enables . Less than 1 usec SW latency to submit/complete I/Os . 1 – 2 million IOPS/Core

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 21 NOTICES AND DISCLAIMERS

. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. . No product or component can be absolutely secure. . Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks . . Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks . . Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. . Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. . Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 22 Backup

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 23 Performance Configuration Performance configuration for slide 5 data: Relative Latency: SuperMicro SYS-2029U-TN24R4T, Intel(R) Xeon(R) Platinum 8270 CPU @ 2.70GHz, 384GB DDR4, Ubuntu 18.04 LTS, Linux Kernel 5.2.0, 1x Intel® Optane™ DC SSD P4800X 375GB SSD, fio-3.14-6-g97134, 4K 100% Random Reads, Iodepth=1, ramp time = 30s, direct=1 , runtime=300s, Data collected at Intel Storage Lab 07/17/2019

Throughput: SuperMicro SYS-2029U-TN24R4T, Intel(R) Xeon(R) Platinum 8270 CPU @ 2.70GHz, 384GB DDR4, Ubuntu 18.04 LTS, Linux Kernel 5.2.0, 1x Intel® SSD DC P4610 1.6TB, fio-3.14-6-g97134, 4K 100% Random Reads, Iodepth=1 to 256 varied (exponential 2), ramp time= 30s, direct=1, runtime=300s, Data collected at Intel Storage Lab 07/17/2019

Performance configuration for slide 11, 12 & 19 data: Intel Server S2600WFT, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz, 192GB DDR4, Fedora 27, Linux Kernel 5.0.0-rc6, 4x Intel® Alderstream 503GB SSD, SPDK commit 41b7f1ca2189, SPDK bdevperf, runtime = 60s, Data collected at Intel Storage Lab 09/12/2019

Performance configuration for slide 14 data: SuperMicro SYS-2029U-TN24R4T, Intel(R) Xeon(R) Platinum 8270 CPU @ 2.70GHz, 384GB DDR4, Ubuntu 18.04 LTS, Linux Kernel 5.2.0, 4x Intel® Optane™ DC SSD P4800X 375GB SSD, fio-3.14-6-g97134, t/fio app used with varied batching sizes, Data collected at Intel Storage Lab 07/17/2019

Performance configuration for slide 15, 17 &18 data: SuperMicro SYS-2029U-TN24R4T, Intel(R) Xeon(R) Platinum 8270 CPU @ 2.70GHz, 384GB DDR4, Ubuntu 18.04 LTS, Linux Kernel 5.2.0, 4x Intel® Optane™ DC SSD P4800X 375GB SSD, SPDK commit c223ba3b0f, fio-3.14-6- g97134, runtime = 60s, Data collected at Intel Storage Lab 09/6/2019

Performance configuration for slide 25 data: SuperMicro SYS-2029U-TN24R4T, Intel(R) Xeon(R) Platinum 8270 CPU @ 2.70GHz, 384GB DDR4, Ubuntu 18.04 LTS, Linux Kernel 5.2.0, 2x Intel® Optane™ DC SSD P4800X 375GB SSD, 2x Intel® SSD DC P4610 fio-3.14-6-g97134, runtime = 300s, Data collected at Intel Storage Lab 07/17/2019

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 24 Relative IOPS Performance: Single Core: IO_Uring vs. Libaio

FIO: 4K 100% Random Reads FIO: 4K 100% Random Reads 2x Intel® SSD DC P4610 2x Intel® Optane™ SSDs 1.79 1.83 2.00 2.00 1.59 1.45 1.28 1.34 1.38 1.36 1.38 1.38 1.36 1.36 1.50 1.50 1.12 1.11 1.11 1.11 1.15 1.09 1.00 1.00

0.50 0.50 Higher is is betterHigher Higher is is betterHigher 0.00 0.00 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Queue Depth Queue Depth

Libaio IO_uring Libaio IO_uring

- Up to 10-15% improvement with io_uring on - io_uring performs up to 1.8x better at lower Intel® SSD DC P4610 at lower queue depths queue depths on Intel® Optane™ SSDs

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 25