Better Utilization of Storage Features from KVM Guest via virtio-

Masaki Kimura Information & Telecommunication Systems Company IT Platform Division Group Hitachi, Ltd.

© Hitachi, Ltd. 2013. All rights reserved. Better Utilization of Storage Features from KVM Guest via virtio-scsi

Contents 1. Background (Use Cases and Requirements) 2. KVM features for guest SCSI commands 3. Current Status of these features 4. Summary 5. Future work

© Hitachi, Ltd. 2013. All rights reserved. 1 Better Utilization of Storage Features from KVM Guest via virtio-scsi

1. Background

© Hitachi, Ltd. 2013. All rights reserved. 2 1-1. Background

 Enterprise systems expect that virtualized environment has the same level of manageability, availability, and reliability achieved in bare-metal.  For example: 1. Thin-provisioned Storage for manageability, 2. HA cluster for availability, 3. Backup for reliability.  In bare-metal environment, some of these requirements are achieved by using storage features, such as SCSI commands.  In virtualized environment, the same use cases exists for guests.  Issuing SCSI commands from guests are required.

Three use cases, thin-provisioned storage, HA cluster, and backup server will be explained in the next slides.

© Hitachi, Ltd. 2013. All rights reserved. 3 1-2. Use case #1: Thin-provisioned Storage

- Many types of enterprise storage have thin-provision function. - For achievement of thin-provision, a disk block is allocated on access. - However, once it is allocated, it can not be reclaimed by storage automatically even when the disk block becomes unused by OS. - To reclaim unused disk block, OS needs to let storage know unused blocks by issuing WriteSame SCSI command.

Bare-metal KVM Guest Guest Guest OS OS QEMU

OS Host OS

Server Server WriteSame WriteSame

reclaiming disk blocks

This use case exists for both bare-metal and KVM guests.  WriteSame is required to be issued to storage from guests. © Hitachi, Ltd. 2013. All rights reserved. 4 1-3. Use Case #2: HA cluster (1/2)

- To improve availability, HA cluster is commonly used in bare-metal.

- For HA cluster, Persistent Reservation SCSI command is generally used to guarantee an exclusive access from an active system.

Bare-metal KVM Guest HA Cluster Active Standby Guest Guest Guest Guest OS OS OS OS HA Cluster QEMU QEMU Active Standby OS OS Host OS Host OS

Server Server Server Server

Persistent blocked blocked Reservation Persistent LU Reservation LU

This use case also exists for KVM guests.  Persistent Reservation is required to be issued to storage from guest.

© Hitachi, Ltd. 2013. All rights reserved. 5 1-4. Use Case #2: HA cluster (2/2)

- Persistent Reservation is held by so-called I_T nexus, the combination of initiator ID and target ID.

- I/Os from standby system are blocked, because the I_T nexus is different.

- Therefore, I_T nexus is required to be unique for Persistent Reservation to work properly.

Bare-metal KVM Guest HA Cluster Active Standby Guest Guest Guest Guest OS OS OS OS HA Cluster initiator_3 initiator_4 Active Standby QEMU QEMU OS OS Host OS Host OS initiator_1 initiator_2 Server Server Server Server

Persistent blocked Persistent Reservation Reservation target_B blocked target_A Reserved LU (initiator_1, target_A) LU Reserved (initiator_3, target_B)

initiator_* : Initiator ID. target_* : Target ID.

© Hitachi, Ltd. 2013. All rights reserved. 6 1-5. Use Case #3: Backup Server

- Persistent Reservation is also used by some backup-server products to guarantee an exclusive access from a backup server on backup.

- Persistent Reservation to storage and unique I_T nexus are required by these products.

Bare-metal KVM Guest Business systems Backup Server Guest Guest Guest OS OS OS initiator_4 initiator_5 initiator_6 Business systems Backup QEMU Server OS OS OS Host OS initiator_1 initiator_2 initiator_3 Server Server Server Server

blocked Persistent blocked Persistent Reservation Reservation target_A target_B LU LU LU LU

initiator_* : Initiator ID. target_* : Target ID.

© Hitachi, Ltd. 2013. All rights reserved. 7 1-6. Summary of Requirements

 Requirements from the use cases:  Requirement #1: SCSI commands to storage from guests. - Thin-provisioned storage requires WriteSame to storage from guests. - HA cluster and backup server require Persistent Reservation to storage from guests.  Requirement #2: Unique initiator ID across guests. - HA cluster and backup server require I_T nexus to be unique.

© Hitachi, Ltd. 2013. All rights reserved. 8 Better Utilization of Storage Features from KVM Guest via virtio-scsi

2. KVM features for guest SCSI commands

© Hitachi, Ltd. 2013. All rights reserved. 9 2-1. KVM features for guest disks

 This presentation focuses on following three device types and their configurations.

Configuration # Device Type Initiator Target Backend 1 File 2 (1) virtio-blk - Device 3 LUN 4 File 5 - (a) Device 6 LUN (2) virtio-scsi 7 block - (b)lio 8 pscsi 9 (c) libiscsi - iSCSI storage 10 (3)PCI device (a)Legacy - PCI device 11 assignment (b)VFIO - PCI device

© Hitachi, Ltd. 2013. All rights reserved. 10 2-2. (1) virtio-blk (1/2)

(1)virtio-blk Characteristic

Block /dev/vdx - Para-virtualized disk. Guest Layer virtio-blk - Shown as /dev/vdX on guests. - Maximum number of disks is limited by kernel SCSI Layer maximum number of PCI devices (32). - Improving performance with virtio-blk plane Device and bio-based I/O. Layer virtio-blk QEMU

Block /dev/sdx Layer Host SCSI kernel Layer Device Layer

Hardware Device

© Hitachi, Ltd. 2013. All rights reserved. 11 2-3. (1) virtio-blk (2/2)

 Configurations and how SCSI commands are handled. /dev/vdx /dev/vdx /dev/vdx Guest Block virtio-blk virtio-blk virtio-blk

SCSI SCSI Command Capability

Device virtio-blk virtio-blk virtio-blk QEMU - SCSI command from guest reaches to storage only when Pass- attached as LUN. VFS File through Block Host LUN kernel Block

Hardware Device

Backend KVM Command Line XML SCSI command Disk -device virtio-blk-pci,scsi=off Not Supported (file) Disk -device virtio-blk-pci,scsi=off Not Supported (block) LUN -device virtio-blk-pci,scsi=on Pass-through

© Hitachi, Ltd. 2013. All rights reserved. 12 2-4. (2) virtio-scsi

 virtio-scsi has three types of configurations. (a)Qemu target (b)lio target (c)libiscsi

Block /dev/sdx /dev/sdx /dev/sdx Guest Layer kernel SCSI scsi_mod scsi_mod scsi_mod Layer virtio-scsi virtio-scsi virtio-scsi Device virtio-scsi Layer virtio-scsi libiscsi QEMU

Block tcm_vhost /dev/sdx LIO Layer /dev/sdx Host SCSI kernel Layer Device Layer

Hardware Device

© Hitachi, Ltd. 2013. All rights reserved. 13 2-5. (2) virtio-scsi: (a) qemu target (1/2)

(a)Qemu target Characteristic Block /dev/sdx Guest Layer [virtio-scsi] - Para-virtualized SCSI transport. kernel SCSI scsi_mod - Shown as /dev/sdX on Linux guests. Layer virtio-scsi Device [Qemu target] Layer virtio-scsi - target QEMU

Block /dev/sdx Layer Host SCSI kernel Layer Device Layer

Hardware Device

© Hitachi, Ltd. 2013. All rights reserved. 14 2-6. (2) virtio-scsi (a) qemu-target (2/2)

 Configurations and how SCSI commands are handled. /dev/sdx /dev/sdx /dev/sdx Guest Block SCSI scsi_mod scsi_mod scsi_mod virtio-scsi virtio-scsi virtio-scsi SCSI Command Capability

Device virtio-scsi QEMU virtio-scsi virtio-scsi - SCSI command from guest Emulated reaches to storage only when result Pass- attached as LUN. File through VFS Device - Emulated results return to Host Block LUN guest when attached as file kernel or device.

Hardware Device

Backend KVM Command Line Libvirt XML SCSI command Disk -device scsi-hd Emulated (file) Disk -device scsi-hd Emulated (block) LUN -device scsi-block Pass-through

© Hitachi, Ltd. 2013. All rights reserved. 15 2-7. (2) virtio-scsi: (b) lio target

(b)lio target Characteristic /dev/sdx Block [virtio-scsi] Guest Layer - Para-virtualized SCSI transport. kernel SCSI scsi_mod - Shown as /dev/sdX on Linux guests. Layer virtio-scsi [lio target] Device - Kernel space target. Layer QEMU - Using LIO (linux-.org ) as backend. - LIO supports following back-stores: - block Block tcm_vhost - fileio Layer LIO - pscsi /dev/sdx - ramdisk Host SCSI kernel Layer Device SCSI Command Capability Layer - Not yet evaluated. (pscsi is pass-through SCSI, therefore Hardware Device it is expected to work well.)

© Hitachi, Ltd. 2013. All rights reserved. 16 2-8. (2) virtio-scsi: (c) libiscsi

(c)libiscsi Characteristic /dev/sdx Block [virtio-scsi] Guest Layer - Para-virtualized SCSI transport. kernel SCSI scsi_mod - Shown as /dev/sdX on Linux guests. Layer virtio-scsi [libiscsi] Device virtio-scsi - User space iSCSI initiator. Layer libiscsi QEMU - Only support iSCSI. - QEMU directly talk to iSCSI storage, therefore host does not see guest disks.

SCSI Command Capability Host - SCSI command from guest reaches to storage. kernel Directly talk to iSCSI storage.

Hardware Device

© Hitachi, Ltd. 2013. All rights reserved. 17 2-9. (3) PCI Device assignment

 PCI Device assignment has two types:

(a) Legacy (b) VFIO Characteristic

/dev/sdx /dev/sdx - Assign PCI device to guests. Block - Host PCI device is dedicated to Guest Layer one guest, therefore the number kernel SCSI scsi_mod scsi_mod of guests is limited to the number Layer SCSI LLD SCSI LLD of PCI devices (or their ports.) Device PCI device PCI device Layer QEMU SCSI Command Capability - SCSI command from guest reaches to storage in both legacy and VFIO configurations. Host kernel Device Layer pci-stub vfio PCI device PCI device Hardware Device

© Hitachi, Ltd. 2013. All rights reserved. 18 2-10. Summary of SCSI command capability

Configuration Whether guest # SCSI commands Device Type Initiator Target Backend reach to storage 1 File No 2 virtio-blk - Device No 3 LUN Yes 4 File No 5 - qemu Device No 6 LUN Yes virtio-scsi 7 block No - lio 8 pscsi ??? 9 libiscsi - iSCSI storage Yes 10 PCI Device Legacy - PCI device Yes 11 assignment VFIO - PCI device Yes In the next chapter, we will see SCSI command capabilities deeper only with configurations marked “Yes” in above table.

© Hitachi, Ltd. 2013. All rights reserved. 19 Better Utilization of Storage Features from KVM Guest via virtio-scsi

3. Current Status of these features

© Hitachi, Ltd. 2013. All rights reserved. 20 3-1. Evaluation Item and Configuration

 Evaluation Items: # Evaluation Item Remarks (a) Whether SCSI commands reach to storage. Requirement #1 (b) Whether unique initiator ID is assigned. Requirement #2 (c) Whether SCSI commands return proper results. -

 Configurations: # Device Type Initiator Target Backend 1 virtio-blk - LUN 2 - qemu LUN virtio-scsi 3 libiscsi - iSCSI storage 4 PCI Device Legacy - PCI device 5 assignment VFIO - PCI device

From next slide, I will share what problems remain in which configurations.

© Hitachi, Ltd. 2013. All rights reserved. 21 3-2. (a) Whether SCSI commands reach to storage(1/4)

Problem - There is a permission check in host kernel, when guest SCSI commands is issued via virtio-blk or virtio-scsi with qemu target.

- Libvirt-managed KVM guests run as qemu user, who lacks CAP_SYS_RAWIO.

 Some SCSI commands, such as Persistent Reserve and Write Same, are blocked by this check unless KVM is running as root user. [virtio-blk] [virtio-scsi] Block /dev/vdx /dev/sdx Permission Check Guest Layer virtio-blk kernel SCSI scsi_mod Layer virtio-scsi CAP_SYS_RAWIO Device No Yes Layer virtio-blk virtio-scsi QEMU read_ok command

No Block /dev/sdx /dev/sdx write_ok command && Layer Permission Check has write permission Host SCSI kernel Layer No Device Not allowed Allowed Layer : Inquiry, Report LUN, etc Hardware Device : Persistent Reserve, WriteSame, etc © Hitachi, Ltd. 2013. All rights reserved. 22 3-3. (a) Whether SCSI commands reach to storage(2/4)

To solve this issue, following patches have been submitted.

Subject : [PATCH v3 0/2] add per-device knob to enable unrestricted, unprivileged SG_IO Date : November 13, 2012 Committer: Paolo Bonzini URL : https://lkml.org/lkml/2012/11/13/440

Subject : [PATCH 00/13] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542) Date : January 24, 2013 Committer: Paolo Bonzini URL : https://lkml.org/lkml/2013/1/24/279

Subject : [PATCH v3 part2] Add per-device sysfs knob to enable unrestricted, unprivileged SG_IO Date : May 23, 2013 Committer: Paolo Bonzini URL : https://lkml.org/lkml/2013/5/23/294

However, neither of them has been merged yet.

© Hitachi, Ltd. 2013. All rights reserved. 23 3-4. (a) Whether SCSI commands reach to storage(3/4)

 Concept of these patches is to introduce a flag, unpriv_sgio, to allow non-root users to issue SCSI commands. Permission Check

CAP_SYS_RAWIO || * Kernel side interface (Not merged yet.) unpriv_sgio No # cat /sys/block/sda/queue/unpriv_sgio Yes 0 read_ok command # echo 1 > /sys/block/sda/queue/unpriv_sgio (*) Unpriv_sgio flag can be set per disk. No write_ok command && has write permission No Not allowed Allowed

: Inquiry, Report LUN, etc : Persistent Reserve, WriteSame, etc If this kernel patch is merged, KVM guest running as qemu user will be able to configure to issue any SCSI commands to storages.

© Hitachi, Ltd. 2013. All rights reserved. 24 3-5. (a) Whether SCSI commands reach to storage(4/4)

Why these patches have not been merged yet?  Still under discussion on how to avoid opcodes-overlap problem. When issued from host When issued from guest READ SUBCHANNEL Intended Guest in guest. Permission Check 0x42 QEMU Share same opcode. READ UNMAP Interpreted SUBCHANNEL Permission in host. Host 0x42 0x42 Check Permission Permission w/ unpriv_sgio. Check Check 0x42 => UNMAP

Hardware

Different device classes. [Problem] : Destructive commands might pass-throughed to the host from guests. [Implementation]: Split permission check by device class (*1) or Introduce per-device filter w/o unpriv_sgio (*2) (*1) https://lkml.org/lkml/2013/1/24/299 (*2) https://lkml.org/lkml/2013/5/27/230 © Hitachi, Ltd. 2013. All rights reserved. 25 3-6. (b) Whether unique initiator ID is assigned(1/3)

Problem When both guest1 and guest2 are on the same host and use the same HBA, they share the same initiator ID. (virtio-blk or vitio-scsi w/ qemu target)

[virtio-blk] [virtio-scsi] Guest 1 Guest 2 Guest 1 Guest 2 Block /dev/vda /dev/vda /dev/sda /dev/sda Persistent Guest Layer virtio-blk virtio-blk Reservation kernel SCSI scsi_mod scsi_mod Layer Persistent virtio-scsi virtio-scsi Device Reservation Layer virtio-blk virtio-scsi QEMU

Block /dev/sdx /dev/sdx Layer Host SCSI kernel Layer Device Layer HBA initiator_1 HBA initiator_2 Hardware Device target_a target_b

initiator_* : Initiator ID. target_* : Target ID.  In such a condition, exclusive access is not guaranteed.

© Hitachi, Ltd. 2013. All rights reserved. 26 3-7. (b) Whether unique initiator ID is assigned(2/3)

Solution With NPIV (N-Port ID ), virtio-blk and vitio-scsi w/ qemu target can assign unique initiator ID.

[virtio-blk] [virtio-scsi] Guest 1 Guest 2 Guest 1 Guest 2 Block /dev/vda /dev/vda /dev/sda /dev/sda Persistent Guest Layer virtio-blk virtio-blk Reservation kernel SCSI scsi_mod scsi_mod Layer Persistent virtio-scsi virtio-scsi Device Reservation Layer virtio-blk virtio-scsi QEMU

Block /dev/sdx /dev/sdy /dev/sdx /dev/sdy Layer NPIV Host SCSI kernel Layer Device Layer initiator_1_1 initiator_1_2 initiator_2_1 initiator_2_2 HBA HBA initiator_1 blocked initiator_2 blocked Hardware Device target_a target_b

initiator_* : Initiator ID. target_* : Target ID.

© Hitachi, Ltd. 2013. All rights reserved. 27 3-8. (b) Whether unique initiator ID is assigned(3/3)

FYI With libiscsi or PCI device assignment, exclusive access is guaranteed, because initiator IDs are unique with these configurations.

[virtio-scsi + libiscsi] [PCI device assignment] Guest 1 Guest 2 Guest 1 Guest 2 Block /dev/sda /dev/sda /dev/sda /dev/sda Guest Layer kernel SCSI scsi_mod scsi_mod scsi_mod scsi_mod Layer virtio-scsi virtio-scsi SCSI LLD SCSI LLD Device virtio-scsi virtio-scsi pci_device pci_device Layer libiscsi libiscsi QEMU initiator_1 initiator_2

Block Layer Persistent Persistent Reservation Reservation Host SCSI kernel Layer Device Layer pci-stub/vfio pci-stub/vfio HBA HBA blocked initiator_3 initiator_4 Hardware Device target_a target_b blocked

initiator_* : Initiator ID. target_* : Target ID.

© Hitachi, Ltd. 2013. All rights reserved. 28 3-9. (c) Whether SCSI commands return proper results

Problem With virtio-blk, Report LUNs returns the list of LUNs including LUNs which are not assigned to the guest.

[virtio-blk] [virtio-scsi]

Block /dev/vdx Report LUNs /dev/sdx Report LUNs Guest Layer virtio-blk * The list of LUNs * The list of LUNs kernel SCSI that host sees. scsi_mod that guest sees.

Layer LUN0 virtio-scsi LUN1 Device LUN1 Layer virtio-blk virtio-scsi QEMU LUN2 Returns emulated results, Just pass-through depending on SCSI commands. Block /dev/sda /dev/sdb /dev/sdc /dev/sda /dev/sdb /dev/sdc Layer Host SCSI kernel Layer Device Layer HBA HBA Hardware Device LUN0 LUN1 LUN2 LUN0 LUN1 LUN2

 Virtio-blk needs emulation functions to return proper results for particular SCSI commands, such as Report LUNs. © Hitachi, Ltd. 2013. All rights reserved. 29 Better Utilization of Storage Features from KVM Guest via virtio-scsi

4. Summary

© Hitachi, Ltd. 2013. All rights reserved. 30 4. Summary

 Enterprise system requires SCSI commands in virtualized environments.  KVM has some configurations which can issue SCSI commands to storage from guests, however each configuration has some restrictions.

SCSI command Unique # Configuration Persistent Write Report Restriction Inquiry WWN Reservation Same LUNs Requires NPIV 1 virtio-blk - No No Yes No Yes for unique WWN. Requires NPIV 2 qemu No No Yes Yes Yes for unique virtio-scsi WWN. 3 libiscsi Yes Yes Yes Yes Yes iSCSI only. Max number of 4 Legacy Yes Yes Yes Yes Yes guests is PCI Device limited by assignment 5 VFIO Yes Yes Yes Yes Yes the number of HBA ports. : Already available. : Patch exists, but not merged yet. : No patch, but could be fixed.

© Hitachi, Ltd. 2013. All rights reserved. 31 Better Utilization of Storage Features from KVM Guest via virtio-scsi

5. Future Work

© Hitachi, Ltd. 2013. All rights reserved. 32 5. Future work

1. To allow qemu user to issue Persistent Reservation and WriteSame, proper permission check in host kernel is needed.

2. To make Report LUNs return proper results with virtio-blk, an emulation function for Report LUNs is needed.

3. To make SCSI command capability of virtio-scsi w/ lio target clear, evaluation is needed for virtio-scsi w/ lio target.

© Hitachi, Ltd. 2013. All rights reserved. 33 Questions?

© Hitachi, Ltd. 2013. All rights reserved. 34 END

Better Utilization of Storage Features from KVM Guest via virtio-scsi

LinuxCon North America Masaki Kimura Information & Telecommunication Systems Company IT Platform Division Group Hitachi Ltd.

© Hitachi, Ltd. 2013. All rights reserved. 35 Trademarks

 Linux® is the registered trademark of in the U.S. and other countries.

© Hitachi, Ltd. 2013. All rights reserved. 36