The KVM/ storage stack

Christoph Hellwig

Storage in Virtual Machines – Why?

 A disk is an integral part of a normal computer  Most operating systems work best with local disks  Boot from NFS / iSCSI still has problems  Simple integrated local storage is:  Easier to setup than network storage  More flexible to manage  For the highend use PCI passthrough or Fibre channel NPIV instead

A view 10.000 feet

 The host system or (VMVM):  exports virtual disks to the guest  The guest uses them like real disks  The virtual disks are backed by real devices..  Whole disks / partitions / logical volumes  .. or files  Either raw files on a filesystem or image formats

A virtual storage stack

Guest filesystem  We have two full storage stacks in the Guest storage driver host and in the guest

 Storage hw emulation Potentially also two filesystem Image format  Potentially also a image format (aka mini Host filesystem filesystem)

Host manager

Host storage driver

Requirements (high level)

 The traditional storage requirements apply:  DataData integrityintegrity – data should actually be on disk when the user / application require it  SpaceSpace efficiencyefficiency – we want to store the user / application data as efficient as possible  PerformancePerformance – do all of the above as fast as possible  Additionally there is a strong focus on:  ManageabilityManageability – we potentially have a lots of hosts to deal with

Requirements – guest

 None - Guests should work out of the box  Migrating old images to virtual machines is a typical use case  Any guest changes should be purely optimizations for:  Storage efficiency or  Performance

Requirements – host

 The host is where all the intelligence sits  Ensures data integrity  Aka: the data really is on disk when the guest thinks so  Optimizantions of storage space usage

A practical implementation: QEMU/KVM

 KVM is the major solution for  Included in the mainline kernel, with lots of development from RedHat, Novell, Intel, IBM and various individual contributors

What is QEMU and what is KVM?

 QEMUQEMU primarily is a CPU emulator  Grew a device model to emulate a whole computer  Actually not just one but a whole lot of them  KVMKVM is a kernel module to use expose hardware virtualization capabilities  e.g. Intel VT-x or AMD SVM  KVM uses QEMU for device emulation  As far as storage is concerned they're the same

QEMU Storage stack overview

Linux Windows Others Guest Guest Guests

QEMU

Transports ATA SCSI Virtio

Image Formats Qcow2 VMDK / etc..

Backend Raw Posix

Host Kernel

Storage transports

 QEMU provides a simple Intel ATAATA controller emulation by default  Works with about every operating systems because it is so common  Alternatively QEMU can emulate a Symbios SCSISCSI controller

Paravirtualization

 ParavirtualizationParavirtualization means providing interfaces more optimal than real hardware  AdvantageAdvantage: should be faster than full virtualization  DisadvantageDisadvantage:: requiresrequires specialspecial driversdrivers forfor eacheach guestguest

Paravirtualized storage transport

 QEMU provides paravirtualized devices using the virtio framework  Virtio-blkVirtio-blk providesprovides aa simplersimpler blockblock driverdriver ontopontop ofof virtiovirtio  JustJust simplesimple read/writeread/write requestsrequests  AndAnd SCSISCSI requestsrequests throughthrough ioctlsioctls  And,And, and,and, and..and..

QEMU storage requirements - AIO

 The qemu main loop is effectively singe threaded:  Time spent there blocks execution of the guest  I/O needs to be offloaded as fast as possible

QEMU storage requirements - vectors

 Typical I/O requests from guest are split into non-contingous parts  scatter/gather lists  In the optimal case a whole SG list is sent to the host kernel in one request  preadv/pwritev system calls

Life of an I/O request

Guest Kernel

DMA

QEMU

MMIO/PIO Trap Interrupt Completion Injection Interrupt

Host Kernel

Posix storage backend

 The primary storage backend  AlmostAlmost allall I/OI/O eventuallyeventually endsends upup therethere  Simply backs disk images using a regular file or device file  AlsoAlso forwardsforwards somesome managementmanagement commandscommands (ioctls)(ioctls) inin casecase ofof devicedevice filesfiles

Posix storage backend

 Except life isn't THATTHAT simple..

Posix storage backend - AIO

 Needs to implement asynchronous semantics  AIO support in hosts is severly lacking  Use a thread pool to hand off I/O by default  Alternatively support for native Linux AIO:  Only works for uncached access (O_DIRECT)  Still can be synchronous for many use cases

Posix storage backend – more fun

 Hosts often have I/O restrictions  Uncached I/O requires strict alignment and specific I/O sizes  Posix backed needs to perform read/modify/write cycles  Want to pass-through commands (ioctls) to host devices  Different for every OS or even driver

Performance – large sequential I/O

140 MB/s

120 MB/s

100 MB/s

80 MB/s

Native QEMU pthreads QEMU AIO 60 MB/s

40 MB/s

20 MB/s

0 MB/s sequential read 8GB sequential write 8GB

Performance – 256 random I/O

180 MB/s

160 MB/s

140 MB/s

120 MB/s

100 MB/s

Native QEMU pthreads 80 MB/s QEMU AIO

60 MB/s

40 MB/s

20 MB/s

0 MB/s random read 256KB random write 256KB

Performance – 16 kilobyte random I/O

80 MB/s

70 MB/s

60 MB/s

50 MB/s

40 MB/s Native QEMU pthreads QEMU AIO

30 MB/s

20 MB/s

10 MB/s

0 MB/s random read 16KB random write 16KB

The quest for formats

 Users want volume-manager like features in image files  Copy-on write snapshots  Encryption  Compression  Also VM snapshots need to store additional metadata

Disk Image formats - Qcow2

 QcowQcow was the initial QEMU image format to provide copy on write snapshots  In Qemu 0.8.3 Qcow2Qcow2 was added to add additional features and now is the image format for QEMU  Provides cluster based copy on write snaphots  Supports encryption and compression  Allows to store additional metadata for VM snaphots

Disk Image formats

 QEMU also supports various foreign image formats:  Cow - User Mode Linux  Vpc - Microsoft Virtual PC  Vmdk – Vmware  Bochs  Parallels  Cloop – popular Linux compressed loop patch  Dmg – MacOS native filesystem images

Non-image backends

 TheThe curlcurl backendbackend allowsallows usingusing VMVM imagesimages fromfrom thethe internetinternet overover httphttp andand ftpftp connectionsconnections.  The nbdnbd backend allows direct access to nbd servers.  The vvfatvvfat backend allows exporting host directories as image with a far format filesystem

Benchmarks..

Data integrity in QEMU / caching modes

 cache=none  uses O_DIRECT I/O that bypasses the filesystem cache on the host  cache=writethrough  uses O_SYNC I/O that is guaranteed to be commited to disk on return to userspace  cache=writeback  uses normal buffered I/O that is written back later by the operating system

Data integrity - cache=writethrough

 ThisThis modemode isis thethe safestsafest asas farfar asas qemuqemu isis concernedconcerned  ThereThere areare nono additionaladditional volatilevolatile writewrite cachescaches inin thethe hosthost  TheThe downsidedownside isis thatthat it'sit's ratherrather slowslow

Data integrity - cache=writeback

 WhenWhen thethe guestguest writeswrites datadata wewe simplysimply putput itit inin thethe filesystemfilesystem cachecache  NoNo guaranteeguarantee thatthat itit actuallyactually goesgoes toto diskdisk  WhichWhich isis actuallyactually veryvery similarsimilar toto howhow modernmodern disksdisks workwork

Data integrity - cache=writeback

 TheThe guestguest needsneeds toto issueissue aa cachecache flushflush commandcommand toto makemake suresure datadata goesgoes toto diskdisk  SimilarSimilar toto realreal modernmodern disksdisks withwith writebackwriteback cachescaches  ModernModern operatingoperating systemssystems cancan dealdeal withwith thisthis  AndAnd thethe hosthost needsneeds toto actuallyactually implementimplement thethe cachecache flushflush commandcommand andand advertiseadvertise it:it:  TheThe QEMUQEMU SCSISCSI emulationemulation hashas alwaysalways donedone thisthis  IDEIDE andand virtiovirtio onlyonly startedstarted thisthis veryvery recentlyrecently

Data integrity - cache=none

 DirectDirect transfertransfer toto diskdisk shouldshould implyimply it'sit's safesafe  ExceptExcept thatthat itit isis not:not:  DoesDoes notnot guaranteeguarantee diskdisk cachescaches areare flushedflushed  DoesDoes notnot givegive anyany guranteesgurantees aboutabout metadatametadata  ThusThus alsoalso needsneeds anan explicitexplicit cachecache flush.flush.

Thin provisioning

 TechnicalTechnical termterm forfor overcommitingovercommiting storagestorage resourcesresources  AA simplesimple exampleexample isis aa sparsesparse filefile thatthat doesn'tdoesn't actuallyactually havehave blocksblocks allocatedallocated toto itit beforebefore useuse  FullFull ThinThin ProvisioningProvisioning alsoalso meansmeans reclaimingreclaiming spacespace againagain whenwhen datadata isis deletedelete  AA bigbig topictopic bothboth forfor high-endhigh-end storagestorage arraysarrays andand virtualizationvirtualization

Thin provisioning - standards

 TheThe T10T10 SPCSPC standardstandard forfor SCSISCSI disksdisks // arraysarrays containscontains TPTP supportsupport inin it'sit's newestnewest revisionsrevisions  TheThe UNMAPUNMAP andand WRITEWRITE SAMESAME commandscommands allowallow tellingtelling thethe storagestorage devicedevice toto freefree datadata  PerfectPerfect useuse casecase forfor qemuqemu toto knowknow thatthat thethe guestguest hashas freedfreed thethe storagestorage  NeedsNeeds extensiveextensive guestguest supportsupport  TheThe ATAATA specspec hashas aa similarsimilar TRIMTRIM commandcommand

forfor SolidSolid StateState DrivesDrives (SSDs)(SSDs) Thin provisioning - implementation

 OnOn thethe guestguest sideside leverageleverage thethe supportsupport forfor SSDsSSDs // ArraysArrays  OnOn thethe hosthost sideside thethe commandcommand decodingdecoding inin qemuqemu isis easyeasy  ButBut thethe standardstandard filesystemfilesystem APIAPI doesdoes notnot allowallow punchingpunching holesholes intointo filesfiles  SomeSome filesystemfilesystem (e.g.(e.g. XFS)XFS) offeroffer extensionsextensions forfor itit

Thin provisioning - demo

Avoiding duplicate data

 OftenOften manymany similarsimilar virtualvirtual machinesmachines areare runningrunning onon oneone hosthost  AimAim forfor stroingstroing duplicateduplicate datadata onlyonly onceonce  TwoTwo approaches:approaches:  ImageImage clonesclones –– startstart withwith aa commoncommon imageimage andand tracktrack changeschanges withwith aa copycopy onon writewrite schemescheme  DataData deduplicationdeduplication –– findfind duplicateduplicate blocksblocks andand mergemerge themthem afterafter thethe factfact

Backing images

 QEMUQEMU allowsallows forfor backingbacking devicesdevices inin thethe QCOW2QCOW2 format.format.  VeryVery easyeasy toto useuse  LVMLVM supportssupports copycopy onon writevolumeswritevolumes  SimilarlySimilarly easyeasy toto useuse  ButBut requiresrequires aa fullfull blockblock devices,devices, notnot filesfiles  FilesystemsFilesystems likelike btrfsbtrfs andand ocfsocfs allowsallows filefile levellevel snapshotssnapshots

Data deduplication

 AllAll thesethese havehave oneone commoncommon disadvantage:disadvantage:  TheThe sharingsharing needsneeds toto bebe plannedplanned fromfrom thethe beginning.beginning.  DataData deduplicationdeduplication isis thethe processprocess ofof findingfinding thesethese duplicatesduplicates laterlater  It'sIt's anan expensiveexpensive andand slowslow processprocess withoutwithout additionaladditional metadatametadata  NotNot currentlycurrently implementedimplemented inin aa wayway usableusable byby QEMUQEMU currentlycurrently

Questions?

 Thanks for your attention!  Feel free to contact me at: [email protected]