The KVM/qemu storage stack
Christoph Hellwig
Storage in Virtual Machines – Why?
A disk is an integral part of a normal computer Most operating systems work best with local disks Boot from NFS / iSCSI still has problems Simple integrated local storage is: Easier to setup than network storage More flexible to manage For the highend use PCI passthrough or Fibre channel NPIV instead
A view 10.000 feet
The host system or virtual machine (VMVM): exports virtual disks to the guest The guest uses them like real disks The virtual disks are backed by real devices.. Whole disks / partitions / logical volumes .. or files Either raw files on a filesystem or image formats
A virtual storage stack
Guest filesystem We have two full storage stacks in the Guest storage driver host and in the guest
Storage hw emulation Potentially also two filesystem Image format Potentially also a image format (aka mini Host filesystem filesystem)
Host volume manager
Host storage driver
Requirements (high level)
The traditional storage requirements apply: DataData integrityintegrity – data should actually be on disk when the user / application require it SpaceSpace efficiencyefficiency – we want to store the user / application data as efficient as possible PerformancePerformance – do all of the above as fast as possible Additionally there is a strong focus on: ManageabilityManageability – we potentially have a lots of hosts to deal with
Requirements – guest
None - Guests should work out of the box Migrating old operating system images to virtual machines is a typical use case Any guest changes should be purely optimizations for: Storage efficiency or Performance
Requirements – host
The host is where all the intelligence sits Ensures data integrity Aka: the data really is on disk when the guest thinks so Optimizantions of storage space usage
A practical implementation: QEMU/KVM
KVM is the major virtualization solution for Linux Included in the mainline kernel, with lots of development from RedHat, Novell, Intel, IBM and various individual contributors
What is QEMU and what is KVM?
QEMUQEMU primarily is a CPU emulator Grew a device model to emulate a whole computer Actually not just one but a whole lot of them KVMKVM is a kernel module to use expose hardware virtualization capabilities e.g. Intel VT-x or AMD SVM KVM uses QEMU for device emulation As far as storage is concerned they're the same
QEMU Storage stack overview
Linux Windows Others Guest Guest Guests
QEMU
Transports ATA SCSI Virtio
Image Formats Qcow2 VMDK / etc..
Backend Raw Posix
Host Kernel
Storage transports
QEMU provides a simple Intel ATAATA controller emulation by default Works with about every operating systems because it is so common Alternatively QEMU can emulate a Symbios SCSISCSI controller
Paravirtualization
ParavirtualizationParavirtualization means providing interfaces more optimal than real hardware AdvantageAdvantage: should be faster than full virtualization DisadvantageDisadvantage:: requiresrequires specialspecial driversdrivers forfor eacheach guestguest
Paravirtualized storage transport
QEMU provides paravirtualized devices using the virtio framework Virtio-blkVirtio-blk providesprovides aa simplersimpler blockblock driverdriver ontopontop ofof virtiovirtio JustJust simplesimple read/writeread/write requestsrequests AndAnd SCSISCSI requestsrequests throughthrough ioctlsioctls And,And, and,and, and..and..
QEMU storage requirements - AIO
The qemu main loop is effectively singe threaded: Time spent there blocks execution of the guest I/O needs to be offloaded as fast as possible
QEMU storage requirements - vectors
Typical I/O requests from guest are split into non-contingous parts scatter/gather lists In the optimal case a whole SG list is sent to the host kernel in one request preadv/pwritev system calls
Life of an I/O request
Guest Kernel
DMA
QEMU
MMIO/PIO Trap Interrupt Completion Injection Interrupt
Host Kernel
Posix storage backend
The primary storage backend AlmostAlmost allall I/OI/O eventuallyeventually endsends upup therethere Simply backs disk images using a regular file or device file AlsoAlso forwardsforwards somesome managementmanagement commandscommands (ioctls)(ioctls) inin casecase ofof devicedevice filesfiles
Posix storage backend
Except life isn't THATTHAT simple..
Posix storage backend - AIO
Needs to implement asynchronous semantics AIO support in hosts is severly lacking Use a thread pool to hand off I/O by default Alternatively support for native Linux AIO: Only works for uncached access (O_DIRECT) Still can be synchronous for many use cases
Posix storage backend – more fun
Hosts often have I/O restrictions Uncached I/O requires strict alignment and specific I/O sizes Posix backed needs to perform read/modify/write cycles Want to pass-through commands (ioctls) to host devices Different for every OS or even driver
Performance – large sequential I/O
140 MB/s
120 MB/s
100 MB/s
80 MB/s
Native QEMU pthreads QEMU AIO 60 MB/s
40 MB/s
20 MB/s
0 MB/s sequential read 8GB sequential write 8GB
Performance – 256 kilobyte random I/O
180 MB/s
160 MB/s
140 MB/s
120 MB/s
100 MB/s
Native QEMU pthreads 80 MB/s QEMU AIO
60 MB/s
40 MB/s
20 MB/s
0 MB/s random read 256KB random write 256KB
Performance – 16 kilobyte random I/O
80 MB/s
70 MB/s
60 MB/s
50 MB/s
40 MB/s Native QEMU pthreads QEMU AIO
30 MB/s
20 MB/s
10 MB/s
0 MB/s random read 16KB random write 16KB
The quest for disk image formats
Users want volume-manager like features in image files Copy-on write snapshots Encryption Compression Also VM snapshots need to store additional metadata
Disk Image formats - Qcow2
QcowQcow was the initial QEMU image format to provide copy on write snapshots In Qemu 0.8.3 Qcow2Qcow2 was added to add additional features and now is the standard image format for QEMU Provides cluster based copy on write snaphots Supports encryption and compression Allows to store additional metadata for VM snaphots
Disk Image formats
QEMU also supports various foreign image formats: Cow - User Mode Linux Vpc - Microsoft Virtual PC Vmdk – Vmware Bochs Parallels Cloop – popular Linux compressed loop patch Dmg – MacOS native filesystem images
Non-image backends
TheThe curlcurl backendbackend allowsallows usingusing VMVM imagesimages fromfrom thethe internetinternet overover httphttp andand ftpftp connectionsconnections. The nbdnbd backend allows direct access to nbd servers. The vvfatvvfat backend allows exporting host directories as image with a far format filesystem
Benchmarks..
Data integrity in QEMU / caching modes
cache=none uses O_DIRECT I/O that bypasses the filesystem cache on the host cache=writethrough uses O_SYNC I/O that is guaranteed to be commited to disk on return to userspace cache=writeback uses normal buffered I/O that is written back later by the operating system
Data integrity - cache=writethrough
ThisThis modemode isis thethe safestsafest asas farfar asas qemuqemu isis concernedconcerned ThereThere areare nono additionaladditional volatilevolatile writewrite cachescaches inin thethe hosthost TheThe downsidedownside isis thatthat it'sit's ratherrather slowslow
Data integrity - cache=writeback
WhenWhen thethe guestguest writeswrites datadata wewe simplysimply putput itit inin thethe filesystemfilesystem cachecache NoNo guaranteeguarantee thatthat itit actuallyactually goesgoes toto diskdisk WhichWhich isis actuallyactually veryvery similarsimilar toto howhow modernmodern disksdisks workwork
Data integrity - cache=writeback
TheThe guestguest needsneeds toto issueissue aa cachecache flushflush commandcommand toto makemake suresure datadata goesgoes toto diskdisk SimilarSimilar toto realreal modernmodern disksdisks withwith writebackwriteback cachescaches ModernModern operatingoperating systemssystems cancan dealdeal withwith thisthis AndAnd thethe hosthost needsneeds toto actuallyactually implementimplement thethe cachecache flushflush commandcommand andand advertiseadvertise it:it: TheThe QEMUQEMU SCSISCSI emulationemulation hashas alwaysalways donedone thisthis IDEIDE andand virtiovirtio onlyonly startedstarted thisthis veryvery recentlyrecently
Data integrity - cache=none
DirectDirect transfertransfer toto diskdisk shouldshould implyimply it'sit's safesafe ExceptExcept thatthat itit isis not:not: DoesDoes notnot guaranteeguarantee diskdisk cachescaches areare flushedflushed DoesDoes notnot givegive anyany guranteesgurantees aboutabout metadatametadata ThusThus alsoalso needsneeds anan explicitexplicit cachecache flush.flush.
Thin provisioning
TechnicalTechnical termterm forfor overcommitingovercommiting storagestorage resourcesresources AA simplesimple exampleexample isis aa sparsesparse filefile thatthat doesn'tdoesn't actuallyactually havehave blocksblocks allocatedallocated toto itit beforebefore useuse FullFull ThinThin ProvisioningProvisioning alsoalso meansmeans reclaimingreclaiming spacespace againagain whenwhen datadata isis deletedelete AA bigbig topictopic bothboth forfor high-endhigh-end storagestorage arraysarrays andand virtualizationvirtualization
Thin provisioning - standards
TheThe T10T10 SPCSPC standardstandard forfor SCSISCSI disksdisks // arraysarrays containscontains TPTP supportsupport inin it'sit's newestnewest revisionsrevisions TheThe UNMAPUNMAP andand WRITEWRITE SAMESAME commandscommands allowallow tellingtelling thethe storagestorage devicedevice toto freefree datadata PerfectPerfect useuse casecase forfor qemuqemu toto knowknow thatthat thethe guestguest hashas freedfreed thethe storagestorage NeedsNeeds extensiveextensive guestguest supportsupport TheThe ATAATA specspec hashas aa similarsimilar TRIMTRIM commandcommand
forfor SolidSolid StateState DrivesDrives (SSDs)(SSDs) Thin provisioning - implementation
OnOn thethe guestguest sideside leverageleverage thethe supportsupport forfor SSDsSSDs // ArraysArrays OnOn thethe hosthost sideside thethe commandcommand decodingdecoding inin qemuqemu isis easyeasy ButBut thethe standardstandard filesystemfilesystem APIAPI doesdoes notnot allowallow punchingpunching holesholes intointo filesfiles SomeSome filesystemfilesystem (e.g.(e.g. XFS)XFS) offeroffer extensionsextensions forfor itit
Thin provisioning - demo
Avoiding duplicate data
OftenOften manymany similarsimilar virtualvirtual machinesmachines areare runningrunning onon oneone hosthost AimAim forfor stroingstroing duplicateduplicate datadata onlyonly onceonce TwoTwo approaches:approaches: ImageImage clonesclones –– startstart withwith aa commoncommon imageimage andand tracktrack changeschanges withwith aa copycopy onon writewrite schemescheme DataData deduplicationdeduplication –– findfind duplicateduplicate blocksblocks andand mergemerge themthem afterafter thethe factfact
Backing images
QEMUQEMU allowsallows forfor backingbacking devicesdevices inin thethe QCOW2QCOW2 format.format. VeryVery easyeasy toto useuse LVMLVM supportssupports copycopy onon writevolumeswritevolumes SimilarlySimilarly easyeasy toto useuse ButBut requiresrequires aa fullfull blockblock devices,devices, notnot filesfiles FilesystemsFilesystems likelike btrfsbtrfs andand ocfsocfs allowsallows filefile levellevel snapshotssnapshots
Data deduplication
AllAll thesethese havehave oneone commoncommon disadvantage:disadvantage: TheThe sharingsharing needsneeds toto bebe plannedplanned fromfrom thethe beginning.beginning. DataData deduplicationdeduplication isis thethe processprocess ofof findingfinding thesethese duplicatesduplicates laterlater It'sIt's anan expensiveexpensive andand slowslow processprocess withoutwithout additionaladditional metadatametadata NotNot currentlycurrently implementedimplemented inin aa wayway usableusable byby QEMUQEMU currentlycurrently
Questions?
Thanks for your attention! Feel free to contact me at: [email protected]