Exploiting the Benefits of Native Programming Access to NVM Devices
Total Page:16
File Type:pdf, Size:1020Kb
Exploiting the benefits of native programming access to NVM devices Ashish Batwara Principal Storage Architect Fusion-io 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. Traditional Storage Stack User space Application Filesystem Kernel space Block Device Driver LBA Hardware LBA view enforced by Storage Protocols (SCSI/SATA etc.) Device 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 2 Flash is Different From Disk Area Hard Disk Drives Flash Devices Logical to Physical Blocks Nearly 1:1 Mapping Remapped at every write Read/Write Performance Largely symmetrical Heavily asymmetrical. Sequential vs Random An order of magnitude difference Minimal difference Performance Regular occurrence. If unmanaged - can Background operations Rarely impact foreground impact foreground Wearout Largelyunlimited writes Limited writes IOPS 100s to 1000s 100Ks to Millions Latency 10s ms 10s - 100s us TRIM Do not benefit Improves performance 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 3 Flash Translation Layer 101 Input Logical Block Address (LBA) Flash Translation Layer Output Commands to NAND flash 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 4 Flash in Traditional Storage Stack User space Application Filesystem Kernel space Block Device Driver LBA Hardware Flash Translation Layer LBA → PBA Device 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 5 Virtual Storage Layer Fusion-io ’s host based FTL Host Virtual Storage Layer (VSL) DRAM / Memory / CPU and cores Operating System and Cut-thru architecture – Tables Application Memory ioMemory avoids traditional storage protocols Virtualization Virtual Storage Layer ® Scales with multi-core (VSL tm ) Commands Provide a large virtual address space DATA TRANSFERS PCIe HW/SW functional boundary defined as optimal for flash ioDrive ioMemory ® Data-Path Traditional block access methods for Controller compatibility New access methods, functionality Banks and primitives natively supported by Channels Wide flash 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 6 Flash Memory Evolution | Native Access Flash with direct I/O Flash with Traditional SSDs Flash as a drive Flash as a cache semantics memory semantics Application Application Application Application Application Open Source Open Source Extensions Extensions Application OS Block I/O OS Block I/O OS Block I/O Application Direct-access I/O Memory access API Family API Family File System File System File System directFS – native file directFS Host Block Layer system service Host Block Layer Block Layer SAS/SATA Network directCache RAID Controller VSL VSL VSL VSL VSL Flash Layer Remote Read/Write Read/Write Read/Write Read/Write Read/Write Load/Store 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 7 Direct-access I/O API family | Native Access Flash with direct I/O Flash with Legacy SSDs Flash as a drive Flash as a cache direct I/O Primitives semantics memory semantics Application Sparse Addressing Application Application Application Application Atomic multi-block Operations Open Source Open Source Extensions Extensions Write Application OS Block I/OPTRIM OS Block I/O OS Block I/O Application Direct-access I/O Memory Semantics Exists, Range Exists API Family API Family Conditional Write FileRange System Read File System File System directFS – native file directFS Host Block Layer system service Host Block Layer Block Layer SAS/SATAdirect Key-Value Store Network NVM optimized with transactional directCache RAID Controller VSL semantics VSL VSL VSL VSL Flash Layer Remote Read/Write Read/Write Read/Write Read/Write Read/Write Load/Store 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 8 Sparse Addressing 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 9 Excess work by conventional cache Conventional Block Cache Application Cache: HDD block -> Flash LBA Metadata, Persistence, Recovery logic etc. Cache Cache Hit Miss Flash FTL: LBA -> PBA Backend Store Block Device • Two Translations • Additional metadata, logic etc. 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 10 Sparse address mapping Mapped to sparse Mapped to sparse address space address space Backend store 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 11 VSL based cache VSL Based Cache Application VSL Based Cache: Minimal Fixed Metadata Cache Hit - Leverage on primitives Cache VSL Miss Sparse HDD LBA -> PBA Backend Store Block Device • Fewer Translations • Minimal additional metadata, logic etc. 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 12 Direct I/O - Atomic Operations Traditional Atomicity Traditional Atomicity (with Hard Disks) (with SSD) Atomicity in ioMemory Atomicity Atomicity DBMS Applications DBMS Applications DBMS Applications Trans Log Trans Log File System Atomicity Atomicity Block IO Layer FileSystem Metadata Journaling, FileSystem Metadata Journaling, Generalized ioMemory Layer Copy-on-Write Copy-on-Write Re-mapping Wear-Leveling Atomic- Block IO Layer Block IO Layer Operation Sector Read/Write Page Read/Write Block Erase Sector Read/Write Flash Translation Layer Controller Re-mapping Wear-Leveling Block ErasePage Write Page Read NAND Flash Memory Solid State NAND Flash Memory Disk ioMemory Disk Drive 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 13 Transactional Block Interface Application issues call to transactional block interface WRITE ALL BLOCKS ATOMICALLY iov[0] iov[1] iov[2] iov[3] iov[4] Virtual Storage Layer 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 14 Transactional Block Interface Application issues call to transactional block interface WRITE ALL BLOCKS ATOMICALLY TRIM ALL BLOCKS ATOMICALLY LBA 7 + offset LBA 24 + offset LBA 42 + offset iov[0] iov[1] iov[2] iov[3] iov[4] LBA 68 + offset Virtual Storage Layer 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 15 Transactional Block Interface Application issues call to transactional block interface TRANSACTION ENVELOPS WRITE ALL BLOCKS ATOMICALLY TRIM ALL BLOCKS ATOMICALLY WRITE AND TRIM ATOMICALLY LBA 7 + offset LBA 7 + offset LBA 24 + offset LBA 24 + offset LBA 42 + offset iov[0] iov[1] iov[2] iov[3] iov[4] LBA 68 + offset iov[3] iov[4] Virtual Storage Layer 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 16 InnoDB (mysql) Double-Write Page C Page B Page A DB Server Pages copied into memory OS RAM Page C Page B Page A Page C Page B Page A Double-write buffer Two Writes ioDrive Page A Actual tablespace Page C Page B 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 17 InnoDB (mysql) atomic-Write Page C Page B Page A DB Server Less Writes Pages copied into memory Better Performance ACID Compliance OS RAM Page C Page B Page A VSL (Atomic- Write) ioDrive Page A Page C Page B 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 18 Sysbench Performance With Atomic-Write MySQL extension for Atomic-Write 43 % 2x TRANSACTIONS/SEC ENDURANCE INCREASE INCREASE • Processor: Xeon X5472 @ 3.00GHz • DRAM: 16GB DDR3 4x4GB DIMMs • OS: Fedora 14 – Linux kernel 2.6.35 • Sysbench config: 1 million inserts in 8, 2-million-entry tables, using 16 threads 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 19 Native raw performance comparison Significantly more functionality with NO additional performance impact 1U HP blade server with 16 GB RAM, 8 CPU cores - Intel(R) Xeon(R) CPU X5472 @ 3.00GHz with single 1.2 TB ioDrive2 mono 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 20 TRIM TRIM – a storage primitive to assist SSD FTLs in garbage collection Classic TRIM – trim(LBA) Hint to FTL to unmap LBA->PBA Improves wear leveling Improves write performance However – only an advisory command 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 21 Persistent Trim and Exists Persistent TRIM (Virtual Address) Has all the positive properties of TRIM Improves wear leveling Improves write performance However – well defined with respect to failures Deterministic return of zeros for read Survives power failures (transactional) EXISTS (Virtual Address) Query the existence of a particular element Enables sparse stores with well defined “presence ” semantics 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 22 Conventional Key-Value Store Conventional KV store Application KV Store – Key -> block mapping (overhead per key) block allocation Metadata, Block persistence mechanisms, Read/Write logging, recovery logic etc. VSL – Dynamic provisioning, Block allocation, persistence mechanisms, Logging, recovery etc. 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 23 VSL based Direct Key-Value Store DirectKey-Value store Application Key Value API and Library Fixed “zero ” metadata, leverages VSL Atomic write/delete, coordinated garbage collection VSL – Dynamic provisioning, Block allocation, persistence mechanisms, Logging, recovery etc. 2012 Storage Developer Conference. © Fusion-io. All Rights Reserved. 24 directKey-Value Store API Overview KV Store Administration KV Store Data Operations kv_create() kv_put() kv_pool_create() kv_batch_put() kv_pool_delete() kv_get() kv_open() kv_batch_get() kv_get_store_info() kv_delete() kv_get_key_info() kv_delete_all() kv_register_notification_handler() kv_batch_delete() kv_close() kv_begin() kv_destroy() kv_next() kv_get_pool_info()