Pangolin: a Fault-Tolerant Persistent Memory Programming Library

Pangolin: A Fault-tolerant Persistent Memory Programming Library Lu Zhang and Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego 1 Persistent memory (PMEM) finally arrives CPU Caches • Working alongside DRAM Memory • New programming model Controller – Byte addressability DRAM DRAM DRAM – Memory semantics PMEM PMEM PMEM – Direct access (DAX) 2 Challenges with PMEM programming • Crash consistency • Fault tolerance – Volatile CPU caches – Media errors – 8-byte store atomicity – Software bugs MOV CPU L1 L1 L2 3 Persistent memory error types • Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention Application Data Receiving good data PMEM Controller Data Error auto-corrected PMEM Data 4 Persistent memory error types • Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention – ECC-detectable but uncorrectable ones require signal handling Application SIGBUS Receiving SIGBUS PMEM Controller Data Error Detected but PMEM Data uncorrectable 5 Persistent memory error types • Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention – ECC-detectable but uncorrectable ones require signal handling – ECC-undetectable errors demand software detection and correction Application Data Receiving bad data PMEM Controller Data Error undetectable PMEM Data 6 Handle uncorrectable & undetectable errors • Prepare some redundancy for recovery • Implement software-based error detection and correction Application SIGBUS Receiving SIGBUS Application Data Receiving bad data PMEM Controller Data PMEM Controller Data Error Detected but Error undetectable PMEM Data uncorrectable PMEM Data 7 DAX-filesystem cannot protect mmap’ed data File System App. DAX-mmap App. • Some filesystems (e.g. read()/write() mmap(/mnt/pmem/file) NOVA) provide protection protected only via read()/write() User-space unprotected • No known filesystem can File System protect DAX-mmap’ed read()/write() mmap() PMEM data Persistent Memory 8 DAX-filesystem cannot protect mmap’ed data File System App. DAX-mmap App. • Some filesystems (e.g. read()/write() mmap(/mnt/pmem/file) NOVA) provide protection protected Pangolin only via read()/write() User-space • No known filesystem can File System protect DAX-mmap’ed read()/write() mmap() PMEM data Persistent Memory 9 Pangolin design goals • Ensure crash consistency • Protect application data against media and software errors • Require very low storage overhead (1%) for fault tolerance 10 Pangolin – Replication, parity, and checksums • Combines replication and parity as redundancy – Similar performance compared to replication – Low space overhead (1% of gigabyte-sized object store) Metadata Metadata Object Object Object Object Object Object Object Object Object Object Parity • Checksums all metadata and object data 11 Pangolin – Transactions with micro-buffering • Provides micro-buffering-based transactions – Buffers application changes in DRAM – Atomically updates objects, checksums, and parity DRAM Object Object PMEM Object Object Object Object Object Parity 12 Pangolin – Transactions with micro-buffering • Provides micro-buffering-based transactions – Buffers application changes in DRAM – Atomically updates objects, checksums, and parity DRAM PMEM Object Object ObjectObject Object Object Parity 13 Pangolin’s data redundancy • Reserve space for metadata replication and object parity • Organize object data pages into “rows” Application Address Space Mapped PMEM file Metadata Metadata Data Parity Replica Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Row 0 Row 1 Row 2 Row 3 Row p Row size: default 160 MB (1% of a data “zone”) 14 Pangolin’s parity coding • Compute a parity page vertically across all rows • Afford losing one whole row of data • By default, Pangolin implements 100 rows per data zone Row 0 Page 0 Page 1 ⊕ ⊕ Row 1 Page 2 Page 3 ⊕ ⊕ Row 2 Page 4 Page 5 ⊕ ⊕ Row 3 Page 6 Page 7 = = Row p Page 8 Page 9 15 Micro-buffering provides transactions • Move object data in DRAM and perform data integrity check • Buffer writes to objects and write back to PMEM on commit • Guarantee consistency with redo logging (replicated) ptr1 2 CSUM CSUM DRAM obj 1 D1’ 3 5 1 Buffering CSUM Logging Writing back D1’ CSUM CSUM CSUM CSUM CSUM PMEM CSUM CSUM Replicating obj 1 D1’D1 obj 2 obj 3 obj 4 obj 5 Row 0 D1’ 4 Updating parity 16 Updating parity using only modified ranges ptr1 2 4 obj 1 D1’ ⊕ = Δ1 ⊕ = P1’ DRAM 1 Logging 3 5 Data D1’ obj 1 D1’D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity PP1’1 Row p 17 Parity’s crash consistency depends on object logs • Apply all redo-logs (if exist) and then re-compute parity obj 1 D1’ ⊕ = Δ1 ⊕ = PowerP1’ failure DRAM Data D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity P1 Row p 18 Parity’s crash consistency depends on object logs • Apply all redo-logs (if exist) and then re-compute parity DRAM Data D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity PP1’ Row p 19 Multithreaded update – Lock parity ranges • Lock a range of parity and serialize parity updates Thread2 Thread1 D1’ ⊕ = Δ1 ⊕ = P1’ D7’ ⊕ = Δ7 ⊕ = P7’ DRAM 1 2 3 4 Data D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM D7’ obj 7 D7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity PPP171’1 Row p 20 Multithreaded update – Atomic XORs • Parity range can update, lock-free, with atomic XORs Thread1 Thread2 D1’ ⊕ = Δ1 D7’ ⊕ = Δ7 DRAM Data D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM D7’ obj 7 D7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity PP171 Row p 21 Multithreaded update – Hybrid scheme • Atomic XORs can be slower than vectorized ones • Use shared mutex to coordinate both methods • Small updates (< 8KB) – Take shared lock of a parity range (8 KB) – Update parity concurrently with atomic XORs • Large updates (≥ 8KB) – Take exclusive locks of parity ranges (8 KB each) – Update parity using vectorized XORs (non-atomic) 22 Performance – Single-object transactions Single-object Overwrite Latencies • Evaluation based on Intel’s libpmemobj libpmemobj-replication pangolin Optane DC persistent memory 20 15 • On average, Pangolin’s latency is 10 11% lower than libpmemobj with 5 0 replication. (microseconds) Latency 64 256 1024 4096 Object Size (bytes) 23 Performance – Multi-object transactions • Performance of Pangolin is 90% of libpmemobj’s with replication • Pangolin incurs about 100× less space overhead Average Insertion Latencies Average Removal Latencies libpmemobj libpmemobj-replication pangolin libpmemobj libpmemobj-replication pangolin 25 20 20 15 15 10 10 5 5 Latency (microseconds) Latency Latency (microseconds) Latency 0 0 ctree rbtree btree skiplist rtree hashmap ctree rbtree btree skiplist rtree hashmap 24 Conclusion • PMEM programming libraries should also consider fault tolerance for critical applications. • Parity-based redundancy provides similar performance compared to replication and significantly reduces space overhead. • Micro-buffering-based transactions can both support crash consistency and provide fault tolerance. 25.

Pangolin: a Fault-Tolerant Persistent Memory Programming Library

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support