Pangolin: a Fault-Tolerant Persistent Memory Programming Library

Pangolin: A Fault-tolerant Persistent Memory Programming Library Lu Zhang and Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego 1 Persistent memory (PMEM) finally arrives CPU Caches • Working alongside DRAM Memory • New programming model Controller – Byte addressability DRAM DRAM DRAM – Memory semantics PMEM PMEM PMEM – Direct access (DAX) 2 Challenges with PMEM programming • Crash consistency • Fault tolerance – Volatile CPU caches – Media errors – 8-byte store atomicity – Software bugs MOV CPU L1 L1 L2 3 Persistent memory error types • Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention Application Data Receiving good data PMEM Controller Data Error auto-corrected PMEM Data 4 Persistent memory error types • Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention – ECC-detectable but uncorrectable ones require signal handling Application SIGBUS Receiving SIGBUS PMEM Controller Data Error Detected but PMEM Data uncorrectable 5 Persistent memory error types • Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention – ECC-detectable but uncorrectable ones require signal handling – ECC-undetectable errors demand software detection and correction Application Data Receiving bad data PMEM Controller Data Error undetectable PMEM Data 6 Handle uncorrectable & undetectable errors • Prepare some redundancy for recovery • Implement software-based error detection and correction Application SIGBUS Receiving SIGBUS Application Data Receiving bad data PMEM Controller Data PMEM Controller Data Error Detected but Error undetectable PMEM Data uncorrectable PMEM Data 7 DAX-filesystem cannot protect mmap’ed data File System App. DAX-mmap App. • Some filesystems (e.g. read()/write() mmap(/mnt/pmem/file) NOVA) provide protection protected only via read()/write() User-space unprotected • No known filesystem can File System protect DAX-mmap’ed read()/write() mmap() PMEM data Persistent Memory 8 DAX-filesystem cannot protect mmap’ed data File System App. DAX-mmap App. • Some filesystems (e.g. read()/write() mmap(/mnt/pmem/file) NOVA) provide protection protected Pangolin only via read()/write() User-space • No known filesystem can File System protect DAX-mmap’ed read()/write() mmap() PMEM data Persistent Memory 9 Pangolin design goals • Ensure crash consistency • Protect application data against media and software errors • Require very low storage overhead (1%) for fault tolerance 10 Pangolin – Replication, parity, and checksums • Combines replication and parity as redundancy – Similar performance compared to replication – Low space overhead (1% of gigabyte-sized object store) Metadata Metadata Object Object Object Object Object Object Object Object Object Object Parity • Checksums all metadata and object data 11 Pangolin – Transactions with micro-buffering • Provides micro-buffering-based transactions – Buffers application changes in DRAM – Atomically updates objects, checksums, and parity DRAM Object Object PMEM Object Object Object Object Object Parity 12 Pangolin – Transactions with micro-buffering • Provides micro-buffering-based transactions – Buffers application changes in DRAM – Atomically updates objects, checksums, and parity DRAM PMEM Object Object ObjectObject Object Object Parity 13 Pangolin’s data redundancy • Reserve space for metadata replication and object parity • Organize object data pages into “rows” Application Address Space Mapped PMEM file Metadata Metadata Data Parity Replica Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Row 0 Row 1 Row 2 Row 3 Row p Row size: default 160 MB (1% of a data “zone”) 14 Pangolin’s parity coding • Compute a parity page vertically across all rows • Afford losing one whole row of data • By default, Pangolin implements 100 rows per data zone Row 0 Page 0 Page 1 ⊕ ⊕ Row 1 Page 2 Page 3 ⊕ ⊕ Row 2 Page 4 Page 5 ⊕ ⊕ Row 3 Page 6 Page 7 = = Row p Page 8 Page 9 15 Micro-buffering provides transactions • Move object data in DRAM and perform data integrity check • Buffer writes to objects and write back to PMEM on commit • Guarantee consistency with redo logging (replicated) ptr1 2 CSUM CSUM DRAM obj 1 D1’ 3 5 1 Buffering CSUM Logging Writing back D1’ CSUM CSUM CSUM CSUM CSUM PMEM CSUM CSUM Replicating obj 1 D1’D1 obj 2 obj 3 obj 4 obj 5 Row 0 D1’ 4 Updating parity 16 Updating parity using only modified ranges ptr1 2 4 obj 1 D1’ ⊕ = Δ1 ⊕ = P1’ DRAM 1 Logging 3 5 Data D1’ obj 1 D1’D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity PP1’1 Row p 17 Parity’s crash consistency depends on object logs • Apply all redo-logs (if exist) and then re-compute parity obj 1 D1’ ⊕ = Δ1 ⊕ = PowerP1’ failure DRAM Data D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity P1 Row p 18 Parity’s crash consistency depends on object logs • Apply all redo-logs (if exist) and then re-compute parity DRAM Data D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity PP1’ Row p 19 Multithreaded update – Lock parity ranges • Lock a range of parity and serialize parity updates Thread2 Thread1 D1’ ⊕ = Δ1 ⊕ = P1’ D7’ ⊕ = Δ7 ⊕ = P7’ DRAM 1 2 3 4 Data D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM D7’ obj 7 D7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity PPP171’1 Row p 20 Multithreaded update – Atomic XORs • Parity range can update, lock-free, with atomic XORs Thread1 Thread2 D1’ ⊕ = Δ1 D7’ ⊕ = Δ7 DRAM Data D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0 obj 6 obj 7 Row 1 PMEM D7’ obj 7 D7 obj 8 Row 2 obj 9 unused (zero bytes) Row 3 Parity PP171 Row p 21 Multithreaded update – Hybrid scheme • Atomic XORs can be slower than vectorized ones • Use shared mutex to coordinate both methods • Small updates (< 8KB) – Take shared lock of a parity range (8 KB) – Update parity concurrently with atomic XORs • Large updates (≥ 8KB) – Take exclusive locks of parity ranges (8 KB each) – Update parity using vectorized XORs (non-atomic) 22 Performance – Single-object transactions Single-object Overwrite Latencies • Evaluation based on Intel’s libpmemobj libpmemobj-replication pangolin Optane DC persistent memory 20 15 • On average, Pangolin’s latency is 10 11% lower than libpmemobj with 5 0 replication. (microseconds) Latency 64 256 1024 4096 Object Size (bytes) 23 Performance – Multi-object transactions • Performance of Pangolin is 90% of libpmemobj’s with replication • Pangolin incurs about 100× less space overhead Average Insertion Latencies Average Removal Latencies libpmemobj libpmemobj-replication pangolin libpmemobj libpmemobj-replication pangolin 25 20 20 15 15 10 10 5 5 Latency (microseconds) Latency Latency (microseconds) Latency 0 0 ctree rbtree btree skiplist rtree hashmap ctree rbtree btree skiplist rtree hashmap 24 Conclusion • PMEM programming libraries should also consider fault tolerance for critical applications. • Parity-based redundancy provides similar performance compared to replication and significantly reduces space overhead. • Micro-buffering-based transactions can both support crash consistency and provide fault tolerance. 25.

Pangolin: a Fault-Tolerant Persistent Memory Programming Library

Distributed Programming I (Socket - Nov'09)

Beej's Guide to Unix IPC

POSIX Signals

Programming with POSIX Threads II

UNIT: 4 DISTRIBUTED COMPUTING Introduction to Distributed Programming

Introduction to POSIX Signals

POSIX Signal Handling in Java

Lecture 14: Paging

Interprocess Communication 1 Processes • Basic Concept to Build

Dressing up Data For

Signals and Pipes

6.828: Virtual Memory for User Programs