Pangolin: A Fault-tolerant Persistent Memory Programming Library
Lu Zhang and Steven Swanson
Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego
1 Persistent memory (PMEM) finally arrives
CPU Caches
• Working alongside DRAM Memory • New programming model Controller – Byte addressability DRAM DRAM DRAM – Memory semantics PMEM PMEM PMEM – Direct access (DAX)
2 Challenges with PMEM programming
• Crash consistency • Fault tolerance – Volatile CPU caches – Media errors – 8-byte store atomicity – Software bugs
MOV
CPU
L1 L1
L2
3 Persistent memory error types
• Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention
Application Data Receiving good data
PMEM Controller Data Error auto-corrected PMEM Data
4 Persistent memory error types
• Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention – ECC-detectable but uncorrectable ones require signal handling
Application SIGBUS Receiving SIGBUS
PMEM Controller Data Error Detected but PMEM Data uncorrectable
5 Persistent memory error types
• Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention – ECC-detectable but uncorrectable ones require signal handling – ECC-undetectable errors demand software detection and correction
Application Data Receiving bad data
PMEM Controller Data Error undetectable PMEM Data
6 Handle uncorrectable & undetectable errors
• Prepare some redundancy for recovery • Implement software-based error detection and correction
Application SIGBUS Receiving SIGBUS Application Data Receiving bad data
PMEM Controller Data PMEM Controller Data Error Detected but Error undetectable PMEM Data uncorrectable PMEM Data
7 DAX-filesystem cannot protect mmap’ed data
File System App. DAX-mmap App. • Some filesystems (e.g. read()/write() mmap(/mnt/pmem/file) NOVA) provide protection protected only via read()/write() User-space unprotected • No known filesystem can File System protect DAX-mmap’ed read()/write() mmap() PMEM data
Persistent Memory
8 DAX-filesystem cannot protect mmap’ed data
File System App. DAX-mmap App. • Some filesystems (e.g. read()/write() mmap(/mnt/pmem/file) NOVA) provide protection protected Pangolin only via read()/write() User-space • No known filesystem can File System protect DAX-mmap’ed read()/write() mmap() PMEM data
Persistent Memory
9 Pangolin design goals
• Ensure crash consistency • Protect application data against media and software errors • Require very low storage overhead (1%) for fault tolerance
10 Pangolin – Replication, parity, and checksums
• Combines replication and parity as redundancy – Similar performance compared to replication – Low space overhead (1% of gigabyte-sized object store) Metadata Metadata Object Object Object Object Object Object Object Object Object Object Parity • Checksums all metadata and object data
11 Pangolin – Transactions with micro-buffering
• Provides micro-buffering-based transactions – Buffers application changes in DRAM – Atomically updates objects, checksums, and parity
DRAM Object Object
PMEM Object Object Object Object Object Parity
12 Pangolin – Transactions with micro-buffering
• Provides micro-buffering-based transactions – Buffers application changes in DRAM – Atomically updates objects, checksums, and parity
DRAM
PMEM Object Object ObjectObject Object Object Parity
13 Pangolin’s data redundancy
• Reserve space for metadata replication and object parity • Organize object data pages into “rows”
Application Address Space Mapped PMEM file
Metadata Metadata Data Parity Replica
Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9
Row 0 Row 1 Row 2 Row 3 Row p
Row size: default 160 MB (1% of a data “zone”) 14 Pangolin’s parity coding
• Compute a parity page vertically across all rows • Afford losing one whole row of data • By default, Pangolin implements 100 rows per data zone
Row 0 Page 0 Page 1 ⊕ ⊕ Row 1 Page 2 Page 3 ⊕ ⊕ Row 2 Page 4 Page 5 ⊕ ⊕ Row 3 Page 6 Page 7 = = Row p Page 8 Page 9 15 Micro-buffering provides transactions
• Move object data in DRAM and perform data integrity check • Buffer writes to objects and write back to PMEM on commit • Guarantee consistency with redo logging (replicated)
ptr1 2
CSUM CSUM DRAM obj 1 D1’
3 5 1 Buffering CSUM Logging Writing back
D1’
CSUM CSUM CSUM CSUM CSUM PMEM CSUM
CSUM Replicating obj 1 D1’D1 obj 2 obj 3 obj 4 obj 5 Row 0 D1’
4 Updating parity 16 Updating parity using only modified ranges
ptr1 2 4 obj 1 D1’ ⊕ = Δ1 ⊕ = P1’ DRAM 1
Logging 3 5 Data
D1’ obj 1 D1’D1 obj 2 obj 3 obj 4 obj 5 Row 0
obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2
obj 9 unused (zero bytes) Row 3
Parity PP1’1 Row p 17 Parity’s crash consistency depends on object logs
• Apply all redo-logs (if exist) and then re-compute parity
obj 1 D1’ ⊕ = Δ1 ⊕ = PowerP1’ failure DRAM
Data
D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0
obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2
obj 9 unused (zero bytes) Row 3
Parity P1 Row p 18 Parity’s crash consistency depends on object logs
• Apply all redo-logs (if exist) and then re-compute parity
DRAM
Data
D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0
obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2
obj 9 unused (zero bytes) Row 3
Parity PP1’ Row p 19 Multithreaded update – Lock parity ranges
• Lock a range of parity and serialize parity updates Thread2 Thread1 D1’ ⊕ = Δ1 ⊕ = P1’ D7’ ⊕ = Δ7 ⊕ = P7’ DRAM 1 2 3 4
Data
D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0
obj 6 obj 7 Row 1 PMEM D7’ obj 7 D7 obj 8 Row 2
obj 9 unused (zero bytes) Row 3
Parity PPP171’1 Row p 20 Multithreaded update – Atomic XORs
• Parity range can update, lock-free, with atomic XORs
Thread1 Thread2 D1’ ⊕ = Δ1 D7’ ⊕ = Δ7 DRAM
Data
D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0
obj 6 obj 7 Row 1 PMEM D7’ obj 7 D7 obj 8 Row 2
obj 9 unused (zero bytes) Row 3
Parity PP171 Row p 21 Multithreaded update – Hybrid scheme
• Atomic XORs can be slower than vectorized ones • Use shared mutex to coordinate both methods • Small updates (< 8KB) – Take shared lock of a parity range (8 KB) – Update parity concurrently with atomic XORs • Large updates (≥ 8KB) – Take exclusive locks of parity ranges (8 KB each) – Update parity using vectorized XORs (non-atomic)
22 Performance – Single-object transactions
Single-object Overwrite Latencies
• Evaluation based on Intel’s libpmemobj libpmemobj-replication pangolin Optane DC persistent memory 20 15
• On average, Pangolin’s latency is 10
11% lower than libpmemobj with 5
0 replication. (microseconds) Latency 64 256 1024 4096 Object Size (bytes)
23 Performance – Multi-object transactions
• Performance of Pangolin is 90% of libpmemobj’s with replication • Pangolin incurs about 100× less space overhead
Average Insertion Latencies Average Removal Latencies libpmemobj libpmemobj-replication pangolin libpmemobj libpmemobj-replication pangolin 25 20
20 15 15 10 10 5
5 Latency (microseconds) Latency Latency (microseconds) Latency 0 0 ctree rbtree btree skiplist rtree hashmap ctree rbtree btree skiplist rtree hashmap
24 Conclusion
• PMEM programming libraries should also consider fault tolerance for critical applications. • Parity-based redundancy provides similar performance compared to replication and significantly reduces space overhead. • Micro-buffering-based transactions can both support crash consistency and provide fault tolerance.
25