<<

Pangolin: A Fault-tolerant Persistent Memory Programming

Lu Zhang and Steven Swanson

Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego

1 Persistent memory (PMEM) finally arrives

CPU Caches

• Working alongside DRAM Memory • New programming model Controller – Byte addressability DRAM DRAM DRAM – Memory semantics PMEM PMEM PMEM – Direct access (DAX)

2 Challenges with PMEM programming

consistency • Fault tolerance – Volatile CPU caches – Media errors – 8-byte store atomicity – Software bugs

MOV

CPU

L1 L1

L2

3 Persistent memory error types

• Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention

Application Data Receiving good data

PMEM Controller Data Error auto-corrected PMEM Data

4 Persistent memory error types

• Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention – ECC-detectable but uncorrectable ones require handling

Application SIGBUS Receiving SIGBUS

PMEM Controller Data Error Detected but PMEM Data uncorrectable

5 Persistent memory error types

• Persistent memory and its controller implement ECC – ECC-detectable & correctable errors do not need software intervention – ECC-detectable but uncorrectable ones require signal handling – ECC-undetectable errors demand software detection and correction

Application Data Receiving bad data

PMEM Controller Data Error undetectable PMEM Data

6 Handle uncorrectable & undetectable errors

• Prepare some redundancy for recovery • Implement software-based error detection and correction

Application SIGBUS Receiving SIGBUS Application Data Receiving bad data

PMEM Controller Data PMEM Controller Data Error Detected but Error undetectable PMEM Data uncorrectable PMEM Data

7 DAX-filesystem cannot protect ’ed data

File System App. DAX-mmap App. • Some filesystems (e.g. ()/() mmap(/mnt/pmem/file) NOVA) provide protection protected only via read()/write() User-space unprotected • No known filesystem can File System protect DAX-mmap’ed read()/write() mmap() PMEM data

Persistent Memory

8 DAX-filesystem cannot protect mmap’ed data

File System App. DAX-mmap App. • Some filesystems (e.g. read()/write() mmap(/mnt/pmem/file) NOVA) provide protection protected Pangolin only via read()/write() User-space • No known filesystem can File System protect DAX-mmap’ed read()/write() mmap() PMEM data

Persistent Memory

9 Pangolin design goals

• Ensure crash consistency • Protect application data against media and software errors • Require very low storage overhead (1%) for fault tolerance

10 Pangolin – Replication, parity, and checksums

• Combines replication and parity as redundancy – Similar performance compared to replication – Low space overhead (1% of gigabyte-sized object store) Metadata Metadata Object Object Object Object Object Object Object Object Object Object Parity • Checksums all metadata and object data

11 Pangolin – Transactions with micro-buffering

• Provides micro-buffering-based transactions – Buffers application changes in DRAM – Atomically updates objects, checksums, and parity

DRAM Object Object

PMEM Object Object Object Object Object Parity

12 Pangolin – Transactions with micro-buffering

• Provides micro-buffering-based transactions – Buffers application changes in DRAM – Atomically updates objects, checksums, and parity

DRAM

PMEM Object Object ObjectObject Object Object Parity

13 Pangolin’s data redundancy

• Reserve space for metadata replication and object parity • Organize object data pages into “rows”

Application Address Space Mapped PMEM file

Metadata Metadata Data Parity Replica

Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9

Row 0 Row 1 Row 2 Row 3 Row p

Row size: default 160 MB (1% of a data “zone”) 14 Pangolin’s parity coding

• Compute a parity page vertically across all rows • Afford losing one whole row of data • By default, Pangolin implements 100 rows per data zone

Row 0 Page 0 Page 1 ⊕ ⊕ Row 1 Page 2 Page 3 ⊕ ⊕ Row 2 Page 4 Page 5 ⊕ ⊕ Row 3 Page 6 Page 7 = = Row p Page 8 Page 9 15 Micro-buffering provides transactions

• Move object data in DRAM and perform data integrity check • Buffer writes to objects and write back to PMEM on commit • Guarantee consistency with redo logging (replicated)

ptr1 2

CSUM CSUM DRAM obj 1 D1’

3 5 1 Buffering CSUM Logging Writing back

D1’

CSUM CSUM CSUM CSUM CSUM PMEM CSUM

CSUM Replicating obj 1 D1’D1 obj 2 obj 3 obj 4 obj 5 Row 0 D1’

4 Updating parity 16 Updating parity using only modified ranges

ptr1 2 4 obj 1 D1’ ⊕ = Δ1 ⊕ = P1’ DRAM 1

Logging 3 5 Data

D1’ obj 1 D1’D1 obj 2 obj 3 obj 4 obj 5 Row 0

obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2

obj 9 unused (zero bytes) Row 3

Parity PP1’1 Row p 17 Parity’s crash consistency depends on object logs

• Apply all redo-logs (if exist) and then re-compute parity

obj 1 D1’ ⊕ = Δ1 ⊕ = PowerP1’ failure DRAM

Data

D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0

obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2

obj 9 unused (zero bytes) Row 3

Parity P1 Row p 18 Parity’s crash consistency depends on object logs

• Apply all redo-logs (if exist) and then re-compute parity

DRAM

Data

D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0

obj 6 obj 7 Row 1 PMEM obj 7 obj 8 Row 2

obj 9 unused (zero bytes) Row 3

Parity PP1’ Row p 19 Multithreaded update – Lock parity ranges

• Lock a range of parity and serialize parity updates Thread2 Thread1 D1’ ⊕ = Δ1 ⊕ = P1’ D7’ ⊕ = Δ7 ⊕ = P7’ DRAM 1 2 3 4

Data

D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0

obj 6 obj 7 Row 1 PMEM D7’ obj 7 D7 obj 8 Row 2

obj 9 unused (zero bytes) Row 3

Parity PPP171’1 Row p 20 Multithreaded update – Atomic XORs

• Parity range can update, lock-free, with atomic XORs

Thread1 Thread2 D1’ ⊕ = Δ1 D7’ ⊕ = Δ7 DRAM

Data

D1’ obj 1 D1 obj 2 obj 3 obj 4 obj 5 Row 0

obj 6 obj 7 Row 1 PMEM D7’ obj 7 D7 obj 8 Row 2

obj 9 unused (zero bytes) Row 3

Parity PP171 Row p 21 Multithreaded update – Hybrid scheme

• Atomic XORs can be slower than vectorized ones • Use shared mutex to coordinate both methods • Small updates (< 8KB) – Take shared lock of a parity range (8 KB) – Update parity concurrently with atomic XORs • Large updates (≥ 8KB) – Take exclusive locks of parity ranges (8 KB each) – Update parity using vectorized XORs (non-atomic)

22 Performance – Single-object transactions

Single-object Overwrite Latencies

• Evaluation based on Intel’s libpmemobj libpmemobj-replication pangolin Optane DC persistent memory 20 15

• On average, Pangolin’s latency is 10

11% lower than libpmemobj with 5

0 replication. (microseconds) Latency 64 256 1024 4096 Object Size (bytes)

23 Performance – Multi-object transactions

• Performance of Pangolin is 90% of libpmemobj’s with replication • Pangolin incurs about 100× less space overhead

Average Insertion Latencies Average Removal Latencies libpmemobj libpmemobj-replication pangolin libpmemobj libpmemobj-replication pangolin 25 20

20 15 15 10 10 5

5 Latency (microseconds) Latency Latency (microseconds) Latency 0 0 ctree rbtree btree skiplist rtree hashmap ctree rbtree btree skiplist rtree hashmap

24 Conclusion

• PMEM programming libraries should also consider fault tolerance for critical applications. • Parity-based redundancy provides similar performance compared to replication and significantly reduces space overhead. • Micro-buffering-based transactions can both support crash consistency and provide fault tolerance.

25