Efficient Fault Tolerance using MPX and TSX

Oleksii Oleksenko, Pascal Felber Dmitrii Kuvaiskii, Pramod Bhatotia, Christof Fetzer

DSN’16 Data corruption

● Performance-critical systems ➜ in a low-level language (/C++) ● Low-level language ➜ no ○ Applications are more vulnerable to hardware faults

DSN’16 2 Data corruption

● Performance-critical systems ➜ in a low-level language (C/C++) ● Low-level language ➜ no memory protection ○ Applications are more vulnerable to hardware faults ● A pointer gets corrupted ➜ stays undetected

Memory Memory

Corrupted object

Object Object

Corrupted Pointer pointer

DSN’16 2 Overview

Problem:

● Existing solutions are expensive ○ They harden the entire program

Approach:

● Partial protection for efficient fault-tolerance ○ Protect only data pointers

DSN’16 3 Hypothesis

Leverage the new ISA extensions in modern CPUs for fault tolerance

DSN’16 4 Hypothesis

Leverage the new ISA extensions in modern CPUs for fault tolerance

Fault detection Fault recovery Intel MPX Intel TSX (Memory Protection Extension) (Transaction Sync Extensions)

DSN’16 4 Fault detection via Memory Protection Extension

Intel MPX: in the H/W Memory

Insight: Pointer error will cause bound violation with high probability Object bounds

Pointer

DSN’16 5 Fault detection via Memory Protection Extension

Intel MPX: Bounds checking in the H/W Memory

Insight: Pointer error will cause bound violation with high probability Object bounds

Corrupted pointer

DSN’16 5 Fault detection via Memory Protection Extension

Intel MPX: Bounds checking in the H/W Memory Detected by MPX Out-of-bound Insight: Pointer error will cause bound access violation with high probability Object bounds

Corrupted pointer

DSN’16 5 Fault recovery using transactions

Intel TSX: Transactions for optimistic Transaction concurrency control ...

Approach: add mul ● Detect faults using MPX ● Use transactions for recovery mov

...

DSN’16 6 Fault recovery using transactions

Intel TSX: Transactions for optimistic Transaction concurrency control ...

Approach: add mul ● Detect faults using MPX mov ● Use transactions for recovery MPX detects a fault ...

DSN’16 6 Fault recovery using transactions

Intel TSX: Transactions for optimistic Transaction

concurrency control ... Rollback the transaction and re-execute Approach: add mul ● Detect faults using MPX mov ● Use transactions for recovery MPX detects a fault ...

DSN’16 6 Performance overheads

lower better

DSN’16 7 Performance overheads

lower better

DSN’16 7 Performance overheads

lower better

~ 50% slowdown on average

DSN’16 7 Summary

Leverage the new ISA extensions for Transaction fault tolerance: ... Rollback the transaction ● MPX: fault detection add and re-execute mul ● TSX: fault recovery mov MPX detects a Improved efficiency: fault ...

● ~ 50% slowdown ○ State-of-the-art full hardening - 100%

DSN’16 8 Summary

Leverage the new ISA extensions for Transaction fault tolerance: ... Rollback the transaction ● MPX: fault detection add and re-execute mul ● TSX: fault recovery mov MPX detects a Improved efficiency: fault ...

● ~ 50% slowdown ○ State-of-the-art full hardening - 100%

Thanks! [email protected] DSN’16 8 DSN’16 9 Backups

DSN’16 19 Generic Solutions

There are solutions for full-program hardening:

- and -level ● Duplicated execution redundancy ○ high performance overhead ○ additional hardware ○ x2 on average ○ i.e., more cores used

add add add add mul mul mul mul mov mov mov mov

DSN’16 20 Approach

...

add Recover (TSX)

mul

mov

Detect (MPX)

DSN’16 21