Fault Tolerance and Security

Heechul Yun

1 Safety Failures in CPS

Therac 25 Arian 5

• Computer controlled medical X-ray • 7 billion dollar rocket was destroyed after 40 treatments secs (6/4/1996) • Six people died/injured due to massive overdoses (1985-1987) • “caused by the complete loss of guidance and • Caused by synchronization mistakes altitude information ”  Caused by 64bit floating to 16bit integer conversion

2 Safety Failures in CPS

http://www.nytimes.com/2015/01/28/us/white-house-drone.html

http://petapixel.com/2015/12/23/crashing-camera-drone-narrowly-misses-top-skiier/

http://rochester.nydatabases.com/map/domestic-drone-accidents http://www.nytimes.com/interactive/2016/07/01/business/inside-tesla-accident.html Failures in CPS have consequences 3 Safety Threats in CPS

• Cyber System (i.e., computer) – Software bugs – Hardware bugs

• Physical System (i.e., plant) – Sensor inaccuracies – Actuator malfunctions/physical damages

4 Safety Threats in CPS

• Cyber System (i.e., computer) – Software bugs • Logical, temporal – Hardware bugs • Permanent, transient

• Physical System (i.e., plant) – Sensor inaccuracies – Actuator malfunctions/physical damages

5 Safety Challenges: Software

• Increasing complexity – E.g., Linux: > 15M SLOC • Concurrency – Multithreading is hard https://www.quora.com/How-many-lines-of-code-are-in-the-Linux-kernel • Timing unpredictability – Shared resource contention affects timing

 Software bugs are hard to weed out

6 Safety Challenges: Hardware

• Hardware bugs – Pentium floating point bug (FDIV bug), circa 1994 – CPU bugs in 2015: http://danluu.com/cpu-bugs/ • “Certain Combinations of AVX Instructions May Cause Unpredictable System Behavior” • “Processor May Experience a Spurious LLC-Related Machine Check During Periods of High Activity” • … • Transient hardware faults (soft errors) – Single event upset (SEU) in SRAM, logic • Due to , cosmic radiation – Manifested as software failures • Crashes, wrong output: silent – Bigger problem in advanced CPU • Increased density, freq  higher soft error http://www.cotsjournalonline.com/articles/view/102279

7 Safety Challenges: Hardware

• SRAM SER vs. technology scaling – Per-bit SER decreases – Per-chip SER increases (due to higher density)

Ibe et al., “Scaling Effects on Neutron-Induced Soft Error in SRAMs Down to 22nm Process” (Hitachi)

8 Security Challenges: Software

• Insecure software in CPS  safety hazards • Stuxnet: first reported cyber warfare, targeted for Iranian nuclear plants (destroying centrifuges) • Vermont power grid hack by Russia • Remote hack into cars (Zeep) • Police drone hacking • Sensor hacking: GPS spoofing. IMU spoofing

9 Security Challenges: Hardware

• Disturbance errors in DRAM (*) • a.k.a. Bug • Repeated opening/closing a DRAM row can cause bit flips in adjacent rows. • In more than 80% DRAM modules between 2010 -2013 • Google demonstrated successful hacking method utilizing the bug (**) – manipulate page tables at the user-level

(*) Yoongu Kim et al, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA’14 10 (**) Google Project Zero. Exploiting the DRAM rowhammer bug to gain kernel privileges, 2015 DRAM Chip

Row of Cells Wordline VictimRow Row AggressorRow OpenedClosed Row VHIGHLOW VictimRow Row Row

Repeatedly opening and closing a row ind uces disturbance errors in adjacent rows This slide is from the Dr. Yoongu Kim’s ISCA 2014 presentation 11 How to Improve Safety of a System?

• Correct by design – Formal method based software development • Difficult for a complex system – Use reliable hardware • e.g., radiation hardened processors • Expensive and low performance

• Deal with failures – Run-time monitoring and redundancy

12 This Week:

• A Simplex Architecture for Intelligent and Safe Unmanned Aerial Vehicles, RTCSA16

• Application and System-Level Software Fault Tolerance Through Full System Restarts, ICCPS'2017 (optional)

• SAFER: System-level Architecture for Failure Evasion in Real-time Applications. RTSS’12

• ROS: an open-source Robot Operating System, ICRAOSS'09 (optional)

13 Next Week: Security

• Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. ISCA, 2014

• Drammer: Deterministic Rowhammer Attacks on Mobile Platforms, CCS'16 [blog] (optional)

• Drone hack: Spoofing attack demonstration on a civilian unmanned aerial vehicle. GPS World, August 2012.

• Comprehensive Experimental Analyses of Automotive Attack Surfaces, USENIX Security, 2011 (optional)

• Rocking Drones with Intentional Sound on Gyroscopic Sensors, USENIX security’15 (optional)

14 Fault Tolerance

• Goal: Logical correctness • Threats – Computer systems • Software bugs • Hardware bugs – Physical systems • Sensors inaccuracies • Actuator malfunctions • Approaches – Redundancy • TMR, n-versioning, – Hardening

15 RowHammer

16