Understanding and Improving Device Access Complexity

Understanding and Improving Device Access Complexity Asim Kadav (with Prof. Michael M. Swift) University of Wisconsin-Madison Devices enrich computers ★ Keyboard ★ Flash storage ★ Graphics ★ WIFI ★ Keyboard ★ Headphones ★ Sound ★ SD card ★ Printer ★ Camera ★ Network ★ Accelerometers ★ Storage ★ GPS ★ Touch display ★ NFC 2 Huge growth in number of devices New I/O devices: accelerometers, GPUS, GPS, touch Many buses: USB, PCI-e, thunderbolt Heterogeneous OS support: 10G ethernet vs card readers 3 Device drivers: OS interface to devices diversity applications OS Expose device abstractions and hide device complexity Expose kernel device drivers abstractions and hide OS complexity buses devices diversity Allow diverse set of applications and OS services to access diverse set of devices 4 Evolution of devices hurts device access Low Simplicity latency Tools and mechanisms to address Efficient device Cost increasing device complexity Reliabilitysupport in OS effective Growth in Run in number and challenging diversity environments Complex Hardware firmware and failures (like configuration CMOS Evolution issues) of devices modes 5 Growth in drivers hurts understanding of drivers Lines of code in Linux 3.8 applications OS 0 1750000 3500000 5250000 7000000 memory mgmt 60,000 device devicekernel drivers66,000 OS drivers file systems 760,000 kernel drivers buses Contribute 60% of Linux kernel code and more than 90% in Windows devices 6,700,000 Understand the software complexity and improve driver code 6 Last decade: Focus on the driver-kernel interface + device drivers OS 3rd party developers kernel Recipe for disaster 7 Re-use lessons from existing driver research Improvement System Validation Drivers Bus Classes New functionality Shadow driver migration [OSR09] 1 1 1 RevNIC [Eurosys 10] 1 1 1 Reliability Nooks [SOSP 03] 6 1 2 XFI [ OSDI 06] 2 1 1 CuriOS [OSDI 08] 2 1 2 Type Safety SafeDrive [OSDI 06] 6 2 3 Singularity [Eurosys 06] 1 1 1 Limited kernel changes + Applicable to lots of drivers => Specification Nexus [OSDI 08] 2 1 2 Real Impact Termite [SOSP 09] 2 1 2 Static analysis tools Windows SDV [Eurosys 06] Many Many Many DesignLarge kernel goal: subsystems CompleteCoverity and solution [CACM validity 10] that of few Alllimits deviceAll kernel typesAll resultchanges in limited and adoptionCocinelle applies [Eurosys of 08] to research all driversAll solutionsAll All 8 Goal: Address software and hardware complexity ★ Understand and improve device access in the face of rising hardware and software complexity Increasing hardware Increasing hardware complexity complexity Reliability against Low latency hardware failures 1 device availability 2 Increasing software Better understanding 3 complexity of driver code 9 My approach Narrow approach and solve specific problems in all drivers Tolerate device failures Broad approach and have a Understand drivers and holistic view of all drivers potential opportunities Known approach and apply to all Transactional approach for drivers low latency recovery Minimize kernel changes and apply to all drivers 10 Contributions/Outline SOSP ’09 First research consideration of hardware failures in drivers Tolerate device failures ASPLOS ’12 Largest study of drivers to Understand drivers and understand their behavior and potential opportunities verify research assumptions ASPLOS ’13 Introduce checkpoint/restore in Transactional approach for drivers for low latency low latency recovery fault tolerance 11 What happens when devices misbehave? ★ Drivers make it better ★ Provide error detection and recovery ★ Drivers make it worse ★ Assume perfect hardware or panic when anything bad happens What assumptions do modern drivers make about hardware ? 12 Current state of OS-hardware interaction ★ Many device drivers often assume device perfection - Common Linux network driver: 3c59x.c while&(ioread16(ioaddr&+&Wn7_MasterStatus)) &&&&0x8000); HANG! Hardware dependence bug: Device malfunction can crash the system 13 Sources of hardware misbehavior ★ Sources of hardware Driver misbehavior Bus ★ Firmware/Design bugs ★ Device wear-out, Cache insufficient burn-in ★ Bridging faults Firmware ★ Electromagnetic interference, radiation, Electrical heat Mechanical Device 14 Sources of hardware misbehavior ★ Sources of hardware ★ Results of misbehavior misbehavior ★ Firmware/Design bugs ★ Corrupted/stuck-at inputs ★ Device wear-out, ★ Timing errors insufficient burn-in ★ Interrupt storms/missing ★ Bridging faults interrupts ★ Electromagnetic ★ Incorrect memory access interference, radiation, heat 15 An evidence: Transient hardware failures caused 8% of all crashes and 9% of all unplanned reboots [1] ★ Systems work fine after reboots ★ Vendors report returned device was faultless Existing solution is hand-coded hardened drivers ★ Crashes reduce from 8% to 3% [1] Fault resilient drivers for Longhorn server, May 2004. Microsoft Corp. 16 How do hardware dependence bugs manifest? 1 Drivers use device printk(“%s”,msg[inb(regA)]); data in critical control and data paths 2 if&(inb(regA)!=&5)&&{ Drivers do not report device &&return&Q22;&//do&nothing malfunction to system log } 3 if&(inb(regA)!=&5)&{ Drivers do not detect or recover from &panic(); device failures } 17 Vendor recommendations for driver developers Recommendation Summary Recommended by Intel Sun MS Linux Validation Input validation ! ! ! Read once& CRC data ! ! ! DMA protection ! ! Timing Infinite polling ! ! ! Stuck interrupt ! Lost request ! Avoid excess delay in OS ! Unexpected events ! ! Reporting Report all failures ! ! ! Recovery Handle all failures ! ! Cleanup correctly ! ! Goal: AutomaticallyDo not crash on failure implement! as many! ! recommendations as possible in commodity drivers Wrap I/O memory access ! ! ! ! 18 Carburizer [SOSP ’09] Goal: Tolerate hardware device failures in software through hardware failure detection and recovery Static analysis component Runtime component ★ Detect and fix hardware ★ Detect interrupt dependence bugs failures ★ Detect and generate ★ Provide automatic missing error reporting recovery information 19 Carburizer architecture Bug detection and Recovery and interrupt automatic fix generation watchdog OS Kernel Kernel Interface Carburizer If&(c==0)&{ . print&(“Driver&init”); } . If&(c==0)&{ Compiler . Carburizer . print&(“Driver& init”); } . Runtime . Hardened Driver Binary Driver Faulty Hardware 20 Hardening drivers • Goal: Remove hardware dependence bugs ★ Find driver code that uses data from device ★ Ensure driver performs validity checks • Carburizer detects and fixes hardware bugs : Unsafe Unsafe System Infinite array pointer panic polling reference reference calls 21 Finding sensitive code ★ First pass: Identify tainted variables that contain data from device Tainted&Variables int&test&()&Types of { device I/O OS &a&=&readl(); a b ★ &b&=&Port I/O :!inb/inwinb(); c ★ &c&=&Memory-mappedb; I/O : readl/readw &d&=&c&+&2; d ★ DMA buffers &return&d; test() ★ Data from USB packets } e int&set()& { &&&&&&&e&=&test(); } network card 22 Detecting risky uses of tainted variables ★ Second pass: Identify risky uses of tainted variables ★ Example: Infinite polling ★ Driver waiting for device to enter particular state ★ Solution: Detect loops where all terminating conditions depend on tainted variables ★ Extra analyses to existing timeouts 23 Infinite polling ★ Infinite polling of devices can cause system lockups static&int&amd8111e_read_phy(………) { &... &&reg_val&=&readl(mmio&+&PHY_ACCESS); !!while!(reg_val!&!PHY_CMD_ACTIVE) & reg_val&=&readl(mmio&+&PHY_ACCESS); &&... } AMD&8111e&network&driver(amd8111e.c) 24 Hardware data used in array reference ★ Tainted variables used as array indexes ★ Detect existing range/not NULL checks static&void&__init&attach_pas_card(...) { &&&if&((pas_model&=&pas_read(0xFF88)))& &&&{& &&&&&... &&&&&sprintf(temp,&“%s&rev&%d”,& &&&&&&&pas_model_names[(int)&pas_model],&pas_read(0x2789));& &&&&&... } Pro&Audio&Sound&driver&(pas2_card.c) 25 Analysis results over the Linux kernel Driver class Infinite polling Static array Dynamic array Panic calls net 117 2 21 2 scsi 298 31 22 121 sound 64 1 0 2 Lightweight and usable technique to video 174find hardware0 dependence22 bugs 22 other 381 9 57 32 Total 860 43 89 179 ★ Analyzed/Built 6300 driver files (2.8 million LOC) in 37 min ★ Found 992 hardware dependence bugs in driver code ★ False positive rate: 7.4% (manual sampling of 190 bugs) 26 Repairing drivers ★ Carburizer automatically generates repair code ★ InsertsCall failure recovery detection and recovery service service callout Timeout Array bounds Not null checks check checks Unsafe Unsafe System Infinite array pointer panic polling reference reference calls 27 Runtime fault recovery : Shadow drivers Driver-Kernel Interface • Carburizer calls generic recovery service if check fails Taps • Low cost transparent recovery Shadow ★ Based on shadow drivers Driver ★ Records state of driver at all times ★ Transparently restarts and replays Device recorded state on failure Driver • No isolation required (like Nooks) Device Swift [OSDI ’04] 28 Carburizer automatically fixes infinite loops timeout!=!rdtscll(start)!+!(cpu/khz/HZ)*2; reg_val&=&readl(mmio&+&PHY_ACCESS); while&(reg_val&&&PHY_CMD_ACTIVE)& { & reg_val&=&readl(mmio&+&PHY_ACCESS);& & if!(_cur!<!timeout)! ! !!!!rdtscll(_cur); ! else ! !!!!__recover_driver(); Timeout code added } AMD&8111e&network&driver(amd8111e.c) *Code/simplified/for/presentation/purposes 29 Carburizer automatically adds bounds checks static&void&__init&attach_pas_card(...) Array bounds { detected and &&if&((pas_model&=&pas_read(0xFF88)))&

Load more