Hardware Error Detection Using AN-Codes

Hardware Error Detection Using AN-Codes Dissertation zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) vorgelegt an der Technischen Universitat¨ Dresden Fakultat¨ Informatik eingereicht von Dipl.-Inf. Ute Schiffel geboren am 08. 07. 1980 in Sebnitz Gutachter: Prof. Christof Fetzer, PhD, Technische Universität Dresden Prof. Dr. Wolfgang Ehrenberger, Hochschule Fulda Datum der Verteidigung: 20. Mai 2011 Dresden, den 10.06.2011 Abstract Due to the continuously decreasing feature sizes and the increasing complexity of integrated circuits, commercial off-the-shelf (COTS) hardware is becoming less and less reliable. However, dedicated reliable hardware is expensive and usually slower than commodity hardware. Thus, economic pressure will most likely result in the usage of unreliable COTS hardware in safety-critical systems. The usage of unreliable, COTS hardware in safety-critical systems results in the need for software-implemented solutions for handling execution errors caused by this unreliable hardware. In this thesis, we provide techniques for detecting hardware errors that disturb the execution of a program. The detection provided facilitates handling of these errors, for example, by retry or graceful degradation. We realize the error detection by transforming unsafe programs that are not guaranteed to detect execution errors into safe programs that detect execution errors with a high probability. Therefore, we use arithmetic AN-, ANB-, ANBD-, and ANBDmem-codes. These codes detect errors that modify data during storage or transport and errors that disturb computations as well. Furthermore, the error detection provided is independent of the hardware used. We present the following novel encoding approaches: • Software Encoded Processing (SEP) that transforms an unsafe binary into a safe execution at runtime by applying an ANB-code, and • Compiler Encoded Processing (CEP) that applies encoding at compile time and provides different levels of safety by using different arithmetic codes. In contrast to existing encoding solutions, SEP and CEP allow to encode applications whose data and control flow is not completely predictable at compile time. For encoding, SEP and CEP use our set of encoded operations also presented in this thesis. To the best of our knowledge, we are the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations. Our evaluations show that encoding with SEP and CEP significantly reduces the amount of erroneous output caused by hardware errors. Furthermore, our evaluations show that, in contrast to replication-based approaches for detecting errors, arithmetic encoding facilitates the detection of permanent hardware errors. This increased reliability does not come for free. However, unexpectedly the runtime costs for the different arithmetic codes supported by CEP compared to redundancy increase only linearly, while the gained safety increases exponentially. iii Für Arthur Es weht der Wind ein Blatt vom Baum, von vielen Blättern eines. Das eine Blatt, man merkt es kaum, denn eines ist ja keines. Doch dieses eine Blatt allein, war Teil von unsrem Leben. Drum wird uns dieses Blatt allein, fur¨ immer, immer fehlen. Hermann Hesse Acknowledgments Over the last years many people helped me to complete this thesis. Now, it is time to thank them for their support. My advisor Christof Fetzer always believed in encoding { even when I did not. He was always open for discussions and an endless source of ideas. His constant request: \You could publish at conference XYZ." ensured a steady progress of my work. Thank you. My colleagues at the chair for Systems Engineering at TU Dresden provided a friendly and enjoyable working environment. They were always open for discussing ideas, problems and gave lots of feedback on paper drafts and pre- sentations. I especially thank, Martin Sußkraut¨ whose ideas and suggestions considerably helped to improve the Encoding Compiler and this thesis, which he proof-read from the first to the very last page, AndréSchmitt who trans- formed my ideas for the ANB-encoding Compiler into a compiler pass during his diploma thesis, Thomas Knauth who implemented the list- and tree-based version management, Gert Pfeifer who had always time for me: either for just listening or for explaining some interesting P2P or DNS technique, Andrey Brito on whom I could count on to drop by on these long evenings before a paper deadline, Martin Nowack who proof-read the short version of this thesis, and Claudia Einer and Karina Wauer who helped me to survive in an otherwise women-free work environment. Special thanks go to my husband Stephan. He endlessly discussed problems and possible solutions with me, he proof-read this thesis, and supported me wherever he could. Und zu guter Letzt: Danke meine lieben Eltern, daß Ihr mich immer unterstutzt¨ und gefördert habt und auch heute noch fur¨ Stephan und mich da seid und alle unsere Vorhaben mit Ratschlägen und Hilfe begleitet. vii Contents Contents ix 1. Introduction1 2. Reliability of Hardware7 2.1. Terminology..............................7 2.2. Causes and Effects of Hardware Errors...............8 2.2.1. Causes for Increasing Unreliability of Hardware......8 2.2.2. (Un)Reliability of Hardware................. 11 2.3. Impact of Hardware Errors..................... 14 2.4. Conclusions from the State of Hardware Reliability........ 15 2.5. Software-level Symptoms of Hardware Errors........... 16 3. Arithmetic Codes 19 3.1. Berger Code.............................. 21 3.2. Residue Codes............................ 23 3.3. AN-Codes............................... 26 3.3.1. Error Correcting AN-Codes................. 29 3.3.2. Systematic AN-Codes.................... 30 3.3.3. jgANjM Code......................... 31 3.3.4. Conclusions for AN-Codes.................. 33 3.4. ANB-Codes.............................. 33 3.5. ANBD-Codes............................. 35 3.6. Comparison of the Codes...................... 36 4. Encoding an Instruction Set 39 4.1. Implementation of Encoding and Decoding............ 40 4.1.1. Provided Functions...................... 41 4.1.2. Encoding........................... 42 4.1.3. Conversion: Signed Encoded Unsigned Encoded.... 44 4.1.4. Decoding........................... 46 4.2. Encoded Operations......................... 46 4.2.1. Encoded Base Operations.................. 47 4.2.2. Encodable Replacement Operations............ 74 4.2.3. Floating Point Operations.................. 79 4.3. Encoded Constants.......................... 80 ix x Contents 4.4. Calls to External Libraries...................... 80 4.5. Encoded Data and Control Flow.................. 81 4.6. Encoding Dynamic Memory Access................. 81 4.7. Version Management......................... 82 4.7.1. The List............................ 84 4.7.2. The Tree........................... 86 4.7.3. Performance Evaluation................... 89 4.8. Outlook: Application of Encoded Basic Building Blocks........... 90 5. Choice of Encoding Parameters 93 5.1. Choice of A.............................. 93 5.1.1. How A Influences the Probability of Detecting Errors.. 94 5.1.2. Practical Evaluation: How Many Errors Are Undetectable? 96 5.2. Choice of the Signatures....................... 100 5.3. Version................................ 102 5.4. Conclusion.............................. 103 6. The Vital Coded Processor (VCP) 105 6.1. System Overview........................... 105 6.2. Workflow............................... 107 6.3. Program Encoding.......................... 108 6.4. Discussion of VCP.......................... 109 7. Software Encoded Processing (SEP) 111 7.1. System Overview........................... 111 7.2. Workflow............................... 113 7.3. Program Encoding.......................... 114 7.3.1. Critical Combinations of Error Symptoms......... 114 7.3.2. Encoding of the Process Image and the Instruction Pointer115 7.3.3. Encoded Program Execution................ 117 7.3.4. Encoding of Control Flow Instructions........... 120 7.3.5. Input and Output...................... 120 7.3.6. Code Checking........................ 121 7.4. Evaluation............................... 122 7.4.1. Error Detection Capabilities................. 123 7.4.2. Runtime Overhead...................... 125 7.5. Summary of SEP........................... 127 8. Compiler Encoded Processing (CEP) 129 8.1. System Overview........................... 130 8.2. Workflow............................... 132 8.3. Program Encoding.......................... 134 8.3.1. LLVM Bitcode........................ 134 Contents xi 8.3.2. Preparations for Encoding.................. 136 8.3.3. Encoding........................... 137 8.4. Checking the Correctness of the Execution............. 153 8.5. Evaluation............................... 155 8.5.1. Benchmarks Used...................... 155 8.5.2. Other Error Detection Approaches Evaluated....... 156 8.5.3. Error Detection Capabilities................. 157 8.5.4. Runtime Overhead...................... 163 8.5.5. Costs vs Gains........................ 167 8.6. Summary of CEP........................... 168 9. Symptom-based Error Injection Tools 171 9.1. Related Work............................. 172 9.1.1. Error Injectors........................ 172 9.1.2. Error Injectors Used in Recent Research Papers...... 174 9.1.3. Slicing............................. 176 9.1.4. Design Decisions Derived.................. 176 9.2. FITgrind..............................

Load more