Flexible Error Handling for Embedded Real-Time Systems – Operating System and Run-Time Aspects –
Total Page:16
File Type:pdf, Size:1020Kb
Flexible Error Handling for Embedded Real-Time Systems – Operating System and Run-Time Aspects – Dissertation zur Erlangung des Grades eines Doktors der Ingenieurwissenschaften der Technischen Universität Dortmund an der Fakultät für Informatik von Andreas Heinig Dortmund 2015 Tag der mündlichen Prüfung: 30. April 2015 Dekan / Dekanin: Prof. Dr. Gernot A. Fink Gutachter / Gutachterinnen: Prof. Dr. Peter Marwedel Prof. Dr. Hermann Härtig (TU Dresden) Acknowledgments I would like to thank my advisor Prof. Peter Marwedel for the initial ideas that lead to this dissertation. Without his advises, writing this thesis in this form would not be feasible. Very special thanks go to Dr. Michael Engel for all the discussions, the advices he gave over the years, and his intensive proof-reading support of this thesis and my publications. I would also like to thank my collaborating colleagues at the Department of Computer Science 12 with special thanks to Florian Schmoll. He provided the compiler support which is required by the flexible error handling approach presented in this thesis. Furthermore, he was available for high valuable technical discussions. Thanks go also to Björn Bönninghoff for providing the timing model for the H.264 decoder and to Ingo Korb for providing various bug fixes for the H.264 library. For proof-reading, I would like to thank Helana Kotthaus, Olaf Neugebauer, Florian Schmoll, and Björn Bönninghoff. My research visit at NTU Singapore was enabled by Prof. Dr. Vincent Mooney and Prof. Dr. Krishna Palem. I like to thank both for the advices that leaded to a publication which received a best paper award. Additional thanks go to the German Research Foundation (DFG) for supporting the FEHLER-Project (Grant No. MA-943/10) as part of the Priority Programme SPP1500 which leads to this thesis. The initial parts of this dissertation were supported by the European Community as part of the Artist Design Network of Excellence EU FP7 ICT Grant No. 39316. Finally, I would like to thank my mother. Without her continual support and encouragement, I would never made it this far. Abstract Due to advancements of semiconductor fabrication that lead to shrinking geome- tries and lowered supply voltages of semiconductor devices, transient fault rates will increase significantly for future semiconductor generations [Int13]. To cope with transient faults, error detection and correction is mandatory. However, addi- tional resources are required for their implementation. This is a serious problem in embedded systems development since embedded systems possess only a limited number of resources, like processing time, memory, and energy. To cope with this problem, a software-based flexible error handling approach is proposed in this dis- sertation. The goal of flexible error handling is to decide if, how,andwhen errors have to be corrected. By applying this approach, deadline misses will be reduced by up to 97% for the considered video decoding benchmark. Furthermore, it will be shown that the approach is able to cope with very high error rates of nearly 50 errors per second. Publications Parts of this thesis have been published in proceedings of the following conferences and workshops (in chronological order): 1. A. Heinig, M. Engel, F. Schmoll, and P. Marwedel, “Improving transient memory fault resilience of an H.264 decoder,” in Proc. of Embedded Systems for Real-Time Multimedia (ESTIMedia), pp. 121–130, 2010. 2. A. Heinig, M. Engel, F. Schmoll, and P. Marwedel, “Using application knowl- edge to improve embedded systems dependability,” in Proc. of HotDep, Oct. 2010. 3. A. Heinig, “R2G: Supporting POSIX like semantics in a distributed RTEMS system,” Technical Reports in Computer Science, no. 836, TU Dortmund, Fac- ulty of Computer Science, Dec. 2010. 4. M. Engel, F. Schmoll, A. Heinig, and P. Marwedel, “Unreliable yet useful – reliability annotations for data in cyber-physical systems,” in Proc. of Work- shop on Software Language Engineering for Cyber-physical Systems (WS4C), 2011. 5. D. Cordes, A. Heinig, P. Marwedel, and A. Mallik, “Automatic Extraction of Pipeline Parallelism for Embedded Software Using Linear Programming,” in Proc. of Parallel and Distributed Systems (ICPADS), 2011, pp. 699–706. 6. M. Engel, A. Heinig, F. Schmoll, and P. Marwedel, “Temporal properties of error handling for multimedia applications,” in Proc. of Conference on Electronic Media Technology (CEMT), 2011, pp. 1–6. 7. A. Heinig, V. J. Mooney, F. Schmoll, P. Marwedel, K. V. Palem, and M. En- gel, “Classification-Based improvement of application robustness and quality of service in probabilistic computer systems,” in Proc. of international con- ference on Architecture of Computing Systems (ARCS), 2012. – Best Paper Award 8. A. Heinig, I. Korb, F. Schmoll, P. Marwedel, and M. Engel, “Fast and Low- Cost Instruction-Aware Fault Injection,” in Proc. of Software-Based Methods for Robust Embedded Systems (SOBRES), 2013. 9. F. Schmoll, A. Heinig, P. Marwedel, and M. Engel, “Improving the fault resilience of an H.264 decoder using static analysis methods,” Transactions on Embedded Computing Systems (TECS), vol. 13, no. 1, pp. 1–27, Nov. 2013. viii 10. F. Schmoll, A. Heinig, P. Marwedel, and M. Engel, “Passing error handling information from a compiler to runtime components,” Technical Reports in Computer Science, no. 844, TU Dortmund, Faculty of Computer Science, 2014. 11. A. Heinig, F. Schmoll, P. Marwedel, and M. Engel, “Who’s using that mem- ory? A subscriber model for mapping errors to tasks,” in Proc of Workshop on Silicon Errors in Logic - System Effects (SELSE), Stanford, 2014. 12. A. Heinig, F. Schmoll, B. Bönninghoff, P. Marwedel, and M.Engel, “FAME: Flexible Real-Time Aware Error Correction by Combining Application Knowl- edge and Run-Time Information,” in Proc of Workshop on Silicon Errors in Logic - System Effects (SELSE), Austin, 2015. Contents 1 Introduction 1 1.1 Motivation ................................ 3 1.2 Faults, Errors, and Failures . ...................... 3 1.2.1 Masking .............................. 4 1.2.2 Duration . ............................ 5 1.3 Embedded Real-Time Systems . .................... 5 1.3.1 Real-Time . ............................ 5 1.3.2 Efficiency . ............................ 7 1.3.3 Dependability . .......................... 7 1.4 Contribution of this Work ........................ 8 1.5 Outline .................................. 10 1.6 Author’s Contribution to this Dissertation ............... 10 2 Related Work 13 2.1FaultTolerance.............................. 13 2.2 Reliability in Real-Time Operating Systems . ............. 16 2.3CheckpointandRecovery........................ 19 2.4 Fault Injection .............................. 21 2.5 Scheduling of Real-Time Tasks . .................... 25 2.6 Timing Estimations . .......................... 27 3 Models and Infrastructure 31 3.1 Reliable Computing Base ........................ 31 3.2RAPModel................................ 32 3.3 Transient Memory Fault Model . .................... 34 3.3.1 Thread-Based Fault Injection . ................. 35 3.3.2 Fault Injection in CoMET .................... 35 3.3.3 Injection into Demonstrator Platform ............. 37 3.4 Probabilistic Error Model ........................ 42 3.4.1 Supply Voltage Schemes . .................... 43 3.4.2 Three-Stage Model ........................ 43 3.4.3 Probabilistic ARM Platform . ................. 47 3.5TaskModel................................ 48 3.6 Quality of Service Metrics ........................ 48 3.6.1 ΔE ................................ 49 3.6.2 Peak Signal-To-Noise Ratio . .................. 49 x Contents 4 Flexible Error Handling 51 4.1 Flexible Error Handling Strategy .................... 51 4.2 Error Classification ............................ 53 4.2.1 H.264 Application Analysis . .................. 54 4.2.2 LEGO Mindstorms Application ................. 56 4.2.3 Further Examples ........................ 59 4.2.4 Classification Annotation .................... 59 4.2.5 Implications ............................ 61 4.3 Real-Time Aspect ............................ 62 4.3.1 A kind of Priority Inversion Problem . ............. 63 4.3.2 Subscriber Model . ........................ 63 4.3.3 Using the Subscriber Model to Schedule Error Correction . 67 4.4 Summary . ................................ 69 5 FAME: Fault Aware Microvisor Ecosystem 71 5.1 REPAIR Compiler ............................ 72 5.1.1 Classification Parsing ...................... 73 5.1.2 Subscriber Model Support .................... 75 5.1.3 Code Generation . ........................ 76 5.1.4 Summary . ............................ 77 5.2 Runtime Overview ............................ 77 5.3Microvisor................................. 78 5.3.1 RCB Components ........................ 79 5.3.2 Virtualization Concepts . .................... 79 5.3.3 Memory Management ...................... 85 5.3.4 Subscriber Model Support and Data Object Management . 88 5.3.5 Checkpoint-and-Recovery .................... 89 5.3.6 Summary . ............................ 92 5.4FAMERE................................. 93 5.4.1 Para-Virtualization Support . .................. 93 5.4.2 Checkpoint-and-Recovery .................... 94 5.4.3 Scheduling and Subscriber Model Support ........... 94 5.4.4 Memory Allocation ........................ 95 5.4.5 Error Handling .......................... 95 5.4.6 Summary . ............................ 95 5.5 Para-Virtualizing RTEMS for FAME . ................. 96 5.5.1 RTEMS Overview ........................ 96 5.5.2 Board Support