Compiler-Directed Soft Error Mitigation for Embedded Systems
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012 159 Compiler-Directed Soft Error Mitigation for Embedded Systems Antonio Martı´nez-Alvarez, Sergio A. Cuenca-Asensi, Member, IEEE, Felipe Restrepo-Calle, Francisco R. Palomo Pinto, Hipo´lito Guzma´n-Miranda, Student Member, IEEE, and Miguel A. Aguirre, Senior Member, IEEE Abstract—The protection of processor-based systems to mitigate the harmful effect of transient faults (soft errors) is gaining importance as technology shrinks. At the same time, for large segments of embedded markets, parameters like cost and performance continue to be as important as reliability. This paper presents a compiler-based methodology for facilitating the design of fault-tolerant embedded systems. The methodology is supported by an infrastructure that permits to easily combine hardware/software soft errors mitigation techniques in order to best satisfy both usual design constraints and dependability requirements. It is based on a generic microprocessor architecture that facilitates the implementation of software-based techniques, providing a uniform isolated-from-target hardening core that allows the automatic generation of protected source code (hardened code). Two case studies are presented. In the first one, several software-based mitigation techniques are implemented and evaluated showing the flexibility of the infrastructure. In the second one, a customized fault tolerant embedded system is designed by combining selective protection on both hardware and software. Several trade-offs among performance, code size, reliability, and hardware costs have been explored. Results show the applicability of the approach. Among the developed software-based mitigation techniques, a novel selective version of the well known SWIFT-R is presented. Index Terms—Fault tolerance, reliability, soft error, single event upset—SEU, embedded systems design, hardware/software co- design, design space exploration. Ç 1INTRODUCTION HE ever increasing miniaturization of electronic compo- (The Radiation Design Handbook) [7]; for avionic systems, Tnents has led to important advances in microprocessors, IEC/TS 62396 (Process Management for Avionics - Atmo- such as the dramatic increase of their performance. Never- spheric radiation effects) [8]; for military systems, MIL- theless, this fact has also provoked adverse consequences HSBK-817 (System Development Radiation Hardness because voltage source level and noise margins are also Assurance) [9]; among others. reduced, causing electronic devices become less reliable and In order to overcome this problem, applying redundant microprocessors more susceptible to transient faults induced hardware has been the usual solution. From low level by radiation [1], [2]. These intermittent faults do not provoke structures, using techniques like: Error-Correcting Code a permanent damage, but may result in the incorrect (ECC), parity bits, Triple Modular Redundancy (TMR) [10] execution of the program by altering signal transfers or or Single Error Correction (SEC) Hamming code [11]; up to stored values, these faults are also called soft errors [3]. more complex components like functional units [12], Although these faults are more frequent in the space coprocessors [13]; or by means of exploiting the multi- environment, they are also present to a lesser extent in the plicity of hardware blocks available on multithreaded/ atmosphere [4] and even at ground level [5], [6] . The need multicore architectures [14], [15]. More recent techniques propose selective hardening of the system, adding protec- to mitigate soft errors has also been reflected in several tion only to the most vulnerable parts [16]; reducing the reports published by technical committees around the performance degradation by applying partial redundant world, which define detailed qualification requirements threading [17], [18]; or like in [19], using simple hardware that electronic components must meet for their use. Some structures to protect stored values, replicating parts of the examples are: for aerospace applications, ESA PSS-01-609 operands (that store narrow values) in their unused parts. In general, most hardware-based approaches provide a . A. Martı´nez-Alvarez, S.A. Cuenca-Asensi, and F. Restrepo-Calle are with very effective solution for mitigating soft errors. However, the Computer Technology Department, University of Alicante, Carretera these techniques produce a penalty in performance, power, San Vicente del Raspeig s/n, 03690 Alicante, Spain. die size, design time, and economic cost, making them E-mail: {amartinez, sergio, frestrepo}@dtic.ua.es. unfeasible in many cases. F.R. Palomo Pinto, H. Guzma´n-Miranda, and M.A. Aguirre are with the Department of Electrical Engineering, University of Sevilla, Camino de los Besides, in recent years, several proposals based on Descubrimientos s/n, 41092 Sevilla, Spain. redundant software have been developed, providing both E-mail: {rogelio, hipolito, aguirre}@zipi.us.es. detection and fault correction capabilities to applications. Manuscript received 7 Oct. 2010; revised 14 Apr. 2011; accepted 4 Oct. 2011; These works are especially motivated by the need for low- published online 26 Oct. 2011. cost solutions, ensuring an acceptable level of reliability For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TDSC-2010-10-0175. [20], [21]. Most software-based approaches are aimed to Digital Object Identifier no. 10.1109/TDSC.2011.54. detect faults, some of them apply redundancy to high-level 1545-5971/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society 160 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012 source code by means of automatic transformation rules, code overhead, performance, reliability and hardware costs whereas some others use instruction redundancy at low have been represented. level (assembly code) in order to reduce the code overhead The main contributions of this work are: and performance degradation, and improve the detection rates [22], [23], [24]. However, only a few of these . The use of the principles of the codesign methodol- techniques have been extended to allow the recovery of ogy [34] applied to the development of hybrid soft the system [25], [26]. The software-based approaches are errors mitigation strategies for embedded systems. less expensive than the hardware-based ones, although they A flexible compiler-based hardening infrastructure cannot achieve the same performance and reliability, since that supports the proposed methodology and assists the designer during the design space exploration. they have to execute additional instructions. It is worth . A novel selective version of the software-based noting that some of these techniques have already been technique known as SWIFT-R [26]. used in systems for satellites and real space missions [27]. Despite the wide number of different fault tolerance To the best of our knowledge, this is the first hardware/ methods based on hardware-only and software-only, in software codesign methodology aimed to design fine-tuned many cases the optimal solution is an intermediate point hybrid hardening strategies that combine both, hardware and between these two extremes. Thus, there is a need for software partial protection approaches. The fine-grained designing the system by combining software and hardware design space exploration allows not only to customize mitigation strategies (hardware/software codesign). The design systems that best meets design constraints and dependability of embedded systems is a clear example of this fact. In this requirements, but also to avoid the excessive use of costly case, there are large domains of applications where factors protection mechanisms (hardware and software). like cost and performance are as important as reliability. Therefore, there is an increasing need to count with suitable 2FAULT MODEL AND TERMINOLOGY development tools for the fault-tolerant codesign. Some recent works have shown promising results in this sense In this paper, we will focus on the well known Single Event [28], [29]. Nevertheless, they are very specific and lack Upset—SEU fault. This is a radiation effect that is caused by flexibility to get the best tradeoffs between design con- the ionization provoked by an incident charged particle on an electronic component. We will use the bit-flip fault model straints and dependability requirements. to represent this fault. In this model, only one bit-flip of a In this context, this paper presents a flexible compiler- storage cell occurs throughout the circuit execution. Despite based methodology aimed to codesign fault-tolerant em- its simplicity, this fault model is widely used in the fault bedded systems. It guides designers through the fault- tolerance community to model real faults because it closely tolerant design space, facilitating the design, application, matches the real fault behavior [35]. and evaluation of different hardware/software protection To evaluate the reliability of the system, injected faults strategies. The infrastructure which supports this metho- are classified according to their effect on the expected dology is made up of two suites of tools: the Software program behavior as it was first proposed by Mukherjee Hardening Environment (SHE) [30] and the fault emulation et al. [36]. If the fault provokes that the program completes tool called FTUnshades [31]. The software environment