IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012 159 Compiler-Directed Soft Error Mitigation for Embedded Systems

Antonio Martı´nez-Alvarez, Sergio A. Cuenca-Asensi, Member, IEEE, Felipe Restrepo-Calle, Francisco R. Palomo Pinto, Hipo´lito Guzma´n-Miranda, Student Member, IEEE, and Miguel A. Aguirre, Senior Member, IEEE

Abstract—The protection of -based systems to mitigate the harmful effect of transient faults (soft errors) is gaining importance as technology shrinks. At the same time, for large segments of embedded markets, parameters like cost and performance continue to be as important as reliability. This paper presents a compiler-based methodology for facilitating the design of fault-tolerant embedded systems. The methodology is supported by an infrastructure that permits to easily combine hardware/software soft errors mitigation techniques in order to best satisfy both usual design constraints and dependability requirements. It is based on a generic architecture that facilitates the implementation of software-based techniques, providing a uniform isolated-from-target hardening core that allows the automatic generation of protected source code (hardened code). Two case studies are presented. In the first one, several software-based mitigation techniques are implemented and evaluated showing the flexibility of the infrastructure. In the second one, a customized fault tolerant embedded system is designed by combining selective protection on both hardware and software. Several trade-offs among performance, code size, reliability, and hardware costs have been explored. Results show the applicability of the approach. Among the developed software-based mitigation techniques, a novel selective version of the well known SWIFT-R is presented.

Index Terms—Fault tolerance, reliability, soft error, single event upset—SEU, embedded systems design, hardware/software co- design, design space exploration. Ç

1INTRODUCTION

HE ever increasing miniaturization of electronic compo- (The Radiation Design Handbook) [7]; for avionic systems, Tnents has led to important advances in , IEC/TS 62396 (Process Management for Avionics - Atmo- such as the dramatic increase of their performance. Never- spheric radiation effects) [8]; for military systems, MIL- theless, this fact has also provoked adverse consequences HSBK-817 (System Development Radiation Hardness because voltage source level and noise margins are also Assurance) [9]; among others. reduced, causing electronic devices become less reliable and In order to overcome this problem, applying redundant microprocessors more susceptible to transient faults induced hardware has been the usual solution. From low level by radiation [1], [2]. These intermittent faults do not provoke structures, using techniques like: Error-Correcting Code a permanent damage, but may result in the incorrect (ECC), parity bits, Triple Modular Redundancy (TMR) [10] execution of the program by altering signal transfers or or Single Error Correction (SEC) Hamming code [11]; up to stored values, these faults are also called soft errors [3]. more complex components like functional units [12], Although these faults are more frequent in the space [13]; or by means of exploiting the multi- environment, they are also present to a lesser extent in the plicity of hardware blocks available on multithreaded/ atmosphere [4] and even at ground level [5], [6] . The need multicore architectures [14], [15]. More recent techniques propose selective hardening of the system, adding protec- to mitigate soft errors has also been reflected in several tion only to the most vulnerable parts [16]; reducing the reports published by technical committees around the performance degradation by applying partial redundant world, which define detailed qualification requirements threading [17], [18]; or like in [19], using simple hardware that electronic components must meet for their use. Some structures to protect stored values, replicating parts of the examples are: for aerospace applications, ESA PSS-01-609 operands (that store narrow values) in their unused parts. In general, most hardware-based approaches provide a . A. Martı´nez-Alvarez, S.A. Cuenca-Asensi, and F. Restrepo-Calle are with very effective solution for mitigating soft errors. However, the Computer Technology Department, University of Alicante, Carretera these techniques produce a penalty in performance, power, San Vicente del Raspeig s/n, 03690 Alicante, Spain. die size, design time, and economic cost, making them E-mail: {amartinez, sergio, frestrepo}@dtic.ua.es. unfeasible in many cases. . F.R. Palomo Pinto, H. Guzma´n-Miranda, and M.A. Aguirre are with the Department of Electrical Engineering, University of Sevilla, Camino de los Besides, in recent years, several proposals based on Descubrimientos s/n, 41092 Sevilla, Spain. redundant software have been developed, providing both E-mail: {rogelio, hipolito, aguirre}@zipi.us.es. detection and fault correction capabilities to applications. Manuscript received 7 Oct. 2010; revised 14 Apr. 2011; accepted 4 Oct. 2011; These works are especially motivated by the need for low- published online 26 Oct. 2011. cost solutions, ensuring an acceptable level of reliability For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TDSC-2010-10-0175. [20], [21]. Most software-based approaches are aimed to Digital Object Identifier no. 10.1109/TDSC.2011.54. detect faults, some of them apply redundancy to high-level

1545-5971/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society 160 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012 source code by means of automatic transformation rules, code overhead, performance, reliability and hardware costs whereas some others use instruction redundancy at low have been represented. level (assembly code) in order to reduce the code overhead The main contributions of this work are: and performance degradation, and improve the detection rates [22], [23], [24]. However, only a few of these . The use of the principles of the codesign methodol- techniques have been extended to allow the recovery of ogy [34] applied to the development of hybrid soft the system [25], [26]. The software-based approaches are errors mitigation strategies for embedded systems. . less expensive than the hardware-based ones, although they A flexible compiler-based hardening infrastructure cannot achieve the same performance and reliability, since that supports the proposed methodology and assists the designer during the design space exploration. they have to execute additional instructions. It is worth . A novel selective version of the software-based noting that some of these techniques have already been technique known as SWIFT-R [26]. used in systems for satellites and real space missions [27]. Despite the wide number of different fault tolerance To the best of our knowledge, this is the first hardware/ methods based on hardware-only and software-only, in software codesign methodology aimed to design fine-tuned many cases the optimal solution is an intermediate point hybrid hardening strategies that combine both, hardware and between these two extremes. Thus, there is a need for software partial protection approaches. The fine-grained designing the system by combining software and hardware design space exploration allows not only to customize mitigation strategies (hardware/software codesign). The design systems that best meets design constraints and dependability of embedded systems is a clear example of this fact. In this requirements, but also to avoid the excessive use of costly case, there are large domains of applications where factors protection mechanisms (hardware and software). like cost and performance are as important as reliability. Therefore, there is an increasing need to count with suitable 2FAULT MODEL AND TERMINOLOGY development tools for the fault-tolerant codesign. Some recent works have shown promising results in this sense In this paper, we will focus on the well known Single Event [28], [29]. Nevertheless, they are very specific and lack Upset—SEU fault. This is a radiation effect that is caused by flexibility to get the best tradeoffs between design con- the ionization provoked by an incident charged particle on an electronic component. We will use the bit-flip fault model straints and dependability requirements. to represent this fault. In this model, only one bit-flip of a In this context, this paper presents a flexible compiler- storage cell occurs throughout the circuit execution. Despite based methodology aimed to codesign fault-tolerant em- its simplicity, this fault model is widely used in the fault bedded systems. It guides designers through the fault- tolerance community to model real faults because it closely tolerant design space, facilitating the design, application, matches the real fault behavior [35]. and evaluation of different hardware/software protection To evaluate the reliability of the system, injected faults strategies. The infrastructure which supports this metho- are classified according to their effect on the expected dology is made up of two suites of tools: the Software program behavior as it was first proposed by Mukherjee Hardening Environment (SHE) [30] and the fault emulation et al. [36]. If the fault provokes that the program completes tool called FTUnshades [31]. The software environment is its execution, but does not produce the expected output, this aimed to implement, automatically apply, and evaluate fault is called Silent Data Corruption—SDC. If the program software-based fault tolerance techniques. It comprises a completes its execution and produces the expected output, flexible hardening multitarget compiler and an instruction the fault is categorized as unnecessary for Architecturally set simulator to assist the design decisions. The tool-chain is Correct Execution—unACE. Finally, if the fault causes the complemented by FTUnshades, which is an FPGA-based program to abnormally finish its execution or to remain fault emulation tool that permits to assess several depend- forever into an infinite loop, this fault is categorized as a ability parameters of the real system implementation. In this Hang. SDC and Hang are both undesirable effects (categor- work, mitigation techniques are applied in a technology- ized together as ACE faults). independent way; thus, the final deployment platform may It is worth noting that soft errors mitigation techniques be an ASIC, or an antifuse or flash FPGA. In case of SRAM necessarily introduce redundancy, which causes two im- FPGAs, additional protection mechanisms have to be portant facts to consider. First, these techniques increase the considered for the configuration memory, such as config- execution time of the programs; therefore, probability of uration scrubbing, partial reconfiguration, or re-routing fault occurrence is higher than for the nonhardened design [32], [33]. program, which shows a shorter time execution. Second, Two case studies for validating our approach have been redundancy increases the number of bits of the system, explored. First, and with the aim to illustrate the flexibility increasing the number of bits prone to soft errors. Therefore, of the infrastructure, three software-based protection the reliability offered by a specific hardening strategy is techniques have been implemented and evaluated. Second, directly related to the percentage of unACE faults and the and in order to show the applicability of the proposed execution time overhead. methodology, an embedded application, based on a wide In the experiments, we used the Mean Work To Failure used soft-microprocessor, has been codesigned by combin- metric—MWTF to estimate reliability improvements. ing hardware/software fault-tolerant strategies. In addition, MWTF is a generalized metric proposed by Reis et al. [29] these strategies have been applied to both parts, hardware in order to capture the tradeoff between reliability and and software in a selective way. Different tradeoffs among performance. The MWTF can be calculated as follows: MARTINEZ-ALVAREZ ET AL.: COMPILER-DIRECTED SOFT ERROR MITIGATION FOR EMBEDDED SYSTEMS 161

1 MWTF ¼ðraw error rate AV F execution timeÞ ; 3. Each software candidate is evaluated to estimate the ð1Þ program overheads compared to the original pro- gram, in terms of code and execution time. All where the raw error rate is determined by the circuit solutions that met the maximum overheads specified technology; the execution time is the time to execute a given are selected to run on the nonhardened real unit of work (in our case the unit of work is the execution of microprocessor implementation, i.e., n0 software a program); and the AVF (Architectural Vulnerability Factor) versions selected, where n0 n. is a commonly used reliability metric. This is the probability 4. The n0 programs running on the real system are that a fault in a particular structure will result in an error evaluated in terms of the dependability requirements. [36]. It can be calculated as follows: 5. If the results still do not meet all applications number of ACE bits in the structure requirements, the protection strategy is complemen- AV F ¼ : ð2Þ ted by applying hardware-based methods. Several total number of bits in the structure possibleimplementationsofthehardwareare obtained (m hardware candidates). 3METHODOLOGY AND INFRASTRUCTURE 6. According to the design constraints (cost, energy ... m We propose to apply the codesign methodology to the consumption, ), some of the hardware im- development of hybrid soft error mitigation strategies. That plementation could be discarded at this point, i.e., m0 hardware candidates are selected, where m0 m. is, the process of exploring the design space of hardware 7. Reliability evaluation of the n0 m0 possible hard- and software techniques to achieve a customized fault ware/software configurations of the embedded tolerant version of the system that best meets design and system. dependability requirements. Given into account that there is 8. Several tradeoffs are explored between design an inherent tradeoff involving both software and hardware constraints and dependability requirements. resources, the proposed methodology is concerned with both of them. Its application is carried out by means of a The result is a set of hardware/software implementa- tions that achieve an optimized fault-tolerant version of the compiler-and-simulator-directed hardener over the soft- system. ware side of the full system, providing to the designer This methodology is supported by a design infrastruc- useful information to best delimits the software/hardware ture for soft errors mitigation in embedded systems. It partition. So, the optimal hardened software solution can be comprises two suites of tools: the Software Hardening automatically calculated, while the hardened hardware Environment and the fault injection based reliability evalua- solution is manually implemented following the precalcu- tor FTUnshades. lated codesign results from the software side. Following the codesign methodology, the first step is the 3.1 Software Hardening Environment—SHE derivation of a set of system requirements from the The main features of SHE can be expressed as: embedded application point of view. Such requirements can be directly a system constraint or may target any kind of . Flexible: that is, easy to extend its hardening dependability parameter associated to a specific applica- capabilities. tion. System constraints are generally related to silicon area, . API-based internals: to easy interfacing. performance, power consumption and costs, whereas . Hardware-agnostic: to provide a uniform isolated hard- dependability parameters are concerned with fault cover- ening core for each supported microprocessor/micro- age, reliability, availability, safety, security, recovery time, controller. etc. These requirements feed into the generation of a test . Re-targetable output: to provide reusing of code. bench to guide the codesign of the system where constraints . Flow-control based analyses and code-injection and dependability parameters motivate design decisions. routines. The complete design flow is driven by the application . Automatic reallocation/minimization of architecture requirements. First, the adoption of software-based techni- resources. ques can then determine a set of suitable implementations The need for having a deep knowledge in the way the of the software side of the system, aimed to avoid hardware code is transformed, has led us to develop an entire costs. Then, if necessary, the mitigation strategy is com- instruction-level hardening tool from scratch, i.e., a hard- plemented by hardware-based protection techniques. The ening strategy based on assembly code. solution is not necessarily a single point in the design space The proposed environment establishes a complete tool but may result in a range of tradeoffs. for the fault-tolerant software development, allowing the The proposed codesign flow can be summarized as design and implementation of software-based mitigation follows: techniques, which can be automatically applied into programs. The environment is made up of a multitarget 1. The specific requirements of the application (design compiler (Hardener), an Instruction Set Simulator (ISS), and constraints and dependability parameters) are fully several compiler front-ends and back-ends (Fig. 1). defined. A given compiler front-end takes the original source 2. Several software-based fault tolerance techniques code from a supported target, performs lexical, syntactical are applied to obtain n candidate implementations of and semantic analyses, and finally generates a Generic the software. Instruction Flow (GIF) as output. This flow represents an 162 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012

Fig. 2. Control flow graph—CFG. Fig. 1. Software hardening environment—SHE. Regarding Memory Management tasks, it is possible to intermediate high-level abstraction of a program. Then the identify the memory map, extract memory sections, and hardening tasks are performed within the Generic Hard- perform modifications on them. In addition, similar to other ening Core. Finally, the Hardener produces a Hardened-GIF approaches, it is assumed that the code being hardened (HGIF) which is then re-targeted to a custom supported does not exploit dynamic memory allocation, i.e., every data microprocessor. structure is defined statically at compilation time. This is The hardening environment is also flexible in the sense not a significant limitation for developers of embedded that it is possible to process a code written for a supported applications, which sometimes are forced to code standards architecture and generate protected code targeting to the that now avoid dynamic memory usage [38]. The following same original architecture or to a different one (using the three possibilities are supported to keep the memory map various back-ends). updated: dilation, displacement and reallocation. The main advantages of our proposal are: . Dilation. When one or more instructions are inserted . It is based on a Microprocessor Generic Architecture in a memory section during compilation time, this that permits to handle multiple microprocessors. section grows and some of the instruction addresses . The Generic Hardening Core allows designing, im- inside this memory section should be reassigned. plementing, applying, and evaluating different soft- . Displacement. If dilation provokes that two or more ware-based fault tolerance techniques in a platform- memory sections share some addresses, which is an independent way. illegal situation, then the complete memory section . Considering hardening purposes, our tools espe- must be moved and all its instruction addresses cially classify the instructions to delimit the Sphere of updated. Replication [37], which is the logic domain of . Reallocation. If there is a memory overflow caused by redundant execution. previous instruction insertions, then it is necessary to perform a complete reallocation of the complete 3.1.1 Microprocessor Generic Architecture memory map. During this process, free memory The generic architecture provides a workspace for the space among memory sections is fully used. A Generic Hardening Core. It gathers together each common memory overflow could take place because of the element from different architectures in order to facilitate the typical reduced memory size in embedded systems. design and implementation of the state-of-the-art software- On the other hand, the Microprocessor Generic Architecture based techniques in a technology-independent way. allows the identification of the CFG from a given GIF. The This architecture is defined by means of Generic Instruc- CFG is represented by a directed graph, where each node is tions (GIs).AGI is actually an instruction from the ISA defined by a program basic block.Abasic block is a group of (Instruction Set Architecture)ofanMGA (Microprocessor instructions that are executed sequentially, without any Generic Architecture) built from an architecture-specific jump instruction or function call, excepting possibly the last instruction (original instruction). It is a high-level abstrac- instruction. Also, a basic block does not contain instructions tion of the original instruction that adds more information being the destination of a call or jump instruction, excepting to it. Thus, a GI can be considered as a wrapping from an the first instruction. The flow control changes are presented original instruction allowing an interaction with the MGA. in the graph as links among nodes. Fig. 2 shows an example In this way, this wrapping defines a hardware abstraction of a simple CFG. layer that permits to handle multiple microprocessor targets within our Generic Hardening Core. That is, the MGA makes 3.1.2 Generic Hardening Core—GH-Core no distinction between a given set of GIs coming from The GH-Core comprises two main components: Hardener diverse microprocessors. This feature permits to apply the and ISS. fault-tolerance techniques using a microprocessor agnostic The Hardener is based on the Microprocessor Generic methodology. Architecture. It comprises an Application Programming The identification of the Control Flow Graph (CFG) and the Interface (API) of hardening routines typical from soft- insertion of instructions into the source code during compila- ware-based techniques algorithms. Developing new hard- tion time are the keys for software-based techniques [22], ening algorithms is a task significantly easier by using this [23], [26]. In this way, our tool also provides the necessary API. Fig. 3 shows a programming example using the functionalities to suitably perform Memory Management proposed API. It presents a simplified version of the Harden during changes in compilation time and analyses to the method, which selects one from different available fault program’s CFG. tolerant techniques to apply to the program. MARTINEZ-ALVAREZ ET AL.: COMPILER-DIRECTED SOFT ERROR MITIGATION FOR EMBEDDED SYSTEMS 163

Fig. 4. Sphere of replication—SoR.

during the program simulation. The register lifetime represents the time when useful data is present in the register. Any fault occurring to the register during that time destroys data integrity [39]. Using this information, an estimation of the register file vulnerability can be easily deduced as well. It is worth mentioning that a hardware- Fig. 3. API programming example method: Harden. independent metric to estimate the software vulnerability has been proposed in the literature, which is called the Available options in the Hardener allow the designer to Program Vulnerability Factor (PVF) [40]. However, estima- decide which protection technique to apply, which is the tions of the register file vulnerability using lifetime are very replication level to be used in redundant instructions (i.e., useful to the designer at this point, since they can be used to duplication, triplication), and the preferred recovery pro- select the most vulnerable registers to be selectively cedure to use (i.e., among different implementations of protected, avoiding the very time consuming process of majority voters, in case of recovery techniques, or routines gathering the AVF by fault injection when performing to be applied when a fault is detected). In this way, the preliminary software vulnerability analyses. Hardener offers complete control to the user for configuring Moreover, in order to evaluate the reliability provided by the protection. the techniques, the ISS is able to simulate SEU faults by Also, it allows designing the hardening strategy by means means of bit-flips in the architectural resources. Although of partial application of one or more software-based the ISS has access to the most important resources such as: techniques. That is, the designer can define and control a register file, , stack pointer, and ALU flags, it does not have access to the microarchitectural resources selective hardening strategy, being able to determine which like those registers located in the pipeline. Thus, the resources must be hardened. This feature permits to quickly reliability results obtained by using the ISS should be explore the design space provided by the technique, obtain- considered preliminary estimations, which are useful ing a set of hardened versions to evaluate and determine their information for tuning the hardening strategy and compar- code overhead, performance degradation and reliability ing different techniques. However, to obtain more realistic level. For instance, if applying a particular set of hardening reliability results, a hardware SEU-emulation tool is routines results inconvenient according to the requirements included in our infrastructure: FTUnshades. of an application (e.g., if the maximum execution time is exceded), the technique can be applied partially depending 3.1.3 Sphere of Replication—SoR on the critical program resources or sections. As it was suggested by Reis et al. [24], SHE classifies in a The Instruction Set Simulator—ISS has two main func- special way those instructions whose function involves tionalities. First, it assists the designer in the implementa- crossing the borders of the SoR. When an instruction causes tion of new software-based techniques. Second, the ISS that some data enter inside the SoR (e.g., reading an input performs different analyses on the original and hardened port, loading a value into a register or reading a value from GIF, and generates useful information to aid the designer in memory), it is classified it as inSoR, and consequently when the codesign process. an instruction provokes data to go out from the SoR (e.g., As usual for the instruction set simulators, the ISS writing on an output port, storing a value into the memory), presents information about the state of the resources of the it is classified as outSoR (Fig. 4). architecture during and after the simulation process. Like- The boundaries of the SoR and, consequently, the wise, the tool allows verifying if the functionality of the coverage of the protection could change according to hardened programs matches the original nonhardened the implemented technique. For instance, in EDDI [22] the programs functionality. This is possible by means of the memory subsystem is inside of the SoR, so the instructions check-hardening option, which uses information stored responsible to perform read/write operations over the in the source code through a compiler pragma to know memory do not cause that any data cross the SoR borders. In which the expected results are. the same way, if the memory subsystem is considered After the simulation process, the ISS presents a brief outside of the SoR, those instructions reading from memory summary to inform the code and execution time overheads or writing into memory are causing some data to cross of the applied hardening technique. Furthermore, it per- through the sphere frontiers and must be handled in a forms a characterization of the simulated programs, special way. In cases when selective hardening is applied, it informing about the percentage of executed instructions is possible to partially have replication of the register bank by their type: arithmetic, logical, control flow, etc.; and (i.e., not all registers into the SoR). Thus, instructions against about the program resource utilization. From these results, data stored in unprotected registers cause some data to the register lifetime is calculated for each register in the cross through the SoR frontiers. 164 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012

Fig. 5. Subnodes.

Considering that when an instruction sends data outside Fig. 6. FTUnshades approach using Smart Table. of the SoR, it may provoke an unrecoverable error (if those data are corrupted), it would be desirable to verify the data As FTUnshades is based on predictive injection, the correctness before they leaves the sphere. Thus, SHE platform provides information of the impact on the reliability decomposes the nodes (basic blocks) of the CFG into subnodes of each one of microprocessor submodules due to hierarchical injection [42]. The tool can detect the internal parts best after each instruction classified as outSoR (Fig. 5). candidates to be protected. 3.2 SEU-Emulation Tool: FTUnshades In this work, FTUnshades is used to assess the reliability The second main component of the hardening infrastruc- of the full hardware/software mitigation strategy applied ture is FTUnshades [31]. It is an FPGA-based platform for the in the physical implementation of the system. By emulating study of digital circuit reliability against soft errors. SEUs SEUs at hardware level, more accurate fault coverage affecting the circuit are emulated by inducing bit-flips in the results are obtained than by using other simulation-based circuit under study, by means of dynamic partial reconfi- fault injection techniques, because it is posible to evaluate the designs even at microarchitectural level (a real system guration. The system is composed of an FPGA emulation implementation is considered). Moreover, unlike the ISS, board and a suite of software tools for testing the emulated the FTUnshades can be used to evaluate the hardened design and analyzing test results. versions of the hardware. In the original version of FTUnshades, two instances of the circuit or module under test (MUT) are instantiated in the implemented design: Target and Gold. In order to 4CASE STUDY determine the effect that a fault causes in the system, a SEU To assess the effectiveness of our approach, a compiler is emulated in the Target instance and its results are front-end and a back-end for the PicoBlaze [43] micropro- compared with the Gold instance in every clock cycle. cessor have been developed. PicoBlaze is an eight-bit soft- The system has been extended for the study of micro- micro with severe limitations in performance and resources, processor architectures and software-based mitigation but widely used in FPGA-based embedded systems. These techniques. Since software techniques recover the fault-free facts make it specially appropriate for our case study, taking state by comparing and recomputing, it is possible that the into account that software-only techniques cannot be Target where the bit-flip was inserted needs more clock completely applied to the whole system because of its high cycles to output the correct value than the Gold one. An code and performance overheads. The main features of the exhaustive description of this extension can be found in microprocessor are: 16 byte-wide general-purpose data [41]. Instead of two instances of the MUT (Gold and Target), registers, 1K instructions of programmable on-chip pro- the implemented design has just one instance of the MUT gram store, byte-wide (ALU) with (Target), and the Gold instance is substituted by a Smart CARRY and ZERO indicator flags, and 64-byte internal Table (see Fig. 6). The Smart Table is an automation that scratchpad RAM. In this work, a technology-independent implements the relaxed time restrictions needed for the fault version of PicoBlaze has been specially developed (RTL injection testing of microprocessors that implement soft- PicoBlaze), which allows validating the proposal for ASIC ware-based techniques. This is needed because the typical and FPGA technologies. This version is cycle accurate and RTL equivalent to the original PicoBlaze-3 version of the cycle-by-cycle comparison would classify as output error the microprocessor. effect of faults that could be corrected if the Target The PicoBlaze front-end takes the original source code microprocessor were given more processing time. The exact (KCPSM3 syntax), performs lexical, syntactical and seman- additional time, measured in clock cycles, that the affected tic analyses, and finally generates a GIF as output. After the microprocessor needs to output the correct value is called hardening process, a hardened GIF is produced, which is recovery time.TheSmart Table can be configured in taken by the compiler back-end, transforming the flow back emulation-time (this is, after synthesis, implementation to the KCPSM3 syntax. and FPGA programming). First, the Smart Table must be This case study is divided into two parts. The first one is configured with the outputs of a Golden Run of the Target aimed to analyze the usefulness of the infrastructure in microprocessor. This means that a whole emulation of the developing and evaluating new software-based protection circuit processing workload is done, but without injecting techniques. Several traditional techniques were adapted any bit-flips. When being configured during a Golden Run, and their effectiveness were estimated by means of a the Smart Table not only memorizes the sequence of the benchmark suite, before being applied to a real system for correct outputs, but also the time (in clock cycles) when the an exhaustive verification. This experiment demonstrates outputs change. that SHE not only assists the designer in the establishment MARTINEZ-ALVAREZ ET AL.: COMPILER-DIRECTED SOFT ERROR MITIGATION FOR EMBEDDED SYSTEMS 165

Fig. 7. Hardened program using TMR1 and TMR2. of the mitigation strategy, but also reduces and defines the region of the design space to explore. A second experiment has been proposed to show the methodology possibilities in the codesign of a customized fault-tolerant system guided by the application requirements. 4.1 SHE Evaluation As part of the case study, three software-based fault- recovery techniques have been implemented and evaluated in order to validate the main features of SHE: the easiness Fig. 8. API programming example method: SWIFT-R. for developing fine-grain software techniques, by means of the API, and the possibility to be selectively applied; the instruction by means of user modifiers. For example, the accuracy of the ISS to assess the reliability provided by the designer can apply these techniques only to arithmetic techniques; and the relevance of the information offered by instructions in the program, or in some other cases, can apply the ISS to drive the codesign of the hardening strategy. them to arithmetic þ logical þ shift rotate instructions. The implemented techniques were designed based on The third technique, SWIFT-R, is an overall method these criteria and not in the reliability that they could offer. aimed to recover faults from the data section, mainly related to the register file of the microprocessor. This method can 4.1.1 Developing Software Techniques be explained as follows: The implemented techniques are based on the well known Triple Modular Redundancy—TMR approach [10]. TMR1 and 1. Just like in TMR1, identify nodes and subnodes of the TMR2 were designed to selectively applied to the different program and build its CFG. kinds of instruction, whereas the third one is an adaptation 2. Data triplication the first time that any data come into of an overall technique proposed by Reis et al. [26] called the SoR (i.e., after inSoR instructions). In this case, only SWIFT-R. the register file is considered as included in the SoR, TMR1 can be summarized as follows: whereas the memory subsystem is not because it is assumed that memory already has its own protection 1. First step is the identification of nodes (basic blocks) mechanism [26]. Therefore, for every instruction and subnodes in the program. classified as inSoR (read input ports, read from 2. Build the CFG of the program. memory, load a value into a register), there will be 3. Triplication of the desired instructions. Redundant created two additional copies. These redundant instructions must operate using redundant register copies will always be created just copying the register copies. values without repeating memory or ports accesses. 4. Insertion of majority voters and recovery procedures 3. Triplication of instructions that perform any data for protected registers at the following points: just operation (e.g., arithmetic, logic, shift/rotate). before the last instruction of each node/subnode, 4. Check the consistency of the data involved in the and also just before any instruction being the following instructions (by inserting majority voters destination of a jump or call. and recovery procedures before executing them): 5. Release redundant register copies after each majority outSoR instructions, e.g., store into a memory voter. Once the registers value correctness has been position or write in an output port; instructions verified, its copies can be released. located just before a conditional branch. This 6. Optionally, additional majority voters and recovery verification is necessary because these instructions procedures can be dynamically inserted in case can affect the flags, so if a register value is corrupted, insufficient available registers to replicate. By means the resultant flag might be erroneous too, provoking of this, register copies will be released to continue an erroneous branch somewhere in the CFG. with the hardening process. 5. Redundant registers can only be released if they are not used anymore in the program (this condition Second implemented technique is called TMR2. It consists implies a detailed analysis to the CFG). of detecting and correcting faults in the program data by computing the values twice and recomputing a third time if Fig. 8 presents the reduced version of the implementa- a discrepancy between the first two values occurs. Fig. 7 tion of the SWIFT-R method using the proposed API of shows an example of the hardening of a simple program hardening routines. (KCPSM3 syntax) using TMR1 and TMR2. Notice that TMR1 and TMR2 handle the selected Due to the versatility of the API of the SHE, these two instructions (arithmetic and logic) as hardware structures. techniques can be increasingly applied to every type of TMR1 injects checkpoints and recovery procedures just 166 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012

Fig. 9. Normalized static code size overhead. Fig. 10. Normalized execution time overhead. before leaving a node/subnode, whereas TMR2 after each configured to estimate the fault coverage of the three instruction. This protects the result generated by a single techniques. In this experiment, for each test program node/sub-node or a selected instruction but increases the (original and hardened version), 5,000 executions were number of cycles that the original register is vulnerable to performed using the ISS. According to the bit-flip fault bit-flips that potentially become ACE faults. For example, model, there was only one SEU simulated during each considering one cycle for each instruction, in the original program execution. The fault was simulated by a bit-flip in code of Fig. 7 the register s0 is vulnerable during cycles 1 a randomly selected bit from the microprocessor register and 2. In the TMR1 version the s0 is vulnerable during file (16-byte-wide registers for PicoBlaze)duringthe cycles 1, 2, 7, and 9. SWIFT-R, on the contrary, is a better programexecution.Fig.11showsthepercentagesof software version than TMR1 and TMR2 due to it takes into the fault classification for every test program, and it also account the whole life of the variable and maintains their presents, on a secondary axis (logarithmic scale), the copies longer (through jumps between basic blocks as well), MWTF until they are not needed anymore. normalized for each program. On average, the unACE percentages are: 81.50 percent for 4.1.2 Evaluating Software Techniques the nonhardened version, 86.32 percent for TMR1, 85.34 per- The benchmark software suite used in the experiments is cent for TMR2, and 97.31 percent for SWIFT-R. These values made up of the following test programs: advanced represent the percentages of injected faults that do not encryption standard (aes)orRijndael, bubble sort (bub), provoke any undesirable behavior during the program scalar division (div), Fibonacci (fib), finite impulse response execution, without taking into account the processor filter (fir), greatest common divisor (gcd), matrix addition . Although they can be considered only (madd), matrix multiplication (mmult), scalar multiplication an estimation of the fault coverage offered by the (mult), proportional integral derivative controller (pid), techniques, it is valuable information since it allows exponentiation (pow), and quick sort (qsort). comparing the preliminary reliability results offered by Every test program was automatically hardened with the different techniques. compiler by means of applying the three software techni- In addition, reliability results jointly with overhead ques. Although these techniques provide mechanisms to results have to be taken into account in the next decisions protect both the control flow and the data, only the second about system design, representing several tradeoffs among was enabled because of PicoBlaze memory restrictions. In code size, performance, and reliability. In this case study, the same way, only arithmetic and logic instructions were TMR1 and TMR2 techniques do not offer a significant protected with TMR1 and TMR2 in order to accommodate improvement of reliability. SWIFT-R, on the other hand, the benchmarks to the memory size. Therefore, this is a presents an important enhancement of the fault coverage typical case where hardware mechanism has to be added to complement the hardening of the system. Figs. 9 and 10 present the resulting overheads reported by SHE after applying the techniques to the test programs. Both, memory and execution time overheads, are normal- ized to a baseline built with the nonhardened version. As it can be seen, TMR1 and TMR2 present almost the same impact over the code size, the geometric mean (calculated across all benchmarks) of their normalized static code size overhead is 2:59 and 2:65 respectively, whereas SWIFT-R has a higher code overhead of 3:08. Regarding Fig. 10, note again that TMR1 and TMR2 offer a quite similar impact on the execution time overhead, their geometric mean is 2:58 and 2:41, respectively, and SWIFT-R causes a 2:83 slightly increase up to . Fig. 11. Fault classification percentages and normalized MWTF for SHE also assesses the fault coverage offered by the nonhardened version (O), TMR1 (A), TMR2 (B), SWIFT-R (C) using the software techniques. The following experimental setup was ISS (test campaign execution against the microprocessor Register File). MARTINEZ-ALVAREZ ET AL.: COMPILER-DIRECTED SOFT ERROR MITIGATION FOR EMBEDDED SYSTEMS 167

Fig. 13. Selective SWIFT-R examples.

ones obtained with the ISS, they do maintain the same tendencies. This fact can be observed in the MWTF results as well. Thus, the ISS can serve as a good predictor of Fig. 12. Fault classification percentages and normalized MWTF for reliability when comparing different software techniques. nonhardened version (O), TMR1 (A), TMR2 (B), SWIFT-R (C) using the Notice that the high fault coverage results obtained for FTUnshades (test campaign execution against the microprocessor the unprotected versions is due to the fact that the fault Register File). injection test was performed over the complete register file, even though the programs do not use all the sixteen but it has a remarkable impact on code size and available general-purpose registers. Thus, injecting a fault performance that has to be considered depending on the in an unused register bit will be considered as unACE system constraints. These analyses can motivate important because it does not affect the expected program output. This design decisions about the maximum level of protection kind of testing has been presented in other works as well that can be applied with software techniques for a specific [22], [23], [24], [26], [35], which allows to obtain homo- application, and without exceeding the system constraints. geneous result sets comparable to each other. Moreover, they permit to redesign the protection strategy, if This experiment shows how SHE assists the designer in the evaluated technique provokes unsuitable overheads. the establishment of the mitigation strategy, and in the The Mean Work To Failure metric (MWTF)shouldbe reduction and definition of the region to explore in the calculated for this purpose, as it constitutes the balance design space. between reliability and execution time overhead. Note from Fig. 11 that not only SWIFT-R offers better 4.2 Codesign of Soft-Error Mitigation Strategies fault coverage than TMR1 and TMR2, it also reaches a In order to illustrate the applicability of the proposal, an in- normalized MWTF of 2:42 more than the nonhardened deep case study has been explored running the program of program in average, whereas the normalized MWTF of an FIR (Finite Impulse Response) filter on the PicoBlaze soft- TMR1 and TMR2 is only 0:52 on average (the same value micro. In this experiment, the design space area to explore for both). This latter value (under the normalized baseline is reduced to the SWIFT-R technique on the software side 1:00) means that TMR1 and TMR2 are more prone to and the TMR technique on the hardware side. failure than the nonhardened program, which is in The flexibility of SHE allows exploring the application of accordance with the previous analysis about the goodness the software-based techniques in a selective way, that is, of each technique. protecting only specific parts of the program or the To validate the estimations made using the ISS, architectural resources. In this case study, SWIFT-R has FTUnshades was used to evaluate the real system when been applied to different register subsets from the micro- running all the test programs. The system has been file looking for a reduction in the implemented using a technology-independent version of overheads but keeping a high fault coverage. For instance, PicoBlaze (RTL PicoBlaze), specifically developed for this a hardened version of the software obtained by applying work. The fault injection campaign was equal to the one SWIFT-R only to the register subset 0-2-F, means that only used in the ISS: 5,000 executions (one SEU emulated per these registers are protected (from sixteen possible PicoBlaze execution) have been performed. Each fault has been registers, hex numbered). This is possible by moving the emulated in a randomly selected clock cycle from all the remaining registers outside the SoR. Fig. 13 shows several workload duration. A critical time of 1,023 clock cycles has code examples applying SWIFT-R to diverse registers been chosen for these experiments, giving the micropro- subsets. Note that majority voter insertion is necessary cessor 1,023 more clock cycles than usual to recover the when an instruction causes the information leaving the SoR. system to a fault-free state. Moreover, to help the designer prioritize which registers Fig. 12 shows the fault classification percentages and the to protect in software, SHE gathers information about the normalized MWTF (logarithmic scale) obtained for each number of accesses to registers (number of times the version of the system in the FTUnshades. Results have been registers have been used for writing, reading or reading/ classified in the same way as with the ISS evaluation. In this writing operations), and the number of clock cycles when case, the percentages of unACE faults offered by the useful data are present in the registers, i.e., registers lifetime. methods are: 87.32 percent for the nonhardened version, The first parameter has a remarkable impact on code size 90.77 percent for TMR1, 90.08 percent for TMR2, and and execution time overheads, since protecting by software 98.10 percent for SWIFT-R. As can be seen, these results a highly accessed register implies a higher number of confirm the previous simulation results expressed during redundant instructions. However, since the higher lifetime the preliminary reliability evaluation with the ISS. Although is, the longer the register is prone to soft errors; the second the FTUnshades percentages are slightly higher than those parameter has high impact on reliability. 168 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012

TABLE 1 Registers Usage for FIR

Table 1 presents the information about registers usage (number of accesses and lifetime) for the FIR program (note that this program only uses the following registers: 0, 1, 2, 3, and F). Registers lifetime is expressed as the percentage of the total program time. The register with the highest lifetime is F (100.0 percent of the total clock cycles of the program). Fig. 14. Normalized static code size and execution time overheads for In addition, notice that this register is the least accessed, and FIR selective hardened versions. consequently it is the first candidate to be hardened. Fig. 14 shows the overhead results for several selectively microprocessor register subsets (including register file, PC, hardened register subsets using software protection. These flags, SP, and pipeline). For each subset, 5,000 SEUs (one per execution) have been emulated in a randomly selected results are normalized with a baseline built with the clock cycle from all the workload duration. This is a total of nonhardened version. Notice that when SWIFT-R is applied 25,000 faults injected in every version of the system. Fig. 15 to some highly accessed registers, such as two, the over- presents the fault classification percentages obtained for heads increase considerably. However, protecting these each system. These results are the weighted average of the registers is not guarantee to improve reliability, since the results from the selective attacks to the internal micro- vulnerability of each register depends on the lifetime of the processor register subsets, assuming the same fault prob- register during the program execution. The lifetime para- ability for all of the bits on target. The AVF of each system meter is not always correlated with the number of times was estimated as the sum of the SDC and Hang faults. that the register is accessed (e.g., register F). Note that the SWIFT-R technique offers a considerable Onthehardwareside,thefaulttolerantcodesign fault coverage increase, even in the nonhardened hardware strategy was complemented by applying selective hard- (up to 92.20 percent unACE faults in the full hardened ening to several microprocessor resources. In this way, TMR program—All on the figure), which is much higher than the has been applied to different sets of microarchitectural reliability of every hardware-hardened approach using registers. The following five versions of the processor have the nonhardened program (“None” versions on the figure). been developed: Results for the P4 approach are not showed in Fig. 15 because 100 percent of the injected faults were classified as . Nonhardened RTL Picoblaze (P0). unACE, as expected. Furthermore, notice that combining . Microprocessor with hardware redundancy for SWIFT-R with hardware protection in only few critical Program Counter—PC, Flags, and Stack Pointer—SP registers, such as PC, Flags, and SP (P1 approach), fault (P1). coverage increases markedly (up to 97.91 percent unACE . All registers in the Pipeline protected (P2). faults). As can be seen, hardware redundancy on the . Hardware redundancy for PC, Flags, SP,and pipeline does not improve the fault coverage of the system Pipeline (P3). considerably (processor P2), even considering that there are . Full protected version, i.e., microprocessor with more protected registers on P2 than the number of them on redundancy for Register file, PC, Flags, SP, and P1, whose fault coverage is better. Pipeline (P4). Taking into account the requirements of the application Using the information obtained by SHE and the synthesis being designed, the analysis of overheads jointly with tool, the designer can select the best candidates for further reliability is a very important key during the codesign analyses. Nevertheless, in this case study, and for demon- process. These results facilitate to find the solutions having stration purposes, all the systems have been synthesized the best reliability/overhead compromise. For instance, the and implemented using the ISE 10.1 suite, i.e., software with SWIFT-R only for the register subset 0-3-F 32 different software versions running on four different running in the P3 microprocessor is an interesting choice, versions of the microprocessor (P0, P1, P2, and P3), and the because it offers both, high reliability (95.52 percent of nonhardened program running on the fully protected unACE faults), and acceptable code size and execution time microprocessor (P4), altogether, 129 different hardware/ overheads (1:73 and 1:46, respectively). software system configurations. Although remarkable reliability increases can be obtained In order to evaluate the reliability of the different when combining software-hardened programs with hard- hardware/software configurations, this time the injection ware-redundant approaches (for instance, up to 99.52 percent of faults has been performed all over the system (not only unACE faults for the P3 microprocessor), hardware cost the register file like in Section 4.1). For each system increases as well. This is an important fact that has to be (selectively-harden in software and hardware), a fault considered when exploring the design space. In this work, injection campaign has been executed in the FTUnshades. results reported by the synthesis tool (Xilinx XST v10.1) have Every test campaign makes selective attacks on the been used as an estimation of the hardware cost. These results MARTINEZ-ALVAREZ ET AL.: COMPILER-DIRECTED SOFT ERROR MITIGATION FOR EMBEDDED SYSTEMS 169

Fig. 15. Fault classification percentages for every test program running on each microprocessor version (P0 to P3). are expressed in terms of sequential logic (flip- and From results in Fig. 16, one can observe that the most latches), combinational logic, and RAM blocks. remarkable increases on the MWTF are for the micropro- Fig. 16, on the one hand, shows the hardware cost of cessors running the full protected software (all versions on each approach normalized with a baseline built with the the figure), whose normalized MWTF are: 1:17, 4:36, nonhardened RTL Picoblaze (P0), on the other hand, it also 1:48, and 19:09 for the P0, P1, P2, and P3 microproces- depicts, in a secondary axis, the MWTF of the hybrid sors, respectively. Notice that even though the registers systems normalized to a baseline built with the nonhar- with hardware protection in the P2 processor (pipeline) are dened software/hardware version. Since the MWTF cap- more numerous than those ones protected in the P1 tures the balance between reliability and performance, this approach, this amount does not imply higher reliability. figure permits to see at a glance, the representation of Protecting only a few critical registers could cause a several tradeoffs among reliability, execution time and remarkable MWTF improvement (as in P1). hardware costs for each one of the systems. In this figure, In addition, there are quite interesting selective ap- proaches, like those ones with the 0-2-3-F register subset the full hardware-protected microprocessor (P4) is not hardened with SWIFT-R, whose normalized MWTF are represented considering its high hardware cost over P0: 1:20, 3:14, 1:48, and 6:13 for the P0, P1, P2, and P3 2:92 sequential logic, and 1:93 combinational logic. In microprocessors, respectively. These approaches can be addition, as none of the used hardware protection better candidates for applications where not only the mechanisms increase the RAM blocks, this component is reliability is important, but also the time execution. not represented in the figure either. Sometimes protecting all registers, using a software It is worth noting that hardware cost increases consider- technique, could result in the best fault coverage, but at ably when the registers in the pipeline are hardened (P2 and the same time it provokes higher performance degradation, P3 microprocessors), whereas the normalized MWTF only reducing the system MWTF compared to some partially improves slightly in these cases (nearly equal to the protected software versions with less execution time over- nonhardened processor), or even decreases if compared head. For instance, this can be observed in Fig. 16 for the P0 with cheaper approaches (P2+SWIFT-R). In case of P4, high and P2 microprocessors when executing the software hardware costs may result unsuitable for many embedded versions F and 0-F. systems, although its reliability is 100 percent, i.e., 100 percent Making this kind of analysis, the design space explora- of the injected faults did not cause any unexpected behavior tion can be driven to best fit the specific constraints and to the system (unACE faults). dependability requirements of every application. For this

Fig. 16. Normalized hardware cost and normalized mean work to failure (MWTF) by hybrid system. 170 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012 case study, it might result as a suitable configuration the [2] P. Shivakumar, M. Kistler, S.W. Keckler, D. Burger, and L. Alvisi, system with the P1 microprocessor and SWIFT-R applied to “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic” Proc. Int’l Conf. Dependable Systems and the register subset 0-2-3-F on the software side, because it Networks, pp. 389-398, 2002. offers a high increase in the mean work to failure (3:14 more [3] T. Karnik, P. Hazucha, and J. Patel, “Characterization of Soft than the nonprotected system) with low hardware costs and Errors Caused by Single Event Upsets in CMOS Processes,” acceptable execution time overhead (2:05). In some other IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, pp. 128-143, Apr.-Jun. 2004. applications, with less severe cost and performance con- [4] R. Edwards, C. Dyer, and E. Normand, “Technical Standard for straints, the fully protected software version with the Atmospheric Radiation Single Event Effects (SEE) on Avionics microprocessor P3 might be the best configuration of Electronics,” Proc. IEEE Radiation Effects Data Workshop (REDW ’04), the joint approach, achieving 19:09 more MWTF than the pp. 1-5, 2004. nonhardened embedded system. [5] R. Baumann, “Soft Errors in Commercial Semiconductor Technol- ogy: Overview and Scaling Trends,” IEEE 2002 Reliability Physics Symp. Tutorial Notes, Reliability Fundamentals, pp. 121-01.1–121- 01.14, IEEE Press, Apr. 2002. 5CONCLUSIONS [6] S.E. Michalak, K.W. Harris, N.W. Hengartner, B.E. Takala, and This paper presents a compiler-directed methodology that S.A. Wender, “Predicting the Number of Fatal Soft Errors in Los is able to guide the codesign of fault-tolerant hardware/ Alamos National Laboratory’s ASC Q Supercomputer,” IEEE Trans. Device and Materials Reliability, vol. 5, no. 3, pp. 329-335, software systems. It is supported by an infrastructure that Sept. 2005. facilitates the exploration of the design space between [7] ESA. “The Radiation Design Handbook ESA PSS-01-609,” techni- hardware-only and software-only soft-errors mitigation cal report, European Space Agency, 1993. strategies. In addition, a novel codesign strategy has been [8] IEC. “IEC/TS 62396-1,” technical report, Int’l Electrotechnical studied by means of selective application of the fault- Commission, Mar. 2006. [9] DoD. “MIL-HDBK-817, Military Handbook System Develop tolerance techniques on both sides, software and hardware. Radiation Hardness Assurance,” technical report, Dept. of The advantages of the resultant hybrid hardware/soft- Defense. USA, 1994. ware configurations are illustrated by means of two case [10] J. Von-Neumann, “Probabilistic Logics and Synthesis of Reliable studies. The first one shows the flexibility of the infra- Organisms from Unreliable Components,” Automata Studies, C.E. structure in order to apply and evaluate different software- Shannon and J. McCarthy, eds., pp. 43-98, Princeton Univ. 1956. [11] R. Naseer, R.Z. Bhatti, and J. Draper, “Analysis of Soft Error based fault-tolerance techniques, facilitating the choice of Mitigation Techniques for Register Files in IBM Cu-08 90nm strategies which best suit to the characteristics of the Technology,” Proc. 49th IEEE Int’l Midwest Symp. Circuits and application. The second one shows how by applying the Systems, pp. 515-519, Aug. 2006. codesign methodology it is possible to obtain a hybrid [12] T.M. Austin, “DIVA: A Reliable Substrate for Deep Submicron configuration that fits better than the original techniques, to Microarchitecture Design,” Proc. 32nd Ann. Int’l Symp. Micro- architecture, (MICRO-32), pp. 196-207, Nov. 1999. the dependability requirements and design constraints. [13] A. Mahmood and E.J. McCluskey, “Concurrent Error-Detection As a result, this new methodology suggests the im- Using Watchdog Processors,” IEEE Trans. Computers, vol. 37, no. 2, plementation of automatic hardening tasks within the pp. 160-174, Feb. 1988. presented platform and opens up interesting new bound- [14] S.S. Mukherjee, M. Kontz, and S.K. Reinhardt, “Detailed Design aries in the design space exploration of hybrid systems. In and Evaluation of Redundant Multithreading Alternatives,” Proc. 29th Int’l Symp. , pp. 99-110, 2002. this way, thanks to our software hardening environment [15] M.A. Gomaa, C. Scarbrough, T.N. Vjaykumar, and I. Pomeranz, advantages and flexibility, a multiobjective optimization “Transient-Fault Recovery for Chip Multiprocessors,” IEEE Micro, algorithm, such as NSGA-II [44], will be applied to vol. 23, no. 6, pp. 76-83, Nov.-Dec. 2003. automatically explore the design space on the software [16] P.K. Samudrala, J. Ramos, and S. Katkoori., “Selective Triple Modular Redundancy (STMR) Based Single-Event Upset (SEU) side. Furthermore, SHE will be extended to support 32-bit Tolerant Synthesis for FPGAS,” IEEE Trans. Nuclear Science, soft-core microprocessors, like LEON3. Moreover, we do vol. 51, no. 5, pp. 2957-2969, Oct. 2004. not discard a further interaction with a high-level compiler [17] A. Parashar, S. Gurumurthi, and A. Sivasubramaniam, “SlicK: such as GCC, as it can generate assembly code, which is the Slice-Based Locality Exploitation for Efficient Redundant Multi- entry point of our hardening platform. threading,” ACM SIGPLAN NOTICES, vol. 41, no. 11, pp. 95-105, Nov. 2006. [18] V.K. Reddy, S. Parthasarathy, and E. Rotenberg, “Understanding ACKNOWLEDGMENTS Prediction-Based Partial Redundant Threading for Low-Over- head, High-Coverage Fault Tolerance,” ACM SIGPLAN NOTICES, This work has been funded by the Ministry of Science and vol. 41, no. 11, pp. 83-94, Nov. 2006. Innovation in Spain under project “Integral Analysis of [19] O. Ergin, O.S. Unsal, X. Vera, and A. Gonzalez, “Reducing Soft Digital Circuits and Systems for Aerospace Applications Errors through Operand Width Aware Policies,” IEEE Trans. Dependable and Secure Computing, vol. 6, no. 3, pp. 217-230, July- (RENASER+)” (TEC2010-22095-C03-01), and by the Sept. 2009. Generalitat Valenciana in Spain under project “Aceleracio´n [20] N. Oh, S. Mitra, and E.J. McCluskey, “(EDI)-I-4: Error Detection by de algoritmos industriales y de seguridad en entornos crı´ticos Diverse Data and Duplicated Instructions,” IEEE Trans. Computers, mediante hardware’” (GV/2009/098). FTUnshades was de- vol. 51, no. 2, pp. 180-199, Feb. 2002. veloped under contract with the European Space Agency [21] M. Rebaudengo, M.S. Reorda, and M. Violante, “A New Software- ESTEC/TEC/NL 17540 Based Technique for Low-Cost Fault-Tolerant Application,” Proc. (ESA)( ). Ann. Reliability and Maintainability Symp., pp. 25-28, 2003. [22] N. Oh, P.P. Shirvani, and E.J. McCluskey, “Error Detection by Duplicated Instructions in Super-Scalar Processors,” IEEE Trans. REFERENCES Reliability, vol. 51, no. 1, pp. 63-75, Mar. 2002. [1] R. Baumann, “Radiation-Induced Soft Errors in Advanced [23] N. Oh, P.P. Shirvani, and E.J. McCluskey, “Control-Flow Checking Semiconductor Technologies,” IEEE Trans. Device and Materials by Software Signatures,” IEEE Trans. Reliability, vol. 51, no. 1, Reliability, vol. 5, no. 3, pp. 305-316, Sept. 2005. pp. 111-122, Mar. 2002. MARTINEZ-ALVAREZ ET AL.: COMPILER-DIRECTED SOFT ERROR MITIGATION FOR EMBEDDED SYSTEMS 171

[24] G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August, Antonio Martı´nez-Alvarez received the MS and “SWIFT: Software Implemented Fault Tolerance,” Proc. Int’l Symp. PhD degree in electronics engineering from the Code Gen. and Opt., pp. 243-254, 2005. University of Granada, Spain, in 2002 and 2006, [25] M. Rebaudengo, M.S. Reorda, and M. Violante, “A New Approach respectively. From 2002 to 2006, he was with to Software-Implemented Fault Tolerance,” J. Electronic Testing- the Department of Computer Architecture and Theory and Applications, vol. 20, no. 4, pp. 433-437, Aug. 2004. Technology at the University of Granada, Spain. [26] G.A. Reis, J. Chang, and D.I. August, “Automatic Instruction- He is currently an associate professor with the Level Software-Only Recovery,” IEEE Micro, vol. 27, no. 1, pp. 36- Department of Computer Technology, University 47, 2007. of Alicante, Spain. His main research interests [27] M. Pignol, “COTS-Based Applications in Space Avionics. In deal with methods and tools for dependable EDDA, editor,” Proc. 13th Design, Automation and Test in Europe design of digital integrated circuits and FPGAs, high-performance conf., (DATE ’10), p. 1213, Mar. 2010. image-processing architectures and embedded systems based on [28] P. Bernardi, L.M. Bolzani Poehls, M. Grosso, and M.S. Reorda, “A reconfigurable devices. He is also interested in neuroengineering and Hybrid Approach for Detection and Correction of Transient Faults neuroprosthesis devices. Currently, he is working in the design and in SoCs,” IEEE Trans. Dependable and Secure Computing, vol. 7, development of the SHE (software hardening environment). no. 4, pp. 439-445, Oct.-Dec. 2010. [29] G.A. Reis, J. Chang, N. Vachharajani, S.S. Mukherjee, R. Rangan, Sergio A. Cuenca-Asensi is an associate and D.I. August, “Design and Evaluation of Hybrid Fault- professor in the Computer Architecture and Detection Systems,” Proc.32ndInt’lSymp.ComputerArch., Technology Department at the University of pp. 148-159, June 2005. Alicante, Spain. He received the BS degree in [30] F. Restrepo-Calle, A. Martı´nez-Alvarez, S. Cuenca-Asensi, F.R. electronic physics, in 1990, from the University Palomo, and M.A. Aguirre, “Hardening Development Environ- of Granada, Spain. He received the PhD degree ment for Embedded Systems, 2010.” Proc. 2nd HiPEAC Workshop in computer engineering from the University Design for Reliability (DFR ’10) held with the Fifth Int’l Conf. High Miguel Herna´ndez of Elche, Spain, in 2002. His Performance and Embedded Arch. and Compilers, pp. 1-10, Jan. 2010. current research interests include reconfigur- [31] J. Napoles, H. Guzman, M. Aguirre, J. Tombs, F. Munoz, V. Baena, able computing, hardware/software codesign A. Torralba, and L. Franquelo, “Radiation Environment Emulation and soft error mitigation in embedded systems. He is a member of for VLSI Designs: A Low Cost Platform Based on Xilinx FPGAs,” the IEEE. Proc. IEEE Int’l Symp. Industrial Electronics, pp. 3334-3338, June 2007. [32] F.L. Kastensmidt, L. Carro, and R. Reis, Fault-Tolerance Techniques Felipe Restrepo-Calle received the Computer for SRAM-Based FPGAs, Springer, p. 183, 2006. Engineering degree from the Technological [33] L. Sterpone and M. Violante, “A New Reliability-Oriented Place University of Pereira, in 2004. Currently, he is and Route Algorithm for SRAM-Based FPGAs,” IEEE Trans. a PhD student working on the Technologies for Computers, vol. 55, no. 6, pp. 732-744, Jun. 2006. the Information Society program in the Compu- [34] G. DeMicheli and R.K. Gupta, “Hardware/Software Co-Design,” ter Technology Department at the University of Proc. IEEE, vol. 85, no. 3, pp. 349-365, Mar. 1997. Alicante, Spain, thanks to a scholarship of the [35] M. Rebaudengo, M.S. Reorda, M. Violante, and M. Torchiano, “A same university. His field of interest includes Source-to-Source Compiler for Generating Dependable Software,” methods and tools for dependable design in Proc. First IEEE Int’l Workshop Source Code Analysis and Manipula- embedded systems. He is also interested in tion, pp. 33-42, 2001. human-computer interaction and assistive technologies. [36] S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin, “A Systematic Methodology to Compute the Architectural Vulnerability Factors For a High-Performance Microprocessor,” Francisco R. Palomo Pinto received Funda- Proc. 36th Int’l Symp. Microarchitecture, pp. 29-40, Dec. 2003. mental Physics degree, in 1993, from the [37] S.K. Reinhardt and S.S. Mukherjee, “Transient Fault Detection via University of Sevilla, Spain. In 1999, he joined Simultaneous Multithreading,” Proc. 27th Int’l Symp. Computer the Dpto. Ingenierı´a Electro´nica, Escuela Super- Arch., pp. 25-36, Jun. 2000. ior de Ingenieros de Sevilla, University of Sevilla, [38] MISRA, MISRA-C:2004 Guidelines for the Use of the C Language in Spain, where he is an assistant professor. His Critical Systems., Motor Ind. Softw. Reliability Assoc., 2004. current research interests include simulation of [39] J. Lee and A. Shrivastava, “Compiler-Managed Register File radiation effects in electronics, radiation testing, Protection for Energy-Efficient Soft Error Reduction,” Proc. Asia and dynamical systems modeling. He is a and South Pacific Design Automation Conf., pp. 618-623, 2009. member of the IEEE Nuclear and Plasma [40] V. Sridharan and D.R. Kaeli, “Quantifying Software Vulnerabil- Sciences Society. ity,” Proc. Workshop Radiation Effects and Fault Tolerance in Nanometer Technologies(WREFTNT ’08), pp. 323-328, 2008. [41] H. Guzman-Miranda, M.A. Aguirre, and J. Tombs, “Noninvasive Fault Classification, Robustness and Recovery Time Measurement in Microprocessor-Type Architectures Subjected to Radiation- Induced Errors,” IEEE Trans. Instrumentation and Measurement, vol. 58, no. 5, pp. 1514-1524, May. 2009. [42] M.A. Aguirre, J.N. Tombs, F. Munoz, V. Baena, H. Guzman, J. Napoles, A. Fernandez-Leon, F. Tortosa-Lopez, and D. Merodio, “Selective Protection Analysis using a SEU Emulator: Testing Protocol and Case Study over the LEON2 Processor,” IEEE Trans. Nuclear Science, vol. 54, no. 4, Part 2, pp. 951-956, Aug. 2007. [43] K. Chapman, PicoBlaze KCPSM3. 8-bit Micro Controller for Spartan- 3, Virtex-II and Virtex-II, Xilinx Ltd., Oct. 2003. [44] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II,” IEEE Trans. Evolutionary Computation, vol. 6, no. 2, pp. 182-197, Apr. 2002. 172 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 2, MARCH/APRIL 2012

Hipo´ lito Guzma´n-Miranda was born in Sevilla, Miguel A. Aguirre (M’91) received the master’s Spain, in 1982. He received the degree in degree in 1991, in electrical and electronic telecommunications engineering from the Uni- engineering from the University of Sevilla, Spain, versity of Sevilla, Spain, in May 2006. After that, and the PhD degree, in 1994, from the same he joined the Department of Electronic Engi- University. He is currently teaching digital neering at the University of Sevilla, where he is microelectronics in the Electronic Engineering currently working as a microelectronics teacher Department of the University of Sevilla, Spain, and researcher. He obtained the masters de- as an assistant professor. He is the author of gree, in May 2008, and the PhD degree, in June more than 15 publications in journals of the 2010. His research interests include fault injec- IEEE, and more than 80 communications in tion emulation and protection of complex microelectronic circuits under conferences. His research interests include methods and tools for the effects of radiation. He is a student member of the IEEE. dependable design of digital integrated circuits and FPGAs. Currently, he is working toward the design and development of the next generation of the FT-UNSHADES system. He is a senior member of the IEEE.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. Copyright of IEEE Transactions on Dependable & Secure Computing is the property of IEEE Computer Society and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.