Evaluation of Failures Masking Across the Software Stack

Thiago Santini, Paolo Rech, Anderson Sartor, Ulisses B. Correa,ˆ Luigi Carro, and Flavio´ R. Wagner Instituto de Informatica´ Universitade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil Email: {tcsantini, prech, alsartor, ubcorrea, carro, flavio}@inf.ufrgs.br

Abstract—In this paper, we analyze how implementing an altitudes (i.e., 35,000 ft), considering only those failures that application in different software layers impacts the failure rate negatively impact the user experience [5]. When scaled to the of embedded computing systems. We investigate an ARM-based average number of passengers flying per year, which has been System-on-Chip executing an application on top of the estimated to be three billion in 2013 by the International Air virtual machine, using Android’s Java Native Interface (JNI), and Transport Association [6], and conservatively supposing that as a native executable. The different versions are then exposed to each one uses a smartphone for an hour per trip, such a MTTF a controlled neutron beam, and the outcome of the resulting executions are logged and analyzed. We additionally classify translates to about 350,000 user-observable errors per year - observable failures based on the events observed in the logs during i.e., 0.1% of users will experience an observable failure. As the time window of each failure. Our experimental results show the complexity is expected to increase in future generations, that the Dalvik version presents the lowest failure rate followed and parallelism is becoming the new computing standard, by the JNI version and then by the native version, suggesting that software stack is going to be mandatory to ease applications the higher the software layer, the higher the failure masking. development and portability. In this scenario, software stack I.INTRODUCTION reliability needs to be carefully evaluated to understand the behaviour of applications executed in embedded systems when Nowadays, Personal Mobile Devices (PMDs), such as exposed to radiation. tablets and smartphones, are becoming the mainstream of The objective of this work is to evaluate how such large computing devices [1]. The amount of resources available software abstraction stacks impact user-observable errors. To in PMDs is continuously increasing, and it is very common reach this goal, three variants of a matrix multiplication to have PMDs powered by parallel embedded processors. To application were developed. The first variant is a full Java manage the available resources a large software platform is re- implementation, executing over the Dalvik Virtual Machine quired. PMD software platform also supports a faster time-to- (DVM). The second variant uses the Java Native Interface, market for third-part software, profoundly changing traditional in which parts of the applications code are implemented in software design development for embedded systems. a native shared . The last variant is a native Linux For complex systems, one of the traditional approaches application. Our experimental results show that the Dalvik to speed-up the time-to-market of applications is through the version presents the lowest failure rate followed by the JNI usage of high-abstraction levels in the application design and version and then by the native version, suggesting that the implementation processes. These abstractions facilitate the em- higher the software layer, the higher the failure masking. bedded resources management and the software development This paper is organised as follows. Section II presents process, making software more readable, easier to maintain, Android software stack background, and Section III gives an and highly portable [2]. In fact, nowadays embedded systems overview of our experimental setup. Then Section IV presents projects typically include object-oriented languages, like Java and discusses the obtained experimental results. Section V and ++. These high abstraction languages are vastly applied concludes the paper and presents future works. even on resource-constrained microcontrollers platforms, such as Arduino, whose official Software Development Kit (SDK) uses C++ in an object oriented way [3]. II.ANDROID SOFTWARE STACK Android, the dominant software platform in the market for The Android platform is a software stack composed by four PMDs [4], uses a four-layer software stack to help third part levels. The lowest level is the Linux kernel, responsible for practitioners to develop applications. This software platform is task scheduling, device drivers, power management, recourse intended to be used in a broad range of devices (e.g., wear- access, and others low-level tasks. The second level com- ables, smartphones, TVs, automobile media centers). Since the prehends native libraries and the Android runtime. The An- Android platform stack abstracts several implementation de- droid runtime comprises core libraries and the Dalvik Virtual tails, code reuse becomes ubiquitous among different devices. Machine (DVM). Dalvik’s purpose is to provide a platform- With the shrinking of transistor dimensions and the exacer- independent programming environment that abstracts details of bation of the amount of resources available in modern devices, the underlying hardware and . To do so, the the radiation-induced error rate cannot be considered negligible Dalvik bytecodes, called DEX, are interpreted to the target ar- even in consumer electronics and PMDs. For instance, the user- chitecture during the execution of the application. In addition, observable Apple iPhone 3 Mean Time To Failure (MTTF) can the DVM includes a trace-based Just-In-Time (JIT) compiler be as short as 1 year when operating at commercial aircraft to translate the bytecodes of frequently used execution paths into native instructions, and the result from this translation is was left unused. cached to avoid reinterpretation overhead. Moreover, the DVM was designed to run on memory-constrained environments and B. Software Under Test to allow multiple instances of the virtual machine, so every As benchmark, we selected Matrix Multiplication since it application runs a private instance, which provides security, is typically used in both safety-critical (e.g., filter and control isolation, and effective memory management. The third level operations) and user applications (e.g., media applications). is the application framework, which provides high-level ser- One application execution was defined as 200 multiplications vices in the form of Java classes accessible through the Java of 25 × 25 integer matrices to keep a tractable run-time Development Kit (JDK). The top level is the applications layer, and output throughput. A greater workload would increase where all applications available in the Android device reside, the probability of having radiation-induced errors, eventually using the resources provided by the layers below. allowing more than one neutron to generate a failure in one A developer has multiple possibilities when developing an single execution. As detailed in the next subsection, this is to Android application, each having its strengths and weaknesses. be avoided to derate the experimentally observed error rate to The application may be developed purely in Java code, with the natural radiation environment. A smaller workload would a combination of Java and native code, or purely in native impede the gathering of a statistically significant amount of code. Java applications run on the Dalvik Virtual Machine, data. therefore having portability and security provided by this As shown in Algorithm 1, after DUT initialization, the virtual machine. However, the additional software layer (i.e., application starts: all matrices are initialized, and for each of the virtual machine) may affect the application performance. the 200 sets of input matrices Ai is multiplied by Bi; the The combination of Java and native code can be done through resulting matrix is then compared to the expected result Gi the Java Native Interface (JNI) framework or through Native and, if they differ, a failure flag is raised. After the 200 matrix Activities; the former provides an interface for native methods multiplications are completed, errors are reported, and a new to be called from the Java side; the latter comprises of whole application execution is triggered. Android activities implemented in native code, which can be used along with Java activities. Through the JNI framework, ALGORITHM 1: Application under test. it is possible for Java code to interact with C or C++ code by calling methods implemented in native code. This possibility setup caches(); allows the reuse of legacy code and can be used to increase print banner(); application performance in some situations. Nevertheless, JNI while T rue do // Applications Start compromises the application’s portability and security, once F ail ← F alse; the code needs to be compiled to each target architecture and for i ← 1 to 200 do // Unrolled does not run on the DVM anymore. Furthermore, by using init(Ai,Bi,Gi); JNI, an overhead is created because of the context switches, end which involves copying of operands in memory between the for i ← 1 to 200 do // Unrolled Java and the native side. Applications developed purely in C ← Ai ∗ Bi; native code, referred to as ELF applications in this work, are if C 6= Gi then usual C or C++ applications that are compiled to run in an F ail ← T rue; Android device. ELF applications have low portability and end low execution overhead as the code is executed directly by the end print(F ail); processor and does not need to be interpreted by the DVM. // Application End On a radiation reliability point of view, increasing the end abstraction level may significantly modify the error rate of an application. Passing from one level of abstraction to a higher one may bring benefits to the device reliability, as In total, three variants of this application were produced: some errors could be masked. In fact, not all the failures Dalvik: The application was entirely implemented in the occurring at physical level actually propagates to the output Java language, from which an Android Application Package of an application. Similarly, the Dalvik Virtual Machine may (APK) was generated. digest some errors that would otherwise affect the applica- JNI: The application was implemented using the JNI tion execution. Nevertheless, increasing the abstraction level framework. The main body of the application was implemented requires additional resources that, if corrupted, may lead to in the Java language, and the core of the application (i.e., the errors or functional interruption. matrix multiplications) was implemented in the C language. The main body calls the core through the JNI framework. This III.SETUP variant was also packaged into an APK. A. Device Under Test ELF: The application was implemented in the C language The Device Under Test (DUT) is the Xilinx ZynqTM-7000 and compiled to a static ELF executable. AP SoC implemented in a 28nm CMOS technology. The DUT All variants run on the same system, namely Android 2.3.7 disposes of two ARM R CortexTM-A9 cores with a maximum on top of the Linux kernel 3.6.0. C language implementations frequency of 667MHz. Each core has 32KB Level 1 4-way were compiled with gcc 4.6.3 using the -O2 optimization level. set-associative instruction and data caches, and they share After the system finishes booting and reaches a stable state, one a 512KB 8-way set-associative Level 2 cache [7]. During of the three variants is started; the produced logs are collected experiments only one of the two available cores was used, and timestamped by a test manager application running on a caches were not protected, and the device programmable logic host PC placed outside of the radiation chamber. Table I shows the resulting execution time of a single without introducing artificial behaviours. In fact, with the low application run for each variant and the speed-up provided by atmospheric neutron flux, it is very unlikely for more than one each version relative to the Dalvik variant, which is the slowest neutron to generate failures in one application execution. one. As expected, the software stack impact significantly the The beam was focused on a spot with a diameter of 2 execution time of the application. inches, which provided uniform irradiation of the SoC, without directly affecting nearby board power control circuitry and TABLE I. EXECUTIONTIMEANDSPEED-UP RELATIVE TO THE DALVIK DRAM chips. VARIANT FOR EACH VERSION. To reduce the uncertainty of the experimental results, three Version Execution Time (ms) Speed-up DUTs were irradiated in parallel (see Fig. 1). The three Dalvik 754.97 1.00 boards with the same hardware revision were aligned with JNI 43.14 17.50 the beam, placed at 43, 45, and 47 inches from the source, ELF 32.16 23.48 respectively. A flux de-rating factor was calculated for each board so as to take beam degradation due to the distance To better analyze and understand the implementation dif- from the source into account. To minimize the statistical ference between variants that lead to the execution time trend error and to avoid experimental results bias on the selected listed in Table I and that may affect the device radiation sen- board and distance de-rating factor, the variants were executed sitivity, we profiled the system at run-time using ARM’s gator alternatively in all three devices. Each version was executed for daemon to collect periodic samples. For each version, samples more than 80 hours under the beam, receiving a total fluence of were collected every 1 ms during a radiation-free session of 2 × 1011 n/cm2. During our experiments the boards received 30 minutes while the application was continuously executing. the radiation equivalent to 1.7 × 106 years of exposure in the Each sample includes information from what code was being natural environment. executed when the sample was collected. By analysing this information, we can infer the codes that were executed the most, giving us an insight on the behaviour of each variant and how they differ. Table II lists the internal codes that appear in more than 1% of the samples for each variant. As can be seen, the ELF variant execution time is dominated by the application (mm-binary). The JNI variant execution time is dominated by the code in the shared library that implements the matrix multiplication core (mm-shared-lib), but a portion of the time is also spent in Android’s C standard library (bionic). The Dalvik variant spent most of the time executing already translated code present in its jit-cache (dalvik-jit-cache); a Fig. 1. Experimental setup mounted at ISIS. significant portion of the time was also spent in the dalvik virtual machine (dalvik-vm), specially in garbage collection and object allocation functions. Curiously, the Dalvik variant IV. EXPERIMENTAL RESULTS also spent a notable portion of the time in the idle process Application output and system logs were continuously (idle), which suggests a memory bottleneck as the CPU is collected and time-stamped while the DUT was exposed to continuously waiting in an idle state. radiation. This information allows us to correlate system events and observable failures, providing a deep analysis of system TABLE II. MOSTEXECUTEDCODESFOREACHVERSION. behaviours under radiation. Each application execution was Dalvik JNI ELF classified as follows: dalvik-jit-cache 62.42% mm-shared-lib 97.17% mm-binary 99.16% Correct: The application produced the expected output dalvik-vm 20.90% bionic 2.36% of a fault-free environment. No error was detected when idle 14.48% comparing the result to a golden copy. Others 2.20% Others 0.47% Others 0.84 % Silent Data Corruption (SDC): The application produced a different output than that of a fault-free-environment. This category includes errors detected comparing the output to a C. Experimental Setup golden copy (mismatch) and cases in which the application Radiation experiments were performed at the ISIS facility produced garbage (garbage) (e.g., the communication channel in the Rutherford Appleton Laboratories (RAL) in Didcot, UK. was corrupted). ISIS provides a white neutron source that mimics the energy Functional Interruption (FI): The system functionality spectrum of the atmospheric neutron flux. The available neu- was interrupted. This category includes cases in which the tron flux was approximately 5.5 × 104 n/(cm2s) for energies kernel panicked (panic), the application died (app-death), above 10 MeV. It is worth noticing that, even if the flux of Android’s system server died (server-death), or the system neutrons in ISIS is several orders of magnitude higher than hanged (hang). the natural one (which is estimated to be about 13 n/(cm2h) We report our results as cross-sections, which represents [8]), the test was tuned to make negligible the probability the sensitivity to radiation of the device. The cross-section of having more than one neutron generating a failure in is obtained experimentally dividing the number of observed one single code execution (observed error rates were lower errors by the total particle fluence (i.e., the number of particles than 1 × 10−2 errors/execution). This allows the scaling hitting the device per unit area) [8]. Please mind that the exe- of experimental data in the natural radioactive environment cution time is normalised when calculating the cross section. Thus, the cross-section is not influenced by the execution time On the contrary, in some applications in which data consistency shown in Table I but only by the amount of resources required is mandatory, like financial transactions, it may be better to to complete execution in each variant. have a detectable FI than a SDC. Figure 2 shows the cross-section for SDCs and FIs and Furthermore, panic occurrences were also reduce for the the overall device cross-section for each variant. It is clear Dalvik variant; this indicates that the Dalvik variant causes that the software layer significantly influences the application Linux kernel data to be less susceptible to corruption. Besides radiation sensitivity. Results can be further analysed to un- the actual application, the Dalvik variant must also provide other functionalities, such as interpreting code not present 2.5e-08 in the JIT cache and providing garbage collection. As such,

Dalvik the Dalvik variant causes more cache conflicts than the ELF 2e-08 JNI variant because the former uses significantly more memory to

) ELF 2 operate than the latter. Then, an explanation for the reduction 1.5e-08 of panic occurrences is that kernel data, whose corruption causes panic occurrences, were present for less time in the 1e-08 cache memories, which have been shown to be the most Cross-Section (cm Cross-Section sensitive parts of the system [9], in the Dalvik variant. 5e-09 Although the ELF and JNI variants are very similar, the

0 JNI variant exhibited a significant influence to the mismatch Overall (SDC + FI) SDC FI cross-section. As Table II suggests, the execution time of the these variants are dominated by the matrix multiplication core, Fig. 2. Cross-section for the each variant. which can be considered similar for these variants. The main difference between them is in the use of a different standard C derstand the observed differences. We can discriminate the library - the ELF variant is statically linked against GNU libc radiation-induced errors effects dividing the cross-section into 2.15 while the JNI variant dynamically loads bionic. Nonethe- the different errors types observed, as shown in Figure 3. The less, this difference does not seem to impact the execution occurrences of garbage and server-death were very rare, which time (difference is about 2.36%). Even if execution time does was expected as their code is rather small. Thus, we can not not impact the cross-section, it indicates that applications are draw any statistically significant conclusion about them, and indeed very similar. Further investigation will be necessary they will not be further discussed. The hang cross-section to fully justify the observed variation in the mismatch cross- remained unaltered between versions, suggesting an indepen- section between ELF and JNI. dence on software - most likely, these failures originate from corruptions in the microarchitectural state of the processor, V. CONCLUSIONAND FUTURE WORK which is independent on the running application. Our experimental results show that the software stack significantly impact the reliability of an application and device. 1.8e-08 In particular, the Dalvik version presents the lowest failure rate dalvik 1.6e-08 jni followed by the JNI version and then by the native version, elf 1.4e-08 suggesting that the higher the software layer, the higher the ) 2 1.2e-08 silent failure masking. As future work, we plan on verifying

1e-08 if other relevant applications exhibit the same behaviour and

8e-09 to extend the presented analysis to embedded parallel devices.

6e-09 Furthermore, we are interested in pinpointing the failure-path Cross-Section (cm Cross-Section 4e-09 of masked failures to determine whether a major masking point

2e-09 exists.

0 mismatch garbage panic app-death server-death hang REFERENCES [1] IDC. (2014, nov). [Online]. Available: http://www.idc.com/getdoc.jsp?containerId=prUS24314413 Fig. 3. Comprehensive contributors to the overall cross-section of each variant. [2] I. Sommerville, Software Engineering, 9th ed. Harlow, England: Addison-Wesley, 2010. We start our analysis focusing on comparing the Dalvik [3] Arduino. (2014, nov). [Online]. Available: http://arduinio.cc/ and ELF variants. While the Dalvik mismatch cross-section [4] IDC. (2014, nov). [Online]. Available: is significantly smaller than the ELF’s, the app-death cross- http://www.idc.com/prodserv/smartphone-os-market-share.jsp section is larger for the Dalvik. We conjecture that this changes [5] Y. Chen, “Cosmic ray effects on personal entertainment applications for are closely related: the increase in app-death occurrences is smartphones,” in Radiation Effects Data Workshop (REDW), 2013. most likely a symptom of typical programming assertions. In [6] IATA. (2013) Air passenger market analysis. [Online]. Avail- able: http://www.iata.org/publications/economics/Documents/passenger- the Dalvik variant part of the radiation-induced corruptions that analysis-dec2013.pdf cause mismatch occurrences in ELF are caught at run-time by [7] Digilent. (2014) Zedboard Data Sheet Overview. [Online]. Avail- assertions in the Android framework, causing the application to able: http://www.xilinx.com/support/documentation/data sheets/ds190- be killed. This means that the software stack would make the Zynq-7000-Overview.pdf application less sensible to SDCs but more sensible to FIs. This [8] JEDEC, “Test method for beam accelerated soft error rate,” JESD89-3A. behaviour may not be desirable in multimedia or non critical [9] M. Manoochehri et al., “Cppc: Correctable parity protected cache,” in user-applications in which SDCs can hardly be distinguished Proceedings of the 38th Annual International Symposium on Computer by the user while FIs significantly lower the user experience. Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011.