Linköping University | Department of Computer and Information Science Master thesis, 30 ECTS | Datateknik 2019 | LIU-IDA/LITH-EX-A--19/074--SE

Taint analysis for automotive safety using the LLVM compiler infrastructure

Éléonore Goblé

Supervisor : Ulf Kargén Examiner : Nahid Shahmehri

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer- ingsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko- pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis- ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker- heten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman- nens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to down- load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

© Éléonore Goblé Abstract

Software safety is getting more and more important in the automotive industry as me- chanical functions are replaced by complex embedded computer systems. Errors during development can lead to accidents and threaten users’ lives. The safety level of the soft- ware must therefore be monitored through Automotive Safety Integrity Levels (ASILs), according to the standard ISO 26262. This thesis presents the development of a static taint analysis tool using the LLVM compiler infrastructure in order to identify safety-critical components and analyze their dependencies in automotive software. The aim was to pro- vide a useful visualization tool to help safety engineers in their work and save time during development. It was concluded that this static taint analysis tool can facilitate and improve the precision of the ASIL decomposition of automotive software. Acknowledgments

First and foremost, I would like to thank ARCCORE for giving me the opportunity to conduct this master thesis. In addition, I would like to thank my supervisor Daniels Umanovskis and my colleague John Tinnerholm for their valuable help. I would also like to thank all my colleagues at ARCCORE for their friendly welcome and their support. Furthermore, I would like to thank my supervisor Ulf Kargén and my examiner Nahid Shahmehri for providing me with valuable feedback. I would also like to thank my sister Morgane for proofreading my thesis. Finally, I would like to thank Linköping University and the University of Technology of Compiègne for giving me the possibility to carry out this double-degree project.

Éléonore Goblé

iv Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of viii

1 Introduction 1 1.1 Company ...... 1 1.2 Motivation ...... 1 1.3 Aim...... 2 1.4 Research questions ...... 2 1.5 Delimitations ...... 3 1.6 Outline ...... 3

2 Theory 5 2.1 Automotive industry ...... 5 2.2 Functional safety ...... 6 2.3 Static Analysis ...... 7 2.4 Pointer and Alias Analysis ...... 8 2.5 LLVM...... 9 2.6 Related Work ...... 10 2.7 Visualization ...... 11 2.8 Evaluation ...... 13

3 Method 15 3.1 LLVM ...... 15 3.2 Taint analysis ...... 16 3.3 Visualization ...... 24 3.4 Evaluation ...... 26

4 Results 30 4.1 LLVM ...... 30 4.2 Taint analysis ...... 30 4.3 Visualization ...... 31 4.4 Evaluation ...... 34

5 Discussion 39 5.1 Taint analysis ...... 39 5.2 Results ...... 40

v 5.3 Method ...... 42 5.4 Source criticism ...... 43 5.5 The work in a wider context ...... 43

6 Conclusion 44 6.1 Consequences ...... 45 6.2 Further work ...... 45

Bibliography 46

vi List of Figures

1.1 Master thesis outline ...... 4

2.1 Compilation process ...... 9

3.1 An overview of the LLVM Value inheritance ...... 17 3.2 UML Diagram, describing the architecture of the taint analysis pass ...... 18 3.3 SafeValue and SafeInstruction classes ...... 23

4.1 The list of tainted functions and global variables in each file ...... 31 4.2 An example of the tree view, whose initiator is the variable safe...... 32 4.3 The alias view of the variable safe in the function testInterProcedural ...... 32 4.4 Visualization tool overview ...... 33 4.5 Which aspect has been used to find the ASIL rating of an object? ...... 35 4.6 An overview of the result of the taint analysis pass on the project (real names have been modified) ...... 37

vii List of Tables

3.1 Taint propagation policy ...... 20 3.2 Linear scale questions ...... 27 3.3 Tasks ...... 27 3.4 Store test cases ...... 28 3.5 Load address test case ...... 28 3.6 Pointer parameter test cases ...... 28 3.7 Global initialization test case ...... 28 3.8 File test case ...... 28 3.9 Call test case ...... 28 3.10 Violation test case ...... 28

4.1 Linear scale questions ...... 34 4.2 LLVM IR metrics ...... 35 4.3 Taint information ...... 36 4.4 Taint analysis results ...... 36 4.5 Results ...... 38 4.6 Program execution time results ...... 38

viii 1 Introduction

The importance of safety in the automotive industry has significantly increased in recent years. Purely mechanical functions have been replaced by complex embedded computer sys- tems, which require high levels of safety. In fact, errors during development can lead to accidents and threaten users’ lives. The safety level of the software must therefore be as- sessed and monitored. ISO 26262 [1] is an industry-specific standard for functional safety of road vehicles, similar to the broader standard IEC 61508 which defines Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems [2]. According to ISO 26262, the safety level of an application can be measured by Automotive Safety Integrity Lev- els (ASILs). This standard recommends separating safety-critical objects from non-hazardous objects in the memory.

1.1 Company

This master thesis is done in collaboration with ARCCORE AB [3], headquartered in Gothen- burg, Sweden. ARCCORE is a fully-owned subsidiary of Vector Informatik GmbH, head- quartered in Stuttgart, Germany. ARCCORE provides leading solutions for embedded sys- tems development in the automotive industry. ARCCORE software aims at being developed with respect to the automotive standard AUTOSAR [4].

1.2 Motivation

In the automotive industry, the embedded code supplier needs to provide guarantees to the Original Equipment Manufacturer (OEM) with regards to safety requirements. In order to attempt to establish that the software is safe, the company needs to perform analysis on the code. Dynamic analysis techniques such as testing and verification are common ways to check software safety, however these methods are tedious. The number of possible paths grows exponentially with the size of the program, therefore, testing only provides a “partial ver- ification”, according to Silva et al. [5]. Hardware protection can also be developed to en- sure safety. AUTOSAR [4] defines a standard for the architecture of Electronic Control Units (ECUs) and recommends functional measures for safety-relevant systems. In embedded sys- tems, a hardware Memory Protection Unit (MPU) [6] allows memory protection by defining

1 1.3. Aim access rights to different parts of memory. In a safety-critical system, the MPU can be used to partition the memory and prevent unsafe components from writing into the safe memory during run-time [7]. Static analysis consists in analyzing the source code before executing it, and thus enables engineers to prove code safety. Static analysis could be used to find out the components to be placed in the safe partition. Static analysis can be combined with dynamic analysis to improve the efficiency of the analysis [8]. However, developing a sound static analyzer is expensive in terms of complexity. Moreover, safe components which have a higher ASIL need “Freedom from interference” (FFI) [9] from lower level components, which ensures that “a fault in a less safety critical software component will not lead to a fault in a more safety critical component“, according to Leitner-Fischer et al. [7]. Nevertheless, monitoring the safety of the entire software can be costly, according to Azevedo et al. [10]. For a developer of automotive software, it is desirable to limit the amount of ASIL components. In fact, such components have to be developed according to additional requirements imposed by ISO 26262, which significantly increases the effort during imple- mentation and testing phases. The goal is to reduce the volume of code involved by high safety levels as much as possible, in order to be able to study these slices precisely and to limit the risks. Currently, a manual code inspection is performed in order to identify the dependencies related to the variables used in safety-critical modules. The challenge is to develop a software that would automatically identify dependencies between the safe objects of the program, and thus would give a base to safety engineers to help them to partition the memory. Taint analysis [11] consists in detecting data coming from untrusted sources and propagat- ing the taint to the variables in relation with this data. Taint analysis can be used to identify data which can influence safety-critical components. The Low Level Virtual Machine (LLVM) [12] is a compiler infrastructure composed of a set of libraries and reusable objects. LLVM provides several modules for compiler construction, which can be used for static code analysis. The Clang compiler utilizes LLVM in order to transform C code into LLVM IR, which is an intermediate representation. This representation facilitates the analysis of the relation between variables. LLVM also provides the LLVM Pass Framework [13] which gives the possibility to develop an “LLVM analysis pass”, which is a plugin developed on the top of LLVM to analyze source code.

1.3 Aim

The aim of the thesis is to develop a static analysis tool, composed of a taint analysis pass based on existing static analyzers such as LLVM analysis modules and the LLVM Pass Frame- work, and a visualization tool to present the results. Through this, this thesis aims at exam- ining how taint analysis can be used to ensure embedded systems safety. This could be done by analyzing C code and generating the dependencies related to safety-critical components. The output of the program should be easily understandable for the safety architects, which means it should be easy and quick to learn how to use the tool, and the output should be precise enough to provide them with additional information for their work.

1.4 Research questions

The first meetings and discussions made it possible to highlight the most important aspects of the thesis and to raise the following questions:

1. Is LLVM suitable to perform static analysis on automotive software?

2 1.5. Delimitations

The first task is to study the possibility of implementing a new module on the top of LLVM.

2. How can static taint analysis be used to track dependencies related to safe components in automotive software? This thesis aims at studying the best method to implement a static taint analyzer for automotive software. This analyzer should identify efficiently the components which can influence variables marked as safe.

3. How to represent results in an understandable way so that engineers can improve the safety development process? This thesis aims at generating an understandable output which focuses on the most important and relevant information and presents useful data for safety engineers. One task is thus to study the best way to represent the dependencies between safe and unsafe components.

4. Is the taint analysis accuracy sufficient for the application? How does taint analysis visualization affect the usefulness of the output? The results of the tool can also be compared to manual analysis results performed on existing projects to evaluate the accuracy. The output visualization can be submitted to safety engineers, so that they can evaluate the usefulness of the result.

1.5 Delimitations

This thesis only aims at analyzing dependencies from safety components provided by the user. Thus, the thesis does not provide identification of the initial components considered as safe. This thesis aims at developing a standalone tool, so the integration of the tool is not included in the developing process. Moreover, this tool should be compatible with LLVM-5 and should work on Windows, according to the company technical configurations. Finally, this thesis aims at analyzing embedded code for automotive industry which follows the rules described in MISRA C Guidelines [14].

1.6 Outline

The figure 1.1 illustrates the outline of the master thesis. This figure highlights the main steps of the study. First, a pre-study was conducted in order to define the subject and plan the thesis work. The Introduction Chapter [1] and the Research Questions were written following this. Some literature and technical research was done in order to write the Theory Chapter [2] and to start the development phase. The research study was useful to design the architecture of the taint analysis LLVM pass based on the LLVM Pass Framework presented in the Method Chapter [3]. Then, the development of the taint analysis LLVM pass and the visualization tool was done iteratively. The main functionalities of the taint analysis pass were tested. A qualitative study was performed on the visualization tool and the taint analysis pass was tested on a real- project of the company in order to evaluate its accuracy. The Results Chapter [4] presents the results of the evaluation and the static taint analyzer composed of the taint analysis pass and the visualization tool. The Discussion Chapter [5] presents feedback and improvements made following the different studies. Finally, the Conclusion Chapter [6] summarizes the results of the master thesis and suggests further work.

3 1.6. Outline

Taint analysis for automotive safety using the LLVM compiler infrastructure

Literature research Defining subject Taint analysis Introduction Planning Automotive Research questions systems Technical research Theory Taint analysis pass Development

LLVM C++ Architecture Visualization Method

Iterative development Results Testing Static taint Qualitative study analyzer Accuracy evaluation Discussion Feedback Improvements Conclusion

Figure 1.1: Master thesis outline

4 2 Theory

This sections aims at presenting the background and the related work relevant to this thesis. First, section 2.1 presents software development in automotive industry. Then, section 2.2 defines functional safety standards and concepts. A review of the different types of static analysis is provided in section 2.3. A brief explanation about pointer analysis is given in section 2.4. Besides, an overview of LLVM is provided in section 2.5. Section 2.6 presents existing studies related to this topic. Finally, section 2.7 introduces software visualization and section 2.8 reviews methods to evaluate software usability and accuracy in the context of static analyzers.

2.1 Automotive industry

The automotive industry deals with safety-critical systems whose malfunctions could lead to serious consequences, including injury to people, environmental issues and large losses of money [5]. Vehicles are increasingly automated and use a lot of embedded computer systems [15]. These systems require more and more checks to ensure vehicle passengers safety.

Automotive systems architecture Automotive systems are divided into a physical hardware part, such as Electronic Control Units (ECUs), and a software part [16]. ECUs are embedded systems composed of “a micro- controller and a set of sensors” [15], aiming at controlling an electrical system in a vehicle through an embedded software. These systems need to implement protection methods to ensure safety, both at hardware and software levels. The Memory Protection Unit (MPU) [6] is a hardware protection in ECUs aiming at restricting the access to the safe partition dur- ing run-time. A memory access violation generates an exception that terminates program execution. This thesis focuses on software methods to ensure embedded systems safety.

Embedded software development According to Freund [17], embedded software involves many constraints such as “real-time scheduling, reliability and production requirements”, which influence software development

5 2.2. Functional safety methods. Embedded software is usually developed in C because this language has been used in critical systems for a long time, and efficient machine code can be generated from C programs [14]. MISRA C Guidelines [14] provide “a subset of the C language” which is supposed to re- duce the possibility of making mistakes during the development. This is done by removing C language expressions which could lead to undefined behaviour, misuse or misunderstand- ing. These guidelines are recommended in the development of embedded applications and safety-related systems.

2.2 Functional safety

Functional safety aims at detecting hazardous situations and applying preventive solutions. These solutions should prevent systematic or hardware failures from having serious conse- quences [2]. Therefore, standards have been developed to assess functional safety and to provide common methods to solve these issues.

Functional safety standards The automotive industry is regulated by several standards which aim at standardizing products development. IEC 61508 defines Functional Safety of Electrical/Electronic/Pro- grammable Electronic Safety-related Systems [2]. ISO 26262 [1] is adapted from IEC 61508 and deals with functional safety of road vehicles. AUTOSAR (AUTOmotive Open System Architecture) [4] is an automotive standard for the software architecture of ECUs. This standard recommends measures and mechanisms to improve the development of safety-related software, such as memory partitioning [6]. Unsafe applications are run in user mode whereas safe applications are run in supervisor mode, in order to access the MPU without restriction.

Automotive Safety Integrity Levels (ASILs) Part 9 of ISO 26262 defines “ASIL-oriented and safety-oriented analyses” in order to decom- pose the software into safety-related components and non-safety-related components. Auto- motive Safety Integrity Levels (ASILs) have been developed to check the safety level of an embedded system. Therefore, respecting ASILs aims at convincing the manufacturers that the products meet safety requirements. In order to develop ASIL software, designers must find out safety-critical components whose malfunctions could lead to serious issues [10], such as the brake system. Therefore, risks related to hazardous situations are defined and classi- fied into four different levels (ASIL A, ASIL B, ASIL C, ASIL D) according to their severity, probability, and controllability [1]. ASIL components must be monitored through safety mea- sures and require more development effort [10]. Components which do not require specific safety measured are identified as Quality Management (QM).

Freedom from interference Freedom from interference (FFI) is defined by ISO 26262 Part 9 Section 6.2 [1] as the absence of “cascading failures” from a lower ASIL element to a higher ASIL element. This means that components with lower ASIL should not influence components with higher ASIL. This should prevent an error that happens in an unsafe module from propagating to a safety- critical module [7]. Therefore, ASIL components should be separated from QM components inside the mem- ory. ASIL components should be placed in the Memory Protection Unit (MPU) [6]. Finally, static code analysis can be performed in order to identify the components related to safety-critical modules.

6 2.3. Static Analysis

2.3 Static Analysis

Static analysis refers to the analysis of a program without running it [18]. Contrary to dy- namic analysis, which is performed on programs during run-time [11], static analysis can be performed directly on source code or on intermediate code, for example on the LLVM intermediate representation (IR) [19]. Although dynamic analysis can be popular, this method has some limitations. One exe- cution path is generated for each input set, and one path is tested for a program at a time. Thus, achieving a high percentage of code coverage is challenging when the number of paths increases, and dynamic methods can thus “encounter [...] paths explosion problems”, ac- cording to Feng and Zhang [20]. Dynamic testing tends to provide only “partial verification” according to Silva et al. [5]: some paths can be missed and inaccurate results can be provided. Static analysis gives the possibility of simulating all the execution paths of the program dur- ing compile-time, which is called symbolic execution, according to Liang et al. [21]. However, static analysis tools are not always fully reliable [11]. They provide either over- approximation or under-approximation. These tools can be incomplete, and produce false positives (find an error where there is none), or unsound, and produce false negatives (error not reported), depending on the chosen approximation method. According to Mock et al. [22], if the static analysis method is too precise, then the algorithm complexity can be a limit when running the analysis on large programs.

Analysis methods Static analysis can be performed by applying formal methods, that is to say, analyzing math- ematically the source code in order to prove some results. According to P. Cousot and R. Cousot [23], abstract interpretation approximates possible values using abstract sets which aims at converting infinite spaces into finite ones. For exam- ple, as far as the sign of the variable’s values is concerned, the set of integers can be abstracted to the set t(+), (´), (0)u . Another technique is deductive verification, which aims at proving the algorithm by dividing it into a list of mathematical proof obligations, according to Silva et al. [5]. Furthermore, symbolic execution consists in simulating the execution of the program during compile-time, according to Liang et al. [21]. Static analysis can also be based on compiler technology. According to Arroyo et al. [11], modern compilers enable developers to build upon their structure elements, such as Abstract Syntax Tree (AST), Control Flow Graph (CFGs) and Call Graphs (CG), in order to perform data and control flow analysis. Data-flow analysis consists in analyzing the operations per- formed on a data set, whereas control-flow analysis is used to study the flow of tasks and the structure of the program.

Taint Analysis The field of static analysis developed in this thesis work is taint analysis. According to Arroyo et al. [11], taint analysis is based on information flow and “non-interference”: information flow analysis is used to check that tainted information does not interfere with information which should not be tainted. Usually, in software security, taint analysis consists in marking data coming from un- trusted sources, such as user input, as unsafe, because external data is always a security risk [11]. As far as software safety is concerned, unsafe data does not necessarily come from the user, but also from unsafe modules. Then, taint analysis can be used to track the unsafe vari- ables which can influence the safety-critical components. In the context of this thesis, tainted data is classified into different safety levels. A lower ASIL data should not influence a higher ASIL data, otherwise both data should be tainted with the higher ASIL level.

7 2.4. Pointer and Alias Analysis

Taint analysis is usually divided into three phases [20]. The first one is taint information, which aims at tainting the initiators (source objects). The second phase is taint propagation, which aims at broadcasting the taint to all the other objects in relation with the initiators. The last phase is taint checking, which consists in checking if an object which has been tainted should not be tainted, to detect an unauthorized behavior. According to Schwarz et al. [24], the taint policy should define how the new objects are tainted, which operations propagate the taint, and how the taint is checked at the end.

2.4 Pointer and Alias Analysis

During taint analysis, propagating operations need to be identified. As far as C language is concerned, the main challenge is pointer and alias analysis. According to Avots et al. [25], C is an unsafe language and is difficult to analyze. In fact, operations can be performed on pointers, and pointers can either point to stack, heap objects or functions. There are also multi-level pointers. All of this increase the complexity of the analysis, according to Andersen [26]. Thus, a sound pointer analysis is really hard to achieve. A pointer analyzer must make compromises to obtain readable and reasonable results. Therefore, different properties can be used to identify the level of precision needed for the pointer analysis. According to Hind [27], this level should be in line with the customer’s needs. Andersen presents in his PH.D. Thesis [26] a pointer analysis for C language based on sub- set constraints. This analysis is inter-procedural, which means that the relationships between the functions are taken into account. Steensgaard [28] presents another inter-procedural pointer analysis, which is based on equality constraints.

Definitions Andersen [26] defines two fundamental concepts regarding pointer analysis: “alias pair” and “point-to information”.

Alias pair: if p = &x is an assignment, then ˚p is aliased with x. The alias pair is written x˚p, xy. “When the lvalue of two objects coincides, the objects are said to be aliased” [26].

Point-to information: if p = &x and p = &y are two assignments, then the point-to infor- mation of p is the set tx, yu, and is written p ÞÑ tx, yu. Point-to information denotes “the set of objects a pointer may point to” [26].

Properties Pointer analysis properties aim at defining the level of precision needed by the application.

Field-sensitivity: Field-sensitivity deals with aggregate data types such as structures and arrays. A field-sensitive analysis studies each field of each structure separately, whereas a field-insensitive analysis considers each access to aggregate data as an access to the whole structure [29].

Intra-procedural or inter-procedural: The intra-procedural pointer analysis performs data- flow analysis only inside functions. This is much easier than inter-procedural analysis, which performs a pointer analysis considering the interaction between functions. Inter-procedural analysis consists in analyzing each function call separately [26].

8 2.5. LLVM

Flow-sensitivity: Flow-sensitive analysis takes the execution order of the program, called control-flow, into consideration. This analysis is more precise because it could detect a depen- dency at a given line in the source code, which is also called program-point specific analysis. Contrary to flow-sensitive analysis, flow-insensitive analysis can only summarize the depen- dencies between pointers in the whole program. Pointers which are aliases only at a given moment of the program are referred to as “may-alias” [26].

2.5 LLVM

The Low Level Virtual Machine (LLVM) Project [12] is a compiler framework developed at the University of Illinois. This framework is composed of “modular and reusable compiler and toolchain technologies” [30]. LLVM aims at being a long-term code analysis and optimization system by providing built-in optimization and analysis passes, and the possibility to develop new passes.

Compilation The compilation is usually divided into three phases [Fig. 2.1]. First, a static compiler front- end, such as Clang, parses the source code and translates it into LLVM intermediate represen- tation (IR). Then, LLVM modules analyze LLVM IR to optimize the code, and finally machine code compatible with the chosen platform is generated.

Figure 2.1: Compilation process [31]

LLVM Intermediate representation (IR) LLVM IR [19] is an intermediate representation used during compilation. It provides “a hu- man readable assembly language representation” (.ll) and a binary representation called “bit- code” (.bc) which can be executed and on which optimizations are performed. LLVM IR is a “language independent type-system”, which uses common low-level prim- itives to implement complex high-level functions. Its architecture is a “load/store architec- ture”: all the accesses to the memory are done using load (read from memory) or store (write in the memory) instructions [32]. It means that all more complex operations which require an access to the memory will be divided into load and store instructions. LLVM bitcode files can be linked together into one single file thanks to the LLVM linker [33], which aims at resolving the definition of functions and variables declared in different files.

Static Single Assignment (SSA) LLVM IR is a “Static Single Assignment (SSA)” [34] based language: each new assignment of a value to a variable results in a new version of the variable being created. Data-flow analysis is facilitated by SSA representation which expresses a variable as a function of its previous versions. According to Braun et al. [35], SSA form aims at improving the efficiency of the analysis by “compactly representing use-def chains”. A use-def chain is a data structure composed of an instruction (use) of a variable, and all the possible definitions of this variable. The def-use information is the list of all the instructions which involves a given variable. LLVM

9 2.6. Related Work

SSA is built according to Cytron et al.’s algorithm [34]. This algorithm first identifies the different definitions of the variable. Then, if there are concurrent definitions, due to a if- statement for example, the multiple definitions are concatenated and propagated. Finally, the new definition of the variable replaces the old variable in its different uses.

LLVM Pass Framework The LLVM project provides an LLVM Pass Framework [13]. An LLVM pass can be used to transform, analyze and optimize source code. New LLVM passes can also be developed in C++. Several types of passes are available, which enable the analysis of the source code on different scales, such as modules, functions or basic blocks.

2.6 Related Work

Clang static analyzer Clang static analyzer is an open-source tool, part of the Clang and the LLVM projects [36]. The formal analysis is based on symbolic execution: a core engine simulates the different execution paths of the program, while the constraint manager checks if the path is satisfiable. The algorithm is path sensitive, so all the possible paths are explored. Arroyo et al. [11] developed a “user configurable static analyzer taint checker” plugin for Clang static analyzer, which aims at checking the propagation of tainted data in C, C++ and Objective C programs. Their tool provides a configuration file so that users can define the sources, propagators, sinks and filters of the taint analysis. Sinks are defined as “critical functions” which should not be influenced by tainted data. Filters are sanitizers which can generate safe data from tainted data. This tool can be used to detect security flaws which could be triggered by malicious user inputs.

Sparse Value Flow (SVF) SVF (Sparse Value-Flow) [37] is an open-source static tool developed at the School of Com- puter Science and Engineering, UNSW, Australia. This static analysis tool is implemented on the top of LLVM and aims at analyzing inter-procedural pointer dependencies for C and C++ programs. This tool resolves both data and control flow dependencies, thus enabling a more precise analysis. The value-flow construction module, based on Andersen’s points-to infor- mation, generates an “inter-procedural memory SSA”[37] representation, providing def-use chains for pointers, whereas LLVM only provides an intra-procedural memory dependence analysis pass, according to Sui et al. [37]. The inter-procedural analysis is performed sparsely, that is to say, by first over-approximately computing def-use chains and then, by eliminating unnecessary propagation and thus, refining the data-flow analysis. SVF can be used to detect bugs involving value-flow reachability, such as memory leak detection. SABER [38] is a mem- ory leak detector developed on the top of SVF. SVF can also be used to implement “scalable and precise pointer analyses” [38].

Frama-C The Frama-C platform [39] is an open-source static analysis tool, which aims at performing safety verification on industrial C code. This tool is supposed to be correct, which means that it provides over-approximation, in order to guarantee that no error remains undetected. Frama-C uses abstract interpretation, deductive verification and concolic testing, which is a form of dynamic symbolic execution, to prove the assertions. Frama-C is developed in OCaml language and aims at being an extensible platform, composed of several plugins which enable more sophisticated approaches. Frama-C Evolved Value Analysis plugin aims at identifying the set of possible values of a variable, at a given moment of the execution. Frama-C also

10 2.7. Visualization provides the possibility to slice the program in order to simplify it, and to navigate the use-def chains. Thus, Frama-C can be used to verify that the source code respects the specifications, which can be expressed as ACSL (a formal specification language) annotations. However, currently Frama-C does not provide a taint analysis plugin, although it is possible to compute the dependencies between variables.

Assisted Assignment of Automotive Safety Requirements Azedevo et al. [10] have developed a tool aiming at automating ASIL allocation and decom- position during design phase. According to ISO 26262 Part 9.5 [1], if several independent safety requirements are responsible for the ASIL rating of a common element, then it is pos- sible to assign a lower ASIL to these requirements. For example, if an element is tainted ASIL D because of two ASIL D sub-elements, then these two sub-elements can be decomposed into two ASIL B requirements, since two ASIL B sub-elements are equivalent to an ASIL D ele- ment. This is done by associating an integer with each ASIL rating (i.e. A=1, B=2, C=3, D=4). In order to compute ASIL allocation, this tool first generates the fault trees thanks to an exist- ing safety analyzer and design optimizer called HiP-HOPS (hierarchically performed hazard origin and propagation studies) [40]. Then, ASIL decomposition is computed by perform- ing a constraint solving algorithm on the “minimal cut set” [10], which refers to the smallest set of events that makes an element to be marked as ASIL. This tool can be used to reduce development costs by limiting the amount of high ASIL elements.

2.7 Visualization

Software visualization refers to the visual representation of software components [41]. The challenge related to software visualization is to provide understandable and useful informa- tion for developers so that they can work more effectively [42]. In fact, software visualization aims at reducing the effort spent by developers on development and maintenance tasks [43]. According to Shahin et al.’s systematic review [41], the most used visualization technique is graph-based visualization.

Graph representation When static analysis is used to examine relations between objects in the source code, a graph representation can be a suitable solution. In fact, graphs can be used to represent these rela- tionships graphically, nodes being objects and relations being edges. Some graphs are com- monly used in static analysis, such as call graphs, program dependence graphs and control- flow graphs. Call graphs display the calling relationship between functions, nodes being functions and edges being calls. Program dependence graphs are used to show the depen- dencies between variables, nodes being statements or values, and edges being relations be- tween them. Control-flow graphs present the different execution paths of a program [42], nodes being instructions and the edges being instruction jumps. The SVF tool [37] can generate a value-flow graph in order to display program dependen- cies. Different kinds of nodes exist and are highlighted by different colors in order to identify them. The dependencies between elements are represented by edges. As far as graph representation is concerned, the developer should be able to find easily the useful information and to understand the relationship between objects. Providing interactive features enables the user to hide information which is not currently important and to expand useful details [42]. Information visualization can be facilitated by navigation interactions such as zooming, moving or expanding nodes. The key issues related to graph representation are due to the information layout. To make a graph easy to read and understand, information should be organized clearly and follow specific rules, according to Herman et al. [44]. Graph drawing also has aesthetic and practical

11 2.7. Visualization rules, such as equal space distribution between nodes. Moreover, edge crossing should be avoided if the graph is planar. One of the most common graph layouts is the tree layout which is convenient to display hierarchical information. “Tree layout algorithms have the lowest complexity and are simpler to implement” [44].

Code annotation As static analysis is used to analyze source code, a common visualization method is to anno- tate the code directly with the results. Usually, a plugin can be developed and integrated to the IDE (integrated development environment). The results of the taint checker developed by Arroyo et al. [11] are displayed as code anno- tation in order to warn the developer against untrusted data during development. Frama-C [39] also provides a user interface and a source code browser to display the results on the code. The advantage of code annotation is to let developers see the context of a result [42], that is to say, reading the code and locating the information inside the project. However, annotations on code do not give the possibility to have a global representation of the dependencies. LaToza and Myers [42] developed an Eclipse plugin composed of both code annotation and graph representation in order to navigate the call graph of a module. In fact, inter- procedural dependencies are easily represented through a call graph. Thus, the user can get context information from the Eclipse IDE and global information from the graph.

Useful properties in software visualization tools According to Bassil and Keller [43], “appropriate visualization can significantly reduce the effort spent on system comprehension and maintenance”. In order to define what an “appro- priate visualization” is, Bassil and Keller conducted a survey about software visualization tools. They aimed at evaluating the usefulness and the importance of different visualization aspects. Bassil and Keller [43] report the most essential properties according to the results of the questionnaire:

1. “Search tools for graphical and/or textual elements”

2. “Source code visualization (textual views)”

3. “Hierarchical representation”

4. “Use of colors”

5. “Source code browsing”

6. “Navigation across hierarchies”

7. “Easy access, from the symbol list, to the corresponding source code”

Some useful but not essential properties have also been reported, such as “saving of views for future use”, the “possibility of having multiple [...] instances of the same object being highlighted in all the views”, or the “visualization of different levels of detail in separate window”. Bassil and Keller [43] have also questioned experts about code analysis support of soft- ware visualization tools. It has been reported that the most important functionalities are “visualization of function calls”, “visualization of inheritance graph” and “visualization of different levels of detail in separate window”.

12 2.8. Evaluation

2.8 Evaluation

In the context of static analysis in automotive safety, the accuracy of the tool should be mea- sured so that users can assess whether they can rely on the results. Moreover, the tool aims at helping engineers to be more efficient in their work. Thus, the usefulness of the results should be evaluated to check whether the tool fulfils its goal.

Evaluating the usefulness of the results According to Seaman [45], qualitative research methods are increasingly used to take into account human behaviour when evaluating software. Qualitative data can not be represented as numbers, contrary to quantitative data. Two data collection methods are commonly used: “participant observation and interviewing” [45]. The first one consists in observing software developers while they are working and taking notes about their behaviour and thoughts. The second one consists in asking a series of questions to developers. After collecting data, results should be analyzed in order to extract “a statement or proposition” [45]. LaToza and Myers [42] evaluated the “potential productivity benefits [...] and the usabil- ity” of their static analyzer taint checker, called REACHER, by conducting a lab study on 12 participants. This tool aims at reducing the time required for a task, by allowing the develop- ers to understand and navigate the code more effectively. The study consisted in comparing the time the participants needed to perform a task with Eclipse, to the time needed to per- form the same task with REACHER. To make the two tools comparable, all the participants had completed two tutorials on Eclipse and REACHER in order to familiarize with both in- terfaces, before taking part in the study. Each task involved the understanding of “control flow between events” in the program and the use of a call graph, which is REACHER’s focus. Each task focused on a particular aspect of REACHER.

Evaluating the accuracy of the results According to Anderson [46], ISO 26262 requires to qualify static analyzers by assessing the tool confidence level (TCL). This is expressed as the possibility that a failure in the tool pre- vents the requirements from being met (tool impact TI), and the probability that the failure can be detected (tool error detection TD). Thus, the accuracy of the tool should be assessed and the functional requirements should be tested. Arroyo et al. [11] evaluated the accuracy of their taint checker based on clang static ana- lyzer following these criteria:

• “capacity for finding usage of tainted data”: this refers, for example, to the capacity of the tool to detect the use of a tainted variable in a given instruction. Each type of usage was tested in a test case.

• “the number of false positives”: this refers to the wrong propagation of tainted data generating false errors.

• “scalability”: the tool was tested on a real case, the hearth bleed vulnerability of OpenSSL.

Sui et al. [38] performed an experimental evaluation in order to measure the accuracy of their static memory leaks detector, called SABER. They define accuracy as the “ability to detect memory leaks with a low false positive rate”. To conduct the study, they tested their tool on “15 SPEC2000 C programs (620 KLOC) and seven open-source applications”. They reported the number of faults found by SABER, and the number of false positives. Then, they computed the false positive rate as seen in [eq. (2.2)]. Finally, they compared the results to the results obtained with other analyzers.

13 2.8. Evaluation

Recall that the number of faults reported and the true number of faults can be expressed as follow:

number of faults reported = false positives + true positives (2.1) Then, the false positive rate can be defined as:

false positives false positives rate = (2.2) number of faults reported They concluded that their detector is “neither complete [...] nor sound” [38] due to some approximations, such as treating multi-dimensional arrays monolithically or bounding the number of loop iterations. Imparato et al. [8] have reported “a comparative study of static analysis tools for AU- TOSAR”. They have evaluated the tools according to their precision and recall, which can be expressed as follow:

true positives precision = (2.3) number of faults reported true positives recall = (2.4) false negatives + true positives A high precision saves time because it limits the amount of false alerts that developers will have to check. The recall measures the number of errors detected out of the total number of errors. If the recall equals 1, then the tool will detect all the errors.

14 3 Method

This chapter describes the implementation of the taint analyzer on the top of LLVM in sections 3.1 and 3.2, the development of the visualization tool in section 3.3 and the evaluation of the accuracy of the results and the usefulness of the visualization tool in section 3.4.

3.1 LLVM

The first research question was to examine if it was possible to utilize LLVM to develop a static analysis tool for automotive software. This study was done in three steps. The first step was to study how to develop a plugin on the top of LLVM. One of the advantages of the compiler infrastructure is the LLVM Pass Framework [13], presented in section 2.5. LLVM passes can be used to transform, analyze and optimize source code in a modular way. Moreover, it is possible to develop new LLVM passes easily thanks to a set of reusable functions and application programming interfaces (APIs) written in C++. LLVM also provides a detailed documentation [47] intended for developers. New passes inherit from one of the Pass child classes: ModulePass, CallGraphSCCPass, FunctionPass, LoopPass, RegionPass and BasicBlockPass. In the context of the thesis, the Module pass was selected because it can analyze the whole program. Therefore, it enables inter- procedural analysis, whereas the Function pass only provides the possibility of analyzing the content of each function separately and independently. Finally, the runOnModule function should be overwritten and is the entry point of the pass. Thus, any object-oriented application can be developed on the top of a Module pass. The second step was to study how to perform the taint analysis based on the function and APIs provided by the LLVM infrastructure. LLVM APIs give the possibility to iterate over several objects of the LLVM IR inside the module. For example, it is possible to iterate over each instruction, each function or each global variable of the program. It is also possible to iterate over the def-use chains, defined in section 2.5, making LLVM especially well suited to perform taint analysis. The last step was to study how to run the pass on the projects of the company. Once the pass is developed, it must be compiled with Clang in order to generate a shared library. Then, a pass can be run on an LLVM bitcode file through the command line interface thanks to “the modular optimizer, opt”, according to Lattner and Adve [48]. Thus, in order to analyze the source files of the different projects of the company, the projects had to be compiled with

15 3.2. Taint analysis

Clang to obtain the corresponding bitcode files of each module, which is a single C translation unit.

3.2 Taint analysis

The second research question was to examine how taint analysis can be used to analyze the dependencies between safe variables in the automotive industry. The first phase was to define the way to identify the source safe variables, called taint information, and how to implement them in the tool. The second phase was to determine the taint propagation policy, that is to say the set of operations or actions propagating the taint, according to the automotive industry requirements and ISO 26262 [1]. The last phase was to study how to implement the taint analysis algorithm to analyze the LLVM IR.

Taint information Taint information, also called source information, represents the data set tainted at the initial- ization of the taint analysis algorithm. Thereafter, tainted data refers to safety-critical data, divided into four ASIL ratings (A, B, C, D), whereas untainted data refer to quality manage- ment (QM) data.

Specification The specifications related to taint information should state the type of objects which can be tainted by the user at the beginning. These specifications have been discussed during a meeting with the safety engineers of the company. In the context of the thesis, according to the needs of the company, taint information should be user-configurable, which means that the user can define the list of tainted values as an input of the taint analysis tool. Then, it has been decided that the source objects that a user can taint at the initialization could be:

• a global variable, identified by its name,

• a memory region, identified by an address range,

• a source-code file, identified by its name.

In fact, specifying the name of a safe global variable is sufficient to identify it in the source code. Moreover, specifying a memory region can be used to taint the safe registers and the partitions which should be protected in the MPU. Specifying a file is useful if a lot of functions that have to be tainted are located in the same file. This prevents developers from writing the name of each tainted function one at a time. Finally, each user input can be associated to an ASIL rating (A, B, C, D).

Implementation To implement a user-configurable analyzer, taint variables are defined by the user in an XML configuration file, which is read by the taint analysis pass using the C++ XML processing library Pugixml [49]. Then, user input is converted into several instances of the Input class [Fig. 3.2]. This class is composed of the name of the object or of a memory region (start and end addresses), and an ASIL rating. All the instances of the Input class are stored in a list, which is a member of the taint analysis pass. Thus, this list represents the set of taint information. Then, these inputs need to be associated to an LLVM class, that is to say an instance of LLVM::Value, which is the most generic LLVM class used to define a variable. An LLVM function is used to select the LLVM::Value corresponding to a name or a memory region. The child classes of LLVM::Value are presented in [Fig. 3.1]. Taint information can either be a global variable (LLVM::GlobalVariable), an address (LLVM::ConstantExpr) or a function (LLVM::Function). LLVM::AllocaInst and LLVM::Argument cannot be part

16 3.2. Taint analysis of taint information since it defines local variables. However, it will be used later in the analysis of the dependencies. Once the LLVM::Value instance corresponding to the Input has been identified, taint information is converted to an instance of the SafeValue class [Fig. 3.3], which is composed of:

• an LLVM::Value instance

• an instance of the enumeration ASIL (QM, A, B, C, D)

This class is the key of the taint analyzer because every LLVM::Value instance analyzed by the algorithm is stored in a SafeValue instance. All the instances being ASIL A, B, C or D are tainted information, whereas instances being QM are untainted information.

Figure 3.1: An overview of the LLVM Value inheritance [30]

17 3.2. Taint analysis Figure 3.2: UML Diagram, describing the architecture of the taint analysis pass

18 3.2. Taint analysis

Taint propagation policy After defining the taint information, the second phase is to identify the kind of operations which can propagate the taint to other variables, which can be either global variables, local variables, addresses or functions.

Specification The taint propagation policy has been defined in accordance with the opinion of the engineers of the company, based on their experience with safety requirements. Seven cases have been defined and are presented below [Tab. 3.1]. If an object is tainted by several objects, then the highest ASIL should be assigned to it, according to ISO 26262 Part 9 [1].

Store If a new value is assigned to an ASIL variable, resulting in the variable being mod- ified, then the function where the assignment is done should be tainted. A memory write access is always translated by a store instruction in LLVM IR [32]. It is considered that an instruction modifies tainted data if its memory location or its content is overwritten. Thus, if tainted data is a pointer, any assignment to the pointer or to the dereferenced pointer will be considered as a modification.

Load address If an ASIL hard-coded address is assigned to a scalar variable, or converted and assigned to a pointer, then the variable or pointer should be tainted.

Pointer parameter If an ASIL pointer is passed as a parameter to a function, then the content of the function should be analyzed to check if the pointer is modified inside, that is to say, if its memory location or its content is overwritten by another value. In order to do this, the function behaviour is first over-approximated: the calling function and the parameter inside the function are tainted. Then, the content of the called function is analyzed to determine whether the pointer is effectively modified. If there appears to be a modification, then the called function is tainted as well. If the pointer is not modified inside the function, then the called function is not tainted.

Function call If a function is tainted, then each function calling this function should also be tainted. Thus, the taint is propagated to the functions of the call graph originating from this function.

Global If a global value is initialized with tainted data, then this global value should also be tainted.

File A file can only be tainted if the user includes its name in the configuration file. If a file is tainted, then all global variables and functions defined in this file should also be tainted.

Violation When the scalar value of a tainted variable is assigned to a QM variable or a lower ASIL variable, it is not a safety-critical operation, because the safe memory is not likely to be modified. So, no tainted value is added. However, if a tainted pointer is stored in another QM or lower ASIL pointer, the safe memory could be modified later through this unsafe pointer. Thus, this case should not happen in a safe application, except if the tainted variable is a hard-coded address, or if it is a global variable definition. Assigning an ASIL variable to a lower ASIL or QM variable is inconsistent with safety recommendations. Thus, this case is considered as a violation.

19 3.2. Taint analysis

Table 3.1: Taint propagation policy Taint Taint Name Description Examples information Propagation Modification of a safe variable_asil = variable_qm; Lvalue Store variable Function variable_asil = function_qm(); (any type) inside pointer_asil = &variable_qm; a function A safe address is Load Rvalue Function int* pointer = (int *) 0x0F; loaded into a Address (address) and lvalue uint32 address = 0x0F; variable inside a function called_fn(&variable_asil); A safe Parameter, pointer is calling Definition: Pointer The pointer passed as a function void called_fn(int* pointer) parameter parameter parameter to and called { a function function *pointer = variable_qm; } void calling_function() { A call to a Called Calling Call function_modifying_ASIL() safe function function function } A global Global int* global = &global_asil Global value Rvalue variable int* global = 0x00001002 definition A file Global File is marked File variables, as safe Functions A safe pointer is loaded Rvalue (not pointer_qm = pointer_asil; Violation Violation into an unsafe an address) pointer_qm = &variable_asil; pointer

Implementation The first step of the implementation was to define the scope of the taint analysis pass. Then, the second step was to develop the algorithm to parse and analyze the LLVM IR, in order to identify the different cases presented in the propagation policy [Tab. 3.1]. The last step was to compile the project with Clang to generate LLVM IR.

MISRA C Guidelines Some assumptions have been made throughout the development process of the analyzer according to MISRA C Guidelines [14]. The following rules apply to the embedded project analyzed by the taint analysis pass:

• Each line of code is reachable. • Variables should always have distinct names. • Dynamic allocation and deallocation functions are not used.

These rules allow some simplifications. All the lines of the LLVM bitcode file are analyzed as there is no unreachable code. A variable can be identified by its name since two different variables should have different names. Dynamic allocation and deallocation are not taken into account during the analysis. Only hard-coded memory addresses are studied.

20 3.2. Taint analysis

Pointer analysis A pointer analysis can have different level of accuracy, as presented in section 2.4. The level of accuracy needed by the tool has been established according to the needs of the company. The taint analyzer should be field-insensitive, which means that each access to a sub-element is equivalent to an access of the whole aggregate data. In fact, ac- cording to ISO 26262 Part 9 Section 6.2 [1], elements composed of sub-elements should be developed according to “the highest ASIL applicable to the element”. The taint analyzer should be inter-procedural so that relationships between functions can be analyzed, in order to identify when a tainted pointer parameter is modified inside a function. Finally, the taint analyzer should be flow-insensitive, which means that the execution order of the program is not important. This is an over-approximation which aims at simplifying the analysis because flow-insensitive analysis is costly in terms of complexity.

Instruction level The taint analysis pass only analyzes source code on the instruction level. Thus, analyzing machine code such as assembly language is out of scope.

LLVM IR analysis At initialization time, taint information is defined. The taint should be propagated to other data according to the taint propagation policy. The users of each taint information, that is to say, in that case, the list of instructions involving a given LLVM::Value instance, can be listed using the iterator over the users. Once a user is detected, it needs to be analyzed, to identify the taint propagation policy case that it corresponds to. In that case, a user can either be an instruction or a constant expression. The AnalyzerFactory selects the child class of the Analyzer corresponding to the LLVM IR instruction type, as described on the UML diagram [Fig. 3.2]. The LLVM language reference manual [19] describes the different LLVM IR instructions.

Listing 3.1: Store Inst store {type} {source}, {type} * {destination}, align {type_alignment} The store instruction writes a value inside an address of the memory. It is the only instruction which can modify the content of an existing variable in the memory (on the LLVM IR level) [19]. Thus, this instruction is related to the Store case of the taint propagation policy, if the destination operand has a higher ASIL than the source operand. Otherwise, it is a violation. Finally, if the source operand is a safety-critical address, then it is related to the Load Address case of the taint propagation policy. A store instruction is often preceded by a load instruction which aims at loading the destination address or the source value of the store instruction.

Listing 3.2: Load Inst {result} = load {type}, {type} * {source}, align {type_alignment} The load instruction reads the content of an address in the memory and stores it inside an SSA result. This instruction is used each time the content of the address of the memory needs to be read. For example, a load instruction can be used to load the address stored in the pointer. In order to access the value pointed by the pointer, a second load instruction should be used to load the content stored in the address. A load instruction does not necessarily indicate that the loaded operand will be modified. In fact, the address of a pointer can either be loaded to modify its content, or to read its content. Then, the instructions following the load should be analyzed, until finding a store instruction or a call instruction. The call instruction is a special case related to inter-procedural analysis.

Listing 3.3: Call Inst {result} = call {type} {function}({function arguments})

21 3.2. Taint analysis

The call instruction is used for function calls. The return value is stored in an SSA result. When performing inter-procedural analysis, if safety-critical data is passed as a parameter to the function, then the content of the function needs to be analyzed as well. This instruction is related to the Pointer Parameter and Call cases of the taint propagation policy.

Listing 3.4: Alloca Inst {result} = alloca {type}, align {type_alignment}

The alloca instruction is used to allocate memory on the stack frame during the execu- tion of a function. It enables the declarations of local variables which will be released after the function returned. An argument of a function is later assigned to a local value which is declared with an alloca instruction.

Listing 3.5: GetElementPtr Inst {result} = getelementptr inbounds {type} * {source}, {type} {index} The getElementPtr instruction is used to “get the address of a sub-element of an ag- gregate data structure” [19], such as arrays or structures. As the load instruction, it does not necessarily lead to the modification of the operand, then the following instructions need to be analyzed.

Listing 3.6: Global variable @{globalVarName} = {global | constant} {type} {initializer}, align {type_alignment}

The global instruction is used to declare a global variable. A global variable can be initialized with another global initializer, which can be a global variable or a constant. This is related to the Global case of the taint propagation policy.

Listing 3.7: An example of constant expression: inttoptr {destination_type} inttoptr ({type} {value} to {destination_type})

Finally, a user can also be a constant expression, which is used to perform operations on constants [19]. If a global value, which inherits from the LLVM::Constant class, is used by a constant expression, then the users of this constant expression should also be analyzed. For example, the constant expression inttoptr [List. 3.7] can be used to convert a constant integer, such as an address, to a pointer.

Propagation policy New tainted variables are stored in a SafeValue instance [Fig. 3.3], in the same way as taint information. It is useful to recall that the SafeValue class aims at storing an LLVM::Value analyzed by the pass, which thus can be associated with an ASIL (A, B, C, D) or classified as QM. The SafeValue objects store a list of all their users, corre- sponding to a propagation case, in a map whose keys are the users’ location. In fact, each time a user is identified as a case of the taint propagation policy, it is stored in an instance of SafeInstruction [Fig. 3.3], which is composed of the tainted value, its alias, the prop- agation type, and its location. If, at some point, the lvalue of two variables are equal, they are said to be aliases, as explained in section 2.4. The location is a global object which refers either to the tainted function where the user is located, or to a tainted global variable if the user is a global declaration. Finally, SafeValue instances are stored in a SafeMap whose keys are the LLVM::Value instances. Thus, it is possible to find out which functions and aliases have been tainted because of a given value, and then to find out which case of the taint propagation policy was responsible for the taint.

22 3.2. Taint analysis

Figure 3.3: SafeValue and SafeInstruction classes

Taint propagation algorithm The taint propagation algorithm developed in the context of this thesis is summarized below in pseudo-code. Each instance of the taint information is tainted at the initialization. Then, users of tainted variables are analyzed. If the user corre- sponds to a propagation case of the taint propagation policy, then the taint is propagated to the function or the alias. Finally, the user is converted to a instance of SafeInstruction which is inserted in the user map of the SafeValue instance.

Listing 3.8: Taint propagation algorithm

This is the initialization. taint_information = list_of_safe_values for each safe_value in taint_information propagating_taint(safe_value)

This function propagates the taint to the safe value and analyzes its users. void function propagating_taint(safe_value) { i f not (safe_value.tainted) { safe_value.tainted = true for each user in safe_value.users() { i f user corresponds to a propagation case { i f STORE or LOAD or PARAMETER or CALL propagating_taint(function)

i f LOAD or PARAMETER or GLOBAL propagating_taint(alias)

convert user to safe_instruction append safe_instruction to safe_value.user_map } } } }

23 3.3. Visualization

Compiling a project with Clang To run the analysis pass on a project, the project has to be compiled with Clang, in order to generate the LLVM bitcode files. The following command should be executed for each source file in order to generate the corresponding bitcode file.

Listing 3.9: Build clang ´g ´emit´ ´o f i l e . bc ´c f i l e . c

The linking part should be done with the LLVM linker, presented in section 2.5, which combines several bitcode files into a single bitcode file.

Listing 3.10: Linking llvm´l i n k * . bc ´o output . bc

3.3 Visualization

The third research question was “How to represent results in an understandable way so that engineers can improve the safety development process?”. The development of the visualiza- tion tool was done in two phases: data structure and serialization, and the development of the graph representation.

Data structure and serialization The main information to be stored is the list of tainted variables (the instances of the SafeValue class), and the userMap of each safe value, containing the list of functions, safe instructions and aliases related to this tainted value.

Listing 3.11: Example of JSON representation trees[’safeValue’] = { "name": "safeValue", "userMap": [ { "name" : "function1", "safeInstructionList": [ { "alias": "alias1", "propagationType": "store", }, [...] ] }, { "name: "function2", "safeInstructionList": [ [...] ] } ] }

24 3.3. Visualization

It has been decided to use the JSON (Javascript Object Notation) format to store the infor- mation. JSON is a human-readable file format used to represent objects as pairs of keys and values. It can be used to represent simple data as well as aggregate data such as arrays and lists. Data contained in each object has to be serialized, which means that data has to be trans- lated into a storable format, so that the results can be read later. The result of the tool is serialized into a JSON file using a C++ function inside the pass. In the context of this thesis, each SafeValue instance corresponds to a JSON entry, as presented in [List. 3.11]. Each safe value has at least a name, and a list of users. Each user is composed of a function, which is associated to a list of safe instructions. Each safe instruction has at least an alias and a propagation type.

Graph representation Javascript library D3 It has been decided to utilize Javascript to render the output of the tool. Javascript enables developing interactive graph easily. Moreover, it is possible to con- struct a Javascript graph from a JSON file. The Javascript library D3 (Data-Driven Documents) [50] has been used. This library is used to create documents to visualize data and to generate SVG (Scalable Vector Graphics). This library aims at simplifying Data-Objects manipulation (DOM).

Visualization properties The most useful visualization properties, according to Bassil and Keller’s study [43], presented in section 2.7, have been implemented on the visualization tool.

1. “Search tools for graphical and/or textual elements”: The possibility of searching for a specific value by entering its name has been added. The target node is highlighted and the ASIL rating is displayed.

2. “Hierarchical representation”: The tree representation has been chosen in order to present the results hierarchically.

3. “Use of colors”: The colors have been used to easily identify the different types of the taint propagation policy [Tab. 3.1] and the ASIL (A, B, C, D) or QM ratings.

4. “Navigation across hierarchies”: The user can expand some function nodes to display the aliases of the tainted value, and minimize some branches of the tree.

Accesses to the source code have not been implemented since this visualization tool is not part of an IDE, so it has been replaced by the possibility of accessing debugging information such as file location. The development of the visualization tool has been done iteratively, by regularly present- ing the results to the engineers to get their feedback. The tool was gradually improved, until it meets their expectations. Thanks to these meetings, they had the possibility to assess the user experience by navigating on the website or to ask for other functionalities. Thus, new functionalities have been added, such as the possibility to display the list of tainted functions and tainted global variables in each file.

Tree layout A tree is an undirected and acyclic graph characterized by the fact that each node can only have one parent [51]. The tree layout has been chosen because it can provide a clear hierarchical organization. In fact, each identical hierarchical level is represented in the same line. Moreover, equal spatial distribution of the nodes is easy to achieve: the D3 tree layout [50] is obtained using Reingold-Tilford “tidy” algorithm [52]. Furthermore, the tree layout avoids crossing edges,

25 3.4. Evaluation which is an important aspect according to Herman et al. [44]. Finally, in the context of that thesis, each branch of the tree is meaningful, because it represents a safety-critical path. However, the tree layout also has some drawbacks. Since each branch represents a path, several nodes can appear several times in different branches of the tree, for example if a function is tainted by several initiators. In the case of recursive functions, there could be cycles in the tree. To avoid cycles and to prevent the generation of endless trees, if a child node is equal to one of its ancestor nodes, its descendant nodes are not displayed. An alternative to the tree layout would have been to construct a spanning tree. A spanning tree is a subset of a graph where all the nodes are present only once and linked together with the minimum number of edges. As it is a tree, each pair of nodes must be linked by only one path according to Graham and Kennedy [51]. Nevertheless, the issue of a spanning tree would be, in the context of that thesis, that some dependencies would not have been displayed in the graph. These dependencies could also have been represented with a multiple tree structure [51]. A multiple tree is a combination of several single trees. The advantages of this structure is that a node can have several parents, while keeping the hierarchical organization. But, the issue of crossing edges still remains. Traditional graph structures can be complex and overwhelming according to Mukherjea et al. [53]. Thus, dividing the graph into several simpler trees seems to be a worthwhile solution.

3.4 Evaluation

The evaluation was conducted to answer the research questions “How does taint analysis vi- sualization affect the usefulness of the output?” and “Is the taint analysis accuracy sufficient for the application?”.

Usefulness According to Seaman [45], qualitative research methods are suitable to evaluate human inter- actions with software. Thus, it has been decided to conduct a survey to assess the usefulness of the visualization tool. This survey was made on and sent by email to the participants who were all employees of the company. In the context of this study, a “useful” functional aspect can be defined as a functionality which provides new and relevant information to the developer, and that the developer will be able to use to improve their work performance. The first set of questions [Tab. 3.2] aimed at assessing the usefulness of the visualization tool. The evaluation method used was based on the survey conducted by Bassil and Keller [43]. Thus, each question was written to evaluate one of the functional aspects of the visual- ization tool. The questions were formulated in the form of assertions. Each question could be answered on a linear scale from one to five, one being that the interviewee does not agree at all with the assertion, and five being that the interviewee agrees with the assertion. Finally, the usefulness of the tool could be assessed by computing the average grade.

26 3.4. Evaluation

Table 3.2: Linear scale questions Q1 The results provided by the visualization tool seem to be useful Q2 Colors make the result easier to be read Q3 The hierarchical representation is relevant Q4 The information displayed in each node is useful Q5 The alias overview provides new information Q6 The detailed overview provides new information Q7 The graph representation is well suited to visualize the relationships between tainted variables Q8 The tree representation (avoiding crossing edges) improves the visual- ization, despite the fact that some variables appear several times Q9 It is useful to minimize some branches of the tree Q10 The file location helps to understand the context Q11 The search tool is useful Q12 The files view is useful

The second set of questions [Tab. 3.3] was a list of tasks that the interviewees had to com- plete, based on the method used by LaToza and Myers [42]. These questions aimed at testing if the developers could understand the results and the notation used in the visualization tool, and thus assessing whether the user interface of the visualization tool was clear and under- standable enough. All the tasks were performed on the test project created for the unit tests, which covers all cases of the taint propagation policy.

Table 3.3: Tasks Q13 What is the ASIL rating of the following functions? Q14 Please list one function which has been tainted by the safe variable “safe_ptr”. Q15 Which alias is assigned to the safe variable “safe” in the function “safeInteger”? Q16 Into which register(s) is the hard-coded address “0x4465” loaded? Q17 Which functions and variables are tainted in the file called “file.c”?

Accuracy Two studies were conducted to evaluate the accuracy of the tool. First, a test suite was devel- oped to check the reliability of the functionalities requested by the taint propagation policy. Then, the pass was tested on an existing project of the company to assess the scalability of the tool and to compare the results of the analysis to the manual ASIL decomposition already done on this project.

Unit tests Unit tests aim at testing the tool’s functional requirements, which can be used to assess the tool confidence level (TCL) [46]. Unit tests have been written using the Google Test Framework [54], which facilitates the implementation of unit tests in C++. At least one test case has been written for each case of the taint propagation policy described in [Tab. 3.1]. The unit tests are described below [Tab 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10]. In each test case, the new objects added to the list of safe values are tested, as well as the ASIL propagation and the case of the taint propagation policy.

27 3.4. Evaluation

Table 3.4: Store test cases Test 1 A negative constant is assigned to a safe integer variable Test 2 A constant float is assigned to a safe float variable Test 3 An integer variable is assigned to a safe integer variable Test 4 The return value of a function is assigned to a safe integer variable Test 5 An integer variable is assigned to a safe dereferenced pointer Test 6 A pointer is assigned to a safe pointer Test 7 A hard-coded address is assigned to a safe pointer Test 8 An integer variable is inserted into a safe array of integers Test 9 Two different integer variables are assigned to a safe variable in an if- statement Test 10 An integer variable is assigned to a safe address

Table 3.5: Load address test case Test 11 A safe hard-coded address is assigned to a pointer Table 3.6: Pointer parameter test cases Test 12 A safe pointer is passed as a parameter to a function and is modified inside the function Test 13 A safe pointer is passed as a parameter to a function and is not modified inside the function Table 3.7: Global initialization test case Test 14 A global variable is initialized with a safe variable

Table 3.8: File test case Test 15 A file is tainted Table 3.9: Call test case Test 16 A safe function is called by another function Table 3.10: Violation test case Test 17 The address of a safe integer is assigned to an unsafe pointer Test 18 A safe pointer is assigned to an unsafe pointer

A case study: A real-world project The tool was tested on a real project of the company. This project, which was requested by a supplier in order to configure an ECU, was selected because it was developed according to safety rules and was qualified ASIL. The taint analysis’ accuracy was assessed according to “the number of false positives” and the “scalability” criteria, in the same way as Arroyo et al. [11]. The study was conducted as follows. The initial taint information was identified together with the safety engineers in order to configure and run the pass on the project. In order to compute the false positives rate of the tool, the results obtained with the taint analysis tool were compared to the existing ASIL decomposition of this project. In fact, this project had already been decomposed with respect to the ASIL, which means that safe mod- ules were already separated from QM modules. This decomposition had been done manually. Thus, the ASIL functional blocks (composed of one or more modules), listed in the mem- ory map of the project, were compared to the ASIL functional blocks detected by the taint analysis pass. The results of the taint analysis pass were taken from the file view, which summarizes the results of the analysis.

28 3.4. Evaluation

In the context of this study, a functional block was considered as ASIL as long as one of its modules (a source-code file and a header file) was ASIL. A module was considered as ASIL if one of its global objects was tainted by the taint analysis pass. Finally, this project is also a large project composed of more than 1 000 files. Thus, this project could also be used to assess the scalability of the taint analyzer and the visualization tool. The program execution time of the taint analysis pass was measured using the command time. This command displays the real time, the user CPU time, and the system CPU time spent to execute the command. Only the real time was considered in this study.

Listing 3.12: Program execution time time opt ´load ./ModulePass.so ´TaintAnalysisPass ´config´f i l e ´name "config/config.xml" project.bc 2> log.txt

29 4 Results

This chapter describes the results obtained at the end of this thesis. The first section 4.1 is related to the use of LLVM to perform static analysis on automotive software. The second section 4.2 presents the results of the implementation of the taint analysis pass. The third section 4.3 presents the visualization tool. The last section 4.4 presents the results of the evaluation of the usefulness and the accuracy of the taint analysis tool.

4.1 LLVM

The taint analysis pass has first been tested on two embedded AUTOSAR example projects of the company: HelloWorld and SafeInteriorLight. The HelloWorld project is not safety- critical, but it contains a complete ECU configuration, used to simulate a real-world ECU. The SafeInteriorLight project is composed of safety-critical parts and aims at controlling in- terior lights of a car. These projects were used to check whether it was possible to compile embedded projects with Clang to generate LLVM IR. These projects were effectively compiled with Clang, on the condition that inline assembly parts were withdrawn. All the bitcode files were linked together into a single LLVM IR file using the LLVM linker, as explained in section 3.2. Then, the pass was run successfully on these two LLVM IR files. The configuration files were composed of a variable, a file, and a memory range. Minor changes have been done to the taint analyzer as a result of these tests, such as increasing the size of the title frame on the visualization to comply with longer names, and allowing the taint propagation when a global variable is initialized with safe aggregate data, such as a structure. Thus, these projects were used as a proof of concept that a taint analysis pass based on LLVM can be developed to analyze automotive software.

4.2 Taint analysis

A taint analysis LLVM pass has been developed, which is an object-oriented C++ application composed of eleven classes. The taint analysis pass takes as input the configuration file of the user and provides the results of the analysis in the form of a JSON file. The pass can be run on a project using the following command line.

30 4.3. Visualization

Listing 4.1: Running the pass opt ´load ./ModulePass.so ´TaintAnalysisPass project.bc 2> log.txt

The configuration file path can also be passed as an option through the command line, by specifying -config-file-name.

4.3 Visualization

An interactive visualization tool written in Javascript has been developed, based on the D3 tree layout [55], and is presented in [Fig. 4.4]. The tool takes as input the JSON file supplied by the pass, and generates a visualization composed of four views. First, the main view [Fig. 4.2] is used to show the dependencies tree. This view shows the functions which are tainted by the initiators, and the call graph of that functions. The aliases are not displayed in the main view, because the ASIL decomposition is usually done on the function level. So, the most important information to show is the list of tainted functions. Therefore, the alias view is used to show which aliases are in relation with a safe variable inside a function [Fig. 4.3]. This view can be accessed by clicking on the functions tainted by the initiator. If the alias is also ASIL, then the functions tainted by the alias and their call graph are displayed in the alias view. The detailed view, which can be accessed from a node, provides more details about the tainted variable selected, such as its name, its ASIL, its type and its file location. Finally, the files view displays a summary of all the functions and global variables tainted in each file [Fig. 4.1].

Figure 4.1: The list of tainted functions and global variables in each file

31 4.3. Visualization

Figure 4.2: An example of the tree view, whose initiator is the variable safe.

Figure 4.3: The alias view of the variable safe in the function testInterProcedural

32 4.3. Visualization Figure 4.4: Visualization tool overview

33 4.4. Evaluation

4.4 Evaluation

This section first presents the results of the survey to evaluate the usefulness of the visualiza- tion tool. Then, the results of the evaluation of the accuracy of the tool are described.

Usefulness The survey was conducted as described in section 3.4. 10 participants took part in that eval- uation. Among them, 30% had already worked with automotive safety. Participants were asked to evaluate the usefulness of several functional aspects of the visualization tool. The results are presented in [Tab. 4.1]. The participants rated the usefulness of the visualization tool (Q1) 4.5 out of 5. The functional aspects which received the best grade (4.8) are the search tool (Q11) and the alias overview (Q5). 80% of the participants rated the search tool’s usefulness 5 out of 5 and 90% of the participants rated the alias view’s usefulness 5 out of 5. The usefulness of the graph representation to visualize relationships between tainted vari- ables (Q7) was rated 4.7, but the choice of the tree layout to avoid crossing edges (Q8) was rated 4.4. The hierarchical representation (Q3) was rated 4.6. The usefulness of the infor- mation displayed in each node (Q4) was rated 4.4. Finally, the possibility to minimize the branches of the tree (Q9) was rated 4.3. The file view (Q12) was rated 4.6 and the detailed overview (Q6) was rated 4.4. The functional aspects which received the lowest grades are the use of colors (Q2) and the file location (Q10). The final average grade of the visualization tool, combining all the grades of the previous questions, achieved 4.48 out of 5.

Table 4.1: Linear scale questions Question Average grade Q1 4.5 Q2 4.2 Q3 4.6 Q4 4.4 Q5 4.8 Q6 4.4 Total 4.48 Q7 4.7 Q8 4.4 Q9 4.3 Q10 4.1 Q11 4.8 Q12 4.6

Then, participants were asked to complete some tasks, described in section 3.4. The first task (Q13) was to find the ASIL rating of a list of functions. Four of the tainted functions were displayed in the main tree view, whereas one of them was displayed in the alias view. The answers given for the fourth first functions were 100% correct. However, the question related to the function hidden in the alias view received 90% of correct answers: one participant could not find the node in the tree. Following this question, participants were asked to reveal which aspects they used to answer the previous question: by using the search tool, looking at the tree, or both. The results are presented in [Fig 4.5]. All the participants used the search tool to answer these questions, among them 30% only used the search tool, and 70% used both the search tool and the tree.

34 4.4. Evaluation

Figure 4.5: Which aspect has been used to find the ASIL rating of an object?

30%

Search tool Tree and Search tool 70%

Question 14, which was related to the functions tainted by an initiator, received 80% of correct answers. Question 15, which was related to the alias of a variable, received 100% of correct answers. However, question 16, which was related to the aliases of an address, only received 80% of correct answers. Question 17, which was related to the file view, received 90% of correct answers. Some of the participants reported some comments on the visualization tool, in addition to the questions of the survey. They suggested to add the possibility to minimize all nodes in the tree, and to have them be a minimal rectangle with variables only, in order to make the tree shorter. They also suggested to add colors explanations. Regarding the search tool which has been mainly used to find the ASIL of the objects, they suggested to add an automatic scroll to the node in the tree. They also asked that the initiators of the trees were clearly marked as such.

Accuracy Unit tests A set of 18 unit tests have been implemented for each case of the taint propa- gation policy described in [Tab. 3.1]. The output of the unit tests is presented below [List. 4.2]. Listing 4.2: Google Test Output [======] 18 tests from 1 test suite ran. (150 ms total) [ PASSED ] 18 t e s t s .

A case study: A real-world project The study of the real-world project was conducted as explained in section 3.4. The project was compiled with Clang to generate the corresponding LLVM IR. The results of the conversion of the project to LLVM IR are presented in [Tab. 4.2].

Table 4.2: LLVM IR metrics number of files 318 number of global objects 6569 number of lines (without debugging information) 168 830 number of lines (including debugging information) 577 668

The following step was to identify the taint information. The safety-critical functional blocks of this project were the serial peripheral interface (SPI) driver, the MPU, the mechanical

35 4.4. Evaluation sensors, and the watchdog manager (WdgM). The SPI driver is used to communicate with the system basis chip (SBC) [56] inside the ECU. The watchdog manager aims at detecting a program flow error during runtime [57]. The safety-critical source-code objects identified in those blocks were the SPI registers, the MPU registers, the mechanical sensors input registers and variables, and the WdgM vari- ables. Registers here refer to hard-coded addresses. The number of objects classified as ASIL in the source-code are presented in [Tab 4.3].

Table 4.3: Taint information Module Number of objects Mechanical sensors input variables 5 WdgM variables 5 Mechanical sensors input registers 1 SPI registers 256 MPU registers 8237

These variables and registers have been copied to a configuration file like the one pre- sented below [List. 4.3]. The global variables are identified by their name and the memory regions are identified by a starting and an ending address.

Listing 4.3: An example of configuration file wdgM_variable A sensors_input A

0xffffC000 0xffffE02C C
0xf000B124 0xf000B124 D

The visualization of the results of the taint analysis pass on this project is presented in [Fig. 4.6], and the number of objects tainted by the pass is listed in [Tab. 4.2].

Table 4.4: Taint analysis results Number of functional blocks 16 Number of modules 27 Number of functions 75 Number of global variables 12

36 4.4. Evaluation

5/23/2019 tree.html 5/23/2019 tree.html

5/23/2019 tree.html 5/23/2019 tree.html  variable_name_82 variable_name_81 variable_name_80   store ASIL A Minimize node   About  view  Minimize All  About  Files view  Minimize All ASIL: A ASIL: A ASIL: D store ASIL A Minimize node  variable_name_82 variable_name_81 variable_name_80  load ASIL B Show aliases  load ASIL B Show aliases   store ASIL A Minimize node   About  Files view  Minimize All  About  Files view  Minimize All ASIL: A ASIL: A ASIL: D store ASIL A Minimize node parameter ASIL C Show detailed view  parameter ASIL C Show detailed view  load ASIL B Show aliases  Search: adcChannel Search load ASIL B Show aliases  Search: adcChannel Search global ASIL D global ASIL D Search: adcChannel Search parameter ASIL C Show detailed view  Search: adcChannel Search parameter ASIL C Show detailed view  violaon QM Iniator violaon QM Iniator global ASIL D global ASIL D violaon QM Iniator violaon QM Iniator

Global  

Global  

        variable_name_86 Function Function Function Function Function ASIL: A Function  Function   Function   Function   Function   Function          variable_name_86 Function Function Function Function Function ASIL: A Function  Function   Function   Function   Function   Function  variable_name_167 variable_name_166 variable_name_165 variable_name_164 variable_name_163 ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B variable_name_79 variable_name_78 variable_name_77 variable_name_76 variable_name_75 variable_name_74 variable_name_167 variable_name_166 variable_name_165 variable_name_164 variable_name_163 ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B variable_name_79 variable_name_78 variable_name_77 variable_name_76 variable_name_75 variable_name_74 ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A

Function  Function  Function   Function   Function  Function  Function  Function   Function   Function  variable_name_162 variable_name_161   ASIL: A ASIL: A variable_name_72 variable_name_71 variable_name_70 variable_name_162 variable_name_161 Global ASIL: A ASIL: A ASIL: D   ASIL: A ASIL: A variable_name_72 variable_name_71 variable_name_70 Global ASIL: A ASIL: A ASIL: D variable_name_168 ASIL: A variable_name_168 Function  ASIL: A Function  variable_name_73 ASIL: A variable_name_73 ASIL: A

Function   Function   Function  Function   Function   Function   Function   Function   Function  Function   Function   Function  Function   Function   Function   Function   Function   Function  variable_name_159 variable_name_158 variable_name_157 ASIL: A ASIL: A ASIL: A variable_name_69 variable_name_68 variable_name_67 variable_name_66 variable_name_65 variable_name_64 variable_name_159 variable_name_158 variable_name_157 ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A variable_name_69 variable_name_68 variable_name_67 variable_name_66 variable_name_65 variable_name_64 ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A ASIL: A

Function 

Function  variable_name_160 ASIL: A variable_name_160 ASIL: A

Function   Function   Function   Function  Function  Function   Function  Function   Function   Function   Function  Function  Function   Function  variable_name_156 variable_name_155 variable_name_154 variable_name_153 ASIL: A ASIL: B ASIL: B ASIL: B variable_name_62 variable_name_61 variable_name_60 variable_name_156 variable_name_155 variable_name_154 variable_name_153 ASIL: D ASIL: D ASIL: D ASIL: A ASIL: B ASIL: B ASIL: B variable_name_62 variable_name_61 variable_name_60 ASIL: D ASIL: D ASIL: D

Global  

Global   variable_name_63 ASIL: A variable_name_63 ASIL: A

Function   Function   Function   Function   Function  Function  Function   Function   Function   Function   Function   Function  Function   Function   Function   Function   Function  Function  Function   Function   Function   Function   Function   Function  variable_name_149 variable_name_148 variable_name_147 variable_name_146 variable_name_145    ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B variable_name_59 variable_name_58 variable_name_57 variable_name_56 variable_name_55 variable_name_54 variable_name_53 variable_name_149 variable_name_148 variable_name_147 variable_name_146 variable_name_145 Function Function ASIL: A ASIL: A ASIL: A ASIL: A ASIL: D ASIL: D ASIL: D    ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B variable_name_59 variable_name_58 variable_name_57 variable_name_56 variable_name_55 variable_name_54 variable_name_53 Function Function ASIL: A ASIL: A ASIL: A ASIL: A ASIL: D ASIL: D ASIL: D variable_name_151 variable_name_150 ASIL: A ASIL: A  variable_name_151 variable_name_150 Function ASIL: A ASIL: A Function  variable_name_144 ASIL: A variable_name_144 ASIL: A Global  

Global   variable_name_152 Function  Function   Function  ASIL: A variable_name_152 Function  Function   Function  ASIL: A variable_name_51 variable_name_50 variable_name_49 ASIL: D ASIL: D ASIL: D variable_name_51 variable_name_50 variable_name_49 ASIL: D ASIL: D ASIL: D

Function  Function   Function  Global   Function  Function   Function  Global   variable_name_143 variable_name_142 variable_name_141 ASIL: A ASIL: A ASIL: A variable_name_52 variable_name_143 variable_name_142 variable_name_141 ASIL: A ASIL: A ASIL: A ASIL: A variable_name_52 ASIL: A

Function  Function   Function   Function   Function   Function   Function 

Function  Function   Function   Function   Function   Function   Function  variable_name_48 variable_name_47 variable_name_46 variable_name_45 variable_name_44 variable_name_43 variable_name_42 ASIL: A ASIL: A ASIL: A ASIL: A ASIL: D ASIL: D ASIL: D variable_name_48 variable_name_47 variable_name_46 variable_name_45 variable_name_44 variable_name_43 variable_name_42 ASIL: A ASIL: A ASIL: A ASIL: A ASIL: D ASIL: D ASIL: D

Function  Function   Function   Function   Function   Function 

Function  Function   Function   Function   Function   Function  variable_name_139 variable_name_138 variable_name_137 variable_name_136 variable_name_135 variable_name_134 ASIL: A ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B variable_name_139 variable_name_138 variable_name_137 variable_name_136 variable_name_135 variable_name_134 ASIL: A ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B

Function  Function   Function   Function   Function   Function   Function 

Function  Function   Function   Function   Function   Function   Function 

  variable_name_40 variable_name_39 variable_name_38 variable_name_37 variable_name_36 variable_name_35 variable_name_34 Global ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D   variable_name_40 variable_name_39 variable_name_38 variable_name_37 variable_name_36 variable_name_35 variable_name_34 Global ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D variable_name_140 ASIL: A      variable_name_140 Function Function Function ASIL: A Function   Function   Function  variable_name_132 variable_name_131 variable_name_130 ASIL: A ASIL: A ASIL: A variable_name_132 variable_name_131 variable_name_130 ASIL: A ASIL: A ASIL: A

Function  Function  Function   Function   Function   Function   Function   Function  Function  Function  Function   Function   Function   Function   Function   Function  variable_name_133 ASIL: A variable_name_33 variable_name_32 variable_name_31 variable_name_30 variable_name_29 variable_name_28 variable_name_27 variable_name_133 ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: A variable_name_33 variable_name_32 variable_name_31 variable_name_30 variable_name_29 variable_name_28 variable_name_27 ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D ASIL: D

Function   Function   Function   Function  Global   Function   Function   Function   Function  Global   variable_name_129 variable_name_128 variable_name_127 variable_name_126 ASIL: A ASIL: B ASIL: B ASIL: B variable_name_41 variable_name_129 variable_name_128 variable_name_127 variable_name_126 ASIL: D ASIL: A ASIL: B ASIL: B ASIL: B variable_name_41 ASIL: D

Function  Function 

Function  Function  variable_name_26 variable_name_25 ASIL: D ASIL: D variable_name_26 variable_name_25 ASIL: D ASIL: D

Function   Function   Function 

Function   Function   Function  variable_name_123 variable_name_122 variable_name_121 ASIL: A ASIL: A ASIL: A variable_name_123 variable_name_122 variable_name_121 ASIL: A ASIL: A ASIL: A

Global   Function  Function  Function   Function  Global   Function  Function  Function   Function  variable_name_125 variable_name_124 ASIL: A ASIL: A variable_name_24 variable_name_23 variable_name_22 variable_name_125 variable_name_124 ASIL: D ASIL: D ASIL: D ASIL: A ASIL: A variable_name_24 variable_name_23 variable_name_22 ASIL: D ASIL: D ASIL: D

Function   Function   Function   Function 

Function   Function   Function   Function  variable_name_120 variable_name_119 variable_name_118 variable_name_117 ASIL: A ASIL: B ASIL: B ASIL: B variable_name_120 variable_name_119 variable_name_118 variable_name_117 ASIL: A ASIL: B ASIL: B ASIL: B

Function  Function 

Function  Function  variable_name_20 variable_name_19 ASIL: B ASIL: B variable_name_20 variable_name_19 ASIL: B ASIL: B

Function  Function   Function   Function   Function   Function 

Function  Function   Function   Function   Function   Function  variable_name_115 variable_name_114 variable_name_113 variable_name_112 variable_name_111 variable_name_110 ASIL: A ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B variable_name_115 variable_name_114 variable_name_113 variable_name_112 variable_name_111 variable_name_110 ASIL: A ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B

Function   Function   Function 

Function   Function   Function 

  variable_name_17 variable_name_16 variable_name_15 Global ASIL: B ASIL: B ASIL: B Function    variable_name_17 variable_name_16 variable_name_15 Global ASIL: B ASIL: B ASIL: B Function  variable_name_116 ASIL: A      variable_name_18 variable_name_116 Function Function Function ASIL: B Address   Function  ASIL: A      variable_name_18 Function Function Function ASIL: B Address   Function  variable_name_108 variable_name_107 variable_name_106 ASIL: A ASIL: A ASIL: A variable_name_21 variable_name_14 variable_name_108 variable_name_107 variable_name_106 ASIL: B ASIL: B ASIL: A ASIL: A ASIL: A variable_name_21 variable_name_14 ASIL: B ASIL: B

Function 

Function  variable_name_109 Function  ASIL: A variable_name_109 Function  ASIL: A variable_name_13 ASIL: B variable_name_13 ASIL: B Function   Function   Function   Function  Function   Function   Function  Function   Function   Function   Function  Function   Function   Function  variable_name_105 variable_name_104 variable_name_103 variable_name_102 ASIL: A ASIL: B ASIL: B ASIL: B variable_name_11 variable_name_10 variable_name_9 variable_name_105 variable_name_104 variable_name_103 variable_name_102 ASIL: B ASIL: B ASIL: B ASIL: A ASIL: B ASIL: B ASIL: B variable_name_11 variable_name_10 variable_name_9 ASIL: B ASIL: B ASIL: B

Function 

Function  variable_name_12 ASIL: B variable_name_12 ASIL: B

Function  Function   Function   Function   Function   Function  Function   Function   Function  Function  Function   Function   Function   Function   Function  Function   Function   Function  variable_name_100 variable_name_99 variable_name_98 variable_name_97 variable_name_96 variable_name_95 ASIL: A ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B variable_name_8 variable_name_7 variable_name_6 variable_name_100 variable_name_99 variable_name_98 variable_name_97 variable_name_96 variable_name_95 ASIL: B ASIL: B ASIL: B ASIL: A ASIL: A ASIL: A ASIL: B ASIL: B ASIL: B variable_name_8 variable_name_7 variable_name_6 ASIL: B ASIL: B ASIL: B

Global  

Global   variable_name_101 ASIL: A      variable_name_101 Function Function Function Function  Function  ASIL: A Function   Function   Function  Function  Function  variable_name_93 variable_name_92 variable_name_91 ASIL: A ASIL: A ASIL: A variable_name_4 variable_name_3 variable_name_93 variable_name_92 variable_name_91 ASIL: D ASIL: D ASIL: A ASIL: A ASIL: A variable_name_4 variable_name_3 ASIL: D ASIL: D

Function  Address   Function  Address   variable_name_94 ASIL: A variable_name_5 variable_name_94 ASIL: D ASIL: A variable_name_5 ASIL: D

Function   Function   Function   Function  Function  Function  Function   Function   Function   Function  Function  Function  variable_name_90 variable_name_89 variable_name_88 variable_name_87 ASIL: A ASIL: B ASIL: B ASIL: B variable_name_2 variable_name_1 variable_name_90 variable_name_89 variable_name_88 variable_name_87 ASIL: D ASIL: D ASIL: A ASIL: B ASIL: B ASIL: B variable_name_2 variable_name_1 ASIL: D ASIL: D

Function  Function   Function 

Function  Function   Function  variable_name_85 variable_name_84 variable_name_83 ASIL: A ASIL: A ASIL: D variable_name_85 variable_name_84 variable_name_83 ASIL: A ASIL: A ASIL: D

Function  Function   Function 

Function  Function   Function  file:///C:/Users/elgo/projects/helloWorldPass/HelloWorldPasses/Javascript/html/tree.html 1/2 file:///C:/Users/elgo/projects/helloWorldPass/HelloWorldPasses/Javascript/html/tree.html 2/2 file:///C:/Users/elgo/projects/helloWorldPass/HelloWorldPasses/Javascript/html/tree.html 1/2 file:///C:/Users/elgo/projects/helloWorldPass/HelloWorldPasses/Javascript/html/tree.html 2/2 Figure 4.6: An overview of the result of the taint analysis pass on the project (real names have been modified)

37 4.4. Evaluation

The taint analysis pass identified 27 modules as ASIL, corresponding to 16 functional blocks. A functional block is considered as ASIL if at least one of its modules are tainted, and a module is considered as ASIL if at least one of its global objects are tainted. In total, 89 global objects were classified as ASIL. 21 modules were tainted without being directly influenced by the initiators, which means that none of their global objects or hard-coded addresses were tainted from the initialization. Although the MPU registers were written in the configuration file, no objects were tainted because of them. Otherwise, all the other variables and registers from the configuration file propagated their taint to some objects. As described in section 3.4, the list of functional blocks which were tainted according to the manual decomposition was compared to the results of the taint analysis pass. The results of the comparison are presented below [Tab. 4.5].

Table 4.5: Results Metrics Number of functional blocks True positives 11 False positives 5 False negatives 4

From those results, it was possible to compute the false positives rate, the precision and the recall, as explained in section 3.4.

false positives rate = 5/16 = 31.25% (4.1)

precision = 11/16 = 68.75% (4.2)

recall = 11/15 = 73.33% (4.3) In order to assess the scalability of the taint analysis pass, the execution time of the pass was measured as explained in section 3.4. This evaluation was conducted on a DellM2800 computer with the following characteristics:

• Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz

• 16.0 GB RAM

• Windows 7 Professional

The results are presented in [Tab. 4.6].

Table 4.6: Program execution time results Average Median Min Max 4 m 45.0378 s 4 m 45.668 s 4 m 38.953 s 4 m 49.912 s

Regarding the scalability of the visualization tool, [Fig. 4.6] shows an overview of the tree graph. The visualization view extends depending on the number of initiators, so that the distance between the nodes can remain sufficient and nodes do not overlap each other. This shows how the tool can cope with a large number of objects.

38 5 Discussion

This chapter first deals with the possible improvements of the taint analysis pass in section 5.1. Then, the results of the evaluations, the method and the sources are discussed in sections 5.2, 5.3 and 5.4, respectively. Finally, the results of the thesis are analyzed in a wider context in section 5.5.

5.1 Taint analysis

Improvements Some improvements could be carried out on the taint analysis pass. The pass could be integrated in the company’s IDE. In fact, using a standalone tool takes more time than using a tool already integrated in an IDE. As presented in section 2.7, devel- oping a plugin to display the results of the pass on the source-code, in addition to the graph view which provides a clear overview of the results, would have been an interesting solution. For the tool to be usable on a larger scale, it would be necessary to automate the compi- lation of the projects with Clang to generate LLVM IR. Currently, most of the projects of the company are made to be compiled with the GCC compiler. Although Clang is often compat- ible with GCC, generating LLVM IR requires additional handling. An efficient improvement could be to include the pass in the continuous integration process of the company. Currently, the pass does not handle the analysis of inline assembly language and specific machine code. Inline assembly parts have to be ignored during the compilation so that the pass can be run on the project. In fact, the pass has been developed to analyze C code only. Future versions of this taint analyzer could include the analysis of other languages, which requires to handle some new cases. However, the algorithm used to propagate the taint should be reusable because it does not depend on a specific language. One of the limits of the visualization tool is that the functions tainted by an alias of an initiator are not displayed in the main tree view. Thus, it could be difficult to find the context of the tainted functions resulting from tainted aliases. Moreover, the current visualization tool cannot display the root tree that a given alias belongs to. It would have been possible to add the aliases directly in the main tree view, but the advantage of the alias view is that it simplifies the main visualization and preventing it from graph explosion. This alias view makes the main graph easily readable.

39 5.2. Results

5.2 Results

Usefulness The results of the survey show that the tool was generally considered useful by the partic- ipants. The visualization tool also seems to be easy to understand in the perspective of the answers provided to the survey. In fact, the results are quite significant as each task received more than 80% of correct answers. By studying more precisely the results related to the different aspects of the tool, it appears that the search tool was evaluated as very useful, as predicted by the study conducted by Bassil and Keller. [43]. Participants mainly used the search tool to find the ASIL rating of a given object, instead of the tree view. The tree view can mainly be used to explore a safety- critical path from an initiator. As suggested by the participants, the possibility to scroll to the first occurrence in the tree of the search input was added, to facilitate the reading of the tree graph. The alias overview was also evaluated as useful. In fact, this overview can be used to gain information about the tainting context of a function and to show the function tainted by the aliases of the initiators. It was evaluated as more useful than the file view. This can be explained by the fact that the alias overview is accessible directly from the tree, whereas the file view is located on a different page. Despite its usefulness, some participants reported that the alias view was at first a bit hard to understand. The function hidden in the alias view received only 90% of correct answers because some participants could not find it in the tree, while functions displayed in the tree view received 100% of correct answers. In fact, the search tool only highlighted the objects displayed in the main tree view. The possibility to display the alias tree with the search tool has been added. Moreover, the task involving the address loaded into a register, whose answer was also in the alias view, received 80% of correct answers. It is possible to deduce from this that the propagation policy case in the alias view was a bit unclear. Therefore, explanations have been added to the header. Contrary to the survey conducted by Bassil and Keller [43], the use of colors received the lowest grade. This can be explained by the fact that the choice of colors was not good, or by the fact that the meaning of the colors was not explained. To solve this problem, a caption was added to the header of the visualization tool to explain the meaning of each color. The file view was rated 4.6. This view is useful to summarize the results of the pass and provides a global overview of the project. The detailed overview was rated 4.4. This overview should be used to show the source- code context, that is to say, the file location, which also received the lowest grade. This view was added to compensate for the lack of source-code browsing. It is possible to assume that this information was not enough to understand the context. Some debugging information could have been added, such as the line of the definition, or the C instruction which lead to the tainting of an object. The information displayed in each node was rated 4.4. This information was redundant with the information displayed in the detailed view. Participants reported that it was not clear which objects were the initiators. Their name was written in italics to overcome this issue.

Unit tests The unit tests suite was very helpful during the development. Developing unit tests is time consuming, but it is worth it. The unit tests can be used to check that the functional require- ments are fulfilled, at least on specific cases. This also ensures that functionalities already developed will not be removed. If additions to the code break the tests, this can be detected immediately. Thus, it increases the tool confidence level (TCL), presented in section 2.8, but also facilitates the maintenance.

40 5.2. Results

A case study: a real-world project Testing the pass on this project has made it possible to be aware of the real conditions related to an embedded project. The results of the taint analysis pass have been compared to the initial decomposition of this project.

Recall Four false negatives were reported: these functional blocks were tainted in the orig- inal decomposition but not detected by the taint analysis pass, which leads to a recall of 73.33%. These functional blocks were related to the functions which were used to access the MPU. However, the MPU addresses were not analyzed by the pass and the taint did not prop- agate. In fact, the MPU registers are not accessed with C instructions, but with instructions which are specific to the architecture of the embedded systems. These instructions are trans- lated into calls in LLVM IR and are not recognized by the tool as store or load instructions. This is an issue, as it means that some safety-critical instructions may be missing, and therefore safety-critical files may not be detected. The result of the analysis also depends on the objects initially tainted. An omission in the taint information has an impact on the results.

False positives rate and precision Five functional blocks were identified as false positives, which lead to a false positives rate of 31.25%, and a precision of 68.75%. Some over-approximations can increase the number of functions tainted by the pass. For example, if a global pointer is passed as a parameter to a function, but is not modified in the function, the calling function remains tainted, due to the over-approximation of the inter- procedural analysis. By analyzing in detail these five cases together with the safety engineers, it appeared that two of them should have been tainted in the initial decomposition. One tainted function was a generated function, which was supposed to read a value, but which actually wrote into a pointer. The other function was a hand-coded function used to validate a checksum. If these two cases are considered, the updated results are the following:

false positive rate = 3/16 = 18.75% (5.1)

precision = 13/16 = 81.25% (5.2) Thus, the precision achieved is 81.25%, which means that the tool does not perform many over-approximations. Therefore, the tool is not likely to increase the work of engineers be- cause it does not taint too many objects compared to the reality.

ASIL decomposition Furthermore, the initial manual decomposition had been made only on the functional block level, whereas the taint analysis pass can mark modules, functions and global variables as ASIL. Thus, the taint analysis pass allows for a more detailed decom- position. The results of the taint analysis pass show that one of the system’s main runnables (AU- TOSAR terminology for a periodically scheduled C function), which was qualified ASIL, could be split into ASIL and QM parts. In fact, the tool indicates that this function was qualified ASIL because of the modification of a safe global variable and the call to a safe func- tion, whose logical blocks represent around 24 lines of code out of 76 lines of code. Thus, the split would be around 70% QM and 30% ASIL code, and while the ASIL code in question is functionally simpler, such a split would have significantly decreased the development effort for the runnable in question according to the safety engineers of the company. On a module level, one of the system’s main modules that is qualified as an ASIL module could be reduced in size by moving parts of code that do not interact with ASIL data to a sep- arate module. The number of module-scoped functions that could be so moved is between 35% and 45%, depending on how over-approximated the approach to the functional safety

41 5.3. Method architecture is. Decreasing the size of the ASIL module this way would likely provide a sig- nificant decrease in the time required for post-implementation activities such as the module’s safety qualification and analysis. Another ASIL-classified module turned out to only have a small part of code, around 15%, that is actually safety-relevant. In this case as well, splitting the safety-relevant functional- ity off would have significantly reduced the later documentation, analysis and qualification effort.

Scalability The pass has been tested on a large project. As shown in [Fig. 4.6], the visual- ization tool is well suited to a large number of nodes. With regard to the time needed to run the pass on this project, it is still reasonable because it does not exceed five minutes. This is acceptable for a static analysis tool.

5.3 Method

Validity Regarding the evaluation of the usefulness, the participants were all working in the company. Thus, the survey has been done in a controlled environment, which guarantees the serious- ness of the participants and the reliability of the answers. Their involvement is also shown by the detailed comments provided by the participants. To increase the validity of the exper- iment and to obtain more significant results, the survey could have been submitted to more participants, as noted by Bassil and Keller [43]. Regarding the evaluation of the accuracy, the project has been tested on a real project. This allowed the tool to be evaluated under real conditions.

Replicability Regarding the evaluation of the usefulness, the level of experience with automotive safety may affect the replicability of the evaluation. In fact, different participants can have different opinions regarding the most important aspects of a visualization tool, depending on their knowledge of the needs related to software safety. Some questions, related to user experience, were more subject to personal interpretation. Thus, these reasons could affect the results of a similar evaluation. The results of the evaluation of the accuracy depend on the project which is tested and its previous manual decomposition. Nevertheless, it can be expected that the general pattern will be similar, that is to say, that the tool would allow engineers to identify in more details the safety-critical components of an application.

Reliability Regarding the usefulness of the tool, the results were quite significant as the average grade of the tool was above 4 out of 5. This means that participants agreed that the tool was useful. Regarding the accuracy of the tool, the results are quite reliable because the tool has been subjected to unit tests. Of course, this tool can be used as a basis for a safety engineer during the ASIL decomposition, but it is important to compare the results with another analysis, human or automated. A best practice would be to apply the same development process to the analysis tool as to the tested project in order to increase the TCL [46].

42 5.4. Source criticism

5.4 Source criticism

Peer reviewed papers have been mostly used as primary sources. It was quite easy to find information about automotive safety and software development, including static analysis, software visualization and tool evaluation. International standards have been used to gain recommendations and detailed informa- tion about safety for road vehicles and embedded systems. However, there is unfortunately a lack of concrete information regarding the identifica- tions of patterns generating “cascading failures” [1] at the software level. Thus, the taint propagation policy has been mainly based on the experience of the engineers with automo- tive safety. ISO 26262 Part 9 Section 7.4.4 [1] recommends the use of “checklists based on field experience” to assess “potential dependent failures plausibility”. Regarding the implementation of the pass, the LLVM Project provides a clear and detailed documentation which has been widely used to develop the pass.

5.5 The work in a wider context

This tool aims at facilitating the work of engineers by providing them with an analysis tool in order to support their work. This tool identifies the safety-critical components, so that engineers can focus on the safe development of these components. Automotive safety is a societal challenge. Vehicles are composed of more and more em- bedded computer systems. The users and manufacturer require safety guarantees in order to trust the vehicles. These safety expectations increase especially in autonomous driving: “The reason for the large amount of software requirement is the electrification of the automobile and autonomous driving systems” according to Sari and Reuss [58]. Vision Zero [59] is a road safety project created in Sweden in 1997. This philosophy con- siders users’ serious injuries, due to road vehicles or the road transport system, as “unaccept- able”. Therefore, safety should not be “traded against” mobility [59]. Thus, it is useful to recall that emphasis should be placed on safety in the automotive industry. However, it is still hard or impossible to “reduce the risk to zero” [46]. Therefore, a static tool analyzer aims as reducing the risk “as low as reasonably practicable” according to An- derson [46].

43 6 Conclusion

This chapter summarizes the purpose of this work and the answers to the research questions:

1. Is LLVM suitable to perform static analysis on automotive software? This research question aimed at determining whether it was possible to develop a static analyzer using the LLVM compiler infrastructure. The features offered by the LLVM compiler infrastructure, such as the LLVM Pass Framework, were studied. It was de- ducted that it was possible to compile a project with Clang to generate LLVM IR, and to develop a pass to analyze this intermediate representation. The pass was successfully run on three automotive projects. Thus, it was concluded that it was possible to develop an LLVM pass to analyze automotive software.

2. How can static taint analysis be used to track dependencies related to safe components in automotive software? This research question aimed at examining how to use taint analysis in order to track the dependencies related to safe components in automotive software. An inter-procedural, field-insensitive and flow-insensitive taint analyzer was developed to analyze the de- pendencies between safety-critical components in automotive software. Therefore, the taint propagation policy was set up and implemented. LLVM IR was analyzed to iden- tify the safety-critical operations. A taint analysis algorithm was developed to propa- gate the taint to the new users related to the taint information.

3. How to represent results in an understandable way so that engineers can improve the safety development process? This research question aimed at studying the alternatives to represent results in an un- derstandable way. A Javascript tool was developed to visualize the results provided by the LLVM pass. The dependencies between safe objects were represented in a tree graph, in order to highlight the safety-critical paths of the software. A file view, show- ing the functions and global variables tainted in each file, was used to summarize the results of the pass.

4. Is the taint analysis accuracy sufficient for the application? How does taint analysis visualization affect the usefulness of the output?

44 6.1. Consequences

These research questions aimed at evaluating the results of the thesis. The usefulness of the visualization was assessed using a survey submitted to the employees of the com- pany. This survey showed that the search tool and the alias view were the most useful aspects of the visualization. Overall, the tool was considered as useful and understand- able by the participants. The accuracy of the tool was assessed through unit tests and the analysis of a case study. The unit tests were used to ensure the functionalities of the tool. The case study showed that the tool was incomplete due to over-approximations, and unsound because it could not detect the dependencies related to the MPU. Nevertheless, the tool was able to de- tect two functional blocks which should have been tainted in the initial decomposition. Finally, the case study revealed that the tool could effectively improve the precision of the ASIL decomposition.

6.1 Consequences

This thesis can be used as proof of concept to show that it is possible to develop a static taint analysis tool using the LLVM compiler to analyze automotive software. According to Anderson [46], static analysis tools exist to check MISRA C rules and coding best practices. But fewer tools exist to check the requirements of the ISO 26262 certification. It is hoped that this taint analysis tool can help safety engineers in their work. This tool should allow them to save time and development effort, by highlighting the safety-critical components in automotive software. Separating the safety-critical parts from QM parts in automotive software would allow engineers to save time and money. It would prevent de- velopers from classifying an entire application as ASIL. According to Heling et al. [6], “it is not necessary to assume that every requirement of the basic software must be generally classified as safety related”. Therefore, ASIL decomposition is important. It allows engineers to focus effort on the components which require safety- oriented development. Thus, taint analysis can be used to support and improve the precision of ASIL decompo- sition.

6.2 Further work

The taint analysis pass could be integrated in the development process of the company. In fact, ASIL decomposition should be prepared early in the development cycle. This would allow engineer to identify the safety-critical components of the software iteratively and for developing them according to the ISO 26262 requirements. The tool could also be integrated in the company’s IDE, which would facilitate the se- lection of taint information. This would make possible the use of the tool during the de- velopment phase. Due to the fact that automotive projects are quite large, displaying the information directly in the IDE would improve the usability of the results. In fact, the visu- alization would be clearer because the results would be annotated to the source-code files in addition to the dependencies graph. Moreover, the integration of the tool would simplify the LLVM IR generation step. This could be added to the compilation process of the project, although it requires a compilation with Clang instead of the GCC compiler.

45 Bibliography

[1] ISO 26262-9:2018(en), Road vehicles — Functional safety — Part 9: Automotive safety in- tegrity level (ASIL)-oriented and safety-oriented analyses. URL: https://www.iso.org/ obp/ui/#iso:std:iso:26262:-9:ed-2:v1:en (visited on 04/02/2019). [2] IEC Functional Safety and IEC 61508. URL: https : / / www . iec . ch / functionalsafety/ (visited on 03/01/2019). [3] ARCCORE - Company. URL: https : / / www . arccore . com / company (visited on 03/01/2019). [4] AUTOSAR development cooperation. About. en. URL: https://www.autosar.org/ about/ (visited on 03/01/2019). [5] R. A. B. e Silva, N. N. Arai, L. A. Burgareli, J. M. P. de Oliveira, and J. S. Pinto. “For- mal Verification With Frama-C: A Case Study in the Space Software Domain”. In: IEEE Transactions on Reliability 65.3 (Sept. 2016), pp. 1163–1179. ISSN: 0018-9529. DOI: 10 . 1109/TR.2015.2508559. [6] Günther Heling and Jochen Rein. “SilentBSW – Silent AUTOSAR Basic Software for Safety Related ECUs”. en. In: 2012, p. 4. URL: https://assets.vector.com/cms/ content/know-how/_technical-articles/AUTOSAR/AUTOSAR_SilentBSW_ ATZ_Elektronik_201211_PressArticle_EN.pdf. [7] Florian Leitner-Fischer, Stefan Leue, and Sirui Liu. “Automated Freedom from Interfer- ence Analysis for Automotive Software”. In: CARS 2016 - 4th International Workshop on Critical Automotive applications : Robustness & Safety. Ed. by Matthieu Roy. CARS 2016 - Critical Automotive applications : Robustness & Safety. Göteborg, Sweden, Sept. 2016. (Visited on 02/15/2019). [8] A. Imparato, R. R. Maietta, S. Scala, and V. Vacca. “A Comparative Study of Static Anal- ysis Tools for AUTOSAR Automotive Software Components Development”. In: 2017 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). Oct. 2017, pp. 65–68. DOI: 10.1109/ISSREW.2017.21. [9] A. Goebel, R. Mader, and O. Tripon. “Performance and Freedom From Interference - a contradiction in embedded automotive multi-core applications?” In: ARCS 2017; 30th International Conference on Architecture of Computing Systems. Apr. 2017, pp. 1–9.

46 Bibliography

[10] L. d S. Azevedo, D. Parker, M. Walker, Y. Papadopoulos, and R. E. Araújo. “Assisted Assignment of Automotive Safety Requirements”. In: IEEE Software 31.1 (Jan. 2014), pp. 62–68. ISSN: 0740-7459. DOI: 10.1109/MS.2013.118. [11] M. Arroyo, F. Chiotta, and F. Bavera. “An user configurable clang static analyzer taint checker”. In: 2016 35th International Conference of the Chilean Computer Science Society (SCCC). IEEE, Oct. 2016, pp. 1–12. DOI: 10.1109/SCCC.2016.7835996. [12] C. Lattner and V. Adve. “LLVM: A compilation framework for lifelong program analy- sis & transformation”. en. In: International Symposium on Code Generation and Optimiza- tion, 2004. CGO 2004. San Jose, CA, USA: IEEE, 2004, pp. 75–86. ISBN: 978-0-7695-2102- 2. DOI: 10.1109/CGO.2004.1281665. URL: http://ieeexplore.ieee.org/ document/1281665/ (visited on 03/01/2019). [13] Writing an LLVM Pass — LLVM 9 documentation. URL: https://llvm.org/docs/ WritingAnLLVMPass.html (visited on 04/16/2019). [14] Motor Industry Software Reliability Association, ed. MISRA C:2012: guidelines for the use of the C language in critical systems. en. OCLC: 847117002. Nuneaton: Misra, 2013. ISBN: 978-1-906400-10-1 978-1-906400-11-8. [15] Rajeshwari Hegde, Geetishree Mishra, and Gurumurthy. “Software and Hardware De- sign Challenges in Automotive Embedded System”. en. In: International Journal of VLSI Design & Communication Systems 2.3 (Sept. 2011), pp. 165–174. ISSN: 09761357. DOI: 10. 5121/vlsic.2011.2314. URL: http://www.aircconline.com/vlsics/V2N3/ 2311vlsics14.pdf (visited on 04/01/2019). [16] K. Lind and R. Heldal. “Automotive System Development Using Reference Architec- tures”. In: 2012 35th Annual IEEE Software Engineering Workshop. Oct. 2012, pp. 42–51. DOI: 10.1109/SEW.2012.11. [17] U. Freund. “Multi-level system integration based on AUTOSAR”. In: 2008 ACM/IEEE 30th International Conference on Software Engineering. May 2008, pp. 581–582. DOI: 10. 1145/1368088.1368168. [18] Static Code Analysis. URL: https://www.owasp.org/index.php/Static_Code_ Analysis. [19] Language Reference Manual — LLVM 9 documentation. URL: https : / / llvm . org / docs/LangRef.html (visited on 02/26/2019). [20] C. Feng and X. Zhang. “A Static Taint Detection Method for Stack Overflow Vulnerabil- ities in Binaries”. In: 2017 4th International Conference on Information Science and Control Engineering (ICISCE). July 2017, pp. 110–114. DOI: 10.1109/ICISCE.2017.33. [21] H. Liang, S. Liu, Y. Zhang, and M. Wang. “Improving the precision of static analysis: Symbolic execution based on GCC abstract syntax tree”. In: 2017 18th IEEE/ACIS In- ternational Conference on Software Engineering, Artificial Intelligence, Networking and Paral- lel/Distributed Computing (SNPD). June 2017, pp. 395–400. DOI: 10.1109/SNPD.2017. 8022752. [22] Markus Mock, Manuvir Das, Craig Chambers, and Susan J. Eggers. “Dynamic points- to sets: a comparison with static analyses and potential applications in program un- derstanding and optimization”. en. In: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering - PASTE ’01. Snowbird, Utah, United States: ACM Press, 2001, pp. 66–72. ISBN: 978-1-58113-413-1. DOI: 10 . 1145/379605.379671. URL: http://portal.acm.org/citation.cfm?doid= 379605.379671 (visited on 02/25/2019).

47 Bibliography

[23] Patrick Cousot and Radhia Cousot. “Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints”. In: Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. POPL ’77. event-place: Los Angeles, California. New York, NY, USA: ACM, 1977, pp. 238–252. DOI: 10.1145/512950.512973. URL: http://doi.acm.org/ 10.1145/512950.512973 (visited on 02/25/2019). [24] E. J. Schwartz, T. Avgerinos, and D. Brumley. “All You Ever Wanted to Know about Dy- namic Taint Analysis and Forward Symbolic Execution (but Might Have Been Afraid to Ask)”. In: 2010 IEEE Symposium on Security and Privacy. May 2010, pp. 317–331. DOI: 10.1109/SP.2010.26. [25] D. Avots, M. Dalton, V. B. Livshits, and M. S. Lam. “Improving software security with a C pointer analysis”. In: Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005. IEEE, May 2005, pp. 332–341. DOI: 10.1109/ICSE.2005.1553576. URL: https://ieeexplore.ieee.org/document/1553576. [26] Lars Ole Andersen. Program Analysis and Specialization for the C Programming Language. Tech. rep. 1994. [27] Michael Hind. “Pointer analysis: Haven’t we solved this problem yet?” In: Paste’01. ACM Press, 2001, pp. 54–61. [28] Bjarne Steensgaard. “Points-to Analysis in Almost Linear Time”. In: Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL ’96. event-place: St. Petersburg Beach, Florida, USA. New York, NY, USA: ACM, 1996, pp. 32–41. ISBN: 978-0-89791-769-8. DOI: 10 . 1145 / 237721 . 237727. URL: http : //doi.acm.org/10.1145/237721.237727 (visited on 03/05/2019). [29] Sheng-Hsiu Lin. Alias Analysis in LLVM. en. Theses and Dissertations. Lehigh Univer- sity, 2015. [30] The LLVM Compiler Infrastructure Project. URL: https : / / llvm . org/ (visited on 02/26/2019). [31] The Architecture of Open Source Applications: LLVM. URL: http://www.aosabook. org/en/llvm.html (visited on 02/21/2019). [32] Kaleidoscope: Extending the Language: Mutable Variables — LLVM 8 documentation. URL: http://releases.llvm.org/8.0.0/docs/tutorial/LangImpl07.html (visited on 05/16/2019). [33] llvm-link - LLVM bitcode linker — LLVM 9 documentation. URL: http://llvm.org/ docs/CommandGuide/llvm-link.html (visited on 05/24/2019). [34] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. “An efficient method of computing static single assignment form”. en. In: Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages - POPL ’89. Austin, Texas, United States: ACM Press, 1989, pp. 25–35. ISBN: 978-0-89791-294-5. DOI: 10. 1145/75277.75280. URL: http://portal.acm.org/citation.cfm?doid= 75277.75280 (visited on 02/12/2019). [35] Matthias Braun, Sebastian Buchwald, Sebastian Hack, Roland Leißa, Christoph Mal- lon, and Andreas Zwinkau. “Simple and Efficient Construction of Static Single Assign- ment Form”. In: Proceedings of the 22Nd International Conference on Compiler Construction. CC’13. event-place: Rome, Italy. Berlin, Heidelberg: Springer-Verlag, 2013, pp. 102–122. ISBN: 978-3-642-37050-2. DOI: 10 . 1007 / 978 - 3 - 642 - 37051 - 9 _ 6. URL: http : //dx.doi.org/10.1007/978-3-642-37051-9_6 (visited on 02/19/2019). [36] Checker Developer Manual. URL: https://clang-analyzer.llvm.org/checker_ dev_manual.html#start (visited on 03/20/2019).

48 Bibliography

[37] Yulei Sui and Jingling Xue. SVF: Interprocedural Static Value-Flow Analysis in LLVM. en. Tech. rep. Australia: School of Computer Science and Engineering, UNSW Australia. URL: https://github.com/SVF-tools/SVF. [38] Yulei Sui, Ding Ye, and Jingling Xue. “Detecting Memory Leaks Statically with Full- Sparse Value-Flow Analysis”. en. In: IEEE Transactions on Software Engineering 40.2 (Feb. 2014), pp. 107–122. ISSN: 0098-5589, 1939-3520. DOI: 10 . 1109 / TSE . 2014 . 2302311. URL: http://ieeexplore.ieee.org/document/6720116/ (visited on 03/21/2019). [39] Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. “Frama-C: A software analysis perspective”. en. In: Formal Aspects of Com- puting 27.3 (May 2015), pp. 573–609. ISSN: 1433-299X. DOI: 10.1007/s00165-014- 0326- 7. URL: https://doi.org/10.1007/s00165- 014- 0326- 7 (visited on 03/06/2019). [40] Yiannis Papadopoulos, Martin Walker, David Parker, Erich Rüde, Rainer Hamann, An- dreas Uhlig, Uwe Grätz, and Rune Lien. “Engineering failure analysis and design op- timisation with HiP-HOPS”. In: Engineering Failure Analysis. The Fourth International Conference on Engineering Failure Analysis Part 1 18.2 (Mar. 2011), pp. 590–608. ISSN: 1350-6307. DOI: 10.1016/j.engfailanal.2010.09.025. URL: http://www. sciencedirect.com/science/article/pii/S1350630710001779 (visited on 04/02/2019). [41] Mojtaba Shahin, Peng Liang, and Muhammad Ali Babar. “A systematic review of soft- ware architecture visualization techniques”. en. In: Journal of Systems and Software 94 (Aug. 2014), pp. 161–185. ISSN: 01641212. DOI: 10.1016/j.jss.2014.03.071. URL: https://linkinghub.elsevier.com/retrieve/pii/S0164121214000831 (visited on 04/01/2019). [42] T. D. LaToza and B. A. Myers. “Visualizing call graphs”. In: 2011 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). Sept. 2011, pp. 117–124. DOI: 10.1109/VLHCC.2011.6070388. [43] S. Bassil and R. K. Keller. “Software visualization tools: survey and analysis”. In: Pro- ceedings 9th International Workshop on Program Comprehension. IWPC 2001. May 2001, pp. 7–17. DOI: 10.1109/WPC.2001.921708. [44] I. Herman, G. Melancon, and M. S. Marshall. “Graph visualization and navigation in information visualization: A survey”. In: IEEE Transactions on Visualization and Computer Graphics 6.1 (Jan. 2000), pp. 24–43. ISSN: 1077-2626. DOI: 10.1109/2945.841119. [45] C. B. Seaman. “Qualitative methods in empirical studies of software engineering”. In: IEEE Transactions on Software Engineering 25.4 (July 1999), pp. 557–572. ISSN: 0098-5589. DOI: 10.1109/32.799955. [46] Paul Anderson. “More Software Safety A Static Analysis Tools Perspective”. en. In: ATZelektronik worldwide 12.1 (Feb. 2017), pp. 16–21. ISSN: 2192-9092. DOI: 10.1007/ s38314-016-0101-z. URL: https://doi.org/10.1007/s38314-016-0101-z (visited on 04/01/2019). [47] Documentation — LLVM 9 documentation. URL: https://llvm.org/doxygen/ (vis- ited on 04/16/2019). [48] Chris Lattner and Vikram Adve. The LLVM Instruction Set and Compilation Strat- egy. Tech. Report UIUCDCS-R-2002-2292. CS Dept., Univ. of Illinois at Urbana- Champaign, Aug. 2002. URL: https : / / llvm . org / pubs / 2002 - 08 - 09 - LLVMCompilationStrategy.html (visited on 04/16/2019). [49] Arseny Kapoulkine. Light-weight, simple and fast XML parser for C++ with XPath support: zeux/pugixml. original-date: 2012-07-06T10:51:03Z. May 2019. URL: https://github. com/zeux/pugixml (visited on 05/09/2019).

49 Bibliography

[50] Tree Layout - D3 wiki. URL: https://d3-wiki.readthedocs.io/zh_CN/master/ Tree-Layout/ (visited on 04/01/2019). [51] Martin Graham and Jessie B. Kennedy. “A survey of multiple tree visualisation”. In: Information Visualization 9 (2010), pp. 235–252. DOI: 10.1057/ivs.2009.29. [52] E.M. Reingold and J.S. Tilford. “Tidier Drawings of Trees”. en. In: IEEE Transactions on Software Engineering SE-7.2 (Mar. 1981), pp. 223–228. ISSN: 0098-5589. DOI: 10.1109/ TSE.1981.234519. URL: http://ieeexplore.ieee.org/document/1702828/ (visited on 04/01/2019). [53] Sougata Mukherjea, James D. Foley, and Scott Hudson. “Visualizing complex hyperme- dia networks through multiple hierarchical views”. en. In: Proceedings of the SIGCHI con- ference on Human factors in computing systems - CHI ’95. Denver, Colorado, United States: ACM Press, 1995, pp. 331–337. ISBN: 978-0-201-84705-5. DOI: 10 . 1145 / 223904 . 223947. URL: http://portal.acm.org/citation.cfm?doid=223904.223947 (visited on 05/20/2019). [54] Googletest: Google Testing and Mocking Framework. Contribute to google/googletest develop- ment by creating an account on GitHub. original-date: 2015-07-28T15:07:53Z. Apr. 2019. URL: https://github.com/google/googletest (visited on 04/17/2019). [55] Zhulinpinyu. D3 layout tree. URL: https://codepen.io/zhulinpinyu/details/ EaZrmM (visited on 05/22/2019). [56] Markus Schwarz. SBC and CANbedded. en. Tech. Report. 2005, p. 4. URL: https:// assets.vector.com/cms/content/know-how/_application-notes/AN- ISC-1-1027_SBC_and_CANbedded.pdf. [57] Matthias Krause and Carsten Weich. Intrinsic Safety of AUTOSAR Basic Software. en. Tech. Report. 2012, p. 4. [58] Bulent Sari and Hans-Christian Reuss. “A model-driven approach for the development of safety-critical functions using modified architecture description language (ADL)”. In: 2016 International Conference on Electrical Systems for Aircraft, Railway, Ship Propulsion and Road Vehicles & International Transportation Electrification Conference (ESARS-ITEC). Toulouse, France: IEEE, Nov. 2016, pp. 1–5. ISBN: 978-1-5090-0814-8. DOI: 10.1109/ ESARS-ITEC.2016.7841346. URL: http://ieeexplore.ieee.org/document/ 7841346/ (visited on 05/03/2019). [59] Claes Tingvall and Narelle Haworth. “Vision Zero - An ethical approach to safety and mobility”. en. In: (), p. 14.

50