Faculty of Informatics, Masaryk University

wyA 1

Compiling Applications for Analysis with DIVINE

Master’s thesis

Zuzana Baranová

Brno, 2019

Declaration

Hereby I declare that this thesis is my original work, which I have created on my own. All sources and literature used in writing the thesis, as well as any quoted material, are properly cited, including full reference to its source.

Advisor: doc. RNDr. Petr Švenda, Ph.D. Consultant: RNDr. Petr Ročkai, Ph.D.

Abstract

Verification and other formal analysis techniques are often demanding tasks, both in skill and time. This is why non-critical software is seldom subjected to the same rigorous analysis as safety-critical software. However, all software would benefit from an extra level of assurance of its reliability and there has been prolonged effort on the side of analysis tools developers to make their use easier. Ideally, the aim is to integrate analysis techniques into the normal software development process. Among other tools, DIVINE is one such verifier whose long-term key goal is to bring verification closer to the developers of everyday software. A big step forward was direct verification of and C++ programs. The programs are compiled into a more analysis-friendly form to be verified, notably LLVM bitcode (LLVM IR). Another big step in lowering barriers for adopting formal verification is re-using automated build tools and existing build instructions of projects, which would prevent the need for manual compilation of software. The purpose of this thesis is to replace the existing compilation toolchain of DIVINE with a tool which could be transparently used in automated build systems and which would produce bitcode of the whole program. We have successfully implemented such a tool, and evaluated its practicality on a number of real projects which use automated build systems. The resulting bitcode can be loaded into DIVINE and verified.

Keywords

C, C++, DIVINE, LLVM IR, ELF, compilation, formal verification, POSIX, build automation, implementation

iii

Acknowledgements

Firstly, I would like to thank the ParaDiSe group, for welcoming me, and for the endless supply of obscure words and programming language concepts. I would also like to thank my Finnish family, for making sure I survived the long winter. Most of all, I have to thank Mornfall, for the unreasonable amount of time he dedicated to fixing my own problems, for explaining the same things to me over and over again, and for not giving up on me.

Notes

This thesis is loosely based on our papers Reproducible of POSIX Programs with DiOS [35] and Compiling C and C++ Programs for Dynamic White-Box Analysis [34].

v

| Contents

1 Introduction1 1.1 Motivation ...... 3 1.2 Goals ...... 3 1.3 LLVM IR ...... 4 1.4 DIVINE...... 6

2 Compilation Process9 2.1 Binaries...... 9 2.2 Header files and Libraries ...... 10 2.3 Architecture...... 14 2.4 Anatomy of an ELF file ...... 22

3 Related Work 25 3.1 Program Analysis...... 25 3.2 Link-Time Optimizations ...... 26 3.3 wllvm ...... 26

4 Compilation Automation 29 4.1 Build Systems...... 29 4.2 Make...... 30 4.3 CMake...... 31 4.4 configure script...... 33

5 DiOS 39 5.1 Motivation ...... 39 5.2 Building verified with DiOS...... 41 5.3 The DiVM ...... 42 5.4 DiOS Libraries ...... 44 5.5 Kernel...... 46 5.6 Changes in DiOS...... 48

6 dioscc 51 6.1 divcc and dioscc ...... 51 6.2 Intended Use ...... 52 6.3 Limitations ...... 53

vii 6.4 Design...... 54 6.5 Implementation...... 55 6.6 Linking ...... 55 6.7 API & ABI Compatibility...... 56 6.8 Internal Structure ...... 60

7 Evaluation 67 7.1 System Information ...... 67 7.2 Feasibility with Real Projects...... 67 7.3 DiOS...... 71

8 Conclusions 73 8.1 Future Work ...... 73

A Archive and Manual 75

Bibliography 77 1| Introduction

The rise of information systems has brought the need to prove that the programs these systems use function correctly. This is true not only for safety-critical systems, since an undetected error in, say, a ticket reservation system can also prove very costly. In fact, anyone who writes programs wishes them to behave as they intended. To ensure the correctness of programs, various analysis tools can be used, with wide-ranging goals: from simple testing of correct behaviour, through more rigorous methods, including memory error detection (such as the tool valgrind [31]), to more formal methods – like static analysis or model checking, used for instance for parallel-program-specific analysis (checking for deadlocks, starvation, etc.). The analyses can be performed at different stages of development of the program, or even on the final . As a result, the tools necessarily work with different representations of the program, e.g. , or – processor instructions.1 Finally, some tools require that the program is first transformed (manually or automatically) into a representation that the analysis tool can work with. There are two basic approaches: the program is either described in a modelling language that can better articulate its properties (these often have to be written by hand, based on domain-specific knowledge); or the program is first translated using another tool into a language that is more analysis-friendly. Examples of the former are:

• the DVE modelling language, used to model asynchronous systems, • the SPIN model checker and its input language (PROMELA) for system descriptions.

Depending on the type of analysis, some of the intermediate languages that are used in the second approach, i.e. those that are obtained automatically from the source language using a specialized compiler, include (the following examples assume C and C++ programs as inputs):

• GOTO programs (a type of control-flow graph representation), which are generated by goto-cc2, and which in turn serve as input for the CBMC analyser, • the LLVM intermediate representation, which is used in a number of tools, including the symbolic executor KLEE, a static analyser called PhASAR, or the DIVINE model checker [3]. For C and C++ programs, LLVM IR can be obtained using , although other languages (such as Rust) can also be translated into this intermediate form.

1Some static analysis tools work directly with the source code; other tools (like valgrind) work with the resulting executable program, analysing it at runtime and making use of debug metadata. The respective stages that a program goes through and the forms a program takes are described in Chapter2. 2More information on goto-cc can be found in Related Work (Section 3.1).

1 With programs written in a compiled, high-level programming language, it is desirable for the analysis to be performed as late as possible in the compilation process, to ensure that the verification result applies to the executable program. Ideally, verification should be done on the fully-compiled, executable binary, which consists of the actual instructions that will be used once the program is run. However, the binary is not portable, i.e. it can only be executed on the system it was compiled for. As such, the choice of the level of abstraction poses a trade-off between veracity and applicability of analysis (analysis of low-level code is more accurate but only applies to the particular binary, while higher-level analysis is more widely applicable, but less precise).

With this in mind, and because the analysis should cover a broad range of systems, many tools instead opt for an intermediate representation of the code as verification input. In this thesis, we will particularly work with LLVM IR3. Another reason for the abstraction of code into LLVM IR is that at the level of processor instructions, after the compiler has possibly performed many optimisations already, it is extremely difficult to reason about the original source code. It is also not trivial to map the instructions back to the source code, so that a potential problem can be traced. In practice, we have observed that the use of LLVM IR in verification is a good trade-off, common in verification tools.

Program analyses can be broadly classified into two categories: static and dynamic. The term dynamic analysis refers to a type of analysis where the program is inspected at runtime, i.e. during execution or interpretation. This is in contrast with static analysis, during which the program’s source code or the intermediate representation is analysed directly. [2] Therefore, static analysers do not need to have the entire program available to analyse it – they can analyse it partially and still get valuable results. On the other hand, dynamic analysis requires that the whole program is available to the verification tool.

A different classification is based on the availability of the source code of the analysed program. In so-called white-box analysis, the analyser needs the source code of the program to reason about its properties. This can be useful for multiple reasons: in analysis of control flow graphs, program optimizations, and, in cases where intermediate representation is analysed, to generate this representation. On the opposite end is black-box analysis, where only the program binary is available. In black-box testing, for example, the expected inputs and outputs are known to the analyser and tested for discrepancies. The tools usually do not have any knowledge of the source code and instead work with the compiled binary, passing the inputs to the program. However, analysis tools can also examine the binary itself – one of the methods is reverse engineering the binary and obtaining the bitcode in the process known as decompilation. Decompilation can be useful because some software is only available in binary form. There is a variety of tools that work in-between the two extremes, within the so-called gray-box analysis range.

3https://llvm.org/docs/LangRef.html

2 1.1 Motivation

In general, programmers make extensive use of automated tools, most notably for the process of transforming the program into its executable form.4 This process, called compilation or building, can comprise several sub-commands, each of which might need to be given flags and options specific to the command and the system for which the program is compiled. What is more, build automation is also an important part of other process automation efforts, such as automated testing or continuous integration. Naturally, automated build tools are used because remembering and replicating the process each time would be unnecessarily complicated. Moreover, the instructions to the build tools are usually shipped to users along with the source code, since a laborious compilation process would discourage users from trying the program. It is likewise reasonable to assume that programmers are unwilling to go through the effort of repeating a similar process by hand when they wish to verify the program. Unfortunately, this is often required with verification tools. For most programs, verification is not sufficiently cost-effective when testing and other analysis tools are already used. However, for safety-critical systems, much more effort is put into ensuring the program’s correct functioning in all possible scenarios. The amount of work often exceeds the implementation process, especially with software in domains such as aviation, spaceflight, medical devices, or software for factories that handle dangerous substances. While the stakes for non-critical software might be lower, it is still desirable for everyday industrial software to be reliable and generally usable without concerns for safety or poor availability. Accordingly, our aim is to encourage more developers of everyday, non-critical software to verify their programs by making verification simple and intuitive by reusing their existing build systems. We target programs written in C and C++ programming languages and use the term ‘program’ in the following text to refer to programs written in either of these languages, unless stated otherwise.

1.2 Goals

The primary goal of this thesis is to implement a tool which would:

a. behave like a compiler, i.e. be capable of imitating the different compilation stages and be usable in high-level automated compilation tools, b. translate the program into a form suitable for analysis (LLVM IR), without interfering with the standard compilation process.

In the ideal case, the two outputs – the result of a normal compilation and the analysis- friendly form of the program – should be created simultaneously in a single pass, while meeting the criteria outlined above. We would like to take advantage of existing build automation systems and build automation processes to obtain the intermediate code. This would allow developers to also automate tasks such as verification or performance analysis, by incorporating

4This only applies to compiled programs.

3 the analyses into their continuous integration processes.

The main motivation for the outlined goals is to make DIVINE, an LLVM-based model checker more accessible to engineers. In its standard workflow, DIVINE first compiles the (C or C++) program into LLVM IR, and uses this representation in verification. Because we target a model checker, which analyses the program dynamically, we need the bitcode of the whole program, along with its dependencies, to be able to execute it. Apart from DIVINE, there are a large number of tools which can benefit from improving the process of obtaining LLVM bitcode for entire programs. Some of the examples include: the symbolic executor KLEE [6], the slicing-based bug hunting tool Symbiotic [7] or the MCP model checker [41]. Similarly, stateless model checkers like Nidhugg [25] and RCMC [21], bounded model checkers like LLBMC [37] or the LLVM-based IC3 tool VVT [18] could take advantage of the process, especially when the input program uses external libraries. The main issue with these tools is very limited support for standard library functions. Library substitutions are addressed in more detail in Section 5.4.

The intended purpose of the tool we introduce in this thesis therefore lies in dynamic white-box analysis of programs, especially model checking. We need the source code for multiple reasons: for the main part, to generate the intermediate representation; but also to report to the user where the problem occurs, if one is found. The result of this thesis is a tool named dioscc, a drop-in replacement for C and C++ , which transparently fits into existing automated build tools and processes. It is tightly coupled with DIVINE, as it links in DiOS libc5 during the linking stage of compilation. Consequently, the resulting binary can be verified by DIVINE in a straightforward way. dioscc generates intermediate and native code in a single pass, ensuring that the final executable is built from the intermediate code that is being analysed. We have also implemented a more general tool, divcc, which does not include the DiOS libc, and instead works with the system libc directly. Binaries produced by divcc do not contain LLVM IR bitcode of all library functions, unless system libraries or an external libc in the form of LLVM bitcode are provided to the build system.

1.3 LLVM IR

In this section, we provide a concise introduction to the LLVM intermediate representation, a term used frequently throughout the text. Most of the information in this section comes from the LLVM Language Reference Manual [20]. LLVM IR is a rather low-level representation of the code, comparable with an , which is one of the reasons why it is suitable for code transformations and opti- mizations. The IR strives to be language- and target-platform-independent, and to provide a general-purpose representation along with simple transformation rules.

5DiOS is DIVINE’s internal operating system, which, among other things, provides a standard library, called DiOS libc. DiOS is described in detail in its own chapter, Chapter5.

4 LLVM IR is in the Static Single Assignment (SSA) form, meaning that every value is assigned only once. This property, demonstrated in Figure 1.1, is useful in many scenarios, in- cluding program analysis and code optimization. A number of compilers rely on an SSA-based intermediate representation for fast optimizations of the code, among others GCC [12], LLVM (the clang compiler) and Java’s just-in-time compiler from the HotSpot virtual machine [32]. As a direct result of the above, LLVM IR has explicit data-flow and control-flow.

x = 1; x1 = 1; y = x + 1; y = x1 + 1; x = 2; x2 = 2; z = x + 1; z = x2 + 1;

Figure 1.1: Static Single Assignment form. [36] The code on the left shows the original code excerpt, the right box shows an equivalent code translated into an SSA form. While a value can be typically assigned to more than once throughout the program (here, variable x on lines 1 and 3), this is not allowed in SSA – a new variable is created every time. This helps in two major ways: analysers can distinguish when a variable’s value is dependent on an earlier assignment, and code generators can use this information to better manage registers.

1.3.1 Syntax and Semantics

This section describes the LLVM intermediate representation, mainly its syntactical notation. In fact, this concerns the so-called human-readable form of LLVM IR, materialized in the form of .ll files. There are two other representations: the in-memory compiler IR, where LLVM modules can be manipulated using LLVM-provided functions, and the binary on-disk representation (.bc files). [20] The three representations are otherwise equivalent. During the transformation of the program into LLVM IR, the can maintain an unlimited set of virtual registers. This is in contrast with machine code, which works with a small number of registers which have to be reused – loaded from and stored back into main memory (when the working set exceeds the size of the register file). LLVM IR is partially type safe; however, type safety only applies to register manipulation, not to memory in general. The language of LLVM is designed to be easily readable; it uses a number of familiar language constructs and features, some of which we introduce in the following outline [20]:

• global symbols/identifiers are denoted with the @ sigil, while local identifiers (including register names) are prefixed with % • reserved words – for opcodes (add, ret, etc.) and primitive types (e.g. void, i32) • ; denotes a beginning of a single-line comment

The main unit of code in LLVM IR is the basic block. A basic block is a sequence of (non-branching) instructions, ending with a terminator (e.g. a branching instruction or a function return). Each basic block has a single entry (the first instruction) and a single exit (the terminator), and cannot contain other terminator instructions. Basic blocks belong to

5 int b = 12; if (a) b++; return b;

1 entry: 2 ; intb= 12 3 %b = alloca i32; allocatea 32-bit int value on the stack, save into register 'b' 4 store i32 12, i32* %b; store value 12(a 32-bit int) to the address of register 'b' 5 6 ; if(a) 7 %0 = load i32 * %a; loada 32-bit int value from the address of register 'a', 8 ; save result of load intoa new register, '0' 9 %cond = icmp ne i32 %0, 0; save into register 'cond', the(boolean) result of integer 10 ; compare of the value stored in register '0' and value0 11 br i1 %cond, label %then, label %end; conditional branch based on value stored in 'cond' 12 13 then: 14 ;b++ 15 %1 = load i32 * %b; load from address of 'b' into register '1' 16 %2 = add i32 %1, 1; add1 to what is now stored in register '1' 17 store i32 %2, i32* %b; store the result of operation 'add' back to memory(into 'b') 18 br label %end; go to end 19 20 end: 21 ; returnb 22 %3 = load i32 * %b 23 ret i32 %3

Figure 1.2: An example code in the LLVM IR (bottom), obtained from the C code at the top. [8] The entry, then and end are so-called labels. They denote the place to jump to if the label is encountered in another statement, used e.g. in branching instructions. Labels are attached to basic blocks, i.e. a label specifies which basic block (a sequence of instructions) to continue with. The respective operations of the bitcode representation are explained using in-code comments in the code excerpt. Registers that are not explicitly named (for e.g. intermediate calculations or loads) are assigned a number, from a static counter, starting at 0 (see e.g. lines 7 and 15). Also note that if (a) is equivalent to if (a != 0), int stands for type integer and ne means “not equal”.

functions (functions are composed of basic blocks), are labelled and can be jumped to. An example computation in the LLVM IR language is demonstrated in Figure 1.2.

1.4 DIVINE

Our tool was developed for the DIVINE model checker [3] as its main use-case (the name dioscc comes from the use of DiOS’s headers and libraries). However, any tool which works with LLVM IR can potentially make use of dioscc.6 DIVINE is an explicit-state model checker, whose main goal is verification of real-world C and C++ programs. It supports a large portion of the standard C library and C++ programs

6We have succeeded in evaluating some programs that were compiled with dioscc in other tools, notably KLEE. However, this requires a patched up version of KLEE (see AppendixA) and the evaluation is still in the early stages.

6 up to the C++17 standard (via DiOS). As mentioned before, programs are not verified at the level of source code, but are rather compiled into an intermediate form – LLVM IR. At the time work on this thesis started, programs had to be compiled into LLVM IR manually to be used in DIVINE. With dioscc, we aim to automate the compilation process, so that no extra effort has to be taken by programmers to verify their programs than it is to compile the programs into the resulting executable form. We achieve this by extending an existing compiler toolchain with means to produce hybrid binaries, which can be both executed on the native system, with no change in behaviour, and verified using DIVINE.

7

2| Compilation Process

This chapter serves to outline the process of transforming text files with human-readable program code in a high-level programming language (called source files) into executable files1 (that can be run) and clarify what components are involved in this process. As we already mentioned in the introduction, the process is known as compilation or building. During building, individual source files are first compiled2 into (a form of machine code which is suitable for linking). One source file, and included header files then form one compilation unit. Linking is the process of combining multiple such compilation units to form a single program. [34] A simplified overview of compilation is shown in Figure 2.1.

compiler unit 1 source object code

headers libraries executable

unit 2 source object code

Figure 2.1: A high-level overview of compilation/building, showing the intermediate files and respective inputs and outputs of the two stages. The two main stages are compilation and linking. Figure is from [34].

2.1 Binaries

An executable file is composed of instructions specific to the architecture of the system the program will be executed on, static data of the program and various metadata. That is, such file consists of machine code – processor instructions for a given instruction set; that is the program is in a form understood by the processor. However, programs are seldom written by humans in this low-level form of individual processor instructions: it would be very difficult to write and maintain such code. Instead, programmers use high-level programming languages, such as Java or C++, designed for practical programming and ease of use.

1In general use, the term binary file refers to any file that is not plain text, which is to say, one that cannot be directly edited using a text editor. This could mean images, files produced by word processors, MP3s, or object code, to name a few examples. In this thesis, however, we will use the term binary file to refer to executable files – those that contain machine code. We will use the two terms interchangeably. 2One of the stages of compilation is, unfortunately, also called compilation, see section 2.3 for the explanation of terminology and for the description of the different stages of compilation.

9 There are several components needed to produce a working executable from source code written in C and C++ (which are our main concern). These can be broadly classified into the following:

• source files, each representing a single compilation unit, • accompanying header files, • libraries3 (shared, static, system and third-party).

2.2 Header files and Libraries

First, let us look at the distinction between the three kinds of files, most notably the difference between a header and a library. The split between header files, libraries and the idea of linking exists mainly to support what is known as separate compilation. Source files (.c and .cpp, for C and C++, respectively), along with the header files they include, each form an individual compilation unit. Related compilation units can be organized into bigger logical units, which we will refer to as modules. They are parts of a program focusing on a specific task or concept, and related functionality. Take, for example, a module providing functions for file manipulation (reading, writing into, etc.). Typically, a module is composed of one or several header files (.h{pp}) and possibly multiple source files (.c{pp}). While the term module can be useful in describing the logical structure of a program, it does not have a 1:1 mapping to source files, libraries or other well-defined entities. On the other hand, libraries are more easily defined. A (static) library is a collection of object files that resulted from the compilation of (usually a number of) source files/compilation units. A single library may contain multiple modules (as is the case of the standard C library), or it can correspond to a single logical module, like the standard math library libm.

2.2.1 Header Files

A header file contains declarations of functions and effectively specifies functions that a library, which contains compiled source files with the definitions of those functions, makes available to the user and the interface that the functions adhere to. That is, a header describes the API of the library it belongs to. In case of C++, (public) classes specified within the header files are also part of the interface (the C language has no concept of classes). A single header file typically identifies one module (as in one logical unit; if we take the stdio.h header as an example), and after including the header file (using the #include [header] preprocessor directive), the program can use the functions declared in the header. In Figure 2.2, we provide some example declarations that could be part of a file manipulation module (the declarations are taken from C’s ). The declarations state the interface of functions: for instance, the fopen function takes two arguments, the filename and the mode to open the file in (reading only, append, etc.), both of type const char *, and returns a pointer to a FILE, which is an opaque (internal) 3Note that libraries are also built from source files, however, the source files may be unavailable – typically only object files are released.

10 FILE *fopen(const char *filename, const char *mode); int fgetc(FILE* stream); int fputc(int c, FILE *stream); size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream); size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);

Figure 2.2: Declarations in the header file (.h). The function names are denoted in green.

#include "stdio.h"

FILE *fopen(const char *filename, const char *mode) { FILE *fp; int f; int flags, oflags;

if ((flags = __sflags(mode, &oflags)) == 0) return (NULL); if ((fp = __sfp()) == NULL) return (NULL);

[ ... ]

/* Concrete steps taken: check mode, check that file was specified, that it exists, is not locked, user has rights to access it, possibly set errno, etc.*/

return fp; }

Figure 2.3: Definition of fopen in the source file (.c). structure of the module. In addition, the header file specifies this FILE type and the mode options for accessing a file, among other things. The source file (or several) then holds the definitions of the functions mentioned in the header file. A definition consists of the exact steps taken to open a file, the logic behind the function.

2.2.2 Libraries

The definitions can all be part of a single source file, or, as is customary in the standard libraries, the functions can each have their own .c file (fopen.c and so on). The headers and source files are then compiled (using the cc command with -c flag, see section 2.3) into object files (.o extension). As was said, the source files each form an individual compilation unit, which, when compiled, can be put together to form a library. At linking, only the compilation units which contain the necessary definitions of the functions actually used in the program are linked in. A library can be user-defined or come from the system (a so-called system or standard

11 static dynamic linking compile-time run-time (loading) execution overhead minimal symbols have to be looked up changes visible needs to be re-compiled immediately memory each program has own copy only one copy of the library exists takes more space single point of failure disadvantages (bloats the program) possible compatibility issues suffix .a, .lib .so, .dll

Table 2.1: An overview of properties of static and dynamic libraries. The changes visible property denotes that in the case of a static library, if a library is updated, the program needs to be re-compiled (linked with the updated version of the library), while changes in dynamic libraries are immediately visible and there is no need to make any changes to the program. Regarding single point of failure as a disadvantage in dynamic libraries: because all programs that use a certain library have the same version, which is linked at run-time, if the library gets broken or removed, all the programs become unusable. On the other hand, because each program carries its own version of a static library, moving a static library or causing it to be faulty has no effect on already-compiled programs which use it. library). Functionality of the library can be used in source files (of other libraries and programs), which is achieved by linking the library with the program. Libraries allow their users to take advantage of existing functions, instead of having to re-invent identical functionality themselves. Moreover, the use of libraries prevents code duplication and thus reduces the chances that errors will be introduced. System libraries often follow standards and are used by a multitude of users, so they are unlikely to contain severe bugs. On POSIX systems, a (static) library is simply a collection (an archive) of object files, obtained by compiling respective source files that are part of the library. A header, on the other hand, is a text file which only contains the declarations (prototypes) of global variables, functions and types. It specifies the names and types of arguments of the available functions, and is expanded in the place of its use (the #include directive is substituted for the contents of the header file), to make the functions available to the program. [14] A library is where the implementation of the functions is located and has to be linked with the program. If the interface of functions in a header file matches that of the definitions in a library, the header file can be used with different libraries. This is closely related to API and ABI compatibility and is discussed in Sections 6.7.1 and 6.7.2.

2.2.3 Static and Dynamic Libraries

The most important distinction is related to the process of linking – libraries can be linked either statically or dynamically. [24] Static libraries become part of the program; the object code of the used library functions is copied in by the linker. This means that the function definitions are fixed at compilation time, when the library was linked with the program to form an executable. A static library on POSIX systems has the .a suffix (as in archive) and can be created from a set of object files using the ar4 command.

4https://linux.die.net/man/1/ar

12 By contrast, dynamic libraries, also called shared libraries, stay separate from the program; they are external entities that can be shared by multiple running programs simultaneously. Only a reference to a function defined in a shared library is kept in the program and the reference is resolved at runtime. Dynamic libraries typically have the .dll (dynamically linked library) or .so (shared object) suffix. When a shared library is replaced (e.g. during a system update), the next time a program that uses the library is executed, it will transparently use the new definitions of symbols defined by that library. As long as the updated version is binary compatible5 with the former version, no change to the program is required, whereas any modification of a static library must be followed by a re-compilation of the program for the changes in the library to take effect. Dynamic linking requires that the user has a compatible version of the dynamic library on their system. Additionally, dynamic libraries may have a negative impact on (execution and program start-up) speed, as references have to be looked up at runtime (Even though the lookup is only performed at start-up of the program (or at most once during runtime), the overhead of dynamic linking may include an extra level of indirection for function calls). There can be both versions (static and dynamic) of a given library on the system, for instance both libm.a and libm-2.29.so can be present. It will depend on the build instructions which library is used. By default, both gcc and clang link dynamically (search for the dynamic version of a library first), unless this is overridden by the build system6 or the -static option is passed to the compiler (or linker directly). Lastly, mixed linking is possible, i.e. both static and dynamic libraries can be linked with a program at the same time.7 Table 2.1 gives an overview of the properties of static and dynamic libraries. An example of creating and using a custom library can be found in section 2.3.7.

2.2.4 Position Independent Code

In this section, we will briefly describe position independent code and how it relates to shared objects. Dynamic linking consist of two separate linking steps: the standard (build-time) linker checks references and emits metadata for use by the runtime linker. The runtime linker then performs symbol resolution in a fashion similar to how the build-time linker treats static libraries. With dynamic libraries, this step is performed either at program start-up, or incrementally during runtime. There are two ways of loading shared libraries into the address space of the program:

1. Load-time relocation When the program is linked, the addresses of functions and symbols from shared objects are not final, the loader can move the objects at the start of the program execution and update the relocation table with the actual addresses of the functions. The program then works with the correct addresses of functions, assigned at load-time. Most overhead

5For details on binary compatibility, see section 6.7.2. 6Dynamic linking can be, for example, disabled at configure time in autotools-based projects by using the –-disable-shared flag (see Section 7.2). 7Mixed linking concerns different libraries, i.e. any given library is either linked statically or dynamically. It is theoretically possible to have the same library linked to the same program both statically and dynamically, but this is not very useful.

13 of the first approach therefore happens at the start of a program, when the relocations are performed and addresses updated. 2. Position-independent code Position independent code is code that is not tied to a particular address in memory and which can be loaded into a number of processes, without a need to be relocated (its physical address can be mapped to a number of virtual addresses, to be used in multiple processes without overlapping with other shared libraries). Position-independent code mostly makes use of relative addresses for locations of functions (or contains a small jump table). All constant addresses are accessed through a global offset table (GOT). [17]

Instructing the compiler to emit position-independent code can be done using the -fPIC option. More information on the two approaches can be found in [4] and [5].

2.2.5 System Libraries

System libraries are, simply put, libraries that are part of the operating system, such as libc, the standard C library. They contain the most essential functions and the user does not normally need to link them explicitly, the compiler can find the system libraries by their name and some of them (like libc) are linked into all programs by default. One of the basic roles of system libraries is to provide a way for the user programs to communicate with the kernel of the operating system they are part of (including filesystem manipulation, process creation and management or memory allocation, for example). One reason for abstracting kernel functions is to offer a standardised interface to the user, such as the interface dictated by POSIX. This makes user programs portable to all operating systems that adhere to a given interface. On UNIX-like operating systems, the system libraries are typically located in /usr/lib. DiOS (see Chapter5), DIVINE’s internal operating system, supplies libc, libc++, libc++abi, libm, libpthread and librst, the latter being a DiOS-specific library, which provides runtime support for some of DIVINE’s features based on code transformations.

2.3 Compiler Architecture

In order to transform a program from source code written in a high-level programming language into a binary executable file, it has to be processed by a series of sub-programs. The overall process is known as compilation. Some of the information on the stages is collected from [11] and from GCC’s online documentation8. For C (and C++) programs, compilation typically comprises 4 core phases: preprocessing, compilation, assembly and linking (out of which preprocessing and assembly were folded into compilation in the high-level overview). Each phase uses the output of the immediately preceding phase as its input and relies on the input to be in a specific format. The following sections (2.3.1– 2.3.4) detail the respective parts, while figure 2.4 gives an overview of the

8https://gcc.gnu.org/onlinedocs/

14 whole process, the sub-processes, their inputs/outputs, and how the stages build on each other. Compilation in itself means invoking a program called ‘driver’ (as it drives the compilation) which can be called using the cc command on a file we wish to compile. The driver then gradually spawns a sub-process for each of the individual stages of compilation. These can be also called manually with their respective commands: cpp (C preprocessor) for preprocessing, ld (as in loader) for linking, or as for assembler. [38] The driver decides which commands to invoke and in what order. The following summary describes the responsibilities of the sub-programs [11]:

1. The preprocessor reads the input source file and any header files it may refer to (using #include directives) and produces a single self-contained source file. 2. The compiler proper is responsible for translating the preprocessed file into assembly instructions, specific to the architecture of the target processor. 3. The assembler then translates the assembly instructions into object code. These are actual instructions for the processor. 4. Finally, the linker resolves the references to library functions and inter-object references.

In clang (and gcc), the following flags are available to instruct the driver to only perform certain phases of the compilation process9:

• -E stop after preprocessing • -S stop after compilation, do not assemble • -c assemble, do not link

The output of preprocessing (running gcc or clang with -E) is written to the standard output by default. We can also run the preprocessor directly using “cpp infile > outfile” or “cpp infile -o outfile”. For compilation, running

$ cc -c file.c creates an object file file.o in the working directory, and

$ cc -S file.c gives us assembly instructions, in a file called file.s. In the examples, cc denotes the default system compiler and can be substituted for clang or gcc. The compilation process can then be done manually, e.g. we can call the as program on an assembly file .s to get an object file, which can then be linked by calling a linker explicitly. Furthermore, the compiler decides based on the file extensions of input files which stages to perform on them – for example, if file.i10 is encountered (which denotes a preprocessed file), the preprocessing phase is skipped; analogously, object files are fed straight to the linker. The respective stages are described in more detail in the following subsections.

9https://clang.llvm.org/docs/CommandGuide/clang.html 10or file.ii in case of C++

15 source code headers (file.c) (header.h)

cpp preprocessor

preprocessed file (file.i)

cc -S compiler

assembly instructions (file.s)

as assembler

object code library (file.o) (lib.a, lib.so)

ld linker

executable (a.out)

Figure 2.4: A typical compiler architecture, showing the main 4 phases (preprocessing, com- pilation, assembly and linking). The whole compilation process is managed by a compiler driver, which invokes the sub-programs (shown in boxes). The commands used to invoke the sub-programs are given to their left. The figure further gives the types of inputs and outputs. There is no explicit command for the phase of compilation proper, as this is the main job of the compiler.

2.3.1 Preprocessing

In the first phase, a text-based transformation is done on the program, as a preparation for the later stages.11 This includes:

• the program is stripped of comments: In C and C++ a comment means any text after a double slash //, up to a newline; or between special symbols /* and */ for multiline comments. • split lines are joined: For better readability, long lines can be split using a backslash (\), these are joined back at the preprocessing stage. • macros are expanded: The preprocessor treats all lines starting with a hash sign (#) as commands, so-called preprocessor directives. There are several kinds of com- mands available: the already-mentioned #include for including header files; #define 11Most information on the C preprocessor in this section comes from GCC’s online manual [13].

16 for creating a macro – an identifier that gets replaced with its textual definition when- ever it is encountered in code during preprocessing12 (macro expansion), e.g. #define MAX_THREADS 5, and #undef for removing the macro definition from that point on. Header files can also contain macro definitions.

• conditionals are evaluated: Some of the preprocessor directives (notably #if, #else, #elif, #ifdef, #ifndef and #endif) specify whether to include a block of code based on a condition. They can also evaluate whether a certain macro is defined (for this, the #ifdef and #ifndef directives are used, but these can also be written as #if defined(MACRO)). These directives control which parts of the code should be skipped because they are not relevant for a given compilation.

There are other special features of the preprocessor, such as stringification of tokens, or digraphs and trigraphs. We will, however, not go into further detail here. The input of preprocessing is the source code (.c, .cpp files) and header files (.h, .hpp), the output is a preprocessed file.

2.3.2 Compilation Proper

The second part of compilation is also called compilation, so it will be referred to as compilation proper in the text. The input to compilation proper is the preprocessed source code (with the .i extension, which is either a temporary file that does not persist after compilation, or is not written at all). Among the responsibilities of this phase are syntactic and semantic analysis (checking for any syntax errors), translating the source code into assembly and optionally performing optimizations. There are three main components in compilation proper (in so-called three-stage compila- tion): frontend, middle-end and backend. Their responsibilities are as follows [34]:

1. the frontend parses and analyses (semantic analysis, type checking, etc.) the source file produced by the preprocessor and generates an intermediate representation out of it, 2. the middle end performs transformations (mainly optimization) on the intermediate format13, generating a new version thereof, 3. the backend, or code generator, translates the optimized intermediate representation into target-specific assembly.

Figure 2.5 shows how the compilation proper stages fit into the whole process of compilation. The intermediate representation can have different forms in different compilers and a single compiler can support several intermediate representations: in the case of clang, it is LLVM IR (as clang is developed by the LLVM Project), however, for instance gcc works with its own kind of intermediate representation, called GIMPLE, but can also use LLVM IR. Another compiler, called LCC or “Little C Compiler” works with data flow graphs (DFGs) as its intermediate representation [26]. The data flow graphs are, however, machine-dependent.

12Macros can also have the form of functions and take arguments, but these are still just textually replaced and the formal arguments are substituted with the actual arguments passed to the function macro. 13The optimizations performed by middle end are (processor-)architecture-independent, as they work directly with the IR.

17 source preprocessor source frontend IR middle end

headers object code codegen IR

libraries linker executable

Figure 2.5: Picture and caption taken from [34]: The architecture of a compiler including the compilation proper stages. The rounded boxes represent compiler components, the squares represent data. Dashed boxes only exist as internal objects within the compiler and will not be written into files unless requested by the user or by the build system. Out of the dashed boxes in the picture, typically only object code is written into a file.

2.3.3 Assembly

The assembler takes the assembly code and translates it to object code, i.e. low-level relocatable machine code. Only the code that is part of the unit is translated; the library functions, for instance those which come from the system libraries are expected to be in the form of object code when they are linked with the program. Besides generating binary encoding of individual instructions, an important responsibility of the assembler is assignment of numeric addresses to functions, labels and static data. The output of the assembler is a file which contains object code (file.o). In many compilers, the boundary between the backend component of compilation proper and the assembler is blurred.

2.3.4 Linking

While the linker is typically a separate program, it is often the responsibility of the compiler driver to invoke it, with correct arguments. It decides on the object files and libraries to be linked (in the correct order), including system-specific files, like crt0.o, and system libraries (such as libc). During the linking stage, the object files are linked together to produce an executable. Linker resolves the references of library functions and relocations. The complexity of linking and the reason why linker invocation is usually delegated to the compiler driver is demonstrated in section 2.3.7, which describes the process of creating a library and of linking it with a program.

An important consideration is the mechanics of archive linking: unlike shared libraries, which are indivisible and linked into each program in their entirety, or not at all, static libraries retain individual object files. With archives, normal behaviour of the linker is to only include object files that provide the necessary symbols for the files that precede the library in the linker command (which is why order is important). This can affect program behaviour, because unlike shared libraries, global constructors which are defined in object files that are not referenced by the program (directly or indirectly) will not run. [34] This is why we need to replicate the behaviour of native archive linking in the bitcode linker.

18 2.3.5 Shared Objects

A shared object file holds code and data suitable for linking in two contexts. First, the build-time linker (ld) may process it with other relocatable and shared object files to create another object file. Second, the dynamic linker (ld.so) combines it with an executable file and other shared objects to create a process image. [9] In other words, ld is a linker used at compile-time, whether for static or dynamic libraries, while ld.so is a dynamic linker, used at load-time of a program (when the program is run). The ld.so program finds and loads the shared objects (shared libraries) needed by a program, prepares the program to run, and then runs it. [23] If an absolute path to a shared object is not specified, it is searched for in: the DT_RPATH dynamic section attribute of the binary (if DT_RUNPATH attribute does not exist); then the LD_LIBRARY_PATH environment variable; followed by DT_RUNPATH. Consequently, the dynamically linked and loaded shared objects may differ (see also Section 7.2.3).

2.3.6 General Compiler Options

The output of the entire compilation process is a single binary. The default name for this binary is, for historical reasons, a.out, unless the -o option is passed, which allows to specify what the binary should be named:

$ cc -o prog prog.c produces a binary called prog in the current working directory.

Some of other frequently used options include:

• The options to suppress or request warnings, e.g. -Wall (enable all warnings) or -Werror (treat warnings as errors).

• Enabling or disabling optimizations with -O*, where * stands for level or type of optimization: -O0 for disabled optimizations, 1 to 3 for different levels of optimizations or e.g. -Os to optimize for size.

Compilation options are passed on to the respective sub-commands that they are relevant for. For instance, -static is passed to the linker.

2.3.7 Example: Building a Library

Let us consider a simple example, one that does not use any third-party libraries. Typically, the of the program is the main function, defined in a .c or .cpp file (such as main.c, see Figure 2.6). This is a source file. A source file can also contain other functions, classes, and symbols. If we want to reuse the plus function in other source files, we can put it into a common library. To create a static library, we first have to put the plus function into a separate file. Ideally, we put the declaration into a header file (like plus.h) and the definition into a source file, plus.c. We then need to compile the plus.c file into an object file, using

19 int plus( int a, int b ) { return a + b; }

int main() { int num = plus( 2, 5 ); }

Figure 2.6: A source file: main.c.

#include "plus.h" #include "plus.h"

int plus(int a, int b) int main() int plus(int a, int b); { { return a + b; int num = plus(2, 5); } }

plus.h plus.c main.c

Figure 2.7: Splitting main.c by moving the plus function into a standalone header containing the declaration (left) and source with the definition (middle). What is left of the main.c is shown on the right. Note that both main.c and plus.c have to include the header file, so that it is aware of the so-called function signature = the return value and the number and types of arguments, to make sure that it matches with the function used in the source files. Using quotation marks instead of angle brackets to include the header file means that the current directory will be searched first. If the header file is not found there, the normal include path will be searched, i.e. headers which belong to system libraries or headers which are installed in locations given by the -I switch to the compiler/preprocessor.

$ cc -c plus.c

which gives us a plus.o file. The object file is needed for inclusion in an archive. We can then run the ar command as follows:

$ ar rcs libplus.a plus.o

to get a static library called libplus. With rcs we specify that we wish to create an archive (c), replace the object file if it already exists in an archive (r) and create an index on the symbols (s). If we then run

$ nm libplus.a the output is:

plus.o: 0000000000000000 T plus

The nm command lists the symbols from object files. It states that our library has object file plus.o, in which the plus symbol is defined (T means that the symbol is defined in the text

20 section of the file, see section 2.4 for details on object file sections). For us, the relevant part is that it contains the plus function. If we try to compile main.c now, with

$ cc main.c the linker will complain that plus is undefined:

/usr/bin/ld: /tmp/cc2I2fwZ.o: in function `main': main.c:(.text+0x13): undefined reference to `plus' collect2: error: ld returned 1 exit status

To provide a definition of the function, we can either compile both files simultaneously, with

$ cc main.c plus.c or we can link in our library, libplus, which contains the plus function

$ cc main.c -lplus -L.

-lplus means link in a library called plus or libplus, while the -L option specifies a path where to look for libraries and we passed the current directory (“.”). It is important to pass the library after the program, as the compiler will then link in any undefined functions. The archive can also be passed to the compiler directly (using a relative or absolute path) and the compiler driver will call the linker with the correct arguments:

$ cc main.o libplus.a

The last option is to call the linker directly:

$ ld main.o libplus.a

However, this is more tricky; we will learn that a symbol called _start, which is the default entry symbol, is missing:

ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000

The reason is that main is not the actual entry point of a program. It is a symbol called _start, which comes from the system. [15] The actual command that produces a working executable was the following (note that it is specific for the gcc compiler):

ld -static -o main -L/usr/lib/gcc/x86_64-pc-linux-gnu/8.3.0/ /usr/lib/crt1. o /usr/lib/crti.o main.o /usr/lib/crtn.o --start-group -lc -lgcc - lgcc_eh --end-group libplus.a

Since libgcc and libgcc_eh have a circular dependency, they need to be enclosed within –-start-group and –-end-group.[15] The first path can be obtained when the -print-file-name option is passed to gcc with no arguments.14 This example serves to demonstrate that it is better to use the compiler for linking, as it invokes the linker with the necessary arguments for us.

14-L`gcc -print-file-name=`

21 2.4 Anatomy of an ELF file

On UNIX systems, the standard format for storing machine code (i.e. the binary code understood by the CPU) is ELF, short for Executable and Linkable Format. [34] It is the file representation used during various stages of compilation. The first intermediate file that is in the ELF format is an object file, generated by the compiler proper. At this point, the code is relocatable, meaning the routines and static data stored in the file have not been assigned their final addresses. [34] A collection of such object files can be used to form a static library (or archive). A relocatable object file holds code and data suitable for linking with other object files to create an executable or a shared object file. [9] This final step is performed by another program, a linker (often known as ld) and consists of resolving cross-references and performing relocations. The final build product, that is an executable or a shared object, is likewise stored in the ELF format. ELF is standardised and was designed to be portable (it is not bound to any particular processor or instruction set architecture). Data and code in an ELF file is stored in multiple sections. Apart from instructions and data, sections store the symbol table, relocation infor- mation, and so on. [9] There are several well-known sections that have special roles in ELF files, the most important of which are:

- .text - contains instructions (machine code) of the program, - .data and .rodata - constant-initialized data, e.g. string literals, - .bss - statically-allocated data that are uninitialized/zero-initialized (the zeroes are not stored in the file), - .interp - stores the path to a dynamic linker/, - other reserved sections include .init or .debug.

As previously mentioned, we store the bitcode in a section named .llvmbc, same as the LTO subsystem uses. When taking advantage of existing LLVM infrastructure, some of the tools already recognize this section as the section containing LLVM bitcode, which allows us to re-use these tools with minimal changes. The name .llvmbc is the closest there is to a ‘standard’ way to embed bitcode in object files. [34] A high-level outline of the ELF format is shown in Figure 2.8. While section names prefixed with a dot are typically reserved for the system, it is not enforced and new sections of any name can be created. It is up to the developer adding the sections to avoid potential clashes with existing sections. Therefore, operating systems or compiler toolchains can create or recognize additional sections with the semantics of their choosing. [34]

22 ELF header .text .data .bss .llvmbc

Figure 2.8: Architecture of ELF files, which are composed of a header and sections. Here, the most common sections are shown, along with our added section, .llvmbc, pictured in red. We use it to store the LLVM bitcode representation.

23

3| Related Work

Because dioscc was designed to serve as a substitute for the standard C and C++ compilers, and because it has been built on/atop clang, it is closely related to these compilers, especially clang. On the other hand, its intended use is to aid with (dynamic) analysis of programs, particularly in combination with automated compilation. A number of tools exist that attempt to make program analysis more accessible by automating the process, often enlisting the help of the build system that is already in place. We introduce a few of these tools in this chapter, focusing on tools working with LLVM as the program representation. Notably, the tool closest to our approach and aim is called wllvm. In this chapter, we will compare wllvm to divcc1, describe their method of operation and outline their differences, with focus on strengths and weaknesses.

3.1 Program Analysis

One noteworthy example of automating program analysis is scan-build [1]. It is part of the Clang project and acts as a wrapper for both clang and a static analyser, in this case clang-analyzer. In other words, it overrides the compiler that would be used to build the project and, in addition to compiling the files, it also executes clang-analyzer on the built files. scan-build can work both with individual files and with make, an automated build tool, simply by running scan-build make instead of running make directly. Accordingly, scan-build is quite simple in design. Additionally, because it performs static analysis, it does not require the whole program to be available. In fact, the default behaviour of scan-build is to exclude header files in analysis; the user needs to explicitly pass a flag to include them. Another related tool is goto-cc [28], although it only partially satisfies the requirements that we set as our objective. Instead of LLVM IR, the output of compilation is a GOTO-binary, a control flow graph representation of the program. GOTO programs can be verified using CBMC2 and a few other tools. The similarity with our approach further comes from the fact that goto-cc was designed to be compatible with gcc, and as such can be used in its place in build systems. But because it deviates from the standard compiler behaviour, it cannot be used in places with more complex build instructions, which might invoke external tools or execute intermediate helper programs that were compiled as part of the build process.

1Why divcc and not dioscc will become clear in the given section (3.3). 2https://www.cprover.org/cbmc/

25 3.2 Link-Time Optimizations

The approach taken by dioscc is similar to the way clang handles link-time optimization (LTO). The two are related in several ways:

1. In order to perform optimizations efficiently, the optimizer needs to have the entire program available. The ability to examine the whole program at once often helps with function inlining, which can further help with dead code elimination, or optimizations such as constant propagation. 2. The optimizations are carried out at the level of LLVM bitcode, for similar reasons why we choose to work with this particular representation of the code. 3. The intermediate representation is stored in an ELF section, alongside the respective object files and libraries, analogously to our approach.

3.3 wllvm

A tool with goals very similar to ours – to obtain the LLVM bitcode of an entire program (or a library), while transparently reusing existing build systems – is wllvm [33] (whole-program LLVM). The wllvm tool is more similar to divcc (rather than dioscc) in its mode of operation – it does not have the bitcode definitions of library functions available, as these cannot be generated from object files (we would need the original source files). The resulting bitcode produced by wllvm and divcc should therefore be equivalent, while bitcode produced by dioscc includes library functions provided by DiOS. With divcc (and dioscc), we only do a single compilation pass, meaning there are no repeated front-end and middle end invocations, which reduces the overhead introduced by the tool into the build process.

Unlike divcc and dioscc, wllvm is a comparatively simple Python-based wrapper for the compiler, which forwards the normal compilation instructions but also instructs the compiler to produce LLVM bitcode. That is, wllvm works in two passes: apart from the standard com- pilation, it does a second pass where it generates LLVM bitcode for each object file. However, this bitcode is stored externally (in hidden files) alongside the object files and a section is in- serted into each object file, keeping track of the absolute path to the corresponding bitcode file.

From the project page [33]: “When object files are linked together, the contents of the dedicated sections are concatenated ([to preserve] the locations of any of the constituent bitcode files).” The wllvm project provides a utility (called extract-bc) to extract the bitcode once an executable or a library is built. It reads the paths stored in the special section of the final ELF file and links the bitcode into a single whole-program bitcode file.

26 There are several downsides to the approach taken by wllvm:[34]

• The extra bitcode files might be problematic during the configuration phase of the build process. The configuration tool does not expect these extra files to be created during compilation and it might conflict with system checks. For instance, we encountered this problem while building the Berkeley DB project, which failed to configure out-of-the- box.3

• Even after installation into a target location, the build products refer to files present in the original build directory which therefore cannot be cleaned up as would be usual, making builds of systems which consist of multiple independent packages more difficult and error-prone.

The wllvm project provides workarounds for both of the problems. However, with the use of absolute paths, the tool would still be difficult to integrate into a typical software development workflow, which often operates in a distributed environment. On the other hand, one major advantage of wllvm is that sections are linked correctly by the native linker, so no extra work with regard to LLVM sections has to be done during linking.

3See section 7.2 in Evaluation.

27

4| Compilation Automation

Software projects that involve multiple source files, header files, libraries and external de- pendencies, projects with a complex build process, or those that use intermediate compiler products often come with machine-readable build instructions. They are normally instructions for automatic build systems or other similar programs which do not require manual execution. Build systems serve to simplify and automate two major tasks [34]:

• generation and management of build artefacts and tracking of their dependencies, • smoothing out the differences between platforms, compiler toolchains and build environ- ments.

The build process can involve many steps and be arbitrarily complex, passing options to sub-stages, invoking external programs, but also checking for external dependencies, such as that a suitable version of a given library is installed on the system. What is more, compilation is a time-consuming and resource-intensive task (including processor time and main memory). Consequently, it is desirable that only the necessary files (whose source files have changed since last compilation) are re-compiled and linked with already-compiled object files that remain unchanged. The incremental or partial (re-) compilation is, however, only useful for developers, not for end users.

A dependency can have two distinct meanings in this context: it either concerns a separate (often third-party) software package which needs to be installed in the system before the build can proceed – usually a library, sometimes a tool used in the build process [34] (an external dependency); or, otherwise, a dependency can refer to a file that is part of the project that another file depends on – an intra-project dependency – if the dependency is changed, the dependent file also has to be re-compiled for it to be up-to-date.

4.1 Build Systems

There are a number of tools for build automation, including those for programming languages other than C and C++, for different platforms, or some IDE-specific build systems. Some notable non-C and C++ system examples include Ant1 for projects written in Java, Ruby’s Rake2, Mix3 for the functional language Elixir, or Qbs4 for the Qt project. Lately, continuous 1https://ant.apache.org/ 2https://ruby.github.io/rake/ 3https://elixir-lang.org/getting-started/mix-otp/introduction-to-mix.html 4https://wiki.qt.io/Qbs

29 integration systems are becoming popular (such as Travis CI, Jenkins or Circle CI), and these also include a build automation component. The build systems for C and C++ programs are among the most popular, especially make and CMake. This chapter aims to familiarize the reader with available automated build tools, particularly those that work with C and C++ projects. Where applicable, the build tools are also presented in the context of dioscc.

An example of build system-specific instructions is a Makefile, associated with the Make build system. When invoked using the make command, the tool looks for a file called Makefile, which contains the build instructions. It is common to use a combination of automated build systems, such as a CMake or a configure script, which produces a Makefile.

The build process carried out by a typical build system is split into 2 phases:5 1. Build configuration is mainly concerned with inspecting the build platform.

i. The tool, taking into account the build instructions, examines the software installed in the system, to see what is available and whether it is possible to build the program at all. ii. To this end, it may attempt to compile and sometimes run feature tests – essentially tiny test programs; if the compilation fails, the tool concludes that the tested functionality is unavailable. Alternatively, it may contain a database of known systems and their properties. iii. At the end of the build configuration phase, the build tool will store the configuration information (like compiler flags and feature macro definitions) in a form which can be used during compilation.

2. The actual build, in which the software is compiled and linked. The build system per- forms the steps specified in the build instructions to produce libraries and executables which make up the package.

Build systems are meant to be compiler-independent and user-configurable, to a degree. This is where we can take advantage of the build system and provide our own compiler to be used to build a project.

4.2 Make

One of best-known programs in the category of build automation tools is a program known as make, developed in Bell Labs in 1977. The entire project it belongs to is then known as Make6. Make is a build automation system that was originally created to track (in-project) dependencies of programs. [22] It is a slightly different goal from the requirement that only changed files are re-compiled, although the latter is a direct consequence of explicitly spelling out dependencies of a program. Identifying dependencies is one of the main reasons for

5The description of the build phases is taken from our divcc paper [34]. 6A comprehensive manual can be found at [39].

30 automated build systems. As was already mentioned, it can save a significant amount of resources, especially time, to only rebuild the required files. make is part of the POSIX standard, i.e. its presence is expected on POSIX-compliant operating systems [27]. Because make has the ability to execute shell commands, it is not limited to C and C++, but can be used with any programming language that can compile programs using a shell script or any task that can be defined using shell commands. [39]

4.2.1 Makefile

Makefile is the “configuration file” of the Make system, which contains the build instructions. The syntax of a Makefile is as follows:

TARGET/ACTION : DEPENDENCIES COMMAND COMMAND [ ... ]

A concrete example is shown in figure 4.1.A target distinguishes the sets of commands to run. It either refers to a filename (like in the case of program), in which case the directory is checked for the presence of a file with this name and then the timestamp of the dependencies is checked and compared with that of the target (to see if we are building with the latest version of the file); or a name of a user-defined action (clean). We can state that a target is not a file using .PHONY, as demonstrated in the figure. In the absence of an explicitly specified default goal, the default target is the first target not starting with a “.”. [39] In our case, the default goal, or target, is program, built when make is run without specifying a different target (such as make clean). It is customary to have a clean target, in case we want to re-run make starting with a clean build (otherwise only the changed files are re-built7). There are 3 targets in our Makefile, an executable called program, an object file called program.o, which is a prerequisite for the program target, and has its own dependency, the source file (program.c); and the clean target, which removes the created files, to get a “clean” directory. Since program.c is not a target, it has to already exist in the directory where we run make, otherwise, we get the following error message:

make: *** No rule to make target 'program.c', needed by 'program.o'. Stop.

If a target for program.c was specified, it would have been used.

4.3 CMake

CMake was created in 2000, to replace The Autotools8 and to allow for easy building across different platforms. In particular, the aim was to have a single build system for both Windows (as part of IDEs) and Unix-like systems. A single CMake specification was meant to be used in multiple builds, so CMake needs to be able to work in a different directory than where the

7Naturally, this can also be overrun but this thesis does not intend to serve as an exhaustive guide. 8See next section for details on Autotools.

31 program: program.o cc program.o -o program

program.o: program.c cc -c program.c -o program.o

.PHONY: clean clean: rm -f program.o program

Figure 4.1: An example of a Makefile. The Makefile assumes existence of a C source file called program.c and builds it, resulting in an executable called program.

configuration and source files are located. Among other useful features of CMake is the ability to automatically search for programs, libraries, and header files that may be required by the software. [30]

CMake uses files called CMakeLists.txt (provided by the project developer), typically located in each main subdirectory of a project. The syntax of CMake is in the form command (arguments), as shown in the following example (later versions of CMake are case-insensitive regarding commands):

cmake_minimum_required( VERSION 3.14 )

project( Hello ) add_executable( hello main.c )

find_library( ZLIB NAMES z zlib PATHS /usr/lib /usr/local/lib )

if ( ZLIB ) target_link_libraries( hello ${ZLIB} ) else() message( FATAL_ERROR "zlib not found" ) endif()

where project serves as project identifier (several projects can be built with one CMakeLists) and the add_executable command specifies which source files belong to a given executable (what is the name of the executable the source files should be compiled to). More source files can be added (separated by white space) to a single executable and multiple executables can be created as part of a project. Apart from executables, one can also specify libraries which should be created from a list of source files, using add_library(). In this simple example we further asked cmake to find a library, namely zlib. We have to specify the names of the library and paths that will be tried. The path to the library, if found, will be stored in the

32 ZLIB variable. Supplying a cmake_minimum_required is mandatory. Variables are referenced as ${VAR} and arguments can be grouped into a list using set. For instance, the cmake command set(Foo a b c) will set the Foo variable to a b c. There are two ways of passing this variable to a command – as command(${Foo}), where the list gets expanded, and the command is called with 3 arguments as command(a b c) or with the list preserved in a single argument, enclosed in quotes: command(“${Foo}”), which is called as command(“a b c”).[30]

With CMake, the usual process of configuring and building the project is done in a new directory, e.g. build, so we will demonstrate this workflow on our example CMakeLists.txt. After creating a build directory and changing to it, we run cmake on the parent directory (cmake ..). The output of cmake includes CMake-specific files, such as CMakeCache.txt, cmake_install.cmake and a directory, called CMakeFiles, which contains more CMake- specific files, along with a preparation for building the projects (directories for each of the built executables). Finally, the build directory contains a Makefile. The Makefile generated from our example CMakeLists.txt is 178 lines long, and produces a number of targets, including make all, make clean, make hello and intermediate targets for all source files, such as make main.o. Under CMakeFiles, in the hello.dir directory, a file called link.txt was created. This file contains linker arguments for the hello executable, including the path to the zlib library. The contents of the link.txt file are:

/usr/bin/cc CMakeFiles/hello.dir/main.c.o -o hello /usr/lib/libz.so

Commands issued by make are tightly coupled with the CMakeFiles directory. CMake is a wide-spread way of building projects, it is used in projects such as boost, KDE – K Desktop Environment, or even ReactOS.9

4.3.1 Ninja

There are several different backends that CMake can use, the two most notable for UNIX-like systems being the already-mentioned Make and Ninja. Ninja is a build system designed to use another generator for creating its input files [29], so it is often used with CMake. Furthermore, the main focus of the tool is speed, so it is generally faster than Make.

Choosing Ninja over the default Make can be specified with:

cmake -GNinja ..

4.4 configure script

As the outline in the section on build systems (4.1) suggests, the build phase is often preceded with a configuration phase, during which the system is examined. The system is checked for

9A list of big projects and companies using CMake can be found under https://cmake.org/success/.

33 availability of functions, system-specific macros are set, and prerequisites are built. With CMake, the configuration phase is initiated by running cmake (the build phase is make).

4.4.1 Autoconf

To prevent writing of configure files by re-using existing ones from other projects, and because configure scripts almost always have common components, a tool called Autoconf [16] (part of the Autotools suite) was created to generate them. Autoconf takes a configure.ac file as its input (provided by the project developer10), along with Makefile.in (either generated by another system called Automake or also provided by the developer). Autoconf then generates the configure script, as its primary output, and internally uses a shell script called config.status, which produces a Makefile. This Makefile is created from Makefile.in. Again, Makefile.in comes into the process either as the product of running automake, based on the contents of Makefile.am (given by the project developer); or the Makefile.in itself is provided by the developer. Autoconf is, in fact, a collection of programs, used during the generation process. Some of the more important ones are: autoconf, autoheader, or autoreconf. Another set of programs comes from the Automake project, including automake and aclocal. The two projects closely cooperate and inter-relate in the build process, if both are used. For reference, the entire process of using Autotools when building projects and when preparing them for release is shown in Figure 4.2. This figure is still somewhat abstracted, there are more steps and sub-programs, but the provided level of abstraction is sufficient for the purposes of this thesis. Autoconf and other Autotools (Automake and Libtool) are developer tools: they are usually not required to build the software from release tarballs. Instead, the developer uses Autoconf to generate the configure script, which becomes part of the code distributed to users. The main purpose of Autotools is therefore to make it simple to ship a project to end users and to make a project portable to a wide range of systems.

4.4.2 Running ./configure

On the part of the end user, the two relevant products are the configure script and the Makefile, where the Makefile is obtained by running configure.11 Apart from the Makefile, other configure-related files are generated – a log file called config.log, and config.h, which contains the #define preprocessor directives that char- acterize the system and that are set as a result of running feature tests on the system12. config.log stores the information on how the configuration was proceeding, and is useful for diagnostics in case the configuration fails.

10The configure.ac file is written in a simple tool-specific macro language, called M4. 11Again, this is not entirely true. configure actually generates a file called config.status (a shell script), which in turn generates the Makefile from Makefile.in. However, this is not important to follow the thesis, so we will simplify the build process. From the perspective of the end user, running ./configure produces a Makefile. 12For feature tests, see section 4.4.3.

34 Figure 4.2: An overview of build file generation using Autotools. Picture from [10]. In the context of Autotools, Makefile.am and configure.ac are typically provided by the programmer (configure.ac can be also partially synthesized from the system), the remaining files are generated using Autoconf and Automake.

35 checking for getopt_long_only... yes checking whether getopt is POSIX compatible... yes checking for working GNU getopt function... yes checking for working GNU getopt_long function... yes checking whether getpass is declared... no checking whether fflush_unlocked is declared... yes checking whether flockfile is declared... yes checking whether fputs_unlocked is declared... yes checking whether funlockfile is declared... yes checking whether putc_unlocked is declared... yes checking for struct timeval... yes checking for wide-enough struct timeval.tv_sec member... yes checking host CPU and C ABI... x86_64 checking if the linker (/sbin/ld) is GNU ld... yes checking for shared library run path origin... done checking for the common suffixes of directories in the library search path... lib, lib checking for iconv... no, consider installing GNU libiconv checking for inline... inline checking for off_t... yes checking whether limits.h has ULLONG_WIDTH etc.... no

Figure 4.3: An example output of configure, from the gzip project when configuring with dioscc.

A small example excerpt from the output of the configure script of gzip is shown in figure 4.3. We can see that the script checks whether functions are declared, whether they adhere to the POSIX standard, what the ABI is, or for the suffixes of files. One of the first tools which is discovered in the configuration phase of a build, after basic sanity checks of the platform, is the system C compiler. The compiler is typically used in subsequent platform checks, as it is expected that it will be used to build the project. Otherwise, the build might fail or it might produce a faulty executable. In the case of gzip, the relevant lines from configure were: checking for gcc... dioscc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... implying that the default system compiler is gcc but we have overridden the setting with the CC option set to dioscc.

4.4.3 Feature Tests

The checks are often performed using small programs, called feature tests. For instance, the tests can attempt to include a header or use a function to directly check whether the functionality is available. An example feature test from the gzip project, one that checks for the existence of the fpurge function, is shown in Figure 4.4. As demonstrated, a test usually begins with a set of macros (stored as confdefs.h), that describe the system, which are often

36 the results of former feature tests, are used in all subsequent tests and are updated after a feature test runs.

37 configure:7924: checking for fpurge configure:7924: dioscc -o conftest -g -O2 conftest.c >&5 failed to link, ld exited with exitcode = 1, signal = 0 stderr: /usr/bin/ld: conftest.divcc.4535.o: in function `main': /home/xbaranov/src/divine/next/gzip-1.10/conftest.c:88: undefined reference to ` fpurge'

configure:7924: $? = 1 configure: failed program was: |/* confdefs.h*/ | #define PACKAGE_NAME "gzip" | #define PACKAGE_VERSION "1.10" | #define HAVE_SYS_TYPES_H 1 | #define HAVE_SYS_STAT_H 1 [ ... lines omitted for brevity ... ] | #define _GNU_SOURCE 1 | #define _NETBSD_SOURCE 1 [ ... lines omitted for brevity ... ] | #define HAVE_FDOPENDIR 1 | #define HAVE_STRERROR_R 1 | #define HAVE_FCNTL 1 [ ... lines omitted for brevity ... ] |/* end confdefs.h.*/ |/* Define fpurge to an innocuous variant, in case declares fpurge. | For example,HP-UX 11i declares gettimeofday.*/ | #define fpurge innocuous_fpurge | |/* System header to define __stub macros and hopefully few prototypes, | which can conflict with char fpurge(); below.*/ | | #include | #undef fpurge | |/* Override any GCC internal prototype to avoid an error. | Use char because int might match the return type ofa GCC | builtin and then its argument prototype would still apply.*/ | #ifdef __cplusplus | extern "C" | #endif | char fpurge (); |/* The GNUC library defines this for functions which it implements | to always fail with ENOSYS. Some functions are actually named | something starting with__ and the normal name is an alias.*/ | #if defined __stub_fpurge || defined __stub___fpurge | choke me | #endif | | int | main (void) | { | return fpurge (); | ; | return 0; | } configure:7924: result: no

Figure 4.4: A feature test that checks whether the fpurge function exists (from the gzip project). 38 5| DiOS

This chapter aims to present DiOS, a lightweight self-contained operating system, which represents a crucial component of DIVINE. For user-level programs, DiOS fills the same roles as a real operating system. However, one cannot boot DiOS on a real computer and one cannot interact with it as a user would: the only interface DiOS provides is the API – or Application Programming Interface. This interface specifies the conventions for library functions regarding number and type of arguments, the return type, and the required behaviour of the functions. To make DiOS maximally useful with existing programs, this API was modelled after the POSIX specification. The operating system is implemented in a mix of C and C++. The chapter is further organized as follows: the first section explains the motivation behind having a custom operating system in the context of formal verification and the choice of POSIX as the standard to follow. The second part gives a brief overview of the architecture of DiOS and its individual components, which are described in their respective sections. The DiVM virtual machine, as one of the components, is detailed in Section 5.3. It is followed by the most comprehensive section of this chapter, the internal libraries of DiOS (Section 5.4), as this part of DiOS is the most relevant for this work. Finally, the last component of DiOS is the kernel, described in Section 5.5. The last section serves to summarize the necessary changes we have made in the course of this work, with regard to DiOS.

5.1 Motivation

One of the most important requirements on program analysis is reproducibility. We want to deterministically get the same result in every run of the analyser (or test suite), to be able to diagnose the underlying problem. However, real-world programs are generally neither under our full control nor self-contained. If we disregard issues with concurrency and their influence on reproducibility of test results, another major concern for reproducibility in program analysis are interactions of the program with the operating system and the environment. More specifically, real-world programs are not isolated – they tend to make use of operating system services, such as the filesystem, or memory allocation, and these often have side effects. For instance, a program can create a file in the course of its execution, and this file will still exist once the program ended, which could possibly interfere with a subsequent run of the same analysis. The need to have control over program’s behaviour is especially critical in model checking and other dynamic approaches, where the program is executed (or interpreted). As mentioned, another problem, apart from the need to have control over program’s behaviour, is that programs are typically not self-contained. They use functions (and type

39 definitions, macros and other functionality) from the standard library, or even import ex- ternal libraries. Because many of the standard functions are bound to the specifics of the operating system (processes need support from the scheduler, etc.), the operating system is expected to provide some of the standard libraries directly, without the need to install or import them manually. Standard libraries are rarely suitable for inclusion in formal analysis of programs due to their size and complexity. Fortunately, operating system services are almost exclusively used via system libraries, which makes it possible to solve the two problems at once: the library functions provided by DiOS shield the program from the outside environment.

These are some of the reasons why DIVINE supplies its own compact operating system, DiOS. The reasons stated above are considered with regard to the goals of dioscc. The basis of our motivation for creating DiOS is, of course, more complex. Firstly, verification, especially of programs that are intended for other users, should be system-independent, relevant for many operating systems. Secondly, normal operating systems are complicated, and provide parts that are unneeded for formal verification of user programs, such as device drivers. On the other hand, DiOS was designed to be lightweight, only retaining the parts necessary for verification. Moreover, we can better adapt the operating system for use in verification; for instance, we can directly amend functions to serve verification purposes – a good example would be assert, which has a special place in testing and verification; another example would be memory management functions, such as memory (de)allocation, which we need to closely monitor to report any memory errors, including leaks or illegal accesses. Concerning isolation, with DiOS, the program is fully isolated from the surrounding operating system (in which the model checker is running) although there is a “passthrough” mode available, which allows the program to make system calls to the real OS. In other words, DiOS is an operating system that can provide a realistic environment (including, for example, a filesystem, a standard library or process management) for executing POSIX-based programs. Since this environment is fully virtualised and isolated from the host system, program execution is always fully reproducible.

5.1.1 POSIX

POSIX stands for Portable Operating System Interface and defines a set of standards to ensure compatibility among different operating systems. More specifically, the standards comprise 4 volumes and concern the following conventions for the operating system and the C programming language [19]:

1. General terms; conventions for standard utilities (e.g. pwd, wait or umask) and header definitions for the C programming language. 2. Definitions for system service functions and subroutines, language-specific system services for the C programming language, function issues, including portability and error handling; essentially a collection of man pages of the functions defined by the standard. 3. Source code-level interface for command interpretation services (a “shell”) and common utility programs for application programs (utilities such as bc, lex and yacc or ar).

40 4. Extended rationale, including historical information and reasons for (not) including features.

To sum up, POSIX defines a standard OS interface and behaviour for the command interpreter, utility programs and the C programming language. The latest standard as of writing this thesis is POSIX.1-2017. Our main motive for choosing POSIX as the API to model was its popularity, with both application programmers and operating system vendors. Hence, this choice makes DiOS usable with a broad range of programs. We would also like to note that even though POSIX specifies many aspects of an operating system, only the API specs are relevant for DiOS. In particular, the relevant parts are the standards for header file definitions, the functions, subroutines and system services for C programs – i.e. the API and the behaviour the functions should adhere to.

5.2 Building verified executables with DiOS

When compiling programs with dioscc, the output of compilation is a hybrid (verifiable) executable. That is, we also obtain a bitcode representation of the program, along with the standard output of compilation. The resulting binary can be both executed on the host system, as if it was compiled using a standard compiler, and verified with DIVINE, which was modified to allow such hybrid executables as input directly. With hybrid executables produced by dioscc, DIVINE searches for an .llvmbc section and verifies the bitcode stored in this section. The approach presented above, dioscc, is not the only way of obtaining bitcode version of the program in DIVINE. DIVINE still works in its former mode of operation (divine cc)– when a C or C++ file is passed to DIVINE, the model checker first compiles the file to LLVM IR before verifying it. It is not straightforward to obtain bitcode from bigger projects this way, which is the reason why dioscc was created.

user program host libc

DiOS headers bitcode native code user

DiOS bitcode full bitcode verification

Figure 5.1: A workflow of compilation with DiOS. [35]

The way DiOS enters the compilation process of dioscc is twofold:

1. DiOS is an integral part of DIVINE: it is used for scheduling threads and processes, it manages the fault handler and generally guides the execution of the verified program. 2. It provides its standard libraries and headers to the verified program, as a substitution for the system libraries of the host.

An overview of the compilation (or build) process is shown in Figure 5.1. The ‘user program’ box represents the source code, in C or C++, supplied by the programmer. In this

41 workflow, we first compile this source code, with dioscc, using DiOS-supplied headers. The headers provided by DiOS cover the APIs mandated by ISO C and by POSIX. Normally, these headers would come from the real operating system. Using DiOS versions of the headers is important to ensure that the program only uses the APIs implemented by DiOS, and that it uses them correctly. Internally, dioscc first translates the C or C++ code into LLVM IR. When targeting native execution, this bitcode is then further processed into native code and linked with the host libc. Since DiOS takes special precautions to be ABI-compatible with the host OS1, this process gives us an executable which can be deployed the usual way. However, the LLVM IR (that is, the bitcode) is also combined with DiOS to produce a viable input for a formal verification tool.

kernel libc C++ support fault handler C99 IO malloc libc++ file system POSIX IO, syscalls libc++abi scheduler pthread non-IO C99 lib other components platform-specific code

execution platform

Figure 5.2: An overview of DiOS architecture. [35]

5.3 The DiVM Virtual Machine

Figure 5.2 gives a high-level overview of the internal architecture of DiOS. There are 4 major components involved: the kernel, DiOS libc, C++ support libraries and the execution platform. The following sections detail the respective parts.

The execution platform is actually not part of DiOS, it is rather the equivalent of a virtual machine, which runs the operating system. Similarly to real operating systems, DiOS is not a standalone self-supporting system, it requires a “machine” to run on. However, in case of DiOS it is a virtual machine, and this execution platform serves as an interface between the operating system and the actual machine the program and the operating system are running on. This is analogous to the interface between hardware and software in real operating systems. [35] In DIVINE, this virtual machine is called DiVM (DIVINE Virtual Machine) and it is the original platform DiOS was designed for. Although DiOS has a mother platform, it has since been successfully ported to KLEE [6] and can also be compiled and run as a user-mode kernel. The latter has not seen extensive testing or use, so we only mention it for completeness here. Furthermore, because portability of DiOS is only marginally related to the main topic of the thesis, we will only discuss KLEE briefly.

1What binary compatibility means is explained in more detail in Section 6.7.2.

42 As the main part of this section, we will consider the features of DiVM, its connection to DiOS, and the general specification of an execution platform in the context of DiOS. The execution platform serves a crucial role in both compilation and verification of programs. We have several requirements on the functionality that we expect this execution platform to provide, although most of them are optional and only enhance the platform’s range of capabilities. Lack of certain functionality might prevent some modules from working – these modules are part of the kernel and will be described in more detail in Section 5.5.

5.3.1 Execution Platform Support

Information about active procedure calls and about the local data of each procedure are, on most platforms, stored in the execution stack. One of the more important functions we expect the platform to provide is the means to switch stack frames – to transfer control to a different stack frame, similarly to how coroutines work. Stack switching is necessary to implement thread-based parallelism, but also important in normal program execution. Another related operation that we would like the platform to have is creation of a new execution stack, which is necessary in two scenarios: isolation of the kernel stack from the user-space stack and creation of new tasks (threads, co-routines or other similar high-level constructs). Another aspect of the interface is memory management: user programs allocate memory using both malloc/free and new/delete and the OS is expected to provide this memory to the program. Consequently, the platform needs to be able to allocate memory in variable-sized chunks.

Depending on the level of support from the execution platform, the following functionality can further be available (DiVM, as the native platform of DiOS, supports all of the functionality):

• Non-deterministic choice can be used to randomly choose a value from a given set and is useful, for instance, for thread scheduling or simulating malloc failures. Furthermore, symbolic values can be used if the set is infinite, such as the contents of a string or file. The model checker can, in these cases, explore the whole state space, or check whether an assignment exists where constraints on the symbolic values are violated. • Passthrough/Proxy mode is a form of program execution where the system calls invoked in the program or indirectly during the course of the execution of the program are propagated to the host system.

Table 5.1 summarises the most important features of DiOS and their dependence on platform features. The only requirement that appears in the table and was not yet mentioned is heap cloning – like memory safety, this feature is only required for process support – the fork system call needs to be able to clone all memory reachable through a given pointer.

43 feature stack nondet memsafe other ∗ malloc X memory management threads XX sync systems X processes XXX heap cloning signals X system calls supervisor mode∗ filesystem longjmp X exceptions X passthrough mode syscall execution replay mode X

Table 5.1: Summary of available DiOS features and what they require from the underlying verification platform: ‘stack’ means direct manipulation of the execution stack, ‘nondet’ means nondeterministic choice, ‘memsafe’ means that the feature relies on enforcement of memory safety (checking for out-of-bounds accesses, but it must also be impossible to construct “accidentally valid” pointers to existing memory). Optional items are marked with *.

5.4 DiOS Libraries

As mentioned previously, operating systems are expected to provide certain functions in form of standard libraries. Because we need the entire program for dynamic analysis, including the functions imported from these standard libraries, we also need to compile the library code into the internal representation used as input for the formal verification (in our case LLVM IR). It is, however, problematic to obtain this bitcode from the existing standard libraries, as these are large and complex, often containing a lot of CPU-specific code and, furthermore, they are designed to be compatible with many system architectures and past versions of the library (so called backward-compatibility). This is why DiOS provides its own standard library, which we will refer to as DiOS libc. It is not possible to verify every program with a single tool, so we focus our standard library on the most common functions and architectures. Moreover, we provide the library directly in the form of archive with LLVM IR bitcode files, making it easy to link to the bitcode generated from the user program.

Besides bitcode libraries, dioscc includes native versions of libc++ and libc++abi, offering support for C++ programs up to and including the C++17 standard. The reason why DiOS supplies its own C++ libraries is that different implementations of C++ libraries are usually not binary compatible with each other, and installing multiple versions of the C++ standard library is rather inconvenient (for one, we use ABI version 2 and many systems are built with ABI version 1). The libc++ and libc++abi libraries are adopted from the LLVM project, and DiOS versions only contain very minor modifications relative to upstream, mainly intended to reduce program size and memory use in verification scenarios. [35] DiOS further comes with its own implementation of libunwind.

44 Finally, another native library bundled with dioscc is libdios-host.a, which contains native versions of functions which are present in the DiOS libc but may be missing from the system one. Among the functions we have encountered were fpurge, freadahead or fseterr.

Unfortunately, work on verification of entire programs that make extensive use of operating system services is scarce. One tool which allows a degree of such interaction is KLEE, which provides a small subset of the standard C library in a fashion similar to ours. [35] Notably, KLEE used a modified version of the uClibc C library when analysing the coreutils program suite [40] and, similarly to DIVINE, provides its own libc implementation to be linked with the program under test and analysed alongside it. However, the libc used by KLEE is much smaller than the libc provided by DiOS, which allows us to analyse a wider range of programs.

5.4.1 System Calls

System calls form an interface that allows the program to use the services of the operating system. In DiOS, the list of system calls has to be fixed (relative to the host OS), in order for the “passthrough” mode to work. The property of system calls relevant for dioscc is that each system call has its library function counterpart, that is, it has an associated user-space C function, which can be called from user programs and acts as a proxy for kernel-space. The functions are declared in the builtin header files provided by DiOS and the implementations are part of DiOS libc. The functionality offered by DiOS covers thread management, the fork system call, signals (such as kill), various process-related system calls (getpid, getsid, etc.), filesystem (including the *at thread safe versions of functions from POSIX.1) and networking (e.g. sockets) support and more. [35] Notably, we currently do not support exec and it is not clear whether it is feasible to implement it in DiOS.

5.4.2 DiOS libc Coverage2

DiOS comes with a complete ISO C99 standard library and the C11 threading API. The functionality of the C library can be broken down into the following categories:

• Input and output. The functionality required by ISO C is implemented in terms of the POSIX file system API. Number conversion (for formatted input and output) is platform independent and comes from pdclib. • The string manipulation and character classification routines are completely system- independent. The implementations were also taken from pdclib. • Memory allocation: new memory needs to be obtained in a platform-dependent way. The library provides the standard assortment of functions: malloc, calloc, realloc and free. • Support for errno : this variable holds the code of the most recent error encountered in an API call. On platforms with threads (like DiOS), errno is thread-local.

2This section is based on Section 4.4 of [35].

45 • Multibyte strings: conversion of Unicode character sequences to and from UTF-8 is supported. • Time-related functions: time and date formatting (asctime) is supported, as is obtaining and manipulating wall time. Interval timers are currently not simulated, although the relevant functions are present as simple stubs. • Non-local jumps: The setjmp and longjmp functions are supported.

In addition to ISO C99, there are a few extensions (not directly related to the system call interface) mandated by POSIX for the C library:

• Regular expressions. The DiOS libc supports the standard regcomp & regexec APIs, with implementation based on the TRE library. • Locale support: A very minimal support for POSIX internationalisation and localisation APIs is present. The support is sufficient to run programs which initialise the subsystem. • Parsing command line options: the getopt and getopt long functions exist to make it easy for programs to parse standard UNIX-style command switches. DiOS contains an implementation derived from the OpenBSD code base.

Finally, C99 mandates a long list of functions for floating point math, including trigonom- etry, hyperbolic functions and so on. A complete set of those functions is provided by DiOS via its libm implementation, based on the OpenBSD version of this library. In Table 5.2, we provide a quantitative measure of the libraries provided by DiOS, measured in lines of code (LoC). The LoC refer to the size of the libraries at the time of writing this thesis.

5.5 Kernel

Kernel is a central component of an operating system, typically responsible for management of tasks (or processes), memory and other resources. It is essentially responsible for the communication between software and hardware, managing the available resources, checking for permissions and abstracting the low-level features of the operating system. As the architecture picture (Figure 5.2) shows, DiOS kernel is modular: it is composed of a number of modules, each responsible for specific related tasks. The modules interact with each other and make use of functions of other modules. Only a few of these modules are mandatory (among others the scheduler), while majority can be omitted in case the platform does not support a particular feature or the user program does not need the functionality provided by the module.

This modularity is useful for a number of reasons [35]: • resource conservation: some components have non-negligible memory overhead even when they are not actively used. This may be, for instance, because they need to store auxiliary data along with each thread or process, and the underlying verification tool then needs to track this data throughout the execution or throughout the entire state space.

46 DiOS kernel 5532 sys 2305 vfs 2282 proxy 945

C libc 23770 pthread 1024 sys 663

regex 6755 stdio 4013 time 2252 string 1327 stdlib 1149

other 6587

libm 18591

C++ libs libc++abi 8755 libc++ 16051

include 11928 sys 2092 (internal) 4084

Table 5.2: The size in lines of code (LoC) of the respective libraries and their sub-parts, as provided by DiOS.

• portability: this allows to port DiOS to platforms which do not support some of the components, for instance thread scheduling. • configuration: as a direct consequence of modularity, DiOS can be used in various contexts without the need to modify the source code, by using different combinations of the modules for different verification situations. • extensibility: it makes the kernel easily extensible with new modules.

We aimed to minimise the interdependence of the individual components, so that these can be used independently. The following modules are currently available in DiOS, some in multiple alternative implementations [35]:

• task scheduling: there are 4 scheduler implementations available:

– the null scheduler, which only allows a single task and does not support any form of task switching. – synchronous scheduler, suitable for executing software models of hardware devices – 2 asynchronous schedulers, both supporting thread-based parallelism – one for verification of safety properties of parallel programs, while the other includes a

47 fairness provision and is therefore more suitable for verification of liveness properties.

• process management: implements the fork system call and other relevant functions, and stores process-related data (process ID, session ID, etc.). Requires one of the two asynchronous schedulers.

• auxiliary/support components: a notable auxiliary module is the fault handler, which is responsible for the problems detected during the verification. It further allows the user to specify which of the problems should be reported as counterexamples and which should be ignored.

• POSIX system calls: many of system calls are part of other modules – for example, the process-related system calls were previously mentioned under the process management module. By far the largest coherent group of system calls deals with files, directories, pipes and sockets, with file descriptors as the unifying concept. A memory-backed filesystem module implements those system calls by default. A smaller group of system calls relate to time and clocks and those are implemented in a separate component (clock) which simulates a system clock. The system calls covered by the filesystem and clock modules can be alternately provided by a proxy module, which forwards the calls to the host operating system, or by a replay module which replays traces captured by the proxy module.

• monitors: for specifying liveness properties of the program under test.

• base (stub) component: the components of the kernel are organised as a stack, where upper components can use services of the components below them. The base compo- nent is fallback, offering stubs of system calls and other functions. That is, if a system call is not implemented in any of the upper components, the base module will report this.

Like in traditional operating systems, the DiOS kernel runs in supervisor mode of the virtual machine, to enforce the separation of kernel memory from the user program. It is, however, not required and DiOS can also run on platforms that do not offer supervisor mode (an example of which is KLEE).

5.6 Changes in DiOS

In order to improve the range of support of DiOS libc and the verification itself, some changes to DiOS were necessary. The most prominent changes are outlined in this section.

5.6.1 Lazy Linking

Linking small binaries with dioscc (such as those created by autoconf feature tests) would spend most of its time in linking DiOS into the binary. The loader in DIVINE now checks whether __boot is defined in the bitcode module it has loaded. If __boot is undefined, we link in DiOS, otherwise the module already carries its own OS with it.

48 5.6.2 Added Functionality / Improved Coverage

Some of the functionality that was added to DiOS libc includes:

• Binary compatibility: in the course of working on the implementation part of this thesis, we have discovered that some structures break binary compatibility. This is discussed in more detail in Section 6.7.2.

• Weak symbols: we allowed the user to override library functions, e.g. to re-define the function write.

• There are now two versions of strerror_r(), the POSIX version (returns an int) and the GNU version (returns char *). We took the approach of glibc – when _GNU_SOURCE is defined, we provide the char * version, but we generally provide both.

• Improved coverage: Added support for:

– libm : frexp, modf – imported from OpenBSD – further imported from OpenBSD: dlfcn.h, grp.h, utime.h, utmp.h, select.h, pwd.h, strftime, {get,set}progname, putenv, setenv and unsetenv, reallocar- ray and sig{empty,fill,add,del}set – libc : __fpending – imported from pdclib – added the sigaction system call – struct timezone was added to – compiler-provided/builtin headers: User programs assume that some headers are provided by the compiler. We have imported mm_malloc.h, float.h, limits.h and stddef.h from clang; stdarg.h from PDCLib; and promoted stdbool.h to a compiler-provided header.

49

6| dioscc

In this chapter, we will describe the implementation of dioscc, its internal structure and how it fits into DIVINE. In the first section, we will discuss its sister tool, divcc, and outline their differences and intended use. Next, in section 6.2, we will give an example of how to use the two tools. We will then note some implementation issues and shortcomings of dioscc. In sections 6.4 and 6.5, we describe some of the design and implementation choices we made. In the following section, we take a closer look at the process of linking implemented in dioscc. Lastly, we will look at API and ABI compatibility with regard to dioscc and the target platform (section 6.7).

6.1 divcc and dioscc

While our main goal was to create a tool which would use the bitcode form of DiOS’ libc (and also libc++ and libc++abi in case of C++ program), we have also released as a standalone tool – a variant of dioscc which works with the system libc. The tool will be referred to as divcc. divcc only obtains the bitcode of the input program, as libraries do not normally contain source files, so we have no way of obtaining their bitcode. As a result, some symbols (functions) will be undefined in the bitcode produced by divcc. It is up to the analysis tool to deal with such incomplete bitcode. The bitcode output of divcc should match that of wllvm, which also works with system libraries. Both tools (divcc and dioscc) can be used in automated build systems in place of the compiler and produce an executable which contains a bitcode section. With divcc, it should be straightforward to provide a different library in the place of DiOS libraries used in dioscc, if a tool supplies its own. Additionally, while dioscc uses header files that come from DiOS libc, divcc uses standard system headers, like a normal compiler would. These can also be easily substituted, as long as the alternate header files preserve binary compatibility1 with the host (or another system the analysis is aimed at).

6.1.1 divc++ and diosc++

Like gcc and clang, which both provide a C++ compiler counterpart, called g++ and clang++, both divcc and dioscc come with a C++ variant, called divc++ and diosc++, respectively. As is the case with the standard compilers, they are not separate tools from their C versions.

1Binary compatibility is one of the key issues with library substitutions: it is the property that both libraries use the same in-memory layouts for data structures, same numeric values for various named constants and so on. [34] A detailed discussion on binary compatibility is given further in this chapter, in Section 6.7.2.

51 As of writing of this thesis, the C++ tools were provided as symbolic links to the original C compilation tools (divcc and dioscc) and the detection of the mode to use was delegated to the tools themselves (i.e. divcc would check at run-time if it was called as divcc or divc++). This affects the flags passed to sub-stages and requests that the necessary libraries are linked in: in divc++, the path to C++ system libraries has to be added to the link path, while in diosc++ our versions of C++ libraries have to be linked in.

6.2 Intended Use

The aim of our tool is to maximally simplify analysis, and hence also the LLVM bitcode generation, for the users. As such, the only change in the build process should be for the user to specify that the user wishes to use dioscc in place of the system compiler and for the compiler to do the rest of the work. Additionally, if the program contains C++ code, they also need to set the C++ compiler to diosc++.2

Here are some examples:

./configure CC=dioscc CXX=diosc++ (with autotools-based builds) cmake -DCMAKE_C_COMPILER=dioscc -DCMAKE_CXX_COMPILER=diosc++ make CC=dioscc CXX=diosc++ (with plain make-based builds)

The remainder of the build process should be unaffected. That is, there should not be any side-effects other than the parallel generation of LLVM bitcode and its manipulation – the extra .llvmbc section and bitcode linking. The choice of dioscc as the compiler should especially not affect the usual processes of configuration and building, which means that the interface of dioscc has to match that of the traditional compilers (in terms of commands and options, etc.), and that dioscc has to produce intermediate files that the build system expects to be created in the process of building (such as object files). Furthermore, the tool needs to enable creation of archives with embedded bitcode. Some analysis tools, including DIVINE, can then directly verify the program by loading the embedded bitcode from the resulting executable. Otherwise, we provide a utility script, called extract-llvm, which takes an ELF file (either an executable or an object file), extracts the LLVM bitcode from the .llvmbc section and writes it into a standalone file (named inputfile.bc3) into the working directory. The process of how to use the tool is illustrated in Figure 6.1 (and later in the next chapter on evaluation – in figure 7.1). Notably, KLEE can also analyse bitcode built with divcc.

An alternative command for DIVINE (in place of divine check) would be:

$ divine exec --virtual -o nofail:malloc --stdin hello.gz ./gzip -f -d - which runs a single execution in DIVINE (i.e. not all executions are explored) and produces the desired output:

2Analogously, the user can set the compilers to divcc and divc++ or wllvm and wllvm++. 3The .o extension is dropped if we are extracting from an object file.

52 > [0] hello world The options are: • –-virtual instructs DIVINE to use the virtual filesystem, • -o nofail:malloc disables checking for possible failures of malloc (a default option with divine check), • the remainder of the options are identical to the previous DIVINE invocation.

6.3 Limitations

The are a few limitations that we are aware of and that will be explained in this section.

1. The main compromise in the current implementation is related to shared libraries. When a binary is linked to a shared library, the machine code version is linked in the usual way. However, we still link the bitcode statically, because no analysis tool can currently resolve dynamic dependencies and automatically load the bitcode from shared libraries.

2. When inline assembly is present in the program, it is retained in the form of architecture- specific assembly instructions in the LLVM bitcode.

3. While considerable in size, DiOS libc is not exhaustive and functionality is added on an as-needed basis. A particular function used in the analysed program may be missing if it is less used.

4. We might uncover further ABI-compatibility issues. A current notable problem is libpthread, which is unfortunately not yet ABI-compatible (though this is not a

$ wget http://ftp.gnu.org/gnu/gzip/gzip-1.10.tar.gz $ tar xzf gzip-1.10.tar.gz $ cd gzip-1.10 && ./configure CC=dioscc && make $ echo hello world | ./gzip - > hello.gz $ divine check --stdin hello.gz ./gzip -f -d -

$ extract-llvm gzip $ klee -exit-on-error gzip.bc hello.gz

Figure 6.1: An example use of dioscc (and divcc) to build and analyse gzip. [34] We first retrieve the archive of the gzip project from the official site using wget. With the following commands, we unpack the archive, change to the project directory, run the configure script (passing dioscc as the compiler to use) and run make to build the project. Next, we pass the traditional ‘hello world’ string to gzip to compress it into a file called hello.gz. We used DIVINE to check the behaviour of the gzip executable, or rather the corresponding LLVM IR code. Running the model checker is demonstrated on the 5th line of the example – we run it in check mode on the gzip executable, taking standard input from the hello.gz file. The other options are passed on to gzip and mean: decompress (-d), force decompression even if the input is a terminal (-f) and read the data to decompress from the standard input (-). The last two lines demonstrate use of dioscc with KLEE, a tool which needs to work with the LLVM IR directly, so we first extract it from the executable using extract-llvm.

53 library source file headers bitcode frontend native code

middle end object file executable object bitcode linked bitcode codegen object code native code linker

Figure 6.2: From [34]: The flow of the compiled code within dioscc. The source code along with included header files is processed by the frontend and then middle end to generate LLVM IR. This IR is used by the code generator to produce object code, stored within the same object file. The linker then separately combines bitcode and machine code from object files and libraries to produce an executable which again contains both executable machine code and analysable bitcode.

fundamental problem and it will be resolved in a future release). Consequently, dioscc may produce broken binaries, if they use the pthread API.

6.4 Design

The main challenges with our approach concern the bitcode, first its generation and then linking the bitcode embedded in hybrid object files and archives. To address them, we necessarily have to adapt the underlying compiler (the modifications are highlighted in figure 6.2). First, we need to modify the flow of data through the compiler to obtain the LLVM IR after the middle end, to store it alongside machine code in the object file. [34] Because at link time, we read back the object file, we need to control its path in cases when object files are temporary.4 The reason for the choice of .llvmbc as the section name was spelled out in 2.4. Second, we have to instruct the compiler driver to also invoke the bitcode linker when linking native code. That is, a new component was added to the compilation process – a bitcode linker. We have to extract the bitcode from the files that are being linked and save the output of bitcode linking to the ELF file that the native linker produces. In addition, the bitcode linker needs to be able to manage bitcode stored in archives. While a bitcode linker is part of LLVM, this linker can only combine individual modules and does not directly support linking bitcode archives, much less archives which consist of object files with embedded bitcode. [34] From the possible options, we have chosen to build our linker around the existing LLVM module-based bitcode linker. Importantly, our linker can also work with archives which only consist of bitcode files, which is the way we provide DiOS libraries. The details of the actual implementation will be described in section 6.8.

4Having a custom path is not without its own specific issues. We later had to add process ID of the current process to the filename, due to issues with parallel compilation, where different processes were overwriting each other’s object files.

54 6.5 Implementation

The dioscc tool is implemented in C++, as part of the DIVINE project5. As such, dioscc builds on clang, in the sense that clang is a reusable library, in addition to being a user-level program, and we can take advantage of its individual components. The use of clang API is one of the reasons for the choice of C++ as the language of the implementation. (It is also the primary language of DIVINE.) We currently use clang version 7.0.0 in DIVINE and make use of both CLang and LLVM projects as internal components (i.e. we use them via their public APIs). There are also foundations in place to use lld in a similar way. Most of the option parsing and processing is delegated to the underlying clang (its compiler driver), as well as construction of correct linker arguments (for details on linking, see section 6.6). Most importantly, clang is responsible for the crucial part of the compila- tion process – compilation proper. Using clang internally makes it easier to implement a compatible external interface, making compiler substitution in build toolchains straightforward.

One notable implementation issue is related to functions which take a variable number of arguments. A notorious example of such a function is printf, the declaration of which is (taken from the manual page on the printf library function6):

int printf( const char *format, ... ); where any number of arguments can be passed to the function (through the ellipsis). LLVM provides a special instruction (called va_arg), which provides access to arguments passed this way, but clang instead emits an architecture-specific instruction sequence, which directly reads the arguments from machine registers or the execution stack. [34] We chose to emit the va_arg LLVM instruction instead, to preserve compatibility with different systems and to simplify program analysis.7

6.6 Linking

A very specific part of the build process is linking. At first, only static linking was supported in dioscc, but eventually we added support for dynamic linking, though the bitcode is still linked statically (cf. Section 6.3). Thus, when executing dioscc and divcc with no additional options, the executable will be dynamically linked. When either -static or –-static is passed, the output binary is linked statically. This can be checked with:

$ dioscc file.c $ file a.out > a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped

5The site of the project can be accessed at https://divine.fi.muni.cz. 6man 3 printf 7The relevant code is located in the stdarg.h builtin header.

55 $ dioscc file.c --static $ file a.out > a.out: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 3.2.0, with debug_info, not stripped

For the moment, we mirror the behaviour of the upstream clang and use the system ld for linking of object files (using fork and exec). However, similarly to the way we use clang and llvm, we have imported and built the sources of lld and the code is in place that uses lld for linking. Unfortunately, some of the executables from the coreutils suite (such as ls) were linked incorrectly when DIVINE was used with lld, because of an issue with initializers of global variables, so lld is disabled for now (the –-use-lld option of dioscc can be used to force use of lld for linking and should work for simpler programs). We aim to diagnose the issues and lift the limitations in the near future. As mentioned before, we delegate the construction of linker arguments to clang.8 Additionally, we need to distinguish whether we link C or C++ code, which is achieved by using two distinct program names: dioscc and diosc++ (see also Section 6.1.1). The main difference with the two tools stems from additional libraries that C++ programs require: the C++ runtime support library and the C++ standard library (and any system libraries these two language-specific libraries depend on – usually at least libpthread)[34].

6.6.1 Bitcode Linking

In addition to native linking of object files and libraries, we have to ensure that the embedded LLVM bitcode is linked alongside the corresponding object code in object files, and that the final bitcode of the full program can be inserted into the executable. If an archive which contains bitcode is created in the build process, it needs to be possible to transparently link this archive to other object files containing bitcode (without interfering with the normal process of linking of the archive). The reason why we need to replicate the behaviour of archive linking in the bitcode linker component of dioscc was described in Section 2.3.4. With static libraries, the bitcode is simply part of the individual object files, which form the library. Shared libraries are more similar to executables, i.e. are a result of the full build process, including linking, and the bitcode is stored the same way it would be in an executable – shared libraries carry an .llvmbc section, in addition to the native code.

6.7 API & ABI Compatibility

This section deals with the question of compatibility of different libraries (and, as such, operating systems in the broader sense). Specifically, we distinguish two kinds of compatibility issues: API compatibility, related to the interface of the functions, and binary (or ABI) compatibility, which concerns finer details, such as the concrete layout of structures. The two

8The relevant function, getJobs(), can be found in divine/cc/driver.cpp.

56 compatibility issues are described in their respective sections, including their specifics within dioscc.

6.7.1 API Compatibility

An issue of API compatibility is much easier to understand, discover and fix. API compatibility issues often result in a compiler error, where the program fails to build (and the compiler emits a helpful error message where the problem occurred). This is unlike ABI compatibility, where the program typically builds successfully but misbehaves or crashes. API stands for Application Programming Interface and defines mainly the calling conven- tions of functions that the library provides, which have to match the one that program uses. Collectively, an API of a library is the set of declarations9 of public functions, classes and their methods. In dioscc, we only consider C APIs, so only APIs of functions and C structures and other data are relevant. All of the aforementioned functionality which falls under the term “API” comes from header files, which are included in user programs. Therefore, both the library, which contains the relevant definitions of these functions and the program which uses them need to adhere to the interface given in the header files. In the case of dioscc, the DiOS libraries also have to be API-compatible with the system libraries, since the native versions of the programs are linked to the host system libraries (but include DiOS header files). As previously mentioned, an example of API is a function declaration, such as the fopen example:

FILE *fopen(const char *pathname, const char *mode);

The API we use is the one mandated by POSIX, as DiOS (which provides the libraries) was designed to be POSIX-compatible. While ensuring API compatibility seems straightforward, there are, however, some API problems which are not intuitive. For example, one thing to consider when designing a C++ library is name mangling: if we intend to use C functions in C++ code, we have to encapsulate the declaration into an extern “C” block to prevent C++ name mangling. Another issue comes from the ambition of compatibility with a variety of system libraries, even different versions of a single library. It is, of course, not possible to support all functions and their different API versions and even the task of identifying the functions which would allow us to analyse the greatest number of programs is difficult. [42] A discrepancy at the level of API can be discovered early, when headers are included. Typically, a C compiler will give a warning, but will provide an implicit declaration for a function that is used in the program but undeclared (with the default return type int), while compilation proper will fail with C++ compilers (such as clang++) with an error.

6.7.2 ABI (Binary) Compatibility

In this section, we will consider the portability of an executable (as in, the result of compilation) with regard to operating systems other than the one this executable has been compiled on. In

9The calling convention does not typically include the return type, as this is not examined when a function is called.

57 the context of DIVINE, we would like the executable that was compiled on DiOS to also work on the host system, with no change in behaviour. This is, naturally, crucial for the consistency of the verification result. As we have mentioned in Chapter5, the workflow of compilation is split into two paths – the bitcode and the native executable, which are in the end combined into one hybrid exe- cutable (cf. Figure 5.1). These two paths have the first step in common: they both include header files from DiOS. When the first part of compilation – preprocessing – is performed, the header files get expanded (the #include statements in the program get substituted for the content of the actual header files). The next phase involves generation of LLVM IR bitcode from these preprocessed files. At linking, the two paths fork: the bitcode gets linked to DiOS, including DiOS libc, to get a representation of the program suitable for verification; and in the other case, we use the bitcode as a basis for the generation of native executable, but link in the standard libc of the host system. This can, for instance, be the GNU C library (glibc) in the case of Linux.

DiOS libc prog.c glibc

unistd.h unistd.h stdlib.h cc stdlib.h headers . . headers . . .o

library libc.a libc.a library

exe

Figure 6.3: The relationship between header files and libraries as they enter compilation when building with dioscc.

The connection of header files and libraries of the two systems is more closely conveyed in Figure 6.3. The left hand side represents DiOS libc – both the library (in the form of an archive, libc.a) and the related header files, while the right hand side shows the libc of the host system. In the middle, the relevant parts of compilation and the types of the files output during the compilation phases are shown. In the upper half, the header files are included. In this case, both native execution and bitcode take the left side, DiOS headers. The split happens in the middle, at the level of object files / intermediate bitcode without function definitions. This is when each side links their respective library to the program, to get the hybrid executable as the overall result of compilation. When the right-hand side is taken, this symbolizes a normal compilation process on the host system, using e.g. clang or gcc. We want the native executable to be usable on the host system, with no change in behaviour from the verified LLVM IR bitcode. This requires, among other things, that DiOS header files be binary compatible with the host libc.

58 The term binary compatibility covers:

a) The layout of data structures which come from included header files have to match. This includes the order and size of the fields of the structure. Some examples of structures that we need to derive from the host system are dirent, sigaction or the stat structure.10 b) Values of constants also need to be the same. Some of the constants this is relevant for are:

• fcntl flags (F_DUPFD, F_{GET,SET}FL, etc.) • MSG_* flags – MSG_PEEK, MSG_DONTWAIT and MSG_WAITALL • SOCK_{NONBLOCK,CLOEXEC} • sigprocmask flags (SIG_{BLOCK,UNBLOCK,SETMASK}) • signal numbers, such as SIGABRT, SIGINT, SIGSEGV, SIGKILL, etc. • and others.

c) Names and type signatures of library functions: for public functions, this is already covered by API compatibility (i.e. if we are API-compatible, the ABI part of it comes for free), but we need to solve it for functions that are not part of an explicit API contract: e.g. what the assert() macro (or some other macros) expands to. d) Names and types of global variables (e.g. stdin) and whether a given symbol is or is not in fact an actual variable (stdin is a macro on some systems, causing ABI-compatibility problems).

The above suggests that the ABI of DiOS is derived from the host operating system. We achieve this using a perl script, called hostabi.pl, which extracts the structure layout and constants from the host statically (at build time of DiOS) and generates a header file, called sys/hostabi.h. The header file can then be imported in DiOS, at places which use the constants and structures. The host values are also used in the final bitcode version of the program, which links in DiOS libc, as they become part of DiOS header files. As a result of binary compatibility, the native code generated from the verified bitcode can be linked to host libraries and executed as usual.

If library’s header files are ABI-incompatible with the library that will be linked with the program (the object file), it might not necessarily result in a program that cannot be executed. On the contrary, the errors are often subtle and require deep analysis to uncover – for instance, the problem might be caused by mismatched padding in a structure, resulting in a segmentation fault at runtime, or retrieving a faulty value with no warning. Only the native executable is affected in this setup and can misbehave as a result.

10As already mentioned, one of the deficiencies i DiOS with regard to binary compatibility that we are aware of is the libpthread library. The library is, unfortunately, not yet ABI compatible with the host version due to the pthread structure.

59 6.8 Internal Structure

In this section, we will describe the actual code structure of dioscc and divcc in the context of DIVINE.11 The two main entry points of the program are tools/divcc.cpp and tools/dioscc.cpp. Code relevant to divcc is located under divine/cc, while dioscc’s code is in divine/rt (for “runtime”) and makes use of the common parts from divcc. The entry point of divcc is a simple wrapper for the Native structure (cc/native.{hpp,cpp}), similarly for dioscc and NativeDiosCC (rt/dios-cc.{hpp,cpp}), which inherits from Native. At this point, we set an attribute, called _cxx, to denote whether we are building in C++ mode (the tool was invoked as divc++ or diosc++). The function get_cpp_header_paths in divcc.cpp uses the clang driver to retrieve system include paths for C++ headers, using the default ToolChain. divc++ further sets the _GNU_SOURCE, as libstdc++ expects it.

6.8.1 Options

Relevant code: • ./divine/cc/options.{hpp,cpp}

The command line arguments are passed directly to Native and, as one of the first things that happen in Native, the arguments are parsed into a representation that we will work with throughout the code (struct ParsedOpts). Some of the options are specific for a particular part of compilation (e.g. linking). These are collected at parsing and will be passed on to the sub-programs, analogously to how other compilers process them. Which is to say, dioscc also has a driver which manages the compilation, invokes the relevant tools and passes options to them.12 In addition, some of the options have to be translated before they can be used in a particular part of compilation; for example cc1 (responsible for the compilation proper) does not directly understand -fPIC, instead this option is translated to the pair of options -mrelocation-model pic and -pic-level 1. At this time, we also make note whether only some parts of the compilation should be performed (i.e. -E or -c was given). If there are no input files, we pass the flags directly to clang to deal with. This is for cases when –-help or –-version are among the options. In any case, since we try to mirror the behaviour of clang and use it as the underlying compiler, the clang output, when passed these options, should be pertinent to our tool. We print DIVINE’s version in addition to clang’s version where applicable.

6.8.2 Native, NativeDiosCC

Relevant code: • ./divine/cc/native.{hpp,cpp} • ./divine/cc/cc1.{hpp,cpp} • ./divine/rt/dios-cc.{hpp,cpp}

11The paths are based on the code structure of the source tarball that is part of the thesis archive. 12We are aware that not all compiler options that clang offers are supported in dioscc.

60 The purpose of attributes of Native is as follows:

• cc::ParsedOpts _po: parsed options • std::vector< std::string > _ld_args: ld-specific options • bool _cxx: true if we are building in C++ mode, • bool _missing_bc_fatal: specifies whether we should abort compilation when we encounter a file or library without bitcode – true in case of dioscc, false for divcc, • cc::CC1 _clang: the compiler driver, • PairedFiles _files: the mapping of input to output file names for the phase of compilation, see section 6.8.3.

As the presence of a compiler driver might suggest, Native is responsible for delegating the work to sub-processes that the full compilation process is composed of. The main method of Native is run(), where it is decided which parts of compilation will be performed. The individual compilation stages are then invoked on the _clang member, which is of type CC1. The CC1 component of Native is, however, yet another layer over clang, which is the actual backbone that does the work. Most of the code from cc/cc1.* is common with DIVINE’s traditional compiler, which can be invoked with divine cc and has thus been re-used in divcc and dioscc. Because we have unified the common parts with the original divine cc compiler, and worked with these, as a result we have also improved this tool. Before the compilation stage, NativeDiosCC only adds DiOS-specific headers and include paths on top of the behaviour of Native. We will highlight differences in further stages where applicable.

6.8.3 Compilation

Additional relevant code: • ./divine/cc/filetype.{hpp,cpp} • ./divine/cc/codegen.{hpp,cpp}

Because preprocessing does not require any additional work on the part of our hybrid compiler with regard to bitcode, we will outline the other stages. Compilation is taken from the point of view of LLVM. Because we generate the object code from the LLVM IR, compilation produces hybrid object files as its output (the high-level compilation and assembly stages are merged (see Figure 6.4). At compilation, we need to consider the naming of object files, which are created at this point. Specifically, we have to handle user-specified names (using -o), or temporary object files in case when full compilation is requested – the object files are not preserved. Moreover, input files that are in an already-compiled format (object files and libraries) should not be re-compiled. The collected cc1 arguments are passed here, along with other, unrecognised options (options starting with a “-” that are not explicitly identified are passed on to clang to handle).

61 source code preprocessor

preprocessed source

frontend

LLVM IR codegen object file

.text .bss .llvmbc .data

Figure 6.4: The compilation proper stage in the context of DIVINE. The output of compilation is an object file, instead of assembly instructions.

During compilation, the code is represented in terms of LLVM’s Modules. From the LLVM Language Reference Manual [20]: LLVM programs are composed of Modules, each of which is a translation unit of the input programs. Each module consists of functions, global variables, and symbol table entries. Modules may be combined together with the LLVM linker, which merges function (and global variable) definitions, resolves forward declarations, and merges symbol table entries.

The compile function takes the input filename, its file type (which can be derived13) and the combined list of cc1 and additional arguments, and returns the llvm::Module that results from compilation of the file into LLVM IR. In CC1’s compile, before returning the Module back to Native, a final pass is performed on the Module, where the instructions generated by LLVM for the ellipsis variable arguments are replaced by the LLVM’s va_arg instruction. Because clang’s compile does not automatically write this Module into a file, we then emit the object file ourselves14, taking advantage of LLVM’s infrastructure. In the first step, the .llvmbc section is embedded into the Module, with a copy of its contents, to preserve the code in the LLVM IR stage. Then, this augmented Module is written into a file as object code.

6.8.4 Linking

Additional relevant code: • ./divine/cc/driver.{hpp,cpp} • ./divine/cc/link.{hpp,cpp} • ./bricks/brick-llvm-link

13The deduction of the file type is straightforward, most of the time the type can be inferred from the extension. One exception we made concerns shared libraries and object files. If these cannot be inferred directly, and the type is still unknown, we check the magic number of the file – this is to avoid specifying all possible extensions, as, for instance, shared objects are often suffixed with versions (e.g. libz.so.1.2.11). 14Code generation is located under cc/codegen.*.

62 The majority of additional work takes place during the linking phase. Linking, unlike the other stages, is deferred to Driver15, which keeps an instance of a bitcode linker (brick::llvm::Linker). The bitcode linker internally retains a single Module and incre- mentally links other Modules to it. One key difference between divcc and dioscc at this stage is that divcc does not abort if it encounters symbols which remain undefined in the linked bitcode, as we expect not to have all library functions available in their bitcode form.

We first outline the workflow of linking in divcc and Native and then highlight the differences in dioscc. The process of linking in divcc has the following steps (also visualized as a sequence diagram in Figure 6.5):

1. If linking is requested (i.e. we are not stopped after other phases), the run() method (of the compiler driver – struct Native) invokes Native’s link(), the main entry point for the linking stage.

2. As the first step, link() initializes linker arguments, using init_ld_args() (which calls ld_args() at cc/link.*, which actually initializes the arguments needed for linker invocation from the parsed options and input files). The options relevant for linking include: library search paths (-L), libraries to link (-l), or explicit naming (-o).

3. We ask the Driver to construct the actual linker arguments that would have been used if clang was used directly (with getJobs()). This is necessary so that we do not have to find paths to system libraries, for instance. In fact, we instruct the Driver to derive all the commands the compilation would have been composed of, but we only pass it the files and options relevant for linking, and we only retrieve the linker job.

4. The Native driver decides whether to use system ld (which is invoked using fork and exec) or lld (using the library directly via lld::elf::link()) for native (object code) linking, based on options passed to divcc and links the native code with the selected linker.

5. The library search paths are cleared and re-set based on the linker job from getJobs().

6. The bitcode is linked using link_bitcode(), which, in divcc is a simple wrapper for do_link_bitcode. We want to re-use the common parts of divcc and dioscc and thus we had to template the latter function by the used driver. On the other hand, the former is used in dioscc to perform extra DiOS-specific work.

7. We link in each file (link_bitcode_file()) into Driver’s linker incrementally: li- braries are linked in directly, object files are first loaded based on a filename, and the bitcode section is retrieved. Next, we verify the final serialized module (as in check that it is a correct Module containing valid LLVM IR), and return it.

8. The serialized bitcode is injected into the binary with add_section().

15Driver comes from cc/driver.* and is also used by divine cc.

63 The dioscc mode reuses much of the functionality from divcc. The following outline shows the high level overview of the sequence of instructions. We focus on the parts which deviate from the divcc workflow. Again, we provide a sequence diagram to visualize the relationships between the components (Figure 6.6, in which dioscc-specific parts are highlighted in green).

1. NativeDiosCC again initializes the linker arguments explicitly (with init_ld_args()) from the options given to dioscc.

2. In link_dios_native(), we write the libraries that DiOS needs for its operation when running on the host system.16 Notably, this concerns libc++, libc++abi and libdios-host, of which the C++ libraries are only written if we are in C++ mode.

3. We delegate the common steps to the parent’s (Native’s) link():

3.1 init_ld_args: The function is idempotent, i.e. the arguments will not be created again.

3.2 Steps 3., 4. and 5. from the divcc workflow are identical. 3.3 In step 6., NativeDiosCC’s link_bitcode() is called instead.

4. The do_link_bitcode() method is used with the dioscc’s driver – rt::DiosCC and returns a serialized Module.

5. Because NativeDiosCC’s link_bitcode() was called from Native’s link(), it can do additional work before returning the control flow to link() to insert the LLVM IR section to the final binary.

5.1 The bitcode linker (from DiosCC, which inherits from Driver) is initialized with the full Module, so that additional libraries can be linked in. 5.2 Default DiOS’s bitcode libraries are linked in.

5.3 The final Module is checked for undefined symbols and verified.

6. The serialized bitcode of the binary (with linked DiOS libraries) is added into the .llvmbc section of the executable.

16See section 5.4 in the DiOS chapter.

64 :Native :link :Driver _clang:CC1

link() init_ld_args() ld_args()

getJobs() compilation jobs link_bitcode() do_link_bitcode< cc::Driver >()

For Loop file : _files link_bitcode_file()

If file is a library

linkLib()

If file is object code

link()

serialized Module

add_section() serialize_module()

Figure 6.5: A sequence diagram demonstrating linking in divcc and how the Native driver makes use of other components. link specifies an interface, rather than an instance of a class – it symbolizes functions from cc/link.*.

65 :NativeDiosCC :Native :link :Driver :CC1 :DiosCC

link() init_ld_args() ld_args()

link_dios_native()

link() init_ld_args()

getJobs() compilation jobs link_bitcode() do_link_bitcode< rt::DiosCC >()

For Loop file : _files link_bitcode_file()

If linkLib() || link()

serialized Module

serialized Module link in DIVINE bitcode libraries

check for missing symbols

Module + DiOS add_section() serialize_module()

Figure 6.6: A sequence diagram demonstrating linking in dioscc. dioscc-specific parts are highlighted in green.

66 7| Evaluation

There are several perspectives to our evaluation. We selected a number of non-trivial but not too complex existing projects to assert that dioscc can be used as a drop-in replacement for traditional system compilers. The projects in question are: GNU coreutils, Berkeley DB, Eigen (a C++ template library), gzip, libpng, SQLite and zlib. This evaluation was partly adopted from our paper on divcc [34] and its main goal is to confirm that the tools can be used with real-world projects that use automated build systems. We demonstrate that both divcc and dioscc are able to produce working executables, although not all of the binaries are problem-free (see section 7.2 for more details). Moreover, we have experimented with other analysis tools that take LLVM bitcode as their input, however, only in the divcc configuration (without DiOS headers). The evaluation also demonstrates the usability of DiOS as a component of the model checker, as well as the system libraries it provides to the programs.

7.1 System Information

The parameters of the system used in evaluation were: make: GNU Make 4.2.1 cmake: cmake version 3.14.3 gcc: 8.3.0 clang: 8.0.0 glibc: 2.29

Operating System: Linux (kernel version 4.19.36) Architecture: x86-64

7.2 Feasibility with Real Projects

In this section, we introduce the projects that were used in the evaluation. We discuss the results and issues that occurred in the course of building, executing or analysing the projects. We also provide measurements of build time for each of the packages (compiled into Table 7.1).

67 7.2.1 Summary

In order to assess the feasibility of the implementation, we have taken 7 existing C and C++ projects and built them from source using their respective build systems (which meant either cmake or configure, followed by make). We have approached this part of evaluation as a comparative study and built the packages using five compilation tools, namely: the two most common C compilers, gcc and clang; our two tools (dioscc and divcc); and wllvm, as a juxtaposition to divcc, to emphasize the effect of the two passes wllvm takes. The standard compilers were used for reference, to show how much overhead the generation and linking of bitcode brings, as the remainder of the tools all make use of clang as the underlying compiler.

Out of the tested projects, Eigen and zlib were built using cmake, the remaining projects used an Autotools configure script which generated a Makefile. Each project was built in 5 configurations: with divcc, dioscc, clang version 8, gcc version 8.3 and wllvm. All tools have built all the projects successfully, with some issues described in Section 7.2.2.

The projects and the used versions were as follows: [34]

– coreutils 8.31 is a set of over 100 GNU core utilities and various helper programs for file and text manipulation (such as cat or ls) and shell utilities (env, pwd, and others), – gzip 1.10 – a data compression and decompression utility, – Eigen 3.3.7 [C++] is a header-only template library that provides linear algebra structures, such as matrices and vectors and operations on them, – SQLite 3.28.0 – a widely used embedded SQL database engine, – BerkeleyDB 4.6.21 – another database library, more closely coupled with the applica- tion, – libpng 1.6.37 – a library for reading and writing PNG image files, – zlib 1.2.11 – a compression library, included because it is required by libpng.

The measurements (Table 7.1) show that both divcc and dioscc are slower than upstream clang, in both configuration and building of the software. During configuration, the tools were about 30-60% slower than clang, while build times vary more, depending on the project – with sqlite, which took the longest to build, the difference was negligible; on the other hand, with Berkeley DB, it took up to twice as long to build the package with our tools, compared to clang. The poorer time performance is, however, understandable: clang is used as the underlying compiler in both of our tools (and in wllvm), and the additional overhead of generating and linking bitcode (and verifying the modules) has to be accounted for. The overhead of our implementations is still notably smaller than that of wllvm, which compiles source code in two passes. The times for wllvm also exclude the additional time required to link the bitcode when extract-bc is executed. gcc proved to be faster than clang (and all the other tools, all based on clang) in all cases (up to twice as fast). On the other hand, this is not a sufficiently big and varied sample to derive conclusions about the two mainstream compilers.

68 7.2.2 Package Details1

• Eigen: This was the only project of the selection which uses CMake exclusively. Since it is also a header-only library, the build instructions mainly exist to build tests (with make buildtests) or build and run them (make check). As some of the tools we used did not manage to build all test files, we did not include compilation of the tests in the time measurements.2

• Berkeley DB: In this case, shared libraries have been disabled,3 to include at least one statically-built library in the evaluation. In dioscc, several of the binaries result in a segmentation fault when run. This is due to the use of the libpthread library, as the system version is not ABI-compatible with the DiOS version of libpthread. In this case, it was also necessary to run the configure script specially for wllvm, passing WLLVM_CONFIGURE_ONLY=1 in the environment, as the bitcode files it otherwise produces confused the build system.

• SQLite: This package was configured with –-disable-dynamic-extensions because DiOS (and hence dioscc) does not currently support the dlopen family of functions. SQLite further exhibited the same problem as Berkeley DB when built with dioscc due to ABI incompatibility of libpthread.

• libpng: This package was partly included in the evaluation since it has a dependency on a 3rd-party library, namely zlib. We built zlib version 1.2.11 using the same tool as libpng and provided the resulting libz.so or libz.a to libpng at configure time – in this case, we built both a static and a dynamic variant of libpng (along with a matching build of zlib).

gcc clang divcc dioscc wllvm coreutils 1:33 + 0:22 2:28 + 0:31 3:59 + 0:42 3:46 + 0:45 7:01 + 1:04 db 0:18 + 0:20 0:33 + 0:26 0:43 + 0:45 0:51 + 0:47 1:11 + 0:53 eigen 0:14 + 0:00 0:21 + 0:00 0:27 + 0:00 0:30 + 0:00 0:34 + 0:00 gzip 0:21 + 0:02 0:45 + 0:04 1:04 + 0:05 1:13 + 0:05 2:08 + 0:09 libpng 0:05 + 0:09 0:10 + 0:10 0:16 + 0:11 0:17 + 0:12 0:28 + 0:18 sqlite 0:05 + 1:23 0:11 + 2:03 0:17 + 2:08 0:20 + 2:12 0:27 + 3:24 zlib 0:02 + 0:01 0:03 + 0:02 0:05 + 0:03 0:05 + 0:03 0:06 + 0:04

Table 7.1: From [34]: Total elapsed time for the configuration and compilation phases (shown in the table as configuration + compilation time) of different software packages. The clang and gcc columns are baseline compilers which only produce native executables. wllvm was included as a reference point for divcc, as both its goals and mode of operation are closest to that of divcc. The make command was run with 4 jobs in parallel using -j4.

1From 4.2 of [34]. 2This is the reason for zero build time of Eigen for all compilers. 3using the –-disable-shared configure flag

69 7.2.3 Worked Example: libpng

Since the libpng example with an external library (zlib) has extra steps, we will use it as demonstration (shown in Figure 7.1). There are 3 main steps, distinguished as separate blocks of code in the figure. First, we download the zlib source files and build it with dioscc. Since zlib’s configure script does not accept the option to set the compiler directly, we run the configure in a modified environment under env. When building zlib, both libz.so and libz.a are created and both contain the .llvmbc section (which can be checked with objdump -h [file]). Once we have the necessary prerequisite, in the second part, we download and build the libpng project. Here, dioscc and diosc++ can be passed directly as options of configure. The tests can be built and they all pass. Furthermore, when testing libpng, we have uncovered a bug in DIVINE4, related to the setjmp function.

The libpng library is built under the build/.libs directory (as a dynamic library – libpng16.so.*). When it is checked for dynamic dependencies (with ldd), it shows the system zlib under /usr/lib. This is due to the fact that two linkers exist – the build-time linker (ld) and the runtime linker (ld.so) – and they use separate library search paths. Thus, when a library path is passed using the -L option, this is a command for the compile-time linker to add a given path to its search paths. If we want to instruct both linkers, we can do this by also setting rpath:

$ ../configure CC=dioscc CXX=diosc++ LDFLAGS="-L$HOME/src/divine/next/zlib -1.2.11/build -Wl,-rpath,$HOME/src/divine/next/zlib-1.2.11/build"

ldd then gives us:

$ ldd .libs/libpng16.so > linux-vdso.so.1 (0x00007ffc57beb000) > libz.so.1 => /home/xbaranov/src/divine/next/zlib-1.2.11/build/libz.so.1 (0x00007fe2db1d3000) > libm.so.6 => /usr/lib/libm.so.6 (0x00007fe2db057000) > libc.so.6 => /usr/lib/libc.so.6 (0x00007fe2dae92000) > /usr/lib64/ld-linux-x86-64.so.2 (0x00007fe2db226000)

Finally, as the third step in Figure 7.1, we check the pngimage executable5 with DIVINE. The command we used is shown as the last part of the figure.

If we do not pass the path to the zlib library (with -L), dioscc, and consequently configure, fails when bitcode is not found in the system zlib. The relevant lines in the config.log are:

> configure:9896: checking for zlibVersion in -lz > configure:9921: dioscc -o conftest -g -O2 conftest.c -lz -lm >&5 > compilation failed: LLVM Error: Bitcode section not found in

4https://divine.fi.muni.cz/trac/ticket/70 5 pngimage is a helper binary used in libpng tests.

70 $ wget https://zlib.net/zlib-1.2.11.tar.gz $ tar xzf zlib-1.2.11.tar.gz $ cd zlib-1.2.11 $ mkdir build && cd build $ env CC=dioscc CXX=diosc++ ../configure $ make

$ wget https://sourceforge.net/projects/libpng/files/libpng16/1.6.37/libpng -1.6.37.tar.gz/download-o libpng-1.6.37.tar.gz $ tar xzf libpng-1.6.37.tar.gz $ cd libpng-1.6.37 $ mkdir build && cd build $ ../configure CC=dioscc CXX=diosc++ LDFLAGS=-L$HOME/src/divine/next/zlib -1.2.11/build $ make $ make test

$ divine check --capture ../contrib/pngsuite/basn0g01.png:follow:/image.png ./pngimage /image.png

Figure 7.1: An example use of dioscc to build the libpng project and verify the pngimage executable, one of the products of the build. The zlib library is a prerequisite of the project, so we build it first.

7.2.4 gzip

We further analysed the gzip compression utility with DIVINE, using the hybrid executable as input to the verifier. While this is a simple utility (compared to typical programs), it serves very well to illustrate problems inherent in automated software verification. Even though the implementation of the compression algorithm used in gzip is entirely platform-independent, there is a range of platform-dependent issues that gzip needs to deal with. This includes input and output, but also resource limits, command line parsing (getopt) and so on. Since POSIX platforms (and pre-POSIX UNIX) have a number of subtle differences, gzip also bundles a portability library known as gnulib which comprises almost 18 thousand lines of code and often relies on minute details of the operating system (for comparison, gzip itself is less than 7000 lines). The steps taken to build and analyse gzip were shown in Figure 6.1. The evaluation shows that the resulting LLVM bitcode can be both executed natively on the host system and loaded and verified using DIVINE.6

7.3 DiOS

We took two approaches to the evaluation of DiOS. Firstly, as part of the model checker, DiOS is used in all verification efforts and DIVINE’s test suite exercises all aspects of the

6The gzip utility (version 1.8), both its configuration and build, is now part of our nightly test suite. We build an un-optimized configuration (-O0), and in two configurations with different levels of optimizations enabled (-O1 and -O2).

71 operating system. The execution platform in this case is DiVM, the original platform DiOS was designed for. The test suite comprises approximately 2200 programs (about 1300 of which are written in C, the rest in C++), covering a broad range of checked properties: memory errors, assertion violations, uninitialized variables or deadlocks. The model checker output is compared against its expected behaviour, i.e. the expected violations are annotated in the programs. About 700 of the programs were evaluated in symbolic verification mode (work with symbolic inputs). The evaluation was performed on Linux 4.19 with glibc 2.29 and on OpenBSD 6.5. As a second goal, by compiling existing projects (also the projects from section 7.2), we have evaluated the API and ABI compatibility of DiOS headers and libraries with the host OS. Most of the evaluation concerning real projects was done manually (i.e. the behaviour of the generated native executables and the behaviour of the program when loaded in DiOS was checked manually). In addition to the projects mentioned above, we have also tested DiOS on [35]:

• diffutils 3.7 – programs for computing differences between text files and applying the resulting patches – the diffing programs compiled and diff3 was checked to work correctly, while the patch program failed to build due to lack of exec support on DiOS, • sed 4.7 builds and works as expected, • make 4.2 builds and can parse Makefiles, but it cannot execute any rules due to lack of exec support, • the wget download program failed to build due to lack of gethostbyname support, the cryptographic library nettle failed due to deficiencies in our compiler driver and mtools failed due to missing langinfo.h support.

72 8| Conclusions

The main contribution of this thesis is an open-source tool, dioscc, which can be used in place of the system compiler and produces hybrid binaries which can be both executed natively and verified using DIVINE. With dioscc, the standard libraries of the host system are substituted with the bitcode versions of the system libraries provided by DiOS, a small operating system bundled with DIVINE. The primary motivation behind dioscc was to facilitate generation of the LLVM intermediate representation of entire projects, by making integration with automated build systems simple and by taking advantage of existing build instructions of the projects. dioscc builds on clang and the LLVM project, which makes the integration into automated build systems straightforward. Secondly, we provide a sister tool called divcc, which does not link in DiOS libraries, but instead works with the libraries of the host system. It can, therefore, be used with different library substitutions, if a tool supplies its own bitcode libraries. The function of divcc is similar to that of a tool called wllvm, used, for instance, in combination with KLEE (which provides a custom subset of libc). divcc and dioscc (unlike wllvm) embed the bitcode directly inside ELF files. The native code is then generated from the bitcode, not in a second pass, mitigating some of the problems of the approach taken by wllvm. Finally, we have evaluated the tools on 7 real-world projects, demonstrating their function- ality and the practicality of the approach. We have compared the performance of the build process with standard compilers and with the wllvm tool. The executables created by dioscc can be loaded into DIVINE and the embedded LLVM bitcode can be verified, as we have confirmed.

8.1 Future Work

There is still a lot of work to be done on dioscc. Most importantly, while the full-program LLVM bitcode can be input to DIVINE (even as part of a binary), we aim to also make it usable in other tools, which is not a trivial task. Secondly, we want to investigate the issues regarding lld and offer it as an alternative to the system ld, which is currently the default, for the linking phase. Thirdly, we constantly improve coverage of DiOS libc so that a maximal number of programs can be verified. Some examples of functions which are only provided as stubs for now include {get,free}addrinfo, readv, writev, popen, pclose, or utimes. One of the biggest implementation challenges remains exec and we are yet to assess whether it is possible to implement it in DiOS.

73 Lastly, there is significant duplication of code logic between the Native and Driver structures and we intend to unify the two. The two structures remain independent until we can safely use the new logic in divine cc. For now, the time-tested Driver was kept as the main wrapper for the compiler.

74 A| Archive and Manual

As an integral part of this work, we include the source code of the implementation, in the form of a tarball, divine-4.4.0.tar.gz. The tar includes the latest upstream release of DIVINE, at the time the thesis was submitted. Most of the project structure has been detailed in Section “Internal Structure” (6.8). The archive also includes the script for bitcode extraction from object files and executables, extract-llvm, mentioned in Section 6.2.

After unpacking the tarball, the key programs, and their dependencies, can be built with the following command (the build will take some time):

$ make divcc dioscc

divc++ and diosc++ symlinks which point to the C compilation tools divcc and dioscc have to be created in a directory which is in PATH.

The archive further contains the following tarballs:

• klee-2.0+dios.tar.gz – a patched version of KLEE, one that can be used with DiOS • divine-4.3.0+dios.tar.gz – version of DIVINE that contains DiOS that can be used with KLEE

Using the two above files in combination allows to use DiOS ported to KLEE (the port has been mentioned in Section 5.3). Note that KLEE has only been evaluated on simple programs and the patches are not yet upstream. To use KLEE with DiOS, the patched versions need to be built and the resulting klee binary needs to be put on the PATH. DIVINE can be built by simply running make in the main project directory. More information on the DiOS port to KLEE can be found in [35] and at https://divine.fi.muni.cz/2019/dios.

75

Bibliography

[1] Clang Static Analyzer. scan-build. 2019. url: https://clang-analyzer.llvm.org/ scan-build.html (visited on 10/23/2019). [2] Thomas Ball. “The Concept of Dynamic Analysis”. In: SIGSOFT Softw. Eng. Notes 24 (1999), pp. 216–234. issn: 0163-5948. [3] Zuzana Baranová, Jiří Barnat, Katarína Kejstová, Tadeáš Kučera, Henrich Lauko, Jan Mrázek, Petr Ročkai, and Vladimír Štill. “Model Checking of C and C++ with DIVINE 4”. In: Automated Technology for Verification and Analysis (ATVA 2017). Vol. 10482. LNCS. Springer, 2017, pp. 201–207.

[4] Eli Bendersky. Load-time relocation of shared libraries. 2011. url: https : / / eli . thegreenplace.net/2011/08/25/load-time-relocation-of-shared- (visited on 12/07/2019).

[5] Eli Bendersky. Position Independent Code (PIC) in shared libraries. 2011. url: https: //commons.wikimedia.org/wiki/File:Autoconf-automake-process.svg (visited on 12/07/2019). [6] Cristian Cadar, Daniel Dunbar, and Dawson Engler. “KLEE: Unassisted and Automatic Generation of High-coverage Tests for Complex Systems Programs”. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation. OSDI’08. 2008, pp. 209–224. url: http://dl.acm.org/citation.cfm?id=1855741.1855756. [7] Marek Chalupa, Martina Vitovská, Martin Jonáš, Jiri Slaby, and Jan Strejček. “Symbi- otic 4: Beyond Reachability”. In: Tools and Algorithms for the Construction and Analysis of Systems. Ed. by Axel Legay and Tiziana Margaria. Springer Berlin Heidelberg, 2017, pp. 385–389.

[8] David Chisnall. Modern Intermediate Representations (IR). 2017. url: https://llvm. org/devmtg/2017-06/1-Davis-Chisnall-LLVM-2017.pdf (visited on 11/11/2019). [9] TIS Committee. Tool Interface Standard (TIS) Portable Formats Specification: Exe- cutable and Linkable Format (ELF). 1993. url: https://refspecs.linuxfoundation. org/elf/TIS1.1.pdf (visited on 12/02/2019). [10] Wikimedia Commons. Autoconf and automake process. 2011. url: https://commons. wikimedia.org/wiki/File:Autoconf-automake-process.svg (visited on 11/29/2019).

77 [11] Calle Erlandsson. The Four Stages of Compiling a C Program. 2015. url: https: //www.calleerlandsson.com/the- four- stages- of- compiling- a- c- program/ (visited on 02/01/2019).

[12] Inc. Free Software Foundation. GCC online documentation: SSA. 2019. url: https: //gcc.gnu.org/onlinedocs/gccint/SSA.html#SSA (visited on 10/31/2019). [13] Inc. Free Software Foundation. GCC online documentation: The C Preprocessor. 2019. url: https://gcc.gnu.org/onlinedocs/cpp/ (visited on 11/21/2019). [14] Inc. Free Software Foundation. GNU C manual: Header Files. 2018. url: https : //www.gnu.org/software/libc/manual/html_node/Header-Files.html (visited on 12/08/2018).

[15] Fröhlich. The True Story of Hello World. 2019. url: https : / / lisha . ufsc . br / teaching/os/exercise/hello.html (visited on 11/22/2019). [16] GNU. Autoconf. 2016. url: https://www.gnu.org/software/autoconf/ (visited on 11/29/2019). [17] Using the GNU Compiler Collection: Code Gen Options. 3.16 Options for Code Gen- eration Conventions. url: https : / / gcc . gnu . org / onlinedocs / gcc / Code - Gen - Options.html (visited on 12/07/2019). [18] Henning Günther, Alfons Laarman, and Georg Weissenbacher. “Vienna Verification Tool: IC3 for Parallel Software”. In: Tools and Algorithms for the Construction and Analysis of Systems. Ed. by Marsha Chechik and Jean-François Raskin. Berlin, Heidelberg: Springer Berlin Heidelberg, 2016, pp. 954–957. [19] IEEE and The Open Group. The Open Group Base Specifications Issue 7, 2018 edition; IEEE Std 1003.1-2017. 2018. url: https : / / pubs . opengroup . org / onlinepubs / 9699919799/ (visited on 10/26/2019). [20] University of Illinois at Urbana-Champaign. LLVM Language Reference Manual. 2019. url: https://llvm.org/docs/LangRef.html (visited on 10/30/2019). [21] Michalis Kokologiannakis, Ori Lahav, Konstantinos Sagonas, and Viktor Vafeiadis. “Effective Stateless Model Checking for C/C++ Concurrency”. In: Proc. ACM Program. Lang. 2.POPL (Dec. 2017), 17:1–17:32. url: http://doi.acm.org/10.1145/3158105. [22] Feldman Bell Laboratories and S. I. Feldman. “Make — A Program for Maintaining Computer Programs”. In: 9 (1979), pp. 255–265. url: http://citeseerx.ist.psu. edu/viewdoc/download?doi=10.1.1.95.9198&rep=rep1&type=pdf. [23] ld.so(8) Linux Programmer’s Manual. 5.00. Feb. 2019.

[24] LearnCpp. Static and dynamic libraries. 2018. url: https://www.learncpp.com/cpp- tutorial/a1-static-and-dynamic-libraries/ (visited on 02/01/2019). [25] Carl Leonardsson, Magnus Lång, Kostis Sagonas, and Phong Ngo. Nidhugg. 2019. url: https://github.com/nidhugg/nidhugg (visited on 11/18/2019).

78 [26] Rainer Leupers and Peter Marwedel. Retargetable Compiler Technology for Embedded Systems: Tools and Applications. Norwell, MA, USA: Kluwer Academic Publishers, 2001. isbn: 0-7923-7578-5. [27] make(1posix) POSIX Programmer’s Manual. The Open Group. 2003.

[28] Debian jessie Manual. CBMC and goto-cc manual page. 2014. url: https://manpages. debian.org/jessie/cbmc/goto-cc.1.en.html. [29] Evan Martin, Nico Weber, Scott Graham, Nicolas Despres, and Brad King. Ninja. 2019. url: https://ninja-build.org/ (visited on 12/02/2019). [30] Ken Martin and Bill Hoffman. Mastering CMake. 4th ed. Kitware, Inc., 2008. [31] Nicholas Nethercote and Julian Seward. “Valgrind: A framework for heavyweight dynamic binary instrumentation”. In: Proceedings of the 2007 Programming Language Design and Implementation Conference. 2007.

[32] Oracle. The Java HotSpot Performance Engine Architecture. 2019. url: https://www. oracle.com/technetwork/java/whitepaper-135217.html (visited on 10/31/2019). [33] Tristan Ravitch. github: Whole Program LLVM. 2019. url: https://github.com/ travitch/whole-program-llvm (visited on 11/23/2019). [34] Petr Ročkai and Zuzana Baranová. “Compiling C and C++ Programs for Dynamic White-Box Analysis”. In: FM Workshops Post-Proceedings. LNCS. Oct. 2019. [35] Petr Ročkai, Zuzana Baranová, Jan Mrázek, Katarína Kejstová, and Jiri Barnat. “Re- producible Execution of POSIX Programs with DiOS”. In: Software Engineering and Formal Methods. Sept. 2019, pp. 333–349. isbn: 978-3-030-30445-4. [36] J. Singer. Static Single Assignment Book. 2018. url: http://ssabook.gforge.inria. fr/latest/book.pdf (visited on 10/31/2019). [37] Carsten Sinz, Florian Merz, and Stephan Falke. “LLBMC: A Bounded Model Checker for LLVM’s Intermediate Representation”. In: Tools and Algorithms for the Construction and Analysis of Systems. Ed. by Cormac Flanagan and Barbara König. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 542–544. [38] Luděk Skočovský. Principy a problémy operačního systému UNIX. 2nd ed. Brno, Czechia: Luděk Skočovský, 2008. isbn: 80-902612-5-6. [39] Richard M. Stallman, Roland McGrath, and Paul D. Smith. GNU Make. url: https: //www.cl.cam.ac.uk/teaching/0910/UnixTools/make.pdf. [40] The KLEE Team. Tutorial on How to Use KLEE to Test GNU Coreutils. 2008. url: https://klee.github.io/tutorials/testing-coreutils/ (visited on 11/02/2019). [41] Sarah Thompson, Guillaume Brat, and Karl Schimpf. The MCP Model Checker. 2007. [42] Chia-Che Tsai, Bhushan Jain, Nafees Ahmed Abdul, and Donald E. Porter. “A Study of Modern Linux API Usage and Compatibility: What to Support when You’Re Sup- porting”. In: Proceedings of the Eleventh European Conference on Computer Systems. EuroSys ’16. ACM, 2016, 16:1–16:16. isbn: 978-1-4503-4240-7.

79