<<

FAKULTA INFORMATIKY, MASARYKOVA UNIVERZITA

Undefined Behaviour in the Language

BAKALÁŘSKÁ PRÁCE

Tobiáš Kamenický

Brno, květen 2015

Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Vedoucí práce: RNDr. Adam Rambousek

ii

Acknowledgements

I am very grateful to my supervisor Miroslav Franc for his guidance, invaluable help and feedback throughout the work on this thesis.

iii

Summary

This bachelor’s thesis deals with the concept of and its aspects. It explains some specific undefined behaviors extracted from the C standard and provides each with a detailed description from the view of a and a tester. It summarizes the possibilities to prevent and to test these undefined behaviors. To achieve that, some and tools are introduced and further described. The thesis contains a set of example programs to ease the understanding of the discussed undefined behaviors.

Keywords undefined behavior, C, testing, detection, secure coding, analysis tools, standard,

iv

Table of Contents Declaration ...... ii Acknowledgements ...... iii Summary ...... iv Keywords ...... iv 1 Introduction ...... 1 2 Compilers and Analysis Tools ...... 2 2.1 GCC ...... 2 2.2 ...... 3 2.3 Splint ...... 3 2.4 Cppcheck ...... 4 2.5 Valgrind ...... 4 3 Undefined Behaviors ...... 5 3.1 Demotion of one real floating type to another produces a value outside the range that can be represented (6.3.1.5)...... 7 3.1.1 Programmer View ...... 8 3.1.2 Tester View ...... 8 3.2 An lvalue designating an object of automatic storage duration that could have been declared with the register storage class is used in a context that requires the value of the designated object, but the object is uninitialized. (6.3.2.1)...... 10 3.2.1 Programmer View ...... 11 3.2.2 Tester View ...... 11 3.3 The program attempts to modify a (6.4.5)...... 13 3.3.1 Programmer View ...... 13 3.3.2 Tester View ...... 14 3.4 Two pointer types that are required to be compatible are not identically qualified, or are not pointers to compatible types (6.7.6.1)...... 15 3.4.1 Programmer View ...... 15 3.4.2 Tester View ...... 16 3.5 The } that terminates a function is reached, and the value of the function call is used by the caller (6.9.1)...... 17 3.5.1 Programmer View ...... 18 3.5.2 Tester View ...... 18

v

3.6 The function is used in a multi-threaded program (7.14.1.1)...... 19 3.6.1 Programmer View ...... 19 3.6.2 Tester View ...... 21 3.7 The string pointed to by the mode argument in a call to the fopen function does not exactly match one of the specified character sequences (7.21.5.3)...... 22 3.7.1 Programmer View ...... 23 3.7.2 Tester View ...... 23 3.8 There are insufficient arguments for the format in a call to one of the formatted input/output functions, or an argument does not have an appropriate type (7.21.6.1, 7.21.6.2, 7.29.2.1, 7.29.2.2)...... 25 3.8.1 Programmer View ...... 25 3.8.2 Tester View ...... 27 3.9 The pointer argument to the free or realloc function does not match a pointer earlier returned by a memory management function, or the space has been deallocated by a call to free or realloc (7.22.3.3, 7.22.3.5)...... 28 3.9.1 Programmer View ...... 29 3.9.2 Tester View ...... 30 3.10 The value of the result of an integer arithmetic or conversion function cannot be represented (7.8.2.1, 7.8.2.2, 7.8.2.3, 7.8.2.4, 7.22.6.1, 7.22.6.2, 7.22.1)...... 32 3.10.1 Programmer View ...... 32 3.10.2 Tester View ...... 34 4 Conclusion and Future Work ...... 36 Reference List ...... 38 Attachments ...... 40

vi

1 Introduction

The programming language C does not have so strict specification and thus provides a lot of leeway in terms of the implementation. Instead of strict and complex rules, the language C is designed for better performance, simplicity, and portability. C can be compiled on computers with different processors or operating systems only with slight changes. It is the programmer’s concern to write portable code and avoid expressions specific to an architecture (e.g. size of data types). The programmer also has full control over memory management and therefore has to allocate and free the memory himself or herself since there is no default garbage collector to take care of it. All this low level programming has some consequences. It is much easier, even for a very experienced programmer, to make a mistake.

With certain mistakes, a program might not compile, produce wrong results, , or even act correctly according to programmer’s expectations. But the fact is that anything could happen depending on the circumstances such as registers content. Such an unexpected event is referred to as “undefined behavior” [1, 2]. The C standard [3, Sec. J.2] contains an attachment describing some cases of undefined behavior but it is more a summary than anything else.

Further behaviors not included in the attachment are mentioned in the text of the C standard. It is generally assumed that situations leading to undefined behavior will not occur and it is up to the programmer to take care of their appearance. Some compilers and analysis tools are able to recognize a few undefined behaviors in the code. This does not mean, however, that the programmer should not be familiar with them and should not try to reduce the risk of their occurrence.

1

2 Compilers and Analysis Tools

In this thesis, the implementation of the C standard was tested on two compilers: GNU Collection (GCC) and a C Language Family

Frontend for LLVM (Clang). The LLVM Project (formerly Low Level Virtual

Machine) is a collection of modular and reusable compiler and toolchain technologies [4].

When testing and inspecting how to discover the selected undefined behaviors in the programs, three testing tools were used mostly:

Secure Programming Lint (Splint), Cppcheck, and Valgrind.

Splint and Cppcheck are static analyzers, which means that they search code for specific constructs or more general templates without ever executing it. On the contrary, Valgrind is a dynamic analyzer; it performs an analysis by executing the program. Its disadvantage is that many inputs are usually required for the analysis to be thorough, on the other hand, the running program can be analyzed in more detail, e.g. memory management.

2.1 GCC The GNU Compiler Collection includes the front ends for C, C++ etc. and the libraries for these languages [5]. It is free software developed by the GNU Project [6] and distributed by the Free Software Foundation [7].

GCC is available across many operating systems and is the default compiler for many of them. The version used for testing in this thesis was 4.9.2 which contains the AddressSanitizer and the UndefinedBehaviorSanitizer that should detect some undefined behaviors on runtime [5]. These two are used further in the thesis.

2

2.2 Clang The C Language Family Frontend for LLVM is a compiler front-end for

C, C++, Objective-C, and Objective-C++ with LLVM as a back-end [8]. It is an open-source project developed by Apple. It contains Clang front-end, Clang static analyzer and some code analysis tools [9]. Clang is designed to be highly compatible with GCC as it contains almost the same flags and options.

An advantage of GCC is that it supports not only languages from the C family but many others like Java, Ada, Go, etc. It further supports some extensions which are not implemented in Clang, and it supports more targets than LLVM.

On the contrary, Clang has some advantages compared to GCC, thanks to its specific focus on the C family. Clang official website claims that “Clang is much faster and uses far less memory than GCC” [10] and offers some features resulting from the usage of LLVM as a backend. The version used for testing in this thesis was 3.5.

2.3 Splint Splint is a modern version of the lint tool, which was the name of the program that detected some suspicious construct or bugs in C source code.

The official definition states: “Splint is a tool for statically checking C programs for security vulnerabilities and coding mistakes. With minimal effort, Splint can be used as a better lint. If additional effort is invested adding annotations to programs, Splint can perform stronger checking than can be done by any standard lint.” [11] It is free software but has not been actively developed since 2010 so many new features are not supported, not even the standard.

Despite its obsolescence, this tool may still be useful when detecting some code issues.

3

2.4 Cppcheck “Cppcheck is a static analysis tool for C/C++ code. Unlike C/C++ compilers and many other analysis tools, it does not detect syntax errors in the code. Cppcheck primarily detects the types of bugs that the compilers normally do not detect. The goal is to detect only real errors in the code

(i.e. have zero false positives)”, claims the official website of the tool [12].

Cppcheck is integrated in many developer tools or available as a plugin.

Some examples of detectable issues: out of bounds read/write of arrays, memory leaks, and possible dereferences [12]. It is free software under the GNU General Public License. The great advantage of this tool is that it is still under intensive development with releases every few months.

2.5 Valgrind “Valgrind is an instrumentation framework for building dynamic analysis tools. The Valgrind distribution currently includes six production- quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache, and a heap profiler.

It also includes three experimental tools.” [13]

This tool is very complex and some small parts of it are described in detail further in the thesis.

4

3 Undefined Behaviors

The goal of this thesis was to analyze and describe a larger set of undefined behaviors for it is rare to find some usable summaries of the possible problems. There are 10 specific instances of the undefined behavior described in the thesis. They were chosen from the attachment of undefined behaviors from the C standard [3, Sec. J.2]. A thesis on the same topic has been written so other undefined behaviors were selected to broaden the available information. The next parameters the behaviors were selected by were: the frequency of occurrence of these behaviors in everyday programming; and the associated risks and available solutions.

The title of every subchapter describing an undefined behavior is taken literally from the mentioned attachment of the C standard and is followed by one or more numbers referencing the related section of the C standard.

In the thesis, each undefined behavior is described in brief at first.

This summary states when the certain behavior might occur and also puts forward an example to clarify the related problem.

The next part is the “Programmer View” that includes the analysis of constructs in the code causing the undefined behavior. This is followed by some tips and best practices on what to use instead of these constructs or at least how to check that something is amiss and the undefined behavior is liable to occur.

The second part of the detailed description of each undefined behavior is the “Tester View”. This part focuses on finding at least one tool or a compiler option able to detect the given undefined behavior. When there was no tool that would test the possible appearance of the undefined behavior, I wrote

5 some suggestions on how to keep control of the program’s behavior by, for example, providing a new implementation of the library functions.

There is actually a very thin line between the tester view and the programmer view so in some cases is the text in one part very related to the other one.

6

3.1 Demotion of one real floating type to another produces a value outside the range that can be represented (6.3.1.5). This undefined behavior may occur when a program is using both double and float data types or different optional specifiers for the same type, such as long double and double, and makes conversions between them. If the value can be represented in both types, then the converted value is unchanged. If the value cannot be represented exactly but is in the range of representable values, the result is the nearest higher or the nearest lower value depending on the implementation of the compiler. But if the value is outside the representable range, the result is undefined. [3, Sec. 6.3.1.5] This rule actually does not apply to demotions on implementations supporting signed infinity, such as IEC 60559, because all values are within the range. This functionality is described in Annex F of the C Standard. Compilers fully implementing the C99 standard are spared from this problem, but C11 standard makes IEC 60559 optional and states that only the implementations that define the macro ___STDC_IEC_559__ shall conform to the specifications in the Annex F. [3, Sec. F.1] See Figure 1 for an example of demotion, that produces a value outside the range that can be represented.

int main() { double d = FLT_MAX + 1; float f = (float)d; printf("%f\n", d); //number depending on architecture printf("%f\n", f); //undefined or infinity with IEC 60559 return 0; }

Figure 1 – Value outside the range of the type

7

3.1.1 Programmer View The first advice is to use double instead of float for it is usually of a bigger size and is more commonly used - float is generally used only in embedded programs with restricted memory. It is also beneficial if the programmer is familiar with the environment he is working in, the target architecture, and the compilers that will be used for his program. This way he can check the ___STDC_IEC_559__ macro and determine whether the target implementation of C supports IEC 60559. This is possible only for programs with a very short expected lifespan as the compilers and architectures evolve pretty fast and future compatibility and functionality cannot be guaranteed.

Programs should not be so strictly focused on a single implementation and should be more portable. It is even better to avoid any problems with demotions in advance; think about the possible range of values in the final program and then choose the type that is big enough for the used variables.

On the other hand, types should not be excessive, for example there is no need to represent percentages using long double.

3.1.2 Tester View Conversions are generally very hard to discover because a value exceeding the range of the target type occurs with a very small probability.

To prevent this from happening once the program has been deployed, the tester should always test whether the program works fine with extreme values before software is deployed. GCC as well as Clang contains the -Wconversion option that warns about all implicit demotions at compile time [14, Sec. 3.8]. the other way to detect this problem comes with paid software, such as Coverity [15], which is a static analysis tool that should be able to find not only problems leading to this undefined behavior, but also

8 other mistakes and bugs. Another option is a less known static analysis tool

Fortify SCA [16] that should also help to discover vulnerabilities in the program.

9

3.2 An lvalue designating an object of automatic storage duration that could have been declared with the register storage class is used in a context that requires the value of the designated object, but the object is uninitialized. (6.3.2.1). This is probably the most common problem among less experienced . Although the stated definition might sound slightly confusing, what lies beneath is much simpler. Objects of an automatic storage duration are all local variables. Frequently used variables can also be declared with the register keyword that serves as a recommendation to the compiler to store them in CPU registers. But since it is only a recommendation, the compiler decides by itself whether it is a wise decision to place them into registers.

When such a variable is declared, a space is created on the stack. The reserved memory is inside the region that is assigned to the program by the operation system. This region could be used by previously called functions and it contains random data. If uninitialized, the new variable takes over the original data and treats them as its own. And because anything could be in the accessed memory, the behavior is undefined when the variables are used without a previous value association. See Figure 2 for an example of uninitialized variable that is consecutively used.

int main() { register int a; printf("a = %d\n", a); //undefined - random number, program crash return 0; }

Figure 2 – Use of uninitialized variable

10

3.2.1 Programmer View In this case, a very simple solution is available – the initialization. General recommendation is to always initialize new variables. The C language, unlike many others including C++, C# and Java, does not take care of resetting the memory (rewriting with zeroes), thus it is best practice to do so explicitly.

This not only leads to avoiding this particular undefined behavior, but it might also help with other problems and bugs. Another issue leading to this undefined behavior is associating a value with an uninitialized variable inside a condition section. This can cause that a certain flow of the program will use the uninitialized variable. Therefore, it is best to avoid a first-time value association inside condition sections, assuming this is possible.

3.2.2 Tester View The tester has many options for detecting this undefined behavior as it is caused by uninitialized variables. In GCC, the tester might compile the code with the –Wuninitialized option which is designed exactly for discovering this problem [14, Sec. 3.8]. However, warnings do not occur if the variable is an array, even when it is in the registers. Clang contains the same option with a similar functionality. Tools for static analysis are a solution as well. One of them is Cppcheck that checks individual source files for errors. Another usable tool is Splint. Unfortunately, none of these tools are able to discover uninitialized arrays. On the other hand, Valgrind, a runtime analysis program, and specifically its default tool Memcheck, which is a memory error detector, can be used to discover even the uninitialized arrays and structures.

Memcheck creates a bit copy of the program memory and stores it in the bitmap. When any bit of the program is loaded by the CPU, corresponding bits are loaded by Memcheck, when some bits are stored, Memcheck stores

11 them as well. The error message is only shown when uninitialized bits are changed or used. In this way Memcheck can reveal all used uninitialized bits.

When reporting some problems in code, Memcheck tries to report every bit only once not to show hundreds of consequent messages. [17, Sec. 4.5]

When repairing bugs or trying to prevent them, these tools or compiler options should be used in the first place because they ought to be able to discover all instances of the mentioned undefined behavior and its origins.

12

3.3 The program attempts to modify a string literal (6.4.5). A string literal is a sequence of zero or more characters enclosed in double quotes, as in “xyz”, according to the C standard [3, Sec. 6.4.5, para. 3].

When the program starts, an array of static storage duration with sufficient size is created, typically in read-only memory. Modifying any part of read- only memory leads to the access violation and the behavior becomes undefined (see Figure 3 for an example).

int main(void) { char *sl = "literal"; sl[3] = 'E'; //undefined behavior - program usually crashes printf("%s\n", sl); return 0; }

Figure 3 – Attempt to modify a string literal

3.3.1 Programmer View When writing a code and declaring string literals, variables should be specified as arrays of const char or const wchar_t for multibyte characters or pointers to these types. This helps the compiler recognize the part of memory that should never be changed and therefore enables it to display the appropriate warnings. [18]

When a string literal is created, the initial value as well as the size of the array is assigned, so there is no need to specify the length of an array.

The programmer also has to be careful when using string literals as parameters to some functions such as strpbrk(), strchr(), strstr() [19, Sec. 3]. When the first parameter of the above-mentioned functions is a string literal, the return value is also a string literal and should be treated accordingly. Similarly, casting a string literal to a non-constant may lead to this undefined behavior.

13

3.3.2 Tester View For this kind of problem, the compilers have a very interesting solution.

According to the list of GCC options, the –Wwrite-strings option can be used while compiling. It gives string constants the type of const char[length]. This helps the programmer find the section of code that is trying to write into a string constant, but a careful usage of const qualifiers in declarations and prototypes is still required for the program to work properly [14, Sec. 3.8].

The Clang contains the same option that works very similarly. Another way to discover a probable appearance of this undefined behavior in the program is the use of Splint. When it discovers a modification of a string literal, it warns with the message: “Suspect modification of observer” and highlights the corresponding line of code. It is better to use both of the above-mentioned techniques of detecting this undefined behavior since neither one of them is guaranteed to find all mistakes.

14

3.4 Two pointer types that are required to be compatible are not identically qualified, or are not pointers to compatible types (6.7.6.1). Pointer types to be identically qualified need to have the same qualifier list, for example *const type_1 and *const type_2. Another possible qualifiers in declarations are volatile, , and _Atomic. For two pointers to be of compatible types, the derived-declarator-type-list of both types shall be compatible, for example volatile int. There is a common mistake concerning declarations of constant pointer and pointer to constant type. Figure 4 shows the difference in declarations of these two types.

const int *ptr_to_constant; int *const constant_ptr;

Figure 4 – Declaration of pointer to constant type and constant pointer

The pointer ptr_to_constant points to the constant object that shall not be modified, but the pointer itself might be redirected to another object.

On the other hand, constant_ptr will always point to the same location, but the object on this location might be modified arbitrarily. [3, Sec. 6.7.6.1]

3.4.1 Programmer View A very frequent problem is a slight chaos in the number of asterisks or ampersands when (de)referencing a pointer. This is usually the case when the object is of multiple dimensions and the programmer is trying to get to specific coordinates. If such a complex structure is inevitable, it might be better to define an own type by typedef. One of the above-mentioned pointers

(constant pointer to integer) could be defined as shown in Figure 5.

int num = 1234; typedef int* int_ptr; //pointer to integer const int_ptr constant_ptr = # //constant pointer to integer

Figure 5 – Defining type of constant pointer to integer

15

All objects in the program should be designed as simply as possible and named according to their purpose, so there would not be a problem with interchanging two pointers. Using brackets to increase readability when dereferencing pointers might also be helpful. See Figure 6 for an example of confusing and difficult-to-read code.

int name(const void*a, const void *b) { TRecord *pA = *(TRecord**)a; TRecord *pB = *(TRecord**)b; return strcmp(pA->n, pB->n); }

Figure 6 – Code with random and vacuous names

When two pointers need to point to the same object, their declaration should be exactly the same. Furthermore, the programmer should not rely on implicit typecasting meaning that the target type should be written explicitly.

This will not only prevent warnings by compiler but increase the readability of source code as well.

3.4.2 Tester View This problem is easy to discover since both compilers, GCC and Clang, have appropriate tools. In GCC there is a default warning about incompatible pointer types or implicit casting. Clang provides a similar functionality under the –Wincompatible-pointer-types option. As for static analysis, Splint warns about sources of this undefined behavior as well. It can also be discovered simply by printing of the object to the output device when debugging, as it usually shows some data that does not make any sense.

16

3.5 The } that terminates a function is reached, and the value of the function call is used by the caller (6.9.1). When the end of a non-void function is reached and the is not found, the function returns a random value leading to undefined behavior when further used. In fact this value is not entirely random.

It depends on the architecture which the program is compiled on and on the Application Binary Interface (ABI) which defines many details such as sizes of data types, calling convention, and the way an application should make system calls to OS. The calling convention controls the way functions’ arguments are passed and return values retrieved; for example whether parameters are passed on the stack or some are passed into registers, which registers are used, and whether the first function parameter passed on the stack is pushed first or last [20]. The return value further depends on the data type of the variable and the number of registers in CPU. This all means that a correct value may be returned sometimes making the bug-discovering more difficult. Figure 7 shows an example of omitted return statement.

int max(int a, int b) { int m = a > b ? a : b; }

int main() { printf("Maximum: %d\n", max(7, 5)); // }

Figure 7 – No return statement in the function

The only exception according to the C standard is the main() function. It reads: "Reaching the } that terminates the main function returns a value of 0."

[3, Sec. 5.1.2.2.3, para. 1] As a result, it is possible to omit the return statement in the main() function. This functionality was specified in C99 so programmers should be careful with the previous implementations. Caution is needed for

17 example while using GCC that still uses an older version of the C Standard by default. The –std= option with at least c99 should be used to avoid problems associated with the previous versions.

3.5.1 Programmer View To avoid this problem, smaller and less complex functions should be written. It is much easier to accidently omit the return statement if many layers of conditions are implemented in the function. When creating a new function, the return statement with an error value right before the last curly bracket terminating the function could be written at first. This helps to control the return values when a different value from a previous return statement is expected. Depending on the type of the function, it can be useful to check the return value of a function call for exceeding the allowed range of values.

This should help discover the most implausible values during the runtime by alerting of possible mistakes in the program.

3.5.2 Tester View All non-void functions have to end with a return statement. Violation of this rule might lead to serious problems. Wrong values might cause a crash of the program with (when the function evaluates array boundaries) and other non-expected errors. In GCC, the –Wreturn-type option can be turned on to discover even the most hidden program flow that might avoid the return statement [14, Sec. 3.8]. Moreover, Clang has this option enabled by default and should provide a sufficient check. If static analysis is preferred in testing, Splint detects the sources of this undefined behavior as well.

18

3.6 The signal function is used in a multi-threaded program (7.14.1.1). The signal function is meant to provide a way to handle signals in

C programs. Due to many portability issues, this function is deprecated and sigaction should be used instead. The signal function cannot be used in multithreaded programs as the signals are not specifically thread-dependent.

This means that in a multithreaded program, more signals can act as one or there could be an uncertainty which handler to use. [19, Sec. 2-signal]

The signal function should not be used on Windows to handle SIGINT for this signal is not supported by Win32 applications. When it occurs, a new thread is generated to handle this interruption. This means that the application becomes multi-threaded and consequently can cause unexpected behavior [21]. Due to these and many other problems, the behavior of this function in multithreaded programs is undefined.

According to the C standard [3, Sec. 7.14.1.1, para. 7],

“The implementation shall behave as if no library function calls the signal function.” This requirement, however, is not always met. See Figure 8 for an example of the signal function that causes undefined behavior.

3.6.1 Programmer View The advice in this case is obvious: Do not use signal function! That applies not only to multithreaded programs but to other programs as well. POSIX.1 solved the issues of the signal function by specifying int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact); which is also included in the signal.h library. This function is safer to use and should be without bugs mentioned in the context of the signal function. On the other hand, it is much more complex and difficult to understand properly. The sigaction function can

19 process all signals except for SIGSTOP and SIGKILL. The signal handler is defined in struct sigaction which is a parameter of this function. In fact, this function has two parameters of this kind and by setting NULL or not-NULL values, the function may be used to save the default handler or query the current handler. The handler structure contains two functions for handling but on some platforms it acts as union so only one of them should be assigned.

Next, various flags as well as the mask attribute can be set so that different threads invoking a signal may be distinguished. All details about this function can be found on man page [19, sec. 2-sigaction]. The sigaction function is implemented in POSIX and is not available for Windows, where other ways have to be used to handle signals in multithreaded programs.

volatile sig_atomic_t flag = 0;

void handler(int signum) { flag = 1; }

/* Runs until user sends SIGINT */ void *func(void *data) { while (!flag) { printf("1 "); } return 0; }

int main(void) { if (signal(SIGINT, handler) == SIG_ERR) {// Undefined! puts("Signal error"); exit(1); } pthread_t tid; if (0 != pthread_create(&tid, NULL, func, NULL)) { printf("Error!"); exit(1); } raise(SIGINT); pthread_join(tid, NULL); }

Figure 8 – Multithreaded program calling signal function

20

3.6.2 Tester View The detection of signal appearing in multithreaded programs is easy for a tester if the code can be accessed and searched for this function. But the automated detection is more problematic because usual tools do not contain any suitable techniques. There is an easy but less-known way provided by both GCC and Clang; to forbid the use of the signal function. It is done by the pragma statement #pragma GCC poison signal. This causes the compiler to “poison” the signal function and an error message appears during the compilation if it is used somewhere in the code. [22, Sec. 7] Another solution comes with the paid software named CodeSonar [23], which is a static analysis tool that should be able to discover faulty usages of signal.

21

3.7 The string pointed to by the mode argument in a call to the fopen function does not exactly match one of the specified character sequences (7.21.5.3). The function FILE * fopen (const char * filename, const char * mode); declared in the standard input/output library serves for opening a file specified by a filename and associating a stream with it. The file is opened in a mode that is indicated by another parameter. Possible mode strings consist of the following letters/strings:

 r read

 r+ read/write

 w truncate or create and open for writing

 w+ truncate or create and open for read/write

 a append (write at the end of the file)

 a+ append/read, the file is created if it does not exist

It can also contain a letter ‘b’, but not as the first letter of the mode string.

This specifies a binary file. All systems conforming to POSIX ignore this letter since they treat binary and text files in the same way. On the other hand, other systems like Windows treat those files differently so ‘b’ is significant on them and could be used to increase portability. The GNU C Library (glibc) [24] allows even more characters in the mode string that may follow the basic ones.

These are ‘c’, ‘e’, ‘m’, and ‘x’. Their purpose is more complicated and can be found on the man pages. The problem with this extension is that glibc examines only the first 7 characters in the mode string and consequently, longer strings cannot be inserted. Moreover, the number of characters in the mode string prior to version 2.14 was limited to 6 so a possible string such as “rb+cmxe” could not be parsed. [19, Sec. 3-fopen]

22

Obviously, when letter other than the above-mentioned ones appear in the mode string or when they appear in the wrong order, the function does not know what to do with the file, causing an undefined behavior (see Figure 9 for an example).

/*Even the wrong order causes undefined behavior*/ FILE *file = fopen("TestFile", "br+"); if (file == NULL) { /*Should get here but not guaranteed*/ fprintf(stderr, "Failed to open the file\n"); return 1; } /*DO SOMETHING WITH THE FILE*/ fclose(file);

Figure 9 – Incorrect mode argument to fopen

3.7.1 Programmer View The fopen function returns the file pointer, provided that file was opened successfully. In case that any error appeared while creating the file handle,

NULL is returned and errno set. This means that the return value can be and should be checked for NULL value so the program is handled safely. Since this is an undefined behavior, it is not guaranteed that the program will not crash during the fopen function, but in most cases, checking the return value should discover a violation of the mode string. The return value should always be checked anyways for many other things can go wrong with the file handling.

3.7.2 Tester View Probably no common tool or a compiler option can detect an appearance of this undefined behavior. For a tester, there are not many possibilities to detect this kind of problem so the programmer has to take care of a possible mode violation. Probably the best way would be to write a new implementation of fopen that would check the mode string and then call the original function with certainty of the correct mode. Such a new implementation

23 can replace the original one by specifying an environment variable

LD_PRELOAD which guarantees to load an own implementation before the original one [19, Sec. 8-LD.SO]. Unfortunately, implementing such a function is too complicated and is only worth writing in the case of a very complex program that works with files a lot. It follows that the best practice is to be really careful about the mode and always check for a NULL return value.

24

3.8 There are insufficient arguments for the format in a call to one of the formatted input/output functions, or an argument does not have an appropriate type (7.21.6.1, 7.21.6.2, 7.29.2.1, 7.29.2.2).

This undefined behavior applies to the functions for reading formatted data such as fscanf or fwscanf (for wide characters) and likewise to the functions for writing formatted data such as fprintf and fwprintf. But it also applies to the basic functions using standard input and output streams printf and scanf.

The first part of the description of this undefined behavior is obvious - if one or more arguments are missing then the printed or written data will most likely be random. The other part of this undefined behavior is trickier, for there is a tremendous amount of conversion specifiers and sub-specifiers and the argument must be of a compatible type. The main reason of the potential occurrence of the undefined behavior is the fact that all the above-mentioned functions have a variable amount of arguments and therefore it is difficult for the compiler to check the prototype of a function. [25]

3.8.1 Programmer View Since expected types for all specifiers are pointers, a typical mistake leading to this undefined behavior happens while reading an input and passing a basic type variable (integer, decimal floating-point), instead of a pointer to this type, as an argument. Figure 10 shows the difference between these two pointers. [19, Sec. 3-scanf]

int a = 0; fscanf(stdin, "%d", a); //not correct fscanf(stdin, "%d", &a); //correct

Figure 10 – Comparison of correct and incorrect argument to fscanf

25

This is actually a very common mistake among less experienced programmers. The only advice to have an appropriate type as an argument is to be very careful with pointers and length specifiers so that they correspond to the expected input. When printing data, a specifier should be chosen according to the type of the variable or an argument in a way to avoid implicit casting. The return values of the input/output functions should also be checked to see whether the operation was successful and the function has written or read all the expected arguments. [19, Sec. 3-printf]

Insufficient arguments for a certain format are usually the result of chaotic or difficult-to-read code. This can be caused by not complying with the basic conventions of the C programming. Another reason could be an attempt to read or write long formatted lines. An example of this would be a program written in such a way that long lines with many items of different types are being read from a file at once as shown in Figure 11. As can be seen in the figure it is very easy to get lost in the code and make a mistake in the fscanf function. It follows that the programmer should avoid using these functions in such a complex way and substitute them with other constructs of the programming language, such as processing the file word by word or at least splitting the input/output functions into more calls.

26

int main() { struct item { char code[8]; int count; double price_per_item; short int VAT; double total; }; struct item my_items[10]; int counter = 0; FILE* iFile; iFile = fopen("myFile", "r"); if (iFile == NULL) { printf("failed to open file\n"); return 1; } while (!feof(iFile)) { if (5 != fscanf(iFile, "%7s %d %lf %hd %*c %lf", my_items[counter].code, &(my_items[counter].count), &(my_items[counter].price_per_item), &(my_items[counter].VAT), &(my_items[counter].total)) && !feof(iFile)) { printf("Read was not successful!\n"); fclose(iFile); return 1; } ++counter; } fclose(iFile); for (int i = 0; i

Figure 11 - Chaotic use of input/output functions for formatted data

3.8.2 Tester View This undefined behavior is actually very easy to discover. Due to the frequency of mistakes leading to this undefined behavior, both compilers,

GCC and Clang, have the –Wformat option that discovers all kinds of mistakes in the above-mentioned input/output functions. It warns by a detailed message about missing arguments, incompatible types etc. [14, Sec. 3.8]

In case of the static analysis, Cppcheck and Splint provide a similar functionality as the compilers.

27

3.9 The pointer argument to the free or realloc function does not match a pointer earlier returned by a memory management function, or the space has been deallocated by a call to free or realloc (7.22.3.3, 7.22.3.5). This undefined behavior appears whenever the free or realloc functions are passed an argument which is not a NULL or a pointer to the memory previously allocated by malloc, realloc or another memory management function. In case of free, the undefined behavior is caused by an attempt of this function to free whatever memory the argument points to. This could lead to the corrupted heap memory or other serious problems. Also when the free function is called, the memory is deallocated and made available for further allocation. Consequently, there could be new data located in that same area of memory. If the free function with the same pointer is used again, it could deallocate a part of the memory where crucial data for the rest of the program might have been stored. The reallocate function behaves similarly for it tries to deallocate the memory pointed to by its argument and allocate another memory region of the new size. The undefined behavior also appears when the parameter passed to realloc points to a non-dynamically allocated memory.

A common consequence is an abnormal termination of the program, usually with SIGSEGV (). [19, Sec. 3-free]

Examples of incorrect memory management can be seen in Figure 12,

Figure 13, and Figure 14.

28

char* original = (char*)malloc(size); /* . . . */ free(original); for (size_t i = 0; i

Figure 12 – Use of heap memory after free

char* original = (char*)malloc(size); /* . . . */ char* changed = original + 10; free(changed); /* AddressSanitizer: attempting free on address which was not malloced Valgrind: Invalid free() */

Figure 13 – Attempt to free memory pointed to by a modified pointer

char* original = (char*)malloc(size); /* . . . */ free(original); /* . . . */ free(original); /* AddressSanitizer: attempting double-free Valgrind: Invalid free() */

Figure 14 – Attempt to free already freed memory

3.9.1 Programmer View There are some well-known tips that should help the programmer prevent the formation of this undefined behavior. The most important one is to always assign NULL to the pointer after the free function was used.

That way it does not matter if the same pointer is handed to some memory management function since they all ignore the NULL pointer. The second advice is to never change the original pointer returned by one of the allocation functions whose return values should be checked for NULL right after usage.

29

Also the size parameter should be stored in size_t and always be checked whether the value is not a zero. These are the main recommendations directly related to the prevention of this undefined behavior but more can be recommended about memory management, for example to zero-initialize memory after malloc or realloc, or preferably to use a more secure calloc function that does this implicitly. [26]

3.9.2 Tester View Testing in this case is very easy with the help of the compilers.

Specifically GCC in version 4.8 or higher and Clang in version 3.1 or higher.

They both contain AddressSanitizer (ASan) that should discover the majority of problems leading to this undefined behavior and other problems concerning memory. ASan is a fast memory error detector which consists of a compiler instrumentation module and a run-time library. The bugs related to this undefined behavior that it should be able to detect are: use-after-free, double-free, and invalid free. [27] To use ASan with both compilers, “simply compile and link the program with –fsanitize=address” [28, sec. Usage].

The detection runs along with the program so do not expect any messages when compiling. Even though it is the “fast” detector, the program will still run only half as fast, so option –o1 might be used to optimize the program and increase the performance. It is also possible to compile with the –g option to trace the origin of potential errors within code. ASan contains more options for detailed output and macros to avoid error detection of certain parts of the code or to execute a code dependent on the ASan’s presence. [28]

This is actually more useful for a programmer who is responsible for

30 the compilation or has the possibility of compiling the code with specific options. But since it concerns testing, it is placed in the “Tester” section.

Other way to detect this undefined behavior is using Valgrind and specifically its tool Memcheck already mentioned in section 3.2.2. This comes in handy with the older versions of the compilers or when a more complex detection is needed. Typical messages by Valgrind related to this undefined behavior are: “Invalid free()”, “Invalid read”, and

“Invalid realloc()”. A very simple check by Valgrind can be performed by valgrind –v –leak-check=full ./compiled_program command. Memcheck tool does not have to be specified since it is the default tool of Valgrind. It is very useful to have debugging information about source code as a number of the line where the problem appeared. Debugging information can be added by compiling the code with the –g option.

31

3.10 The value of the result of an integer arithmetic or conversion function cannot be represented (7.8.2.1, 7.8.2.2, 7.8.2.3, 7.8.2.4, 7.22.6.1, 7.22.6.2, 7.22.1). This undefined behavior concerns every possible function where conversions or an integer arithmetic is taking place. The undefined behavior arises whenever the result of any of these functions cannot be represented.

The conversion functions convert a string to an integer type or a wide- character string to an integer. The arithmetic functions serve for counting the absolute value or for integer division with a remainder in a single function.

There are standard functions declared in the stdlib.h library such as abs, div, atoi and its versions for long int and long long int. [3, Sec. 7.22.6]

More types were added to C99, they are included in the inttypes.h library.

These are integer types of a fixed width that allow better portability, since the basic data types are of an implementation-dependent size across the various architectures. This library defines macros for formatting of these types as well. The conversion and arithmetic functions related to this undefined behavior that are declared in this library are: imaxabs, imaxdiv, strtoimax, and wcstoimax. They are very similar to basic functions mentioned above. There are also versions for unsigned integers. [3, Sec. 7.8.2]

3.10.1 Programmer View A common cause of undefined behavior when counting the absolute value is a parameter of the function being outside of the range. The advice is to carefully choose the functions according to the expected numbers - it can go up to long long int with llabs function. [19, Sec. 3-abs] Also be careful with the most negative value of the given type, for it cannot be represented since numbers on most of the modern architectures are represented with two’s

32 complement where negative and positive numbers are not symmetric. For example in 8-bit arithmetic, 127 is represented as 01111111 and -127 as

10000001, which is counted by negating all bits and adding 1. Similarly, when

-128 is 10000000 then 128 would be 01111111 + 1 which would be 10000000 and again -128 in the decimal numeral system, and consequently could not be represented. [29] This principle is illustrated in Figure 15.

#include #include #include int main() { int a = INT_MIN; //-2147483648 printf("Absolute value of INT_MIN: %d\n",abs(a)); //likewise dividing by -1 do the same but is harder to discover return 0; } /* Absolute value of INT_MIN: -2147483648 */

Figure 15 – Overflow of integer data type

For the programmer, there are two solutions, one is to use an integer of a bigger size, the second one is to implement an own absolute value function where the calculation can be specified differently for INT_MIN.

A very similar advice would apply to the division functions.

If the programmer is not sure whether a value might exceed the range, it is better to use a version of the function for integers of bigger size.

With conversion functions, there is an easier way out. Instead of the conversion functions atoi, atol, atoll safer functions strtol, strtoll should be used. As for unsigned integers, safe functions are strtoul and strtoull. [30]

Safer in this context means that in case of inability of conversion, 0 is returned; in case of exceeding the range, one of LONG_MIN, LONG_MAX,

LLONG_MIN, LLONG_MAX, ULONG_MAX, or ULLONG_MAX is returned

33

(according to the return type and sign of the value, if any), and the value of the macro ERANGE is stored in errno [3, Sec. 7.22.1.4, para. 8]. The functions strtoimax and wcstoimax work very similarly.

The best practice is to always check the return value of these functions whether they were processed successfully or not, in which case some failover mechanisms should be implemented. Also errno should be checked when the return values is one of the above-mentioned macros.

3.10.2 Tester View There is no specific tool that detects all potential sources of this undefined behavior. There are ways to find some particular causes such as scanning for the non-secure methods for string-to-integer conversion. In case they are found, they should be replaced with the safer version. The absolute value function is a bigger problem for it cannot be detected and the input value and its size cannot always be predicted. Actually, this is a problem only on systems using the two’s complement. The code can be very easily searched for the abs function and if discovered, it should not be admitted as sufficient. It can be done automatically be #pragma GCC poison abs. For the implementation of the abs function is not an error-checking one, a function capable of returning some specific error value should be implemented instead (see Figure 16 for an example).

int abs(int a) { if (a == INT_MIN) { return -1; //or INT_MAX like strtol() } return (a < 0 ? -a : a); }

Figure 16 – Implementation of safe abs function

34

One way is to return a negative number in which case it is obvious that it is not a correct result, or errno could be set to some error value. This is more of a programmer’s job but the tester should be aware of the usage of the abs function and the risks it entails. If the new and more secure abs function is implemented it can replace the original one using the LD_PRELOAD directive with a path to the file where the new function is implemented. It ensures that the path to the new library will be loaded before any other library.

35

4 Conclusion and Future Work

The goal of this thesis was to describe 10 instances of undefined behavior in language C. The first part was brief introduction to the concept of the undefined behavior, summary of used tools, and the description of the structure of the thesis. The second part was dedicated to a detailed explanation of each undefined behavior. It consisted of two views that cover the possible ways to detect and repair the given undefined behavior. The thesis could be seen as a summarization and further description of more undefined behaviors in one place.

Each undefined behavior was tested on small examples included in the paper that aid the reader understand a given problem. Each example was tested by many tools and compiled with a variety of flags to discover the ways to detect the possible appearance of the undefined behavior. In some cases, there are no tools to detect the behavior but a different way is always introduced to avoid the certain problem in source code. Table 1 shows the overview of results.

Behavior GCC Clang Splint Cppcheck Valgrind Other 3.1 -Wconversion -Wconversion Coverity 3.2 -Wuninitialized –Wuninitialized   Memcheck 3.3 –Wwrite-strings –Wwrite-strings 

–Wincompatible- 3.4 Default  pointer-types

3.5 –Wreturn-type Default 

#pragma GCC 3.6 #pragma GCC poison CodeSonar poison

3.7 LD_PRELOAD LD_PRELOAD 3.8 -Wformat -Wformat   3.9 ASan ASan Memcheck 3.10 LD_PRELOAD LD_PRELOAD

Table 1: Summarization of tools used for discovery of undefined behaviors

36

This thesis in a simplified form could be used as a study material for language C or a manual for the code improvement and the discovery of critical sections. The thesis could also serve as a basis or help for the creation of a tool that would comprise the detection options for many undefined behaviors. This would eliminate the need to test code by more than a few tools.

37

Reference List

[1] C. Lattner. “What Every C Programmer Should Know About Undefined Behavior”. 13 May 2011. [Blog entry]. LLVM PROJECT BLOG. Available: http://blog.llvm.org/2011/05/what-every-c- programmer-should-know.html [Accessed: 25 Nov 2014]

[2] J. Regehr. “A Guide to Undefined Behavior in C and C++”. 9 Jul 2010. [Blog entry]. Embedded in Academia. Available: http://blog.regehr.org/archives/213 [Accessed: 25 Nov 2014]

[3] WG14 N1570 Committee Draft, ISO/IEC 9899:201x, 2011.

[4] LLVM Project. “The LLVM Compiler Infrastructure”. [Online]. Available: http://llvm.org/ [Accessed: 1 Apr 2015]

[5] GCC team. “GCC, the GNU Compiler Collection”. [Online]. Available: https://gcc.gnu.org/ [Accessed: 1 Apr 2015]

[6] R. Stallman. (14 May 2014). “The GNU Project,” gnu.org. [Online]. Available: https://www.gnu.org/gnu/thegnuproject.html [Accessed: 1 Apr 2015]

[7] FSF. “Free Software Foundation”. [Online]. Available: http://www.fsf.org/ [Accessed: 1 Apr 2015]

[8] The Clang team. “Clang: a C language family frontend for LLVM,” llvm.org. [Online]. Available: http://clang.llvm.org/ [Accessed: 1 Apr 2015]

[9] The Clang team. “Clang Static Analyzer”. [Online]. Available: http://clang- analyzer.llvm.org/index.html [Accessed: 1 Apr 2015]

[10] The Clang team. ”Clang vs Other Open Source Compilers: Clang vs GCC (GNU Compiler Collection)”. [Online]. Available: http://clang.llvm.org/comparison.html [Accessed: 1 Apr 2015]

[11] Splint. “Splint”. [Online]. Available: http://www.splint.org/ [Accessed: 1 Apr 2015]

[12] D. Marjamäki (3 Jan 2015). “Cppcheck: a tool for static C/C++ code analysis”. [Online]. Available: http://cppcheck.sourceforge.net/ [Accessed: 1 Apr 2015]

[13] Valgrind Developers. “Valgrind”. [Online]. Available: http://valgrind.org/ [Accessed: 1 Apr 2015]

[14] GCC team. “A GNU Manual,” GCC 4.9.2. [Online]. Available: https://gcc.gnu.org/onlinedocs/gcc/index.html [Accessed: 10 Dec 2014]

[15] Synopsis. “Coverity”. [Online]. Available: http://www.coverity.com/ [Accessed: 10 Dec 2014]

[16] Network Design & Management. “HP Fortify Static Code Analyzer”. [Online]. Available: http://www.ndm.net/sast/hp-fortify-static-code-analyzer [Accessed: 10 Dec 2014]

[17] Valgrind Developers (2014). “Valgrind User Manual”. [Online]. Available: http://valgrind.org/docs/manual/manual.html [Accessed: 18 Nov 2014]

[18] R. Seacord, et. al. (4 Aug 2014), “STR30-C. Do not attempt to modify string literals,” CERT C Coding Standard. [Online]. Available: https://www.securecoding.cert.org/confluence/display/c/STR30- C.+Do+not+attempt+to+modify+string+literals [Accessed: 18 Nov 2014]

38

[19] M. Kerrisk (2015). “Linux man pages online”, man7.org. [Online]. Available: http://man7.org/linux/man-pages/index.html [Accessed: 1 May 2015]

[20] Wikipedia. “Application binary interface”, Wikipedia.org. [Online]. Available: http://en.wikipedia.org/wiki/Application_binary_interface [Last Modified: 25 February 2015, 03:02]

[21] Microsoft. “signal”, C Run-Time Library Reference. [Online]. Available: https://msdn.microsoft.com/en-us/library/xdkz3x12.aspx [Accessed: 13 Feb 2015]

[22] GCC team. “The C Preprocessor,” GCC 4.9.2. [Online]. Available: https://gcc.gnu.org/onlinedocs/gcc-4.9.2/cpp/index.html [Accessed: 13 Feb 2015]

[23] GrammaTech. “CodeSonar”, .com [Online]. Available: http://www.grammatech.com/codesonar [Accessed: 13 Feb 2015]

[24] Glibc. “The GNU C Library (glibc)”. [Online]. Available: http://www.gnu.org/software/libc/index.html [Accessed: 13 Feb 2015]

[25] J. Pincar et. al. (4 Aug 2014). “FIO47-C. Use valid format strings,” CERT C Coding Standard. [Online]. Available: https://www.securecoding.cert.org/confluence/display/c/FIO47- C.+Use+valid+format+strings [Accessed: 5 Jan 2015]

[26] R. Lewis (17 May 2006). “Secure Coding Best Practices for Memory Allocation in C and C++”, CodeProject: For those who code. [Blog entry]. Available: http://www.codeproject.com/Articles/13853/Secure-Coding-Best-Practices-for-Memory-Allocation [Accessed: 20 Apr 2015]

[27] D. Seketeli (2 Dec 2014). “Address and Thread Sanitizers in GCC,” Red Hat Developer Blog [Blog entry]. Available: http://developerblog.redhat.com/2014/12/02/address-and-thread-sanitizers- gcc/ [Accessed: 20 Apr 2015]

[28] the Clang team. “AddressSanitizer,” Clang 3.7 documentation. [Online]. Available: http://clang.llvm.org/docs/AddressSanitizer.html [Accessed: 1 May 2015]

[29] N. Jones (1 Feb 2012). “The absolute truth about abs(),” Embedded Gurus: Experts on Embedded Software. [Blog entry]. Available: http://embeddedgurus.com/stack-overflow/2012/02/the-absolute- truth-about-abs/ [Accessed: 20 Mar 2015]

[30] D. Svoboda et. al. (19 Oct 2014). “ERR07-C. Prefer functions that support error checking over equivalent functions that don't,” CERT C Coding Standard [Online]. Available: https://www.securecoding.cert.org/confluence/display/c/ERR07- C.+Prefer+functions+that+support+error+checking+over+equivalent+functions+that+don%27t [Accessed: 20 Mar 2015]

39

Attachments

The attachment to this thesis, Sources.zip, contains a set of example programs for each undefined behavior. There is makefile in the top folder that triggers the compilation of all source files. Whenever there is a flag that might warn about the examined behavior, the source code is compiled with this flag.

Some files are compiled with –fsanitize and the warning message appears when the file is executed. Other files are made to be executed to observe the undefined behavior occuring in these files.

Summary of examples:

o 1_1.c, 1_2.c – warnings about conversions when compiled o 2_1.c – warnings about uninitialized variable when compiled o 2_2.c – uninitialized array; compiled without warning; command valgrind –v ./2_2 runs Valgrind that detects problem o 2_3.c - uninitialized structure; compiled without warning; command valgrind –v ./2_3 runs Valgrind that detects problem o 3_1.c, 3_2.c – warnings about modification of string literals when compiled o 4_1.c – default warning about incompatible pointer type by GCC o 5_1.c – warning about missing return statement o 6_1.c – use of signal with POSIX threads program, part of code can be uncommented to forbid compiler to compile with signal function in code o 6_2.c – use of signal with C11 threads o 7_1.c – no warnings; when executed, program should fail to open file o 8_1.c – warning about missing argument o 8_2.c – example from Figure 11 o 8_3.c – warning about incompatible argument o 9_1.c, 9_2.c, 9_3.c – no warnings; 9_#ASan can be run to show output from ASan; 9_# can be run with Valgrind that detects incorrect memory management o 10_1.c – part of code can be uncommented to forbid compiler to compile with abs function in code

40