MASARYKUNIVERSITY FACULTY}w¡¢£¤¥¦§¨  OF I !"#$%&'()+,-./012345

Undefined behaviour in language

MASTER’S THESIS

Branislav Košˇcák

Brno, autumn 2014 Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Branislav Košˇcák

Advisor: Mgr. Marek Grác, Ph..

ii Acknowledgement

I am very grateful to my supervisor Miroslav Franc for his guidance, invaluable help and feedback throughout the work on this thesis.

iii Abstract

This Master’s Thesis deals with the concept of undefined behaviour in the C language from the point of view of a tester and a developer. It further describes what the purpose of undefined behaviour is, why it is dangerous, and it provides a summary of the current state regard- ing prevention and detection of undefined behaviour. Blind spots where there is currently no tool available are identified. The thesis contains a set of sample programs.

iv Keywords undefined behaviour, C, testing, programming language, exploit, stan- dard, detection

v Contents

1 Introduction ...... 1 1.1 ISO requirements...... 1 1.2 Analysis Tools...... 3 1.3 Goals...... 4 2 Undefined Behaviour ...... 5 2.1 Reason for Undefined Behaviour...... 5 2.1.1 Performance – compile/run-time...... 5 2.1.2 Support of HW portability...... 6 2.2 Drawbacks of Undefined Behaviour...... 7 3 Analyzed Undefined Behaviours ...... 9 3.1 Object Lifetime...... 9 3.1.1 Detection...... 10 3.2 Integer Overflow...... 11 3.2.1 Detection...... 13 3.3 Integer Conversion...... 13 3.3.1 Detection...... 14 3.4 Void Expression...... 15 3.5 Pointer Conversion...... 16 3.6 Pointer to Function...... 17 3.6.1 Detection...... 18 3.7 Modifying String Literal...... 19 3.8 Sequence Points...... 20 3.8.1 Detection...... 21 3.9 Function Prototype...... 22 3.10 Pointer Type...... 23 3.11 Use of Library Functions...... 24 3.11.1 Detection...... 25 3.12 Pointer Arithmetic...... 26 3.12.1 Detection...... 27 3.13 Shifting Overflow...... 28 3.13.1 Detection...... 29 3.14 Pointers Comparison...... 29 3.14.1 Detection...... 30 3.15 Integer Constant Expression...... 31 3.16 Longjmp/Setjmp...... 32

vi 3.16.1 Detection...... 33 3.17 Multiple External Definitions...... 34 3.18 Pointer to FILE...... 35 3.19 Output Functions...... 36 3.20 Copy of Overlapping Objects...... 37 3.20.1 Detection...... 39 4 Conclusion ...... 40 Bibliography...... 43 A Shared Library Implementation ...... 44 B Content of the CD-ROM ...... 46

vii 1 Introduction

Nowadays there are various types of programming languages avail- able. They differ by programming style – paradigm, by syntax and semantics they use, whether they are strongly or weakly typed and by other specifications. It is natural that single programming lan- guage cannot handle all tasks equally efficiently. Performing quick queries on a large database requires different approach than low level GPU programming. Every language has its advantages and limita- tions. Mastering both minimizes errors and also leads to better per- formance and an effective code. In 1972 C programming language was developed. Combination of principles and main features of BCPL and B programming lan- guage resulted into creation of C. It soon spread among universities and became used by several organizations. The problem was that dif- ferent versions of C were used what caused compatibility problems. It was necessary to establish a standard definition of C. The first was ANSI C (later adopted by ISO organization and thus referred as ISO C). The standard has been evolving, undergoing several changes and adding new features that were lacking in previous standards. The current standard for C programming language is C11 [9]. C is imperative programming language. It is a flexible program- ming language whose important feature is that it allows low level programming while using syntax of high level languages. It allows fast program execution, has ability to manipulate with memory on low level. C is weakly typed. No bounds checking and pointer valid- ity checking improves the performance of C programs. This is pow- erful tool for experienced programmer. On the other hand it can lead to severe problems and undesired results when used inappropriately.

1.1 ISO requirements

C programming language comes with additional concepts whose knowledge is required when programming. Unspecified behaviour, implementation-defined behaviour and undefined behaviour are as- pects of C which may cause programs to misbehave, even though a compiler does not report any error. However, as it often happens, af-

1 1. INTRODUCTION ter a change in compiler implementation or language specification, the program may suddenly stop working. Unspecified behaviour is defined as the use of an unspecified value, or other behaviour where by standard two or more options are possible, and it is not specified which one would be chosen. For example order in which the operands of an assignment operator are evaluated. Program that contains unspecified behaviour but is cor- rect in all other aspects is considered to be correct. Implementation-defined behaviour is like unspecified behaviour where operations and choices that are made are described with re- spect to implementation documents. For example the number of bits in a byte. In ISO C undefined behaviour is described as "behaviour, upon use of a non-portable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements". As there are no constraints, a program containing undefined behaviour can be- have in an expected way, it can crash or run with unpredictable re- sults with or without an issuance of a diagnostic messages. The specification of C language contains full list of program con- structs that will lead to undefined behaviour. The occurrence can be of different type. From preprocessor directives, functions decla- rations, expression evaluation or calling library functions. Causing undefined behaviour should be avoided as it may lead into undesired interpretations. In better case program would not work. But because results can be arbitrary there is also a possibility that it could be exploited. As stated in C specification, signed integer overflow causes un- defined behaviour. Example [7] of the real situation when this could have been exploited, is the Linux kernel 2.4.25 and earlier where inte- ger overflow in the SCTP_SOCKOPT_DEBUG_NAME SCTP socket option in socket.c allowed local users to execute arbitrary code. Since C programming language imposes only a few constraints on a programmer it can easily result into problems that may be diffi- cult to discover. There are many tools available to aid programmers to avoid some of the problems that can occur in a code and may re- main hidden as they are not reported by a compiler. Analysis can be static (compile-time) or dynamic (run-time), performed on source code or binary.

2 1. INTRODUCTION 1.2 Analysis Tools

One of the main problems with undefined behaviour is that the range of possible errors is quite wide, the C specification lists more than 190 types of undefined behaviour. Unless a programmer knows the C standard, it is easy to make a mistake. The programmer may not even know that there is something wrong with the code. For exam- ple null pointer dereference is known bug which usually leads into a program crash. This kind of error is frequent and programmers usu- ally pay attention to prevent it. Even though statement like a[i] = i++; does not look suspicious, it is by the standard considered to be unde- fined. Again, consequences may or may not occur. Expressions like this usually do not cause program to crash, just results of program computation can be unexpected and it may be difficult to find the reason.

i = i + 1; //Well defined behaviour i = i++ + 1; //Undefined behaviour

Figure 1.1: Defined versus undefined behaviour

The need for automated code check is obvious. For undefined be- haviour detection both static and dynamic analyzers are available. Clang is a front-end for the LLVM compiler [12] which comes with several sanitizers such as memory, address and also undefined be- haviour sanitizer. Analyzers that explicitly do not seek undefined behaviour can be used as well. For example the programming tool that detects er- rors at run-time – Valgrind [17]. Even though the main usage of Val- grind tool Memcheck is connected with memory issues, it can dis- cover problems resulting from undefined behaviour (out of bound access, multiple free memory). There are various static analyzers such as Clang Static Analyzer [13], Cppcheck [3], Splint [16] available too. They provide checkers that can detect numerous problems. Apart from performance, there are areas where results of static analysis differs from dynamic code check.

3 1. INTRODUCTION 1.3 Goals

The main goal of this Master’s Thesis is to describe the severity of undefined behaviour from both developer and tester point of view. The second chapter provides detailed description of undefined behaviour in C programming language. It deals with positive aspects of undefined behaviour and what is the reason to have it in the C standard. Drawbacks and necessity to avoid it are described in this chapter too. The following chapter contains set of chosen undefined be- haviours upon which it will be in detail explained in what way any undefined behaviour can be undesired. Individual cases are ana- lyzed and critical constructs which can be potentially dangerous are inspected. The chapter also lists and describes currently available static and dynamic analyzers that can detect respective undefined behaviour. The thesis also includes a set of sample programs. They describe oc- currence of various cases of undefined behaviour in C code.

4 2 Undefined Behaviour

As the word Undefined implies, the result of a program invoking un- defined behaviour is not specified. It depends on the system archi- tecture, compiler implementation, translation environment and set- tings. The program may translate and behave in an expected and documented way, results may be correct. Moving such program to different platform may reveal signs that the program is erroneous (unsuccessful translation, terminated execution or arbitrary results). In C standard behaviours are identified as undefined, if:

• "shall" or "shall not" requirement is violated, and that require- ment is outside of a constraint

• behaviour is explicitly marked with words "undefined be- haviour"

• any explicit behaviour is omitted

Even though wording in C standard may not be always com- pletely clear, Annex J.2 contains the list of explicit undefined be- haviours. However, the list is only informative, not formative.

2.1 Reason for Undefined Behaviour

Despite all potential problems that may arise when undefined be- haviour is invoked, there are reasons why C standard leaves certain areas undefined. They come together with the main characteristics of C language: the C language does not impose unnecessary overhead on the implementation, speed of compilation and execution is high, and it is possible to implement C on a wide range of hardware.

2.1.1 Performance – compile/run-time Generating an efficient code positively improves program perfor- mance. One of the means how it is achieved in C, involves loops and signed integer overflow. It is described in the Section 3.2.

5 2. UNDEFINED BEHAVIOUR

If inside a for statement a signed integer is used to control the number of iterations, by making signed integer overflow undefined, the compiler can rely on the fact that the number of iterations will not be infinite and thus it can perform loop optimizations. Values of objects with automatic storage duration (objects which could have been declared also with the register storage class) should not be used before the object is initialized, otherwise the behaviour is undefined. It means that C does not need to initialize all variables. This can be used for efficiency reasons – large auto arrays do not need to be initialized – it helps to avoid expensive overhead of as- signing a default value to all elements. Undefined behaviour invoked by accessing an array out of bounds frees compilers from checking if the accessed value is in the array, thus making pointer arithmetic faster.

2.1.2 Support of HW portability By making results of certain operations intentionally ambiguous ac- cording in C standard, many different platforms are supported with- out sacrificing performance. Because no specific behaviour is required in these circumstances, compiler vendors are free to do whatever is the most efficient for the target hardware platform. Undefined behaviour "Conversion of a pointer to an integer type pro- duces a value outside the range that can be represented"[8], is the example how the ambiguity may aid to support more hardware platforms. As described in the Section 3.5, there are platforms where the size of pointers is greater than the size of integer variables. The result of conversion of values that cannot be represented is undefined. It al- lows C to be implemented on various platforms with different data models.

Table 2.1: Integer overflow on different architectures

ARM ARM ARM x86/x64 x86/x64 x86/x64 +5e9 32-bit 16-bit 8-bit 32-bit 16-bit 8-bit unsigned 4294967295 65535 255 705032704 0 0 signed 2147483647 -1 -1 -2147483648 0 0

6 2. UNDEFINED BEHAVIOUR

The Table 2.1 shows overflows of unsigned and signed (undefined behaviour) integers on different hardware architectures.

2.2 Drawbacks of Undefined Behaviour

Even though undefined behaviour makes applications faster, invok- ing it is dangerous and can cause not only non-functional code but also security issues. However, many consequences of undefined be- haviour arise only after applying compiler optimizations.

1. int array[10]; 2. 3. bool contains(int value) { 4. for (int i = 0; i <= 10; i++) { 5. if (array[i] == value) 6. return true; 7. } 8. return false; 9. }

Figure 2.1: Undefined behaviour before optimization

The function contains in the Figure 2.1 takes variable value and checks, if it is present in the array. On the line 4 the array is ac- cessed past its last element. On most of compilers the code would be generated, and the program would behave in an expected way. How- ever, because accessing out-of-bounds array element invokes unde- fined behaviour, the compiler is free to make following assumptions:

• if during first 9 iterations array contains value, the function returns true

• when i is incremented to 10, the code invokes undefined be- haviour. The compiler may assume that this never occurs be- cause accessing such value is undefined and thus can ignore it

• according to that the compiler knows that i cannot be greater than 10

7 2. UNDEFINED BEHAVIOUR

• therefore all legal code paths return true

Due to undefined behaviour, the function contains then can be optimized into the form in the Figure 2.2

1. bool contains(int value) { 2. return true; 3. }

Figure 2.2: Optimization performed

GCC version 4.4.7 and Clang version 3.5.0 with both enabled and disabled optimizations do not return true, if the array does not contain value. GCC version 4.9.0 without optimizations returns false, with optimizations -01 causes segmentation fault, and with optimizations -02 returns true. Erroneous program may not be caused only by bugs in source code but also by bugs produced and inserted into the program by the compiler, which translates this program. However, miscompila- tion is far less often the source of problems of nonfunctional applica- tions. Silent bugs may be revealed by optimizations. If the program relies on undefined behaviour, compilers are allowed to perform un- expected optimizations. Another drawback of undefined behaviour is that C applications need to be tested and debugged by external tools to detect it. Com- piler developers are licensed to ignore it – not catch certain pro- gram errors that are difficult to diagnose, and emit code with un- predictable results, if the code depends on undefined behaviour. For various undefined behaviours different tools have to be used. Even though GCC and Clang contain sanitizers to detect it, they are still in development and the range of cases that are caught is still limited.

8 3 Analyzed Undefined Behaviours

The standard contains 191 undefined behaviours. This chap- ter describes 20 of them. Such undefined behaviours were selected, which are likely to be caused, may be difficult to detect, and may break the code seriously. Unless otherwise stated, the code was tested for undefined behaviours on Fedora 20 x86_64. GCC version 4.9.2, Clang version 3.4, Splint 3.1.2, Cppcheck 1.63, Valgrind 3.9.

3.1 Object Lifetime

Each object in C programming language has a storage class. It is an attribute, which defines scope and lifetime of the respective object. It specifies, where the variable will be stored and what its initial value will be, if no value has been assigned during the declaration. C comes with automatic, static and allocated storage duration. During the lifetime of an object the storage is guaranteed to be reserved. The behaviour is undefined if "An object is referred to outside of its lifetime." [8] With the automatic storage duration the object is created at the beginning of a code block and deallocated at its end. It is apparent that the scope of a local variable declared inside the body of a func- tion is limited only to the function definition and cannot be accessed outside – compiler would not recognize such variable. However, the programmer is free to access the value of the object outside its life- time, even though it leads into undefined behaviour.

1. int *initArray(void) { 1. void initArray(int **pArray) { 2. 2. 3. int array[10]; 3. int array[10]; 4. for (i = 0; i < 10; i++) { 4. for (i = 0; i < 10; i++) { 5. array[i] = i; 5. array[i] = i; 6. } 6. } 7. 7. 8. return array; 8. *pointer = pArray; 9. } 9. }

Figure 3.1: Functions returning local variable

9 3. ANALYZED UNDEFINED BEHAVIOURS

The Figure 3.1 shows two examples, when the object is referred outside of its lifetime. In the first code block, elements of array will re- main accessible after the return statement, but their values will be in- determinate. If the programmer needs to initialize the array, it should be passed to the function by reference. The second program imple- ments the correct approach, but the result is undefined behaviour as well. After the closing bracket, the pointer pArray becomes dan- gling. It points to invalid memory location and should not be deref- erenced. It is advisable to set such pointer to NULL value so that an attempt to access it would cause NULL pointer dereference, which is easier to debug. With allocated storage duration, the programmer allocates re- quired memory block, and he is also responsible for freeing it cau- tiously – neither leaving memory leaks, nor attempting to free the same memory region twice, as it invokes undefined behaviour. Writ- ing or reading from already deallocated space means accessing a dangling pointer, which can result into exploitable vulnerabilities. Setting a pointer after it has been freed to NULL, is a defensive style, which protects against such pointer bugs. On most systems ac- cessing NULL pointer results in immediate crash, which may be bet- ter option than corrupting other variables and silent misbehaviour of the application.

3.1.1 Detection Analysis of using objects outside of their lifetime is covered by mul- tiple tools. The best coverage of aforementioned problems has Val- grind tool – Memcheck. This memory error detector maintains a set of shadow registers, where call stack is stored. All data from both registers and memory (every value that is created as a result of an operation in the program run) is shadowed by metadata, called de- finedness bit (V bit). V bits are computed from V bits of all inputs and the operation. If an operation uses a value which could affect the ob- servable behaviour of a program the operation is checked. If V bits indicate that the control flow transfer, conditional moves, addresses used in memory access, or data passed to system call depend on un- defined values, an error message is issued. This checking of defined- ness costs additional overhead. To use shadow bits, double amount

10 3. ANALYZED UNDEFINED BEHAVIOURS

of memory is required. Memcheck manual states that program will run 20-30 times slower than normal. Misuse of freed or allocated but uninitialized memory can be de- bugged with Malloc Tunable Parameter – M_PERTURB [5]. All bytes that malloc (not calloc) gets from system and bytes that are freed will be filled with perturb_byte. It can be then checked if malloced data are still uninitialized, or if there is an attempt to write to freed memory. Other tools (Splint, Clang Static Analyzer, Cppcheck and also GCC) reported attempts to return pointer to an automatic variable. But they failed to catch double free bugs, and dereferencing pointer after it has been deallocated. Writing to memory after it has been freed can be found by mem- ory debugger Electric Fence [1]. Memory functions from are overridden by eFence library. When a memory error is detected, eFence crashes the program, so that it can be inspected by a debugger. The downside of this tool is that its options have high memory consumption. Even though eFence is interesting from his- torical perspective, Valgrind is superior in most ways.

3.2 Integer Overflow

During the evaluation of an expression it is possible that an excep- tional condition occurs. Mostly this is the concern of expressions, whose result is not mathematically defined or is out of the range for its data type. Even though C standard does not explicitly define inte- ger overflow, only mentions it as an example of undefined behaviour (C11, section 3.4.3), the following Section describes situations, when integer may overflow. The unsigned integer type is by Standard guaranteed not to over- flow, because the value that cannot be represented wraps around us- ing modulo power of two, and thus always remains in the range of the unsigned integer type. Signed integer operations overflow, if the result of the operation is not representable by the integer data type. There are hardware and compiler implementations on which when an integer overflow occurs, the most significant bit is overwritten by low order bits, which are the product of the overflowing operation. There are operators, which are safe against overflow results such as

11 3. ANALYZED UNDEFINED BEHAVIOURS comparison operators >, < and logical operators &&, ||, !. However, using arithmetic operators with unsigned integers must be done in a way preventing undefined behaviour. When adding two 8-bit signed integer numbers (one sign bit), for the correct result 9 bits are required. This is when a signed integer overflows on addition operation. The same holds for all signed inte- ger widths (16, 32, 64 bit). The addition operation overflows, if both operands have the same sign, which is opposite from the addition sign. One of the programmer’s options to detect the occurrence of an overflow can be achieved by checking, if the conditions in the code snippet from the Figure 3.2 are satisfied. All checks for overflows must be done prior to calling the operations that may overflow, be- cause afterwards the program will already be in an undefined state.

1. ((s1 > 0) && (s2 > 0) && (s1 > (INT_MAX - S2))) 2. ((s1 < 0) && (s2 < 0) && (s1 < (INT_MAX - S2)))

Figure 3.2: Check for overflow in addition

Subtraction on signed integers may overflow, if both operands have the same sign. It can be detected by the approach similar the addition overflow check with adjusted sign conditions. Multiplication is common arithmetic operation, which can eas- ily grow beyond the representable range. For example the factorial of number 13 is 3 times larger than maximal 32-bit signed integer. However, to work with large numbers like this, it may be advisable to use an additional library, such as The GNU Multiple Precision Arithmetic Library [6]. The same as with addition and subtraction operations, it is programmer’s responsibility to verify that values are within the representable range.

1. ((abs(s1)>abs(INT_MAX/s2)) || (abs(s2)>abs(INT_MAX/s1)))) 2. (((s1 < INT_MIN/s2) || (s2 < INT_MAX/s1))

Figure 3.3: Check for overflow in multiplication

If the expression in the Figure 3.3 returns true, multiplying

12 3. ANALYZED UNDEFINED BEHAVIOURS

operands s1 and s2 will overflow. The line 1 detects results over the INT_MAX range and the line 2 over INT_MIN range. Despite these arithmetic operations, the overflow may occur in simple expression, which does not look prone to overflow. For ex- ample multiplying and dividing INT_MIN by number -1 overflows. Also library functions should be used with caution. Function abs() from stdlib.h is implemented as return i < 0 ? -i : i;, where i is its argument. Without any check, passing INT_MIN to this function causes i to overflow. On most implementations the re- sult is again INT_MIN, however it may be anything, as it is the result of undefined behaviour. GCC has 2 options to handle integer overflows. Option -ftrapv generates trap for signed overflow on addition, subtraction, and mul- tiplication operations. Option -fwrapv instructs compiler to expect two’s complement behaviour when signed integer overflows.

3.2.1 Detection Detection of signed integer overflow is covered best by GCC and Clang dynamic analyzer. With option -fsanitize=undefined both can warn about overflows on arithmetic operations +, -, *, /. However, they fail to detect overflows in abs() function.

3.3 Integer Conversion

Since C99 Standard, C programming language supports Single- precision floating-point format defined in IEEE 754 standard [10]. In this encoding, 4 byte number consists of 1 Sign bit, 8 exponent bits, and 24 mantissa bits (23 bits are stored, the remaining bit is im- plied). With the 4 byte float type, the range between approximately -3.4x1038 and 3.4x1038 can be represented, and 8 byte double type covers values between ≈ -1.79x10308 and ≈ 1.79x10308. Because the range that can be represented by float values is much wider than the width of the integer type, there are float type values that cannot be represented by the integer type. If such float value is converted into the integer type, the behaviour is undefined. IEEE 754 defines -inf and +inf [10] values for calculations, which can-

13 3. ANALYZED UNDEFINED BEHAVIOURS not be represented accurately. That is why conversions from float to double or any floating-point arithmetic operations with these two types are safe and will not cause overflow – undefined behaviour. However, the programmer is informed about such cases, for exam- ple Clang dynamic analyzer warns that assigned value DBL_MAX is outside the range of representable values of the type float in float f = (float)DBL_MAX, even though it is not undefined behaviour, and the result of this operation is inf. A simple approach to verify that converted float value is not greater then INT_MAX may seem sufficient. However, the code ex- ample from the Figure 3.4 shows otherwise.

1. float f = INT_MAX; 2. int i = f; 3. printf("%d\n", i);

Figure 3.4: INT_MAX rounded

The result of the line 3 is -2147483648 on both GCC and Clang compilers. It is the evidence that the variable i overflew. By the closer analysis it can be seen that assigning INT_MAX to the float variable f rounds the value 2,147,483,647 (which is the highest 32-bit integer value) into 2147483648 (32-bit integer overflow). The largest integer value that can be represented by a float variable (on 64 bit system) is INT_MAX - 64 = 2,147,483,520. Therefore, the programmer must take this into account, as such cast can easily result into undefined behaviour because conversions from int to float to int can over- flow, and in many occasions it happens counter-intuitively.

3.3.1 Detection Overflows resulting from conversions between int and float vari- ables are detected only by Clang dynamic analyzer with the op- tion -fsanitize=undefined. The GCC option -Wconversion warns only when using implicit casts between variables of different types, which may alter a constant value: int i = 5.0; for exam- ple. However, this option does not inform about explicit casts that are prone to invoke undefined behaviour.

14 3. ANALYZED UNDEFINED BEHAVIOURS 3.4 Void Expression

To represent numbers, characters, strings, or structures, the C pro- gramming language comes with various data types. The void type is an object type, which does not have a value of any type. It is used to declare functions with the absence of return value. The void type is also used to specify that a function takes no parameter. The function prototype function() means that the function can take any number of parameters of unknown type. Not providing a com- piler any type and number of arguments is unsafe and can lead into undefined behaviour, as described in the Section 3.6. However, function(void) defines a function taking no argument, which is an advisable approach. A void expression returns the type void, and its value cannot be used. An implicit or explicit conversion cannot be applied to it, oth- erwise undefined behaviour is invoked. A pointer to void cannot be dereferenced, unless it is cast to other type since it may not be prop- erly aligned, and because the void type lacks the size. That is the same reason why it is not possible to use pointer arithmetic on point- ers to void. Violating these rules is by default detected and treated as an er- ror by both GCC and Clang compilers. Except explicit conversion of the void type, Splint Static Analyzer also reports all misuses of void expressions.

int function(void) { | int main () { \* function body *\ | printf("%d\n", function()); \*no return statement*\ | return 0; } | }

Figure 3.5: Invalid use of void expression

Another potential risk of undefined behaviour concerning the void type arises, when control reaches closing curly brace } of a non- void function without evaluating the return statement, and the re- turn value of the function call is used. Omitting the return key- word in a non-void function can occur inside an if statement when the else branch is missing, or when the switch case condition

15 3. ANALYZED UNDEFINED BEHAVIOURS

does not contain a default statement, and the control reaches closing } without returning any value. The program in the Figure 3.5 prints arbitrary data because values which are not specified are read from the stack. However, the behaviour is undefined, and any result is possible. Omitting return statement is reported by GCC (needs to be enabled by -Wreturn-type) and Clang compiler.

3.5 Pointer Conversion

Pointers to different types are by standard not required to be of the same size. As mentioned in the Section 3.4, it is safe to cast any pointer type to a void pointer and get back its original value. When a pointer is converted to the integer type, it can produce a value out- side the range that can be represented, and it leads into undefined behaviour. The size of a pointer depends on the compiler implemen- tation. C standard requires support of five signed and unsigned in- teger types. However, only a few requirements are imposed on the sizes of these data types. Different systems use various data models. An overview of the most widely used systems and respective data models is in the Table 3.1. When a variable of one type is converted to a type of different size, bits are either added or removed from the destination address. On most today systems int is represented by 4 bytes. However, in 64-bit programs, the size of pointers is 64 bits. When 64-bit pointer is converted into the integer type, 4 bytes are lost. When this integer value is cast back into the pointer type, dereferencing it will probably lead into program crash.

16 3. ANALYZED UNDEFINED BEHAVIOURS

Table 3.1: Data models

32-bit system1 Linux/Unix 64-bit Windows 64-bit C Type ILP32 LP64 LLP64 char 8 8 8 short 16 16 16 int 32 32 32 long 32 64 32 long long 64 64 64 pointer 32 64 64

As stated in C standard, an integer may be converted to any pointer type. If the value cannot be represented, the result is imple- mentation defined. Nevertheless, conversions of integer to pointer should be avoided because dereferencing an invalid pointer itself produces undefined behaviour. A programmer that needs to cast a pointer to an integer can use intptr_t and uintptr_t for this purpose. These were introduced in C99 standard [8] and can be found in stdint.h header file. They are guaranteed to be large enough to hold any pointer and they can be converted to and back from any type, and the result must com- pare equal to the original value. However, these types are defined as optional types so they are not guaranteed to exist on all platforms which makes their usage limited only to implementations that sup- port them. Conversions of a pointer to an integer value and back, which are not of conforming sizes, are by default detected by both GCC, Clang compilers, and Clang Static Analyzer.

3.6 Pointer to Function

Along with pointers to variables, there is a concept of function point- ers in C programming language. A function pointer is a pointer to address in memory, which contains executable code. A programmer is allowed to cast it to other function pointer. However, the function

1. Most of today’s 32 bit systems are ILP32

17 3. ANALYZED UNDEFINED BEHAVIOURS

which is called through the converted function pointer must be com- patible with the pointed-to type, otherwise the behaviour is unde- fined. Various architectures use different ways how parameters are passed to functions (pushed on the stack, placed in registers, mix of both). These calling conventions are part of an application binary in- terface. When arguments are being pushed to stack and jump to func- tion pointer target is made, it is too late to make arguments check, thus it is programmer’s responsibility to comply with this rules. C standard guarantees that the original value of the pointer can be ob- tained back, if it is cast to the original type. According to C standard, function types are compatible, when return types agree and when number and type of parameters (pa- rameter lists) matches. The complete definition can be found in the Section 6.7.6.3 – Functions declarators (including prototypes) of the C standard.

1. char *(*fp)(); 2. 3. const char *c; 4. 5. fp = strchr; 6. c = fp(’e’, "Hello");

Figure 3.6: Pointer to function of different type

3.6.1 Detection The code in the Figure 3.6 invokes undefined behaviour, unless the line number 6 is called with the argument types specified by the def- inition of the library function strchr(). Declaring pointer to the function at the line number 1 without argument specification is abus- ing obsolete C function declarations. It allows programmers to pass any – and thus also incompatible arguments to the strchr() func- tion. Empty parameter list means unspecified number of arguments and should be avoided by using the void keyword instead. GCC compiler with option -Wstrict-prototypes warns about miss- ing prototype and is advised to be used when compiling. If a function is declared with the proper prototype, casting the function pointer to the incompatible type is by default detected by

18 3. ANALYZED UNDEFINED BEHAVIOURS both GCC and Clang compilers and assignment from incompatible pointer type is reported. Restrictions to comparison of pointers defined in the Section 3.14 apply also to comparison of function pointers. Comparing pointers to functions with equality operators == and != is valid. However, comparing them using relational operators <, >, <=, >= invokes un- defined behaviour, unless the two pointers are equal.

3.7 Modifying String Literal

String literal in C programming language is a sequence of characters enclosed in double quotation mark symbols. It has the type of array of char or array of wchar_t. String literal may be placed by com- piler into read-only memory, thus being unavailable to be written into. This implementation of string literal comes from pre-ANSI C, when the language did not have the const keyword. It was the only way to define a function that takes a pointer to char as its argument and ensure that the object to which it points would not be modified. After introduction of const in 1989 C standard, it was not possible to make string literals const, because it would not be backward com- patible. That is the reason, why in the C standard the behaviour: "The program attempts to modify a string literal" [8] is specified as undefined. If a string literal is defined:

char *string = "string literal";

Any attempt to modify it results into undefined behaviour and the program is prone to cause segmentation fault because it tries to write into read-only memory;

string[7] = ’L’; \\undefined behaviour *string = ’S’; \\undefined behaviour

Another situation, which can potentially lead into modify- ing string literal is when library functions such as strchr(), strstr() are used.

19 3. ANALYZED UNDEFINED BEHAVIOURS

1. char *path = "/home/user/Documents"; 2. char *pointer; 3. 4. pointer = strrchr(path, ’/’); 5. if (pointer) { 6. *(pointer+1) = ’\0’; //undefined behaviour 7. }

Figure 3.7: String literal through function

In the example from the Figure 3.7, the string literal is passed as the argument to strrchr() function. The function returns pointer, but it remains pointing to the string literal and any attempt to modify it through this pointer invokes undefined behaviour. The solution to prevent modifying string literal is to use const qualifiers and GCC option Wwrite-strings. Copying the type const char[] into non-const *char produces a warning. Even in more complex code, the user is immediately notified that an as- signment of read-only location has been made. Definition

const char *string = "string literal"; makes string the pointer to a constant string. An attempt to mod- ify it is reported by the compiler, thus helping to prevent undefined behaviour. Except the GCC option -Wwrite-strings, Splint indi- cates that the caller may not modify the read-only storage, if afore- mentioned rules are violated.

3.8 Sequence Points

C standard states that if "Between two sequence points, an object is mod- ified more than once, or is modified and the prior value is read other than to determine the value to be stored", the behaviour is undefined [8]. A sequence point in C is a place in program execution where it is guaranteed that all side effects of previously evaluated expressions are completed and no side effects of subsequently called expressions have taken place. There are various sequence points, such as The end

20 3. ANALYZED UNDEFINED BEHAVIOURS

of a full declarator, defined. The complete list can be found in C stan- dard [8]. An expression has side effect when in addition to returning some value it also modifies state of the execution environment. An exam- ple of side effect is evaluation of ’++’ operator. Expression int a = b++; does not only initialize variable ’a’ but the side effect of ’++’ operator also changes its value. Understanding side effects is necessary to avoid violation of the aforementioned rule because even simple expression can consist of multiple sub-expressions with side effects between 2 sequence points. 1. a[i++] = a[i]; //Undefined behaviour 2. i = i++ + i++; //Undefined behaviour 3. printf("%d %d", a[i++], a[i])//Undefined behaviour

Figure 3.8: Sequence points and side effects

The left operand of the assignment operator from the expression in the line 1 in the Figure 3.8 modifies variable i, used by the right operand. The result is undefined behaviour. The expression in the line 2, Figure 3.8 is undefined because for the ’+’ operator the order of evaluation for its operands is not de- fined. The same holds for the most of operators like ’-’, ’*’, ’/’. A way to overcame this behaviour is to use one of the operators, such as logical AND ’&&’, logical OR ’||’, conditional ’?’ or comma ’,’ as it is shown in the Figure 3.9. The end of the first operand of these operators is a sequence point. status = abs(x) + abs(y); //Undefined behaviour status = (a += abs(z), a + abs(y)); //Defined behaviour

Figure 3.9: Sequence points and side effects

3.8.1 Detection Currently, expressions that can potentially cause this sort of unde- fined behaviour can be detected by GCC (-Wsequence-point). In gen-

21 3. ANALYZED UNDEFINED BEHAVIOURS

eral, GCC has been effective at detecting this kind of problems but it may report false positive results in some more complicated cases [7]. All examples from this chapter were tested and successfully detected by GCC version 4.9.0 and version 4.4.7. However, there is a difference in warning messages in these two versions of GCC as shown in the Figure 3.10 for the expression from the Figure 3.8, line 3. Another tool that detected this undefined behaviour is Splint. \\GCC version 4.4.7 warning: operation on ‘i‘ may be undefined

\\GCC version 4.9.0 warning: operation on ‘i‘ may be undefined [-Wsequence-point] printf("%d %d\n", a[i++], a[i]); ^ Figure 3.10: GCC warning about sequence points

3.9 Function Prototype

In old versions of C programming language, before C99 standard, functions did not have to be declared before the point of the call. In such case the function was implicitly declared as a function tak- ing any argument(s) and returning int. If the implicit declaration was incompatible with the actual definition, the behaviour was un- defined. It was programmer responsibility to pass correct attributes to the function. The only exception were variadic functions, which required visible function prototype in scope. Function prototypes were introduced in ANSI C standard and implicit declarations are considered to be violation of rules since C99. Even though a function call without a visible declaration is a con- straint violation, the declaration is not required to be function pro- totype and can be substituted by the old-fashioned declaration style, without specifying parameter types. This can result into errors and thus undefined behaviour, if not treated properly. It is described in the Figure 3.11. In the source file 1, adding declaration int add() before calling the function, results in warning free compilation, using GCC with

22 3. ANALYZED UNDEFINED BEHAVIOURS source file #1 source file #2

int add(); | int add(int x, int y) { | return (x + y); int main() { | } int a = add(); | return 0; | } |

Figure 3.11: Function declaration is not a prototype

options -Wall and -Wextra. However, it is apparent that calling function add() without arguments is undefined behaviour. To pre- vent similar errors, compiling with strict options is advised. Option -Wstrict-prototypes warns that function declaration is not a prototype. Clang (version 3.4) lack implementation of this warning. Nevertheless, the best practice is to declare a prototype for the func- tion before calling it, as it allows compiler itself detect errors.

3.10 Pointer Type

Rules how to convert pointers to other types are described in the Section 3.5. Converting the pointer data type to the integer data type is allowed, but the resulting value must be in the repre- sentable range. C standard specifies another rule concerning point- ers and their conversion. If "A pointer is converted to other than an in- teger or pointer type", the behaviour is undefined [8]. Specifically the pointer data type should not be converted to the float data type and vice versa. Situations when such conversion may occur concern calls and re- turns by value and the double data type. If a function is declared to return double but the programmer by mistake returns double* and the returned value is assigned to double variable, the behaviour is undefined. The same undefined behaviour occurs, if a double* is passed to a function, which expects double. However these violations are detected by GCC and Clang compilers and also by static analysis

23 3. ANALYZED UNDEFINED BEHAVIOURS tools. The Clang compilation error is in the Figure 3.12

error: passing ’double *’ to parameter of incompatible type ’double’;

error: returning ’double *’ from a function with incompatible result type ’double’;

Figure 3.12: Clang compilation error

However, the situation is more complicated with functions tak- ing variable number of arguments. As described in the Section 3.11 the macro va_arg is used to extract the next argument. The type of the argument is used determine the size of each parameter. The com- piler does not check the type of passed arguments. If the type and the corresponding argument are not consistent, the result data may be misinterpreted or incorrectly aligned, which invokes undefined behaviour.

3.11 Use of Library Functions

Sometimes it may be required to implement a function with non- constant number of arguments. C programming language provides variadic functions to write functions with variable number of argu- ments. An example of such function is the library function printf(). To access arguments, whose number and types are not known to the called function when it is translated, stdarg.h header has to be included. It provides the type va_list and macros working with this type. First, va_start macro has to be called. It initializes a variable argument list – sets pointer to the first parameter. Then arguments can be accessed sequentially by macro va_arg. The Figure 3.13 depicts, how does the stack look, when func- tion arguments are pushed on it (general case for the purpose of ex- planation, it does not apply to all system architectures). The macro va_start during initialization takes the address of the function ar- gument and calculates the position of the next argument. va_arg

24 3. ANALYZED UNDEFINED BEHAVIOURS

macro advances the pointer by the parameter size, thus returns the next argument. When there are no further arguments and va_arg is called, the behaviour is undefined. The compiler itself is unable to determine the number of parameters, which the caller intended to pass to the function. A good approach is to make the first argument of variadic functions represent the number of arguments the function takes.

Figure 3.13: Macro va_arg called on the stack

C library functions, such as printf, use the format string to spec- ify the number and type of its parameters. If the number of format specifiers does not match number of parameters, va_arg macro is called when there are no further arguments. This invokes undefined behaviour. When the format specifier does not match the parameter data type, the behaviour is undefined. va_arg takes the data type of the next argument as its type and advances to the next argument accord- ing to its size.

3.11.1 Detection Invalid number and incorrect data types in the library function printf are detected by all tools covered by this Master’s The- sis. When using GCC, option -Wall or specific -Wformat and -Wformat-extra-args need to be enabled. Even though calling

25 3. ANALYZED UNDEFINED BEHAVIOURS

printf with incorrect arguments may appear trivial, it still invokes undefined behaviour and results may be unexpected. However, no tool was able to detect undefined behaviour invoked by incorrect ar- guments to variadic functions and va_ macros. The function itself has to sanitize input arguments to detect these constraint violations.

3.12 Pointer Arithmetic

"Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that does not point into, or just be- yond, the same array object."[8] C language has the ability to access memory addresses in an effi- cient manner. Pointers are a powerful tool for this purpose but using them requires deep understanding because incorrect usage is the rea- son for many programmers bugs. Arrays in C are defined by pointer arithmetic but are not pointers itself. In particular array[i] == *(array + i), where array is a pointer to the first element of array[i]. Because addition operation is commutative, expressions *(array + i) and *(i + array) have the same meaning. It can be further rewritten into i[array], which equals to array[i]. If an integer value is added or subtracted from a pointer, the re- sult is defined to have the type of pointer operand. The behaviour is well-defined as long as the resulting expression points into respec- tive array. If it points one element past the end of the array, an over- flow should not occur. However, in such case it should not be deref- erenced, or the behaviour will be undefined. The Figure 3.14 shows an example of the function that has un- defined behaviour. Calling the function with negative argument in- vokes behaviour, which is not defined. That is why it is required to alter the check from line 4 to if (SIZE > index || index > 0). If the SIZE == index, the pointer points 1 element past the array boundary. It is guaranteed not to overflow but it should not be dereferenced. Another way how to check boundaries of an array is by using pointer arithmetic directly as the condition. The line number 3 from the Figure 3.14 could be rewritten into if (array + index <

26 3. ANALYZED UNDEFINED BEHAVIOURS 1. char array[SIZE]; 2. 3. char *func(int index) { 4. if (SIZE > index) { 5. return index + array; 6. }

Figure 3.14: Undefined behaviour in pointer arithmetic

&array[SIZE]) {/* do something */}. In this case if index is too large, the addition can overflow. The result then may be the pointer value, which is less than zero so it is necessary to add an- other check: if (array + index < &array[SIZE] || array + index < array) { /* do something */ }

The problem is that even though now we have assured that re- sulting pointer will be within the range of the array object, due to optimization, the following may be vulnerable to buffer overflow, thus invoking undefined behaviour. The reason is that arithmetic overflow within pointers is undefined, so by the C standard some compilers may assume that the test for overflow is always false and may be eliminated from the expression. This behaviour was identified as a vulnerability of GCC version 4.2 [2]. Identical bug was found in lib/vsprintf.c of the Linux kernel.

3.12.1 Detection

Expressions that result into undefined behaviour can be detected by Clang Static Analyzer, which yields warning about the use of an uninitialized value. Clang AddressSanitizer reports that memory ac- cess overflows the variable.

27 3. ANALYZED UNDEFINED BEHAVIOURS 3.13 Shifting Overflow

When using bitwise shift operator, there are two types of shifts that should be considered – arithmetic and logical shift. Logical shift fills discarded bits with zeros for both left and right shift. Arithmetic shift works in the same way for the left shift, but when shifting an expres- sion right, instead of filling discarded bits with zeros it also copies the most significant bit, thus persevering the sign of the operand. Left arithmetic shift is equal to multiplication by a power of 2. However, there is a discrepancy between logical right shift, which is equivalent to division by a power of 2 for positive and unsigned numbers, and arithmetic right shift, which is equivalent to division only for non negative numbers. That is the reason, why according to C standard, shifts where the right operand is a negative number, or have width equal or greater than the width of promoted expression invoke undefined behaviour.

1. int R = E1 << E2; 2. 3. if (E2 < 0) { 4. //handle to prevent undefined behaviour 5. } 6. if (E2 >= sizeof(unsigned int) * CHAR_BIT { 7. //handle to prevent undefined behaviour 8. }

Figure 3.15: Shift overflow check

The programmer must ensure that expression E2 from the Figure 3.15 is not negative and its width is not equal or greater than E1 expression, as seen in the Figure 3.15. However, in special cases the check from line 7 may not be suffi- cient. There are architectures (portable platforms) that use padding bits in unsigned int. It means that not all bits are used to store desired value. In such case calling sizeof() operator returns the width of the integer, not its precision. That is why functions like builtin_popcount() should be used instead.

28 3. ANALYZED UNDEFINED BEHAVIOURS

3.13.1 Detection

The detection of shifting overflows is implemented in several tools. GCC’s ubsan [14], which is dynamic analyzer, has this detection fully implemented, as well as Clang dynamic analyzer. Among static analyzers, the Splint can detect shift overflow. However, the best practice is to avoid invoking shifting overflows.

3.14 Pointers Comparison

Along with arithmetic operations, pointers can be compared us- ing relational operators as well. Pointed to types that are compared should be of the same type or of the type void*. In this case the other pointer is converted to void*. When pointers are compared, the result depends on the relative locations of the pointed to objects in the address space of the program. If there is an array of char and two pointers to this array, using relational operators with these pointers is legal and well-defined in C standard. Addresses of individual characters in the array are incre- mented by the size of the char data type, thus can be compared. However, the comparison is valid only for pointers pointing to the same object. If pointers, which are being compared, do not point to the same aggregate or union, the behaviour is undefined. This limitation is caused by existence of different memory models. C lan- guage supports many architectures. Its standard is abstract enough to be able to cover various, even older platforms. To achieve that, it leaves certain aspects undefined. Flat (linear) memory model represents memory as a continuous address space. There is no need to apply any form of memory seg- mentation. With this approach comparing pointers to distinct objects would be valid. It would be apparent, which object has lower or higher address. If memory layout is segmented, the address of pointed to object is determined by segment and offset. Comparing offsets within single segment is simple but segments may be organized non-linearly, thus making comparison difficult.

29 3. ANALYZED UNDEFINED BEHAVIOURS

3.14.1 Detection

Currently no tool covered by this Master’s Thesis was able to detect this undefined behaviour. There is a experimental implementation of invalid-pointer-pair detector in Clang address sanitizer. The current implementation has both false positives and false negatives and is by default disabled. Testing it on code, which invokes undefined be- haviour by comparing pointers to two distinct local arrays, proved false negative case because it was not detected. After making the ar- rays in the same program dynamically allocated, the error was re- ported and the result can be found in the Figure 3.16.

$ clang -g -fsanitize=address -mllvm -asan-detect-invalid-pointer-pair=1 program.c $ ASAN_OPTIONS=detect_invalid_pointer_pairs=1 ./a.out ======22452==ERROR: AddressSanitizer: invalid-pointer-pair: 0x62eff0 0x62efd0 #0 0x4b8efa in main /home/user/Program/program.c:38 #1 0x3c1c421734 in __libc_start_main (/lib64/libc.so.6+0x3c1c421734) #2 0x4b8b7c in _start (/home/user/Program/a.out+0x4b8b7c)

0x60200000eff0 is located 0 bytes inside of 4-byte region [0x62eff0,0x62eff4) allocated by thread T0 here: #0 0x49ce3b in __interceptor_malloc #1 0x4b8dae in main /home/user/Program/program.c:31 #2 0x3c1c421734 in __libc_start_main (/lib64/libc.so.6+0x3c1c421734)

0x60200000efd0 is located 0 bytes inside of 4-byte region [0x62efd0,0x62efd4) allocated by thread T0 here: #0 0x49ce3b in __interceptor_malloc #1 0x4b8def in main /home/user/Program/temp/program.c:32 #2 0x3c1c421734 in __libc_start_main (/lib64/libc.so.6+0x3c1c421734)

SUMMARY: AddressSanitizer: invalid-pointer-pair /home/user/Program/program.c:38 main ==22452==ABORTING

Figure 3.16: Invalid pointer comparison detection

Run-time detection of pointers pointing to different objects that are being compared can be achieved using DWARF debugging in- formation [4] emitted by the compiler. There are libraries, such as libdwarf [15], which can be used to simplify implementation of applications working with DWARF format. Valgrind experimental SGCheck uses DWARF to detect stack and global array overruns.

30 3. ANALYZED UNDEFINED BEHAVIOURS 3.15 Integer Constant Expression

Constants in C are integer constants composed of digits and can be hexadecimal, octal, and decimal. Integer character constants – a char- acter enclosed in single quotes. Enumeration constants represent- ing identifiers in a definition of the enumeration data type. Floating point and address constants. A constant expression is created by applying operators to con- stants. It is an expression that can be evaluated at compile-time and cannot be changed at run-time. Compiler does not need to have in- formation about the execution environment to compute the expres- sions but strict rules concerning what constant expression is, what it can and cannot contain must be preserved. An integer constant ex- pression must be of the integer type and its operands must be:

1. Integer/enumeration/character constant

2. Sizeof expression with integer constant result

3. Floating constant being the immediate operand of cast

4. Contain casts of arithmetic to integer types

or the behaviour is undefined. Integer constant expressions are required in the following:

1. After the keyword case in a switch statement

2. Value of enumeration constant

3. Bit field width specifier

4. Preprocessor #if directive

In previous versions of C standard a subscript declarator in an ar- ray bound definition was required to be integer constant expression. However, C99 adds variable length arrays, making the requirement obsolete. Violation of these rules is detected and treated as error by GCC and Clang compilers. It is important to note that const-qualified variables are different from aforementioned constant expressions. Const-qualified objects

31 3. ANALYZED UNDEFINED BEHAVIOURS

cannot be used in constant expressions. They may be placed by the compiler into read-only memory. An attempt to modify const object through a non-const lvalue type invokes undefined behaviour.

3.16 Longjmp/Setjmp

In C programming language, various control flow methods exist. Ex- cept conditional expressions, switch-case statements and loops, there are unconditional transfers of control, such as goto statement. Call- ing it will simply bypass any code, up to the predefined label. How- ever, it is limited only to local jumps, within single function. If there is a need to implement non-local, far jump in the program execution, macro setjmp and function longjmp from setjmp.h are available. Macro setjmp stores the current environment (the pro- gram state), which can be later accessed by calling function longjmp. It restores registers, such as stack pointer and frame pointer. By restor- ing the program counter, the execution continues, where setjmp was originally called. These non-local jumps can be used to implement exception mech- anism in C. However, it is mainly used when in longer chain of pro- cedure calls an error is detected, it can be conveniently handled by a high level call. On the other side, there are drawbacks, which may be the reason to reconsider usage of setjmp/longjmp. By restoring the program to previously stored state, references to dynamically al- located memory are lost. These functions also make program hard to understand and maintain, alternative approach should be used if possible. C standard specifies conditions, which lead into undefined be- haviour, when calling setjmp and longjmp. If automatic, non-volatile variable defined in the function, which calls setjmp is altered before calling longjmp, its value is indeterminate and the behaviour is un- defined. The Figure 3.17 shows example of such code that leads into undefined behaviour. In main() function, the buffer to store environment informa- tion is created and passed to a() function. This function then calls setjump macro. The Figure 3.18, part A shows the stack after calling function a(). The variable i is pushed to the top of the stack and

32 3. ANALYZED UNDEFINED BEHAVIOURS

the state of stack is then stored to the array jmp_buf. Reading from program counter, the control then returns back to main(), calling function b(). The state of the stack after calling function b() is de- picted in the Figure 3.18, part B. Then function longjmp() is called. It restores registers to values, when a() called macro setjmp. At this moment, program counter is set to point past calling setjmp() in a(). The state of the stack is in the Figure 3.18, part C. The stack is corrupted, because a() expects (char*), where actually int value 3 is stored. Because function a(), which called macro setjmp() ended, the behaviour is undefined. The C standard specifies that longjmp can- not be used to return to function, which has already terminated its execution. int a(jmp_buf env, char *string) | int main(int argc, char **argv) { | { int i; | jmp_buf env; | i = setjmp(env); | if (a(env, "String") != 0) | { return i; | return 0; } | } | void b(int i, jmp_buf env) | b(3, env); { | } longjmp(env, i); | } |

Figure 3.17: Invalid longjmp call

3.16.1 Detection

Detection of aforementioned undefined behaviour was unsuccessful with all tools used in this Master’s Thesis. Clang’s dynamic analyzer address sanitizer reported message: "AddressSanitizer can not provide additional info". However, it is the result of corrupted stack, not detec- tion of undefined behaviour itself.

33 3. ANALYZED UNDEFINED BEHAVIOURS

Figure 3.18: Content of the stack, during setjmp/longjmp invocation

3.17 Multiple External Definitions

Objects in C programming language can be internal – defined inside a function and external – declared outside a function. Only external objects can be visible across other source files. To describe accessi- bility of objects between source files or even within single file in C program, the C standard uses the term linkage. Linkage can be of three types: internal, external and no linkage. Objects with internal linkage are not accessible from other files, thus can be referenced only by functions within the same file. To provide external object internal linkage, the keyword static is used. The Table 3.2 summarizes linkage types versus object types.

34 3. ANALYZED UNDEFINED BEHAVIOURS

Table 3.2: Linkage and object types

Linkage type Object type Accessibility external external throughout the program internal external a single file no linkage internal local to a function

For the reason that objects with external linkage are visible throughout the program – across other source files, to avoid conflicts there are strict rules that need to be preserved. Multiple external def- initions must not occur in a program, or the behaviour is undefined. This requirement is easily violated when header file contains vari- able declaration without the keyword extern. When a header file is included, it is copied where the #include preprocessor directive is located, thus into all source files, which include the header file. Even though int variable; may look like variable declaration, it is a tentative definition. A tentative definition is any external declaration without storage class specifier and initializer. It becomes a full defi- nition at the end of the translation unit, if no other definition for this variable has appeared. Including header file with such variable into more source files invokes undefined behaviour. This tentative definition misuse is not detected by any of the tools described in this thesis. If the tentative definition is directly in a source file, Splint reports: "Function or variable is redefined. One of the declarations should use extern." Extern storage specifier turns tentative definition into mere declaration.

3.18 Pointer to FILE

C Standard I/O library is used to work with files and file streams. Header stdio.h must be included. Files are accessed through the pointer to FILE structure, which is implementation defined. It con- tains information about the stream to control it. The C standard states that if "The value of a pointer to a FILE object is used after the associated file is closed", the behaviour is undefined [8]. The FILE objects are created by calling functions tmpfile and

35 3. ANALYZED UNDEFINED BEHAVIOURS

fopen. They allocate memory, create and populate the FILE struc- ture and return the pointer to this structure. Memory deallocation is ensured by calling the function fclose. After that the structure is freed (destroyed) but the pointer remains pointing to the address of the structure, whose content is now indeterminate. If such call for released FILE pointer is made, Cppcheck tool re- ports "(error) Dereferencing ’fp’ after it is deallocated / released." Valgrind tool Memcheck detects invalid reads. However, both fail to detect the usage of the Standard streams stdin, stdout, and stderr after be- ing closed. The solution is to implement functions which take FILE* to sup- port checking for using the pointer to file that has been already closed. Such function then can be used to override library func- tion. The shared library with overridden library functions can be preloaded via LD_PRELOAD environment variable. The complete implementation of the shared library with overridden functions fclose, fopen, fgetc is in the AppendixA. With the provided implementation, calling fopen stores respective (FILE*) into static global variable FILE *fileBuffer[FOPEN_MAX]. Pointer to every opened file is stored in this array. Constant FOPEN_MAX defines potential limit of simultaneous open streams, the system supports. Every call to overridden fgetc function checks, if the provided file pointer is present in the fileBuffer[]. If not, the fgetc function fails and reports an error – it cannot be called on the file pointer, which is not opened. When fclose is called, the provided file pointer is removed from the fileBuffer[] – the file has been closed.

3.19 Output Functions

According to C standard, if a formatted output function transmits more than INT_MAX characters, the behaviour is undefined. For- matted output functions include fprintf, printf, snprintf, sprintf, vfprintf, vprintf, vsnprintf, vsprintf. Even though their functionality differs, all of these functions return integer value as can be seen in the Figure 3.19. As it is described

36 3. ANALYZED UNDEFINED BEHAVIOURS

in the Section 3.2, overflow of signed integer value causes undefined behaviour. The same rule applies to formatted input functions.

int __printf (const char *format, ...) { va_list arg; int done;

va_start (arg, format); done = vfprintf (stdout, format, arg); va_end (arg);

return done; }

Figure 3.19: Implementation of printf in glibc-2.20

Implementation of these output functions predates more portable data types such as size_t. It is defined as unsigned integer type and could be used as the return value instead of int. C out- put functions return negative value if an error occured. To preserve the ability to signal unsuccessful condition, ssize_t or error codes could be used instead. The C POSIX library implements output func- tions as returning integer type too. If an error during evaluation occurs, error number is set. The function snprintf returns negative value and sets the error number to EOVERFLOW, if the value of n is greater than INT_MAX, or the number of bytes needed to hold the output excluding the terminating null is greater than INT_MAX.

3.20 Copy of Overlapping Objects

According to C standard, specific library functions should be used with caution, not over overlapping memory blocks. "If an attempt is made to copy an object to an overlapping object by use of a library function, other than as explicitly allowed, the behaviour is undefined." [8] Overlap- ping memory block may occur within one object. Common source of overlapping memory regions is pointer arithmetic on strings.

37 3. ANALYZED UNDEFINED BEHAVIOURS

C language comes with the keyword restrict. It is used with pointer declarations to inform the compiler that pointers which share the same type do not alias each other. With restrict the program- mer declares that any access to the object will be done through the restrict pointer or its copies, thus optimizations then can be per- formed. If the objects referenced by arguments to functions overlap (they share the same memory address), the behaviour is undefined. It concerns library functions accessing memory such as strcpy, strcat. The common function which by violating aforementioned restric- tions causes undefined behaviour is the library function memcpy(). This function makes an exact byte for byte copy between two mem- ory blocks. If the memory blocks overlap, it is not guaranteed that the source memory region will not be overwritten before being writ- ten into the destination memory region. In such cases it is necessary to use safer, thus slower library function memmove(), which behaves as if a temporary array was used to store characters before copying them. 1. void overlap(char *a, char *b, size_t n) { 2. 3. int i = 0; 4. char *x = a; 5. char *y = b; 6. 7. for (i = 0; i < n; i++) { 8. if ((a + i == b) || (b + i == a)) { 9. /* memory blocks are overlapping */ 10. return; 11. } 12. } 13. /* memory blocks aren’t overlapping */ 14. return; 15. }

Figure 3.20: Detect overlapping blocks

However, not every library function with this kind of restriction has adequate replacement (snprintf(), sprintf()). One way

38 3. ANALYZED UNDEFINED BEHAVIOURS

how to sanitize memory blocks can be found in the Figure 3.20, as simple function implementation. The function compares addresses of memory regions and checks whether these memory blocks are overlapping.

3.20.1 Detection The usage of library functions on overlapping memory regions is currently detected only by dynamic analyzers – Clang with the op- tion -fsanitize=address and Valgrind tool Memcheck. Another way how to discover cases, which may exhibit undefined behaviour through overlapping memory objects, is the library Memstomp [11]. It identifies function calls, which use overlapping memory regions. If such function call is detected, backtrace to the problem is displayed. An important advantage of Memstomp in comparison to Valgrind is that it can be used on existing applications without recompiling them, thus being significantly faster. It is preloaded by an applica- tion to detect calls to functions with overlapping memory addresses.

39 4 Conclusion

This thesis was aimed to describe what exactly is undefined be- haviour in language C. The first part of the thesis contains the for- mal definition of undefined behaviour and its relation to unspecified and implementation-defined behaviour. C is powerful language pro- ducing efficient and portable programs. The second chapter explains how by leaving several areas of C language undefined performance and portability are achieved. However, if the rules specified in the C standard are not followed and undefined behaviour occurs, anything may happen. It depends on the compiler implementation, system architecture where the er- roneous application is run and the level of performed optimizations. The second chapter also provides examples of undefined behaviours, which arise when optimizations are turned on. Situations when undefined behaviour may occur are various. They range from incorrect usage of library functions to errors in pre- processor directives. The most vulnerable functions and expression that may invoke undefined behaviour are described in the Chapter3. Undefined behaviour can be detected by compilers during compila- tion process (void expression misuse) but more often external static or dynamic tool is required. Splint Static Analyzer detected errors, which other tools were un- able to discover, misuse of tentative definitions or modifying string literal for example. GCC detected violations of C standard rules dur- ing compilation process but some of the checks had to be explicitly enabled. GCC comes with undefined behaviour sanitizer too. Op- tion -fsanitize=undefined enables ubsan – a fast undefined be- haviour detector. It provides checks such as integer division by zero, shift overflow, or signed integer overflow. Clang dynamic analyzers offers additional error detectors, such as AddressSanitizer or Memo- rySaintizer. Experimental checker invalid-pointer-pair of the Address- Sanitizer was the only tool that found comparison of pointers to dif- ferent objects. However, there are errors causing undefined behaviour that cur- rently are not detected by any of the tools covered by this thesis. When C standard formatted function outputs more than INT_MAX

40 4. CONCLUSION characters, when a pointer to FILE object is used after being closed, or variadic functions called with incorrect arguments. To detect these, explicit checks need to be implemented. The thesis also includes set of example programs, which demon- strate respective undefined behaviours. The minimalist form aids to understand, where the code is prone to cause undefined behaviour.

41 Bibliography

[1] Bruce Perens. Electric Fence [online]. URL: http://elinux. org/Electric_Fence [cited 2014-12-12].

[2] CERT Software Engineering Institute. C compilers may silently discard some wraparound checks [online]. URL: http://www. kb.cert.org/vuls/id/162289 [cited 2014-12-01].

[3] Daniel Marjamaki. A tool for static C/C++ code analysis [on- line]. URL: http://cppcheck.sourceforge.net/ [cited 2014-06-12].

[4] DWARF Committee. The DWARF Debugging Standard [on- line]. URL: http://dwarfstd.org/ [cited 2014-12-01].

[5] GNU Project. The GNU C Library: Malloc Tunable Parameters [online]. URL: http://www.gnu.org/software/libc/ manual/html_node/Malloc-Tunable-Parameters. html [cited 2014-11-12].

[6] GNU Project. The GNU Multiple Precision Arithmetic Library [online]. URL: https://gmplib.org/ [cited 2014-09-30].

[7] IBM X-Force. Linux kernel sctp_setsockopt integer buffer over- flow [online]. URL: http://xforce.iss.net/xforce/ xfdb/16117 [cited 2014-05-09].

[8] ISO/IEC 9899:1999. Programming languages – C [online]. URL: http://www.iso.org/iso/iso_catalogue/ catalogue_ics/catalogue_detail_ics.htm? csnumber=29237 [cited 2014-05-09].

[9] ISO/IEC 9899:2011. Information technology – Programming languages – C [online]. URL: http://www.iso.org/iso/ iso_catalogue/catalogue_tc/catalogue_detail. htm?csnumber=57853 [cited 2014-05-09].

[10] ISO/IEC/IEEE 60559:2011. Information technology – Mi- croprocessor Systems – Floating-Point arithmetic [online].

42 4. CONCLUSION

URL: http://www.iso.org/iso/iso_catalogue/ catalogue_tc/catalogue_detail.htm?csnumber= 57469 [cited 2015-01-01].

[11] Jeff Law. Improvements in memstomp [online]. URL: http://developerblog.redhat.com/2014/10/07/ improvements-in-memstomp/ [cited 2014-12-21].

[12] LLVM Developer Group. clang: a C language family frontend for LLVM [online]. URL: http://clang.llvm.org/index. html [cited 2014-06-08].

[13] LLVM Developer Group. Clang Static Analyzer [online]. URL: http://clang-analyzer.llvm.org/ [cited 2014-06-12].

[14] Marek Polacek. GCC Undefined Behavior Sanitizer – ubsan [on- line]. URL: http://developerblog.redhat.com/2014/ 10/16/gcc-undefined-behavior-sanitizer-ubsan/ [cited 2014-12-21].

[15] SGI. libdwarf - DWARF debugging information [online]. URL: http://libdwarf.sourceforge.net/ [cited 2014-12-01].

[16] The Splint Developers. Splint - Secure Programming Lint [on- line]. URL: http://www.splint.org/ [cited 2014-06-12].

[17] The Valgrind Developers. Valgrind [online]. URL: http:// valgrind.org/ [cited 2014-06-08].

43 A Shared Library Implementation fpointers.c

#define _GNU_SOURCE

#include #include static FILE *fileBuffer[FOPEN_MAX]; static int position = 0; void addOpenedFile(FILE *fp) { int isFull = 0;

while (fileBuffer[position] != 0) { position++; if (position == FOPEN_MAX) { if (isFull) { /* cannot open more files then FOPEN_MAX */ return; } /* when end of the buffer is reached, go to its beginning */ position = 0; isFull = 1; } } fileBuffer[position] = fp; } void removeClosedFile(FILE *fp) { int i = 0; for (i = 0; i < FOPEN_MAX; i++) { if (fp == fileBuffer[i]) { fileBuffer[i] = NULL; break; } } }

44 A.SHARED LIBRARY IMPLEMENTATION

int isOpened(FILE *fp) { int i = 0; for (i = 0; i < FOPEN_MAX; i++) { if (fp == fileBuffer[i]) { return 1; break; } } return 0; }

FILE *fopen(const char *path, const char *mode) { FILE *(*original_fopen)(const char*, const char*); original_fopen = dlsym(RTLD_NEXT, "fopen"); FILE *fp = (*original_fopen)(path, mode); addOpenedFile(fp); /* add fp to fileBuffer[] */ return fp; } int fclose (FILE *fp) { int (*original_fclose)(FILE *) = dlsym (RTLD_NEXT, "fclose"); removeClosedFile(fp); //remove fp from fileBuffer[] return original_fclose(fp); } int fgetc (FILE *fp) { int (*original_fgetc)(FILE *) = dlsym (RTLD_NEXT, "fgetc"); if (isOpened(fp)) { return original_fgetc(fp); } else { /* file has been already closed */ return -1; } }

45 B Content of the CD-ROM

Attached CD-ROM contains:

• directory text: Text of the Master’s Thesis • directory sources: Source files of the sample programs and the shared library

46