The Semantics of C++ Data Types: Towards Verifying Low-Level System Components

The Semantics of C++ Data Types: Towards Verifying low-level System Components Michael Hohmuth Hendrik Tews Dresden University of Technology Department of Computer Science vfi[email protected] July 3, 2003 In order to formally reason about low-level system programs one needs a semantics (for the programming language in question) that can deal with programs that are not statically type-correct. For system-level programs, the semantics must deal with such heretical constructs like casting integers to pointers and converting pointers between incompatible base types. In this paper we describe a formal semantics for the data types of the C++ programming language that is suitable for low-level programs in the above sense. This work is part of a semantics for a large subset of the C++ programming language developed in the VFiasco project. In the VFiasco project we aim at the verification of substantial properties of the Fiasco microkernel, which is written in C++. 1 Introduction The VFiasco [21] project aims at the verification of substantial properties of the Fiasco [9] microkernel for x86 PC hardware (more precisely for IA32-based systems). Fiasco is a real- time microkernel operating system. It has been developed in the context of the DROPS project [7] and supports the flexible construction of applications with security or quality of service requirements. As a microkernel, Fiasco has a minimal interface and supports only the absolutely necessary operating-system functionality. There are, for instance, no device drivers included in Fiasco. For legacy applications it is possible to boot Linux on top of Fiasco [8]. Fiasco is almost entirely written in C++ [20]. Only those operations that cannot be performed in C++ (like accessing CPU control registers) are written in inline assembler. The properties that we attempt the prove in the VFiasco project include the following: This work was supported by the Deutsche Forschungsgemeinschaft (DFG) through DFG grant Re 874/2-1. 1 • Fiasco’s internal page-fault handler terminates for all kernel page faults • Fiasco’s internal memory allocation works correctly For the verification, we plan to employ a state-of-the-art general-purpose theorem prover. Our current development focuses on PVS [17]. For the verification, the C++ source code of Fiasco will be fed into a semantics compiler that generates the semantics of the input as shallow embedding1 in the higher-order logic of PVS. Specifications and properties will be formulated directly in the logic of PVS. The whole verification process can be depicted as follows: properties Fiasco source C++ semantics Formalization +3 +3 +3 PVS +3 q.e.d. in C++ compiler in HOL The semantics compiler is currently developed on the basis of the OpenC++ package [4]. For the translation of Fiasco’s source code we need a semantics for a substantial part of C++. In this paper we concentrate on the built-in types and the type constructions (like structs and unions) of C++. Other details of our semantics of C++ will be described elsewhere, see [10, 11]. We base our formalisation of the data types on the C++ ISO Standard [5]. In the following we call this document simply the standard. The notation §n.m(k) refers to Section n.m, paragraph k in the standard. In the standard the data types are mainly described in §3.9 and §9. We have the following requirements for our semantics: 1. The semantics is suitable for reasoning about low-level systems programs, especially operating systems written in C++, like Fiasco. 2. The semantics scales well to the formalisation of a specific C++ implementation that determines the behaviour of certain language constructs that is unspecified in the standard. 3. The semantics supports popular unsafe C++ programming idioms (like allocating a char array and casting its base address into a specific pointer type) together with programming idioms that are necessary for operating-systems programming (like casting unsigned integers to pointers). 4. The semantics is sensible enough to detect subtle programming errors, like reading data from uninitialised memory. The motivation for these requirements is clear from the context of this work: After all we want to use the semantics in a concrete verification project. Let us now discuss the implications of these requirements. The first consequence of Requirement 1 (reasoning about low level code) is that we cannot use the traditional way of modelling dynamically allocated structures. For instance in [3, 15] 1In a shallow embedding an external tool translates the sources into their semantics. In contrast, in a deep embedding one can represent phrases of the source in the logic and the semantic function is expressed inside the logic. 2 dynamically allocated structures are modelled as functions of the form heap : address −→ value-type. In this approach the heap is implicitly typed, making it impossible to describe programs which possibly contain type errors (like reading an integer from a memory location that contains a string). However, one of the goals of the VFiasco project is to prove the absence of such typing errors. This is impossible in an implicitly-typed semantics. Our model of the heap must support operations like pointer-type conversion and also interpreting the same data in different types. For the second consequence of the first requirement recall again our long term goal: Verifying Fiasco. For that we are currently developing an x86 hardware model in PVS. The hardware model gives an abstraction of the execution environment of x86 compatible processors. A state of the hardware model will be an abstraction of the state of real-world PC. Such a state contains a snapshot of the physical memory together with some important control registers on the CPU (like cr3, which determines the base address of the current page directory for the virtual-memory address translation). For the verification, the semantics of the Fiasco sources will be interpreted on the hardware model. In order to ensure that the verification results apply to the real-world Fiasco, it is necessary that our hardware model models x86 hardware — and nothing more. The conclusion is that the semantics of data types must not add information to the states of the hardware model, like, for instance, associating type tags to memory locations. The second requirement (scalability of the semantics) originates from the following: The C++ standard is very vague about almost every C++ construct. For the built-in integer types it does not specify which values they can represent, for signed integer types it is left open if they can represent negative numbers. One of the few things that are required is that all built-in integer types are at least 7 bits wide.2 A specific C++ implementation can define things that are left open in the standard. For instance the gcc compiler uses 8 bits for char and 32 bits for int. It is impossible to write (and verify!) an operating system without such implementation-specific knowledge. For instance, Fiasco relies on the fact that the standard conversion from any pointer type to unsigned and back is the identity function on the underlying bit representation. The data-type semantics in this paper will easily scale to (possibly different) implementations. The base version of the semantics formalises the C++ standard. From the base version one can derive an implementation-specific version by just adding the additional properties. Requirement 3 (support for unsafe idioms) is (again) motivated by our long-term goal in the VFiasco project. The semantics must support all the program idioms that are necessary for writing an operating system. This includes casting (seemingly) arbitrary unsigned integers into pointers and dereferencing these pointers. Last but not least we want the semantics to catch all possible programming errors (Re- quirement 4. Consider the following C++ statement3 T a = * reinterpret cast<T*>(50) It accesses an object of type T that is located at address 50. At the time where this statement is executed there might or might not be a valid object of type T stored at address 2Because they all can represent the 96 characters from the basic source character set, see §§ 2.2.1, 3.9.1.2 3The C++ standard says that the behaviour of this statement is undefined. However for gcc it is well defined. 3 50. Statements of this kind are necessary in an operating system, for instance in the code that traverses page directories. Therefore we need a semantics for data types that permits the above statement (with the expected semantics) provided one can prove that there is a valid object of type T stored at address 50. However, we also want to catch programming errors where the above statement is executed before the memory at address 50 is properly initialised. Therefore, without any knowledge about the memory contents one should not be able to derive anything about the above statement (not even that it does not crash the hardware). For a second example about which programming errors we want to catch, consider the following piece of code: unsigned a[2] = { 1, 2 }; unsigned b[3]; memcpy(static cast<char *>(static cast<void *>(b)) + 1, a, sizeof(a)); unsigned c = * static cast<int *> (static cast<void *> (static cast<char *> (static cast<void *> (b)) + 1)); unsigned d = b[1]; The memcpy function copies the two integers 1 and 2 in array a into array b. Note that the copy is displaced by an offset of 1. According to the C++ standard such a memcpy is legitimate. Moreover, it specifies that the fourth statement (which reads the integer 1 at its new address) initialises c correctly with 1 (§3.9.3) provided the alignment requirements are met.

The Semantics of C++ Data Types: Towards Verifying Low-Level System Components

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support