Formal Verification of Object Layout for C++ Multiple Inheritance

Formal Verification of Object Layout for C++ Multiple Inheritance Tahina Ramananandro Gabriel Dos Reis ∗ Xavier Leroy y INRIA Paris-Rocquencourt Texas A&M University INRIA Paris-Rocquencourt [email protected] [email protected] [email protected] Abstract Many programming languages are implemented using straight- Object layout — the concrete in-memory representation of objects forward memory layouts. In Fortran and C, for example, compound — raises many delicate issues in the case of the C++ language, data structures are laid out consecutively in memory, enumerat- owing in particular to multiple inheritance, C compatibility and ing their members in a conventional order such as column-major separate compilation. This paper formalizes a family of C++ object ordering for Fortran arrays or declaration order for C structures. layout schemes and mechanically proves their correctness against Padding is inserted between the members as necessary to satisfy the the operational semantics for multiple inheritance of Wasserrab alignment constraints of the machine. Higher-level languages leave et al. This formalization is flexible enough to account for space- more flexibility for determining data representations, but practi- saving techniques such as empty base class optimization and tail- cal considerations generally result in simple memory layouts quite padding optimization. As an application, we obtain the first formal similar in principle to those of C structures, with the addition of correctness proofs for realistic, optimized object layout algorithms, tags and dynamic type information to guide the garbage collector including one based on the popular “common vendor” Itanium and implement dynamic dispatch and type tests. C++ application binary interface. This work provides semantic This paper focuses on data representation for objects in the C++ foundations to discover and justify new layout optimizations; it is language. C++ combines the many scalar types and pointer types also a first step towards the verification of a C++ compiler front- inherited from C with a rich object model, featuring multiple inheri- end. tance with both repeated and shared inheritance of base classes, object identity distinction, dynamic dispatch, and run-time type tests Categories and Subject Descriptors D.2.4 [Software for some but not all classes. This combination raises interesting Engineering]: Software/Program Verification—Correctness data representation challenges. On the one hand, the layout of ob- proofs; D.3.3 [Programming Languages]: Language Constructs jects must abide by the semantics of C++ as defined by the ISO and Features—Classes and objects; D.3.4 [Programming standards [8]. On the other hand, this semantics leaves significant Languages]: Processors—Compilers; E.2 [Data storage repre- flexibility in the way objects are laid out in memory, flexibility that sentations]: Object representation; F.3.3 [Logics and meanings can be (and has repeatedly been) exploited to reduce the memory of programs]: Studies of program constructs—Object-oriented footprint of objects. A representative example is the “empty base constructs optimization” described in section 2. As a result of this tension, a number of optimized object layout General Terms Languages, Verification algorithms have been proposed [6, 7, 13, 17], implemented in pro- duction compilers, and standardized as part of application binary 1. Introduction interfaces (ABI) [3]. Section 2 outlines some of these algorithms and their evolution. These layout algorithms are quite complex, One of the responsibilities of compilers and interpreters is to rep- sometimes incorrect (see Myers [13] for examples), and often diffi- resent the data types and structures of the source language in terms cult to relate with the high-level requirements of the C++ specifica- of the low-level facilities provided by the machine (bits, pointers, tion. For example, the C++ “common vendor” ABI [3] devotes sev- etc.). In particular, for data structures that must be stored in mem- eral pages to the specification of an object layout algorithm, which memory layout ory, an appropriate must be determined and imple- includes a dozen special cases. mented. This layout determines the position of each component of The work reported in this paper provides a formal framework a compound data structure, relative to the start address of the mem- to specify C++ object layout algorithms and prove their correct- ory area representing this structure. ness. As the high-level specification of operations over objects, we ∗ use the operational semantics for C++ multiple inheritance formal- Partially supported by NSF grants CCF-0702765 and CCF-1035058. ized by Wasserrab et al [19], which we have extended with struc- y Partially supported by ANR grant Arpege` U3CAT. ture fields and structure array fields (section 3). We then formalize a family of layout algorithms, independently of the target archi- tecture, and axiomatize a number of conditions they must respect while leaving room for many optimizations (section 4). We prove Permission to make digital or hard copies of all or part of this work for personal or that these conditions are sufficient to guarantee semantic preserva- classroom use is granted without fee provided that copies are not made or distributed tion when the high-level operations over objects are reinterpreted for profit or commercial advantage and that copies bear this notice and the full citation as machine-level memory accesses and pointer manipulations (sec- on the first page. To copy otherwise, to republish, to post on servers or to redistribute tion 5). Finally, we formalize two realistic layout algorithms: one to lists, requires prior specific permission and/or a fee. based on the popular “common vendor” C++ ABI [3], and its ex- POPL ’11 January 26–28, Austin, TX, USA Copyright c 2011 ACM 978-1-4503-0490-0/11/01. $10.00 tension with one further optimization; we prove their correctness Field alignment: for any field f of scalar type t, the natural by showing that they satisfy the sufficient conditions (section 6). alignment of type t evenly divides its memory address. All the specifications and proofs in this paper have been me- Besides containing fields, C++ objects also have an identity, which chanically verified using the Coq proof assistant. The Coq devel- can be observed by taking the address of an object and comparing opment is available online [15]. it (using the == or != operators) with addresses of other objects of The contribution of this paper is twofold. On one hand, it is the same type. The C++ semantics specifies precisely the outcome (to the best of our knowledge) the first formal proof of semantic of these comparisons, and this semantics must be preserved when correctness for realistic, optimizing C++ object layout algorithms, the comparisons are reinterpreted as machine-level pointer compar- one of which being part of a widely used ABI. Moreover, we hope isons: that large parts of our formalization and proofs can be reused for other, present or future layout algorithms. On the other hand, just Object identity: two pointers to two distinct (sub)objects like the subobject calculus of Rossie and Friedman [16] and the of the same static type A, obtained through conversions operational semantics for multiple inheritance of Wasserrab et al or accesses to structure fields, map to different memory [19] were important first steps towards a formal specification of addresses. the semantics of (realistic subsets of) the C++ language, the work This requirement is further compounded by the fact that C++ presented in this paper is a first step towards the formal verification operates under a simplistic separate compilation model inherited of a compiler front-end for (realistic subsets of) C++, similar in from C, and the fact that every class can be used independently principle and structure to earlier compiler verification efforts for to create complete objects, and every subobject in isolation is other languages such as Java [9], C0 [10], and C [11]. a potential target of most operations supported by any complete object of the same type. 2. Overview Furthermore, some classes are considered to be dynamic: those classes that need dynamic type data to perform virtual function dis- 2.1 The object layout problem patch, dynamic cast, access to a virtual base, or other dynamic op- Generally speaking, an object layout algorithm is a systematic way erations. The concrete representation for objects of these dynamic to map an abstract, source-level view of objects down to machine- classes must include dynamic type data (usually as a pointer to level memory accesses and pointer operations. At the source level, a data structure describing the class), and such data must be pre- a C++ object is an abstract entity over which various primitive served by field updates. operations can be performed, such as accessing and updating a Dynamic type preservation: any scalar field maps to a mem- field, converting (“casting”) an object or an object descriptor to ory area that is disjoint from any memory area reserved to another type, or dispatching a virtual function call. At the machine hold dynamic type data. level, an object descriptor is a pointer p to a block of memory containing the current state of the object. The object layout scheme The separation conditions between two areas holding dynamic type determines, at compile-time, how to reinterpret the source-level data are weaker than the separation conditions for fields. Indeed, object operations as machine instructions: most layout algorithms arrange that the dynamic type data for a class C is shared with that of one of its dynamic non-virtual direct • Accessing a field f defined in the class C of the object becomes bases, called the non-virtual primary base of C and chosen during a memory read or write at address p + δ, where the constant layout, in a way that preserves the semantics of virtual function offset δ is determined by the layout algorithm as a function of dispatch and other dynamic operations.

Load more