High Level Memory Model with Low Level Pointer Cast Support For

ISSN 03617688, Programming and Computer Software, 2015, Vol. 41, No. 4, pp. 197–208. © Pleiades Publishing, Ltd., 2015. Original Russian Text © M.U. Mandrykin, A.V. Khoroshilov, 2015, published in Programmirovanie, 2015, Vol. 41, No. 4. HighLevel Memory Model with LowLevel Pointer Cast Support for Jessie Intermediate Language1 M. U. Mandrykin* and A. V. Khoroshilov** The Institute for System Programming of the RAS, ul. A. Solzhenitsyna 25, Moscow, 109004 Russia email: *[email protected], **[email protected] @ Abstract—The paper presents a target analyzable language used for verification of realworld production GNU C programs (Linux kernel modules). The language represents an extension of the existing intermediate language used by the Jessie plugin for the F@ramaC static analysis framework. Compared to the original Jessie, the extension is fully compatible with the C semantics of arrays, initially supports discriminated unions and prefix (hierarchical) structure pointer casts and provides a limited, but reasonable support for lowlevel pointer casts (reinterpretations of the underlying bytes of memory). The paper describes the approaches to translation of the original C code into the analyzable intermediate language and of the intermediate language into Why3ML i.e. the input language of the Why3 deductive verification platform. DOI: 10.1134/S0361768815040040 1 INTRODUCTION violations and buffer overruns). Third, C is a lowlevel language with generally untyped semantics, which can There are many tools and techniques for static ver substantially influence the choice of an appropriate ification of C source code. Among them are deduction techniques, that are based on translation of the origi memory model. nal C source code supplied with specifications of the Let’s assume the VC for the assertion check at line properties being verified into a set of logical formulas 10 (a[n] = 1) is required. The most straightforward and whose validity is equivalent to the correctness of the also precise solution would be to make use of the low original program with respect to the specified proper level memory model for constructing the correspond ties. These logical formulas a.k.a. verification condi ing logical formula. In this model each variable in the tions (VCs) or proof obligations can be proved valid program memory is represented with a range of ele (discharged) by various tools that can be either com ments in the common logical byte array M indexed by pletely automatic, e.g. SAT and SMT solvers [1, 2] 32bit (or 64bit) integer addresses. A logical array [2] (Z3, CVC3, CVC4, ALTERGO et al.) and saturation (as a logical type supported by SMTsolvers) is a total based provers (VAMPIRE, EPROVER et al.), or interac map from the underlying set of the array index type to tive, e.g. COQ and PVS proof assistants. the underlying set of its element type that is associated Thus any deductive verification tool for C that aims with the corresponding pure (i.e. sideeffect free) ter the most completely automatic verification should nary update operation ·[·←·] defined on triples of the somehow translate the semantics of the original code form (array, index, value). The result of the operation and provided specifications into a set of VCs that the is a new logical array of the same (index and element) existing stateoftheart automatic theorem provers type, that differs from the original one on and only on are capable to discharge. the specified index mapped to the new specified ele Meanwhile the basic features of the C program ment value (i.e. for any other distinct index the value ming language significantly complicate such transla from the original array is retained). Logical formulas tion. First, C is an imperative programming language can only constrain the values of logical arrays on cer and thus the implicit program state should be some tain indexes, while initially the values are treated as how modeled in the resulting logical formulas. It can nondeterministic. The value of an array M on index i be modeled in several ways usually involving theory of is further denoted as M[i]. For the array variable M arrays for representing the state of program memory. representing the program memory we introduce a Second, C assumes manual memory management, which requires generation of additional memory safety finite sequence of logical array constants M1, M2, …, checks, corresponding to possible memory misman Mn corresponding to the states of program memory agements (premature deallocation, garbage, access resulting from several consecutive memory updates. Assuming the variables a, b, c, d, e, n and m have 32 1 The article was translated by the authors. bit addresses represented with constants a, b, c, d, e, n 197 198 MANDRYKIN, KHOROSHILOV and m respectively, the required VC can be encoded addressed by pointers). In particular, the variables that within the lowlevel memory model as follows: never occur as arguments to the addressing operation M1 = M0[a ← M0[e]] ∧ (&), are often regarded as pure. This assumption is M = M [b ← M [a]] ∧ sound if the program under consideration is memory 2 1 1 safe (free of access violations and buffer overruns). M = M [M [b] + M [n] ← 0 ] ∧ 3 2 2 32 2 8 As pure variables can only be addressed by their ← ∧ M4 = M3[M3[b] +32 M3[n] +32 132 08] names, the corresponding access operations on these M5 = M4[M4[b] +32 M4[n] +32 232 ← 08] ∧ variables can be easily statically separated from each M6 = M5[M5[b] +32 M5[n] +32 332 ← 18] ∧ other and from accesses to other variables in program M = M [M [d] + M [m] ← 0 ] ∧ . ∧ memory. Therefore each such variable can be repre 7 2 6 32 6 8 v v v M = M [M [d] + M [m] + 3 ← 2 ] ∧ sented with a separate finite sequence 1, 2, …, k of 10 5 9 32 9 32 8 constants rather than an index in the common array M11 = M10[M10[c] ← 97] ∧ sequence M1, …, Mn. This optimization alone allows ()…ba≤ – 4 ∨ b ≥ a + 4 ∧∧⎫ to decrease the number of inequalities (stating the dis ⎪ jointness/separation of variables in program memory) ()…ea≤ – 16 ∨ ea≥ + 4 ∧ ∧⎬ 42 @@@@@@@. in the above example from 42 to 6, also reducing the ⎪ ()dd≤ – 16 ∨ ed≥ + 16 ⎭ number of logical array constants (common memory states) M1, …, Mn from 11 to 9. The lowlevel memory model allows very precise bitaccurate modeling of C program semantics poten The idea of separation for the common memory tially (with even more precise modeling of CPU regis array M applied to the pure variables can be general ters and caches) achieving nearly full conformance ized to arbitrary a priori separated (up to some static between the program model and the actual execution approximation) memory regions. For that purpose we of the program on a real physical machine. The logical can separate the program memory into a number of formulas involving both logical arrays and bitvector disjoint sets of program variables, such that for any reasoning are often amenable to automatic satisfiabil particular pointer expression in the program source ity checking with the stateoftheart SMT solvers, code the unique corresponding set of program vari though the existing decision procedures for the theory ables possibly addressed by that pointer can be deter of logical arrays are generally incomplete [2] (i.e. the mined once for all the states of the program execution corresponding tools can loop forever or terminate corresponding to the source code location of the without the satisfiability verdict). But even the sim pointer. Thus we obtain a static approximation of pro plest example above demonstrates the redundancy and gram memory separation. The resulting disjoint sets of nonoptimality of the lowlevel memory model. The variables are called regions in direct analogy to the model is poorly scalable in practice and is not usually regions introduced in [9, 10] as a tool for automatic used by deductive verification tools directly, but rather dynamic memory management without garbage col after applying a number of substantial optimizations. lector (regionbased memory management). These papers don’t suggest any concrete algorithm for mem ory region inference (although they provide corre Existing Jessie Memory Model sponding inexplicit inference rules), but the paper [11] It worth to be mentioned here that in this paper we suggests a practical algorithmic approach, allowing to only consider logical formulas in firstorder classical compute contextsensitive (but flowinsensitive) logic with equality modulo several theories (such as approximation of regionbased program memory sep the theory of arrays, uninterpreted functions, bit vec aration statically. This technique is based on fixpoint tors, and linear integer/real arithmetic), i.e. socalled computation starting from the initial approximation SMT formulas. Alternative logics, including the ones assigning each pointer variable in the program with its based on separation logic [3], are usually rather insuf own region and proceeding with unification of the ficiently supported by automatic satisfiability checkers regions according to the set of several simple unifica and therefore usually require significantly greater user tion rules. The result of the algorithm is a region label intervention to determine the satisfiability of the assignment, matching each pointer expression in the resulting VCs. In the field of deductive verification the program with exactly one region label. To apply the techniques based on SMT formulas are still the most suggested technique to arbitrary C programs the often used in practice, particularly in the tools authors

High Level Memory Model with Low Level Pointer Cast Support For

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support