CCured Type-Safe Retrofitting of Legacy Code

George C. Necula Scott McPeak Westley Weimer University of California, Berkeley {necula, smcpeak, weimer}@cs, berkeley, edu

Abstract While in lhe 1970s sacrificing lype safely for flexibilily and perfbrmance mighl have been a sensible language design In lhis paper we propose a scheme lhai combines type in- choice, today lhere are more and more siiuaiions in which ference and run-time checking to make exisiing C programs type safety is jusi as imporiani as, if noi more importani type saI~. }V~ describe lhe CCured type sysiem, which ex- lhan, performance. Errors like array oui-of:bounds accesses lends lhat of C by separaiing poinier types according to lead boih lo painful debugging sessions chasing inadverieni lheir usage. This type sysiem allows boih poiniers whose memory updaies and lo malicious atiacks exploiiing buft>r usage can be verified siaiically to be type saI>, and poiniers overrun errors in security-criiical code. (Almost g0% of re- whose sai>ty musi be checked ai run lime. }V~ prove a type ceni CERT advisories resuli fi'om security violaiions of this soundness resuli and lhen we preseni a surprisingly simple kind [29].) Type safety is desirable for isolaiing program type inference algoriihm lhai is able lo infer lhe appropriaie componenis in a large or exiensible sysiem, wiihoui lhe loss poinier kinds for exisiing C programs. of performance of separaie address spaces. II is also valu- Our experience wiih lhe CCured sysiem shows lhai lhe able for inter-operaiion wiih sysiems wriiien in type-saf> inference is very eft~ciive for many C programs, as ii is able languages (such as type-saf> Java naiive meihods, for exam- io infer ihai most or all of the pointers are statically ver- pie). Since a great deal of useful code is already wriiien or ifiable lo be type saI>. The remaining poiniers are insiru- being wriiien in C, ii would be useful to have a praciical menied wiih efficieni run-lime checks lo ensure lhai lhey are scheme lo bring type safety lo lhese programs. used sai>ly. The resuliing performance loss due lo run-lime The work described in lhis paper is based on lwo main checks is 0 15056, which is several limes belier lhan com- premises. Firsi, we believe lhai even in programs wriiien in parable approaches lhai use only dynamic checking. Using unsaf~ languages like C, a large pari of lhe program can be CCured we have discovered programming bugs in esiablished verified siatically to be type saf~. Then lhe remaining pari C programs such as several SPECINT95 benchmarks. of lhe program can be insirumenied wiih run-lime checks lo ensure lhai lhe execuiion is memory saf>. The second 1 Introduction premise of our work is lhai in many applicaiions, some loss of performance due lo run-lime checks is an accepiable price The C programming language provides programmers wiih a for type safety, especially when compared lo lhe alternaiive greai deal of flexibility in lhe represeniaiion of daia and lhe cosi of reprogramming lhe sysiem in a type-saf~ language. use of poiniers. These f~aiures make C lhe language of choice The main coniribuiion of lhis paper is lhe CCured type for sysiems programming. Unforiunaiely, lhe cosi is a weak sysiem, an exiension of lhe C type sysiem wiih explicii types type sysiem and consequenlly a greai deal of "flexibility" in for poiniers inio arrays, and wiih dynamic types. II exiends iniroducing sublle bugs in programs. previous work on adding dynamic types lo siaiically typed This research was supported in part by the National Sci- languages: types and capabiliiies for lhe siaiically-typed el- ence I~bundation Career Grant No. CCR-9875171, and ITR Grants emenis are known ai compile lime, while lhe dynamically- No. CCR-0085949 and No. CCR-0081588, and gifts from AT& typed elemenis are guarded by run-lime checks. Our type Research and Microsoft Research. The information presented here sysiem is inspired by common C usage, and includes sup- does not necessarily reflect the position or the policy of the Gov- pori for physical type equivalence [8] and special "sequence" ernment and no official endorsement should be inferred. poiniers for accessing arrays. The second coniribuiion of lhe paper is a simple yei eft~ciive type-inf>rence algoriihm Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided lhai can lranslaie ordinary C programs inio CCured mostly that copies are not made or distributed for profit or commercial auiomaiically and in a mailer of seconds even for a0,000- advantage and that copies bear this notice and the full citation on line programs. }V~ have used lhis inference algoriihm lo the first page. 2b copy otherwise, to republish, to post on servers produce type-sat~ versions of several C programs for which or to redistribute to lists, requires prior specific permission and/or we observed a slowdown of 0 15056. In lhe process, we have a fee. POPL '02, Jan. 16-18, 2002 Portland, OR USA (c) 2002 ACM ISBN 1-58118-450-9/02/01...$5.00 128

also found programming bugs in the analyzed code, the most iint "i *2 a; // array surprising being several array out-of-bounds errors in the 2 int i; // index SPECINT95 compress, go and ijpeg benchmarks. int acc; // accumulator }V~ continue in Section 2 with an informal overview of the 4 int *a *4 p; // elem ptr system in the context of a small example program. Then 5 int *5 e; // unboxer in Section 3 we present a simple language of pointers, with 6 acc = 0; its (in Section 4) and operational semantics (in for (i=O; irence algorithm for CCured. }V~ io while ((int) e ~ 2 == O) { // check tag discuss informally the extension of the language presented 11 e = * (int *6 "7) e; // unbox in this paper to the whole C programming language in Sec- le } tion 7, necessary source code changes in Section 8, and in i~ acc += ((int)e >> I); // strip tag Section 9 we relate our experience with a prototype imple- e# } mentation.

Figure 1: A short C program fragment demonstrating sat> 2 Overview of the Approach and unsat> use of pointers.

To ensure memory safety, tbr each pointer we must keep track of certain properties of the memory area to which it is supposed to point. Such properties include the area's size the static type of "e" as being an accurate description of and the types of the values it contains. For some pointers its values. In our type system we say that "e" has a dy- this information can be computed precisely at compile time narnic pointer type and we associate this inibrmation with and for others we must compute it at run time, in which case the pointer type constructors *s and *r. Dynamic point- we have to insert run-time sai~ty checks. ers always point into memory areas whose contents do not These two kinds of pointers appear in the example C pro- have a reliable static type and must therefore store extra gram shown in Figure 1. The program operates on a - information to classii]y their contents as pointers or integers. pothetical disjoint union datatype we call "boxed integer" Correspondingly, the aliases of dynamic pointers can only be that has been efficiently implemented in C as follows: if a other dynamic pointers; otherwise a saI~ pointer's static type boxed integer value is odd then it represents a 31-bit integer assumptions could be violated aider a memory write through in the most significant bits along with a least significant tag a dynamic pointer alias. This means that the type construc- bit equal to one, otherwise it represents a pointer to another tors *l, *a and *s must also be classified as dynamic point- "boxed integer". }V~ use the C datatype int • to represent ers. In this example program, we have a mixture of pointers boxed integers. The variable a is a pointer to an array of whose static type can be relied upon and thus require lit- boxed integers. The purpose of the function is to accumu- tle or no access checks (the saI~ and sequence pointers), and late in the variable acc the sum of the first 100 boxed integers also pointers whose static type is unreliable and thus require in the array. In line 8 we compute the address of a boxed more extensive checking. integer and in line 9 we t>tch the boxed integer. The loop in lines 10 12 unboxes the integer. The subscripts on the pointer type constructors "*" have been added to simplii~v cross-referencing from the text. Motivated by this and similar examples, the CCured lan- By inspection of the program we observe that the values guage is essentially the union of two languages: a strongly of the variables a and p are supposed to point into the same typed language (containing sat> and sequence pointers), and array. Neither of these variables is subject to casts (and an untyped language for which the type information is main- they have no other aliases) and thus we know that the type rained and checked at run time. of the values they point to is indeed "int *". Furthermore, we observe that while the pointer a is subject to pointer All values and memory areas in the system are either part arithmetic, the pointer p is not. This means that we must of the safe/sequence world, or part of the dynamic world, check uses of "a" for array out-of-bounds errors but we do The only place these worlds touch is when a typed mem- not need to do so for the uses of "p" (assuming that a check ory area contains a pointer to untyped memory, Untyped is perfbrmed in line 8 where "p" is initialized). In this paper memory cannot contain saf~ or sequence pointers because we rei~r to "p" as a sa]~ pointer and to "a" as a sequer~ce we cannot assign a reliable static type to the contents of dy- pointer. To be more precise we associate this information namic areas, ~V~ shall formalize these invariants starting in with the pointer type constructors *4 and *2 respectively. Section 4. Beibre then, in order to provide some intuition Sat> and sequence pointers have only aliases that agree on for the formal development, we summarize in Figure 2 the the type of the value pointed to and thus point to memory capabilities and invariants of various pointer kinds. Since areas whose contents is statically typed. in C the null pointer is used frequently, we allow the saI> Now we turn our attention to the pointer values of "e". pointers to be null. Similarly, we allow arbitrary integers to These values are used with two incompatible types "int *" be "disguised" as sequence and dynamic pointers, but not and "int * *". This means that we cannot count on as safe pointers.

129

Kind Invariants maintained Capabilities Access checks required Safe • Either 0 or a valid address containing a value • Cast from sequence pointer • Null-pointer check when pointer of type w. of base type T. dereihrenced. to 7- • Aliases are either saf~ or sequence pointers • Set to O, of base type v. • Cast to integer, Sequence • Knows at run-time if it is an integer, and • Cast to sai~ pointer of base • Non-pointer check (sub- pointer if not, knows the memory area (containing type 7. sumes null-pointer check). to 7- a number of values of type w) to which it • Cast from integer. • Bounds check when deref- points. • Cast to integer. erenced or cast to SAFE. • Aliases are sai~ and sequence pointers of base type r. • Perform pointer arithmetic. Dynamic • Knows at run-time if it is an integer, and • Cast to and fi'om any dy- • Non-pointer check. pointer if not, knows the memory area (containing namic pointer type. • Bounds check when deref- a number of integer or dynamic pointer val- • Cast from integer, erenced. ues) to which it points. • Cast to integer. • Maintain the tags in the • The memory area pointed to maintains tags • Perform pointer arithmetic. pointed area when reading distinguishing integers fi'om pointers. and writing. • Aliases are dynamic pointers,

Figure 2; Summary of the properties and capabilities of various kinds of pointers.

relational operations, written generically as op. Relational Types: T ::= int ] 7- ref SAFE ] 7- ref SEQ operations on pointers are done after casting the pointers to I DYNAMIC integers. The binary operation @ denotes pointer arithmetic Expressions; e :;= at' ] n ] el opez ] (r)e and the notation ! e denotes the result of reading from the memory location denoted by e (like *e in C). Commands: c ::= skip ] cl;c2 ] el :=e2 The language of commands is greatly simplified. The only notable form of commands is memory update through a Figure 3: The syntax of a simple language with pointers, pointer (p := e is like *p = e in C). Control flow operations pointer arithmetic and casts. are not interesting because our approach is flow insensitive. Function calls are omitted fbr simplification; instead we dis- cuss briefly in Section 7 how we handle function calls and 3 A Language of Pointers function pointers. Among other notable omissions are vari- able updates and the address-of operator on variables. In- There are many constructs in the C programming language stead, to simpli[v the formal presentation, we consider that that can be misused to violate memory safety and type a variable that is updated or has its address taken in C safety. Among them are type casts, pointer arithmetic, ar- would be allocated on the heap and operated upon through rays, union types, the address-of operator, and explicit deal- its address (which is an immutable pointer variable) in our location. To simplif]y the presentation of the key ideas be- language. Finally, we ignore here the allocation and deal- hind our approach we describe it formally ibr a small lan- location of memory (including that of stack frames). Even guage containing only pointers with casts and pointer arith- though the resulting language appears to be much simpler metic, and then we discuss infbrmally in Section 7 how we than C, it allows us to expose formally and in a succinct way extend the approach to handle the remaining C constructs. the major ideas behind our type system, the type inference Figure 3 presents the syntax of types, expressions and algorithm and the implementation. The implementation it- commands for a simple programming language that serves self handles the entire C language. as the vehicle fbr fbrmalizing CCured. At the level of types Our example program from Figure 1 can be transcribed we have retained only the integers and the pointer types. In in this language (with the use of variable declarations and a C the symbol "*" is used in various syntactic roles in con- f~w additional control-flow constructs) as shown in Figure 4. junction with pointer types; to avoid confusion we have in- The major change in this version is that we have replaced stead adopted the syntax of ML reihrences for our modeling the variables i, ace, p and e by pointers, and the accesses language. ~V~ have three flavors of pointer types correspond- to those variables by memory operations. (The lines 1 5 are ing to sai~, sequence and dynamic pointers respectively. The technically not representable in our language. ~V~ show them type DYNAMICis a pointer type that does not carry with it the only to provide a context for the rest of the example. ~V~ also type of the values pointed to. This is indicative of the fact ignore the initialization of these variables.) All the newly that we cannot count on the rei~renced type of a dynamic introduced pointer type constructors are SAFE since the cor- pointer. responding pointers are used only for reading and writing. Among expressions we have integer literals and an assort- Another change is that one or more nested dynamic pointer merit of binary integer operations, such as the arithmetic and type constructors are collapsed into the DYNAMIC type,

130

i DYNAMIC ref SEQ a; // array If this last conversion rule is used then our operational se- 2 int ref SAFE p_i; // index mantics inserts a run-time check to veri(y that the pointer int ref SAFE p_acc; // accumulator being cast is within the bounds of its home area. 4 DYNAMIC ref SAFE ref SAFE p_p; // elem ptr It is important to point out that in most cases casts act 5 DYNAMIC ref SAFE p_e; // unboxer as conversions between diRerent representations of values : p_acc := 0; and in some cases they are also accompanied by run-time for (p_i := 0; !p_i < I00; p_i := !p_i + I) { checks. The run-time manipulations that accompany casts p_p := (DYNAMIC ref SAFE)(a Q !p_i); make convertibility diRerent from subtyping in several re- 9 p_e := !!p_p; spects. First, convertibility extended with transitivity and io while ((int) !p_e % 2 == 0) { viewed as a coercion-based subtyping relation [4], would be 11 p_e := !!p_e; incoherent. For example, the series of coercions correspond- le } ing to DYNAMIC <_ int <_ DYNAMIC have a diffe-rent effe-ct fi'om la p_acc := !p_acc + ((int)!p_e >> I); the identity since even if" we start with a pert~ctly usable z4 } pointer we end up with a pointer that has lost its capability to per%rm memory operations. Because of lack of coherence we cannot allow a general subsumption rule and instead we Figure 4: The program ti'om Figure 1, translated to CCured. let the program control the use of conversions, through casts. Consequently, we do not have a transitivity rule and we rely 4 The Type System instead on the programmer to obtain the same eft>ct by us- ing a sequence of casts. In this section we describe the CCured type system for the language introduced in the previous section. The purpose 5 Operational Semantics of this type system is to maintain the separation between the statically typed and the untyped worlds, and to ensure In this section we describe the run-time checks that are nec- that all well-typed programs can be made to run sa%ly with essary fbr CCured programs to run sa%ly. ~¥~ do this in the the addition of appropriate run-time checks. The run-time form of an operational semantics for CCured. checks are described as part of the operational semantics in The execution environment consists of a mapping E fi'om Section 5. For now we concentrate on type checking, and variable names to values, a set of allocated memory areas we shall assume that the program contains complete pointer H (which we call homes), and a mapping M (the memory) kind information. from addresses within the homes to values. The mapping The type system is expressed by means of the following is assumed to be provided externally just like the similar three judgments: mapping F from the typing rules. In our language only the Expression typing: F ~- e : r memory changes during the execution. In order to better Command typing: F ~- c expose the precise costs of using each kind of pointer we use Convertibility: r < r' a low-level representation of addresses as natural numbers. A home is represented by its starting address (H C_ N) and In these judgments F denotes a typing environment map- for all homes we define a function size : H -+ N whose value ping variable names to types. Since we do not have declara- is the size of the home. }¥~ require the following properties tions in our language, the typing environment is assumed to of the set H and the function size: be provided externally. The derivation rules for the typing judgments are shown in Figure 5. • NULL: 0 E H and size(0) = 1. Observe in the typing rules that we check casts with re- spect to a convertibility relation on types, whose rules are • DISJOINT:Vh, h',i,(. (h:fih' A 0 one. Notice also Mn : N --~ Values such that its domain consists exactly of how the type DYNAMIC is used both for pointers into untyped the addresses contained in the non-null homes: areas and %r the values stored into those areas. The type convertibility relation captures the situations Dom(Mn)={h+i IhCH ~A0

131

Expressions:

F(x) = r F ~ el ; int F ~ e2 ; int F ~ e ; T' T' < T F~-x:r F ~- n : int F ~- el op ez : int F K (r)e ; r F K (r ref SAFE)0 : r ref SAFE

F~- el ;Tref SEQ F K e2 ; int F~ el :DYNAMIC F~e~ : int F~ e:rref SAFE F~ e:DYNAMIC F ~el @ ez :rref SEQ F~el ~ez :DYNAMIC F~!e:T F~!e:DYNAMIC Commands; F~-cl F~-cz F~ e:Tref SAFE F~ e' :T F~-e:DYNAMIC F~-e' :DYNAMIC F~skip F~cl;cz F~-e:= e' F~-e:=e'

Convertibility: 7-_< 7- 7-_< int int < ~ ref SEQ int < DYNAMIC ~ ref SEQ < ~ ref SAFE

Figure 5: The typing judgments.

Expressions: E(x) = v E, 33 ~ el g 'hi E, 33 ~ e~ g 'n~ I N T VA t{ O P E, 33 k n .~ n E,M~xgv N, 31 t- el op ez g 'hi op 'n~ Casts: E, 33 k e g n E,M ~eg (h,n} C1 C2 E, 33 ~ (i=t)e g n E, 3g ~ (iat)e g h + n E, M t- e g n E, M ~- e g (h, n} C3 C4 E, 33 k (r ref SEQ)e .~ (0, 'n} E, 33 k (r ref SEQ)e .~ (h, n} E, MF-e~n E, 33 F e .~ (h, 'n} C5 C6 E, 33 ~ (DYNAMIC)e $ <0, 'n} E, 33 k (DYNAMIC)e $ (h, n}

E, MF-egn E, 33 I- e g (h, n} 0 __< n < size(h) C7 C8 E, M ~ (r ref SAFE)e g n E, 33 [- (r ref SAFE)e .~ h + n Pointer arithmetic: E,M F- el g (h,'nl} E,M F- e2 g'n2 ARITH E, 33 t- el Q e2 g (h, 'hi + 'n2} Memory reads:

SAFERD DYNRD E, 33 F- ! e g M(n) E,33 t- !e g M(h + n)

Commands: E,31K cl ~ 3I' E,3I' ~ c2 ~ 3I" SKIP CHAIN E,3I ~- skip ~ 3I E,3I ~ cl;c2 ~ 3I"

E, M ~ el ~ '- ,,~ ~, M ~ e2 ~ v2 ~, M ~ el ~ ~ 0 _< ,. < size(h) ~, M ~ e~ V2 SAFEWR DYNWt{ E, 3II ~ el := e2 ~ M[v2/n] E, 3II ~ el := e2 ~ M[V~h + n]

Figure 6: The operational semantics. The boxed premises are the run-time checks that CCured uses.

132

The values of integer and sai~ pointer expressions are plain the current set of homes: integers (without any representation overhead over C), while = the values of sequence and dynamic pointer expressions is of Ilinql,1 IIDYNAMICII]i = { I h ¢ H a the tbrm (h, n}: (h = o v kind(h) = U,~tvped)} g.l~e~ v ::= n I (h,n> lit ref SEQll,, = {(h,,,~} I h ¢ H A (h = o v kind(h) = Typed(r))} The latter kind of pointer value carries with it its "iden- lit ref SAFV'II,, = {h + i l h e H A 0 _< i < oi--~(h) A tity" (represented by its home). The home is used both to (h = 0 V kind(h) = Typed(r))) check if the pointer is actually an integer converted to a }¥~ extend the notation v ¢ Ilrlln element-wise to the cor- pointer (when the home is 0), or otherwise to retrieve the responding notation for environments Z ¢ IIFIIn (meaning size of the home while performing a bounds check. v,. ¢ Do~r4n). Z(,') ¢ IIe(*')ll,,). The operational semantics is defined by means of two At all times during the execution, the contents of each judgments. ~V~ write Z, M k- e g v to say that in the en- memory address must correspond to the typing constraints vironment Z and in the memory state denoted by M the of the home to which it belongs. }¥~ say that such a mem- expression e evaluates to value v. For commands we use a ory is well-ibrmed (written WF(Mn )), a property defined as similar judgment Z, M k- c ~ M' but in this case the result follows: is a new memory state. The derivation rules for these two judgments are given in Figure 6. WE ( Mi~ ) e ~j Notice that we have eight rules ibr casts, one rule for each Vh ¢ H ~. Vi ¢ N. combination of destination type and form of the value being 0 _< i < ~ize(h) cast. The rules C3 and C5 show that an integer is converted (kind(h) = U,~typed ~ M(h + i) ¢ IIDYNAMIclI,, to a sequence or to a dynamic pointer by using a null home. A kind(h) = Typed(r) ~ M-(h ÷ i) e Ilrll,]) The rule C7 applies when we cast the integer 0 or a sai> pointer to another sai> pointer, while the rule C8 applies for There are several reasons wily tile evaluation of an expres- casts from sequence pointers to safe pointers. In this latter sion or a command can fail. The most obvious is that a boxed case we must perform a bounds check. Here and in the rules run-time check can fail. ~¥~ actually consider this to be sat> to follow we mark such run-time checks with a box around behavior. Another reason for failure is that operands can them. Other instances of run-time checks are for memory evaluate to unexpected values, such as if the second operand operations. If the memory operation uses a sai> pointer then of Q evaluates to a value of the form (h, n}. The third rea- only a null-pointer check must be done, otherwise a non- son is that the operations on memory are undefined if they pointer check and a bounds check must be done. involve invalid addresses. ~V~ state below two theorems say- The typing rules from Figure 5 suggest that we can per- ing essentially that the last two reasons for failure cannot form a sequence of conversions r ref SEQ _< int _< r ref SEQ happen in well-typed CCured programs. or even a similar one where the destination type is 7' ref In order to state a progress theorem we want to distin- SEQ. This is indeed legal in CCured but the operational rules guish between executions that stop because memory sai~ty show that when starting with a pointer value (h, n} we end is violated (i.e. trying to access an invalid memory loca- up after these two conversions with the value (0, h+n}, which tion) and executions that stop because of a failed run-time is a pointer value that cannot be used for reading and writ- check (the boxed hypotheses in the rules of Figure 6). ~V~ ing. This property is quite important in practice: programs accomplish this by introducing a new possible outcome of that cast pointers into integers and then back to pointers evaluation. ~V~ say that ~, M k- e g CheckFailed when one will not be able to use the resulting pointers as memory ad- of the run-time checks fails during the evaluation of the ex- dresses. ~V~ discuss this issue further in Section 8. pression e. Similarly, we say that Z, M k- c ~ CheckFailed when the execution of the command c results in a failed run-time check. Technically, this means that we add deriva- 5.1 Type Safety tion rules that initiate the CheckFailed result when one of the run-time check fails and also rules that propagate the The type system described in Section 4 enforces the separa- CheckFailed outcome from the subexpressions to the enclos- tion between the typed and the untyped worlds. The oper- ing expression. ational semantics of Section 5 describes the run-time checks we perform for each operation on various pointer kinds. In Theorem 1 (Progress and type preservation) If F k- this section we formalize and outline a proof of the resulting safety guarantees we obtain for CCured programs. For each non-null home we define its kind as either Typed(r), meaning that it contains a number of values of Theorem 2 (Progress for comnmnds) IfF ~ c and Z ¢ type r and has only sai~ and sequence pointers of base type r pointing to it, or as Untyped, meaning that it contains a number of values of type DYNAMIC and has only pointers of type DYNAMIC pointing to it, The prooi~ of these theorems are fairly straightforward Then we define for each type r the set Ilrll,, of valid values inductions on the structure of the typing derivations. Note of that type. As the notation suggests, this set depends on that the progress theorems state more than just memory

133

Expressions and commands:

Fl-el:rrefq~--+C1 F I- e2 : int ~--+ C2 FI-e:T'~--+C Tf~T~--~C I FI-e~ @e2 :rref q~ C~ LIC2 Ll{q¢ SAFE} FI- (r)e:r~ CLIC' FI- (rref q)0: rref q~ 0

F I- e : r ref q ~--+ C F I- el : r ref q ~--+ C1 F ~- c2 : T2 ~--~ C2 F~-!e:T~C F~-e~ ::e2~C~UC2UCs Convertibility:

7-_< int ~-+ O int ! w ref q ~-+ {q # SAFE} Ti ref ql ! w2 ref q2 ~-+ {ql _~ q2} U {ql = q2 = DYNQ V wl ~ w2}

Figure 7: Constraint generation rules. sai~ty. They also imply that well-typed computations of non- consistent with the idea that for the dynamic pointers we dynamic type are type preserving, similar to corresponding should not count on the declared ret~renced type. DYNAMIC results for a type-sat~ language. This means that CCured pointers never point to typed areas and thus the int~rence is memory sat~ and is also type sat~ for the non-dynamic algorithm is designed to inter only DYNQ qualifiers in the ref- fragment. erenced types of DYNQ pointers. The overall strategy of int~rence is to find as many SAFE 6 Type Inference and SEQ pointers as possible. Simply making all qualifiers DYNQ yields always a well-typed solution, but SAFE and SEQ So far we have considered the case of a program that is pointers are pret~rred. written using the CCured type system. Our implementation does allow the programmer to write such programs directly in C with the pointer kinds specified using the __attribute__ keyword of GCC. But our main goal is to be able to use I. Constraint Collection. }V~ collect constraints us- CCured with existing, un-annotated C programs. For this ing a modified typing judgment written F ~ e : T ~:+ C and purpose we have designed and implemented a type int~r- meaning that by scanning the expression e in context F we ence algorithm that, given a C program, constructs a set of int>rred type r along with a set of constraints C. }¥~ also use pointer-kind qualifiers that make the program well-typed in the auxiliary judgments r _< r' ~ C to collect constraints the CCured type system. corresponding to the convertibility relation, and F F- c ~ C Our int~rence algorithm can operate either on the whole to express constraint collection from commands. The intent program, or on modules whose interfaces have been anno- is that a solution to the set of constraints C, when applied tated with pointer-kind qualifiers. ~V~ rely on the fact that as a substitution to the elements appearing beibre the sym- the C program already uses types of the tbrm "7- re:f". All bol ~, yields a valid typing judgment of the corresponding we need is to discover for each occurrence of the pointer type syntactic form in CCured. The rules for the constraint col- constructor whether it should be sat~, sequence or dynamic. lection judgments are shown in Figure 7. To describe the int~rence algorithm we extend the CCured The constraints fbr pointer arithmetic are fairly straight- type language with the pointer type "7- ref q", where q forward and those for casts are expressed as a separate con- is a qualifier variable ranging over the set of qualifier val- vertibility judgment. For memory reads and writes, we ues {SAFE, SEQ, DYNQ} (where DYNQ is the qualifier associated must bridge the gap between the rules of C and the rules with the DYNAMIC type). of CCured. Specifically, we allow memory access through The ini~rence algorithm starts by introducing a qualifier SEQ (not just SAFE) pointers, and we allow ints to be read variable for each syntactic occurrence of the pointer type or written through DYNAMIC pointers. In both cases, an im- constructor in the C program. }V~ then scan the program plicit cast is inserted to yield a valid CCured program. In a and collect a set of constraints on these qualifier variables. memory write we allow for a conversion of the value being Next we solve the system of constraints to produce a substi- written to the type of the rei~renced type. tution $ of qualifier variables with qualifier values and finally To express the convertibility constraints in a concise way we apply the substitution to the types in the C program to we introduce a convertibility relation on qualifier values, produce a CCured program. which essentially says SEQ can be cast to SAFE: The substitution $ is applied to types using the following rules: q ~ dcf _ q, = q= q, V (q=SEQAq' =SAFE) S(int) = int ~DYNAMIC if $(q) = DYNQ $(r ref q) = IS(T) ref $(q) otherwise Finally, to capture the requirement that all DYNAMICpoint- ers point only to dynamically typed areas, for each type of Note that when the qualifier q is substituted with DYNQ the florin T ref qr ref q we collect a POINTST0 constraint we ignore the ret~renced type (r) of the pointer, which is q = DYNQ ~ q' = DYNQ,

134

After constraint generation we end up with a set contain- 3. Constraint Solving. The final step in our algorithm ing the following four kinds of constraints: is to solve the remaining set of constraints. The algorithm is quite simple: AKITH: q ~ SAFE CONV: q J q' 3.1 Propagate the ISDYN constraints using the constraints POINTST0:q = DYNQ~q' = DYNQ EQ, CONV, and POINTSTO. After this is done all the other TYPEEQ: q = q' = DYNQ V T-i ~7-2 qualifier variables can be made SEQ or SAFE, as follows:

3.2 All qualifier variables involved in AKITH constraints are The constraint 7-, ~ 7-2 requires that a valid solution is a set to SEQ and this infbrmation is propagated using the substitution $ that makes the types ,5(7-:) and ,5(7-2) identi- constraints EQ and CONV (in this latter case the SEQ in- cal. This notion is made more precise below. formation is propagated only from q' to q, or against the direction of the cast). 2. Constraint Normalization. The next step is to nor- 3.3 }V~ make all the other variables SAFE. malize the generated constraints into a simpler form. Notice that the system of constraints we have generated so far has Essentially, we find first the minimum number of DYNQ conditional constraints. The POINTST0 constraints are easy qualifiers. Among the remaining qualifiers we find those on to handle because we can ignore them as long as the quali- which pointer arithmetic is pertbrmed and we make them fier q on the leit is unknown, and if it becomes DYNQ, we add SEQ, and the remaining qualifiers are SAFE. This solution is the constraints ,qr = DYNQ" to the system of constraints. If the best one possible in terms of maximizing the number of q remains unknown at the end of the normalization process SAFE and SEQ pointers. we will make it SAFE or SEQ. The whole type int~rence process is linear in the size of the However, the same is not true of the TYPEEQ constraints. program. A linear number of qualifier variables is introduced If we postpone such constraints and the qualifiers involved (one for each syntactic occurrence of a pointer type construc- remain unconstrained we would like to make them both SAFE tor), then a linear number of constraints is created (one for or (to mimimize the number of pointers). But SEQ DYNAMIC each cast or memory read or write in the program). Dur- to do that we must introduce in the system the type equality ing the simplification of the TYPEQ constraints the number constraint, which might lead to contradictions that require of constraints can get multiplied by the maximum nesting backtracking. Fortunately there is a simple solution to this depth of a qualifier in a type. Finally, constraint solving is problem. }¥~ start by simplii~ving the TYPEEEQ constraint linear in the number of constraints. based on the possible forms of the types r~ and r2:

q = q' = DYNQ V int ~ int ~-~ 0 7 Handling the Rest of C q = q' = DYNQ V int ~ T2 ref q2 ~-+ {q = DYNQ, q' = DYNQ} In the interest of clarity we have formalized in this paper q = q' = DYNQ V 7-: ref q: ~ int ~-+ {q = DYNQ, only a small subset of the CCured dialect of C. Our imple- q' = DYNQ} mentation handles the entire C programming language along q = q' = DYN~ V 7-: ref q: ~ 7-2 ref q2 ~-+ {q = q'} U C with most of the extensions in the GNU C dialect, In this where q: = q2 = DYNQ V 7-: ~ 7-2 ~-+ C section we discuss informally how we handle the rest of the C programming language, The full details are presented in The only subtlety is in the last rule. }¥~ observe that a forthcoming paper, a constraint of the tbrm "q = q' = DYNQ V 7-: ref q: In the DYNAMIC world, structures and arrays are simply 7-2 ref q2" arises only when the types "7-1 ref ql ref q" and alternative notations for saying how many bytes of storage to "7-2 ref q2 ref q'" appear in the program. This means that allocate. In the SAFE world, structures accesses are required the following POINTSTO constraints also exist: to respect the types of all fields. For example, it is possible q = DYNQ ~ q: = DYNQ to have a SAFE pointer to a structure field, but you cannot q' = DYNQ ~ q2 = DYNQ perform arithmetic on such a pointer. }V~ treat unions as syntactic sugar for casts. This in turn means that the disjunct q = q' = DYNQ in the Explicit deallocation is currently ignored, and the Boehm- last reduction rule is redundant and can be eliminated. }V~iser conservative garbage collector [8] is used to reclaim storage. However, the CCured system maintains enough After simpli[ving all TYPEQ constraints, the normalized type information to allow the use of a precise collector; this system has only the tbllowing kinds of constraints: may be future work. The address-of operator in C can yield a pointer to a stack- AKITH: q ~ SAFE allocated variable. The variable to which the pointer points CONV: q ..~ q' may be interred to live in the DYNAMIC or SAFE worlds, de- POINTST0:q = DYNQ ~ q' = DYNQ pending on how the pointer is used. The only difficulty is ISDYN: q = DYNQ that the storage will be deallocated when the function re- EQ: q = q' turns, so the CCured run-time checks ensure no stack pointer ever gets written into the heap or globals. This restriction

135

allows the common use of address-of to implement call-by- must do the run-time checks assocated with the pointers, ret~rence; tbr other uses, the storage in question may have before passing them to the underlying library. to be allocated on the heap instead. The wrapper solution works well tbr the standard C 5- DYNAMIC function pointers and variable-argument func- brary. However, we expect to encounter difficulties when tions are also handled in CCured, by passing a hidden argu- interoperating with third-party libraries whose interface in- ment which specifies the types of all arguments passed. The volves passing pointers to large structures which themselves hidden argument is then checked in the callee, and parame- contain pointers. ~V~ are experimenting with an alterna- ters interpreted accordingly. Among other things, this level tive implementation scheme in which the bookkeeping infor- of checking is sutficient to detect format string errors. mation for sequence and dynamic pointers that escape the Certain C library functions must be handled specially. CCured world are kept in a global table so that we do not Several functions (of which malloc is the most important) have to change the representation of exported data struc- are treated polymorphically, lest all dynamically-allocated tures. data be marked DYNAMIC. A i~w others impose constraints on argument qualifiers: e.g., memcpy internally does pointer arithmetic, and hence cannot accept SAFE pointers. 9 Experiments ~V~ ran though our translator several C programs ranging 8 Source Changes in size f¥om 400 (treeadd) to 30,000 (ijpeg) lines of code (including whitespace and comments), with several goals. The CCured type system and in%rence algorithm are de- First, we wanted to measure the pertbrmance impact of the signed to minimize the amount of source changes required run-time checks introduced by our translator. Second, we to contbrm to its restrictions. However, there are still a i>w wanted to see how eft~ctive our int~rence system is at elim- cases in which legal C programs will stop with a failed run- inating these checks. Finally, we investigated what changes time check. In those cases manual intervention is necessary. to the program source are required to make the program run One common situation is when the program stores a under the CCured restrictions. pointer in a variable declared to hold an integer, then casts }V~ used several test cases, some from SPECINT95 [26]: it back to a pointer and derei>rences the pointer. In some compress is LZW data compression; go plays the board game cases, it sutfices to change the variable's declaration from Go; ijpeg compresses image flies; li is a Lisp ; from (say) unsigned long to void*. This type will certainly and some from the Olden benchmark suite [6], a collection of be marked DYNAMIC, but it will work. For other programs, we small, compute-intensive kernels: bh is an n-body simulator; may be able to replace casts with pointer arithmetic. For ex- bisort is a sorting algorithm; em3d solves an elecromag- ample, if e is a sequence or dynamic pointer expression then netism prob]em; health simulates Colombia's hea]th care the legal CCured expression "e ~) (n- (int)e)" is eft>ctively system; mst computes minimum spanning trees; perimeter a cast of the integer n to a pointer (with the same home as computes perimeters of regions in images; power simulates e). As a last resort, it is possible to query the garbage col- power market prices; treeadd simply builds a binary tree; lector at run time to find the home and type of any pointer, tsp uses a greedy algorithm to approximately solve random but so far this has not been necessary. Traveling Salesman Problem instances; and voronoi con- Another problem in otherwise legal C code is the interac- structs Voronoi diagrams. tion between sizeof and our fat pointers: one must change Most of the source changes needed to run these bench- occurrences of sizeof (type) to sizeof (exprvssion), when- marks were simple syntactic adjustments, such as adding (or ever type contains pointers. A typical example, allocating an correcting) prototypes and marking printf-like functions. array of 5 integer pointers, is A t~w benchmarks required changing sizeof (prevalent in int **p = (int**)malloc(5 * sizeof(int*)); ijpeg) or moving locals into the heap (tbr li). No pro- This code will always allocate space for 5 SAFE pointers, even gram required changes to the data structures or other basic if p is ini~rred to point to SEQ or DYNAMIC pointers. This code design elements. A number of remaining bugs in our imple- must be changed to mentation prevents us t¥om applying CCured to the other int **p = (int**)malloc(5 * sizeof(*p)); benchmarks in the SPECINT95 suite. so the size passed to malloc is related to the size of *p. The running time (median of five) of each of the bench- While most uses of address-of are to implement call-by- marks is shown in Figure 8. The measurements were made tel>fence, some programs attempt to store stack pointers on an otherwise quiescent 1GHz AMD Athlon, 768MB into memory. Among the programs we have compiled with machine, using the gcc-2.95.3 with -02 or -03 op- CCured, only two (the SPECINT95 benchmarks li and timization level (depending on benchmark size). ijpeg) do this. The solution is to annotate certain local In all cases the pointer kind int~rence was pertbrmed over variables with a qualifier that causes them to be allocated the whole program. However, because int~rence time is lin- on the heap. For li, which makes fairly extensive use of this ear in the size of the program (as argued in Section 6), i>ature, this results in a performance penalty of about 25%. we have not observed scalability problems with our whole- When CCured changes the representation of pointers, this program approach. In fact, our biggest scalability problem can lead to problems when calling functions in libraries that is with the optimizer in the C compiler that consumes our were not compiled with CCured. The typical solution is to output (presented as a single, large C source file). write wrapper functions which translate between two-word Most of the benchmarks have between 30% and 150% and one-word arguments and return values. The wrapper slowdown. To measure the effectiveness of our inference al-

136

Name Lines Orig. CCured Puri(y The last column in Figure 8 shows the slowdown of these of code time sf/sq/d ratio ratio programs when instrumented with Purit~v (version 2001A) SPECINT95 [10], a tool that instruments existing C binaries to detect compress 1590 9.586s 87/12/0 1.25 28 memory errors and leaks by keeping two bits of storage tbr go 29315 1.191s 96/4/0 2.01 51 each byte in the heap (unallocated, uninitialized and initial- ijpeg 31371 0.963s 36/1/62 2.15 30 ized). However, Purit]y does not catch pointer arithmetic li 7761 0.176s 93/6/0 1.86 50 that yields a pointer to a separate valid region [13], a prop- Olden erty that Fischer and Patil [20] show to be important. Pu- bh 2053 2.992s 80/18/0 1.53 94 ri(y tends to slow programs down by a factor of 10 or more, bisort 707 1.696s 90/10/0 1.03 42 much more than CCured. Of course, Purit]y does not require emad 557 0.371s 85/15/0 2.44 7 source code, so may be applicable in more situations. Pu- health 725 2.769s 93/7/0 0.94 25 ritzy did find the uninitialized variable in go, but none of the rest 617 0.720s 87/10/0 2.05 5 other bugs, because the accesses in question did not stray perimeter 395 4.711s 96/4/0 1.077 544 far enough to be noticed. power 763 1.647s 95/6/0 1.31 53 treeadd 385 0.613s 85/15/0 1.47 500 tsp 561 3.093s 97/4/0 1.15 66 10 Related Work

Figure 8: CCured versus original performance. The mea- Abadi et al. [1] study the theoretical aspects of adding a surements are presented as ratios, where 2.00 means the Dynamic type to the simply-typed A-calculus and discuss ex- program takes twice as long to run when instrumented with tensions to polymorphism and abstract data types. Thatte CCured. The %f/sq/d" column show the percentage of [28] extends their system to replace the typecase expressions (static) pointer declarations which were int>rred SAFE, SEQ with implicit casts. Their system does not handle rei~rence and DYNAMIC, respectively. types or memory updates and Dynamic types are introduced to add flexibility to the language. In contrast our system was designed to handle memory reads and writes, allows DYNAMIC gorithm we used CCured with a naive int>rence algorithm values to be manipulated (e.g., via pointer arithmetic) with- that makes all pointers DYNAMIC. The slowdown in this case is out checking their tags, and uses DYNAMIC types to guarantee more significant (6 to 20 times slower) and it approaches that the sai~ty of code that cannot be statically verified. reported by other researchers [2, 13, 18, 19] who tried an all- Chandra and Reps [8] present a method tbr physical type run-tinm-checks approach to memory sat~ty for C. For exam- checking of C progralns based on structure layout in the pres- pie, the most pointer-intensive benchmark is I±, which runs ence of casts. Their int~rence method can reason about casts 16 times slower if all pointers are blindly marked DYNAMIC; between various structure types by considering the physical however, once the ini~rence discovers that ~gg the pointers layout of memory. Our example in Section 2 would fail to are SAFE or SEQ, it is only twice as slow. type check in their system for the same reason that we must Program size has a big influence on how many of the point- mark some of the pointers DYNAMIC: its safety cannot be guar- ers can be statically verified. Small programs like the Olden anteed at compile time. Sift et al. [24] identii]y that many benchmark suite tend to have f~w data types, and they are casts in C programs are safe, upcasts and present a tool to used in straightforward ways. Large programs, especially check such casts. those designed to be extended in the future, use pointers The programming languages CLU [17], Cedar/Mesa [16] in many ways. In the case of ijpeg, it uses object-oriented and Modula-{2+,3} [5] include similar notions of a dynamic downcasts throughout, and thus a large number of the point- type and a typecase statement. This idea can also be seen ers become DYNAMIC. in CAML's exception type [22]. ~¥~ discovered and fixed several bugs in the SPECINT95 Other related work in this area falls into three broad cat- benchmarks: compress and ijpeg each contain one array egories: (1) extensions to C's type system, (2) adding run- bounds violation, and go has (at least) eight array bounds time checks to C, and (3) removing run-time checks from violations and one use of an uninitialized integer variable as LISP. an array index. In each case we verified that fixing the bug Previous efforts to extend C's type system usually deal did not change the progranFs eventual output (for the test with polymorphism. Smith et al. [25] present a polymor- vectors considered), which partially explains how these bugs phic and provably type-sat> dialect of C that includes most survived fbr so long in otherwise well-tested programs. of C's t>atures (and higher-order functions, which our cur- Most of the bugs in go involved erroneous index arithmetic rent system handles weakly) but lacks casts and structures. within large, multi-dimensional arrays. Finding these bugs Evans [9] describes a system in which programmer-inserted demonstrates an advantage of our type-sensitive approach. annotations and static checking techniques can find errors If we simply marked all home areas as untyped, and only and anomalies in large programs. Ramalingam et al. [21] checked for errors when a pointer strayed out of its home have presented an algorithm tbr finding the coarsest accept- area, we would miss errors that happen to stay within the able type tbr structures in C programs. Most such type intended home area. While we were originally motivated by systems and int~rence methods are presented as sources of performance to discover saI~ pointers, we found that doing information. In this paper we present a type and int~rence so enhanced our bug-finding ability too. system with the goal of making programs safe,.

137

There have been many attempts to bring some measure of eliminate run-time checks from higher-order call-by-value sat~ty to C in the past by trading space and speed for secu- programs. Shields et al. [23] present a system in which dy- rity. Previous techniques have been concerned with spatial namic typing and staged computation (run-time code gen- access errors (array bounds checks and pointer arithmetic) eration) coexist: all deterred computations have the same and temporal access errors (touching memory that has been dynamic type at compile-time and can be checked precisely freed) but none of them use a static analysis of the ibrm at run-time. Such a technique can handle persisting dy- presented here. Kaut~r et al. [14] present an interpretive namic data, a weakness of our current system. Sot~ type scheme called Saber-C that can detect a rich class of errors systems [7'] also inter types tbr procedures and data struc- (including uninitialized reads and dynamic type mismatches tures in dynamically-typed programs. Advanced soft type but not all temporal access errors) but runs about 200 times systems [30] can be based on inclusion subtyping and can slower than normal. Austin et al. [2] store extra informa- handle unions, recursive types and other complex language tion with each pointer and achieve sai~ty at the cost of a t~atures. Finally, [15] presents a practical ML-style type in- large (up to 540% speed and 100% space) overhead and a t~rence system for LISP. As with Henglein [11], such systems lack of backwards compatibility. Jones and Kelly [13] store start with a dynamically typed language and thus tackle a extra information for run-time checks in a splay tree, allow- dift~rent core problem. ing sat~ code to work with unsat~ libraries. This results in a slowdown factor of 5 to 6. Fischer and Patil have pre- sented a system that uses a second processor to perfbrm 11 Conclusion and Future Work the bounds checks [19]. The total execution overhead of a The C programming language is the language of choice for program is typically only 5% using their technique but it systems programming because of its flexibility and control requires a dedicated second processor. Loginov et al. [18] over the layout of data structures and the use of pointers. store type intbrmation with each memory location, incurring Untbrtunately, this comes at the expense of type sat~ty. In a slowdown factor of 5 to 158. This extra intbrmation allows this paper we propose a scheme that combines program anal- them to peribrm more detailed checks and they can detect ysis and run-time checking to bring type sat~ty to existing when stored types mismatch declared types or union mem- C programs by trading off some peribrmance. bers are accessed out of order. While their tool and ours are The key insight of this work is that even in C programs similar in many respects their goal is to provide rich debug- most pointers are used in such a way that they can be verified ging intbrmation and ours is to make C programs sat~ while to be type sat~ using typing rules similar to those of strongly retaining efficiency. Steft~n's rtcc compiler [27] is portable typed languages. Furthermore the rest of the pointers can and adds object attributes to pointers but fails to detect be checked at run-time to ensure that they are indeed used temporal access errors and does not perform any check opti- saiely. The entire approach hinges on the ability to inier mizations. In fact, beyond array bounds check elimination, accurately which pointers need to be checked at run time none of these techniques use type-based static analysis to and which do not. }¥~ present a surprisingly simple type aggressively reduce the overhead of the instrumented code. int~rence algorithm that is able to do just that. Finally, much work has been done to remove dynamic Perhaps the most surprising result of our experiments is checks and tagging operations from LISP-like languages. that in many C programs most pointers are peri~ctly sat~ Henglein [11] details a type int~rence scheme to remove tag- (and our int~rence is able to discover that), which means that ging and untagging operations in LISP-like languages. The those programs are just as sat~ as if they had been written overall structure of his algorithm is very similar to ours in a type-sat~ language. Consequently the cost of entbrcing (simple syntax-direct constraint generation, constraint nor- sat~ty tbr many C programs is relatively low and even with a malization, constraint solving) but the domain of discourse prototype implementation we were able to achieve overheads is quite dift~rent because his base language is dynamically several times smaller than those of comparable tools that rely typed. In Henglein's system each primitive type construc- exclusively on run-time checking. tor is associated with exactly one tag, so there is no need to deal with the pointer/array ambiguity that motivates our The two flavors of typed pointers that we present in this SEQ pointers. In C it is sometimes necessary to allocate an paper cover many of the programming paradigms encoun- object as having a certain type and later view it as having tered in C programs. But there are still other operations another type: Henglein's system disallows this because tags on pointers that could be statically proven sat~, which our are set at object creation time (that is, true C-style casts type system fails to recognize. The most important exam- or unions are not fully supported [12]). Henglein is also pie is tagged union types with incompatible members, which able to sidestep update and aliasing issues because tagging the current CCured system flags as untyped; a special case and untagging create a new copy of the object (to which is object-oriented "downcasts," used heavily by ijpeg. To ~e¢! can be applied, tbr example) so one never has tagged handle these situations without resorting to the DYNAMIC and untagged aliases for the same item. His algorithm does sledgehammer, it would be useful to have a more expressive not consider polymorphism or module compilation [15]. The language of pointer types than our current system provides. CCured system uses a form of physical subtyping for point- ers to structures and it is not clear how to extend Henglein's 12 Acknowledgments constraint normalization procedure in such a case. Jagannathan et al. [12] use a more expensive and more ~¥~ would like to thank Alex Aiken and Jeff Foster ibr useful precise flow-sensitive analysis called polyrr~orphic splittir~g to comments on earlier drafts of this paper and also Raymond

138

To, Aman Bhargava, S. P. Rahul, and Danny Antonetti for [18] A. Loginov, S. Yong, S. IIIorwitz, and T. Reps. Debugging via helping with the implementation of the CCured system. run-time type checking. In Proceedings of i~½SE 2001: l~n - damental Appwaches to S@war~ Engineering, Apr. 2001. References [19] III. Patil and C. N. Fischer. Efficient run-time monitoring using shadow processing. In Automated and Algorithmic De- [1] M. Abadi, L. Cardelli, B. Pierce, and G. Plotkin. Dynamic bugging, pages 119 132, 1995. typing in a statically typed language. ACM 1;rYansactions on [20] III. Patil and C. N. Fischer. Low-cost, concurrent checking Programming Languages and Systems, 13(2):237 268, April of pointer and array accesses in C programs. S@ware 1991. Practice and Experience, 27(1):87 110, Jan. 1997. [2] T. M. Austin, S. E. Breach, and G. S. Sohi. Efficient de- [21] G. Ramalingam, J. Field, and F. Tip. Aggregate structure tection of all pointer and array access errors. SIGPLAN identification and its application to program analysis. In Notices, 29(6):290 301, June 1994. Proceedings of the ACM Symposium on Principles of Programming Languages, pages SIGPLAN '9~ Conference on Programming Language De- 119 132, Jan. 1999. sign and Implementation. [22] D. Remy and a. Vouillon. Objective ML: A simple object- [3] III.-J. Boehm and M. Weiser. Garbage collection in an unco- oriented extension of ML. In Symposium on Principles of operative environment. S@warv Practice and Experience, Programming Languages, pages 40 53, 1997. 27:807820, Sept. 1997. [23] M. Shields, T. Sheard, and S. L. P. Jones. Dynamic typing [4] V. Breazu-Tannen, C. A. Gunter, and A. Scedrov. Comput- as staged type inference. In Symposium on Principles of ing with coercions. In LISP and l~3netional Programming, Programming Languages, pages 289 302, 1998. pages 44 60, 1990. [24] M. Sift, S. Chandra, T. Ball, K. Kunchithapadam, and [5] L. Cardelli, J. Donahue, L. Glassman, M. Jordan, B. Kalsow, T. Reps. Coping with type casts in C. In 1999 ACM P~un- and G. Nelson. Modula3 report, 1989. dations on S@warv Engineering Conference (LNCS 1687), [6] M. C. Carlisle. Olden: Parullelizing Programs with Dynamic volume 1687 of Lecture Notes in Computer Science, pages Data Structures on Distributed-Memory Machines. PhD the- 180 198. Springer-Verlag / ACM Press, September 1999. sis, Princeton University Department of Computer Science, June 1996. [25] G. Smith and D. Volpano. A sound polymorphic type system for a dialect of C. Science of Computer Programming, 32(1 [7] R. Cartwright and M. Pagan. Soft typing. In Proceedings of 3):49 72, 1998. the '91 Conference on Programming Language Design and Implementation, pages 278 292, 1991. [26] SPEC 95. Standard Performance Evaluation Corportation Benchmarks. http ://www. spee. org/osg/cpu95/UINT95, July [8] S. Chandra and T. Reps. Physical type checking for C. In 1995. Proceedings of the A C'M SIGPLAN-SIGSOFT Workshop on Program Analysis for S@war~ ~bols and Engineering, vol- [27] J. L. Steften. Adding run-time checking to the Portable C ume 24.5 of Softwarv Engeneering Notes (SEN), pages 66 75. Compiler. S@warv Practice and Experience, 22(4):305 ACM Press, Sept. 6 1999. 316, Apr. 1992. [9] D. Evans. Static detection of dynamic memory errors. ACM [28] S. Thatte. Quasi-static typing. In Conference r~eo~d of the SIGPLAN Notices, 31(5):44 53, 1996. 17~h A CM Symposium on Principles of Programming Lan- [10] R. Hastings and B. Joyce. Purify: Fast detection of memory guages (POPL), pages 367 381, 1990. leaks and access errors. In Proceedings of the Usenix Winter [29] D. Wagner, J. I~bster, E. Brewer, and A. Aiken. A first step 1992 Technical Conference, pages 125 138, Berkeley, CA, toward automated detection of buffer overrun vulnerabilities. USA, Jan. 1991. Usenix Association. In Network Distributed Systems Security Symposium, pages [11] F. IIIenglein. Global tagging optimization by type inference. 1 15, I~eb. 2000. In Proceedings of the 1992 ACM Conference and [30] A. Wright and R. Cartwright. A practical soft type system P~netional Programming, pages 205 215, 1992. for Scheme. A UM Transactions on Programming Languages [12] S. Jagannathan and A. Wright. Effective flow analysis for and Systems, 1997. avoiding run-time checks. In Proceedings of the Second In- ternational Static Analysis Symposium, volume 983, pages 207224. Springer-Verlag, 1995. [13] R. W. M. Jones and P. III. J. Kelly. Backwards-compatible bounds checking for arrays and pointers in C programs. AADEBUG, 1997. [14] S. Kaufer, R. Lopez, and S. Pratap. Saber-C: an interpreter- based programming environment for the C language. In Pro- ceedings of the Summer Usenix Confervnce, pages 161 171, 1988. [15] A. Kind and H. Friedrich. A practical approach to type inference for EuLisp. Lisp and Symbolic Computation, 6(1/2):159 176, 1993. [16] B. Lampson. A description of the Cedar language. Technical Report CSL-83-15, Xerox Palo Alto Research Center, 1983. [17] B. Liskov, R. R. Atkinson, T. Bloom, E. B. Moss, R. Schaf- fert, and A. Snyder. CLU Reference Manual. Springer- Verlag, Berlin, 1981.

139