Ccured Type-Safe Retrofitting of Legacy Code
Total Page:16
File Type:pdf, Size:1020Kb
CCured Type-Safe Retrofitting of Legacy Code George C. Necula Scott McPeak Westley Weimer University of California, Berkeley {necula, smcpeak, weimer}@cs, berkeley, edu Abstract While in lhe 1970s sacrificing lype safely for flexibilily and perfbrmance mighl have been a sensible language design In lhis paper we propose a scheme lhai combines type in- choice, today lhere are more and more siiuaiions in which ference and run-time checking to make exisiing C programs type safety is jusi as imporiani as, if noi more importani type saI~. }V~ describe lhe CCured type sysiem, which ex- lhan, performance. Errors like array oui-of:bounds accesses lends lhat of C by separaiing poinier types according to lead boih lo painful debugging sessions chasing inadverieni lheir usage. This type sysiem allows boih poiniers whose memory updaies and lo malicious atiacks exploiiing buft>r usage can be verified siaiically to be type saI>, and poiniers overrun errors in security-criiical code. (Almost g0% of re- whose sai>ty musi be checked ai run lime. }V~ prove a type ceni CERT advisories resuli fi'om security violaiions of this soundness resuli and lhen we preseni a surprisingly simple kind [29].) Type safety is desirable for isolaiing program type inference algoriihm lhai is able lo infer lhe appropriaie componenis in a large or exiensible sysiem, wiihoui lhe loss poinier kinds for exisiing C programs. of performance of separaie address spaces. II is also valu- Our experience wiih lhe CCured sysiem shows lhai lhe able for inter-operaiion wiih sysiems wriiien in type-saf> inference is very eft~ciive for many C programs, as ii is able languages (such as type-saf> Java naiive meihods, for exam- io infer ihai most or all of the pointers are statically ver- pie). Since a great deal of useful code is already wriiien or ifiable lo be type saI>. The remaining poiniers are insiru- being wriiien in C, ii would be useful to have a praciical menied wiih efficieni run-lime checks lo ensure lhai lhey are scheme lo bring type safety lo lhese programs. used sai>ly. The resuliing performance loss due lo run-lime The work described in lhis paper is based on lwo main checks is 0 15056, which is several limes belier lhan com- premises. Firsi, we believe lhai even in programs wriiien in parable approaches lhai use only dynamic checking. Using unsaf~ languages like C, a large pari of lhe program can be CCured we have discovered programming bugs in esiablished verified siatically to be type saf~. Then lhe remaining pari C programs such as several SPECINT95 benchmarks. of lhe program can be insirumenied wiih run-lime checks lo ensure lhai lhe execuiion is memory saf>. The second 1 Introduction premise of our work is lhai in many applicaiions, some loss of performance due lo run-lime checks is an accepiable price The C programming language provides programmers wiih a for type safety, especially when compared lo lhe alternaiive greai deal of flexibility in lhe represeniaiion of daia and lhe cosi of reprogramming lhe sysiem in a type-saf~ language. use of poiniers. These f~aiures make C lhe language of choice The main coniribuiion of lhis paper is lhe CCured type for sysiems programming. Unforiunaiely, lhe cosi is a weak sysiem, an exiension of lhe C type sysiem wiih explicii types type sysiem and consequenlly a greai deal of "flexibility" in for poiniers inio arrays, and wiih dynamic types. II exiends iniroducing sublle bugs in programs. previous work on adding dynamic types lo siaiically typed This research was supported in part by the National Sci- languages: types and capabiliiies for lhe siaiically-typed el- ence I~bundation Career Grant No. CCR-9875171, and ITR Grants emenis are known ai compile lime, while lhe dynamically- No. CCR-0085949 and No. CCR-0081588, and gifts from AT&T typed elemenis are guarded by run-lime checks. Our type Research and Microsoft Research. The information presented here sysiem is inspired by common C usage, and includes sup- does not necessarily reflect the position or the policy of the Gov- pori for physical type equivalence [8] and special "sequence" ernment and no official endorsement should be inferred. poiniers for accessing arrays. The second coniribuiion of lhe paper is a simple yei eft~ciive type-inf>rence algoriihm Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided lhai can lranslaie ordinary C programs inio CCured mostly that copies are not made or distributed for profit or commercial auiomaiically and in a mailer of seconds even for a0,000- advantage and that copies bear this notice and the full citation on line programs. }V~ have used lhis inference algoriihm lo the first page. 2b copy otherwise, to republish, to post on servers produce type-sat~ versions of several C programs for which or to redistribute to lists, requires prior specific permission and/or we observed a slowdown of 0 15056. In lhe process, we have a fee. POPL '02, Jan. 16-18, 2002 Portland, OR USA (c) 2002 ACM ISBN 1-58118-450-9/02/01...$5.00 128 also found programming bugs in the analyzed code, the most iint "i *2 a; // array surprising being several array out-of-bounds errors in the 2 int i; // index SPECINT95 compress, go and ijpeg benchmarks. int acc; // accumulator }V~ continue in Section 2 with an informal overview of the 4 int *a *4 p; // elem ptr system in the context of a small example program. Then 5 int *5 e; // unboxer in Section 3 we present a simple language of pointers, with 6 acc = 0; its type system (in Section 4) and operational semantics (in for (i=O; i<lO0; i++) { Section 5), t~llowed by a discussion of the type sat~ty guar- p = a + i; // ptr arith antees of CCured programs. In Section 6 we present a simple 9 e = *p; // read elem constraint-based type int>rence algorithm for CCured. }V~ io while ((int) e ~ 2 == O) { // check tag discuss informally the extension of the language presented 11 e = * (int *6 "7) e; // unbox in this paper to the whole C programming language in Sec- le } tion 7, necessary source code changes in Section 8, and in i~ acc += ((int)e >> I); // strip tag Section 9 we relate our experience with a prototype imple- e# } mentation. Figure 1: A short C program fragment demonstrating sat> 2 Overview of the Approach and unsat> use of pointers. To ensure memory safety, tbr each pointer we must keep track of certain properties of the memory area to which it is supposed to point. Such properties include the area's size the static type of "e" as being an accurate description of and the types of the values it contains. For some pointers its values. In our type system we say that "e" has a dy- this information can be computed precisely at compile time narnic pointer type and we associate this inibrmation with and for others we must compute it at run time, in which case the pointer type constructors *s and *r. Dynamic point- we have to insert run-time sai~ty checks. ers always point into memory areas whose contents do not These two kinds of pointers appear in the example C pro- have a reliable static type and must therefore store extra gram shown in Figure 1. The program operates on a hy- information to classii]y their contents as pointers or integers. pothetical disjoint union datatype we call "boxed integer" Correspondingly, the aliases of dynamic pointers can only be that has been efficiently implemented in C as follows: if a other dynamic pointers; otherwise a saI~ pointer's static type boxed integer value is odd then it represents a 31-bit integer assumptions could be violated aider a memory write through in the most significant bits along with a least significant tag a dynamic pointer alias. This means that the type construc- bit equal to one, otherwise it represents a pointer to another tors *l, *a and *s must also be classified as dynamic point- "boxed integer". }V~ use the C datatype int • to represent ers. In this example program, we have a mixture of pointers boxed integers. The variable a is a pointer to an array of whose static type can be relied upon and thus require lit- boxed integers. The purpose of the function is to accumu- tle or no access checks (the saI~ and sequence pointers), and late in the variable acc the sum of the first 100 boxed integers also pointers whose static type is unreliable and thus require in the array. In line 8 we compute the address of a boxed more extensive checking. integer and in line 9 we t>tch the boxed integer. The loop in lines 10 12 unboxes the integer. The subscripts on the pointer type constructors "*" have been added to simplii~v cross-referencing from the text. Motivated by this and similar examples, the CCured lan- By inspection of the program we observe that the values guage is essentially the union of two languages: a strongly of the variables a and p are supposed to point into the same typed language (containing sat> and sequence pointers), and array. Neither of these variables is subject to casts (and an untyped language for which the type information is main- they have no other aliases) and thus we know that the type rained and checked at run time. of the values they point to is indeed "int *".