<<

University of Pennsylvania ScholarlyCommons

Departmental Papers (CIS) Department of Computer & Information Science

2006

SMART C: A Semantic Replacement Translator for C

Matthew Jacobs University of Pennsylvania, [email protected]

E. Christopher Lewis University of Pennsylvania, [email protected]

Follow this and additional works : https://repository.upenn.edu/cis_papers

Part of the Computer Engineering Commons

Recommended Citation Matthew Jacobs and E. Christopher Lewis, "SMART C: A Semantic Macro Replacement Translator for C", . January 2006.

Suggested Citation: Jacobs, M. and E.C. Lewis. (2006). SMART C: A Semantic Macro Replacement Translator for C. Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation. New York: IEEE.

©2006 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_papers/438 For information, please contact [email protected]. SMART C: A Semantic Macro Replacement Translator for C

Abstract Programmers often want to transform the source or binary representations of their programs (e.g., to optimize, add dynamic safety checks, or add profile gathering code). Unfortunately, existing approaches to program transformation are either disruptive to the source (hand transformation), difficulto t implement (ad hoc analysis tools), or functionally limited (macros). We propose an extension to the C programming language called the Semantic Macro Replacement Translator (SMART C). SMART C allows for the specification of very general -aware transformations of all operations, statements, and declarations of the C programming language without exposing the programmer to the complexities of the system’s internal representations. We have implemented a prototype SMART C sourceto- source translator and show its use in transforming programs for buffer overflow detection, format string vulnerability detection, and weighted call profiling. eW show that SMART C achieves a pragmatic balance between generality and ease of use.

Disciplines Computer Engineering | Engineering

Comments Suggested Citation: Jacobs, M. and E.C. Lewis. (2006). SMART C: A Semantic Macro Replacement Translator for C. Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation. New York: IEEE.

©2006 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

This conference paper is available at ScholarlyCommons: https://repository.upenn.edu/cis_papers/438 SMART C: A Semantic Macro Replacement Translator for C

Matthew Jacobs E Christopher Lewis Department of Computer and Information Science University of Pennsylvania {mrjacobs,lewis}@cis.upenn.edu

Abstract tions on C programs, but they are each limited in that they are disruptive to the program source, beyond the reach of Programmers often want to transform the source or bi- typical programmers, or restricted in the transformations nary representations of their programs (e.g., to optimize, they may describe. Hand transformation clearly suffers add dynamic safety checks, or add profile gathering code). from the first limitation. The powerful approach to Unfortunately, existing approaches to program transforma- program transformation is to augment an existing compiler, tion are either disruptive to the source (hand transforma- such as GCC, to build an ad hoc transformation tool. Un- tion), difficult to implement (ad hoc analysis tools), or func- fortunately, this requires considerable effort and expertise; tionally limited (macros). We propose an extension to the most programmers lack one or both of these. Binary and C programming language called the Semantic Macro Re- dynamic rewriting tools (e.g., ATOM [23] and Pin [16]) are placement Translator (SMART C). SMART C allows for powerful, but they cannot reliably transform source-level the specification of very general type-aware transforma- constructs because some constructs (e.g., structure field ac- tions of all operations, statements, and declarations of the cess) are not necessarily apparent at the instruction level. C programming language without exposing the program- Aspect-oriented programming (AOP) systems [13] are easy mer to the complexities of the system’s internal representa- to use, but existing AOP designs (even those applied to tions. We have implemented a prototype SMART C source- C) are limited in the language-level constructs that may be to-source translator and show its use in transforming pro- transformed. Finally, macro systems such as cpp and grams for buffer overflow detection, format string vulnera- are simple and easy to use, but they are very limited in the bility detection, and weighted call graph profiling. We show transformations they may specify. that SMART C achieves a pragmatic balance between gen- In this paper, we propose a modest extension to the C erality and ease of use. programming language called the Semantic Macro Replace- ment Translator for C (SMART C). Unlike token-based macros (e.g., those of cpp), semantic macros operate on the 1. Introduction abstract syntax of a program and are type aware. SMART C Programmers and users often want to transform the allows for the transformation of any declarative or compu- source or binary representations of their C language pro- tational element of the C language without exposing the grams in particular ways. For example, the performance internal representation of the compiler. As a result, pro- conscious would like to perform domain or application- grammers can freely transform variable and function dec- specific optimization without hard coding these optimiza- larations, statements, and even primitive operations such tions into the source program, thus preserving the natu- as arithmetic or logical operations. SMART C transfor- ral (unoptimized) program logic. Programmers sometimes mations can be predicated on both syntactic (e.g., variable want to encode dynamic checks in programs to ensure cer- names) and semantic (e.g., variable types) properties of tain dynamic properties (e.g., that an array is not accessed the code. In addition, SMART C includes a limited form beyond its bounds), and users often want to use dynamic of transformation- evaluation that balances generality checks to prevent potentially buggy applications from com- and ease of use. Finally, the SMART C design preserves promising user or system integrity (e.g., by restricting sys- the spirit of the C language, introducing little new syntax tem calls). Transformation is also used to inject instrumen- and leveraging existing programmer intuition. In summary, tation code in order to gather profile data about a program’s SMART C is powerful, general, compact, and easy to use. dynamic behavior is useful in guiding offline opti- In order to use SMART C, a programmer defines a mization. set of semantic macros (s-macros). S-macro expansion is A variety of techniques exists to effect such transforma- guided by patterns that determine what source-level con-

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 Source code: Transformed code: Source code: Transformed code: int a[10]; =⇒ struct {int value[10]; float a, b, c; =⇒ float a, b, c; bool isValid;} a; a = b/c; a = b * (1.0/c); Transformation specification: Transformation specification: around(decl(Integer[] %)) { around(FP%/FP%){ struct{tc type value; bool isValid;} tc name; return tc operand * (1.0/tc operand2); tc body; } }

(a) Floating point division transformation. (b) Array declaration transformation.

Figure 1. Example SMART C transformation specification and their effect on C source.

structs (e.g., floating point division and integer array dec- nents of s-macros, but space constraints preclude complete, laration) are to be transformed. When a pattern specified manual-style presentation. in an s-macro matches a source-level construct, the macro 2.1. Patterns is expanded. For example, suppose we wish to transform floating point division into multiplication of the numera- An s-macro pattern is an abstract description of C tor and the reciprocal of the denominator. The s-macro source-level primitives and can describe any expression, to achieve this appears in Figure 1(a). The pattern “FP % statement, or declaration (variable or function). The match- /FP%” indicates that the s-macro should match all divi- ing process matches on both the primitive (e.g., addition or sion operations that have floating point operands (with any function call) and the types/names of the operands. For - names). The around keyword indicates that the macro ample, the s-macro in Figure 1(a) only matches division of body should replace the division expression (versus be- floating point operands. Patterns have three components, ing inserted before or after it). The macro body com- each of which plays a role in pattern matching: (i) a type putes the product as a function of the values of the numer- specifying the type of an operand (e.g., a floating point num- ator expression (tc operand) and denominator expres- ber), (ii) a specification of the name of an operand (e.g.,a sion (tc operand2). Figure 1(b) illustrates the transfor- function called malloc), and (iii) a C language primitive mation of all integer array declarations to structure decla- (e.g., division or variable declaration). Below, we describe rations containing the original integer array and a boolean each pattern component. flag. This work makes the following contributions. We Pattern types. A pattern type specifies the data type of an present the design of a semantic macro system for C that operand in a pattern. In addition to all the C primitive data is simple (leveraging programmer intuition, requiring lit- types, pattern types may include any user-defined data types tle new syntax, avoiding exposing intermediate representa- or any of the additional (shaded) types appearing in the type tions), powerful (useful and interesting transformations may hierarchy in Figure 2. These additional types are only avail- be specified), and concise (requiring very little SMART C able in SMART C patterns. (i.e., they cannot be used in any code to achieve useful transformations). We describe our C code). Each pattern type in Figure 2 matches all C types SMART C implementation and show its utility in three dif- in the nodes beneath it, thus allowing for the concise spec- ferent application contexts. ification of a set of related types (e.g., all signed integers of any precision). Derived types such as structures, point- 2. SMART C Design ers or arrays may be created from these primitive types. In addition, the Any type can be used to specify any possible SMART C macro expansion is a source-to-source trans- type, including derived types. The Any type may also be formation guided by a set of user-specified transformation refined, as in the case of the pattern type Any , which specifications. Each transformation specification consists * matches pointers to any type. The syntax of pattern types is of both semantic macros (s-macros) and (transformation- borrowed directly from the C language. local) auxiliary code and data declarations required by the s-macros. An s-macro consists of (i) a pattern describing the Pattern names. A pattern name specifies the name of an expressions, statements, or declarations to be transformed, operand in a pattern. This includes variable and function (ii) a body containing code, and (iii) a modifier describ- names as well as literals. Names may include the wild- ing how the matched entity is to be transformed. We call card character %, which matches any number of characters. the untransformed and transformed programs the base code For example, the pattern name %alloc would match both and target code, respectively. A matching entity in the base malloc and calloc. Pattern names can also represent code is called a match site. Below we introduce the compo- literals by using quotes. The literals matched will depend

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 Primitive declaration. There our four kinds of variable declara- Number void tion patterns: decl, globaldecl, localdecl, and formaldecl. The decl pattern matches all variable dec- FP Integer larations, while the remaining three match global, local, and formal parameter declarations, respectively. Unsigned Signed Int Int SMART C also contains the boolean patterns not, and, long double float and or that combine multiple patterns, which have the nat- double char long int short ural interpretation. This admits, for example, patterns that match additions in which either operand is a pointer type. 2.2. Modifiers unsigned unsigned unsigned unsigned char long int short An s-macro modifier describes how the s-macro body should be inserted at a match site. The modifier may take on the values of before, after,oraround (terms bor- Figure 2. Primitive type hierarchy used for rowed from aspect-oriented programming [13]), indicating matching in SMART C. whether the body of the s-macro should be inserted before, after, or instead of, respectively, the matching expression, upon the pattern type (above). For example, if the pattern statement, or declaration. The method by which bodies are type is Integer, then "1" refers to the integer literal 1, while inserted is described in the next section. "%" refers to any integer literal. Although general regu- 2.3. Bodies lar expressions would provide additional flexibility beyond our wildcard character, we have not found a need for this An s-macro body defines the code that is to be inserted generality. (according to the modifier) into the program (at the match site). The body simply consists of C code, augmented with Pattern primitives. A pattern primitive specifies the C SMART C-specific variables and syntax that allow the body language primitive that the pattern should match. Pattern to be parameterized based on the context in which it is in- primitives are built from names and types (specifying the serted. Context variables describe properties of the matched name and types of the primitive operands, where applica- expression, statement, or declaration; and transformation- ble), and can represent C expressions, statements, or decla- time control statements allow for the body to be customized rations. The syntax of pattern primitives, again, is borrowed based on these properties. from the C language itself. For example, the pattern in the s-macro of Figure 1(a) matches division primitives; the two Context variables. Context variables are place holders for type/name operands specify that we should only match if values, names, declarations, code, or operations that are the division operands are floating point values of any preci- part of the matching expression, statement, or declaration. sion with any name. Context variables allow the body code to be parameterized based on properties of the match site. Each context variable SMART C provides expression patterns for all arith- is recognized by the “tc ” prefix (“this context”). Since metic, comparison, and logical operators, in addition to context variables appear only in SMART C code, the us- other primitive operations of the C language (dereference, age of variables beginning with “tc ” in C remains unre- address-of, array access, structure access, function call). stricted. SMART C provides pattern primitives representing sets of Table 1 shows the context variables available for each related C primitives, thus allowing for the specification of type of pattern (although we will see that the context vari- more generic patterns. For example, the Binop primi- ables in parentheses are not defined for all bodies). The tive matches all binary operations and Unop matches all context variables that are available in a particular body are unary operations. These abstract primitives are further re- determined by the properties of the match site, which are fined based on their potential return types (e.g., arithmetic apparent from the pattern associated with the body. The versus logical binary operations). tc func context variable always denotes the function in Statement patterns (if, ifelse, switch, while, which the match site is located. dowhile, for) do not require any pattern types or names Expression patterns are created out of operands and an (i.e.,anif pattern will match all if statements indepen- operation. The value of a matching expression can be ac- dent of the type of its predicate expression or structure of its cessed via tc . Each operand can, in general, be an body). We have not found a need for more selective match- arbitrary expression of any type. The type and value of the ing. first matched operands are accessed through the tc type Declaration patterns match either variable or function and tc operand context variables. When matching bi-

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 Pattern Available Context Variables available to the transformation body. This is achieved by Expression tc expr, tc operand, tc type, preceding any context variable with a dollar sign. In this (tc operand2), (tc type2), tc operation, example, we could include the following code in our body: tc rettype,(tc args), (tc numargs), tc func fprintf(log, $tc operand). Statement tc stmt, tc expr,(tc expr2), (tc expr3), Transformation-time control statements. Like con- tc block,(tc block2), tc func text variables, transformation-time control constructs are Declaration tc declscope, tc name, tc type, tc decl, a SMART C-specific construct that can appear within s- tc body,(tc args), tc func macro bodies. The three possible transformation-time con- Table 1. Context Variables. trol constructs are IF-ELSE, FOR, and SWITCH. These statements are formed in the same way as their C counter- nary expressions, the tc type2 and tc operand2 con- parts. However, they are restricted in what they may contain text variables are also available. The operation is accessed so that they may be evaluated at transformation-time. via the tc operation() function. The result type of The IF and IF-ELSE constructs selectively include the operation is denoted by tc rettype. If the ex- code in the target program based on the value of a predi- pression is a function call, then the argument list is via cate (i.e., the predicate is evaluated at transformation time tc args. This is an array of arguments, each possess- and either the then or the else statement—if one exists— ing type and value information. For example, the context is included in the target code). The predicate must be a variables tc args[1].expr and tc args[1].type transformation-time constant expression, consisting of C are place holders for the value of the value and type of literals, context variables, any C operation, and the func- the second argument to the function. The number of - tional subset of the string library. A simple example guments to the function is denoted by the context variable adapted from Engler [10] appears in Figure 3. If a non- tc numargs. reentrant function is called from within a signal handler (by convention named with a “sig ” prefix), the transfor- Statement patterns match control-flow statements. The mation results in an error and program termination. This statement matched by the pattern may be accessed by s-macro matches all calls to non-reentrant functions (just tc stmt. In each control-flow statement, there is an ex- nonreentrant() in this example). SMART C also pression or series of expressions to be evaluated and con- provides transformation-time display (()) and ter- trol goes to a block of statements based upon the expres- mination (()) operations that could be used in this sion value. These expressions are denoted by the tc expr example to report the same error at transformation time. context variables, and the code blocks are denoted by the PRINTF() and EXIT() cannot be nested within C con- tc block context variables. For example, an if-else state- trol flow, but appearing within transformation-time control ment possesses one expression to evaluate, and a choice of flow is allowed. two code blocks to be conditionally executed. The context Unlike the C for loop, the SMART C FOR loop must expressions tc expr, tc block, and tc block2 will iterate a transformation-time constant number of iterations, be available to use in the transformation body. requiring that (i) the initializer must be a simple assignment Declaration patterns match variable or function declara- of an integer transformation-time constant, (ii) the com- tions. Each variable declaration is associated with a state- parator must compare the induction variable to an integer ment block, logically giving each declaration its own scope. transformation-time constant, and (iii) the incrementor may The context variable tc declscope is used to denote increment or decrement the induction variable by an inte- the combination of the declaration and its associated code ger transformation-time constant. During transformation, block. The declaration itself is denoted by tc decl and the number of iterations is computed and the loop is fully the associated code block by tc body. The type and unrolled. name of the declared variable are denoted by tc type and Similarly, the SWITCH statement must be governed by a tc name. For function declarations, the context variables transformation-time constant expression. Although this ex- tc type and tc name denote the return type and func- pression may be any transformation-time constant expres- tion name respectively. In this case, the function body is sion, expressions representing C types will be particularly represented by tc body, and the function prototype by useful, allowing for the insertion of type-specific code dur- tc decl. As in the function call case, the arguments can ing transformation. An example (also adapted from En- be accessed though the tc args context variable. gler [10]) exploiting both the FOR and SWITCH statements It is often useful to have the textual representation of appears in Figure 4. In this example, the programmer calls that which a context variable represents. For example, an output(...) function in the base code that is type- suppose tc operand represents some match site variable unaware. This function call is transformed to an appropriate counter. It is useful to the string "counter" type-dependent printf() call.

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 before(Any nonreentrant()) { IF(strncmp($tc_func,"sig_",4) == 0) { to C type checking except that context variables can take printf("Sig handling error"); on multiple types. This set of types is determined from the exit(1); pattern associated with the s-macro. The type checker con- } servatively assumes the context variables may have any of } these types and rejects programs that may violate C’s typing rules. Figure 3. S-macro using IF. 3. Implementation around(void output(...)) { char typeString[] = {0}; This section describes how pattern matching and s- FOR(int i=0; i

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 Source code: Transformed code: int *ptr, result; =⇒ int *ptr, *temp, result; ptr = (int *) ptr = (int *) malloc(sizeof(int)); malloc(sizeof(int)); result = *ptr; if (ptr == 0) error("NULL DEREF"); Source code: Transformed code: temp = ptr; int a, b; =⇒ int a, b; result = *temp; a=b* 2; a=b<<1;

S-macro: S-macro: before(*Any* %){ around(Integer % * "2"){ if (tc operand == 0) error("NULL DEREF") return tc operand << 1; } }

(a) Before transformation on dereference expression. (b) Around transformation on multiply expression.

Figure 5. Example SMART C expression transformations.

Source code: Transformed code: for(i=0; i < 10; i++) =⇒ i=0; {printf("i=%d", i); while (i < 10) { Source code: Transformed code: } printf("i=%d", i); if (x > 0) =⇒ printf(" %s", i++; } foo(x); "x > 0"); if (x > 0) S-macro: foo(x); around(for){ tc expr; S-macro: while(tc expr2) { before(if) { tc block; printf("Test %s", $tc expr); tc expr3; } } }

(a) Before transformation on if statement. (b) Around transformation on for statement.

Figure 6. Example SMART C statement transformations.

resolve both the transformation-time control constructs and placed by the result of the macro body. In this case, the context variables that appear in the transformation body. All macro body must end with a return statement which spec- of the context variables may be statically determined by ex- ifies a value of the same type as the matched expression. amining the match site. Each context variable is replaced by This expression is assigned into the temporary variable de- the appropriate variable, type, operation, expression, decla- scribed above, and the macro body is inserted before this ration or code block, as described above. statement. An example is shown in Figure 5(b).

Expression transformation. Expression transformation is Statement transformation. A statement may be trans- governed by the before, after,oraround s-macro formed by a s-macro using the before, after,or modifiers. Since the expression is part of a statement, it around modifiers. If the modifier is before or after, is desirable to isolate this expression from the rest of the the macro body is inserted directly before or after the match statement for transformation purposes. To do this, SMART site. If the modifier is around, then the statement is re- C creates a temporary variable of the expression’s type and placed with the macro body. Examples of each are shown replaces the appearance of the expression at the match site in Figure 6. with this temporary variable. Declaration transformation. A declaration may be trans- If the transformation modifier is before or after,a formed by a s-macro using the before, around,or new statement is created that assigns the matched expres- after modifiers. This pattern matches on the declaration, sion into the temporary variable from above. At this point, and may replace or add to the declaration and the code in the macro body (with context variables replaced, as above) the scope associated with matched declaration. The modi- is inserted before or after the newly created statement. An fiers specify where code is added so that a before s-macro example is shown in Figure 5(a). will add declarations before the match site and code before If the modifier is around, the expression will be re- the associated scope. The after and around cases are

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 Source code: Transformed code: Source code: Transformed code: int a; =⇒ int a; foo(int a, float b) =⇒ foo(int a, float b) { foo(); add object(a); { do stuff(); } init table(); foo(); do stuff(); del object(a); delete table(); }

S-macro: S-macro: around(decl(Integer %)) { around(decl(Any %(...))){ tc decl; tc decl { add object(tc name); init table(); tc body; tc body; del object(tc name); delete table(); } } }

(a) Around transformation on variable declaration. (b) Around transformation on function declaration.

Figure 7. Example SMART C declaration transformations.

similar. An example of changing a variable declaration has typedef struct { already been presented in Figure 1(b). Examples demon- void * value; void * base; strating another transformation of a variable declaration and unsigned size; the body of a function declaration are shown in Figure 7. enum {Heap, Local, Global} storageClass; int capability; 3.3. Implementation Status } SafePtr; SMART C is implemented as a source-to-source trans- around(decl(Any * %)) { lator. SMART C takes as input one file containing an arbi- SafePtr tc_name; //make all ptrs into SafePtrs trary number of transformation specifications, and any files tc_body; //keep body the same containing source code that the user wishes to transform. } SMART C outputs the transformed versions of each of these source files. Figure 8. SafeC s-macro that transforms Our SMART C implementation has been built using the pointer declaration to fat pointer declaration C-Breeze compiler infrastructure [15]. C-Breeze provides a translation. number of predefined phases and allows for custom phases to be built. Our implementation uses four phases. First, the source code and the transformation specifications are parsed bounds-checking occurs. Hackers have found numerous into an AST. Next they are dismantled using C-Breeze’s ways to construct malicious input which subverts control built-in dismantler, resulting in a three-address-code-like of the system by overwriting critical control areas through intermediate representation. The next phase walks through the use of carefully constructed which overwrite the dismantled AST to search for C constructs which match unchecked buffers in C code. There are a variety of solu- a s-macro pattern. Upon a pattern match, the transformation tions to this problem; we will examine two here. Both of body is expanded and applied to the match site as discussed the solutions have been proposed and implemented previ- in the previous section. The final phase “undismantles” the ously by modifying a compiler. We will show that SMART code, converting it to a higher-level, more readable version, C provides a way to offer this functionality without having which serves as the output. to deal with the complexity of compiler internals. The first buffer overflow detection mechanism we con- 4. Applications sider is SafeC [2]. SafeC is a program transformation This section demonstrates the use of SMART C across that changes the representation of pointers to “fat pointers,” a wide variety of application domains. Space constraints which are C structures that contain spatial and temporal at- preclude the presentation of complete transformation spec- tributes. This transformation requires that every use of a ifications, so we instead describe only the most important pointer must be transformed to update or check this pointer s-macros. metadata appropriately. In SMART C, these transformations are easy to express. 4.1. Buffer Overflow Detection The SafeC system has been implemented via SMART C us- Recently, there has been a great amount of work done ing fifteen transformations, requiring 150 lines of code (ex- to make C programs safe with respect to buffer overflows. cluding the C runtime library routines). The most important Buffer overflows are possible in C because no explicit SMART C-based SafeC transformation is the conversion

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 around(Void* %alloc(...)) { around(and(BinOp(Any * %, Any %), tc_rettype ptr; not(BinOp(Any * %, Any * %))) { ptr = tc_expr; //ptr stores result of malloc //matches all expr of form : ptr (op) non-ptr add_object(ptr, tc_args[0].expr); // add to table tc_type base; return ptr; //return address tc_rettype result, retval; } result = tc_operation( get_oob(tc_operand), tc_operand2); //look up ptr in OOB, do original operation Figure 9. CRED s-macro that transforms if (check_ptr(result)) retval=result; //nothing to do if object is inbounds malloc call to allocation plus object inser- else tion. retval = add_oob(result, base_obj(tc_operand)); //if OOB, add (address,base) to OOB return retval; //return result of computation } of pointer declarations to fat pointers. The transformation specification to achieve this appears in Figure 8. Each oper- ation that performs pointer arithmetic is modified to update Figure 10. CRED s-macro that transforms a the correct SafePtr components. Assignments to pointers binary operation to use object and OOB table must also reflect the new SafePtr representation. Pointer information. creation through a malloc call or the ‘&’ operation must generate pointer attributes to place in the SafePtr rep- resentation. The transformation of pointer dereferencing before(Int printf(...) { does not change the semantics of dereference, but inserts char * formatStr = tc_args[0].expr; checks to ensure the spatial and temporal attributes of the int numPercent = parse(formatStr); if (numPercent > tc_numargs-1) { pointer result in a valid dereference. Finally, each function printf("Attack detected!"); body must be transformed to include prologue and epilogue exit(1); code to generate and discard scoping information, which is } used by the SafePtr representation to update and verify } its temporal attributes. Next, we consider the buffer overflow detection tech- Figure 11. FormatGuard s-macro that trans- nique proposed by Ruwase and Lam [19] called CRED (C forms calls to printf to check the number Range Error Detector). CRED does not change the pointer of arguments. representation; rather, it keeps object metadata in an auxil- iary runtime table, and checks each pointers’ value against this table to verify its validity. This involves adding bounds string vulnerabilities are possible in C because no check- information about each object in the program to the table, ing is performed to ensure that the format string contains % and updating it appropriately when objects are deallocated. directives that match the number and types of further argu- These transformations are easy to describe in SMART ments passed to printf. C. The object table, out-of-bounds (OOB) table, and helper FormatGuard [7] detects many forms of this attack by functions which provide an interface to modify the tables dynamically ensuring that the number of arguments to are provided at the level of the transformation specifica- printf() are the same as the number of % directives in tion. Objects must be inserted into the object table for each the format string. While FormatGuard uses cpp to effect malloc object (non-pointer) declaration and each call to . this transformation, SMART C admits a much simpler spec- malloc The code for the case is shown in Figure 9. These ification (Figure 11). objects are deleted upon the termination of a scope or a call to free, respectively. S-macros are also necessary to up- Alternatively, format string vulnerabilities can be stati- date the tables appropriately and to perform checks upon cally detected. Shankar et al. use type qualifiers (specif- pointer dereference. The transformation for binary opera- ically tainted) to discover when format strings are de- tions where the first operand is a pointer is shown in Fig- rived from user-supplied input [20]. SMART C can gener- ure 10. ate code to dynamically compute the same thing. To achieve 4.2. Format String Vulnerability Detection this, taintedness information is maintained and propagated for each string, and all strings used as format strings are Programs written in C are also subject to format string checked to ensure that they are not tainted. This transfor- attacks. These attacks are achieved by giving the program a mation is similar to Safe C, in that SMART C changes the string which will be passed to printf as a format string. representation of strings to include a taintedness bit, and This string can be formed in a particular way to use it’s % each access to a string must be transformed in the obvious directives to write an arbitrary value to memory. Format manner.

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 4.3. Weighted Call Graph Construction for cpp.Ascpp does, ASTEC matches on keywords in- troduced by the programmer, and is therefore inappropriate The call graph is a useful tool in application profiling. for applying transformations to preexisting base code. Like- Call graphs encode the control flow between functions by wise, the MS2 (Meta Syntactic Macro System) [26] uses a counting how many times and from where each procedure language that may access pieces of the AST directly and use gets dynamically called. Call graphs are typically con- them in code expansion. structed as the program runs and analyzed offline. Aspect-oriented programming (AOP) [13] is a general Weighted call graph construction is easily realized with framework for expressing crosscutting concerns in a modu- SMART C. The main function is augmented with code to lar fashion. The most well-known versions are AspectJ [12] initialize a global data structure as its prologue, and code and AspectC++ [22]. These systems allow programmers to to calculate the call graph from the global data structure match and transform method calls and variable access, and as its epilogue. Internally, pairs of (caller, callee) func- refine the match sites by further matching upon dynamic tions are stored in a table and associated with a count. control flow information. Many systems have brought AOP For each occurrence of a (caller, callee) pair, the count is concepts to C, including AspectC [5], c4 [27], Aspicere [1], incremented. Each function body is transformed to begin Arachne [9] and TinyC [28]. These systems all provide the with an update to the hash table that stores the callee name. ability to match on function calls, and in some cases vari- Each function call is transformed to also include an update able access and dynamic control flow information. We ar- to the hash table that stores the call site. Upon termination gue that SMART C patterns made up of all primitive oper- of the program, there will be a sequence of pairs of func- ations of C allow a more expressive set of transformations tion names (caller, callee) with an associated count. From than solely function calls. In fact, the dynamic control flow this information, it is simple to construct the weighted call matching provided in some of these languages can be ex- graph. pressed as a transformation specification in SMART C, ob- 5. Related Work viating the need for a special language construct. Token-based patterns. Transformation tools that reason Code transformation approaches may be distinguished about a token representation of base code include the cpp by the representations on which they operate. Below we and m4 macro systems. These tools suffer from the disad- summarize systems that operate on the token stream, the vantage that no contextual information is available about the abstract syntax tree, or the machine code of a program. match site. Furthermore, subtle errors may be introduced Syntax-based patterns. Transformation systems which op- due to precedence and side-effects that are not obvious from erate on the syntactic structure of C code have more ex- the macro code. pressiveness than token-based transformations, and allow Binary-based patterns. Systems that operate on the binary transformations to be architecture-independent, unlike bi- representation of the program are designed with a different nary translators. SMART C provides the ability to match set of goals than systems which operate on ASTs. They seek and transform primitives of the C language. MAGIK [10] to provide a machine-specific transformation capability, at provides a library to access AST primitives and transform the loss of semantic information (such as types), potential them. Unlike SMART C, traversals over the AST must be optimization opportunities, and portability across architec- explicitly performed by the programmer. tures. ATOM [23] and EEL [14] are examples of compile- Systems exist that are able to match and transform ar- time transformation systems which operate on the binary bitrary ASTs. These transformations are sensitive to pro- representation of the program. There are also a variety of grammer idioms, and may be more brittle than transfor- run-time transformation systems which allow programmers mations on primitives. Moreover, writing transformational hooks into the binary representation of a program, some ex- code is harder for these systems since there are more pieces amples include Dynamo [3], Pin [16], and DISE [6]. of the AST to reason about. The Code Transformation Tool (ctt) [4] is an example of such a system. The Strat- Metaprogramming. The concept of transformation-time ego [25] system uses term rewriting to express transforma- control in SMART C (IF-ELSE, FOR, SWITCH)isan tions. ASTLOG [8] uses a Prolog variant as a transforma- instance of metaprogramming. Other metaprogramming tion language. ASF+SDF is an environment for automati- systems include Template Haskell [21] and the Scheme cally constructing languages and provides facilities for their language [11]. Metaprogramming allows programmers to transformation [24]. write code which produces other code. Many metapro- Transformation systems may also operate on keywords gramming systems, such as tcc [18], allow this metacode introduced by the programmer that will into C code. to be arbitrary. In general, this makes code written in a ASTEC [17] is a system which operates on code that has not metaprogramming language to be difficult to reason about. yet been pre-processed, and is designed to be a replacement SMART C provides only a limited set of language con-

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006 structs to produce code at transformation-time. Again, this [12] G. Kiczales, E. Hilsdale, J. Hugunin, M. Kersten, J. Palm, is an example where SMART C attempts to limit the com- and . G. Griswold. An Overview of AspectJ. In Proc. of European Conf. on Object-Oriented Programming, 2001. plexity of transformational code, while still maintaining [13] G. Kiczales, J. Lamping, A. Menhdhekar, C. Maeda, enough power to express meaningful transformations. C. Lopes, J.-M. Loingtier, and J. Irwin. Aspect-oriented programming. In M. Aks¸it and S. Matsuoka, editors, Pro- 6. Conclusion ceedings of European Conference on Object-Oriented Pro- gramming, volume 1241, pages 220–242. Springer-Verlag, We have introduced semantic macros to the C program- Berlin, Heidelberg, and New York, 1997. ming language via SMART C (Semantic Macros Replace- [14] J. R. Larus and E. Schnarr. EEL: Machine-Independent Exe- ment Transformer for C). Our SMART C extensions allow cutable Editing. In Proceedings of Conference on Program- ming Language Design and Implementation, pages 291– for far more transformation power than traditional C macro 300, 1995. systems because (i) type information is used the pattern [15] C. Lin, S. Z. Guyer, and D. Jimenez. The C-Breeze Com- matching/replacement process, (ii) any C language prim- piler Infrastructure. Technical Report -01-43, The Uni- versity of Texas at Austin, November 2001. itive may be transformed, and (iii) our macro bodies are [16] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, highly parameterizable. We show the use of SMART C in G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. several practical contexts (buffer overflow detection, format Pin: Building Customized Program Analysis Tools with Dy- string vulnerability detection, and call graph profiling), and namic Instrumentation. In Proceedings of Conference on Programming Language Design and Implementation, pages we find that powerful transformations can very simply and 190–200, 2005. succinctly be represented with SMART C. [17] B. McCloskey and E. Brewer. ASTEC: A New Approach to Refactoring C. In Proceedings of Foundations of Software References Engineering, 2005. [18] M. Poletto, D. R. Engler, and M. F. Kaashoek. tcc: A Sys- [1] B. Adams and T. Tourwe.´ Aspect Orientation for C: Express tem for Fast, Flexible, and High-level Dynamic Code Gener- yourself. In Proceedings of Software-Engineering Proper- ation. In Proceedings of Conference on Programming Lan- ties of Languages and Aspect Technologies, 2005. guage Design and Implementation, pages 109–121, 1997. [2] T. M. Austin, S. E. Breach, and G. S. Sohi. Efficient Detec- [19] O. Ruwase and M. S. Lam. A Practical Dynamic Buffer tion of All Pointer and Array Access Errors. In Proceedings Overflow Detector. In Proceedings of Network and Dis- of Conference on Programming Language Design and Im- tributed System Security Symposium, pages 159–169, 2004. plementation, pages 290–301, 1994. [20] U. Shankar, K. Talwar, J. S. Foster, and D. Wagner. Detect- [3] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A Trans- ing Format String Vulnerabilities with Type Qualifiers. In parent Dynamic Optimization System. In Proceedings of Proceedings of USENIX Security Symposium, 2001. Conference on Programming Language Design and Imple- [21] T. Sheard and S. P. Jones. Template Meta-programming for mentation, pages 1–12, 2000. Haskell. In Proceedings of Workshop on Haskell, pages 1– [4] M. Boekhold, I. Karkowski, and H. Corporaal. Trans- 16, 2002. forming and Parallelizing ANSI C Programs Using Pattern [22] O. Spinczyk, A. Gal, and W. Schroder-Preikschat.¨ As- Recognition. In Proceedings of International Conference on pectC++: An Aspect-Oriented Extension to C++. In High-Performance Computing and Networking, pages 673– Proceedings of International Conference on Technology of 682, 1999. Object-Oriented Langiages and Systems, 2002. [5] Y. Coady, G. Kiczales, M. Feeley, and G. Smolyn. Using [23] A. Srivastava and A. Eustace. ATOM: A System for Build- AspectC to Improve the Modularity of Path-Specific cus- ing Customized Program Analysis Tools. In Proceedings of tomization in operating system code. In Foundations of Soft- Conference on Programming Language Design and Imple- ware Engineering, 2001. mentation, pages 196–205, 1994. [6] M. L. Corliss, E. C. Lewis, and A. Roth. DISE: A Pro- [24] M. van den Brand, J. Heering, H. de Jong, M. de Jonge, grammable Macro Engine for Customizing Applications. In T. Kuipers, P. Klint, L. Moonen, P. Olivier, J. Scheerder, Proceedings of International Symposium on Computer Ar- J. Vinju, E. Visser, and J. Visser. The ASF+SDF Meta- chitecture, pages 362–373, 2003. Environment: a Component-Based Language Development [7] C. Cowan, M. Barringer, S. Beattie, and G. Kroah-Hartman. Environment. In Proceedings of Compiler Construction FormatGuard: Automatic Protection From printf Format 2001 (CC 2001), LNCS. Springer, 2001. String Vulnerabilities. In Proceedings of USENIX Security [25] E. Visser. Program transformation with Stratego/XT: Symposium, 2001. Rules, strategies, tools, and systems in StrategoXT-0.9. In [8] R. F. Crew. ASTLOG: A Language for Examining Abstract C. Lengauer et al., editors, Domain-Specific Program Gen- Syntax Trees. In Proceedings of Conference on Domain- eration, volume 3016 of Lecture Notes in Computer Science, Specific Languages, 1997. pages 216–238. Spinger-Verlag, 2004. [9] R. Douence, T. Fritz, N. Loriant, J.-M. Menaud, M. Segura- [26] D. Weise and R. Crew. Programmable Syntax Macros. In Devillechaise, and M. Sudholt. An expressive aspect lan- Proceedings of Conference on Programming Language De- guage for system applications with Arachne. In Proceedings sign and Implementation, pages 156–165, 1993. of International Conference on Aspect-oriented software de- [27] M. Yuen, M. Fiuczysnki, R. Grimm, Y. Coady, and velopment, 2005. D. Walker. Making extensibility of system software prac- [10] D. R. Engler. Incorporating application semantics and con- tical with the C4 toolkit. In AOSD Workshop on Software trol into compilation. In Proceedings of Conference on Engineering Properties of Languages and Aspect Technolo-

Domain-Specific Languages, 1997. 5 gies, 2006. 2 [11] R. Kelsey, W. Clinger, and J. R. (Editors). Revised report [28] C. Zhang and H.-A. Jacobssen. TinyC : Towards Building on the algorithmic language Scheme. ACM SIGPLAN No- a Dynamic Weaving Aspect Language for C. In Foundations tices, 33(9):26–76, 1998. of Aspect-Oriented Languages, 2003.

Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'06) 0-7695-2353-6/06 $20.00 © 2006