Lackwit: A Program Understanding Tool Based on Type Inference Robert O’Callahan Daniel Jackson School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University 5000 Forbes Avenue 5000 Forbes Avenue Pittsburgh, PA 15213 USA Pittsburgh, PA 15213 USA +1 412 268 5728 +1 412 268 5143 [email protected] [email protected] same, there is an abstraction violation. Less obviously, the ABSTRACT value of a variable is never used if the program places no By determining, statically, where the structure of a constraints on its representation. Furthermore, program requires sets of variables to share a common communication induces representation constraints: a representation, we can identify abstract data types, detect necessary condition for a value defined at one site to be abstraction violations, find unused variables, functions, used at another is that the two values must have the same and fields of data structures, detect simple errors in representation1. operations on abstract datatypes, and locate sites of possible references to a value. We compute representation By extending the notion of representation we can encode sharing with type inference, using types to encode other kinds of useful information. We shall see, for representations. The method is efficient, fully automatic, example, how a refinement of the treatment of pointers and smoothly integrates pointer aliasing and higher-order allows reads and writes to be distinguished, and storage functions. We show how we used a prototype tool to leaks to be exposed. answer a user’s questions about a 17,000 line program We show how to generate and solve representation written in C. constraints, and how to use the solutions to find abstract data types, detect abstraction violations, identify unused Keywords variables, functions, and fields of data structures, detect restructuring, abstraction, C, representation simple errors of operations on abstract datatypes (such as failure to close after open), locate sites of possible INTRODUCTION references to a value, and display communication Many interesting properties of programs can be described relationships between program modules. in terms of constraints on the underlying representation of Using a new type system in which representation data values. For example, if a value is an instance of an information is encoded in the types, we express abstract data type, then the client code must not constrain representation constraints as type constraints, and obtain its representation. If a program requires that the and solve them by type inference. The type inference representations of two supposedly abstract types be the algorithm computes new types for the variables and textual expressions of a program, based on their actual usage— 1 Some languages may allow a single value to be viewed as having different types at different points in a program, for example by implicit coercions. However these are just views of a single underlying representation; any meaningful transmission of data must use an agreed common representation. rarely use recursion or pass functions around. Our tool usually consumes space and time little more than linear in C code Output the size of the program being analyzed. Preprocessor/Parser Postprocessor We have built a tool called Lackwit to demonstrate the Parse tree Signatures User feasibility of applying type inference analyses to C Query programs for program understanding tasks, and to Translator Query engine experiment with the kind and quality of information Intermediate code Signatures available. The general architecture of the tool is shown in Figure 1. The multiple downward arrows indicate that C Constraint generator Type inference modules can be processed individually and fed into the Constraints Constraints database; the fat upward arrows show where information about the entire system is being passed. In the remaining Database sections of this paper, we will describe the type inference system, present the results of applying our tool to a real- life C program, and discuss some user interface issues. Figure 1: Tool Architecture Our system has advantages over all other tools we know of that is, their role in primitive operations—and ignores for analyzing source code for program understanding; see explicit type declarations. Figure 2. Lexical tools such as grep [7], Rigi [11] and the Reflection Model Tool [11] lack semantic depth in that The type system underlying this analysis can be very they fail to capture effects such as aliasing that are vital for different from the type system for the source language2. understanding the manipulation of data in large programs. This allows us to make finer distinctions than the Dataflow-based tools such as the VDG slicer [4] and language’s type system would allow, and to handle Chopshop [8] do not scale to handle very large programs. potential sources of imprecision (such as type casts) in a Code checking tools such as LCLint [5] do not try to flexible way. present high-level views of a large system. (Furthermore, The inferred types amount to a set of representation although LCLint does compute some semantic constraints. Given a variable or textual expression of information, it does not have a framework for accurate interest, we examine its inferred type and draw global analysis.) conclusions about the variable’s usage and behavior, and by finding other occurrences of the type (that is, other Tool Semantic Global Scalability values that share the representation), we can track Depth Display propagation and aliasing relationships in the global grep — — program. Rigi, RMT — Type inference is an attractive basis for an analysis for VDG slicer — — many reasons. It is fully automatic. It elegantly handles Chopshop — complex features of rich source languages like C, such as LCLint — recursive pointer-based data structures and function Lackwit pointers. Because our type system is a straightforward F igure 2: Tool feature comparison elaboration of a standard type system [9], we can employ the standard inference algorithm with only minor adaptation, and can be confident in the soundness of our EXAMPLE scheme. Although the algorithm is known to have doubly- Consider the trivial program in Figure 4. Suppose that a exponential time complexity in the worst case, in practice programmer is interested in the value pointed to by r. Our its performance is excellent. It has been implemented in analysis will determine that the contents of s and r may be compilers for Standard ML, where it has not proved to be a the same value, but that the value pointed to by r cannot be bottleneck. Our application is actually less demanding, since in contrast to their ML counterparts, C programs g s 2 It is very important not to get the two notions of “type” c, d confused. In this paper we will normally be referring to f types in our specialized type system. Figure 3: Result of query on g’s argument r int x; intn x; int p1; intz p1; void f(int a, int b, int * c, int * d) void f(intn a, intA b, intB * c, intB * { x = a; d) *c = *d; { x = a; } *c = *d; void g(int p, int * q, int * r, int * s) } { int t1 = 2; void g(intY * q, intX * r, intX * s) int c1 = 3, c2 = 4; { intY t1 = 2; intn c1 = 3, c2 = 4; p = p1; intz p; x++; p = p1; f(c1, p, &t1, q); x++; f(c2, 4, r, s); f(c1, p, &t1, q); } f(c2, c2, r, s); } Figure 4: Trivial program Figure 5: Trivial program annotated with copied to or referenced by any of the other variables in this representations program. The analysis will also determine that the relationship between r and s occurs because of a call from different representations for the values pointed to by q and g to f; the tool can display the information graphically, as r, we can conclude that g does not cause the two locations in Figure 3. to be aliased, nor does it copy any values from one to the other. ANALYSIS OF THE EXAMPLE Examining the calls to f in the body of g shows us how The analysis assigns extended types to variables; the types data flows between the two functions. If we are interested encode information about representations. In C, the in tracking the contents of q passed into g, we can simply declared type of a variable usually determines its look for all the occurrences of its type intY; this includes representation, but that is simply a matter of convenience. occurrences of intB, since one of the occurrences of intB is The program would still operate correctly if we had several instantiated to intY (in the first call to f). We find that different representations of int, provided that each version values may be propagated to t1, and to the contents of of int had the same semantics and that they were used arguments c and d of f. consistently3. Intuitively, we could determine consistency by augmenting the declared type with the representation, Furthermore, suppose that we specify that the arguments of and type-checking the program. all primitive arithmetic operations have representation intn. Then Figure 5 is still consistent. If p is intended to be Figure 5 shows a consistent augmentation of the trivial an abstract file descriptor, we can see from its annotated program of Figure 4. The names superscripted with capital type alone that no arithmetic operations can be performed letters indicate polymorphic type variables: for any choice on it (because it does not have to have type intn); we can A B of representations int and int , the function f would type- conclude that abstraction is preserved. check; we might just as well have different versions of f for A B each choice of int and int . We are thus free to assign q THE MOTIVATION FOR TYPE INFERENCE and s different representations.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-