Polymorphic Type Inference for Machine Code ∗

Matthew Noonan Alexey Loginov David Cok GrammaTech, Inc. Ithaca NY, USA {mnoonan,alexey,dcok}@.com

Abstract such as reconstruction of pointer const annotations with 98% For many compiled languages, source-level types are erased recall. Retypd can operate on weaker program representations very early in the compilation process. As a result, further com- than the current state of the art, removing the need for high- piler passes may convert type-safe source into type-unsafe quality points-to information that may be impractical to machine code. Type-unsafe idioms in the original source and compute. type-unsafe optimizations mean that type information in a Categories and Subject Descriptors F.3.2 [Logics and stripped binary is essentially nonexistent. The problem of re- Meanings of Programs]: Semantics of Programming Lan- covering high-level types by performing type inference over guages; D.2.7 []: Distribution, Mainte- stripped machine code is called type reconstruction, and of- nance, and Enhancement; D.3.3 [Programming Languages]: fers a useful capability in support of reverse engineering and Language Constructs and Features; F.4.3 [Mathematical decompilation. Logic and Formal Languages]: Formal Languages In this paper, we motivate and develop a novel type sys- Keywords Reverse Engineering, Type Systems, Polymor- tem and algorithm for machine-code type inference. The phism, Static Analysis, Binary Analysis, Pushdown Automata features of this were developed by surveying a wide collection of common source- and machine-code idioms, 1. Introduction building a catalog of challenging cases for type reconstruc- tion. We found that these idioms place a sophisticated set In this paper we introduce Retypd, a machine-code type- of requirements on the type system, inducing features such inference tool that finds regular types using pushdown sys- as recursively-constrained polymorphic types. Many of the tems. Retypd includes several novel features targeted at im- features we identify are often seen only in expressive and pow- proved types for reverse engineering, decompilation, and erful type systems used by high-level functional languages. high-level program analyses. These features include: Using these type-system features as a guideline, we have • Inference of most-general type schemes (§5) developed Retypd: a novel static type-inference algorithm for • Inference of recursive structure types (Figure 2) machine code that supports recursive types, polymorphism, • Sound analysis of pointer subtyping (§3.3) and subtyping. Retypd yields more accurate inferred types • Tracking of customizable, high-level information such as than existing algorithms, while also enabling new capabilities purposes and typedef names (§3.5) • Inference of type qualifiers such as const (§6.4) ∗ This research was developed with funding from the Defense Advanced • Research Projects Agency (DARPA). The views, opinions, and/or findings No dependence on high-quality points-to data (§6) arXiv:1603.05495v2 [cs.PL] 18 Mar 2016 contained in this material are those of the authors and should not be • More accurate recovery of source-level types (§6) interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. Retypd continues in the tradition of SecondWrite [10] and DISTRIBUTION A. Approved for public release; distribution unlimited. TIE [17] by introducing a principled static type-inference algorithm applicable to stripped binaries. Diverging from previous work on machine-code type reconstruction, we use a rich type system that supports polymorphism, mutable references, and recursive types. The principled type-inference phase is followed by a second phase that uses heuristics to “downgrade” the inferred types to human-readable types before display. By factoring type inference into two phases, we can sequester unsound heuristics and quirks of the C type systems from the sound core of the type-inference engine. This adds a degree of freedom to the design space so that we may leverage a relatively complex type system during type analysis, yet still emit familiar C types for the benefit of the reverse engineer. T * get_T(void) get_T : Retypd operates on an intermediate representation (IR) { call get_S S * s = get_S(); test eax, eax recovered by automatically disassembling a binary using if (s == NULL) { jz local_exit GrammaTech’s static analysis tool CodeSurfer® for Binaries return NULL; push eax [5]. By generating type constraints from a TSL-based abstract } call S2T interpreter [18], Retypd can operate uniformly on binaries T * t = S2T(s); add esp , 4 return t; local_exit: for any platform supported by CodeSurfer, including x86, } ret x86-64, and ARM. During the development of Retypd, we carried out an ex- Figure 1. A common fortuitous re-use of a known value. tensive investigation of common machine-code idioms in compiled C and C++ code that create challenges for existing used more like a syntactic constant than a dynamic value that type-inference methods. For each challenging case, we iden- should be typed. tified requirements for any type system that could correctly type the idiomatic code. The results of this investigation ap- Fortuitous re-use of values: A related situation appears in the pear in§2. The type system used by Retypd was specifically common control-flow pattern represented by the snippet of C designed to satisfy these requirements. These common id- and the corresponding machine code in Figure 1. Note that ioms pushed us into a far richer type system than we had first on procedure exit, the return value in eax may have come expected, including features like recursively constrained type from either the return value of S2T or from the return value schemes that have not previously been applied to machine- of get_S (if NULL). If this situation is not detected, we will code type inference. see a false relationship between the incompatible return types Due to space limitations, details of the proofs and algo- of get_T and get_S. rithms appear in the appendices, which are available in the on- Re-use of stack slots: If a function uses two variables of the line version of this paper at arXiv:1603.05495 [22]. Scripts same size in disjoint scopes, there is no need to allocate two and data sets used for evaluation also appear there. separate stack slots for those variables. Often the optimizer will reuse a stack slot from a variable that has dropped out 2. Challenges of scope. This is true even if the new variable has a different type. This optimization even applies to the stack slots used to There are many challenges to carrying out type inference store formal-in parameters, as in Figure 2; once the function’s on machine code, and many common idioms that lead to argument is no longer needed, the optimizer can overwrite it sophisticated demands on the feature set of a type system. with a local variable of an incompatible type. In this section, we describe several of the challenges seen More generally, we cannot assume that the map from during the development of Retypd that led to our particular program variables to physical locations is one-to-one. We combination of type-system features. cannot even make the weaker assumption that the program variables inhabiting a single physical location at different 2.1 Optimizations after Type Erasure times will all belong to a single type. Since type erasure typically happens early in the compilation We handle these issues through a combination of type- process, many optimizations may take well-typed system features (subtyping instead of unification) and pro- machine code and produce functionally equivalent but ill- gram analyses (reaching definitions for stack variables and typed results. We found that there were three common op- trace partitioning [21]). timization techniques that required special care: the use of a variable as a syntactic constant, early returns along error 2.2 Polymorphic Functions paths, and the re-use of stack slots. We discovered that, although not directly supported by the Semi-syntactic constants: Suppose a function with signature C type system, most programs define or make use of func- void f(int x, char* y) is invoked as f(0, NULL). This tions that are effectively polymorphic. The most well-known malloc will usually be compiled to x86 machine code similar to example is : the return value is expected to be imme- diately cast to some other type T*. Each call to malloc may xor eax , eax be thought of as returning some pointer of a different type. pusheax ;y:=NULL The type of malloc is effectively not size_t void*, but pusheax ;x:=0 → rather τ.size_t τ*. call f ∀ → The problem of a polymorphic malloc could be mitigated This represents a code-size optimization, since push eax can by treating each call site p as a call to a distinct function be encoded in one byte instead of the five bytes needed to mallocp, each of which may have a distinct return type Tp*. push an immediate value (0). We must be careful that the Unfortunately, it is not sufficient to treat a handful of special type variables for x and y are not unified; here, eax is being functions like malloc this way: it is common to see binaries F. ( τ. ) F where = #include close_last: ∀ ∃ C ⇒ C push ebp F.in τ stack0 v struct LL mov ebp ,esp τ.load.σ32@0 τ { sub esp ,8 v struct LL * next; mov edx,dword [ebp+arg_0] τ.load.σ32@4 int # FileDescriptor v ∧ int handle ; jmp loc_8048402 int # SuccessZ F.out }; loc_8048400: ∨ v eax mov edx ,eax int close_last(struct LL * list) loc_8048402: typedef struct { { mov eax,dword [edx] Struct_0 * field_0; while (list->next != NULL) test eax,eax int // #FileDescriptor { jnz loc_8048400 field_4 ; list = list->next; mov eax,dword [edx+4] } Struct_0 ; } mov dword [ebp+arg_0],eax return close(list->handle); leave int // #SuccessZ } jmp __thunk_.close close_last(const Struct_0 *);

Figure 2. Example C code (compiled with gcc 4.5.4 on Linux with flags -m32 -O2), disassembly, type scheme inferred from the machine code, and reconstructed C type. The tags #FileDescriptor and #SuccessZ encode inferred higher-level purposes. that use user-defined allocators and wrappers to malloc. All an important source of imprecision. Since recursive data of these functions would also need to be accurately identified structures are relatively common, it is desirable that a type- and duplicated for each callsite. inference scheme for machine code be able to represent and A similar problem exists for functions like free, which infer recursive types natively. is polymorphic in its lone parameter. Even more complex are functions like memcpy, which is polymorphic in its first 2.4 Offset and Reinterpreted Pointers two parameters and its return type, though the three types are Unlike in source code, there is no syntactic distinction in not independent of each other. Furthermore, the polymorphic machine code between a pointer-to-struct and a pointer-to- type signatures first-member-of-struct. For example, if X has type struct { char*, FILE*, size_t }* on a 32-bit platform, then it malloc : τ.size_t τ* ∀ → should be possible to infer that X + 4 can be safely passed free : τ.τ* void to fclose; conversely, if X + 4 is passed to fclose we may ∀ → memcpy : α, β.(β α) (α* β* size_t) α* need to infer that X points to a structure that, at offset 4, ∀ v ⇒ × × → contains a FILE*. This affects the typing of local structures, are all strictly more informative to the reverse engineer than as well: a structure on the stack may be manipulated using a the standard C signatures. How else could one know that pointer to its starting address or by manipulating the members the void* returned by malloc is not meant to be an opaque directly, e.g., through the frame pointer. handle, but rather should be cast to some other pointer type? These idioms, along with casts from derived* to base*, In compiled C++ binaries, polymorphic functions are even fall under the general class of physical [31] or non-structural more common. For example, a class member function must [24] subtyping. In Retypd, we model these forms of subtyping potentially accept both base_t* and derived_t* as types using type scheme specialization (§ 3.5). Additional hints for this. about the extent of local variables are found using data- Foster et al. [11] noted that using bounded polymorphic delineation analysis [13]. type schemes for libc functions increased the precision of type-qualifier inference, at the level of source code. To 2.5 Disassembly Failures advance the state of the art in machine-code type recovery, The problem of producing correct disassembly for stripped we believe it is important to also embrace polymorphic binaries is equivalent to the halting problem. As a result, we functions as a natural and common feature of machine code. can never assume that our reconstructed program represen- Significant improvements to static type reconstruction—even tation will be perfectly correct. Even sound analyses built for monomorphic types—will require the capability to infer on top of an unsound program representation may exhibit polymorphic types of some nontrivial complexity. inconsistencies and quirks. Thus, we must be careful that incorrect disassembly or 2.3 Recursive Types analysis results from one part of the binary will not influence The relevance of recursive types for decompilation was the correct type results we may have gathered for the rest of recently discussed by Schwartz et al. [30], where lack of the binary. Type systems that model value assignments as type a recursive type system for machine code was cited as unifications are vulnerable to over-unification issues caused by bad IR. Since unification is non-local, bad constraints in Table 1. Example field labels (type capabilities) in Σ. one part of the binary can degrade all type results. Another instance of this problem arises from the use of Label Variance Capability register parameters. Although the x86 cdecl calling conven- .in Function with input in location L. L tion uses the stack for parameter passing, most optimized .out Function with output in location L. L ⊕ binaries will include many functions that pass parameters in .load Readable pointer. ⊕ registers for speed. Often, these functions do not conform to .store Writable pointer. any standard calling convention. Although we work hard to .σN@k Has N-bit field at offset k. ensure that only true register parameters are reported, conser- ⊕ vativeness demands the occasional false positive. Type-reconstruction methods that are based on unification 2.7 Incomplete Points-to Information are generally sensitive to precision loss due to false-positive Degradation of points-to accuracy on large programs has register parameters. A common case is the “push ecx” idiom been identified as a source of type-precision loss in other that reserves space for a single local variable in the stack systems [10]. Our algorithm can provide high-quality types frame of a function f. If ecx is incorrectly viewed as a register even in the absence of points-to information. Precision can parameter of f in a unification-based scheme, whatever be further improved by increasing points-to knowledge via type variables are bound to ecx at each callsite to f will machine-code analyses such as VSA [6], but good results be mistakenly unified. In our early experiments, we found are already attained with no points-to analysis beyond the these overunifications to be a persistent and hard-to-diagnose simpler problem of tracking the stack pointer. source of imprecision. In our early unification-based experiments, mitigation 2.8 Ad-hoc Subtyping heuristics against overunification quickly ballooned into a disproportionately large and unprincipled component of type Programs may define an ad-hoc type hierarchy via typedefs. analysis. We designed Retypd’s subtype-based constraint This idiom appears in the Windows API, where a variety of system to avoid the need for such ad-hoc prophylactics handle types are all defined as typedefs of void*. Some of against overunification. the handle types are to be used as subtypes of other handles; for example, a GDI handle (HGDI) is a generic handle used to represent any one of the more specific HBRUSH, HPEN, etc. In 2.6 Cross-casting and Bit Twiddling other cases, a typedef may indicate a supertype, as in LPARAM Even at the level of source code, there are already many or DWORD; although these are typedefs of int, they have the type-unsafe idioms in common use. Most of these idioms intended semantics of a generic 32-bit type, which in different operate by directly manipulating the bit representation of a contexts may be used as a pointer, an integer, a flag set, and value, either to encode additional information or to perform so on. computations that are not possible using the type’s usual To accurately track ad-hoc hierarchies requires a type sys- interface. Some common examples include tem based around subtyping rather than unification. Models for common API type hierarchies are useful; still better is the • hashing values by treating them as untyped bit blocks [1], ability for the end user to define or adjust the initial type hier- • stealing unused bits of a pointer for tag information, such archy at run time. We support this feature by parameterizing as whether a thunk has been evaluated [20], the main type representation by an uninterpreted lattice Λ, as • reducing the storage requirements of a doubly-linked list described in §3.5. by XOR-combining the next and prev pointers, and • directly manipulating the bit representation of another 3. The Type System type, as in the quake3 inverse square root trick [29]. The type system used by Retypd is based around the inference of recursively constrained type schemes (§3.1). Solutions to Because of these type-unsafe idioms, it is important that constraint sets are modeled by sketches (§3.5); the sketch a type-inference scheme continues to produce useful results associated to a value consists of a record of capabilities even in the presence of apparently contradictory constraints. which that value holds, such as whether it can be stored to, We handle this situation in three ways: called, or accessed at a certain offset. Sketches also include markings drawn from a customizable lattice (Λ, , , <:), ∨ ∧ 1. separating the phases of constraint entailment, solving, used to propagate high-level information such as typedef and consistency checking, names and domain-specific purposes during type inference. 2. modeling types with sketches (§ 3.5) that carry more Retypd also supports recursively constrained type schemes information than C types, and that abstract over the set of types subject to a constraint set 3. using unions to combine types with otherwise incompati- . The language of type constraints used by Retypd is weak C ble capabilities (e.g., τ is both int-like and pointer-like). enough that for any constraint set , satisfiability of can be C C Derived Type Variable Formation Subtyping

α β α β, VAR α.` VAR α α β, VAR β.`, ` = v (T-LEFT) v (T-INHERITL) (S-REFL) v h i ⊕ (S-FIELD ) VAR α VAR β.` α α α.` β.` ⊕ v v α β α β, VAR β.` α β, β γ α β, VAR β.`, ` = v (T-RIGHT) v (T-INHERITR) v v (S-TRANS) v h i (S-FIELD ) VAR β VAR α.` α γ β.` α.` v v VAR α.` VAR α.load, VAR α.store (T-PREFIX) (S-POINTER) VAR α α.store α.load v Figure 3. Deduction rules for the type system. α, β, γ represent derived type variables; ` represents a label in Σ. reduced (in cubic time) to checking a set of scalar constraints duction rules of Figure 3. We also allow projections: given κ <: κ , where κ are constants belonging to Λ. a constraint set with free variable τ, the projection τ. 1 2 i C ∃ C Thanks to the reduction of constraint satisfiability to scalar binds τ as an “internal” variable in the constraint set. See τ constraint checking, we can omit expensive satisfiability in Figure 2 for an example or, for a more in-depth treatment checks during type inference. Instead, we delay the check of constraint projection, see Su et al. [34]. until the final stage when internal types are converted to The field labels used to form derived type variables are C types for display, providing a natural place to instantiate meant to represent capabilities of a type. For example, the union types that resolve any inconsistencies. Since compiler constraint VAR α.load means α is a readable pointer, and optimizations and type-unsafe idioms in the original source the derived type variable α.load represents the type of the frequently lead to program fragments with unsatisfiable type memory region obtained by loading from α. constraints (§2.5, §2.6), this trait is particularly desirable. Let us briefly see how operations in the original program 3.1 Syntax: the Constraint Type System translate to type constraints, using C-like pseudocode for clar- ity. The full conversion from disassembly to type constraints Throughout this section, we fix a set of type variables, an is described in Appendix A. field labels V alphabet Σ of , and a function :Σ , Value copies: When a value is moved between program vari- variance h·i → {⊕ } denoting the (Definition 3.2) of each label. We do ables in an assignment like x := y, we make the conservative not require the set Σ to be finite. Retypd makes use of a large assumption that the type of x may be upcast to a supertype of set of labels; for simplicity, we will focus on those in Table 1. y. We will generate a constraint of the form Y X. Within , we assume there is a distinguished set of type v V Loads and stores: Suppose that p is a pointer to a 32- constants. These type constants are symbolic representations bit type, and a value is loaded into x by the assignment κ of elements κ belonging to some lattice, but are otherwise x := *p. Then we will generate a constraint of the form uninterpreted. It is usually sufficient to think of the type P.load.σ32@0 X. Similarly, a store *q := y results in constants as type names or semantic tags. v the constraint Y Q.store.σ32@0. v Definition 3.1. A derived type variable is an expression of In some of the pointer-based examples in this paper we the form αw with α and w Σ∗. omit the final .σN@k access after a .load or .store to simplify ∈ V ∈ the presentation. Definition 3.2. The variance of a label ` encodes the subtype relationship between α.` and β.` when α is a subtype of β, Function calls: Suppose the function f is invoked by formalized in rules S-FIELD and S-FIELD of Figure 3. y := f(x). We will generate the constraints X F.in and ⊕ F.out Y v The variance function can be extended to Σ∗ by defining , reflecting the flow of actuals to and from formals. h·i v ε = and xw = x w , where , is the sign Note that if we define A.in = X and A.out = Y then the two h i ⊕ h i h i · h i {⊕ } monoid with = = and = = constraints are equivalent to F A by the rules of Figure 3. ⊕ · ⊕ · ⊕ ⊕ · · ⊕ v . A word w Σ∗ is called covariant if w = , or This encodes the fact that the called function’s type must be ∈ h i ⊕ contravariant if w = . at least as specific as the type used at the callsite. h i One of the primary goals of our type-inference engine is Definition 3.3. Let = α be a set of base type variables. V { i} to associate to each procedure a most-general type scheme. A constraint is an expression of the form VAR X (“existence Definition 3.4. type scheme of the derived type variable X”) or X Y (“X is a subtype A is an expression of the form v α. α where α = α ... α is quantification over a of Y ”), where X and Y are derived type variables. A con- ∀ C ⇒ 1 ∀ ∀ 1 ∀ n straint set over is a finite collection of constraints, where set of type variables, and is a constraint set over αi i=1..n. V C C { } the type variables in each constraint are either type constants Type schemes provide a way of encoding the pre- and post- or members of . We will say that entails c, denoted c, conditions that a function places on the types in its calling V C C ` if c can be derived from the constraints in using the de- context. Without the constraint sets, we would only be able C to represent conditions of the form “the input must be a subtype of X” and “the output must be a supertype of Y ”. f() { g() { The constraint set can be used to encode more interesting p = q; p = q; C *p = x; *q = x; type relations between inputs and outputs, as in the case y = *q; y = *p; of memcpy (§2.2). For example, a function that returns the } } second 4-byte element from a struct* may have the type scheme τ. (τ.in.load.σ32@4 τ.out) τ. Figure 4. Two programs, each mediating a copy from x to y ∀ v ⇒ through a pair of aliased pointers. 3.2 Deduction Rules 3.3 Modeling Pointers The deduction rules for our type system appear in Figure 3. Most of the rules are self-evident under the interpretation in To model pointers soundly in the presence of subtyping, we Definition 3.3, but a few require some additional motivation. found that our initial naïve approach suffered from unex- pected difficulties when combined with subtyping. Following S-FIELD and S-FIELD : These rules ensure that field the C type system, it seemed natural to model pointers by ⊕ labels act as co- or contra-variant type operators, generat- introducing an injective unary type constructor Ptr, so that ing subtype relations between derived type variables from Ptr(T ) is the type of pointers to T . In a unification-based subtype relations between the original variables. type system, this approach works as expected. In the presence of subtyping, a new issue arises. Consider T-INHERITL and T-INHERITR: The rule T-INHERITL the two programs in Figure 4. Since the type variables P and should be uncontroversial, since a subtype should have all Q associated to p, q can be seen to be pointers, we can begin capabilities of its supertype. The rule T-INHERITR is more by writing P = Ptr(α),Q = Ptr(β). The first program will unusual since it moves capabilities in the other direction; generate the constraint set = Ptr(β) Ptr(α),X α, taken together, these rules require that two types in a subtype C1 { v v β Y while the second generates = Ptr(β) Ptr(α), relation must have exactly the same set of capabilities. This v } C2 { v X β, α Y . Since each program has the effect of copy- is a form of structural typing, ensuring that comparable types v v } ing the value in x to y, both constraint sets should satisfy have the same shape. X Y . To do this, the pointer subtype constraint must Structural typing appears to be at odds with the need to Ci ` v entail some constraint on α and β, but which one? cast more capable objects to less capable ones, as described If we assume that Ptr is covariant, then Ptr(β) Ptr(α) in §2.4. Indeed, T-INHERITR eliminates the possibility of v entails β α and so X Y , but X Y . On the forgetting capabilities during value assignments. But we still v C2 ` v C16 ` v other hand, if we make Ptr contravariant then X Y maintain this capability at procedure invocations due to our C1 ` v but X Y . use of polymorphic type schemes. An explanation of how C26 ` v It seems that our only recourse is to make subtyping type-scheme instantiation enables us to forget fields of an degenerate to type equality under Ptr: we are forced to object appears in §3.4, with more details in §E.1.2. declare that Ptr(β) Ptr(α) α = β, which of course These rules ensure that Retypd can perform “iterative v ` means that Ptr(β) = Ptr(α) already. This is a catastrophe variable recovery”; lack of iterative variable recovery was for subtyping as used in machine code, since many natural cited by the creators of the Phoenix decompiler [30] as a subtype relations are mediated through pointers. For example, common cause of incorrect decompilation when using TIE the unary Ptr constructor cannot handle the simplest kind [17] for type recovery. of C++ class subtyping, where a derived class physically S-POINTER: This rule is a consistency condition ensuring extends a base class by appending new member variables. that the type that can be loaded from a pointer is a supertype The root cause of the difficulty seems to be in conflating of the type that can be stored to a pointer. Without this rule, two capabilities that (most) pointers have: the ability to be pointers would provide a channel for subverting the type written through and the ability to be read through. In Retypd, system. An example of how this rule is used in practice these two capabilities are modeled using different field labels appears in §3.3. .store and .load. The .store label is contravariant, while the .load label is covariant. The deduction rules of Figure 3 are simple enough that To see how the separation of pointer capabilities avoids the each proof may be reduced to a normal form (see Theo- loss of precision suffered by Ptr, we revisit the two example rem B.1). An encoding of the normal forms as transition programs. The first generates the constraint set sequences in a modified pushdown system is used to pro- vide a compact representation of the entailment closure 10 = Q P,X P.store, Q.load Y = c c . The pushdown system modeling is C { v v v } C { | C ` } C queried and manipulated to provide most of the interesting By rule T-INHERITR we may conclude that Q also has type-inference functionality. An outline of this functionality a field of type .store. By S-POINTER we can infer that appears in §5.2. Q.store Q.load. Finally, since .store is contravariant and v Q P , S-FIELD says we also have P.store Q.store. α = int # SuccessZ v v ∨ load Putting these parts together gives the subtype chain β = int # FileDescriptor 0 ∧ @ > A σ32 X P.store Q.store Q.load Y 0 load v v v v instack > > σ32 @ > 4 β The second program generates the constraint set out eax α A 0 = Q P,X Q.store,P.load Y C2 { v v v } Since Q P and P has a field .load, we conclude that Q Figure 5. A sketch instantiating the type scheme in Figure 2. v has a .load field as well. Next, S-POINTER requires that Q.store Q.load. Since .load is covariant, Q P implies v v from a set of type constraints. Yet there is no notion of what that Q.load P.load. This gives the subtype chain is v a type inherent to the deduction rules of Figure 3. We have defined the rules of the game, but not the equipment with X Q.store Q.load P.load Y v v v v which it should be played. By splitting out the read- and write-capabilities of a We found that introducing C-like entities at the level of pointer, we can achieve a sound account of pointer subtyping constraints or types resulted in too much loss of precision that does not degenerate to type equality. Note the importance when working with the challenging examples described in§ of the consistency condition S-POINTER: this rule ensures 2. Consequently we developed the notion of a sketch, a kind that writing through a pointer and reading the result cannot of regular tree labeled with elements of an auxiliary lattice Λ. subvert the type system. Sketches are related to the recursive types studied by Amadio The need for separate handling of read- and write- and Cardelli [3] and Kozen et al. [16], but do not depend on a capabilities in a mutable reference has been rediscovered priori knowledge of the ranked alphabet of type constructors. multiple times. A well-known instance is the covariance of Definition 3.5. A sketch is a (possibly infinite) tree T with the array type constructor in Java and C#, which can cause edges labeled by elements of Σ and nodes marked with ele- runtime type errors if the array is mutated; in these languages, ments of a lattice Λ, such that T only has finitely many sub- the read capabilities are soundly modeled only by sacrificing trees up to labeled isomorphism. By collapsing isomorphic soundness for the write capabilities. subtrees, we can represent sketches as deterministic finite state automata with each state labeled by an element of Λ. 3.4 Non-structural Subtyping and T-INHERITR The set of sketches admits a lattice structure, with operations It was noted in §3.2 that the rule T-INHERITR leads to a described by Figure 18. system with a form of structural typing: any two types in The lattice of sketches serves as the model in which we a subtype relation must have the same capabilities. Super- interpret type constraints. The interpretation of the constraint ficially, this seems problematic for modeling typecasts that VAR α.u is “the sketch Sα admits a path from the root with forget about fields, such as a cast from derived* to base* label sequence u”, and α.u β.v is interpreted as “the sketch derived* v when has additional fields (§2.4). obtained from Sα by traversing the label sequence u is a The missing piece that allows us to effectively forget ca- subsketch (in the lattice order) of the sketch obtained from pabilities is instantiation of callee type schemes at a callsite. Sβ by traversing the sequence v.” To demonstrate how polymorphism enables forgetfulness, consider the example type scheme F. ( τ. ) F from Fig- The main utility of sketches is that they are nearly a free ∀ ∃ C ⇒ ure 2. The function close_last can be invoked by providing tree model [25] of the constraint language. Any constraint any actual-in type α, such that α F.in ; in particular, set is satisfiable over the lattice of sketches, as long as v stack0 C C α can have more fields than those required by . We simply cannot prove an impossible subtype relation in the auxiliary C select a more capable type for the existentially-quantified lattice Λ. In particular, we can always solve the fragment of type variable τ in . In effect, we have used specialization of that does not reference constants in Λ. Stated operationally, C C polymorphic types to model non-structural subtyping idioms, we can always recover the tree structure of sketches that potentially solve . This observation is formalized by the while subtyping is used only to model structural subtyping C idioms. This restricts our introduction of non-structural sub- following theorem: types to points where a type scheme is instantiated, such as Theorem 3.1. Suppose that is a constraint set over the C at a call site. variables τi i I . Then there exist sketches Si i I , such { } ∈ { } ∈ that w S if and only if VAR τ .w. 3.5 Semantics: the Poset of Sketches ∈ i C ` i The simple type system defined by the deduction rules of Proof. The idea is to symmetrize using an algorithm that v Figure 3 defines the syntax of legal derivations in our type is similar in spirit to Steensgaard’s method of almost-linear- system. The constraint solver of §5.2 is designed to find a time pointer analysis [33]. Begin by forming a graph with simple representation for all conclusions that can be derived one node n(α) for each derived type variable appearing in , along with each of its prefixes. Add a labeled edge the concrete semantics simply by specifying the abstract C ` n(α) n(α.`) for each derived type variable α.` to form domain and an interpretation of the concrete numeric and → A a graph G. Now quotient G by the equivalence relation mapping types. Retypd uses CodeSurfer’s recovered IR to ∼ defined by n(α) n(β) if α β , and n(α0) n(β0) determine the number and location of inputs and outputs ∼ v ` ∈ C ∼`0 to each procedure, as well as the program’s call graph and whenever there are edges n(α) n(α0) and n(β) n(β0) → → per-procedure control-flow graphs. An abstract interpreter in G with n(α) n(β) where either ` = `0 or ` = .load and ∼ then generates sets of type constraints from the concrete ` = .store. 0 TSL instruction semantics. A detailed account of the abstract By construction, there exists a path through G/ with ∼ semantics for constraint generation appears in Appendix A. label sequence u starting at the equivalence class of τi if and only if VAR τi.u; the (regular) set of all such paths yields 4.2 Approach to Type Resolution C ` the tree structure of S . i After the initial IR is recovered, type inference proceeds

Working out the lattice elements that should label Si is a in two stages: first, type-constraint sets are generated in a trickier problem; the basic idea is to use the same automaton bottom-up fashion over the strongly-connected components Q constructed during constraint simplification (Theorem 5.1) of the callgraph. Pre-computed type schemes for externally to answer queries about which type constants are upper and linked functions may be inserted at this stage. Each con- lower bounds on a given derived type variable. The full straint set is simplified by eliminating type variables that do algorithm is listed in §D.4. not belong to the SCC’s interface; the simplification algo- In Retypd, we use a large auxiliary lattice Λ containing rithm is outlined in§5. Once type schemes are available, hundreds of elements that includes a collection of standard the callgraph is traversed bottom-up, assigning sketches to C type names, common typedefs for popular APIs, and type variables as outlined in §3.5. During this stage, type user-specified semantic classes such as #FileDescriptor in schemes are specialized based on the calling contexts of each Figure 2. This lattice helps model ad-hoc subtyping and function. Appendix F lists the full algorithms for constraint preserve high-level semantic type names, as discussed in simplification (F.1) and solving (F.2). §2.8. 4.3 Translation to C Types Note. Sketches are just one of many possible models for the The final phase of type resolution converts the inferred deduction rules that could be proposed. A general approach sketches to C types for presentation to the user. Since C is to fix a poset ( , <:) of types, interpret as <:, and T v types and sketches are not directly comparable, this resolu- interpret co- and contra-variant field labels as monotone (resp. tion phase necessarily involves the application of heuristic antimonotone) functions . T → T conversion policies. Restricting the heuristic policies to a The separation of syntax from semantics allows for a single post-inference phase provides us with the flexibility to simple way to parameterize the type-inference engine by generate high-quality, human-readable C types while main- a model of types. By choosing a model ( , ) with a T ≡ taining soundness and generality during type reconstruction. symmetric relation , a unification-based type ≡ ⊆ T × T system similar to SecondWrite [10] is generated. On the Example 4.1. A simple example involves the generation of other hand, by forming a lattice of type intervals and interval const annotations on pointers. We decided on a policy that inclusion, we would obtain a type system similar to TIE [17] only introduced const annotations on function parameters, that outputs upper and lower bounds on each type variable. by annotating the parameter at location L when the constraint set for procedure p satisfies VAR p.in .load and C C ` L VAR p.in .store. Retypd appears to be the first machine- 4. Analysis Framework C6 ` L code type-inference system to infer const annotations; a 4.1 IR Reconstruction comparison of our recovered annotations to the original Retypd is built on top of GrammaTech’s machine-code- source code appears in §6.4. analysis tool CodeSurfer for Binaries. CodeSurfer carries Example 4.2. A more complex policy is used to decide out common program analyses on binaries for multiple CPU between union types and generic types when incompatible architectures, including x86, x86-64, and ARM. CodeSurfer scalar constraints must be resolved. Retypd merges compara- is used to recover a high-level IR from the raw machine ble scalar constraints to form antichains in Λ; the elements of code; type constraints are generated directly from this IR, and these antichains are then used for the resulting C union type. resolved types are applied back to the IR and become visible to the GUI and later analysis phases. Example 4.3. The initial type-simplification stage results in CodeSurfer achieves platform independence through TSL types that are as general as possible. Often, this means that [18], a language for defining a processor’s concrete semantics types are found to be more general than is strictly helpful in terms of concrete numeric types and mapping types that to a (human) observer. A policy is applied that specializes model flag, register, and memory banks. Interpreters for type schemes to the most specific scheme that is compatible a given abstract domain are automatically created from with all statically-discovered uses. For example, a C++ object may include a getter function with a highly polymorphic type Definition 5.2. An unconstrained pushdown system is a triple scheme, since it could operate equally well on any structure = ( , Σ, ∆) where is the set of control locations, P V V 2 with a field of the correct type at the correct offset. But we Σ is the set of stack symbols, and ∆ ( Σ∗) is a ⊆ V × expect that in every calling context, the getter will be called (possibly infinite) set of transition rules. We will denote a on a specific object type (or perhaps its derived types). We can transition rule by X; u , Y ; v where X,Y and h i → h i ∈ V specialize the getter’s type by choosing the least polymorphic u, v Σ∗. We define the set of configurations to be Σ∗. ∈ V × specialization that is compatible with the observed uses. By In a configuration (p, w), p is called the control state and w specializing the function signature before presenting a final the stack state. C type to the user, we trade some generality for types that are Note that we require neither the set of stack symbols, more likely to match the original source. nor the set of transition rules, to be finite. This freedom is required to model the derivation S-POINTER of Figure 3, 5. The Simplification Algorithm which corresponds to an infinite set of transition rules. In this section, we sketch an outline of the simplification Definition 5.3. An unconstrained pushdown system de- algorithm at the core of the constraint solver. The complete P termines a transition relation on the set of configura- algorithm appears in Appendix D. −→ tions: (X, w) (Y, w0) if there is a suffix s and a rule −→ 5.1 Inferring a Type Scheme X; u , Y ; v , such that w = us and w0 = vs. The h i → h i transitive closure of is denoted ∗ . The goal of the simplification algorithm is to take an inferred −→ −→ type scheme α. C τ for a procedure and create a smaller With this definition, we can state the primary theorem ∀ ⇒ constraint set 0, such that any constraint on τ implied by behind our simplification algorithm. C C is also implied by 0. C Theorem 5.1. Let be a constraint set and a set of base Let denote the constraint set generated by abstract C V C type variables. Define a subset S of ( Σ)∗ ( Σ)∗ interpretation of the procedure being analyzed, and let α C V ∪ × V ∪ by (Xu, Y v) S if and only if X.u Y.v. Then S be the set of free type variables in . We could already ∈ C C ` v C C is a regular set, and an automaton Q to recognize S can be use α. τ as the constraint set in the procedure’s type C ∀ C ⇒ constructed in O( 3) time. scheme, since the input and output types used in a valid |C| invocation of f are tautologically those that satisfy . Yet, as C Proof. The basic idea is to treat each X.u Y.v as a a practical matter, we cannot use the constraint set directly, v ∈ C transition rule X; u , Y ; v in the pushdown system . since this would result in constraint sets with many useless h i → h i # # P free variables and a high growth rate over nested procedures. In addition, we add control states START, END with tran- sitions #START; X , X; ε and X; ε , #END; X Instead, we seek to generate a simplified constraint set 0, h i → h i h i → h i C for each X . For the moment, assume that (1) all la- such that if c is an “interesting” constraint and c then ∈ V C ` bels are covariant, and (2) the rule S-POINTER is ignored. 0 c as well. But what makes a constraint interesting? C ` By construction, (#START, Xu) ∗ (#END, Y v) in if −→ P Definition 5.1. For a type variable τ, a constraint is called and only if X.u Y.v. A theorem of Büchi [27] ensures C ` v interesting if it has one of the following forms: that for any two control states A and B in a standard (not • A capability constraint of the form VAR τ.u unconstrained) pushdown system, the set of all pairs (u, v) • A recursive subtype constraint of the form τ.u τ.v with (A, u) ∗ (B, v) is a regular language; Caucal [8] v −→ • A subtype constraint of the form τ.u κ or κ τ.u gives a saturation algorithm that constructs an automaton to v v where κ is a type constant. recognize this language. In the full proof, we add two novelties: first, we support We will call a constraint set 0 a simplification of if 0 c C C C ` contravariant stack symbols by encoding variance data into for every interesting constraint c, such that c. Since both C ` the control states and transition rules. The second novelty and 0 entail the same set of constraints on τ, it is valid to C C involves the rule S-POINTER; this rule is problematic since replace with 0 in any valid type scheme for τ. C C the natural encoding would result in infinitely many transition Simplification heuristics for set-constraint systems were rules. We extend Caucal’s construction to lazily instantiate studied by Fähndrich and Aiken [12]; our simplification all necessary applications of S-POINTER during saturation. algorithm encompasses all of these heuristics. For details, see Appendix D.

5.2 Unconstrained Pushdown Systems Since will usually entail an infinite number of con- C The constraint-simplification algorithm works on a constraint straints, this theorem is particularly useful: it tells us that set by building a pushdown system whose transition se- the full set of constraints entailed by has a finite encoding C PC C quences represent valid derivations of subtyping judgements. by an automaton Q. Further manipulations on the constraint We briefly review pushdown systems and some necessary closure, such as efficient minimization, can be carried out generalizations here. on Q. By restricting the transitions to and from #START and # END, the same algorithm is used to eliminate type variables, INT_PTR Proto_EnumAccounts(WPARAM wParam, producing the desired constraint simplifications. LPARAM lParam) { *( int* )wParam = accounts.getCount(); *( PROTOACCOUNT*** )lParam = accounts.getArray(); 5.3 Overall Complexity of Inference return 0; } The saturation algorithm used to perform constraint-set sim- plification and type-scheme construction is, in the worst case, Figure 6. Ground-truth types declared in the original source cubic in the number of subtype constraints to simplify. Since do not necessarily reflect program semantics. Example from some well-known pointer analysis methods also have cubic miranda32. complexity (such as Andersen [4]), it is reasonable to wonder if Retypd’s “points-to free” analysis really offers a benefit over a type-inference system built on top of points-to analysis data. 6. Evaluation To understand where Retypd’s efficiencies are found, first 6.1 Implementation 3 consider the n in O(n ). Retypd’s core saturation algorithm Retypd is implemented as a module within CodeSurfer is cubic in the number of subtype constraints; due to the for Binaries. By leveraging the multi-platform disassembly simplicity of machine-code instructions, there is roughly one capabilities of CodeSurfer, it can operate on x86, x86-64, subtype constraint generated per instruction. Furthermore, and ARM code. We performed the evaluation using minimal Retypd applies constraint simplification on each procedure analysis settings, disabling value-set analysis (VSA) but in isolation to eliminate the procedure-local type variables, computing affine relations between the stack and frame resulting in constraint sets that only relate procedure formal- pointers. Enabling additional CodeSurfer phases such as VSA ins, formal-outs, globals, and type constants. In practice, these can greatly improve the reconstructed IR, at the expense of simplified constraint sets are small. increased analysis time. Since each procedure’s constraint set is simplified inde- 3 Existing type-inference algorithms such as TIE [17] and pendently, the n factor is controlled by the largest procedure SecondWrite [10] require some modified form of VSA to size, not the overall size of the binary. By contrast, source- resolve points-to data. Our approach shows that high-quality code points-to analysis such as Andersen’s are generally cubic types can be recovered in the absence of points-to informa- in the overall number of pointer variables, with exponential tion, allowing type inference to proceed even when comput- duplication of variables depending on the call-string depth ing points-to data is too unreliable or expensive. used for context sensitivity. The situation is even more diffi- cult for machine-code points-to analyses such as VSA, since there is no syntactic difference between a scalar and a pointer 6.2 Evaluation Setup in machine code. In effect, every program variable must be Our benchmark suite consists of 160 32-bit x86 binaries for treated as a potential pointer. both Linux and Windows, compiled with a variety of gcc On our benchmark suite of real-world programs, we found and Microsoft Visual C/C++ versions. The benchmark suite that execution time for Retypd scales slightly below O(N 1.1), includes a mix of executables, static libraries, and DLLs. The where N is the number of program instructions. The follow- suite includes the same coreutils and SPEC2006 bench- ing back-of-the-envelope calculation can heuristically explain marks used to evaluate REWARDS, TIE, and SecondWrite much of the disparity between the O(N 3) theoretical com- [10, 17, 19]; additional benchmarks came from a standard plexity and the O(N 1.1) measured complexity. On our bench- suite of real-world programs used for precision and perfor- mark suite, the maximum procedure size n grew roughly like mance testing of CodeSurfer for Binaries. All binaries were n N 2/5. We could then expect that a per-procedure anal- built with optimizations enabled and debug information dis- ≈ ysis would perform worst when the program is partitioned abled. into N 3/5 procedures of size N 2/5. On such a program, a per- Ground truth is provided by separate copies of the binaries procedure O(nk) analysis may be expected to behave more that have been built with the same settings, but with debug like an O(N 3/5 (N 2/5)k) = O(N (3+2k)/5) analysis overall. information included (DWARF on Linux, PDB on Windows). · In particular, a per-procedure cubic analysis like Retypd could We used IdaPro [15] to read the debug information, which be expected to scale like a global O(N 1.8) analysis. The re- allowed us to use the same scripts for collecting ground-truth maining differences in observed versus theoretical execution types from both DWARF and PDB data. time can be explained by the facts that real-world constraint All benchmarks were evaluated on a 2.6 GHz Intel Xeon graphs do not tend to exercise the simplification algorithm’s E5-2670 CPU, running on a single logical core. RAM uti- worst-case behavior, and that the distribution of procedure lization by CodeSurfer and Retypd combined was capped at sizes is heavily weighted towards small procedures. 10GB. Benchmark Description Instructions single data point to the final data set. Because Retypd per- forms well on many clusters, this averaging procedure tends CodeSurfer benchmarks to reduce our overall precision and conservativeness measure- libidn Domain name translator 7K ments. Still, we believe that it gives a less biased depiction Tutorial00 Direct3D tutorial 9K of the algorithm’s expected real-world behavior than does an zlib Compression library 14K average over all benchmarks. ogg Multimedia library 20K distributor UltraVNC repeater 22K libbz2 BZIP library, as a DLL 37K 6.3 Sources of Imprecision glut The glut32.dll library 40K pngtest A test of libpng 42K Although Retypd is built around a sound core of constraint freeglut The freeglut.dll library 77K simplification and solving, there are several ways that impre- miranda IRC client 100K cision can occur. As described in §2.5, disassembly failures XMail Email server 137K can lead to unsound constraint generation. Second, the heuris- yasm Modular assembler 190K tics for converting from sketches to C types are lossy by python21 Python 2.1 202K necessity. Finally, we treat the source types as ground truth, quake3 Quake 3 281K leading to “failures” whenever Retypd recovers an accurate TinyCad Computed-aided design 544K type that does not match the original program—a common Shareaza Peer-to-peer file sharing 842K situation with type-unsafe source code. A representative example of this last source of impre- SPEC2006 benchmarks cision appears in Figure 6. This source code belongs to 470.lbm Lattice Boltzmann Method 3K the miranda32 IRC client, which uses a plugin-based ar- 429.mcf Vehicle scheduling 3K chitecture; most of miranda32’s functionality is imple- 462.libquantum Quantum computation 11K mented by “service functions” with the fixed signature 401.bzip2 Compression 13K int ServiceProc(WPARAM,LPARAM). The types WPARAM 458.sjeng Chess AI 25K and LPARAM are used in certain Windows APIs for generic 433.milc Quantum field theory 28K 16- and 32-bit values. The two parameters are immediately 482.sphinx3 Speech recognition 43K cast to other types in the body of the service functions, as in 456.hmmer Protein sequence analysis 71K Figure 6. 464.h264ref Video compression 113K 445.gobmk GNU Go AI 203K 400.perlbench Perl core 261K 6.4 const Correctness 403.gcc C/C++/Fortran compiler 751K As a side-effect of separately modeling .load and .store Figure 7. Benchmarks used for evaluation. All binaries were capabilities, Retypd is easily able to recover information compiled from source using optimized release configurations. about how pointer parameters are used for input and/or The SPEC2006 benchmarks were chosen to match the bench- output. We take this into account when converting sketches marks used to evaluate SecondWrite [10]. to C types; if a function’s sketch includes .inL.load but not .inL.store then we annotate the parameter at L with const, as in Figure 5 and Figure 2. Retypd appears to be Our benchmark suite includes the individual binaries in the first machine-code type-inference system to infer const Figure 7 as well as the collections of related binaries shown in annotations directly. Figure 10. We found that programs from a single collection On our benchmark suite, we found that 98% of parameter tended to share a large amount of common code, leading const annotations in the original source code were recovered to highly correlated benchmark results. For example, even by Retypd. Furthermore, Retypd inferred const annotations though the coreutils benchmarks include many tools with on many other parameters; unfortunately, since most C and very disparate purposes, all of the tools make use of a large, C++ code does not use const in every possible situation, we common set of statically linked utility routines. Over 80% do not have a straightforward way to detect how many of of the .text section in tail consists of such routines; for Retypd’s additional const annotations are correct. yes, the number is over 99%. The common code and specific Manual inspection of the missed const annotations shows idioms appearing in coreutils make it a particularly low- that most instances are due to imprecision when analyzing variance benchmark suite. one or two common statically linked library functions. This In order to avoid over-representing these program collec- imprecision then propagates outward to callers, leading to tions in our results, we treated these collections as clusters decreased const correctness overall. Still, we believe the 98% in the data set. For each cluster, we computed the average of recovery rate shows that Retypd offers a useful approach to each metric over the cluster, then inserted the average as a const inference. Distance to source type Interval size Conservativeness Pointer accuracy

coreutils SPEC2006 All coreutils SPEC2006 All 2 2 100% 100%

1.5 1.5 90% 90%

1 1 80% 80%

0.5 0.5 70% 70%

0 0

TIE TIE TIE* TIE* Retypd Retypd Retypd Retypd Retypd Retypd

SecondWrite Dynamic SecondWrite Dynamic REWARDS-c* ∗ REWARDS-c* ∗

Figure 8. Distance to ground-truth types and size of the Figure 9. Conservativeness and pointer accuracy metric. interval between inferred upper and lower bounds. Smaller Perfect type reconstruction would be 100% conservative distances represent more accurate types; smaller interval and match on 100% of pointer levels. Note that the y axis sizes represent increased confidence. begins at 70%.

6.5 Comparisons to Other Tools Multi-level pointer accuracy: ElWazeer et al. [10] also We gathered results over several metrics that have been used introduced a multi-level pointer-accuracy rate that attempts to evaluate SecondWrite, TIE, and REWARDS. These metrics to quantify how many “levels” of pointers were correctly were defined by Lee et al. [17] and are briefly reviewed here. inferred. On SecondWrite’s benchmark suite, Retypd attained TIE infers upper and lower bounds on each type variable, a mean multi-level pointer accuracy of 91%, compared with with the bounds belonging to a lattice of C-like types. The SecondWrite’s reported 73%. Across all benchmarks, Retypd lattice is naturally stratified into levels, with the distance averages 88% pointer accuracy. between two comparable types roughly being the difference Conservativeness: The best type system would have a high between their levels in the lattice, with a maximum distance conservativeness rate (few unsound decisions) coupled with of 4. A recursive formula for computing distances between a low interval size (tightly specified results) and low dis- pointer and structural types is also used. TIE also determines tance (inferred types are close to ground-truth types). In each a policy that selects between the upper and lower bounds on of these metrics, Retypd performs about as well or better a type variable for the final displayed type. than existing approaches. Retypd’s mean conservativeness TIE considers three metrics based on this lattice: the con- rate is 95%, compared to 94% for TIE. But note that TIE coreutils servativeness rate, the interval size, and the distance. A type was evaluated only on ; on that cluster, Retypd’s interval is conservative if the interval bounds overapproxi- conservativeness was 98%. SecondWrite’s overall conser- mate the declared type of a variable. The interval size is the vativeness is 96%, measured on a subset of the SPEC2006 lattice distance from the upper to the lower bound on a type benchmarks; Retypd attained a slightly lower 94% on this variable. The distance measures the lattice distance from the subset. final displayed type to the ground-truth type. REWARDS and It is interesting to note that Retypd’s conservativeness coreutils SecondWrite both use unification-based algorithms, and have rate on is comparable to that of REWARDS, even been evaluated using the same TIE metrics. The evaluation though REWARDS’ use of dynamic execution traces suggests of REWARDS using TIE metrics appears in Lee et al. [17]. it would be more conservative than a static analysis by virtue Distance and interval size: Retypd shows substantial im- of only generating feasible type constraints. provements over other approaches in the distance and interval- size metrics, indicating that it generates more accurate types 6.6 Performance with less uncertainty. The mean distance to the ground-truth Although the core simplification algorithm of Retypd has type was 0.54 for Retypd, compared to 1.15 for dynamic TIE, cubic worst-case complexity, it only needs to be applied on a 1.53 for REWARDS, 1.58 for static TIE, and 1.70 for Sec- per-procedure basis. This suggests that the real-world scaling ondWrite. The mean interval size shrunk to 1.2 with Retypd, behavior will depend on the distribution of procedure sizes, compared to 1.7 for SecondWrite and 2.0 for TIE. not on the whole-program size. Cluster Count Description Instructions Distance Interval Conserv. Ptr. Acc. Const freeglut-demos 3 freeglut samples 2K 0.66 1.49 97% 83% 100% coreutils 107 GNU coreutils 8.23 10K 0.51 1.19 98% 82% 96% vpx-d 8 VPx decoders 36K 0.63 1.68 98% 92% 100% vpx-e 6 VPx encoders 78K 0.63 1.53 96% 90% 100% sphinx2 4 Speech recognition 83K 0.42 1.09 94% 91% 99% putty 4 SSH utilities 97K 0.51 1.05 94% 86% 99% Retypd, as reported 0.54 1.20 95% 88% 98% Retypd, without clustering 0.53 1.22 97% 84% 97% Figure 10. Clusters in the benchmark suite. For each metric, the average over the cluster is given. If a cluster average is worse than Retypd’s overall average for a certain metric, a box is drawn around the entry.

3 103 · 4GB 1000 2GB

1GB

100 500MB

10 100MB Type inference time (seconds) Type inference memory usage 50MB 1 1K 10K 100K 1M 1K 10K 100K 1M Program size (number of CFG nodes) Program size (number of CFG nodes)

Figure 11. Type-inference time on benchmarks. The line Figure 12. Type-inference memory usage on bench- indicates the best-fit exponential t = 0.000725 N 1.098, marks. The line indicates the best-fit exponential m = · demonstrating slightly superlinear real-world scaling be- 0.037 N 0.846, with coefficient of determination R2 = · havior. The coefficient of determination is R2 = 0.977. 0.959.

In practice, Figure 11 suggests that Retypd gives nearly models then minimize the error in the predicted values linear performance over the benchmark suite, which ranges of T and m, rather than minimizing errors in log T or in size from 2K to 840K instructions. To measure Retypd’s log m. Linear regression in log-log space results in the less- performance, we used numerical regression to find the best-fit predicative models T = 0.0003 N 1.14 (R2 = 0.92) and · model T = αN β relating execution time T to program size m = 0.08 N 0.77 (R2 = 0.88). · N. This results in the relation T = 0.000725 N 1.098 with co- The constraint-simplification workload of Retypd would · efficient of determination R2 = 0.977, suggesting that nearly be straightforward to parallelize over the directed acyclic 98% of the variation in performance data can be explained by graph of strongly-connected components in the callgraph, fur- this model. In other words, on real-world programs Retypd ther reducing the scaling constant. If a reduction in memory demonstrates nearly linear scaling of execution time. The cu- usage is required, Retypd could swap constraint sets to disk; bic worst-case per-procedure behavior of the constraint solver the current implementation keeps all working sets in RAM. does not translate to cubic behavior overall. Similarly, we found that the sub-linear model m = 0.037 N 0.846 explains 7. Related Work · 96% of the memory usage in Retypd. Machine-code type recovery: Hex-Ray’s reverse engineer- Note. The regressions above were performed by numerically ing tool IdaPro [15] is an early example of type reconstruction fitting exponential models in (N,T ) and (N, m) space, rather via static analysis. The exact algorithm is proprietary, but it than analytically fitting linear models in log-log space. Our appears that IdaPro propagates types through unification from library functions of known signature, halting the propagation Related type systems: The type system used by Retypd is when a type conflict appears. IdaPro’s reconstructed IR is related to the recursively constrained types (rc types) of relatively sparse, so the type propagation fails to produce Eifrig, Smith, and Trifonov [9]. Retypd generalizes the rc useful information in many common cases, falling back to type system by building up all types using flexible records; the default int type. However, the analysis is very fast. even the function-type constructor , taken as fundamental → SecondWrite [10] is an interesting approach to static in the rc type system, is decomposed into a record with in IR reconstruction with a particular emphasis on scalability. and out fields. This allows Retypd to operate without the The authors combine a best-effort VSA variant for points- knowledge of a fixed signature from which type constructors to analysis with a unification-based type-inference engine. are drawn, which is essential for analysis of stripped machine Accurate types in SecondWrite depend on high-quality points- code. to data; the authors note that this can cause type accuracy The use of CFL reachability to perform polymorphic sub- to suffer on larger programs. In contrast, Retypd is not typing first appeared in Rehof and Fähndrich [26], extending dependent on points-to data for type recovery and makes use previous work relating simpler type systems to graph reacha- of subtyping rather than unification for increased precision. bility [2, 23]. Retypd continues by adding type-safe handling TIE [17] is a static type-reconstruction tool used as part of pointers and a simplification algorithm that allows us to of Carnegie Mellon University’s binary-analysis platform compactly represent the type scheme for each function. (BAP). TIE was the first machine-code type-inference system CFL reachability has also been used to extend the type to track subtype constraints and explicitly maintain upper and system of Java [14] and C++ [11] with support for additional lower bounds on each type variable. As an abstraction of the type qualifiers. Our reconstructed const annotations can C type system, TIE’s type lattice is relatively simple; missing be seen as an instance of this idea, although our qualifier features, such as recursive types, were later identified by the inference is not separated from type inference. authors as an important target for future research [30]. To the best of our knowledge, no prior work has applied HOWARD [32] and REWARDS [19] both take a dynamic polymorphic type systems with subtyping to machine code. approach, generating type constraints from execution traces. Through a comparison with HOWARD, the creators of TIE 8. Future Work showed that static type analysis can produce higher-precision One interesting avenue for future research could come from types than dynamic type analysis, though a small penalty the application of dependent and higher-rank type systems to must be paid in conservativeness of constraint-set generation. machine-code type inference, although we rapidly approach TIE also showed that type systems designed for static analysis the frontier where type inference is undecidable. A natural can be easily modified to work on dynamic traces; we expect example of dependent types appearing in machine code is the same is true for Retypd, though we have not yet performed malloc, which could be typed as malloc :(n : size_t) → these experiments. where denotes the common supertype of all n-byte >n >n Most previous work on machine-code type recovery, in- types. The key feature is that the value of a parameter cluding TIE and SecondWrite, either disallows recursive determines the type of the result. types or only supports recursive types by combining type- Higher-rank types are needed to properly model functions inference results with a points-to oracle. For example, to infer that accept pointers to polymorphic functions as parameters. that x has a the type struct S { struct S *, ...}* in a Such functions are not entirely uncommon; for example, unification-based approach like SecondWrite, first we must any function that is parameterized by a custom polymorphic have resolved that x points to some memory region M, that allocator will have rank 2. ≥ M admits a 4-byte abstract location α at offset 0, and that Retypd was implemented as an inference phase that runs the type of α should be unified with the type of x. If pointer after CodeSurfer’s main analysis loop. We expect that by analysis has failed to compute an explicit memory region moving Retypd into CodeSurfer’s analysis loop, there will pointed to by x, it will not be possible to determine the type be an opportunity for interesting interactions between IR of x correctly. The complex interplay between type inference, generation and type reconstruction. points-to analysis, and abstract-location delineation leads to a relatively fragile method for inferring recursive types. In 9. Conclusion contrast, our type system can infer recursive types even when By examining a diverse corpus of optimized binaries, we points-to facts are completely absent. have identified a number of common idioms that are stum- Robbins et al. [28] developed an SMT solver equipped bling blocks for machine-code type inference. For each of with a theory of rational trees and applied it to type recon- these idioms, we identified a type-system feature that could struction. Although this allows for recursive types, the lack enable the difficult code to be properly typed. We gathered of subtyping and the performance of the SMT solver make it these features into a type system and implemented the in- difficult to scale this approach to real-world binaries. Except ference algorithm in the tool Retypd. Despite removing the for test cases on the order of 500 instructions, precision of requirement for points-to data, Retypd is able to accurately the recovered types was not assessed. and conservatively type a wide variety of real-world binaries. We assert that Retypd demonstrates the utility of high-level [17] J. Lee, T. Avgerinos, and D. Brumley. TIE: Principled reverse type systems for reverse engineering and binary analysis. engineering of types in binary programs. In Network and Distributed System Security Symposium, pages 251–268, 2011. Acknowledgments [18] J. Lim and T. Reps. TSL: A system for generating abstract The authors would like to thank Vineeth Kashyap and the interpreters and its application to machine-code analysis. ACM anonymous reviewers for their many useful comments on this Transactions on Programming Languages and Systems , 35(1): manuscript, and John Phillips, David Ciarletta, and Tim Clark 4, 2013. for their help with test automation. [19] Z. Lin, X. Zhang, and D. Xu. Automatic reverse engineering of data structures from binary execution. In Network and References Distributed System Security Symposium, 2010. [1] ISO/IEC TR 19768:2007: Technical report on C++ library [20] S. Marlow, A. R. Yakushev, and S. Peyton Jones. Faster lazi- International Confer- extensions, 2007. ness using dynamic pointer tagging. In ence on Functional Programming, pages 277–288, 2007. [2] O. Agesen. Constraint-based type inference and parametric polymorphism. In Static Analysis Symposium, pages 78–100, [21] L. Mauborgne and X. Rival. Trace partitioning in abstract inter- Programming Languages 1994. pretation based static analyzers. In and Systems, pages 5–20, 2005. [3] R. M. Amadio and L. Cardelli. Subtyping recursive types. ACM Transactions on Programming Languages and Systems, [22] M. Noonan, A. Loginov, and D. Cok. Polymorphic type http: 15(4):575–631, 1993. inference for machine code (extended version). URL //arxiv.org/abs/1603.05495. [4] L. O. Andersen. and specialization for the C programming language. PhD thesis, University of [23] J. Palsberg and P. O’Keefe. A type system equivalent to flow Cophenhagen, 1994. analysis. ACM Transactions on Programming Languages and Systems , 17(4):576–599, 1995. [5] G. Balakrishnan, R. Gruian, T. Reps, and T. Teitelbaum. CodeSurfer/x86 – a platform for analyzing x86 executables. In [24] J. Palsberg, M. Wand, and P. O’Keefe. Type inference with Compiler Construction, pages 250–254, 2005. non-structural subtyping. Formal Aspects of Computing, 9(1): 49–67, 1997. [6] G. Balakrishnan and T. Reps. Analyzing Memory Accesses in x86 Executables, in Compiler Construction, pages 5–23, 2004. [25] F. Pottier and D. Rémy. The essence of ML type inference. In B. C. Pierce, editor, Advanced Topics in Types and Program- [7] A. Carayol and M. Hague. Saturation algorithms for model- ming Languages, chapter 10. MIT Press, 2005. checking pushdown systems. In International Conference on Automata and Formal Languages, pages 1–24, 2014. [26] J. Rehof and M. Fähndrich. Type-based flow analysis: From polymorphic subtyping to cfl-reachability. In Principles of [8] D. Caucal. On the regular structure of prefix rewriting. Theo- Programming Languages, pages 54–66, 2001. retical , 106(1):61–86, 1992. [27] J. Richard Büchi. Regular canonical systems. Archive for [9] J. Eifrig, S. Smith, and V. Trifonov. Sound polymorphic Mathematical Logic, 6(3):91–111, 1964. type inference for objects. In Object-Oriented Programming, Systems, Languages, and Applications, pages 169–184, 1995. [28] E. Robbins, J. M. Howe, and A. King. Theory propagation and rational-trees. In Principles and Practice of Declarative [10] K. ElWazeer, K. Anand, A. Kotha, M. Smithson, and R. Barua. Programming, pages 193–204, 2013. Scalable variable and data type detection in a binary rewriter. In Programming Language Design and Implementation, pages [29] M. Robertson. A Brief History of InvSqrt. PhD thesis, 51–60, 2013. University of New Brunswick, 2012. [11] J. S. Foster, R. Johnson, J. Kodumal, and A. Aiken. Flow- [30] E. J. Schwartz, J. Lee, M. Woo, and D. Brumley. Native x86 insensitive type qualifiers. ACM Transactions on Programming decompilation using semantics-preserving structural analysis Languages and Systems, 28(6):1035–1087, 2006. and iterative control-flow structuring. In USENIX Security Symposium, pages 353–368, 2013. [12] M. Fähndrich and A. Aiken. Making set-constraint program analyses scale. In Workshop on Set Constraints, 1996. [31] M. Siff, S. Chandra, T. Ball, K. Kunchithapadam, and T. Reps. Coping with type casts in C. In Software Engi- [13] D. Gopan, E. Driscoll, D. Nguyen, D. Naydich, A. Loginov, neering—ESEC/FSE’99, pages 180–198, 1999. and D. Melski. Data-delineation in software binaries and its application to buffer-overrun discovery. In International [32] A. Slowinska, T. Stancescu, and H. Bos. Howard: A dynamic Conference on Software Engineering, pages 145–155, 2015. excavator for reverse engineering data structures. In Network and Distributed System Security Symposium [14] D. Greenfieldboyce and J. S. Foster. Type qualifier inference for , 2011. Java. In Object-Oriented Programming, Systems, Languages, [33] B. Steensgaard. Points-to analysis in almost linear time. In and Applications, pages 321–336, 2007. Principles of Programming Languages, pages 32–41, 1996. [15] Hex-Rays. Hex-Rays IdaPro. http://www.hex-rays.com/ [34] Z. Su, A. Aiken, J. Niehren, T. Priesnitz, and R. Treinen. The products/ida/, 2015. first-order theory of subtyping constraints. In Principles of Programming Languages [16] D. Kozen, J. Palsberg, and M. I. Schwartzbach. Efficient , pages 203–216, 2002. recursive subtyping. Mathematical Structures in Computer Science, 5(01):113–125, 1995. A. Constraint Generation aware of register reaching definitions, and define A(reg, s) Type constraint generation is performed by a parameterized by TYPEA; the parameter A is itself A(reg , s) = an abstract interpreter that is used to transmit additional case reaching-defs(reg, s) of analysis information such as reaching definitions, propagated { p } → (regp ,{}) defs → constants, and value-sets (when available). let t = fresh Let denote the set of type variables and the set of type c = { reg v t | p ∈ defs } V C p constraints. Then the primitive TSL value- and map- types in (t, c) for TYPE are given by A where reaching-defs yields the set of definitions of reg that are visible from state s. Then TYPEA at program point q BASETYPETYPEA = BASETYPEA 2V 2C × × will update the constraint set to MAP[α, β] = MAP[α, β] 2C C TYPEA A × eax ebx Since type constraint generation is a syntactic, flow- C ∪ { p v q} insensitive process, we can regain flow sensitivity by pair- if p is the lone reaching definition of EAX. If there are multiple ing with an abstract semantics that carries a summary of reaching definitions P , then the constraint set will become flow-sensitive information. Parameterizing the type abstract interpretation by A allows us to factor out the particular way eaxp t p P t ebxq in which program variables should be abstracted to types (e.g. C ∪ { v | ∈ } ∪ { v } SSA form, reaching definitions, and so on). A.2 Addition and Subtraction A.1 Register Loads and Stores It is useful to track translations of a value through additions or subtraction of a constant. To that end, we overload the The basic reinterpretations proceed by pairing with the ab- add(x,y) and sub(x,y) operations in the cases where x or stract interpreter A. For example, y have statically-determined constant values. For example, if INT32(n) regUpdate(s, reg, v) = is a concrete numeric value then let (v’, t, c) = v add(v, INT32(n)) = (s’, m) = s let (v’, t, c) = v in s’’ = regUpdate(s’, reg, v’) ( add(v’, INT32(n)), t.+n, c ) (u, c’) = A(reg, s’’) in ( s’’, m ∪ c ∪ { t v u } ) In the case where neither operand is a statically-determined constant, we generate a fresh type variable representing the where A(reg, s) produces a type variable from the register result and a 3-place constraint on the type variables: reg and the A-abstracted register map s”. Register loads are handled similarly: add(x, y) = let (x’, t1 , c1 ) = x (y’, t , c ) = y regAccess(reg, s) = 2 2 t = fresh let (s’, c) = s in (t, c’) = A(reg, s’) ( add(x’, y’), in t, ( regAccess(reg, s’), t, c ∪ c’ ) c1 ∪ c2 ∪ { Add(t1 , t2 , t) } )

Example A.1. Suppose that A represents the concrete Similar interpretations are used for sub(x,y). semantics for x86 and A(reg, ) yields a type variable · (reg, ) and no additional constraints. Then the x86 expres- A.3 Memory Loads and Stores {} sion mov ebx, eax is represented by the TSL expression Memory accesses are treated similarly to register accesses, regUpdate(S, EBX(), regAccess(EAX(), S)), where S except for the use of dereference accesses and the handling is the initial state (S , ). After abstract interpretation, of points-to sets. For any abstract A-value a and A-state s, let conc C C will become eax ebx . A(a,s) denote a set of type variables representing the address C ∪ { v } A s PtsTo (a, s) By changing the parametric interpreter A, the generated in the context . Furthermore, define A to be a a type constraints may be made more precise. set of type variables representing the values pointed to by in the context s. Example A.2. We continue with the example of mov ebx, eax The semantics of the N-bit load and store functions above. Suppose that A represents an abstract semantics that is memAccessN and memUpdateN are given by Within procedure id, the locator Li returns the type vari- memAccessN (s, a) = able arg0 and Lo returns eaxr, resulting in the constraint let (s , c ) = s id 0 s set (a0 , t, ct ) = a c = { x v t. load .σN@0 idi arg0 , eaxr ido pt { v id v } | x ∈ PtsTo (a0 ,s0 )} in A procedure may also be associated with a set of type ( memAccessN (s0 , a0 ), constraints between the locator type variables, called the t. load .σN@0 , procedure summary; these type constraints may be inserted at c ∪ c ∪ c ) s t pt function calls to model the known behavior of a function. For

memUpdateN (s, a, v) = example, invocation of fopen will result in the constraints let (s0 , cs ) = s (a0 , t, ct ) = a fopeni0 char , fopeni1 char , FILE fopeno (v0 , v, cv ) = v { v ∗ v ∗ ∗ v } cpt = { t. store .σN@0 v x To support polymorphic function invocation, we instanti- | x ∈ PtsTo (a0 ,s0 )} in ate fresh versions of the locator type variables that are tagged ( memUpdateN (s0 , a0 , v0 ), with the current callsite; this prevents type variables from mul- cs ∪ ct ∪ cv ∪ cpt tiple invocations of the same procedure from being linked. ∪ { v v t. store .σN@0 } )

We achieved acceptable results by using a bare minimum Example A.4 (cont’d). When using callsite tagging, the points-to analysis that only tracks constant pointers to the callsite constraints generated by the locators would be local activation record or the data section. The use of the q q .load / .store accessors allows us to track multi-level pointer ext4p idi , ido eaxq information without the need for explicit points-to data. The { v v } minimal approach tracks just enough points-to information The callsite tagging must also be applied to any procedure to resolve references to local and global variables. summary. For example, a call to malloc will result in the constraints A.4 Procedure Invocation p p Earlier analysis phases are responsible for delineating proce- malloc size_t, void malloco { i v ∗ v } dures and gathering data about each procedure’s formal-in and formal-out variables, including information about how If malloc is used twice within a single procedure, we see parameters are stored on the stack or in registers. This data is an effect like let-polymorphism: each use will be typed transformed into a collection of locators associated to each independently. function. Each locator is bound to a type variable representing A.5 Other Operations the formal; the locator is responsible for finding an appropri- ate set of type variables representing the actual at a callsite, A.5.1 Floating-point or the corresponding local within the procedure itself. Floating point types are produced by calls to known library Example A.3. Consider this simple program that invokes a functions and through an abstract interpretation of reads 32-bit identity function. to and writes from the floating point register bank. We do not track register-to-register moves between floating point p: push ebx ; writes to local ext4 registers, though it would be straightforward to add this q: call id ability. In theory, this causes us to lose precision when ... id: ; begin procedure id() attempting to distinguish between typedefs of floating point r: mov eax, [esp+arg0] values; in practice, such typedefs appear to be extremely rare. ret A.5.2 Bit Manipulation The procedure id will have two locators: We assume that the operands and results of most bit- • A locator Li for the single parameter, bound to a type manipulation operations are integral, with some special variable idi. exceptions: • A locator Lo for the single return value, bound to a type • Common idioms like xor reg,reg and or reg,-1 are variable ido. used to initialize registers to certain constants. On x86 At the procedure call site, the locator Li will return the type these instructions can be encoded with 8-bit immediates, variable ext4p representing the stack location ext4 tagged saving space relative to the equivalent versions mov reg,0 by its reaching definition. Likewise, Lo will return the type and mov reg,-1. We do not assume that the results of variable eaxq to indicate that the actual-out is held in the these operations are of integral type. version of eax that is defined at point q. The locator results are combined with the locator’s type variables, resulting in the constraint set ext4 id , id eax { p v i o v q} • We discard any constraints generated while computing a Proof. By the previous lemma, we may assume that S-REFL value that is only used to update a flag status. In particular, and T-INHERIT are not used in the proof. We will prove the on x86 the operation test reg1,reg2 is implemented lemma by transforming all subproofs that end in a use of like a bitwise-AND that discards its result, only retaining T-PREFIX to remove that use. To that end, assume we have the effect on the flags. a proof ending with the derivation of VAR α from VAR α.` REFIX AR α.` • For specific operations such as y := x AND 0xfffffffc using T-P . We then enumerate the ways that V and y := x OR 1, we act as if they were equivalent to may have been proven. y := x. This is because these specific operations are often • If VAR α.` then VAR α already, so the entire ∈ C ∈ C used for bit-stealing; for example, requiring pointers to be proof tree leading to VAR α may be replaced with an aligned on 4-byte boundaries frees the lower two bits of axiom from . C a pointer for other purposes such as marking for garbage • The only other way a VAR α.` could be introduced is collection. through T-LEFT or T-RIGHT. For simplicity, let us only A.6 Additive Constraints consider the T-LEFT case; T-RIGHT is similar. At this point, we have a derivation tree that looks like The special constraints ADD and SUB are used to condition- ally propagate information about which type variables repre- α.` ϕ sent pointers and which represent integers when the variables v VAR α.` are related through addition or subtraction. The deduction rules for additive constraints are summarized in Figure 13. VAR α We obtained good results by inspecting the unification graph How was α.` ϕ introduced? If it were an axiom of used for computing (S ); the graph can be used to quickly v C L i then our assumed closure property would imply already determine whether a variable has pointer- or integer-like capa- that VAR α . The other cases are: bilities. In practice, the constraint set also should be updated ∈ C ϕ = β.` and α.` ϕ was introduced through one of with new subtype constraints as the additive constraints are v the S-FIELD axioms. In either case, α is the left- or applied, and a fully applied constraint can be dropped from right-hand side of one of the antecedents to S-FIELD, . We omit these details for simplicity. C so the whole subderivation including the T-PREFIX B. Normal Forms of Proofs axiom can be replaced with a single use of T-LEFT or T-RIGHT. A finite constraint set with recursive constraints can have an ` = .store and ϕ = α.load, with α.` ϕ introduced infinite entailment closure. In order to manipulate entailment v via S-POINTER. In this case, VAR α.` is already closures efficiently, we need finite (and small) representations an antecedent to S-POINTER, so we may elide the of infinite constraint sets. The first step towards achieving a subproof up to T-PREFIX to inductively reduce the finite representation of entailment is to find a normal form for problem to a simpler proof tree. every derivation. In Appendix D, finite models of entailment Finally, we must have α.` ϕ due to S-TRANS. But closure will be constructed that manipulate representations v in this case, the left antecedent is of the form α.` β; of these normal forms. v so once again, we may elide a subproof to get a simpler Lemma B.1. For any statement P provable from , there proof tree. C exists a derivation of P that does not use rules S-REFL C ` Each of these cases either removes an instance of or T-INHERIT. T-PREFIX or results in a strictly smaller proof tree. It follows that iterated application of these simplification Proof. The redundancy of S-REFL is immediate. As for rules results in a proof of VAR α with no remaining T-INHERIT, any use of that rule in a proof can be replaced by instances of T-PREFIX. S-FIELD followed by T-LEFT, or S-FIELD followed by ⊕ T-RIGHT.

In the following, we make the simplifying assumption Corollary B.1. If VAR α, then either VAR α , or that is closed under T-LEFT, T-RIGHT, and T-PREFIX. α β, β α C ` β ∈ C C for some . Concretely, we are requiring that C ` v v To simplify the statement of our normal-form theorem, 1. if contains a subtype constraint α β as an axiom, it C v recall that we defined the variadic rule S-TRANS’ to remove also contains the axioms VAR α and VAR β. the degrees of freedom from regrouping repeated application 2. if contains a term declaration VAR α, it also contains of S-TRANS: C term declarations for all prefixes of α. α1 α2, α2 α3, . . . αn 1 αn v v − v Lemma B.2. If is closed under T-LEFT, T-RIGHT, and (S-TRANS’) C α1 αn T-PREFIX then any statement provable from can be proven v C without use of T-PREFIX. ADD SUB X i I p P I i i I P P p p p Y i I I i p P I i i p P i I Z I i P p P p I i p I i P p

Figure 13. Inference rules for ADD(X,Y ; Z) and SUB(X,Y ; Z). Lower case letters denote known integer or pointer types. Upper case letters denote inferred types. For example, the first column says that if X and Y are integral types in an ADD(X,Y ; Z) constraint, then Z is integral as well.

Theorem B.1 (Normal form of proof trees). Let be a We also can eliminate the need for the subderivations Q C constraint set that is closed under T-LEFT, T-RIGHT, and and S except for the very first and very last antecedents of T-PREFIX. Then any statement provable from has a deriva- S-TRANS’ by pulling the term out of the neighboring subtype C tion such that relation; schematically, we can always replace

• There are no uses of the rules T-PREFIX, T-INHERIT, or P S S-REFL. α β VAR β.` U • For every instance of S-TRANS’ and every pair of adja- v ··· α.` β.` β.` γ ··· cent antecedents, at least one is not the result of a S-FIELD v v application. T

Proof. The previous lemmas already handled the first point. by As to the second, suppose that we had two adjacent an- U tecedents that were the result of S-FIELD : P β.` γ ⊕ v P Q R S α β VAR β.` U v α β VAR β.` β γ VAR γ.` ··· α.` β.` β.` γ ··· v v v v ··· α.` β.` β.` γ.` ··· T v v T This transformation even works if there are several S-FIELD First, note that the neighboring applications of S-FIELD applications in a row, as in ⊕ must both use the same field label `, or else S-TRANS would not be applicable. But now observe that we may move a P Q S-FIELD application upwards, combining the two S-FIELD α β VAR β.` R v applications into one: α.` β.` VAR β.`.`0 U v P R ··· α.`.`0 β.`.`0 β.`.`0 γ ··· v v α β β γ S T v v α γ VAR γ.` v which can be simplified to ··· α.` γ.` ··· v T U β.`.`0 γ For completeness, we also compute the simplifying transfor- v mation for adjacent uses of S-FIELD : the derivation P VAR β.`.`0 U

α β VAR β.` β.`.`0 γ P Q R S v v α.` β.` AR β.`.` U β α VAR α.` γ β VAR β.` V 0 v v v ··· α.` β.` β.` γ.` ··· ··· α.`.`0 β.`.`0 β.`.`0 γ ··· v v v v T T can be simplified to Taken together, this demonstrates that the VAR β.` an- R P tecedents to S-FIELD are automatically satisfied if the con- sequent is eventually used in an S-TRANS application where Q γ β β α v v β.` is a “middle” variable. This remains true even if there are VAR α.` γ α several S-FIELD applications in sequence. v ··· α.` γ.` ··· v T Since the S-FIELD VAR antecedents are automatically To simplify the presentation, when u = u u we will 1 ··· n satisfiable, we can use a simplified schematic depiction of sometimes write proof trees that omits the intermediate VAR subtrees. In this pop u = pop u pop u simplified depiction, the previous proof tree fragment would 1 ⊗ · · · ⊗ n be written as and P push u = push u push u n ⊗ · · · ⊗ 1 α β v Definition C.2. A monomial in StackOpΣ is called reduced α.` β.` U if its length cannot be shortened through applications of the v ··· α.`.`0 β.`.`0 β.`.`0 γ ··· push x pop y = δ(x, y) rule. Every reduced monomial has v v ⊗ T the form pop u pop u push v push v Any such fragment can be automatically converted to a full 1 ··· n ⊗ m ··· 1 proof by re-generating VAR subderivations from the results for some u , v Σ. of the input derivations P and/or U. i j ∈ Elements of StackOp can be understood to denote B.1 Algebraic Representation Σ possibly-failing functions operating on a stack of symbols Finally, consider the form of the simplified proof tree that from Σ: elides the VAR subderivations. Each leaf of the tree is a subtype constraint c , and there is a unique way to [[0]] = λs . fail i ∈ C fill in necessary VAR antecedents and S-FIELD applications [[1]] = λs . s to glue the constraints into a proof tree in normal form. {ci} [[push x]] = λs . cons(x, s) In other words, the normal form proof tree is completely determined by the sequence of leaf constraints c and the [[pop x]] = λs . case s of { i} sequence of S-FIELD applications applied to the final tree. In nil fail → effect, we have a term algebra with constants Ci representing cons(x0, s0) if x = x0 then s0 the constraints c , an associative binary operator that → i ∈ C combines two compatible constraints via S-TRANS, inserting else fail appropriate S-FIELD applications, and a unary operator S ` Under this interpretation, the semiring operations are inter- for each ` Σ that represents an application of S-FIELD. ∈ preted as The proof tree manipulations provide additional relations on this term algebra, such as [[X Y ]] = nondeterministic choice of [[X]] or [[Y ]] ⊕ ( [[X Y ]] = [[Y ]] [[X]] S`(R1 R2) when ` = S`(R1) S`(R2) = h i ⊕ ⊗ ◦ S`(R R ) when ` = [[X∗]] = nondeterministic iteration of [[X]] 2 1 h i The normalization rules described in this section demonstrate Note that the action of a pushdown rule R = P ; u , h i → that every proof of a subtype constraint entailed by can be Q; v on a stack configuration c = (X, w) is given by the C h i represented by a term of the form application

S` (S` ( S` (R1 Rk) )) [[pop P pop u push v push Q]](cons(X, w)) 1 2 ··· n · · · ··· ⊗ ⊗ ⊗

C. The StackOp Weight Domain The result will either be cons(X0, w0) where (X0, w0) is the configuration obtained by applying rule R to configuration c, In this section, we develop a weight domain StackOpΣ that will be useful for modeling constraint sets by pushdown sys- or fail if the rule cannot be applied to c. tems. This weight domain is generated by symbolic constants Lemma C.1. There is a one-to-one correspondence between representing actions on a stack. elements of StackOpΣ and regular sets of pushdown system rules over Σ. Definition C.1. The weight domain StackOpΣ is the idem- potent -semiring generated by the symbols pop x, push x Definition C.3. The variance operator can be extended to ∗ h·i for all x Σ, subject only to the relation StackOp by defining ∈ Σ ( 1 if x = y pop x = push x = x push x pop y = δ(x, y) = h i h i h i ⊗ 0 if x = y 6 D. Constructing the Transducer for a The transition rules ∆ are partitioned into four parts ∆ = Constraint Set ∆ ∆ptr ∆start ∆end where C q q q D.1 Building the Pushdown System [ ∆ = rules(c) C Let be a set of type variables and Σ a set of field labels, c V ∈C equipped with a variance operator . Furthermore, suppose [ h·i ∆ptr = rules(v.store v.load) that we have fixed a set of constraints over . Finally, v v 0 C V ∈V suppose that we can partition into a set of interesting and  # V ∆ = START; v⊕ , v⊕; ε v uninteresting variables: start h i → h L i | ∈ Vi  # START; v , v ; ε v ∪ h i → h L i | ∈ Vi = i u  # V V q V ∆ = v⊕; ε , END; v⊕ v end h R i → h i | ∈ Vi Definition D.1. Suppose that X,Y i are interesting type  # ∈ V vR ; ε , END; v v i variables, and there is a proof X.u Y.v. We will call ∪ h i → h i | ∈ V C ` v the proof elementary if the normal form of its proof tree only Note 1. The , superscripts on the control states are {⊕ } involves uninteresting variables on internal leaves. We write used to track the current variance of the stack state, allowing

i us to distinguish between uses of an axiom in co- and contra- V X.u Y.v C ` elem v variant position. The tagging operations lhs, rhs are used when has an elementary proof of X.u Y.v. to prevent derivations from making use of variables from C v i, preventing from admitting derivations that represent V PC Our goal is to construct a finite state transducer that non-elementary proofs. i recognizes the relation elemV i0 i0 between derived v ⊆ V × V Note 2. Note that although ∆ is finite, ∆ptr contains rules type variables defined by C for every derived type variable and is therefore infinite. We n o i i V = (X.u, Y.v) V X.u Y.v carefully adjust for this in the saturation rules below so that v elem | C ` elem v rules from ∆ptr are only considered lazily as Deriv C is P We proceed by constructing an unconstrained pushdown sys- constructed. tem whose derivations model proofs in ; a modification a b PC C Lemma D.1. For any pair (X u, Y v) in Deriv C , we also of Caucal’s saturation algorithm [8] is then used to build the b a P have (Y · v, X · u) Deriv C . P transducer representing Deriv C . ∈ P Proof. Definition D.2. Define the left- and right-hand tagging rules Immediate due to the symmetries in the construction lhs, rhs by of ∆. ( x x a b L if i Lemma D.2. For any pair (X u, Y v) in Deriv C , the lhs(x) = ∈ V P x if x o relation ∈ V a u = b v ( · h i · h i xR if x i rhs(x) = ∈ V must hold. x if x ∈ Vo Definition D.3. The unconstrained pushdown system Proof. This is an inductive consequence of how ∆ is defined: PC the relation holds for every rule in ∆ ∆ due to the use of associated to a constraint set is given by the triple (e, Σe, ∆) ptr C V C ∪ where the rules function, and the sign in the exponent is propagated from the top of the stack during an application of a rule e = (lhs( i) rhs( i) o) , from ∆start and back to the stack during an application of a V V q V q V × {⊕ } # # rule from ∆ . Since every derivation will begin with a rule START, END end ∪ { } from ∆start, proceed by applying rules from ∆ ∆ptr, and ∆ C ∪ We will write an element (v, ) e as v . conclude with a rule from end, this completes the proof. ∈ V The stack alphabet Σe is essentially the same as Σ, with a Definition D.4. Let Deriv0 C i0 i0 be the relation on few extra tokens added to represent interesting variables: P ⊆ V × V derived type variables induced by Deriv C as follows: P Σe = Σ v⊕ v i v v i n o ∪ { | ∈ V } ∪ { | ∈ V } u v Deriv0 C = (X.u, Y.v) (Xh iu, Y h iv) Deriv C P | ∈ P To define the transition rules, we first introduce the helper functions: Lemma D.3. If (p, q) Deriv0 C and σ Σ has σ = ∈ P ∈ h i u v , then (pσ, qσ) Deriv0 C as well. If σ = , then rule⊕(p.u q.v) = lhs(p)h i; u , rhs(q)h i; v ⊕ ∈ P h i v h i → h i (qσ, pσ) Deriv0 C instead. v u P rule (p.u q.v) = lhs(q) ·h i; v , rhs(p) ·h i; u ∈ v h i → h i rules(c) = rule⊕(c), rule (c) { } Proof. Immediate, from symmetry considerations. the rule S-FIELD`. This results in an algebraic expression describing the normalized proof tree as in § B.1, which Lemma D.4. ⊕ Deriv C can be partitioned into Deriv C P P q completes the proof.

Deriv C where P D.2 Constructing the Initial Graph n u v o ⊕ 0 Deriv C = (Xh iu, Y h iv) (X.u, Y.v) Deriv C In this section, we will construct a finite state automaton that P | ∈ P n v u o accepts strings that encode some of the behavior of ; the 0 C Deriv C = (Y ·h iv, X ·h iu) (X.u, Y.v) Deriv C P P | ∈ P next section describes a saturation algorithm that modifies the initial automaton to create a finite state transducer for

Deriv0 C . P In particular, Deriv C can be entirely reconstructed from the The automaton is constructed as follows: P AC simpler set Deriv0 C . # # P 1. Add an initial state START and an accepting state END. Theorem D.1. has an elementary proof of the constraint a 2. For each left-hand side p ; u1 un of a rule in ∆ , C C X.u Y.v if and only if (X.u, Y.v) Deriv0 C . h ··· i v ∈ P add the transitions

Proof. We will prove the theorem by constructing a bijection a·hu1...uni # pop p a u ...un between elementary proof trees and derivations in . START p ·h 1 i a PC pop u−→ Call a configuration (p , u) positive if a = u . By a u1...un 1 a u2...un h i p ·h i p ·h iu1 Lemma D.2, if (pa, u) is positive and (pa, u) ∗ (qb, v) −→ −→ . then (qb, v) is positive as well. We will also call a transition . rule positive when the configurations appearing in the rule are a un pop un a p ·h iu1 . . . un 1 p u1 . . . un positive. Note that by construction, every positive transition − −→ rule from ∆ ∆ptr is of the form rule⊕(c) for some axiom b C ∪ 3. For each right-hand side q ; v1 vn of a rule in ∆ , c or load/store constraint c = (p.store p.load). h ··· i C ∈ C v add the transitions For any positive transition rule R = rule⊕(p.u q.v) and v ` Σ, let S`(R) denote the positive rule b push vn b vn ∈ q v1 . . . vn q ·h iv1 . . . vn 1 ( −→ − p u` ; u` , q v` ; v` ` = . h i h i if . S`(R) = h v` i → h u` i h i ⊕ qh i; v` , ph i; u` if ` = b v ...vn push v1 b v ...vn h i → h i h i q ·h 2 iv q ·h 1 i 1 −→ Note that by the construction of ∆, each S`(R) is re- b·hv1...vni b v ...vn push q # q ·h 1 i END dundant, being a restriction of one of the existing transition −→ rules rule⊕(p.u q.v) or rule (p.u q.v) to a more spe- v v S (R) 4. For each rule pa; u , qb; v , add the transition cific stack configuration. In particular, adding ` to the h i → h i set of rules does not affect Deriv C . P w a 1 b Suppose we are in the positive state (ph i, w) and are p u q v u v −→ about to apply a transition rule R = ph i; u , qh i; v . h i → h i Then we must have w = u`1`2 `n, so that the left-hand Definition D.5. Following Carayol and Hague [7], we call a ··· w side of the rule S` ( (S` (R)) ) is exactly ph i; w . sequence of transitions p p productive if the correspond- n ··· 1 ··· h i 1 ··· n Let Rstart,R1, ,Rn,Rend be the sequence of rule appli- ing element p1 pn in StackOpΣ is not 0. A sequence ··· # u ⊗ · · · ⊗ cations used in an arbitrary derivation ( START, ph iu) ∗ is productive if and only if it contains no adjacent terms of # v −→ ( END, qh iv) and let R R0 denote the application of rule the form pop y, push x with x = y. R R R ex- 6 followed by rule 0, with the right-hand side of Lemma D.5. There is a one-to-one correspondence between actly R matching the left-hand side of 0. Given the initial productive transition sequences accepted by and elemen- stack state p u u and the sequence of rule applications, for AC h i tary proofs in that do not use the axiom S-POINTER. each Rk there is a unique rule R0 = S`k ( (S`k (Rk)) ) C k n ··· 1 ··· such that ∂ = R0 R0 R0 R0 describes Proof. A productive transition sequence must consist of a start 1 · · · n end the derivation exactly. Now normalize ∂ using the rule sequence of pop edges followed by a sequence of push S (R ) S (R ) S (R R ); this process is clearly edges, possibly with insertions of unit edges or intermediate ` i ` j 7→ ` i j reversible, producing a bijection between these normalized sequences of the form ⊕ expressions and derivations in Deriv C . P To complete the proof, we obtain a bijection between push (un) push (u1) pop (u1) pop (un) ⊗ · · · ⊗ ⊗ ⊗ · · · ⊗ normalized expressions and the normal forms of elementary Following a push or pop edge corresponds to observing or proof trees. The bijection is straightforward: Ri represents the introduction of an axiom from , represents an application forgetting part of the stack. Following a 1 edge corresponds of the rule S-TRANS, and S Crepresents an application of to applying a PDS rule from . ` PC Algorithm D.1 Converting a transducer from a set of pushdown system rules 1: procedure ALLPATHS(V, E, x, y) . Tarjan’s path algorithm. Return a finite state automaton recognizing the label sequence for all paths from x to y in the graph (V,E). 2: ... 3: return Q 4: end procedure

5: procedure TRANSDUCER(∆ ) . ∆ is a set of pushdown system rules pa; u , qb; v C C h i → h i 6: V #START, #END ← { } 7: E ← ∅ a b 8: for all p ; u1 un , q ; v1 vm ∆ do h ···a ui →un h b v ···vm i ∈ C 9: V V p ·h 1··· i, q ·h 1··· i ← ∪ { # a u un } a u un 10: E E ( START, p ·h 1··· i, pop p ·h 1··· i) ← ∪ { b v vm # b v vm } 11: E E (q ·h 1··· i, END, push q ·h 1··· i) ← ∪ { } 12: for i 1 . . . n 1 do ← −a ui+1 un 13: V V p ·h ··· iu1 . . . ui ← ∪ { a ui un } a ui+1 un 14: E E (p ·h ··· iu1 . . . ui 1, p ·h ··· iu1 . . . ui, pop ui) ← ∪ { − } 15: end for 16: for j 1 . . . m 1 do ← −b vj+1 um 17: V V q ·h ··· iv1 . . . vj ← ∪ { b vj+1 vm } b vj vm 18: E E (q ·h ··· iv1 . . . vj, q ·h ··· iv1 . . . vj 1, push vj) ← ∪ { − } 19: end for 20: E E (pau . . . u , qbv . . . v , 1) ← ∪ { 1 n 1 m } 21: end for 22: E SATURATED(V,E) ← 23: Q ALLPATHS(V,E, #START, #END) ← 24: return Q 25: end procedure

D.3 Saturation The label sequences appearing in Lemma D.5 are tantaliz- Proof. Immediate due to the use of the rule constructors rule⊕ ingly close to having the simple structure of building up a and rule when forming ∆ . pop sequence representing an initial state of the pushdown C automaton, then building up a push sequence representing a In this section, the automaton will be saturated by AC final state. But the intermediate terms of the form “push u, adding transitions to create a new automaton sat. then pop it again” are unwieldy. To remove the necessity for AC Definition D.6. A sequence is called reduced if it is produc- those sequences, we can saturate by adding additional AC tive and contains no factors of the form pop x push x. 1-labeled transitions providing shortcuts to the push/pop sub- ⊗ sequences. We modify the standard saturation algorithm to Reduced productive sequences all have the form of a also lazily instantiate transitions which correspond to uses of sequence of pops, followed by a sequence of pushes. the S-POINTER-derived rules in ∆ptr. The goal of the saturation algorithm is twofold: Lemma D.6. The reachable states of the automaton can 1. Ensure that for any productive sequence accepted by AC AsatC be partitioned into covariant and contravariant states, where there is an equivalent reduced sequence accepted by . AC a state’s variance is defined to be the variance of any sequence 2. Ensure that sat can represent elementary proofs that use # AC reaching the state from START. S-POINTER. Proof. By construction of ∆ and . The saturation algorithm D.2 proceeds by maintaining, for C AC each state q , a set of reaching-pushes R(q). The Lemma D.7. There is an involution n n on defined ∈ AC 7→ AC reaching push set R(q) will contain the pair (`, p) only by if there is a transition sequence in from p to q with a a x u = x · u AC weight push `. When q has an outgoing pop ` edge to q0 # # 1 START = END and (`, p) R(q), we add a new transition p q0 to . # ∈ → AC #END = START A special propagation clause is responsible for propagat- Second, by computing Q relative to the set of type con- ing reaching-push facts as if rules from ∆ptr were instantiated, stants we obtain a transducer that can be efficiently queried allowing the saturation algorithm to work even though the to determine which derived type variables are bound above corresponding unconstrained pushdown system has infinitely or below by which type constants. This is used by the SOLVE many rules. This special clause is justified by considering the procedure in F.2 to populate lattice elements decorating the standard saturation rule when x.store x.load is added as inferred sketches. v an axiom in . An example appears in Figure 14: a saturation C Algorithm D.3 Converting a transducer to a pushdown sys- edge is added from x⊕.store to y⊕.load due to the pointer saturation rule, but the same edge would also have been added tem procedure if the states and transitions corresponding to p.store p.load TYPESCHEME(Q) v ∆ new PDS (depicted with dotted edges and nodes) were added to . ← AC ∆.states Q.states Once the saturation algorithm completes, the automaton ←t sat has the property that, if there is a transition sequence for all p q Q.transitions do AC → ∈ from p to q with weight equivalent to if t = pop ` then ADDPDSRULE(∆, p; ` , q; ε ) pop u pop u push v push v , h i → h i 1 ⊗ · · · ⊗ n ⊗ m ⊗ · · · ⊗ 1 else p q 1 ADDPDSRULE(∆, p; ε , q; ` ) then there is a path from to that has, ignoring edges, h i → h i exactly the label sequence end if end for pop u1,..., pop un, push vm,..., push v1. return ∆ D.4 Shadowing end procedure The automaton sat now accepts push/pop sequences repre- AC senting the changes in the stack during any legal derivation in the pushdown system ∆. After saturation, we can guarantee E. The Lattice of Sketches that every derivation is represented by a path which first pops Throughout this section, we fix a lattice (Λ, <:, , ) of ∨ ∧ a sequence of tokens, then pushes another sequence of tokens. atomic types. We do not assume anything about Λ except Unfortunately, sat still accepts unproductive transition AC that it should have finite height so that infinite subtype chains sequences which push and then immediately pop token se- eventually stabilize. For example purposes, we will take Λ to quences. To complete the construction, we form an automaton be the lattice of semantic classes depicted in Figure 15. Q by intersecting sat with an automaton for the language of AC Our initial implementation of sketches did not use the words consisting of only pops, followed by only pushes. auxiliary lattice Λ. We found that adding these decorations to This final transformation yields an automaton Q with the the sketches helped preserve high-level types of interest to the property that for every transition sequence s accepted by sat, AC end user during type inference. This allows us to recover high- Q accepts a sequence s0 such that [[s]] = [[s0]] in StackOp and level C and Windows typedefs such as size_t, FILE, HANDLE, s0 consists of a sequence of pops followed by a sequence and SOCKET that are useful for program understanding and of pushes. This ensures that Q only accepts the productive reverse engineering, as noted by Lin et al. [19]. transition sequences in sat. Finally, we can treat Q as a AC Decorations also enable a simple mechanism by which finite-state transducer by treating transitions labeled pop ` as the user can extend Retypd’s type system, adding semantic reading the symbol ` from the input tape, push ` as writing ` purposes to the types for known functions. For example, to the output tape, and 1 as an ε-transition. we can extend Λ to add seeds for a tag #signal-number Q can be further manipulated through determinization attached to the less-informative int parameter to signal(). and/or minimization to produce a compact transducer repre- This approach also allows us to distinguish between opaque senting the valid transition sequences through . Taken as typedefs and the underlying type, as in HANDLE and void*. Q PC a whole, we have shown that Xu Y v if and only if X Since the semantics of a HANDLE are quite distinct from those 7→ and Y are interesting variables and there is an elementary of a void*, it is important to have a mechanism that can derivation of X.u Y.v. preserve the typedef name. C ` v We make use of this process in two places during type analysis: first, by computing Q relative to the type variable of E.1 Basic Definitions a function, we get a transducer that represents all elementary Definition E.1. A sketch is a regular tree with edges labeled derivations of relationships between the function’s inputs by elements of Σ and nodes labeled with elements of Λ. and outputs. D.3 is used to convert the transducer Q back The set of all sketches, with Σ and Λ implicitly fixed, will to a pushdown system , such that Q describes all valid be denoted Sk. We may alternately think of a sketch as a P derivations in . Then the rules in can be interpreted prefix-closed regular language (S) Σ∗ and a function as subtype constraints,P resulting in aP simplification of the L ⊆ constraint set relative to the formal type variables. Algorithm D.2 Saturation algorithm

1: procedure SATURATED(V,E) .V is a set of vertices partitioned into V = V ⊕ V q 2: E0 E.E is a set of edges, represented as triples (src, tgt, label) ← 3: for all x V do ∈ 4: R(x) ← ∅ 5: end for 6: for all (x, y, e) E with e = push ` do . Initialize the reaching-push sets R(x) ∈ 7: R(y) R(y) (`, x) ← ∪ { } 8: end for 9: repeat 10: R R old ← 11: E0 E0 old ← 12: for all (x, y, e) E0 with e = 1 do ∈ 13: R(y) R(y) R(x) ← ∪ 14: end for 15: for all (x, y, e) E0 with e = pop ` do ∈ 16: for all (`, z) R(x) do . The standard saturation rule. ∈ 17: E0 E0 (z, y, 1) ← ∪ { } 18: end for 19: end for 20: for all x V do . Lazily apply saturation rules corresponding to S-POINTER. ∈ 21: for all (`, z) R(x) with ` = .store do . See Figure 14 for an example. ∈ 22: R(x) R(x) (.load, z) ← ∪ { } 23: end for 24: for all (`, z) R(x) with ` = .load do ∈ 25: R(x) R(x) (.store, z) ← ∪ { } 26: end for 27: end for

28: until R = Rold and E0 = Eold0 29: return E0 30: end procedure

#START

pop AL

AL⊕ 1 p

pop store pop load 1 push store x { x.store⊕ p.store⊕ p.load p = y; x = p; 1 1 1 *x = A; B = *y; y.load⊕ p.load⊕ p.store } pop load 1 y⊕ push load push store 1 BR⊕ p⊕

push BR

#END Figure 14. Saturation using an implicit application of p.store p.load. The initial constraint set was v y p, p x, A x.store, y.load B , modeling the simple program on the right. The dashed edge was { v v v v } added by a saturation rule that only fires because of the lazy handling in D.2. The dotted states and edges show how the graph would look if the corresponding rule from ∆ptr were explicitly instantiated. .σ32@4 • If κ is a type constant, (S ) = ε and ν (ε) = κ. L κ { } Sκ • If VAR X.v then v (X). .load .σ32@0 C ` ∈ L str • If X.u Y.v then νX (u) <: νY (v). > > C ` v 1 1 1 • If X.u Y.v then u− S v− S , where v− S C ` v X E Y X .store .store is the sketch corresponding to the subtree reached by following the path v from the root of SX .

.load .σ32@0 The main utility of sketches is that they are almost a free str ⊥ ⊥ tree model of the constraint language. Any constraint set C is satisfiable over the lattice of sketches, as long as cannot C .σ32@4 prove an impossible subtype relation in Λ. Theorem E.1. Suppose that is a constraint set over the Figure 16. Sketch representing a linked list of strings C variables τi i I . Then there exist sketches Si i I such struct LL { str s; struct LL* a; }*. { } ∈ { } ∈ that w S if and only if VAR τ .w. ∈ i C ` i Proof. The languages (Si) can be computed an algorithm .store .σ32@0 url L > ⊥ that is similar in spirit to Steensgaard’s method of almost- linear-time pointer analysis [33]. Begin by forming a graph with one node n(α) for each derived type variable appearing Figure 17. Y _Out_ url * u; Sketch for representing in , along with each of its prefixes. Add a labeled edge C ` n(α) n(α.`) for each derived type variable α.` to form → a graph G. Now quotient G by the equivalence relation > ∼ defined by n(α) n(β) if α β , and n(α0) n(β0) num str ∼ v ` ∈ C ∼`0 whenever there are edges n(α) n(α0) and n(β) n(β0) → → url in G with n(α) n(β) where either ` = `0 or ` = .load and ∼ `0 = .store. The relation is the symmetrization of , with the ⊥ ∼ v first defining rule roughly corresponding to T-INHERITL Figure 15. The sample lattice Λ of atomic types. and T-INHERITR, and the second rule corresponding to S-FIELD and S-FIELD . The unusual condition on ` and ⊕ 1 OINTER ν : S Λ such that each fiber ν− (λ) is regular. It will be `0 is due to the S-P rule. → convenient to write νS(w) for the value of ν at the node of S By construction, there exists a path with label sequence u through G/ starting at the equivalence class of τ if and reached by following the word w Σ∗. ∼ i ∈ only if VAR τ .u. We can take this as the definition of By collapsing equal subtrees, we can represent sketches as C ` i deterministic finite state automata with each state labeled by the language accepted by Si. an element of Λ, as in Figure 16. Since the regular language Working out the lattice elements that should label S associated with a sketch is prefix-closed, all states of the i is a trickier problem; the basic idea is to use the same associated automaton are accepting. pushdown system construction that appears during constraint Lemma E.1. The set of sketches Sk forms a lattice with meet simplification to answer queries about which type constants and join operations , defined according to Figure 18. Sk are upper and lower bounds on a given derived type variable. u t has a top element given by the sketch accepting the language The computation of upper and lower lattice bounds on a ε , with the single node labeled by Λ. derived type variable appears in §D.4. { } > ∈ If Σ is finite, Sk also has a bottom element accepting the E.1.2 Sketch Narrowing at Function Calls language Σ∗, with label function ν (w) = when w = , ⊥ ⊥ h i ⊕ or ν (w) = when w = . It was noted in §3.2 that the rule T-INHERITR leads to a ⊥ > h i We will use X E Y to denote the partial order on sketches system with structural typing: any two types in a subtype compatible with the lattice operations, so that X Y = X if relation must have the same fields. In the language of sketches, u and only if X E Y . this means that if two type variables are in a subtype relation then the corresponding sketches accept exactly the same E.1.1 Modeling Constraint Solutions with Sketches languages. Superficially, this seems problematic for modeling Sketches are our choice of entity for modeling solutions to typecasts that narrow a pointed-to object as motivated by the the constraint sets of §3.1. idioms in §2.4. Definition E.2. A solution to a constraint set over the type The missing piece that allows us to effectively narrow C variables is a set of bindings S : Sk such that objects is instantiation of callee type schemes at a callsite. To V V → Algorithm E.1 Computing sketches from constraint sets function close_last can be invoked by providing any actual- procedure INFERSHAPES( ,B) in type α such that α F.instack0; in particular, α can have Cinitial v SUBSTITUTE( ,B) more capabilities than F.instack0 itself. That is, we can pass a C ← Cinitial G . Compute constraint graph modulo more constrainted (“more capable”) type as an actual-in to a ← ∅ ∼ function that expected a less constrained input. In this way, for all p.`1 . . . `n .derivedTypeVars do for i 1 . . . n∈do C we recover aspects of the physical, nonstructural subtyping ← utilized by many C and C++ programs via the pointer-to- s FINDEQUIVREP(p.`1 . . . `i 1,G) ← − t FINDEQUIVREP(p.` . . . ` ,G) member or pointer-to-substructure idioms described in §2.4. ← 1 i G.edges G.edges (s, t, `i) Example E.1. Suppose we have a reverse DNS lookup func- ← ∪ end for tion reverse_dns with C type signature void reverse_dns end for (num addr,url* result). Furthermore, assume that the for all x y do implementation of reverse_dns works by writing the re- v ∈ C X FINDEQUIVREP(x, G) sulting URL to the location pointed to by result, yield- ← Y FINDEQUIVREP(y, G) ing a constraint of the form url result.store.σ32@0. So ← v UNIFY(X,Y,G) reverse_dns will have an inferred type scheme of the form end for repeat . Apply additive constraints and update G α, β.(α num, url β.store.σ32@0) α β void ∀ v v ⇒ × → Cold ← C for all c with c = ADD(_) or SUB(_) do Now suppose we have the structure LL of Figure 16 represent- ∈ Cold D APPLYADDSUB(c, G, ) ing a linked list of strings, with instance LL * myList. Can ← C for all δ D with δ = X Y do we safely invoke reverse_dns(addr, (url*) myList)? ∈ v UNIFY(X,Y,B) Intuitively, we can see that this should be possible: since end for the linked list’s payload is in its first field, a value of type LL* end for also looks like a value of type str*. Furthermore, myList is until = not const, so it can be used to store the function’s output. Cold C for all v .typeVars do . Infer initial sketches Is this intuition borne out by Retypd’s type system? ∈ C S new Sketch The answer is yes, though it takes a bit of work to see ← (S) ALLPATHSFROM(v, G) why. Let us write Sβ for the instantiated sketch of β at L ← for all states w S do the callsite of reverse_dns in question; the constraints ∈ C if w = then require that Sβ satisfy .store.σ32@0 (Sβ) and url <: h i ⊕ ∈ L νS(w) νSβ (.store.σ32@0). The actual-in has the type sketch SX else ← > seen in Figure 16, and the copy from actual-in to formal-in ν (w) will generate the constraint X β. S ← ⊥ v end if Since we have already satisfied the two constraints on Sβ end for coming from its use in reverse_dns, we can freely add other B[v] S words to (Sβ), and can freely set their node labels to almost ← L end for any value we please subject only to the constraint SX E Sβ. end procedure This is simple enough: we just add every word in (SX ) to L (S ). If w is a word accepted by (S ) that must be added L β L X procedure UNIFY(X,Y,G) . Make X Y in G to (Sβ), we will define νSβ (w) := νSX (w). Thus, both the ∼ L if X = Y then shape and the node labeling on SX and Sβ match, with one 6 MAKEEQUIV(X,Y,G) possible exception: we must check that X has a nested field for all (X0, `) G.outEdges(X) do .store.σ32@0, and that the node labels satisfy the relation ∈ if (Y 0, `) G.outEdges(Y ) for some Y 0 then ∈ νS (.store.σ32@0) <: νS (.store.σ32@0) UNIFY(X0,Y 0,G) β X end if since .store.σ32@0 is contravariant and S S . X does end for X E β indeed have the required field, and url <: str; the function end if invocation is judged to be type-safe. end procedure F. Additional Algorithms demonstrate how polymorphism enables narrowing, consider the example type scheme F. F from Figure 2. The This appendix holds a few algorithms referenced in the main ∀ C ⇒ text and other appendices. Algorithm F.2 C type inference Algorithm F.1 Type scheme inference procedure INFERTYPES(CallGraph,T ) procedure INFERPROCTYPES(CallGraph) B .B is a map from type variable to sketch. T .T is a map from procedure to type scheme. ← ∅ ← ∅ for all S REVERSEPOSTORDER(CallGraph.sccs) for all S POSTORDER(CallGraph.sccs) do do ∈ ∈ forC ← all ∅ P S do forC ← all ∅ P S do T [P ] ∈ T [P ] ∈ end for ← ∅ end for ← ∅ for all P S do ∈ for all P S do CONSTRAINTS(P,T ) ∈T [P ] endC for ← C ∪ C∂ ← SOLVE( ,B) INFERSHAPES( , ) C∂ C ← C ∅ REFINEPARAMETERS(P,B) for all P S do ∈ CONSTRAINTS(P,T ) P.formalIns P.formalOuts C ← V ← ∪ SOLVE( ,B) Q TRANSDUCER( , Λ) C ← C V ∪ end for T [P ] TYPESCHEME(Q) end for end for ← A end for for← all ∅x B.keys do end procedure ∈ A[x] SKETCHTOAPPXCTYPE(B[x]) ← end for procedure CONSTRAINTS(P,T ) return A end procedure forC ← all ∅ i P.instructions do ∈ C C ABSTRACTINTERP(TypeInterp, i) ← ∪ procedure SOLVE( ,B) if i calls Q then C INFERSHAPES( ,B) INSTANTIATE(T [Q], i) C ← C C ← C ∪ Q TRANSDUCER( , Λ) end if for← all λ Λ do C end for ∈ Q for all Xu such that λ Xu do return 7→ C νB[X](u) νB[X](u) λ end procedure end for ← ∨ Q for all Xu such that Xu λ do 7→ νB[X](u) νB[X](u) λ end for ← ∧ Language end for (X Y ) = (X) (Y ) end procedure L u L ∪ L (X Y ) = (X) (Y ) L t L ∩ L Algorithm F.3 Procedure specialization Node labels procedure REFINEPARAMETERS(P,B) for all i P.formalIns do ( ∈ νX (w) νY (w) if w = λ νX Y (w) = ∧ h i ⊕ ← > u ν (w) ν (w) if w = for all a P.actualIns(i) do X ∨ Y h i ∈ λ λ B[a]  ← t ν (w) if w (X) (Y ) end for  X ∈ L \L  B[i] B[i] λ νY (w) if w (Y ) (X) ← u  ∈ L \L end for νX (w) νY (w) if w (X) (Y ), for all o P.formalOuts do νX Y (w) = ∨ ∈ L ∩ L ∈ t  w = λ  h i ⊕ ← ⊥ ν (w) ν (w) if w (X) (Y ), for all a P.actualOuts(o) do  X ∧ Y ∈ L ∩ L ∈  w = λ λ B[a] h i end for← u B[o] B[o] λ Figure 18. Lattice operations on the set of sketches. end for ← t end procedure G. Other C Type Resolution Policies Accordingly, we would expect the most-general inferred type MyFile::filename Example G.1. The initial type-simplification stage results in scheme for to be types that are as general as possible. Often, this means that α, β.(β dword, α.load.σ32@4 β) α β types are found to be more general than is strictly helpful to a ∀ v v ⇒ → (human) observer. A policy called REFINEPARAMETERS is indicating that get_filename will accept a pointer to any- used to specialize type schemes to the most specific scheme thing which has a value of some 32-bit type β at offset 4, and that is compatible with all uses. For example, a C++ object will return a value of that same type. If the function is truly may include a getter function with a highly polymorphic used polymorphically, this is exactly the kind of precision type scheme, since it could operate equally well on any that we wanted our type system to maintain. structure with a field of the right type at the right offset. But in the more common case, get_filename will only be But we expect that in every calling context, the getter will called with values where this has type MyFile* (or perhaps be called on a specific object type (or perhaps its derived a subtype, if we include inheritance). If every callsite to types). By specializing the function signature, we make use of get_filename passes it a pointer to MyFile*, it may be best contextual clues in exchange for generality before presenting to specialize the type of get_function to the monomorphic a final C type to the user. type Example G.2. Suppose we have a C++ class get_filename : const MyFile* char* class MyFile → { The function REFINEPARAMETERS in F.3 is used to spe- public : just enough char * filename() const { cialize each function’s type to match how the return m_filename; function is actually used in a program, at the cost of reduced } generality. private : FILE * m_handle; Example G.3. A useful but less sound heuristic is repre- char * m_filename; sented by the reroll policy for handling types which look }; like unrolled recursive types: In a 32-bit binary, the implementation of MyFile::filename reroll (x): (if not inlined) will be roughly equivalent to the C code if there are u and ` with x.`u = x.`, and sketch(x) E sketch (x.`): typedef int32_t dword; replace x with x.` dword get_filename(const void * this) else : { policy does not apply char * raw_ptr = (char*) this; dword * field_ptr = (dword*) (raw_ptr + 4); In practice, we often need to add other guards which inspect return *field_ptr; the shape of x to determine if the application of reroll appears } to be appropriate or not. For example, we may require x to have at least one field other than ` to help distinguish a pointer-to-linked-list from a pointer-to-pointer-to-linked-list. H. Details of the Figure 2 Example Finally, the two remaining transitions from start to end The results of constraint generation for the example program generate in Figure 2 appears in Figure 20. The constraint-simplification int close_last.out algorithm builds the automaton Q (Figure 19) to recognize v eax #SuccessZ close_last.out the simplified entailment closure of the constraint set. Q v eax recognizes exactly the input/output pairs of the form To generate the simplified constraint set, we gather up these constraints (applying some lattice operations to combine (close_last.instack0(.load.σ32@0)∗.load.σ32@4, (int #FileDescriptor)) | inequalities that only differ by a lattice constant) and close and over the introduced τ by introducing an τ quantifier. The ∃ ((int #SuccessZ), close_last.out ) result is the constraint set of Figure 2. | eax To generate the simplified constraint set, a type variable τ is synthesized for the single internal state in Q. The path leading from the start state to τ generates the constraint .load.32@0 / ε close_last.in / ε .load.32@4 / #FileDescriptor close_last.in τ stack0 v The loop transition generates .load.32@4 / int τ.load.σ32@0 τ v #SuccessZ / close_last.out and the two transitions out of τ generate int / close_last.out τ.load.σ32@4 int v τ.load.σ32@4 #FileDescriptor v Figure 19. The automaton Q for the constraint system in Figure 2. _text:08048420 close_last proc near _text:08048420 close_last: _text:08048420 mov edx,dword [esp+fd] AR_close_last_INITIAL[4:7] <: EDX_8048420_close_last[0:3] close_last.in@stack0 <: AR_close_last_INITIAL[4:7] EAX_804843F_close_last[0:3] <: close_last.out@eax _text:08048424 jmp loc_8048432 _text:08048426 db 141, 118, 0, 141, 188, 39 _text:0804842C times 4 db 0 _text:08048430 _text:08048430 loc_8048430: _text:08048430 mov edx,eax EAX_8048432_close_last[0:3] <: EDX_8048430_close_last[0:3] _text:08048432 _text:08048432 loc_8048432: _text:08048432 mov eax,dword [edx] EDX_8048420_close_last[0:3] <: unknown_loc_106 EDX_8048430_close_last[0:3] <: unknown_loc_106 unknown_loc_106.load.32@0 <: EAX_8048432_close_last[0:3] _text:08048434 test eax,eax _text:08048436 jnz loc_8048430 _text:08048438 mov eax,dword [edx+4] EDX_8048420_close_last[0:3] <: unknown_loc_111 EDX_8048430_close_last[0:3] <: unknown_loc_111 unknown_loc_111.load.32@4 <: EAX_8048438_close_last[0:3] _text:0804843B mov dword [esp+fd],eax EAX_8048438_close_last[0:3] <: AR_close_last_804843B[4:7] _text:0804843F jmp __thunk_.close AR_close_last_804843B[4:7] <: close:0x804843F.in@stack0 close:0x804843F.in@stack0 <: #FileDescriptor close:0x804843F.in@stack0 <: int close:0x804843F.out@eax <: EAX_804843F_close_last[0:3] int <: close:0x804843F.out@eax _text:08048443 _text:08048443 close_last endp

Figure 20. The constraints obtained by abstract interpretation of the example code in Figure 2.