<<

Modeling Imperative String Operations with Transducers

Pieter Hooimeijer David Molnar Prateek Saxena University of Virginia Microsoft Research UC Berkeley [email protected] [email protected] [email protected]

Margus Veanes Microsoft Research [email protected]

Abstract lems if the code fails to adhere to that intended structure, We present a domain-specific imperative language, Bek, causing the output to have unintended consequences. that directly models low-level string manipulation code fea- The growing rate of security vulnerabilities, especially in turing boolean state, search operations, and sub- web applications, has sparked interest in techniques for vul- stitutions. We show constructively that Bek is reversible nerability discovery in existing applications. A number of through a semantics-preserving translation to symbolic fi- approaches have been proposed to address the problems as- nite state transducers, a novel representation for transduc- sociated with string manipulating code. Several static anal- ers that annotates transitions with logical formulae. Sym- yses aim to address cross-site scripting or SQL injection bolic finite state transducers give us a new way to marry by explicitly modeling sets of values that strings can take the classic theory of finite state transducers with the recent at runtime [5, 16, 22, 23]. These approaches use analysis- progress in satisfiability modulo theories (SMT) solvers. We specific models of strings that are based on finite automata exhibit an efficient well-founded encoding from symbolic fi- or context-free grammars. More recently, there has been nite state transducers into the higher-order theory of alge- significant interest in constraint solving tools that model braic datatypes. We evaluate the practical utility of Bek as a strings [2, 10, 12, 14, 18, 20, 21]. String constraint solvers constraint language in the domain of web application saniti- allow any client analysis to express constraints (e.g., path zation code. We demonstrate that our approach can address predicates) that include common string manipulation func- real-world queries regarding, for example, the idempotence tions. and relative strictness of popular sanitization functions. Neither existing static approaches nor string constraint solvers are currently well-equiped to model low-level imper- Categories and Subject Descriptors D.2.4 [Software ative string manipulation code. In this paper, we aim to ad- Engineering]: Software/Program Verification—Validation; dress this niche. Our approach is based on the insight that D.2.4 [Software Engineering]: Software/Program Verification— many security-critical string functions can be modeled pre- Model checking; F.3.1 [Logics and Meanings]: Specifying cisely using finite state transducers over a symbolic alpha- and Verifying and Reasoning about Programs—Mechanical bet. We present Bek, a domain-specific language for mod- verification eling string transformations. The language is designed to be (a) sufficiently expressive to model real-world code, and (b) General Terms Algorithms, Languages, Theory, Verifi- sufficiently restricted to allow precise analysis using trans- cation ducers. Bek can model real-world sanitization functions, such as those in the .NET System.Web library, without ap- Bek 1. Introduction proximation. We provide a translation from expressions to the theory of algebraic datatypes, allowing Bek expres- A large fraction of security vulnerabilities (including SQL in- sions to be used directly when specifying constraints for an jection, cross-site scripting, and buffer overflows) arise due SMT solver, in combination with other theories. to errors in string-manipulating code. Developers frequently Key to enabling the analysis of Bek expressions is a new use low-level string operations, like and sub- theory of symbolic finite state transducers, an extension of stitution, to manipulate data that must follow a particular standard form finite transducers that we introduce formally high-level structure, like HTML or SQL. This leads to prob- in this paper. We develop a theory of symbolic transducers, showing its integration with other theories in SMT solvers that support E-matching [6]. We describe a tractable encod- ing of symbolic finite transducers into the theory of alge- braic data types and prove its well-foundedness (i.e., given sufficient resources, any given query yields a finite-length proof) using model-theoretic insights. We introduce the con- cept of join composition, which enables us to preserve the key property of reversibility (i.e, given an output, produce corresponding inputs) that is crucial for checking sanitizer [Copyright notice will appear here once ’preprint’ option is removed.]

1 2010/7/19 correctness. Finally, we present a translation of Bek expres- void site_exec( char *cmd) sions into symbolic finite transducers. { To evaluate our approach, we show that several pop- char buf[MAXPATHLEN], *slash, *t; Bek /* sanitize the command string */ ular sanitization procedures can be ported to with char *sp = ( char *) strchr (cmd, ’ ’); 5 little effort. Each such port matches the behavior of the if (sp == 0) { original procedure without any need for conservative over- while ((slash = strchr (cmd, ’/’)) != 0) approximation. We use a prototype implementation of Bek cmd = slash + 1; that uses the Z3 SMT solver [24] as the back end interpreter. } During our experiments, we were able to generate witnesses else { 10 for a previously-known vulnerability. We demonstrate that while (sp && the prototype can resolve queries that are of practical in- (slash = ( char *) strchr (cmd, ’/ ’)) && terest to both users and developers of sanitization routines, (slash < sp)) cmd = slash + 1; such as “do two sanitizers exhibit deviant behaviors on cer- } 15 tain inputs,” “do multiple applications of a sanitizer intro- for (t = cmd; *t && !isspace(*t); t++) duce errors,” or “given an possibility of attack output, what if (isupper(*t)) *t = tolower(*t); is the maximal set of corresponding inputs that demonstrate /* build the command */ the attack?” int pathlen = strlen (PATH); The primary contributions of this paper are: int cmdlen = strlen (cmd); 20 if (pathlen + cmdlen +2 > sizeof (buf)) • We formally describe a domain-specific language, Bek, return ; for string manipulation. We describe a syntax-driven sprintf(buf, "%s/%s", PATH, cmd); translation from Bek expressions to symbolic finite state /* ... execute buf, store results ... */ transducers. fprintf(remote_socket , cmd); 25 } • We formalize symbolic finite state transducers and their reduction to the theory of algebraic datatypes, including Figure 1. Source code using hand-written sanitization and the intersection and composition operations. checks to avoid a buffer overrun (successfully, line 21) and • We show that Bek can encode real-world string manip- a format string vulnerability (unsuccessfully, line 25), and ulating code used to sanitize untrusted inputs in Web enforce path-related policies (successfully). applications. Within this domain, we demonstrate sev- eral applications that are of direct practical interest. removed from the command, not from the arguments. The The rest of this paper is structured as follows. We provide a strchr invocation on line 5 is used to check if any spaces motivating example in Section 2. Sections 3–5 describe our are present (line 6). If so, a more complicated version of approach. Section 3 describes Bek in terms of its large step the slash-skipping logic is used (lines 10–15) that only ad- operational semantics. Section 4 provides language-theoretic vances cmd past slashes before the first space. Lines 18– definitions and presents the translation from Bek expres- 22 build the command that will be executed (e.g., com- sions to transducers. In Section 5, we formalize a theory of pleting the transformation from "/usr/bin/ls -l *.c" to symbolic finite transducers based on the theory of algebraic "/home/ftp/bin/ls -l *.c") by using sprintf to concatenate datatypes. We present our case studies in Section 6. Finally, the trusted directory, a slash, and the suffix of the user we discuss closely related work in Section 7 and conclude in command. The check on line 21 prevents a buffer overrun Section 8. on the local stack-allocated variable buf by explicitly adding together the two string lengths, one byte for the slash, and 2. Motivating Example one byte for C’s null termination, and comparing the result against the size of buf. We put our work in context using a code fragment from More tellingly, while the code correctly avoids buffers version 2.6.0 of wu-ftpd, a file transfer server, written in overruns and implements its path-based security policy, it C, that has a known format string vulnerability. Figure 1 is vulnerable to a format string attack. Since the user’s shows a fragment of code for handling SITE EXEC commands command is passed as the format string to fprintf (line from wu-ftpd. The SITE EXEC portion of the file transfer 25), if it contains sequences such as %d or %s they will protocol allows remote users to execute certain commands be interpreted by printf’s formatting logic. This typically on the local server. The cmd string holds untrusted data results in random output, but careful use of the uncommon provided by such a remote user; an example benign value %n directive, which instructs printf to store the number of is "/usr/bin/ls -l *.c". This code is an indicative example characters written so far through an integer pointer on the of string processing in the wild: it tries to accomplish several stack, can allow an adversary to take control of the system. tasks at once, it relies on character-level imperative updates An exploit for just such an attack against exactly this code to manipulate its input, and control flow depends on string was made publicly available [4]. values. The variable PATH points to a directory containing ex- ecutable files that remote users are allowed to invoke 3. Modeling Low-Level String Operations (e.g., "/home/ftp/bin"). To prevent the remote user from In this section, we give a high-level description of a small im- invoking other executables via pathname trickery (e.g., perative language, Bek, of low-level string operations. Our cmd == "../../../bin/dangerous"), lines 5–15 sanitize the goal is two-fold. First, it should be possible to model Bek command string by skipping past all slash-delimited path expressions in a way that allows for their analysis using ex- elements. However, skipping past all slashes does not have isting constraint solvers. Second, we want Bek to be suf- the desired effect: "/bin/echo ’10/5=2’" should become ficiently expressive to closely model real-world code (such "/echo ’10/5=2’" and not "5=2’"; slashes should only be as the wu-ftpd example of Section 2). This section defines

2 2010/7/19 Bool Constants B ∈ {T, F} Bool Variables b,... h∅, initi⇓ EB t is fresh Char Constants d ∈ Σ Char Variables c,... ∅ 0 0 (2) 0 N h ,sei⇓hE ,r i E = EB [t 7→ r ] Int Constants n ∈ String Variables t (2) (3) ∗ hE , iter[c1,... in t] () {clist}i ⇓hE ,ri String Literals const ∈ Σ Itr hE, iter[c1 . . . in se] (init) {clist}i⇓h∅,ri Expressions strexpr ::= iter[cseq in strexpr] (init) {clist} | (strexpr) from posexpr | (strexpr) upto posexpr E(s) = [] | strexpr · const | const · strexpr hE, iter[c1 . . . in s] () {clist}i ⇓h∅, []i | t cseq ::= c | c, cseq 0 E(s)= w1 :: w t is fresh init ::= b := B, init |  (2) clist ::= case clist | case EC = E[c1 7→ E(s)(1)] . . . hEC , clisti⇓hE ,ri (2) 0 0 0 case ::= case(bexpr) {cstmt} hE [t 7→ w ], iter[c1,... in t] () {clist}i ⇓hE ,r i cstmt cstmt cstmt pass ::= | ; hE, iter[c ,... in s] () {clist}i ⇓hE0,r · r0i | b := boolexpr; 1 | yield(chexpr); Positions posexpr ::= pterm hE,casei⇓ Skip | (pterm)  n  ∈ {+, −} hE,casei⇓hE0,ri hE,clisti⇓hE0,ri pterm ::= last const Cases 0 0 | first const hE,case clisti⇓hE ,ri hE,case clisti⇓hE ,ri Booleans bexpr ::= bexpr ∨ bexpr | bexpr ∧ bexpr | ¬(bexpr) | chexpr = chexpr hE,bei⇓ T | bexpr = bexpr | B | b 0 Characters chexpr ::= c | d | $ hE,csti⇓hE ,ri hE,bei⇓ F hE, case(be){cst}i⇓hE0,ri hE, case(be){cst}i ⇓ Skip Figure 2. Concrete syntax for Bek. Well-formed Bek ex- pressions are functions of type string -> string; the lan- guage provides basic constructs to filter and transform the Figure 3. Selected operational semantics for the iter con- single input string t. struct, which provides a sliding window over the value of a string expression. Boolean state (declared using init in Itr) is available across iterations, but local to the iter block for Bek, presents its forward operational semantics, and pro- which it is declared. For each iteration, only the body of the Cases vides examples. In the sections that follow, we demonstrate topmost matching case is evaluated ( ). Case state- that Bek can be integrated into existing constraint solvers. ments may update the boolean state, and yield zero or more Figure 3 describes the language syntax. We define a single characters (not shown). string variable, t, to represent an input string, and a number of expressions that can take either t or another expression as their input. The from and upto constructs represent search if they are not escaped already). An iter block declares operations that truncate their input starting at (or ending a single-character window (c1) and a single boolean state with) the occurrence of a constant search string. Without variable b , which is initially false. the integer argument, the results of both from and upto 1 include the matched search constant. iter[c1 in t] (b1 = F) { 0 00 Example 1. The following expression searches for the last case(¬(b1) ∧ (c1 = ∨ c1 = )) { occurrence of foo in its input, returning everything following b1 := F; yield(\); yield(c1); the match (if any). } case(c1 = \) { b := ¬(b ); yield(c ); (t) from (last foo) − 1; 1 1 1 } case(T) { If applied to the string foofoo, the output would be b1 := F; yield(c1); ofoo. If we replaced last with first, the result would also } be ofoo, since there is no earlier occurrence of foo that has } one preceeding character in the string.  The boolean variable b1 is used to track whether the previous The iter construct is designed to model loops that tra- character seen was an unescaped slash. For example, in the verse strings while making imperative updates. Given a input \\00 the double quote is not considered escaped, and string expression (strexpr), a sequence of character binders the transformed output is \\\00. If we apply the expression to (cseq), and an optional initial boolean state (init), an iter- \\\00 again, the output is the same. An interesting question block provides a sliding window over its input. For the ith (0- is whether this holds for any output string. In other words, based) iteration, the character binders c1,...,cn are bound we may be interested in whether a given Bek expression is to characters wi through wi+n−1 in the input. If some wj do idempotent. not exist (i.e., we have reached the end of the input), then If implemented wrongly, double applications of such san- the corresponding character binder is assigned the special itization functions has resulted in duplicate escaping which symbol $. The case statements inside the block can yield have opened real systems to command injection of script- zero or more characters, and update the boolean state (af- injection attacks in the past. Checking idempotence of cer- fecting future iterations). tain functions is practically useful. The transducer transla- Example 2. The following expression represents a basic tion presented in Section 4 can be used to prove such prop-  sanitizer that escapes single and double quotes (but only erties about Bek expressions (including idempotence).

3 2010/7/19 Figure 3 presents the operational semantics for the iter with their usual meaning. We also adopt the convention that construct. We define the evaluation relation: [a,b,c] stands for the list [a|[b|[c|ε]]] and we write l1 · l2 for the concatenation of l with l . When convenient, we will ⇓ ⊆ (context × strexpr) × (context × string) 1 2 use length-bounded lists in the context of finite sets (such where a context E maps variables to values. The iter judg- as the alphabet of an automaton). ments update the environment to carry boolean state across Words are represented by lists. Typically, characters have iterations and to update the character binders for each it- sort bvn for some fixed n> 0, e.g., if words represent strings eration. Each iteration consumes the first character w1 of of ASCII characters, in which case constant characters are the current remaining string. The case block conditions are written as ‘a’ assuming for example ASCII encoding. In checked in sequence; the first case to match is executed. If general however, characters may have compound sorts such none of the case conditions match, we assume an implicit as tuplehbv7, bv7, booli, although finite, e.g., unbounded case (not shown) that outputs the empty string and makes lists will not be considered as characters. no change to the state. We write E(s)(n) for the nth char- We assume basic familiarity with classical automata the- acter in the value of string variable s. If n ≥ len(E(s)), then ory [13]. In this paper we use finite (state) transducers. A E(s)(n) = $. We define $ to be a special character symbol finite transducer is a generalization of a Mealy machine that, that is uncomparable to in-domain characters. in addition to its input and output symbols, has the special Any well-formed derivation under these inference rules symbol  denoting the empty word making it possible to starts with the base case: hE,ti⇓h∅,E(t)i, where E is omit characters in the input and output words. We use the assumed as the initial assignment to t. The out state is following formal definition of a finite transducer. The defini- used only by the evaluation rules for iter. We elide the tion is known as the standard form of a finite transducer [17, judgments for the search operations from and upto and the p. 69]. concatenation-with-a-constant operations; they are defined directly in terms of their input string, yielding only the Definition 1. A Finite Transducer A is defined as a six- tuple (Q, q0, F, Σ, Γ, δ), where Q is a finite set of states, corresponding output string. Note that, in Figure 3, we 0 ignore any state E0 produced by the evaluation of nested q ∈ Q is the initial state, F ⊆ Q is the set of final states, Σ Itr is the input alphabet, Γ is the output alphabet, and δ is the string expression se ( judgment), and we always emit Q×(Γ∪{}) the empty mapping. In other words, the execution of an transition function from Q × (Σ ∪ {}) to 2 . iter block is free of external side-effects. It follows that all We indicate a component of a finite transducer A by using toplevel strexpr judgments are side-effect free. A as a subscript. Instead of (q,b) ∈ δA(p,a) we often use the a/b a/b 4. Translation to Finite Transducers more intuitive notation p −→A q, or p −→ q when A is clear from the context. Given words v and w we let v · w be the We now turn to the translation of Bek expressions to fi- concatenated word. Note that v ·  =  · v = v. Bek ai/bi v/w nite state transducers. For a given expression P , we Given qi −→ A qi+1 for i < n we write q0 −→A qn where write M[[ P ]] for the corresponding finite transducer. We v = a0 · a1 · . . . · an−1 and w = b0 · b1 · . . . · bn−1. A induces Bek ∗ ∗ use this construction to show that programs are re- the binary relation [[A]] ⊆ Σ × Γ as follows for which we Bek A A versible: given a expression P and an output string y, use infix notation we can compute the maximal set R = {x | P (x) = y}, and def 0 v/w R is regular for any such computation. The presentation is v[[A]]w = ∃q ∈ FA (qA −→ q) structured as follows. In Section 4.1, we provide transducer- Given two binary relations R and R , we write R ◦ R related definitions. Section 4.2 exhibits the high-level trans- 1 2 1 2 for the binary relation {(x,y) | ∃z (R (x,z) ∧ R (z,y))}.A lation from Bek to finite transducers. Finally, in Section 5, 1 2 useful composition of finite transducers A and B is the join we extend the definitions of Section 4.1 to a formal encod- composition of A and B, that is a finite transducer A ◦ B ing of symbolic finite transducers. This allows for an imple- such that [[A ◦ B]] = [[A]] ◦ [[B]]. mentation that integrates Bek-program-induced constraints directly with other constraints. Definition 2. Let A and B be finite transducers. The join composition of A and B is the finite transducer 4.1 Definitions def 0 0 We work in the context of a fixed multi-sorted universe of A ◦ B = (QA × QB , (qA, qB ), FA × FB , ΣA, ΓB , δA◦B) values, where each sort σ is (corresponds to) a sub-universe. where δA◦B is defined as follows The basic sorts that we need are the Boolean sort bool, n a/b 0 b/c 0 with the values t and f, and the sort bv of n-bit-vectors, ∃b (b 6=  ∧ p −→A p ∧ q −→B q ) a/c def for n ≥ 1. We also need the sort tuplehσ0,...,σn−1i, for 0 0 a/ 0 0 (p, q) −→A◦B (p , q ) = ∨(p −→A p ∧ c =  ∧ q = q ) n ≥ 1, of n-tuples of elements of sorts σi for i < n. The /c 0 0 sorts are associated with built-in (predefined) functions and ∨(q −→B q ∧ a =  ∧ p = p ) built-in theories. For example, there is a built-in Boolean The first case (disjunct) in the definition of δ means function (predicate) < : bv7 × bv7 → bool that provides a A◦B that some character b is output in state p of A while input in strict total order of all 7-bit-vectors that matches with the the state q of B, thus consuming b in the composed transition standard lexicographic order of ASCII characters. For each that inputs a and outputs c (note that a or c may be ). The n-tuple sort there is a constructor and a projection function second case means that A outputs nothing while inputting π : tuplehσ ,...,σ i → σ , for i < n, that projects the i 0 n−1 i a, thus B stays in the same state. The third case means i’th element from an n-tuple. that B inputs nothing while outputting c, thus A stays in For each sort σ, listhσi is the list sort with element sort the same state. The following property is well-known. σ. Lists are algebraic data types. There is an empty list ε : listhσi and for all e : σ and l : listhσi, [e | l] : listhσi. The Proposition 1. Let A and B be finite transducers. Then accessors are hd : listhσi → σ and tl : listhσi → listhσi [[A ◦ B]] = [[A]] ◦ [[B]].

4 2010/7/19 Similarly to parallel composition of finite automata, the M[[ t ]] = Ident join composition of finite transducers can be done incremen- Ident Const tally using depth first search, avoiding the introduction of M[[ strexpr · w ]] = M[[ strexpr ]] ◦ ( · (w)) states that cannot be reached from the initial state, called M[[ w · strexpr ]] = M[[ strexpr ]] ◦ (Const(w) · Ident) unreachable states. Moreover, all states in a finite transducer from which no final state can be reached, called dead states, can be elmininated through backwards reachability. Both M[[(se) from (first w)  n ]] = M[[ se ]] ◦ FF(w, 0  n) optimizations may significantly decrease the size of the re- sulting composite transducer while preserving equivalence M[[(se) from (last w)  n ]] = M[[ se ]] ◦ FL(w, 0  n) in terms of the denoted relation. M[[(se) upto (first w)  n ]] = M[[ se ]] ◦ UF(w, 0  n) 4.2 Translating Bek Expressions M[[(se) upto (last w)  n ]] = M[[ se ]] ◦ UL(w, 0  n) The evaluation order for Bek programs is relatively straight- forward; each string expression depends either on the input variable t or on another string expression. There are no side M[[ iter[cs in se] (init) {cl} ]] = M[[ se ]] ◦ Iter(cs,init, cl) effects, with the exception of the boolean state available in the iter construct, and that that boolean state is limited in Ident scope to the iter block in which it is defined. This informs = //.-,()*+  {x/x | x∈Σ} our approach: we define the translation function M[[ · ]] re- e cursively, using the composition operator ◦ on transducers to 0 w1 0 Const(w1 :: w ) = /.-,()*+ /.-,()*+  · Const(w ) model nested string expressions. This leads to a single M[[ · ]] / / for each type of strexpr; we give the high-level definition in Const([]) = Figure 4. /.-,()*+/  The Slide function is central to the translations for the first, upto, and iter constructs. For a given finite sort σ, FF(w,n) = Slidebv7 (x) Slideσ takes an integer parameter and produces a transducer: 0 (Q, q , F, σ, tuplehσ ∪ {$}, ..., σ ∪ {$}i, δ) {v/πz(v) | πy (v)...=w} ◦ //.-,()*+  /.-,()*+*  n Y Y so that any input of sort σ is| split into{z partially} overlapping n-tuples. {v/ | πy (v)...6=w} {v/πz(v)} Example 3. We consider a toy example to illustrate how len(w) − n if n ≤ 0 the Slide operation can be implemented using concrete trans- such that x = max len w ,n n> ducers. Figure 5 shows the full transducer for Slide{a,b}(2) ( ( ( ) +1) if 0 (where {a, b} can be modeled using sort bv1). Given an in- put sequence [abba], transducer output is −n if n ≤ 0 0 if n ≤ 0 y = z = [ha,bi (0 if n> 0 (n if n> 0 hb, bi Figure 4. The definition of M[[ · ]], the translation of Bek hb,ai expressions to corresponding finite transducers. The func- ha, $i] tions FL, UF, UL are symmetric with FF. Slide, described in Given a search request (t) from (first b) − 1 applied to this the text, returns a sliding window representation of its in- string, we can output the first a when we see the first pair put to accomodate multi-character search and replacement. ha, bi. Searches that involve last are handled analogously, The integers x,y, and z represent the width of the window, but there we rely on the nondeterminism of the transducer the relative position of the “needle” in the window, and the (i.e., once we see a match we must not see it again).  relative positioning of the desired output, respectively. Intuitively, we use this conversion to provide look-ahead for the search operations first and upto, and to provide the sliding window for iter blocks. For the search opera- yields the updated transducer F 0 and new term P 0. The tion translations (e.g., the definition of FF in Figure 4), we Itr judgment relates the collecting semantics to the output assume implicit special handling of the $ symbol, so that of the function Iter. To construct the transducer, we pro- that symbol never appears in the output of such an opera- ceed as follows. We start with an initial transducer that has tion. Similarly, we ignore yield statements if the character one state for each possible boolean assignment in the Bek value is $. We omit a full algorithmic description of Slide. expression (e.g., 24 states if init declares 4 distinct vari- In practice, a concrete instantiation of Slide-transducers is ables). We assume a mapping from concrete boolean states infeasible because the state space grows exponentially with b to transducer states qb. The transducer’s start state is the n; we discuss our symbolic representation in Section 5. state qb such that b = Red(init), where Red reduces boolean The Iter function converts iter blocks into a correspond- Bek expressions to possibly-open propositional terms. We ing finite transducer. Figure 6 describes a collecting seman- compose this automaton on the left with a Slide transducer tics that defines this transducer. We introduce a judgment to produce a sliding window of the appropriate width. of the form: We process the case blocks in syntactic order (Cases). 0 0 F, P ` expr : F , P Recall that the semantics for case blocks require that we which states that, given an initial transducer F and a execute the first matching case (exclusively). We write F ∪ possibly-open boolean term P , the given expression expr G to denote the transducer F extended with the set of

5 2010/7/19 bc/cc a/aa we add edges qb1 −→ qb2 for each qb1 given the following constraints: Õ a/aa 276540123q3 1. bc defines a feasible character condition. In other words, E /a$ there exists at least one character so that, starting at in b/ab b/ab  boolean state b1, the case condition be is true. a/ 576540123q1 176540123q4 H 2. cc corresponds to the list of yields in the current case. a/aa a/ba Each ci in the character binder is replaced with the ap- /76540123'&%$ q!"#s /b$ propriate projection πi(v), where v refers to the current b/ab b/bb Ø  input vector. We write Yield to indicate the extraction of b/ a/ba 276540123q5 76540123'&%$ q!"#9 list constraints from the case body. ] 29 /b$ 3. b2 is the result of executing the boolean assignments in 76540123q2 a/ba 76540123q6 /a$ the current case, given initial boolean state b and the 3 E 1 b/bb case condition be. We write Symex for the conversion of a

b/bb sequence of boolean assignments to an open propositional term.

Figure 5. Finite transducer for Slide{a,b}(2). For a length Finally, having added the appropriate edges for each case Yield n and a finite sort σ, Slideσ(n) transforms its input to an n- block, we must convert the output alphabet from ’s list tuple output sort tuplehσ ∪ {$},...,σ ∪ {$}i. The outputs sort back to individual characters. Note that the maximum represent a sliding window of the inputs. Note that these length of these lists is bounded by the maximum number of transducers grow very rapidly in the size of σ and n; we yield statements per case. The UnList operation is similar discuss how to avoid concrete instantiation in Section 5. For to Slide (e.g., Figure 5). As with Slide, we avoid instantiating UnList clarity, the diagram elides edges from q1 and q2 to q9. the transducers directly, instead relying on axiomatic definition in the theorem prover. The presentation of the Iter function would be signifi- cantly more complex if we discussed it strictly in terms of (2) (2) concrete character transitions. In the following section, we F, P ` case1 : F , P formalize the notion of symbolic finite transducers. This con- (2) (2) 0 0 F , P ` case2 : F , P cept yields several direct benefits. First, we can avoid instan- Cases 0 0 tiating prohibitively large transducers like those for Slide F, P ` case1 case2 : F , P and UnList by using specialized axioms instead. Second, the symbolic encoding allows us to use the logical definition of Figure 6 directly without much further work. 0 bc/cc F = F ∪ {qb1 −→ qb2 | bc = Red(be ∧ ¬P )(b1) ∧ cc = Yields(cst)(bc) 5. Symbolic Finite Transducers ∧ b2 = Symex(cst)(bc)} P 0 = P ∨ be In this section we develop the theory of symbolic finite trans- 0 0 ducers. The theory lends itself to efficient symbolic analysis F, P ` case(be){cst} : F , P using state-of-the-art satisfiability modulo theories (SMT) solvers, and can be integrated through E-matching [6] with

(2) other theories supported by such solvers. F = Slidebv7 (len(cs)) We first develop a mathematical theory of (symbolic) bv7 bv7 ◦ (Q, qRed(init), Q, , , δ) finite transducers and prove it to be well-defined for the 0 P = false class of well-founded finite transducers. The theory relies on F (2), P 0 ` cases : F 0, P the combination of the theory of algebraic data types [15], F = UnList(F 0) in particular lists, with the theory of uninterpreted function Itr ` Iter(cs, init, cases)= F symbols that builds on the notion of model expansion from model theory [8]. We then discuss how algorithms can be built on top Figure 6. Collecting semantics for constructing transduc- of the symbolic representation of finite transducers with ers that model iter blocks. We represent the boolean states Bek a particular emphasis on symbolic join composition that of the expression using transducer states; we write qb is used in the translation of Bek to finite transducers, as for the states in which boolean expression b is satisfiable. 0 discussed above. We write Red(b)(b ) for the partial application of b as an Finally, we map that theory to a background theory of open propositional term to b. Yields produces a list sort of an SMT solver in terms of universally quantified transducer character constraints. Symex processes case statements and axioms. We discuss how such axioms work in general and converts them to an open propositional term. how they are implemented using the SMT solver Z3, that is currently the only SMT solver that provides built-in support for the given combination of theories. transitions G. We use P to hold the disjunction of the case conditions we have already seen, and for each following case, 5.1 Symbolic finite transducer theory we require that that disjunction be false. In the following, let A = (Q, q0, F, Σ, Γ, δ) be a fixed finite We define the edges that must be added in terms of transducer. We assume that all input characters have the logical conditions. In particalar, for the current case block, same sort sort(Σ) and all output characters have the same

6 2010/7/19 {x/x | x ∈ Σ} {x/ | x ∈ Σ} be a predicate symbol of Th(A) called the acceptor for p. Th(A) contains the following axiom for each Accp:

{x/ | x ∈ Σ} Accp(v : listhsort(Σ)i,w : listhsort(Γ)i) ⇔ q0 q0 q1 (v = ε ∧ w = ε ∧ p ∈ F ) ∨

q∈Q((v 6= ε ∧ w 6= ε ∧ Figure 7. Finite transducer realizing the prefix operation W (p, , ,q) of words in Σ∗ where Σ = Γ. δ (hd(v), hd(w)) ∧ Accq (tl(v), tl(w))) (p, ,,q) ∨(v 6= ε ∧ δ (hd(v)) ∧ Accq(tl(v),w)) (p,, ,q) ∨(w 6= ε ∧ δ (hd(w)) ∧ Accq (v, tl(w))) sort sort(Γ). We use the following definitions to combine all (p,,,q) possible input/output pairs of characters between any fixed ∨(δ ∧ Accq (v,w))) 0 pair (p, q) of states in Q. These definitions play an important The acceptor for A, denoted by AccA, is the acceptor for q . role in a symbolic representation of transitions and in the defintion of the theory of A that is introduced below. Let Note that the acceptor axioms above are written in a δ(p, , ,q)(x,y), δ(p,, ,q)(y), δ(p, ,,q)(x), δ(p,,,q) be predicates, very general form and have not been simplifed. Obviously, where x : sort(Σ) and y : sort(Γ) are free variables, such false disjuncts can simply be eliminated, e.g., when p∈ / F , that, where Σ and Γ are viewed as unary predicates: or when there is no transition from p to q of a certain kind, as illustrated in the following example. The example a/b δ(p, , ,q)(a,b) ⇔ Σ(a) ∧ Γ(b) ∧ p −→ q also illustrates another simplification that can be used to eliminate some reqursive cases. a/ δ(p, ,,q)(a) ⇔ Σ(a) ∧ p −→ q Example 5. Consider the transducer, say Prefix, in Fig- /b δ(p,, ,q)(b) ⇔ Γ(b) ∧ p −→ q ure 7. Assume that both the input and the output alpha- bets contain all characters of sort sort(Σ) (e.g. bv7). Then (p,,,q) / δ ⇔ p −→ q Th(Prefix ) contains the following two axioms.

Note that the predicates can always be represented as ex- Accq0 (v,w) ⇔ (v = ε ∧ w = ε) ∨ plicit disjunctions by combining individual characters, but (v 6= ε ∧ w 6= ε ∧ hd(v)= hd(w) ∧ this would often defeat the purpose of getting a more suc- cinct and more efficient representation for analysis by using Accq0 (tl(v), tl(w))) ∨ built-in functions and implicit symbolic representations. (v 6= ε ∧ Accq1 (tl(v),w)) Acc (v,w) ⇔ (v = ε ∧ w = ε) ∨ Definition 3. We say that A is symbolic if δ is represented q1 by predicates of the above form. (v 6= ε ∧ Accq1 (tl(v),w))  Example 4. Consider the finite transducer in Figure 7. The second axiom is equivalent to Accq1 (v,w) ⇔ w = ε. (q0, , ,q0) The predicate δ (x,y) can be defined as x = y. Both The final simplification in Example 5, say sink-simplifica- (q0, ,,q1) (q1, ,,q1) δ (x) and δ (x) can be defined as t.  tion, can always be applied to acceptor axioms for final states q when Σ contains all characters of sort(Σ), δ(q,x)= We adapt the notion of IDs and step relations from [13] {(q, )} for all x ∈ Σ and δ(q, )= ∅, in which case to finite transducers. An ID is an Instantaneous Description of a possible state of a finite transducer together with an Accq(v,w) ⇔ w = ε input word and output word starting from that state. The list sort formal definition is as follows. Thus, any input v : h (Σ)i is accepted, i.e., the input characters do not have to be individually restricted to Σ Definition 4. An ID of A is a triple (v,q,w) where v ∈ Σ∗, since this is imposed by the sort, while the output must be q ∈ Q, and w ∈ Γ∗. The step relation of A is the binary the empty word (list). Symmetrical simplification rule can relation `A over IDs induced by δ. be applied for output sink states. We write sat(ϕ) for satisfiablity of formula ϕ (modulo (p, , ,q) ([a|v], p, [b|w]) `A (v,q,w) ⇔ δ (a,b) the built-in theories), i.e., sat(ϕ) means that there exists ([a|v],p,w) ` (v,q,w) ⇔ δ(p, ,,q)(a) a model M that provides an interpretation for all the un- A interpreted function symbols in ϕ such that M |= ϕ. Note (p,, ,q) (v,p, [b|w]) `A (v,q,w) ⇔ δ (b) that the uninterpreted function symbols in Th(A) are the (p,,,q) acceptors. Also, given a theory T , we write T for ϕ. (v,p,w) `A (v,q,w) ⇔ δ ϕ∈T The correctness criterion that we need ThV(A) toV fulfill is The following proposition is an immediate consequence sat( Th(A) ∧ AccA(v,w)) if an only if v[[A]]w. To this end, of the definitions. we needV to consider finite transducers whose step relation is 0 ∗ well-founded. Proposition 2. v[A]w ⇔∃q ∈ F ((v, q ,w) `A (, q, )). Theorem 1. If `A is well-founded then v[[A]]w if and only The overall idea behind the theory Th(A) introduced if sat( Th(A) ∧ AccA(v,w)). next is to precisely characterize [[A]]. The definition is es- V sentially an axiomatic formalization of `A. Proof. Assume `A is well-founded. Thus, since Q is finite, Definition 5. Let A be as above. For each p ∈ Q, let there exists a well-ordering Q over Q such that list sort list sort bool + Accp : h (Σ)i× h (Γ)i→ p Q q ⇒ ¬((, q, ) `A (, p, )).

7 2010/7/19 ∗ ∗ Define the lexicographic order over Σ × Γ × Q as : Example 6. Consider the transducer ee: / . def (v,w,q) (v0,w0, q0) = |v| > |v0| ∨ Thus, [[ee]] = {(, )} and Th(ee) is 0 0 (|v| = |v | ∧ |w| > |w |) ∨ {Accee (v,w) ⇔ (v =  ∧ w = ) ∨ Accee (v,w)}. 0 0 0 (|v| = |v | ∧ |w| = |w | ∧ q Q q ) For example let M be a model such that M |= Accee (v,w) The following statement follows by induction over using for all v and w, then M |= Th(ee), but v[[A]]w does not hold Definition 5. For all p ∈ Q, v ∈ Σ∗, and w ∈ Γ∗: for all v and w.  ∗ The following theorem follows from Theorem 1, Propo- ∃q ∈ F ((v,p,w) `A (, q, )) ⇔ sat( Th(A) ∧ Accp(v,w)) sition 3, and Theorem 2, and outlines the algorithm in a 0 V Finally, let p = q and use Proposition 2. nut-shell for creating a soft-theory plug-in for A for an SMT solver, that in our case is Z3 [7]. The following proposition provides a useful condition e Theorem 3. v[[A]]w ⇔ sat( Th(A) ∧ AccA(v,w)) over the structure of A that is equivalent to `A being well- V founded; the proposition reflects the role of Q in the proof When asserting Th(A) as a soft theorye to an SMT solver, of Theorem 1. An -loop is a nonemty path of -moves the first assumption is that the solver actually supports lists / e p −→ q that starts and ends in the same state. as a built-in algebraic data type, which, unlike the acceptors, cannot be defined through uninterpreted functions, since the Proposition 3. `A is well-founded ⇔ A is -loop-free. theory of algebraic data types is not first-order definable [15]. The practical significance of the proposition is that there Note that the proof of Theorem 1 would fail without this as- is an efficient algorithm that given A in symbolic form sumption, where is defined in terms of lengths of words, constructs an equivalent -loop-free finite transducer from A which is well-defined since the notion of counting the ele- in symolic form (provided that disjunction over predicates ments of a list is well-defined. is supported efficiently). 5.2 Symbolic finite transducer algorithms While full -move elimination may cause quadratic in- crease in the number of symbolic transitions (by eliminating The built-in theory integration of state-of-the-art SMT sharing), -loop elimination does not increase the number of solvers can be exploited, to some degree, for directly encod- symbolic transitions. For symbolic analysis, full -move elim- ing finite transducer algorithms symbolically. One particular ination may reduce the performance considerably, similar to algorithm that we need is join composition of finite trans- the case of symbolic finite automata [21]. ducers. The following propostion shows a direct encoding of The following definition provides the key idea behind join composition. the -loop elimination algorithm. Recall the definition of Proposition 4. Assume sort(Γ ) = sort(Σ ). Then -closure, denoted here by (q), as the closure of {q}, for A B e e q ∈ Q, by -moves [13] (where stated for finite automata, but sat( Th(A) ∪ Th(B) ∧∃z (AccA(v,z) ∧ AccB (z,w))) if and is similar for finite transducers). Similarly, define (q) as the onlyV if v[[A ◦ B]]w. def e e closure of {q} by -moves in reverse. Let q = (q)∩(q) (note def Proof. The folowing statements are equivalent: that {q}⊆ q) and lift the notion to sets: P = {p | p ∈ P }. e def 1. sat( Th(A) ∪ Th(B) ∧∃z (Acc e(v,z) ∧ Acc e (z,w))) Definitione 6. Let A = (Q, q0, F , Σ, Γ, δ)e wheree A B e 2. ∃z s.t.V sat( Th(A) ∧ AccA(v,z)) and sat( Th(B) ∧ def e e (p,e , ,qe) (p, , ,q) e δ e = e e e δ e AccB(z,w))V V p∈p,qe ∈qe e e (p,e ,,qe) def W (p, ,,q) 3. ∃z s.t. v[[A]]z and z[[B]]w. δe = p∈p,qe ∈qe δ def (p,,e ,qe) W (p,, ,q) The equivalence between 1 and 2 holds by disjointness of the δe = p∈p,qe ∈qe δ e e def δ(p,,,q) = W δ(p,,,q) uninterpreted function symbols (acceptors) of the theories. e p∈p,qe ∈q,e pe6=qe The equivalenve between 2 and 3 follows from Theorem 3. W Note thate if A is already -loop-free (such as the trans- Finally, use Proposition 1. ducer in Figure 7) then A and A are isomorphic. The al- gorithm for constructing A can be implemented as a graph While absence of -moves is preserved for example by e algorithm that collapses -loops into single nodes and joins parallel composition of finite automata, this is not the case the labels with disjunction.e The algorithm is linear in the for join composition of finite transducers. number of nodes plus edges (symbolic transitions), which Example 7. Consider the transducers e : / and may be an order of magnitude smaller than the number of concrete transitions. The actual complexity depends on the e: / that have no -moves, and where the input complexity of computing disjunctions (that is, without sim- plifications, an O(1) operation in most SMT solvers). and output aphabets are, say bool. Then e ◦ e = ee with The following theorem follows from Definition 6 and by ee as in Example 6. It is therefore interesting to note that using techniques similar to the proof of equivalence between Th(e ) ∪ Th( e) is well-defined by Proposition 4. Note that, nondeterministic finite automata and nondeterministic finite with sink-simplification, as explained after Example 5, the automata with epsilon-moves (see [13]). axioms for Th(e ) and Th( e) are Acce ( ,w) ⇔ w = ε and Acc e (v, ) ⇔ v = ε, respectively.  Theorem 2. A is -loop-free and [[A]] = [[A]]. In general, we can take acceptors for regular [21] and con- Theorem 1 failse if we omit the conditione that `A is well- text free [20] languages an combine them with finite trans- founded as shown by the following example. ducer acceptors and use SMT to solve them. For example,

8 2010/7/19 suppose L is a with a theory Th(L) defin- a lot of the boilerplate for the current project, in particular ing the acceptor AccL such that AccL(v) iff v ∈ L, and A is the language acceptors for regular expressions, that are built a finite transducer then {w | ∃v (AccL(v) ∧ AccA(v,w))} is directly using the internal .NET parser for regexes. the relational image of L under A. While such direct encodings have certain advantages, 6.1 Sliding Window Axioms such as generality, they cannot easily cope with unsatisfi- One of the algorithmic difficulties we encountered, when able solutions when the acceptors are recursive and accept dealing with creating transducers from Bek, is related to infinite languages. For example, a symbolic join composition maintaining a sliding window of characters (providing a algorithm that first constructs A◦B may discover that A◦B look-ahead in the input string) or for outputting multiple is empty, while the direct use of Th(A) ∪ Th(B) does not characters in an iter block, which in some cases may lead terminate. There are many nonobvious algorithmic tradeoffs to an explosion of the transducer state space for general that arise with the symbolic algorithms for finite transduc- purpose Bek programs. For example, assuming a look-ahead ers, similar to the case with finite automata [9], that are a of 4 characters in the output string and a relatively small subject of ongoing research. alphabet size of 20 characters already led to a transducer 5.3 Implementation with SMT solvers with over 200000 states. A technique that enabled us to scale the approach on The general idea behind the encoding of Th(A) of a well- a collection of micro-benchmarks, was to use additional ax- founded finite transucer A as a theory of an SMT solver, is ioms and combine them with the acceptor axioms. In partic- similar to the encoding of language acceptors [20]. We use ular, when outputting a list of strings (that are represented particular kinds of axioms, all of which are equations of the as lists of bounded length) we use an axiom for folding such form lists back to lists of singleton characters that are then fed to ∀x¯(tlhs ⇔ trhs) (1) another transducer acceptor or an automaton acceptor. For example, for upper bound 3, the following axiom is used where FV (tlhs) =x ¯ and FV (trhs) ⊆ x¯. The left-hand- side tlhs of (1) is called the head of (1) and the right- fold(x:listhlisthσii,y:listhσi) ⇔ (x = ε ∧ y = ε) ∨ hand-side trhs of (1) is called the body of (1) While SMT (x 6= ε ∧ hd(x) 6= ε ∧ tl(hd(x)) = ε ∧ solvers support various kinds of patterns in general for y 6= ε ∧ hd(hd(x)) = hd(y) ∧ fold(tl(x), tl(y))) ∨ triggering axioms, here we assume that the pattern of an hd(x) has exactly 2 characters case ∨ equational axiom is always its head. The acceptor symbols hd(x) has exactly 3 characters case are declared to the SMT solver as uninterpreted Boolean function symbols with the given sorts. In our encoding of E.g., fold([[a,b], [c,d,e], [f]], [a,b,c,d,e,f]). Using axioms of Th(A), each acceptor axiom in Th(A) is represented by two this kind, we connect several acceptors in a chain (avoiding axioms, one for the case Accp(ε,w) and one for the case the state space explosion), as in ϕ: Accp([x|v],w), motivated by our application domain where proofs are triggered by input words. ∃xyz (AccA(x,y) ∧ fold(y,z) ∧ AccB (z)), Such axioms are asserted as equations that are expanded where A is a transducer generated from a sanitizer, and during proof search. Expanding the formula up front is prob- B is an acceptor for a regex pattern of disallowed output lematic since the equational axioms are in general mutually strings. Then ϕ is satisfiable iff the sanitizer has a bug, recursive and a naive a priori exhaustive expansion would i.e., when there exists an input x that may produce an in most cases not terminate, while straight-forward depth- unwanted output. Moreover, the actual model generation bounded expansions are impractical as the size of the ex- with Z3 yields concrete witnesses for the existential variables pansion is easily exponential in depth. Well-foundedness of and if no model is found then the sanitizer is correct with A guarantees termination of the expansion process during respect to B. proof search. We use some features that are specific to Z3, including the 6.2 Macrobenchmarks integrated combination of decision procedures for algebraic data-types, integer linear arithmetic, bit-vectors and quanti- We then applied our framework to the analysis of code from fier instantiation. We also make use of incremental features real Web programs. A basic sanitization function in the Web so that we can manipulate logical contexts while explor- context is “HTMLEncode,” which takes a string and “es- ing different combinations of constraints. Working within a capes” characters such as angle brackets. This sanitization context enables incremental use of the solver. A context in- function has been re-implemented multiple times for differ- cludes declarations for a set of symbols, assertions for a set ent Web programs and libraries. Do all these implementa- of formulas, and the status of the last satisfiability check (if tions compute the same function? If not, is the set of char- any). There is a current context and a backtrack stack of acters escaped by one a superset of the characters escaped previous contexts. Contexts can be saved through pushing by another? These questions are important because failing and restored through popping. This feature is used for imple- to escape some characters can directly lead to a cross-site menting the satisfiability checks performed during symbolic scripting attack by an adversary who can use the unescaped join composition of finite transducers. character to change a web browser’s behavior. To answer these questions, we translated five different implementations of the HTMLEncode function to the Bek 6. Implementation and Case Study language by hand. Our implementations were from the Sys- Our implementation contains roughly 5000 lines of C# code tem.Web DLL version 2.0.0.0, an internal Microsoft product, implementing the basic transducer algorithms and Z3 inte- and three versions of the AntiXSS library distributed by gration, and 1000 lines of F# code for translation from Bek. Microsoft. We used the free Reflector decompilation tool to The work builds, in part, on the earlier Rex project where extract C# code from the compiled binary and then trans- basic automata axioms were introduced [21]. Rex provides lated the resulting code to Bek. The translation process took

9 2010/7/19 rougly a day of work, the majority of which was spent un- 7. Related Work derstanding the original implementations. Saner combines dynamic and static analysis to validate The original sanitizer implementations themselves were sanitization functions in web applications[1]. Saner creates 28, 63, 64, 111, and 147 lines of C# code. We discovered finite state transducers for an over-approximation of the that all five implementations of HTMLEncode could be eas- strings accepted by the sanitizer using static analysis of ily represented as a simple Bek iteration over single charac- existing PHP code. In contrast, our work focuses on a ters of the input string. Each iteration had 256 cases, one simple language that is expressive enough to capture existing for each potential character value. We used metaprograms in sanitizers or write new ones by hand, but then compile to Perl to output the C# constructor code needed by our im- symbolic finite state transducers that precisely capture the plementation to create parse trees for Bek programs; these sanitization function. Saner also treats the issue of inputs metaprograms ranged from 66 to 126 lines. Our results show that may be tainted by an adversary, which is not in scope that Bek is expressive enough to handle this crucial Web san- for our work. Our work also focuses on efficient ways to itizer and that the translation effort does not incur undue compose sanitizers and combine the theory of finite state programmer time or overhead. We then focused on charac- transducers with SMT solvers, which is not treated by Saner. ters common in cross-site scripting attacks designed to foil Minamide constructs a string analyzer for PHP code, sanitization. We can easily check whether such characters then uses this string analyzer to obtain context free gram- can be legal outputs of a sanitizer simply by transforming its mars that are over-approximations of the HTML output by Bek program to a symbolic finite state transducer, asserting a server[16]. He shows how these grammars can be used to that the output of the transducer is equal to the character find pages with invalid HTML. Our work treats issues of in question, and then using our framework to solve for an composition and state explosion for finite state transducers input that yields the character. by leveraging recent progress in SMT solvers, which aids Our framework discovered that the single quote character us in reasoning precisely about the transducers created by is a legal output of the System.Web HTMLEncode imple- transformation of Bek programs. mentation. This is a security bug in the System.Web imple- Wasserman and Su also perform static analysis of PHP mentation, because the single quote character can be used code to construct a grammar capturing an over-approximation in some HTML contexts to close string literals and open the of string values. Their application is to SQL injection at- way for a browser to treat subsequent strings as Javascript. tacks, while our framework allows us to ask questions about The System.Web implementation, which also happens to any sanitizer [22]. Follow-on work combines this work with be the most difficult to understand C# implementation of dynamic test input generation to find attacks on full PHP HTMLEncode, simply fails to transform single quotes under web applications [23]. Our focus is specifically on sanitizers any circumstances. While this was difficult for us to deter- instead of on full applications; we emphasize precision of the mine by visual inspection of the C# code, our framework analysis over scaling to large amounts of code. was able to solve for an example input exhibiting the prob- Christensen et al.’s Java String Analyzer is a static anal- lem in less than a second. We discovered that the problem ysis package for deriving finite automata that characterize was previously known independent of our work [3]. an over-approximation of possible values for string variables Our framework also showed that there are no strings of in Java[5]. The focus of their work is on analyzing legacy five characters or less that result in single quotes in a le- Java code and on speed of analysis. In contrast, we focus gal output from the other four sanitizer implementations. on precision of the analysis and on constructing a specific The “five characters or less” is a restriction we specified on language to capture sanitizers, as well as on the integration the search by specifying a recursion depth of five for our with SMT solvers. fold axioms. We noticed that these four implementations Our work is complementary to previous efforts in ex- of HTMLEncode have the property that they do not drop tending SMT solvers to understand the theory of strings. characters from the input on any path. Therefore, our frame- HAMPI [14] and Kaluza [19] extend the STP solver to han- work’s results are sufficient to show that no legal output of dle equations over strings and equations with multiple vari- these sanitizers can contain single quotes. ables. Rex extends the Z3 solver to handle We encountered performance problems, however, when constraints [21], while Hooimeijer et al. show how to solve scaling our framework to check for longer in the subset constraints on regular languages [11]. We in contrast output, such as strings found on the XSS Cheat Sheet web show how to combine any of these solvers with finite au- site. These longer substrings require specifying a deeper re- tomata whose edges can take symbolic values in the theories cursion depth for our fold axioms, because otherwise the understood by the solver. search will not consider strings long enough to yield the de- sired substring. Unfortunately, this deeper recursion depth also yields a larger search space. We were not able to syn- thesize new inputs exhibiting potential cross-site scripting 8. Conclusions attacks for any of our Bek implementations, with a max- imum of 5 minutes for each invocation of the underlying We have shown that combining SMT solvers with finite state constraint solver. transducers is a powerful way to reason precisely about the In the case of the sanitizers we considered, it would be sanitization functions used for enforcing security guarantees. possible to directly encode all five implementations as reg- We introduced a special purpose language for capturing such ular transducers. This would allow bypassing the need for sanitizers and showed that it is expressive enough to model special Z3 axioms. Of course, some sanitizers, such as the real implementations of sanitizers. Future work points to a wu-ftpd sanitizer we showed in Section 2, can not be cap- rich interplay between our style of symbolic finite transduce tured this way. Exploring the performance and expressibility translation to logical formulas, improvement in the ability tradeoffs of a larger scale practical study is future work. of SMT solvers to reason about string constraints, and the expressivity of language required to capture the special cases of transducers seen in practical sanitization examples.

10 2010/7/19 References [21] M. Veanes, P. de Halleux, and N. Tillmann. Rex: Symbolic [1] D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, Regular Expression Explorer. In ICST’10. IEEE, 2010. E. Kirda, C. Kruegel, and G. Vigna. Saner: Composing [22] G. Wassermann and Z. Su. Sound and precise analysis of static and dynamic analysis to validate sanitization in web web applications for injection vulnerabilities. In PLDI’07: applications. In SP ’08: Proceedings of the 2008 IEEE Proceedings of the 2007 ACM SIGPLAN conference on Symposium on Security and Privacy, 2008. Programming language design and implementation, pages [2] N. Bjørner, N. Tillmann, and A. Voronkov. Path feasibility 32–41. ACM, 2007. analysis for string-manipulating programs. In TACAS’09, [23] G. Wassermann, D. Yu, A. Chander, D. Dhurjati, H. Ina- volume 5505 of LNCS, pages 307–321. Springer, 2009. mura, and Z. Su. Dynamic test input generation for web applications. In ISSTA, 2008. [3] N. Calinoiu. Httputility.htmlattributeencode security and rendering issues, 2005. Bug ID 103332, Microsoft Connect [24] Z3. http://research.microsoft.com/projects/z3. Visual Studio Feedback. Appendix: Symbolic Join Composition [4] CERT. CERT advisory CA-2001-33 multiple vulnerabilities in WU-FTPD, 2001. We start by providing some supporting definitions. We then [5] A. S. Christensen, A. Møller, and M. I. Schwartzbach. describe the algorithm that, given two symbolic finite trans- Precise Analysis of String Expressions. In SAS, pages 1–18, ducers A and B constructs a transducer AB such that 2003. [[AB]] = [[A]] ◦ [[B]]. Finally, we provide a correctness argu- [6] L. de Moura and N. Bjørner. Efficient E-matching for SMT ment for the algorithm. solvers. In CADE-21: Proceedings of the 21st International Definition 7. An input/output condition with sort σ/γ is Conference on Automated Deduction, volume 4603 of LNAI, a formula IO(x:σ, y:γ). pages 183–198. Springer, 2007. [7] L. de Moura and N. Bjørner. Z3: An Efficient SMT Solver. Definition 8. Let IO1(x:σ, y:ρ) and IO2(y:ρ,z:γ) be in- In TACAS’08, LNCS. Springer, 2008. put/output conditions. The join of IO 1 with IO2, IO 1 ◦IO2, is the input/output condition ∃y (IO1 ∧ IO2) with sort σ/γ. [8] W. Hodges. Model theory. Cambridge Univ. Press, 1995. [9] P. Hooimeijer and M. Veanes. An evaluation of automata Definition 9. Let C(x:σ) be a predicate and let IO(x:σ, y:γ) algorithms for string analysis. Technical Report MSR-TR- be an input/output conditon. Then C ◦ IO is the predicate 2010-90, Microsoft Research, July 2010. ∃x (C(x) ∧ IO(x,y)). [10] P. Hooimeijer and W. Weimer. A decision procedure for Definition 10. Let C(y:γ) be a predicate and let IO(x:σ, y:γ) subset constraints over regular languages. In PLDI, pages be an input/output conditon. Then IO ◦ C is the predicate 188–198, 2009. ∃y (IO(x,y) ∧ C(y)). [11] P. Hooimeijer and W. Weimer. A decision procedure for Definition 11. Given an input/output condition IO(x,y) subset constraints over regular languages. In PLDI ’09: M M Proceedings of the 2009 ACM SIGPLAN conference on we let [[IO]] denote the binary relation {(x ,y ) | M |= Programming language design and implementation, pages IO}. Given a predicate C with a single free variable x we 188–198, New York, NY, USA, 2009. ACM. let [[C]] denote the set (unary relation) {xM | M |= C}. [12] P. Hooimeijer and W. Weimer. Solving string constraints We assume that sat(ϕ) is a Boolean value that is t iff lazily. In ASE 2010: Proceedings of the 25th IEEE/ACM ϕ is satisfiable, meaning that there exists a model M that International Conference on Automated Software Engineer- assigns values to all the uninterpreted function symbols in ing, 2010. ϕ such that M |= ϕ, where each free variable of ϕ is treated [13] J. E. Hopcroft and J. D. Ullman. Introduction to Automata as an uninterpreted constant symbol. Theory, Languages, and Computation. Addison Wesley, The intuition behind the following definition is that it 1979. symbolically represents δA for a given start state p and a [14] A. Kiezun, V. Ganesh, P. J. Guo, P. Hooimeijer, and M. D. given end state q, i.e., the transition label associated with a Ernst. HAMPI: a solver for string constraints. In ISSTA pair of states (p, q) is the four-tuple: ’09, pages 105–116. ACM, 2009. (p, , ,q) (p, ,,q) (p,, ,q) (p,,,q) [15] A. Mal’cev. The Metamathematics of Algebraic Systems. hδA , δA , δA , δA i North-Holland, 1971. Definition 12. A transition label ` with sort σ/γ, is a four- [16] Y. Minamide. Static approximation of dynamically tuple hIO(x:σ, y:γ),I(x:σ), O(y:γ),ei of predicates, generated web pages. In WWW ’05: Proceedings of the def 14th International Conference on the World Wide Web, • IO(`) = IO is the input/output condition of `; def pages 432–441, 2005. • I(`) = I is the input condition of `; def [17] G. Rozenberg and A. Salomaa, editors. Handbook of Formal • O(`) = O is the output condition of `; Languages, volume 1. Springer, 1997. def • (`) = e, e ∈ {t, f}, is the -condition of `; [18] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, and D. Song. A symbolic execution framework for javascript. ` is feasible if one of sat(IO), sat(I), sat(O), or e is t. Technical Report UCB/EECS-2010-26, EECS Department, University of California, Berkeley, Mar 2010. The following operations are used in the join algorithm.

[19] P. Saxena, D. Akhawe, S. Hanna, S. McCamant, F. Mao, Definition 13. Let `1 and `2 be transition labels. and D. Song. A symbolic execution framework for javascript. In IEEE Security and Privacy, 2010. `1 = hIO1(x,y),I1(x), O1(y),e1i [20] M. Veanes, N. Bjørner, and L. de Moura. Solving extended `2 = hIO2(y,z),I2(y), O2(z),e2i regular constraints symbolically. Technical Report MSR- Then ` = `1 ◦`2, the join of `1 with `2, is the transition label TR-2009-177, Microsoft Research, 2009. hIO(x,z),I(x), O(z), sat(O1 ∧ I2)i

11 2010/7/19 where Update ∆ with (p, q) 7→ hf, f, O(`2), (`2)i;

f, if sat(IO1 ◦ IO2)= f; 0 0 IO = Finish: Let AB =(Q, hqA, qB i,Q ∩ FA × FB , ΣA, ΓB , ∆).  IO1 ◦IO2 , otherwise. [Cleanup:] Remove dead states from AB.A dead state is f, if sat(IO ◦ I )= f; I = 1 2 a noninitial state from which no final state is reachable.  IO1 ◦ I2 , otherwise. The final cleanup step is optional with respect to the in- f, if sat(O ◦ IO )= f; O = 1 2 tended semantics but may reduce the size of AB consider-  O1 ◦ IO2, otherwise. ably. Note that the algorithm does not require A or B to be Note that the -conditions of `1 and `2 do not affect the -loop-free. Also, AB construction is need for Th(AB). The definition of `. However, the -condition of ` is determined -loop elimination algorithm uses the sum operation over g g by whether the output condition of `1 is consistent with the labels and does not require satisfiability checking of labels, input condition of `2 or not. since label-sums cannot produce infeasible transition labels from feasible ones. Definition 14. Let `1 and `2 be transition labels Correctness argument Termination of the algoritm fol- `1 = hIO1(x,y),I1(x), O1(y),e1i lows from the standard argument of DFS, i.e., in this par- `2 = hIO2(x,y),I2(x), O2(y),e2i ticular case, an element is pushed to the stack S only if it Then ` = `1 + `2, the sum of `1 and `2, is the transition is not in Q, i.e., has not already been in S, and the number label hIO1 ∨ IO2,I1 ∨ I2, O1 ∨ O2, sat(e1 ∨ e2)i. of elements is bounded by the size of QA × QB. Obviously, termination is contingent upon termination of satisfiability Recall that a formula ϕ is existential if ϕ is equivalent checking, that is being assumed. The partial correctness of to a formula ∃xψ¯ where ψ is quantifier free. We say that the algorithm, AB is clean and [[AB]] = [[A]] ◦ [[B]], follows a transition label ` is existential if all conditions of ` are from the following arguments. First, cleanness is implied by existential. that in each update of ∆ with (p, q) 7→ `, ` is feasible and fea- Proposition 5. Sums and joins of labels are existential if sibility is preserved by sums. Second, S represents a frontier the arguments are existential. of reachable states, the algorithm computes the closure Q of all reachable states. The order of picking elements from S is Definition 15. We say that A is clean if all transition lables irrelevant. Once a state p is chosen, all possible combinations in A are feasible. of outgoing transitions from p are added. When a transition Proposition 5 is central for efficient implementation of (p, q) 7→ ` is added, it follows from Definition 2 that ` rep- (p,q) sat(ϕ) during the construction of join of transition labels resents δA◦B , where the additional input-only and output- using an SMT solver, in order to maintain cleanness of the only moves correspond to the separate cases in Definition 2. resulting finite transducer of the join algorithm. It follows that [[AB]] = [[A ◦ B]] and thus [[AB]] = [[A]] ◦ [[B]] Each symbolic transducer A is assumed to be represented by Proposition 1. (p,q) so that ∆A(p) yields the set of all pairs hδA , qi from the state p ∈ QA, where (p,q) (p, , ,q) (p, ,,q) (p,, ,q) (p,,,q) δA = hδA , δA , δA , δA i. The join algorithm is described as a DFS algorithm. Sym- bolic transitions of AB are maintained using a dictionary ∆ from QAB ×QAB to transition labels. Updating ∆ with k 7→ ` stands for either adding the new entry k 7→ ` to ∆, if k is not a key in ∆, or updating ∆(k) to ∆(k)+ `, otherwise. Input: The input to the algorithm is a pair of clean finite transducers A and B such that sort(ΓA)= sort(ΣB ). Output: The output of the algorithm is a clean finite trans- ducer AB such that [[AB]] = [[A]] ◦ [[B]]. 0 0 Initialize: Let S =(hqA, qB i) be a stack of AB states. 0 0 Let Q = {hqA, qB i} be a set of AB states. Let ∆ = ∅ be a dictionary from AB states to labels. def s(q) = If q∈ / Q then add q to Q and push q to S; Explore: While S is nonempty:

Pick a state: Pop p = hp1,p2i from S.

For each: h`1, q1i ∈ ∆A(p1) and h`2, q2i ∈ ∆B(p2):

If feasible(`1 ◦ `2): let q = hq1, q2i; s(q);

Update ∆ with (p, q) 7→ `1 ◦ `2;

If (`1)= t or sat(I(`1)): let q = hq1,p2i; s(q);

Update ∆ with (p, q) 7→ hf,I(`1), f, (`1)i;

If (`2)= t or sat(O(`2)): let q = hp1, q2i; s(q);

12 2010/7/19