Modeling Imperative String Operations with Transducers

Modeling Imperative String Operations with Transducers Pieter Hooimeijer David Molnar Prateek Saxena University of Virginia Microsoft Research UC Berkeley [email protected] [email protected] [email protected] Margus Veanes Microsoft Research [email protected] Abstract lems if the code fails to adhere to that intended structure, We present a domain-specific imperative language, Bek, causing the output to have unintended consequences. that directly models low-level string manipulation code fea- The growing rate of security vulnerabilities, especially in turing boolean state, search operations, and substring sub- web applications, has sparked interest in techniques for vul- stitutions. We show constructively that Bek is reversible nerability discovery in existing applications. A number of through a semantics-preserving translation to symbolic fi- approaches have been proposed to address the problems as- nite state transducers, a novel representation for transduc- sociated with string manipulating code. Several static anal- ers that annotates transitions with logical formulae. Sym- yses aim to address cross-site scripting or SQL injection bolic finite state transducers give us a new way to marry by explicitly modeling sets of values that strings can take the classic theory of finite state transducers with the recent at runtime [5, 16, 22, 23]. These approaches use analysis- progress in satisfiability modulo theories (SMT) solvers. We specific models of strings that are based on finite automata exhibit an efficient well-founded encoding from symbolic fi- or context-free grammars. More recently, there has been nite state transducers into the higher-order theory of alge- significant interest in constraint solving tools that model braic datatypes. We evaluate the practical utility of Bek as a strings [2, 10, 12, 14, 18, 20, 21]. String constraint solvers constraint language in the domain of web application saniti- allow any client analysis to express constraints (e.g., path zation code. We demonstrate that our approach can address predicates) that include common string manipulation func- real-world queries regarding, for example, the idempotence tions. and relative strictness of popular sanitization functions. Neither existing static approaches nor string constraint solvers are currently well-equiped to model low-level imper- Categories and Subject Descriptors D.2.4 [Software ative string manipulation code. In this paper, we aim to ad- Engineering]: Software/Program Verification—Validation; dress this niche. Our approach is based on the insight that D.2.4 [Software Engineering]: Software/Program Verification— many security-critical string functions can be modeled pre- Model checking; F.3.1 [Logics and Meanings]: Specifying cisely using finite state transducers over a symbolic alpha- and Verifying and Reasoning about Programs—Mechanical bet. We present Bek, a domain-specific language for mod- verification eling string transformations. The language is designed to be (a) sufficiently expressive to model real-world code, and (b) General Terms Algorithms, Languages, Theory, Verifi- sufficiently restricted to allow precise analysis using trans- cation ducers. Bek can model real-world sanitization functions, such as those in the .NET System.Web library, without ap- Bek 1. Introduction proximation. We provide a translation from expressions to the theory of algebraic datatypes, allowing Bek expres- A large fraction of security vulnerabilities (including SQL in- sions to be used directly when specifying constraints for an jection, cross-site scripting, and buffer overflows) arise due SMT solver, in combination with other theories. to errors in string-manipulating code. Developers frequently Key to enabling the analysis of Bek expressions is a new use low-level string operations, like concatenation and sub- theory of symbolic finite state transducers, an extension of stitution, to manipulate data that must follow a particular standard form finite transducers that we introduce formally high-level structure, like HTML or SQL. This leads to prob- in this paper. We develop a theory of symbolic transducers, showing its integration with other theories in SMT solvers that support E-matching [6]. We describe a tractable encoding of symbolic finite transducers into the theory of algebraic data types and prove its well-foundedness (i.e., given sufficient resources, any given query yields a finite-length proof) using model-theoretic insights. We introduce the con- cept of join composition, which enables us to preserve the key property of reversibility (i.e, given an output, produce corresponding inputs) that is crucial for checking sanitizer [Copyright notice will appear here once ’preprint’ option is removed.] 1 2010/7/19 correctness. Finally, we present a translation of Bek expres- void site_exec( char *cmd) sions into symbolic finite transducers. { To evaluate our approach, we show that several pop- char buf[MAXPATHLEN], *slash, *t; Bek /* sanitize the command string */ ular sanitization procedures can be ported to with char *sp = ( char *) strchr (cmd, ’ ’); 5 little effort. Each such port matches the behavior of the if (sp == 0) { original procedure without any need for conservative over- while ((slash = strchr (cmd, ’/’)) != 0) approximation. We use a prototype implementation of Bek cmd = slash + 1; that uses the Z3 SMT solver [24] as the back end interpreter. } During our experiments, we were able to generate witnesses else { 10 for a previously-known vulnerability. We demonstrate that while (sp && the prototype can resolve queries that are of practical in- (slash = ( char *) strchr (cmd, ’/ ’)) && terest to both users and developers of sanitization routines, (slash < sp)) cmd = slash + 1; such as “do two sanitizers exhibit deviant behaviors on cer- } 15 tain inputs,” “do multiple applications of a sanitizer intro- for (t = cmd; *t && !isspace(*t); t++) duce errors,” or “given an possibility of attack output, what if (isupper(*t)) *t = tolower(*t); is the maximal set of corresponding inputs that demonstrate /* build the command */ the attack?” int pathlen = strlen (PATH); The primary contributions of this paper are: int cmdlen = strlen (cmd); 20 if (pathlen + cmdlen +2 > sizeof (buf)) • We formally describe a domain-specific language, Bek, return ; for string manipulation. We describe a syntax-driven sprintf(buf, "%s/%s", PATH, cmd); translation from Bek expressions to symbolic finite state /* ... execute buf, store results ... */ transducers. fprintf(remote_socket , cmd); 25 } • We formalize symbolic finite state transducers and their reduction to the theory of algebraic datatypes, including Figure 1. Source code using hand-written sanitization and the intersection and composition operations. checks to avoid a buffer overrun (successfully, line 21) and • We show that Bek can encode real-world string manip- a format string vulnerability (unsuccessfully, line 25), and ulating code used to sanitize untrusted inputs in Web enforce path-related policies (successfully). applications. Within this domain, we demonstrate several applications that are of direct practical interest. removed from the command, not from the arguments. The The rest of this paper is structured as follows. We provide a strchr invocation on line 5 is used to check if any spaces motivating example in Section 2. Sections 3–5 describe our are present (line 6). If so, a more complicated version of approach. Section 3 describes Bek in terms of its large step the slash-skipping logic is used (lines 10–15) that only ad- operational semantics. Section 4 provides language-theoretic vances cmd past slashes before the first space. Lines 18– definitions and presents the translation from Bek expres- 22 build the command that will be executed (e.g., com- sions to transducers. In Section 5, we formalize a theory of pleting the transformation from "/usr/bin/ls -l *.c" to symbolic finite transducers based on the theory of algebraic "/home/ftp/bin/ls -l *.c") by using sprintf to concatenate datatypes. We present our case studies in Section 6. Finally, the trusted directory, a slash, and the suffix of the user we discuss closely related work in Section 7 and conclude in command. The check on line 21 prevents a buffer overrun Section 8. on the local stack-allocated variable buf by explicitly adding together the two string lengths, one byte for the slash, and 2. Motivating Example one byte for C’s null termination, and comparing the result against the size of buf. We put our work in context using a code fragment from More tellingly, while the code correctly avoids buffers version 2.6.0 of wu-ftpd, a file transfer server, written in overruns and implements its path-based security policy, it C, that has a known format string vulnerability. Figure 1 is vulnerable to a format string attack. Since the user’s shows a fragment of code for handling SITE EXEC commands command is passed as the format string to fprintf (line from wu-ftpd. The SITE EXEC portion of the file transfer 25), if it contains sequences such as %d or %s they will protocol allows remote users to execute certain commands be interpreted by printf’s formatting logic. This typically on the local server. The cmd string holds untrusted data results in random output, but careful use of the uncommon provided by such a remote user; an example benign value %n directive, which instructs printf to store the number of is "/usr/bin/ls -l *.c". This code is an indicative example characters written so far through an integer pointer on the of string processing in the wild: it tries to accomplish several stack, can allow an adversary to take control of the system. tasks at once, it relies on character-level imperative updates An exploit for just such an attack against exactly this code to manipulate its input, and control flow depends on string was made publicly available [4]. values. The variable PATH points to a directory containing ex- ecutable files that remote users are allowed to invoke 3. Modeling Low-Level String Operations (e.g., "/home/ftp/bin"). To prevent the remote user from In this section, we give a high-level description of a small im- invoking other executables via pathname trickery (e.g., perative language, Bek, of low-level string operations.

Modeling Imperative String Operations with Transducers

Data Structure

Precise Analysis of String Expressions

Regular Languages

Formal Languages

Formal Languages

String Solving with Word Equations and Transducers: Towards a Logic for Analysing Mutation XSS (Full Version)

A Generalization of Square-Free Strings a GENERALIZATION of SQUARE-FREE STRINGS

Mathematical Linguistics

On Simple Linear String Equations

Reference Abstract Domains and Applications to String Analysis

PLDI 2016 Tutorial Automata-Based String Analysis

Optimal Dynamic Strings∗