Improving Application Security with Data Flow Assertions

Alexander Yip, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek Massachusetts Institute of Technology – Computer Science and Artificial Intelligence Laboratory

ABSTRACT miss some, which can lead to exploits. For example, one popular Web application, phpMyAdmin [38], requires sanitizing user input is a new language runtime that helps prevent security vulner- abilities, by allowing programmers to specify application-level data in 1,409 places. Not surprisingly, phpMyAdmin has suffered 60 vulnerabilities because some of these calls were forgotten [41]. flow assertions. RESIN provides policy objects, which programmers use to specify assertion code and metadata; data tracking, which This paper presents RESIN, a system that allows programmers allows programmers to associate assertions with application data, to make their plan for correct data flow explicit using data flow and to keep track of assertions as the data flow through the appli- assertions. Programmers can write a data flow assertion in one cation; and filter objects, which programmers use to define data place to capture the application’s high-level data flow invariant, and RESIN checks the assertion in all relevant places, even places where flow boundaries at which assertions are checked. RESIN’s runtime checks data flow assertions by propagating policy objects along with the programmer might have otherwise forgotten to check. data, as that data moves through the application, and then invoking RESIN operates within a language runtime, such as the Python or filter objects when data crosses a data flow boundary, such as when PHP . RESIN tracks application data as it flows through writing data to the network or a file. the application, and checks data flow assertions on every executed path. RESIN uses runtime mechanisms because they can capture Using RESIN, Web application programmers can prevent a range of problems, from SQL injection and cross-site scripting, to inadver- dynamic properties, like user-defined access control lists, while tent password disclosure and missing access control checks. Adding integration with the language allows programmers to reuse the ap- plication’s existing code in an assertion. RESIN is designed to help a RESIN assertion to an application requires few changes to the existing application code, and an assertion can reuse existing code programmers gain confidence in the correctness of their application, and data structures. For instance, 23 lines of code detect and prevent and is not designed to handle malicious code. three previously-unknown missing access control vulnerabilities in A key challenge facing RESIN is knowing when to verify a data phpBB, a popular Web forum application. Other assertions compris- flow assertion. Consider the assertion that a user’s password can ing tens of lines of code prevent a range of vulnerabilities in Python flow only to the user herself. There are many different ways that an adversary might violate this assertion, and extract someone’s and PHP applications. A prototype of RESIN incurs a 33% CPU overhead running the HotCRP conference management application. password from the system. The adversary might trick the application into emailing the password; the adversary might use a SQL injection attack to query the passwords from the database; or the adversary Categories and Subject Descriptors might fetch the password file from the server using a directory D.2.4 [Software Engineering]: Software Verification—Assertion traversal attack. RESIN needs to cover every one of these paths to checkers; D.4.6 [Operating Systems]: Security and Protection prevent password disclosure. A second challenge is to design a generic mechanism that makes General Terms it easy to express data flow assertions, including common assertions like cross-site scripting avoidance, as well as application-specific Security, Languages, Design assertions. For example, HotCRP [26], a conference management application, has its own data flow rules relating to password disclo- 1. INTRODUCTION sure and reviewer conflicts of interest, among others. Can a single assertion API allow for succinct assertions for cross-site scripting Software developers often have a plan for correct data flow within avoidance as well as HotCRP’s unique data flow rules? their applications. For example, a user u’s password may flow out The final challenge is to make data flow assertions coexist with of a Web site only via an email to user u’s email address. As another each other and with the application code. A single application may example, user inputs must always flow through a sanitizing function have many different data flow assertions, and it must be easy to before flowing into a SQL query or HTML, to avoid SQL injection add an additional assertion if a new data flow rule arises, without or cross-site scripting vulnerabilities. Unfortunately, today these having to change existing assertions. Moreover, applications are plans are implemented implicitly: programmers try to insert code often written by many different programmers. One programmer in all the appropriate places to ensure correct flow, but it is easy to may work on one part of the application and lack understanding of the application’s overall data flow plan. RESIN should be able to enforce data flow assertions without all the programmers being Permission to make digital or hard copies of all or part of this work for aware of the assertions. personal or classroom use is granted without fee provided that copies are RESIN addresses these challenges using three ideas: policy ob- not made or distributed for profit or commercial advantage and that copies jects, data tracking, and filter objects. Programmers explicitly an- bear this notice and the full citation on the first page. To copy otherwise, to notate data, such as strings, with policy objects, which encapsulate republish, to post on servers or to redistribute to lists, requires prior specific the assertion functionality that is specific to that data. Program- permission and/or a fee. mers write policy objects in the same language that the rest of the SOSP’09, October 11–14, 2009, Big Sky, Montana, USA. Copyright 2009 ACM 978-1-60558-752-3/09/10 ...$10.00. application is written in, and can reuse existing code and data struc- Vulnerability Count Percentage Vulnerable sites SQL injection 1176 20.4% Vulnerability among those surveyed Cross-site scripting 805 14.0% Cross-site scripting 31.5% Denial of service 661 11.5% Information leakage 23.3% Buffer overflow 550 9.5% Predictable resource location 10.2% Directory traversal 379 6.6% SQL injection 7.9% Server-side script injection 287 5.0% Insufficient access control 1.5% Missing access checks 263 4.6% HTTP response splitting 0.8% Other vulnerabilities 1647 28.6% Total 5768 100% Table 2: Top Web site vulnerabilities of 2007 [48].

Table 1: Top CVE security vulnerabilities of 2008 [41]. or HTML. In practice this is difficult to accomplish because there are many data flow paths to keep track of, and some of them are tures, which simplifies writing application-specific assertions. The non-intuitive. For example, in one cross-site scripting vulnerability, RESIN runtime then tracks these policy objects as the data propa- phpBB queried a malicious whois server, and then incorporated the gates through the application. When the data is about to leave the response into HTML without first sanitizing the response. A survey control of RESIN, such as being sent over the network, RESIN in- of Web applications [48] summarized in Table 2 illustrates how vokes filter objects to check the data flow assertions with assistance common these bugs are with cross-site scripting affecting more than from the data’s policy objects. 31% of applications, and SQL injection affecting almost 8%. We evaluate RESIN in the context of application security by show- If there were a tool that could enforce a data flow assertion on ing how these three mechanisms can prevent a wide range of vul- an entire application, a programmer could write an assertion to nerabilities in real Web applications, while requiring programmers catch these bugs and prevent an adversary from exploiting them. to write only tens of lines of code. One application, the MoinMoin For example, an assertion to prevent SQL injection exploits would wiki [31], required only 8 lines of code to catch the same access verify that: control bugs that required 2,000 lines in Flume [28], although Flume provides stronger guarantees. HotCRP can use RESIN to uphold DATA FLOW ASSERTION 1. Any user input data must flow through its data flow rules, by adding data flow assertions that control who a sanitization function before it flows into a SQL query. may read a paper’s reviews, and to whom HotCRP can email a pass- RESIN aims to be such a tool. word reminder. Data flow assertions also help prevent a range of other previously-unknown vulnerabilities in Python and PHP Web Directory Traversal applications. A prototype RESIN runtime for PHP has acceptable performance overhead, amounting to 33% for HotCRP. Directory traversal is another common vulnerability that accounts The contributions of this work are the idea of an application-level for 6.6% of the vulnerabilities in Table 1. In a directory traversal data flow assertion, and a technique for implementing data flow attack, a vulnerable application allows the user to enter a file name, assertions using filter objects, policy objects, and data tracking. Ex- but neglects to limit the directories available to the user. To exploit periments with several real applications further show that data flow this vulnerability, an adversary typically inserts the “..” string as part assertions are concise, effective at preventing many security vulner- of the file name which allows the adversary to gain unauthorized abilities, and incrementally deployable in existing applications. access to read, or write files in the server’s file system. These ex- The rest of the paper is organized as follows. The next section ploits can be viewed as faulty data flows. If the adversary reads a file discusses the specific goals and motivation for RESIN. Section 3 without the proper authorization, the file’s data is incorrectly flowing presents the design of the RESIN runtime, and Section 4 describes to the adversary. If the adversary writes to a file without the proper our implementation. Section 5 illustrates how RESIN prevents a authorization, the adversary is causing an invalid flow into the file. range of security vulnerabilities. Sections 6 and 7 present our evalu- Data flow assertions can address directory traversal vulnerabilities ation of RESIN’s ease of use, effectiveness, and performance. We by enforcing data flow rules on the use of files. For example, a pro- discuss RESIN’s limitations and future work in Section 8. Section 9 grammer could encode the following directory traversal assertion to covers related work, and Section 10 concludes. protect against invalid writes: DATA FLOW ASSERTION 2. No data may flow into directory d 2. GOALS AND EXAMPLES unless the authenticated user has write permission for d. RESIN’s main goal is to help programmers avoid security vul- nerabilities by treating exploits as data flow violations, and then Server-Side Script Injection using data flow assertions to detect these violations. This section Server-side script injection accounts for 5% of the vulnerabilities explains how faulty data flows cause vulnerabilities, and how data reported in Table 1. To exploit these vulnerabilities, an adversary up- flow assertions can prevent those vulnerabilities. loads code to the server and then fools the application into running that code. For instance, many PHP applications load script code for SQL Injection and Cross-Site Scripting different visual themes at runtime, by having the user specify the SQL injection and cross-site scripting vulnerabilities are common file name for their desired theme. An adversary can exploit this by and can affect almost any Web application. Together, they account uploading a file with the desired code onto the server (many applica- for over a third of all reported security vulnerabilities in 2008, as tions allow uploading images or attachments), and then supplying seen in Table 1. These vulnerabilities result from user input data the name of that file as the theme to load. flowing into a SQL query string or HTML without first flowing Even if the application is careful to not include user-supplied file through their respective sanitization functions. To avoid these vul- names, a more subtle problem can occur. If an adversary uploads a nerabilities today, programmers insert calls to the correct sanitization file with a . extension, the may allow the adversary function on every single path on which user input can flow to SQL to directly execute that file’s contents by simply issuing an HTTP request for that file. Avoiding such problems requires coordination explicitly specify data flow assertions, which are then checked at between many parts of the application, and even the Web server, runtime in all places in the application. to understand which file extensions are “dangerous”. This attack We expect that programmers would specify data flow assertions can be viewed as a faulty data flow and could be addressed by the to prevent well-known vulnerabilities shown in Table 1, as well following data flow assertion: as existing application-specific rules, such as HotCRP’s rules for password disclosure or reviewer conflicts of interest. As program- DATA FLOW ASSERTION 3. The interpreter may not interpret mers write new code, they can use data flow assertions to make sure any user-supplied code. their data is properly handled in code written by other developers, Access Control without having to look at the entire code base. Finally, as new prob- lems are discovered, either by attackers or by programmers auditing Insufficient access control can also be viewed as a data flow viola- the code, data flow assertions can be used to fix an entire class of tion. These vulnerabilities allow an adversary to read data without vulnerabilities, rather than just a specific instance of the bug. proper authorization and make up 4.6% of the vulnerabilities re- RESIN treats the entire language runtime, and application code, as ported in 2008. For example, a missing access control check in part of the trusted computing base. RESIN assumes the application MoinMoin wiki allowed a user to read any wiki page, even if the code is not malicious, and does not prevent an adversary from com- page’s access control list (ACL) did not permit the user to read that promising the underlying language runtime or the OS. In general, page [46]. Like the previous vulnerabilities, this data leak can be a buffer overflow attack can compromise a language runtime, but viewed as a data flow violation; the wiki page is flowing to a user buffer overflows are less of an issue for RESIN because languages who lacks permission to receive the page. This vulnerability could like PHP and Python are not susceptible to buffer overflows. be addressed with the data flow assertion:

DATA FLOW ASSERTION 4. Wiki page p may flow out of the 3. DESIGN system only to a user on p’s ACL. Many of the vulnerabilities described in Section 2 can be ad- Insufficient access control is particularly challenging to address dressed with data flow assertions, but the design of such an assertion because access control rules are often unique to the application. system requires solutions to a number of challenges. First, the sys- For example, MoinMoin’s ACL rules differ from HotCRP’s access tem must enforce assertions on the many communication channels control rules, which ensure that only paper authors and program available to the application. Second, the system must provide a committee (PC) members may read paper reviews, and that PC mem- convenient API in which programmers can express many different bers may not view a paper’s authors if the author list is anonymous. types of data flow assertions. Finally, the system must handle several Ideally, a data flow assertion could take advantage of the code and assertions in a single application gracefully; it should be easy to add data structures that an application already uses to implement its new assertions, and doing so should not disrupt existing assertions. access control checks. This section describes how RESIN addresses these design challenges, beginning with an example of how a data flow assertion prevents the Password Disclosure HotCRP password disclosure vulnerability described in Section 2. Another example of a specific access control vulnerability is a pass- 3.1 Design Overview word disclosure vulnerability that was discovered in HotCRP; we To illustrate the high-level design of RESIN and what a program- use this bug as a running example for the rest of this paper. This bug mer must do to implement a data flow assertion, this section de- was a result of two separate features, as follows. scribes how a programmer would implement Data Flow Assertion 5, First, a HotCRP user can ask HotCRP to send a password re- the HotCRP password assertion, in RESIN. This example does not minder email to the user’s email address, in case the user forgets the use all of RESIN’s features, but it does show RESIN’s main concepts. password. HotCRP makes sure to send the email only to the email Conceptually, the programmer needs to restrict the flow of pass- address of the account holder as stored in the server. The second words. However, passwords are handled by a number of modules email preview feature is an mode, in which the site administrator in HotCRP, including the authentication code and code that formats configures HotCRP to display email messages in the browser, rather and sends email messages. Thus, the programmer must confine than send them via email. In this vulnerability, an adversary asks passwords by defining a data flow boundary that surrounds the en- HotCRP to send a password reminder for another HotCRP user (the tire application. Then the programmer allows a password to exit email preview victim) while HotCRP is in mode. HotCRP will dis- that boundary only if that password is flowing to the owner via play the content of the password reminder email in the adversary’s email, or to the program chair. Finally, the programmer marks browser, instead of sending the password to that victim’s email the passwords as sensitive so that the boundary can identify which address, thus revealing the victim’s password to the adversary. data contains password information, and writes a small amount of A data flow assertion could have prevented this vulnerability assertion checking code. because the assertion would have caught the invalid password flow RESIN provides three mechanisms that help the programmer despite the unexpected combination of the password reminder and implement such an assertion (see Figure 1): email preview mode. The assertion in this case would have been: • Programmers use filter objects to define data flow boundaries. DATA FLOW ASSERTION 5. User u’s password may leave the A filter object interposes on an input/output channel or a system only via email to u’s email address, or to the program chair. function call interface. 2.1 Threat Model • Programmers explicitly annotate sensitive data with policy As we have shown, many vulnerabilities in today’s applications objects. A policy object can contain code and metadata for can be thought of as programming errors that allow faulty data checking assertions. flows. Adversaries exploit these faulty data flows to bypass the • Programmers rely on RESIN’s runtime to perform data track- application’s security plan. RESIN aims to prevent adversaries ing to propagate policy objects along with sensitive data when from exploiting these faulty data flows by allowing programmers to the application copies that data within the system. Figure 1: Overview of the HotCRP password data flow assertion in RESIN. class PasswordPolicy extends Policy { private $email; Once a password has the appropriate policy object, RESIN’s data function __construct($email) { tracking propagates that policy object along with the password data; $this->email = $email; when the application copies or moves the data within the system, } function export_check($context) { the policy goes along with the password data. For example, after if ($context[’type’] == ’email’ && HotCRP composes the email content using the password data, the $context[’email’] == $this->email) return; email content will also have the password policy annotation (as global $Me; if ($context[’type’] == ’http’ && shown in Figure 1). $Me->privChair) return; RESIN enforces the assertion by making each filter object call throw new Exception(’unauthorized disclosure’); export_check on the policy object of any data that flows through } } the filter. The filter object passes its context as an argument to export_check to provide details about the specific I/O channel (e.g., policy_add($password, new PasswordPolicy(’[email protected]’)); the email’s recipient). This assertion catches HotCRP’s faulty data flow before it can Figure 2: Simplified PHP code for defining the HotCRP password leak a password. When HotCRP tries to send the password data policy class and annotating the password data. This policy only over an HTTP connection, the connection’s filter object invokes allows a password to be disclosed to the user’s own email address the export_check method on the password’s policy object. The or to the program chair. export_check code observes that HotCRP is incorrectly trying to send the password over an HTTP connection, and throws an ex- ception which prevents HotCRP from sending the password to the adversary. This solution works for all disclosure paths through the code because RESIN’s default boundary controls all output chan- RESIN by default defines a data flow boundary around the lan- nels; HotCRP cannot reveal the password without traversing a filter guage runtime using filter objects that cover all I/O channels, in- object. cluding pipes and sockets. By default, RESIN also annotates some This example is just one way to implement the password data of these default filter objects with context metadata that describes flow assertion, and there may be other ways. For example, the the specific filter object. For example, RESIN annotates each filter programmer could implement the assertion checking code in the object connected to an outgoing email channel with the email’s filter objects rather than the password’s policy object. However, recipient address. The default set of filters and contexts defining the modifying filter objects is less attractive because the programmer boundary are appropriate for the HotCRP password assertion, so the would need to modify every filter object that a password can traverse. programmer need not define them manually. Putting the assertion code in the policy object allows the programmer In order for RESIN to track the passwords, the programmer must to write the assertion code in one place. annotate each password with a policy object, which is a language- level object that contains fields and methods. In this assertion, a user’s password will have a policy object that contains a copy of the 3.2 Filter Objects user’s email address so that the assertion can determine which email A filter object, represented by a diamond in Figure 1, is a generic address may receive the password data. When the user first sets their interposition mechanism that application programmers use to create password, the programmer copies the user’s email address from the data flow boundaries around their applications. An application can current session information into the password’s policy object. associate a filter object with a function call interface, or an I/O The programmer also writes the code that checks the assertion, in channel such as a file handle, socket, or pipe. a method called export_check within the password policy object’s RESIN aims to support data flow assertions that are specific to an class definition. Figure 2 shows the code the programmer must write application, so in RESIN, a programmer implements a filter object to implement this data flow assertion, including the policy object’s as a language-level object in the same language as the rest of the class definition and the code that annotates a password with a policy application. This allows the programmer to reuse the application’s object. The policy object also shows how an assertion can benefit code and data structures, and allows for better integration with from the application’s data structures; this assertion uses an existing applications. flag, $Me->privChair, to determine whether the current user is the When an application sends data across a channel guarded by a program chair. filter object, RESIN invokes a method in that filter object with the Function Caller Semantics filter::filter_read(data, offset) Runtime Invoked when data comes in through a data flow boundary, and can assign initial policies for data; e.g., by de-serializing from persistent storage. filter::filter_write(data, offset) Runtime Invoked when data is exported through a data flow boundary; typically invokes assertion checks or serializes policy objects to persistent storage. filter::filter_func(args) Runtime Checks and/or proxies a function call. policy::export_check(context) Filter object Checks if data flow assertion allows exporting data, and throws exception if not; context provides information about the data flow boundary. policy::merge(policy_object_set) Runtime Returns set of policies (typically zero or one) that should apply to merging of data tagged with this policy and data tagged with policy_object_set. policy_add(data, policy) Programmer Adds policy to data’s policy set. policy_remove(data, policy) Programmer Removes policy from data’s policy set. policy_get(data) Programmer Returns set of policies associated with data.

Table 3: The RESIN API. A::B(args) denotes method B of an object of type A. Not shown is the API used by the programmer to specify and access filter objects for different data flow boundaries. class DefaultFilter(Filter): def __init__(self): self.context = {} in Section 3.1. RESIN also allows the application to add its own def filter_write(self, buf): key-value pairs to the context hash table of default filter objects. for p in policy_get(buf): The context key-value pairs are likely specific to the I/O channel if hasattr(p, ’export_check’): p.export_check(self.context) or function call that the filter guards, and the default filter passes the return buf context hash table as an argument to export_check. In the HotCRP example, the context for a sendmail pipe contains the recipient of Figure 3: Python code for the default filter for sockets. the email (as shown in Figure 1).

3.2.2 Importing Code data as an argument. If the interposition point is an I/O channel, RESIN treats the interpreter’s execution of script code as another RESIN will invoke either filter_read or filter_write; for function data flow channel, with its own filter object. This allows program- calls, RESIN will invoke filter_func (see Table 3). Filter_read and mers to interpose on all code flowing into the interpreter, and ensure filter_write can check or alter the in-transit data. Filter_func can that such code came from an approved source. This can prevent check or alter the function’s arguments and return value. server-side script injection attacks, where an adversary tricks the For example, in an HTTP splitting attack, the adversary inserts interpreter into executing adversary-provided script code. an extra CR-LF-CR-LF delimiter into the HTTP output to confuse browsers into thinking there are two HTTP responses. To thwart 3.2.3 Write Access Control this type of attack, the application programmer could write a filter object that scans for unexpected CR-LF-CR-LF character sequences, In addition to runtime boundaries, RESIN also permits an applica- and then attach this filter to the HTTP output channel. As a second tion to place filter objects on persistent files to control write access, example, a function that encrypts data is a natural data flow boundary. because data tracking alone cannot prevent modifications. In partic- A programmer may choose to attach a filter object to the encryption ular, RESIN allows programmers to specify access control checks function that removes policy objects for confidentiality assertions for files and directories in a persistent filter object that’s stored in such as the PasswordPolicy from Section 3.1. the extended attributes of a specific file or directory. The runtime automatically invokes this persistent filter object when data flows 3.2.1 Default Filter Objects into or out of that file, or modifies that directory (such as creating, deleting, or renaming files). This programmer-specified filter object RESIN pre-defines default filter objects on all I/O channels into can check whether the current user is authorized to access that file and out of the runtime, including sockets, pipes, files, HTTP out- or directory. These persistent filter objects associated with a specific put, email, SQL, and code import. Since these default filter objects file or directory are separate from the filter objects associated by are at the edge of the runtime, data can flow freely within the ap- default with every file’s I/O channel. plication and the default filters will only check assertions before making program output visible to the outside world. This bound- ary should be suitable for many assertions because it surrounds the 3.3 Policy Objects entire application. The default boundary also helps programmers Like a filter object, a policy object is a language-level object, avoid accidentally overlooking an I/O channel, which would result and can reuse the application’s existing code and data structures. A in an incomplete boundary that would not cover all possible flows. policy object can contain fields and methods that work in concert The default filter objects check the in-transit data for policies, as with filter objects; policy objects are represented by the rounded shown in Figure 3. If a filter finds a policy that has an export_check rectangles in Figure 1. method, the filter invokes the policy’s export_check method. As To work with default filter objects, a policy object should have described in Section 3.1, export_check typically checks the assertion an export_check method as shown in Table 3. As mentioned earlier, and throws an exception if the flow would violate the assertion. default filter objects invoke export_check when data with a policy Since the policy’s export_check method may need additional passes through a filter, so export_check is where programmers imple- information about the filter’s specific I/O channel or function call ment an assertion check for use with default filters. If the assertion to check the assertion, RESIN attaches context information, in the fails, export_check should throw an exception so that RESIN can form of a hash table, to some of the default filters as described prevent the faulty data flow. The main distinction between policy objects and filter objects is that a policy object is specific to data, and a filter object is specific to a channel. A policy object would contain data specific metadata and code; for example, the HotCRP password policy contains the email address of the password’s account holder. A filter object would contain channel specific metadata; for example, the email filter object contains the recipient’s email address. Even though RESIN allows programmers to write many different filter and policy objects, the interface between all filters and policies Figure 4: Saving persistent policies to a SQL database for HotCRP remains largely the same, if the application uses export_check. This passwords. Uses symbols from Figure 1. limits the complexity of adding or changing filters and policies, because each policy object need not know about all possible filter objects, and each filter object need not know about all possible For example, HotCRP stores user passwords in a SQL database. It policy objects (although this does not preclude the programmer can be inconvenient and error-prone for the programmer to manually from implementing special cases for certain policy-filter pairs). save metadata about the password’s policy object when saving it to the database, and then reconstruct the policy object when reading 3.4 Data Tracking the password later. RESIN keeps track of policy objects associated with data. The To help programmers of such applications, RESIN transparently programmer attaches a policy object to a datum—a primitive data tracks data flows to and from persistent storage. RESIN’s default element such as an integer or a character in a string (although it is filter objects serialize policy objects to persistent files and database common to assign the same policy to all characters in a string). The storage when data is written out, and de-serializes the policy objects RESIN runtime then propagates that policy object along with the when data is read back into the runtime. data, as the application code moves or copies the data. For data going to a file, the file’s default filter object serializes the To attach a policy object to data, the programmer uses the pol- data’s policy objects into the file’s extended attributes. Whenever icy_add function listed in Table 3. Since an application may have the application reads data from the file, the filter reads the serialized multiple data flow assertions, a single datum may have multiple policy from the file’s extended attributes, and associates it with the policy objects, all contained in the datum’s policy set. newly-read data. RESIN tracks policies for file data at byte-level RESIN propagates policies in a fine grained manner. For example, granularity, as it does for strings. if an application concatenates the string “foo” (with policy p1), and RESIN also serializes policies for data stored in a SQL database, “bar” (with policy p2), then in the resulting string “foobar”, the first as shown in Figure 4. RESIN accomplishes this by attaching a default three characters will have only policy p1 and the last three characters filter object to the function used to issue SQL queries, and using that will have only p2. If the programmer then takes the first three filter to rewrite queries and results. For a CREATE TABLE query, the characters of the combined string, the resulting substring “foo” will filter adds an additional policy column to store the serialized policy only have policy p1. Tracking data at the character level minimizes for each data column. For a query that writes to the database, the interference between different data flow assertions, whose data may filter augments the query to store the serialized policy of each cell’s be combined in the same string, and minimizes unintended policy value into the corresponding policy column. Last, for a query that propagation, when marshalling and un-marshalling data structures. fetches data, the filter augments the query to fetch the corresponding For example, in the HotCRP password reminder email message, only policy column, and associates each de-serialized policy object with the characters comprising the user’s password have the password the corresponding data cell in the resulting set of rows. policy object. The rest of the string is not associated with the Storing policies persistently also helps other programs, such as password policy object, and can potentially be manipulated and the Web server, to check invariants on file data. For example, if an sent over the network without worrying about the password policy application accidentally stores passwords in a world-readable file, (assuming there are no other policies). and an adversary tries to fetch that file via HTTP, a RESIN-aware RESIN tracks explicit data flows such as variable assignment; Web server will invoke the file’s policy objects before transmitting most of the bugs we encountered, including all the bugs described in the file, fail the export_check, and prevent password disclosure. Sections 2 and 6, use explicit data flows. However, some data flows RESIN only serializes the class name and data fields of a policy are implicit. One example is a control flow channel, such as when a object. This allows programmers to change the code for a policy value that has a policy object influences a conditional branch, which class to evolve its behavior over time. For example, a program- then changes the program’s output. Another example of an implicit mer could change the export_check method of HotCRP’s password flow is through data structure layout; an application can store data policy object to disallow disclosure to the program chair without in an array using a particular order. RESIN does not track this order changing the existing persistent policy objects. However, if the information, and a programmer cannot attach a policy object to the application needs to change the fields of a persistent policy, the array’s order. These implicit data flows are sometimes surprising programmer will need to update the persistent policies, much like and difficult to understand, and RESIN does not track them. If the database schema migration. programmer wants to specify data flow assertions about such data, the programmer must first make this data explicit, and only then 3.4.2 Merging Policies attach a policy to it. RESIN uses character-level tracking to avoid having to merge policies when individual data elements are propagated verbatim, 3.4.1 Persistent Policies such as through concatenation or taking a substring. Unfortunately, RESIN only tracks data flow inside the language runtime, and merging is inevitable in some cases, such as when string characters checks assertions at the runtime boundary, because it cannot control with different policies are converted to integer values and added up what happens to the data after it leaves the runtime. However, many to compute a checksum. In many situations, such low-level data applications store data persistently in file systems and databases. transformation corresponds to a boundary, such as encryption or def process_client(client_sock): hashing, and would be a good fit for an application-specific filter req = parse_request(client_sock) object. However, relying on the programmer to insert filter objects client_sock.__filter.context[’user’] = req.user in all such places would be error-prone, and RESIN provides a safety ... process req ... net by merging policy objects in the absence of any explicit actions class PagePolicy(Policy): by the programmer. def __init__(self, acl): self.acl = acl By default, RESIN takes the union of policy objects of source def export_check(self, context): if not self.acl.may(context[’user’], ’read’): operands, and attaches them to the resulting datum. The union raise Exception("insufficient access") strategy is suitable for some data flow assertions. For example, an assertion that tracks user-supplied inputs by marking them with a class Page: def update_body(self, text): UserData policy would like to label the result as UserData if any text = policy_add(text, PagePolicy(self.getACL())) source operand was labeled as such. In contrast, the intersection ... write text to page’s file ... strategy may be applicable to other policies. An assertion that tracks data authenticity by marking data with an AuthenticData policy Figure 5: Python code for a data flow assertion that checks read would like to label the result as AuthenticData only if all source access control in MoinMoin. The process_client and update_body operands were labeled that way. functions are simplified versions of MoinMoin equivalents. Because different policies may have different notions of a safe merge strategy, RESIN allows a policy object to override the merge method shown in Table 3. When application code merges two data 5. APPLYING RESIN elements, RESIN invokes the merge method on each policy of each RESIN’s main goal is to allow programmers to avoid security vul- source operand, passing in the entire policy set of the other operand nerabilities by specifying data flow assertions. Section 3.1 already as the argument. The merge method returns a set of policy objects showed how a programmer can implement a data flow assertion that that it wants associated with the new datum, or throws an exception prevents password disclosure in HotCRP. This section shows how a if this policy should not be merged. The merge method can consult programmer would implement data flow assertions in RESIN for a the set of policies associated with the other operand to implement number of other vulnerabilities and applications. either the union or intersection strategies. The RESIN runtime then The following examples use the syntax described in Table 3. Addi- labels the resulting datum with the union of all policies returned by tionally, these examples use sock.__filter to access a socket’s all merge methods. filter object, and in the Python code, policy_add and policy_remove return a new string with the same contents but a different policy set, because Python strings are immutable. 5.1 Access Control Checks 4. IMPLEMENTATION As mentioned in Section 2, RESIN aims to address missing access We have implemented two RESIN prototypes, one in the PHP control checks. To illustrate how a programmer would use RESIN runtime, and the other in Python. At a high-level, RESIN requires to verify access control checks, this section provides an example the addition of a pointer, that points to a set of policy objects, to implementation of Data Flow Assertion 4, the assertion that verifies the runtime’s internal representation of a datum. For example, in MoinMoin wiki’s read ACL scheme (see Section 2). PHP, the additional pointer resides in the zval object for strings and The MoinMoin ACL assertion prevents a wiki page from flowing numbers. For strings, each policy object contains a character range to a user that’s not on the page’s ACL. One way for a programmer for which the policy applies. When copying or moving data from one to implement this assertion in RESIN is to: primitive object to another, the language runtime copies the policy 1. annotate HTTP output channels with context that identifies set from the source to the destination, and modifies the character the user on the other end of the channel; ranges if necessary. When merging individual data elements, the runtime invokes the policies’ merge functions. 2. define a PagePolicy object that contains an ACL; The PHP prototype involved 5,944 lines of code. The largest 3. implement an export_check method in PagePolicy that matches module is the SQL parsing and translation mechanism at about 2,600 the output channel against the PagePolicy’s ACL; lines. The core data structures and related functions make up about 4. attach a PagePolicy to the data in each wiki page. 1,100 lines. Most of the remaining 2,200 lines are spent propagating and merging policy objects. Adding propagation to the core PHP Figure 5 shows all the code necessary for this implementation. The language required changes to its virtual machine opcode handlers, process_client function annotates each HTTP connection’s context such as variable assignment, addition, and string concatenation. In with the current user, after parsing the user’s request and creden- addition, PHP implements many of its library functions, such as tials. PagePolicy contains a copy of the ACL, and implements substr and printf, in C, which are outside of PHP’s virtual machine export_check. The update_body method creates a PagePolicy object and require additional propagation code. and attaches it to the page’s data before saving the page to the file To allow the Web server to check persistent policies for file data, system. One reason why the PagePolicy is short is that it reuses as described in Section 3.4.1, we modified the mod_php Apache existing application code to perform the access control check. module to de-serialize and invoke policy objects for all static files it This example assertion illustrates the use of persistent policies. serves. Doing so required modifying 49 lines of code in mod_php. The update_body function associates a PagePolicy with the contents The Python prototype only involved 681 lines of code; this is of a page immediately before writing the page to a file. As the page fewer than the PHP prototype for two reasons. First, our Python pro- data flows to the file, the default filter object serializes the PagePolicy totype does not implement all the RESIN features; it lacks character- object, including the access control list, to the file system. When level data tracking, persistent policy storage in SQL databases, and MoinMoin reads this file later, the default filter will de-serialize Apache static file support. Second, Python uses fewer C libraries, the PagePolicy and attach it to the page data in the runtime, so that so it required little propagation code beyond the opcode handlers. RESIN will automatically enforce the same access control policy. class CodeApproval extends Policy { function export_check($context) {} executes; PHP’s auto_prepend_file option is one way to do this. } If, instead, the application set the filter at the beginning of the application’s own code, adversaries could bypass the check if they function make_file_executable($f) { $code = file_get_contents($f); are able to upload and run their own .php files. policy_add($code, new CodeApproval()); This example illustrates the need for programmer-specified fil- file_put_contents($f, $code); ter objects in addition to programmer-specified context for default } filters. The default filter calls export_check on all the policies that class InterpreterFilter extends Filter { pass through, but the default filter always permits data that has no function filter_read($buf) { policy. The filter in this script injection assertion requires that data foreach (policy_get($buf) as $p) if ($p instanceof CodeApproval) have a CodeApproval policy, and reject data that does not. return $buf; throw new Exception(’not executable’); 5.3 SQL Injection and Cross-Site Scripting } As mentioned in Section 2, the two most popular attack vectors } in Web applications today are SQL injection and cross-site scripting. This section presents two different strategies for using RESIN to Figure 6: Simplified PHP code for a data flow assertion that catches address these vulnerabilities. server-side script injection. In the actual implementation, filter_read To implement the first strategy, the programmer: verifies that each character in $buf has the CodeApproval policy. 1. defines two policy object classes: UntrustedData and SQL- Sanitized; In this implementation, the update_body function provides a 2. annotates untrusted input data with an UntrustedData policy; single place where MoinMoin saves the page to the file system, and a thus a single place to attach the PagePolicy. If, however, 3. changes the existing SQL sanitization function to attach a MoinMoin had multiple code paths that stored pages in the file SQLSanitized object to the freshly sanitized data; system, the programmer could assign the policy to the page contents 4. changes the SQL filter object to check the policy objects on earlier, perhaps directly to the CGI input variables. each SQL query. If the query contains any characters that have In addition to read access checks, the programmer can also define the UntrustedData policy, but not the SQLSanitized policy, a data flow assertion that verifies write access checks. MoinMoin’s the filter will throw an exception and refuse to forward the write ACLs imply the assertion: data may flow into wiki page p only query to the database. if the user is on p’s write ACL. MoinMoin stores a wiki page as a directory that contains each version of the page as a separate file. Addressing cross-site scripting is similar, except that it uses HTML- The programmer can implement this assertion by creating a filter Sanitized rather than SQLSanitized. This strategy catches unsani- class that verifies the write ACL against the current user, and then tized data because the data will lack the correct SQLSanitized or attaching filter instances to the files and directory that represent a HTMLSanitized policy object. The reason for appending SQLSani- wiki page. The filters restrict the modification of existing versions, tized and HTMLSanitized instead of removing UntrustedData is to and also the creation of new versions based on the page’s ACL. allow the assertion to distinguish between data that may be incorpo- rated into SQL versus HTML since they use different sanitization 5.2 Server-Side Script Injection functions. This strategy ensures that the programmer uses the cor- Another class of vulnerabilities that RESIN aims to address is rect sanitizer (e.g., the programmer did not accidentally use SQL server-side script injection, as described in Section 2, which can be quoting for a string used as part of an HTML document). addressed with Data Flow Assertion 3. One way for the programmer The second strategy for preventing SQL injection and cross-site to implement this assertion is to: scripting vulnerabilities is to use the same UntrustedData policy from the previous strategy, but rather than appending a policy like 1 1. define an empty CodeApproval policy object; SQLSanitized, the SQL filter inspects the final query and throws an 2. annotate application code and libraries with CodeApproval exception if any characters in the query’s structure (keywords, white policy objects; space, and identifiers) have the UntrustedData policy. The HTML filter performs a similar check for UntrustedData on JavaScript 3. change the interpreter’s default input filter (see Section 3.2.2) portions of the HTML to catch cross-site scripting errors, similar to to require a CodeApproval policy on all imported code. a technique used in prior work [34]. A variation on the second strategy is to change the SQL filter’s This data flow assertion instructs RESIN to limit what code the in- tokenizer to keep contiguous bytes with the UntrustedData policy terpreter may use. Figure 6 lists the code for implementing this in the same token, and to automatically sanitize the untrusted data assertion. When installing an application, the developer tags the ap- in transit to the SQL database. This will prevent untrusted data from plication code and system libraries with a persistent CodeApproval affecting the command structure of the query, and likewise for the policy object using make_file_executable. The filter_read method HTML tokenizer. These two variations require the addition of either only allows code with a CodeApproval policy object to pass, ensur- tokenizing or parsing to the filter objects, but they avoid relying on ing that code from an adversary which would lack the CodeApproval trusted quoting functions. policy, will not be executed, whether through include statements, We have experimented with both of these strategies in RESIN, eval, or direct HTTP requests. and find that while the second approach requires more code for the The programmer must override the interpreter’s filter in a global parsers, many applications can reuse the same parsing code. configuration file, to ensure the filter is set before any other code A SQL injection assertion is complementary to the other asser- 1The CodeApproval policy does not need to take the intersection of tions we describe in this section. For instance, even if an application policies during merge because RESIN’s character-level data tracking has a SQL injection vulnerability, and an adversary manages to ex- avoids having to merge file data. ecute the query SELECT user, password FROM userdb, the policy object for each password will still be de-serialized from to implement an existing implicit data flow plan as an explicit data the database, and will prevent password disclosure. flow assertion in RESIN. We then evaluate whether each data flow assertion actually prevents typical data flow bugs, both previously- 5.4 Other Attack Vectors known and previously-unknown bugs. Finally, we evaluate whether Finally, there are a number of other attack vectors that RESIN can a single high-level assertion can be general enough to cover both help defend against. For instance, to address the HTTP response common and uncommon data flows that might violate the assertion, splitting attack described in Section 3.2, a developer can use a filter by testing assertions against bugs that use surprising data paths. to reject any CR-LF-CR-LF sequences in the HTTP header that came from user input. 6.1 Programmer Effort As Web applications use more client-side code, they also use To determine the level of effort required for a programmer to use more JSON to transport data from the server to the client. Here, RESIN, we took a number of existing, off-the-shelf applications and much like in SQL injection, an adversary may be able to craft an examined some of their implicit security-related data flow plans. input string that changes the structure of the JSON’s JavaScript We then implemented a RESIN data flow assertion for each of those data structure, or worse yet, include client-side code as part of the implicit plans. Table 4 summarizes the results, showing the applica- data structure. Web applications can use RESIN’s data tracking tions, the number of lines of code in the application, and the number mechanisms to avoid these pitfalls as they would for SQL injection. of lines of code in each data flow assertion. 5.5 Application Integration The results in Table 4 show that each data flow assertion requires a small amount of code, on the order of tens of lines of code. The One potential concern when using RESIN is that a data flow assertion that checks read access to author lists in HotCRP requires assertion can duplicate data flow checks and security checks that the most changes, 32 lines. This is more code than other assertions already exist in an application. As a concrete example, consider because our implementation issues database queries and interprets HotCRP, which maintains a list of authors for each paper. If a paper the results to perform the access check, requiring extra code. How- submission is anonymous, HotCRP must not reveal the submission’s ever, many of the other assertions in Table 4 reuse existing code list of authors to the PC members. HotCRP already performs this from the application’s existing security plan, and are shorter. check before adding the author list to the HTML output. Adding Table 4 also shows that the effort required to implement a data a RESIN data flow assertion to verify read access to the author list flow assertion does not grow with the size of the application. This will make HotCRP perform the access check a second time within is because implementing an assertion only requires changes where the data flow assertion, duplicating the check that already exists. sensitive data first enters the application, and/or where data exits If a programmer implements an application with RESIN in mind, the system, not on every path data takes through the application; the programmer can use an exception to indicate that the user may RESIN’s data tracking handles those data paths. For example, the not read certain data, thereby avoiding duplicate access checks. For cross-site scripting assertion for phpBB is only 22 lines of code even example, we modified the HotCRP code that displays a paper sub- though phpBB is 172,000 lines of code. mission to always try to display the submission’s author list. If the As a point of comparison for programmer effort, consider the submission is anonymous, the data flow assertion raises an excep- MoinMoin access control scheme that appeared in the Flume evalu- tion; the display code catches that exception, and then displays the ation [28]. MoinMoin uses ACLs to limit who can read and write string “Anonymous” instead of the author list. This avoids dupli- a wiki page. To implement this scheme under Flume, the program- cate checks because the page generation code does not explicitly mer partitions MoinMoin into a number of components, each with perform the access control check. However, if the application sends different privileges, and then sets up the OS to enforce the access HTML output to the browser during a try block and then encounters control system using information flow control. Adapting MoinMoin an exception later in the try block, the previously released HTML to use Flume requires modifying or writing about 2,000 lines of might be invalid because the try block did not run to completion. application code. In contrast, RESIN can check the same MoinMoin RESIN provides an output buffering mechanism to assist with this access control scheme using two assertions, an eight line assertion style of code. To use output buffering, the application starts a new for reading, and a 15 line assertion for writing, as shown in Ta- try block before running HTML generation code that might throw ble 4. Most importantly, adding these checks with RESIN requires an exception. At the start of the try block, the application notifies no structural or design changes to the application. the outgoing HTML filter object to start buffering output. If the try Although Flume provides assurance against malicious server code block throws an exception, the corresponding catch block notifies and RESIN does not, the RESIN assertions catch the same two the HTML filter to discard the output buffer, and potentially send vulnerabilities (see Section 6.2) that Flume catches, because they do alternate output in its place (such as “Anonymous” in the example). not involve binary code injection. By focusing on a weaker threat However, if the try block runs to completion, the try block notifies model, RESIN’s lightweight and easy-to-use mechanisms provide the HTML filter to release the data in the output buffer. a compelling choice for programmers that want additional security Using exceptions, instead of explicit access checks, frees the assurance without much extra effort. programmer from needing to know exactly which checks to invoke in every single case, because RESIN invokes the checks. Instead, 6.2 Preventing Vulnerabilities programmers need to only wrap code that might fail a check with To evaluate whether RESIN’s data flow assertions are capable of an appropriate exception handler, and specify how to present an preventing vulnerabilities, we checked some of the assertions in exception to the user. Table 4 against known vulnerabilities that the assertion should be able to prevent. The results are shown in Table 4, where the number 6. SECURITY EVALUATION of previously-known vulnerabilities is greater than zero. The main criteria for evaluating RESIN is whether it is effective at The results in Table 4 show that each RESIN assertion does pre- helping a programmer prevent data flow vulnerabilities. To provide vent the vulnerabilities it aims to prevent. For example, the phpBB a quantitative measure of RESIN’s effectiveness, we focus on three access control assertion prevents a known missing access control areas. First, we determine how much work a programmer must do check listed in the CVE [41], and the HotCRP password protec- App. Assertion Known Discovered Prevented Application Lang. LOC LOC vuln. vuln. vuln. Vulnerability type MIT EECS grad admissions Python 18,500 9 0 3 3 SQL injection MoinMoin Python 89,600 8 2 0 2 Missing read access control checks 15 0 0 0 Missing write access control checks File Thingie file manager PHP 3,200 19 0 1 1 Directory traversal, file access control HotCRP PHP 29,000 23 1 0 1 Password disclosure 30 0 0 0 Missing access checks for papers 32 0 0 0 Missing access checks for author list myPHPscripts login library PHP 425 6 1 0 1 Password disclosure PHP Navigator PHP 4,100 17 0 1 1 Directory traversal, file access control phpBB PHP 172,000 23 1 3 4 Missing access control checks 22 4 0 4 Cross-site scripting many [3, 11, 16, 23, 36] PHP – 12 5 0 5 Server-side script injection

Table 4: Results from using RESIN assertions to prevent previously-known and newly discovered vulnerabilities in several Web applications. tion assertion shown in Section 3.1 prevents the password disclo- function. If a programmer starts writing code before understanding sure vulnerability described in Section 2. The assertion to prevent all of these rules, the programmer can easily introduce vulnerabil- server-side script injection described in Section 5.2 prevents such ities, and this turned out to be the case in the phpBB plugins we vulnerabilities in five different applications [3, 11, 16, 23, 36]. examined. Using RESIN, one programmer can make a data flow rule Since we implemented these assertions with knowledge of the explicit as an assertion and then RESIN will check that assertion for previously-known vulnerabilities, it is possible that the assertions all the other programmers. are biased to thwart only those vulnerabilities. To address this bias, These results also provide examples of a single data flow asser- we tried to find new bugs, as an adversary would, that violate the tion thwarting more than one instance of an entire class of vulner- assertions in Table 4. These results are shown in Table 4 where the abilities. For example, the single read access assertion in phpBB number of newly discovered vulnerabilities is greater than zero. thwarts four specific instances of read access vulnerabilities (see These results show that RESIN assertions can prevent vulnerabili- Table 4). As another example, a single server-side script injection ties, even if the programmer has no knowledge of the specific vulner- assertion that works in all PHP applications catches five different abilities when writing the assertion. For example, we implemented a previously-known vulnerabilities in the PHP applications we tested generic data flow assertion to address SQL injection vulnerabilities (see Table 4). This suggests that when a programmer inevitably finds in MIT’s EECS graduate admissions system. Although the original a security vulnerability and writes a RESIN assertion that addresses programmers were careful to avoid most SQL injection vulnerabili- it, the assertion will prevent the broad class of problems that allow ties, the assertion revealed three previously-unknown SQL injection the vulnerability to occur in the first place, rather than only fixing vulnerabilities in the admission committee’s internal user interface. the one specific instance of the problem. As a second example, File Thingie and PHP Navigator are Web based file managers, and both support a feature that limits a user’s 6.3 Generality write access to a particular home directory. We implemented this To evaluate whether RESIN data flow assertions are general behavior as a write access filter as described in Section 3.2.3. Again, enough to cover the many data flow paths available to an adver- both applications have code in place to check directory accesses, but sary, we checked whether the assertions we wrote detect a number after a careful examination, we discovered a directory traversal vul- of data flow bugs that use surprising data flow channels. nerability that violates the write access scheme in each application. The results indicate that a high-level RESIN assertion can detect The data flow assertions catch both of these vulnerabilities. and prevent vulnerabilities even if the vulnerability takes advantage As a final example, phpBB implements read access controls so of an unanticipated data flow path. For example, a common way that only certain users can read certain forum messages. We imple- for an adversary to exploit a cross-site scripting vulnerability is mented an assertion to verify this access control scheme. In addition to enter malicious input through HTML form inputs. However, to preventing a previously-known access control vulnerability, the there was a cross-site scripting vulnerability in phpBB due to a assertion also prevents three previously-unknown read access vi- more unusual data path. In this vulnerability, phpBB requests data olations that we discovered. These results confirm that data flow from a whois server and then uses the response without sanitizing it assertions in RESIN can thwart vulnerabilities, even if the program- first; an adversary exploits this vulnerability by inserting malicious mer does not know they exist. Furthermore, these assertions likely JavaScript code into a whois record and then requesting the whois eliminate even more vulnerabilities that we are not aware of. record via phpBB. The RESIN assertion that protects against cross- The three vulnerabilities in phpBB are not in the core phpBB site scripting in phpBB, listed in Table 4, prevents vulnerabilities package, but in plugins written by third-party programmers. Large- at a high-level; the assertion treats all external input as untrusted scale projects like phpBB are a good example of the benefit of and makes sure that the external input data flows through a sanitizer explicitly specifying data flow assertions with RESIN. Consider a before phpBB may use the data in HTML. This assertion is able situation where a new programmer starts working on an existing to prevent both the more common HTML form attack as well as application like HotCRP or phpBB. There are many implicit rules the less common whois style attack because the assertion is general that programmers must follow in hundreds of places, such as who is enough to cover many possible data flow paths. responsible for sanitizing what data to prevent SQL injection and A second example is in the read access controls for phpBB’s cross-site scripting, and who is supposed to call the access control forum messages. The common place to check for read access is before displaying the message to a user, but one of the read access Unmodified RESIN RESIN Operation PHP no policy empty policy vulnerabilities, listed in Table 4, results from a different data flow path. When a user replies to a message, phpBB includes a quotation Assign variable 0.196 µs 0.210 µs 0.214 µs of the original message in the reply message. In the vulnerable ver- Function call 0.598 µs 0.602 µs 0.619 µs String concat 0.315 µs 0.340 µs 0.463 µs sion, phpBB also allows a user to reply to a message even if the user Integer addition 0.224 µs 0.247 µs 0.384 µs lacks permission to read the message. To exploit this vulnerability, File open 5.60 µs 7.05 µs 18.2 µs an adversary, lacking permission to read a message, replies to the File read, 1KB 14.0 µs 16.6 µs 26.7 µs message using its message ID, and then reads the content of the File write, 1KB 57.4 µs 60.5 µs 71.7 µs original message, quoted in the reply template. The RESIN asser- SQL SELECT 134 µs 674 µs 832 µs tion that checks the read access controls prevents this vulnerability SQL INSERT 64.8 µs 294 µs 508 µs SQL DELETE 64.7 µs 114 µs 115 µs because the assertion detects data flow from the original message to the adversary’s browser, regardless of the path taken. Table 5: The average time taken to execute different operations in A final example comes from the two password disclosure vulner- an unmodified PHP interpreter, a RESIN PHP interpreter without abilities shown in Table 4. As described in Section 5, the HotCRP any policy, and a RESIN PHP interpreter with an empty policy. disclosure results from a logic bug in the email preview and the email reminder features. In contrast, the disclosure in the myPHPscripts login library [33] results from the library storing its users’ pass- prototype, its performance is likely to be adequate for many real words in a plain-text file in the same HTTP-accessible directory that world applications. For example, in the 30 minutes before the SOSP contains the library’s PHP files [35]. To exploit this, an adversary submission deadline in 2007, the HotCRP submission system logged requests the password file with a Web browser. Despite prevent- only 390 user actions. Even if there were 10 page requests for each ing password disclosure through two different data flow paths, the logged action (likely an overestimate), this would only average to assertions for password disclosure in HotCRP and myPHPscripts 2.2 requests per second and a CPU utilization of 14.3% without are very similar (the only difference is that HotCRP allows email RESIN, or 19.1% with RESIN on a single core. Adding a second reminders and myPHPscripts does not). This shows that a single CPU core doubles the throughput. RESIN data flow assertion can prevent attacks through a wide range of attack vectors and data paths. 7.2 Microbenchmarks To determine the source of RESIN’s overhead, we measured the 7. PERFORMANCE EVALUATION time taken by individual operations in an unmodified PHP inter- Although the main focus of RESIN is to improve application secu- preter, and a RESIN PHP interpreter both without any policy and rity, application developers may be hesitant to use these techniques with an empty policy. The results of these microbenchmarks are if they impose a prohibitive performance overhead. In this section, shown in Table 5. we show that RESIN’s performance is acceptable. We first mea- For operations that simply propagate policies, such as variable sure the overhead of running HotCRP with and without the use of assignments and function calls, RESIN incurs a small absolute over- RESIN, and then break down the low-level costs that account for the head of 4-21ns, but percentage wise, this is about a 10% overhead. overhead using microbenchmarks. The overall result is that a com- This overhead is due to managing the policy set objects. plex Web application like HotCRP incurs a 33% CPU overhead for The overhead for invoking a filter object’s interposition method generating a page, which is unlikely to be noticeable by end-users. (filter_read, filter_write, and filter_func) is the same as for a stan- The following experiments were run on a single core of a 2.3GHz dard function call, except that RESIN calls the interposition method Xeon 5140 server with 4GB of memory running Linux 2.6.22. The once for every call to read or write. Therefore the application pro- unmodified PHP interpreter is version 5.2.5, the same version that grammer has some control over how much interposition overhead the RESIN PHP interpreter is based on. the application will incur. For example, the programmer can control the amount of computation the interposition method performs, and 7.1 Application Performance the number of times the application calls read and write. To evaluate the system-level overhead of RESIN, we compare a For operations that track byte-level policies, such as string con- modified version of HotCRP running in the RESIN PHP interpreter catenation, the overhead without any policy is low (8%), but in- against an unmodified version of HotCRP 2.26 running in an un- creases when a policy is present (47%). This reflects the cost of modified PHP interpreter. We measured the time to generate the propagating byte-level policies for parts of the string at runtime as Web page for a specific paper in HotCRP, including the paper’s title, well as more calls to malloc and free. A more efficient implementa- abstract, and author list (if not anonymized), as if a PC member tion of byte-level policies could reduce these calls. requested it through a browser. The measured runtime includes the Operations that merge policies (such as integer addition, which time taken to parse PHP code, recall the session state, make SQL cannot do byte-level tracking) are similarly inexpensive without a queries, and invoke the relevant data flow assertions. In this exam- policy (10%), but are more expensive when a policy is applied (71%). ple, RESIN invoked two assertions: one protected the paper title and This reflects the cost of invoking the programmer-supplied merge abstract (and the PC member was allowed to see them), and the other function. However, in all the data flow assertions we encountered, protected the author list (and the PC member was not allowed to see we did not need to apply policies to integers, so this might not have it, due to anonymization). We used the output buffering technique a large impact on real applications. from Section 5.5 to present a consistent interface even when the For file open, read, and write, RESIN adds potentially noticeable author list policy raised an exception. The resulting page consisted overhead, largely due to the cost of serializing, de-serializing, and of 8.5KB of HTML. invoking policies and filters stored in a file’s extended attributes. The unmodified version of HotCRP generates the page in 66ms Caching file policies in the runtime will likely reduce this overhead. (15.2 requests per second) and the RESIN version uses 88ms (11.4 The INSERT operation listed in Table 5 inserts 10 cells, each requests per second), averaged over 2000 trials. The performance into a different column, and the SELECT operation reads 10 cells, of this benchmark is CPU limited. Despite our unoptimized RESIN each from a different column. When there is an empty policy, each datum has the policy. The overhead without any policy is 229– 9.1 Policy Specification 540µs (354%–403%), and that with an empty policy is 443–698µs In RESIN, programmers define a data flow assertion by writing (521%–684%). RESIN’s overhead is related to the size of the query, policy objects and filter objects in the same language as the rest and the number of columns that have policies; reducing the number of the application. Previous work in policy description languages of columns returned by a query reduces the overhead for a query. focuses on specifying policies at a higher level, to make policies eas- For example, a SELECT query that only requests six columns with ier to understand, manage [6, 12, 14], analyze [20], and specify [2]. policies takes 578µs in RESIN compared to 109µs in unmodified While these policy languages do not enforce security directly, hav- PHP. The DELETE operation has a lower overhead because it does ing a clearly defined policy specification allows reasoning about not require rewriting queries or results. the security of a system, performing static analysis [17, 18], and RESIN’s overhead for SQL operations is relatively high because composing policies in well-defined ways [1, 7, 43]. Doing the same it parses and translates each SQL query in order to determine the in RESIN is challenging because programmers write assertions in policy object for each data item that the query stores or fetches. Our general-purpose code. In future work, techniques like program anal- current implementation performs much of the translation in a library ysis could help formalize RESIN’s policies [4], to bring some of written in PHP; we expect that converting all of it to C would offer these benefits to RESIN, or to allow performance optimizations. significant speedup. Note that, even with our high overhead for SQL Lattice-based label systems [9, 10, 13, 15, 28, 32, 51] control queries, the overall application incurs a much smaller performance data flow by assigning labels to objects. Expressing policies using overhead, such as 33% in the case of HotCRP. labels can be difficult [14], and can require re-structuring applica- tions. Once specified, labels objectively define the policy, whereas RESIN assertions require reasoning about code. For more complex policies, labels are not enough, and many applications use trusted 8. LIMITATIONS AND FUTURE WORK declassifiers to transform labels according to application-specific RESIN currently has a number of limitations which we plan to rules (e.g. encryption declassifies private data). Indeed, a large part address in future work. First, we would like to provide better support of re-structuring an application to use labels involves writing and for data integrity invariants. Instead of requiring programmers to placing declassifiers. RESIN’s design can be thought of as speci- specify what writes are allowed using filter objects, we envision fying the declassifier (policy object) in the label, thus avoiding the using transactions to buffer database or file system changes, and need to place declassifiers throughout the application code. checking a programmer-specified assertion before committing them. Since RESIN programmers define their own policy and filter ob- Second, we would like to add the ability to construct internal data jects, programmers can implement data flow assertions specific to flow boundaries within an application. For example, an assertion an application, such as ensuring that every string that came from could prevent clear-text passwords from flowing out of the software one user is sanitized before being sent to another user’s browser. module that handles passwords. Attaching filter objects to func- RESIN’s assertions are more extensible than specialized policy lan- tion calls helps with these boundaries, but languages like PHP and guages [19], or tools designed to find specific problems, such as SQL Python allow code to read and write data in another module’s scope injection or cross-site scripting [21, 22, 29, 30, 34, 39, 44, 47, 49]. as if they were global variables. An internal data flow boundary PQL [30] allows programmers to run application-specific program would need to address these data flow paths. analyses on their code at development time, including analyses that We would also like to extend RESIN to allow data flow assertions look for data flow bugs such as SQL injection. However, PQL is to span multiple runtimes, possibly including Javascript and SQL. limited to finding data flows that can be statically analyzed, with We are considering a few possible approaches, including a generic the help of some development-time runtime checks, and cannot find representation of policy objects, or mechanisms to invoke the policy data flows that involve persistent storage. This could miss some object’s original runtime for performing policy checks. Also, we subtle paths that an attacker might trigger at runtime, and would not are interested in extending RESIN to propagate policies between prevent vulnerabilities in plug-ins added by end-users. machines in a distributed system similar to the way DStar [52] does FABLE [40] allows programmers to customize the type system with information flow labels. and label transformation rules, but requires the programmer to define We are looking for easier ways to implement data tracking in a a type system in a specialized language, and use the type system language runtime, perhaps with OS or VMM support. Adding data to implement the applications’ data flow schemes. RESIN, on the tracking to PHP required modifying the interpreter in 103 locations other hand, implements data tracking orthogonal to the type system, to propagate policies; ideally, applying these techniques to new requiring fewer code modifications, and allowing programmers to runtimes would require fewer changes. For example, it might be reuse existing code in their assertions. possible to implement RESIN without modifying the language run- Systems like OKWS [27] and Privman [25] enforce security by time, given a suitable object-oriented system. The implementation having programmers partition their application into less-privileged would override all string operations to propagate policy objects, and processes. By operating in the language runtime, RESIN’s policy and override storage system interfaces to implement filter objects. filter objects track data flows and check assertions at a higher level of Finally, dynamic data tracking adds runtime overheads and presents abstraction, avoiding the need to re-structure applications. However, challenges to tracking data through control flow paths. We would RESIN cannot protect against compromised server processes. like to investigate whether static analysis or programmer annotations can help check RESIN-style data flow assertions at compile time. 9.2 Data Tracking Once the assertions are in place, RESIN tracks explicit flows of application data at runtime, as it moves through the system. RESIN does not track data flows through implicit channels, such as program 9. RELATED WORK control flow and data structure layout, because implicit flows can be RESIN makes a number of design decisions regarding how pro- difficult to reason about, and often do not correspond to data flow grammers specify policies and how RESIN tracks data. This section plans the programmer had in mind. Implicit data flows can lead to relates RESIN’s design to prior work. “taint creep”, or increasingly tainted program control flow, as the application executes, which can make the system difficult to use Acknowledgments in practice. In contrast, systems like Jif [32] track data through all We thank Michael Dalton, Eddie Kohler, Butler Lampson, Robert channels, including program control flow, and can catch subtle bugs Morris, Neha Narula, Jerry Saltzer, Jacob Strauss, Jeremy Stribling, that leak data through these channels. By relying on a well-defined and the anonymous reviewers for their feedback. This work was label system, Jif can also avoid runtime checks in many cases, and supported by Nokia Research. rely purely on compile-time static checking, which reduces runtime overhead. RESIN’s data tracking is central to its ability to implement data flow assertions that involve data movement, like SQL injection References or cross-site scripting protection. Other program checkers, like [1] G. Ahn, X. Zhang, and W. Xu. Systematic policy analysis Spec# [5, 6], check program invariants, but focus on checking func- for high-assurance services in SELinux. In Proc. of the 2008 tion pre- and post-conditions and do not track data. Aspect-oriented POLICY Workshop, pages 3–10, Palisades, NY, June 2008. programming (AOP) [45] provides a way to add functionality, in- [2] A. H. Anderson. An introduction to the web services policy cluding security checks, that cuts across many different software language (WSPL). In Proc. of the 2004 POLICY Workshop, modules, but does not perform data tracking. However, AOP does pages 189–192, Yorktown Heights, NY, June 2004. help programmers add new code throughout an application’s code [3] J. Bae. Vulnerability of uploading files with multiple base, and could be used to implement RESIN filter objects. extensions in phpBB attachment mod. http://seclists. By tracking data flow in a language runtime, RESIN can track org/fulldisclosure/2004/Dec/0347.html. data at the level of existing programming abstractions—variables, CVE-2004-1404. I/O channels, and function calls—much like in Jif [32]. This allows [4] S. Barker. The next 700 access control models or a unifying programmers to use RESIN without having to restructure their ap- meta-model? In Proc. of the 14th ACM Symposium on Access plications. This differs from OS-level IFC systems [15, 28, 50, 51] Control Models and Technologies, pages 187–196, Stresa, Italy, which track data flowing between processes, and thus require pro- June 2009. grammers to expose data flows to the OS by explicitly partitioning [5] M. Barnett, B.-Y. E. Chang, R. DeLine, B. Jacobs, and K. R. M. their applications into many components according to the data each Leino. Boogie: A modular reusable verifier for object-oriented component should observe. On the other hand, these OS IFC sys- programs. In Proc. of the 4th International Symposium on tems can protect against compromised server code, whereas RESIN Formal Methods for Components and Objects, pages 364–387, assumes that all application code is trusted; a compromise in the Amsterdam, The Netherlands, November 2005. application code can bypass RESIN’s assertions. Some bug-specific tools use data tracking to prevent vulnerabil- [6] M. Barnett, K. Rustan, M. Leino, and W. Schulte. The Spec# ities such as cross-site scripting [24], SQL injection [34, 44], and programming system: An overview. In Proc. of the Workshop on Construction and Analysis of Safe, Secure and Interoper- untrusted user input [8, 37, 42]. While these tools inspired RESIN’s design, they effectively hard-code the assertion to be checked into able Smart devices, pages 49–69, Marseille, France, March the design of the tool. As a result, they are not general enough 2004. to address application-specific data flows, and do not support data [7] L. Bauer, J. Ligatti, and D. Walker. Composing security poli- flow tracking through persistent storage. One potential advantage cies with Polymer. In Proc. of the 2005 PLDI, pages 305–314, of these tools is that they do not require the programmer to modify Chicago, IL, June 2005. their application in order to prevent well-known vulnerabilities such [8] W. Chang, B. Streiff, and C. Lin. Efficient and extensible as SQL injection or cross-site scripting. We suspect that with RESIN, security enforcement using dynamic data flow analysis. In one developer could also write a general-purpose assertion that can Proc. of the 15th CCS, pages 39–50, Alexandria, VA, October be then applied to other applications. 2008. [9] S. Chong, J. Liu, A. C. Myers, X. Qi, K. Vikram, L. Zheng, and X. Zheng. Secure web applications via automatic parti- 10. CONCLUSION tioning. In Proc. of the 21st SOSP, pages 31–44, Stevenson, WA, October 2007. Programmers often have a plan for correct data flow in their [10] S. Chong, K. Vikram, and A. C. Myers. SIF: Enforcing con- applications. However, today’s programmers often implement their fidentiality and integrity in web applications. In Proc. of the plans implicitly, which requires the programmer to insert the correct 16th USENIX Security Symposium, pages 1–16, Boston, MA, code checks in many places throughout an application. This is August 2007. difficult to do in practice, and often leads to vulnerabilities. This work takes a step towards solving this problem by introduc- [11] CWH Underground. Kwalbum arbitrary file upload vulnerabil- http://www.milw0rm.com/exploits/6664 ing the idea of a data flow assertion, which allows a programmer ities. . to explicitly specify a data flow plan, and then have the language CVE-2008-5677. runtime check it at runtime. RESIN provides three mechanisms for [12] N. Damianou, N. Dulay, E. Lupu, and M. Sloman. The Ponder implementing data flow assertions: policy objects associated with policy specification language. In Proc. of the 2001 POLICY data, data tracking as data flows through an application, and filter Workshop, pages 18–38, Bristol, UK, January 2001. objects that define data flow boundaries and control data movement. [13] D. E. Denning. A lattice model of secure information flow. We evaluated RESIN by adding data flow assertions to prevent Communications of the ACM, 19(5):236–243, May 1976. security vulnerabilities in existing PHP and Python applications. [14] P. Efstathopoulos and E. Kohler. Manageable fine-grained Results show that data flow assertions are effective at preventing information flow. In Proc. of the 3rd ACM EuroSys conference, a wide range of vulnerabilities, that assertions are short and easy pages 301–313, Glasgow, UK, April 2008. to write, and that assertions can be added incrementally without [15] P. Efstathopoulos, M. Krohn, S. VanDeBogart, C. Frey, having to restructure existing applications. We hope these benefits D. Ziegler, E. Kohler, D. Mazières, F. Kaashoek, and R. Mor- will entice programmers to adopt our ideas in practice. ris. Labels and event processes in the Asbestos . In Proc. of the 20th SOSP, pages 17–30, Brighton, [34] A. Nguyen-tuong, S. Guarnieri, D. Greene, J. Shirley, and UK, October 2005. D. Evans. Automatically hardening Web applications using [16] Emory University. Multiple vulnerabilities in AWStats precise tainting. In Proc. of the 20th IFIP International In- Totals. http://userwww.service.emory.edu/ formation Security Conference, pages 295–307, Chiba, Japan, ~ekenda2/EMORY-2008-01.txt. CVE-2008-3922. May 2005. [17] D. Engler, B. Chelf, A. Chou, and S. Hallem. Checking sys- [35] Osirys. myPHPscripts login session password disclo- tem rules using system-specific, programmer-written sure. http://nvd.nist.gov/nvd.cfm?cvename= extensions. In Proc. of the 4th OSDI, pages 1–16, San Diego, CVE-2008-5855. CVE-2008-5855. CA, October 2000. [36] Osirys. wPortfolio arbitrary file upload exploit. http: [18] D. Evans and D. Larochelle. Improving security using extensi- //www.milw0rm.com/exploits/7165. CVE-2008- ble lightweight static analysis. IEEE Software, 19(1):42–51, 5220. January/February 2002. [37] Perl.org. Perl taint mode. http://perldoc.perl.org/ [19] D. F. Ferraiolo and D. R. Kuhn. Role-based access control. perlsec.html. In Proc. of the 15th National Computer Security Conference, [38] phpMyAdmin. phpMyAdmin 3.1.0. http://www. pages 554–563, Baltimore, MD, October 1992. phpmyadmin.net/. [20] S. Garriss, L. Bauer, and M. K. Reiter. Detecting and resolving [39] T. Pietraszek and C. V. Berghe. Defending against injection policy misconfigurations in access-control systems. In Proc. attacks through context-sensitive string evaluation. In Proc. of the 13th ACM Symposium on Access Control Models and of the 8th International Symposium on Recent Advances in Technologies, pages 185–194, Estes Park, CO, June 2008. Intrusion Detection, pages 124–145, Seattle, WA, September [21] W. G. J. Halfond and A. Orso. AMNESIA: analysis and moni- 2005. toring for neutralizing SQL-injection attacks. In Proc. of the [40] N. Swamy, B. J. Corcoran, and M. Hicks. Fable: A language 20th ACM International Conference on Automated Software for enforcing user-defined security policies. In Proc. of the Engineering, pages 174–183, Long Beach, CA, November 2008 IEEE Symposium on Security and Privacy, pages 369– 2005. 383, Oakland, CA, May 2008. [22] W. G. J. Halfond, A. Orso, and P. Manolios. Using positive [41] The MITRE Corporation. Common vulnerabilities and expo- tainting and syntax-aware evaluation to counter SQL injection sures (CVE) database. http://cve.mitre.org/data/ attacks. In Proc. of the 14th FSE, pages 175–185, Portland, downloads/. OR, November 2006. [42] D. Thomas, C. Fowler, and A. Hunt. Programming Ruby: The [23] N. Hippert. phpMyAdmin code execution vulnerability. http: Pragmatic Programmers’ Guide. Pragmatic Bookshelf, 2004. //fd.the-wildcat.de/pma_e36a091q11.php. [43] M. C. Tschantz and S. Krishnamurthi. Towards reasonabil- CVE-2008-4096. ity properties for access-control policy languages. In Proc. [24] S. Kasatani. Safe ERB plugin. http:// of the 11th ACM Symposium on Access Control Models and agilewebdevelopment.com/plugins/safe_erb. Technologies, pages 160–169, Lake Tahoe, CA, June 2006. [25] D. Kilpatrick. Privman: A library for partitioning applications. [44] W. Venema. Taint support for PHP. http://wiki.php. In Proc. of the 2003 USENIX Annual Technical Conference, net/rfc/taint. FREENIX track, pages 273–284, San Antonio, TX, June 2003. [45] J. Viega, J. T. Bloch, and P. Chandra. Applying aspect-oriented [26] E. Kohler. Hot crap! In Proc. of the Workshop on Organizing programming to security. Cutter IT Journal, 14(2):31–39, Workshops, Conferences, and Symposia for Computer Systems, February 2001. San Francisco, CA, April 2008. [46] T. Waldmann. Check the ACL of the included page when us- [27] M. Krohn. Building secure high-performance web services ing the rst parser’s include directive. http://hg.moinmo. with OKWS. In Proc. of the 2004 USENIX Annual Technical in/moin/1.6/rev/35ff7a9b1546. CVE-2008-6548. Conference, pages 185–198, Boston, MA, June–July 2004. [47] G. Wassermann and Z. Su. Sound and precise analysis of Web [28] M. Krohn, A. Yip, M. Brodsky, N. Cliffer, M. F. Kaashoek, applications for injection vulnerabilities. In Proc. of the 2007 E. Kohler, and R. Morris. Information flow control for standard PLDI, pages 32–41, San Diego, CA, June 2007. OS abstractions. In Proc. of the 21st SOSP, pages 321–334, [48] Web Application Security Consortium. 2007 web applica- Stevenson, WA, October 2007. tion security statistics. http://www.webappsec.org/ [29] V. B. Livshits and M. S. Lam. Finding security vulnerabilities projects/statistics/wasc_wass_2007.pdf. in applications with static analysis. In Proc. of the 14th [49] Y. Xie and A. Aiken. Static detection of security vulnerabilities USENIX Security Symposium, pages 271–286, Baltimore, MD, in scripting languages. In Proc. of the 15th USENIX Security August 2005. Symposium, pages 179–192, Vancouver, BC, Canada, July [30] M. Martin, B. Livshits, and M. Lam. Finding application errors 2006. and security flaws using PQL: a program query language. In [50] A. Yumerefendi, B. Mickle, and L. P. Cox. TightLip: Keeping Proc. of the 2005 OOPSLA, pages 365–383, San Diego, CA, applications from spilling the beans. In Proc. of the 4th NSDI, October 2005. pages 159–172, Cambridge, MA, April 2007. [31] MoinMoin. The MoinMoin wiki engine. http:// [51] N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazières. moinmoin.wikiwikiweb.de/. Making information flow explicit in HiStar. In Proc. of the 7th [32] A. C. Myers and B. Liskov. Protecting privacy using the OSDI, pages 263–278, Seattle, WA, November 2006. decentralized label model. ACM TOCS, 9(4):410–442, October [52] N. Zeldovich, S. Boyd-Wickizer, and D. Mazières. Securing 2000. distributed systems with information flow control. In Proc. of [33] myPHPscripts.net. Login session script. http://www. the 5th NSDI, pages 293–308, San Francisco, CA, April 2008. myphpscripts.net/?sid=7. CVE-2008-5855.