AutoCSP: Automatically Retrofitting CSP to Web Applications

Mattia Fazzini⇤, Prateek Saxena†, Alessandro Orso⇤ Georgia Institute of Technology, USA, mfazzini, orso @cc.gatech.edu ⇤ { } † National University of Singapore, Singapore, [email protected]

Abstract—Web applications often handle sensitive user data, a set of dynamically generated HTML pages whose content is which makes them attractive targets for attacks such as cross- annotated with (positive) taint information. Second, it analyzes site scripting (XSS). Content security policy (CSP) is a content- the annotated HTML pages to identify which elements of these restriction mechanism, now supported by all major browsers, that offers thorough protection against XSS. Unfortunately, pages are trusted. (Basically, the tainted elements are those that simply enabling CSP for a web application would affect the come only from trusted sources.) Third,AUTOCSP uses the re- application’s behavior and likely disrupt its functionality. To sults of the previous analysis to infer a policy that would block address this issue, we propose AUTOCSP, an automated tech- potentially untrusted elements while allowing trusted elements nique for retrofitting CSP to web applications. AUTOCSP (1) to be loaded. Fourth,AUTOCSP automatically modifies the leverages dynamic taint analysis to identify which content should be allowed to load on the dynamically-generated HTML pages server-side code of the web application so that it generates of a web application and (2) automatically modifies the server- web pages with the appropriate CSP. side code to generate such pages with the right permissions. Our To assess the usefulness and practical applicability of evaluation, performed on a set of real-world web applications, AUTOCSP, we developed a prototype that implements our shows that AUTOCSP can retrofit CSP effectively and efficiently. technique and used it to perform an empirical evaluation. In our evaluation, we applied AUTOCSP to seven real-world I. INTRODUCTION web applications and assessed whether it can protect web Web applications are extremely popular, easily accessible, applications without disrupting their functionality. Overall, and often handle personal, confidential, and even sensitive the results of our evaluation are encouraging and show that data. These characteristics, together with the widespread pres- AUTOCSP can effectively retrofit CSP to existing web applica- ence of vulnerabilities in such applications, make them an tions so that (1) the applications are actually protected against attractive target for attackers. Cross-site scripting (XSS) in XSS attacks, (2) their functionality is either not affected particular, is one of the most commonly reported security vul- or minimally affected, and (3) their performance incurs a nerabilities in web applications and often results in successful negligible overhead. Our results also show that automating attacks, whose consequences range from website defacing to this approach is cost-effective, as the number of changes to theft of sensitive information. perform to the server-side code to retrofit CSP is large enough The most common defense mechanisms against XSS are to make a manual approach expensive and error prone. based on input filtering, which can be an effective approach, This work makes the following contributions: but is also error prone and often results in an incomplete pro- The definition of AUTOCSP, a general approach to retrofit • tection. Content security policy (CSP), conversely, is a content- CSP to web applications. restriction scheme that is currently supported by all major A prototype implementation of AUTOCSP that can operate • browsers [1] and offers comprehensive protection against XSS on PHP-based web applications and is available at http:// attacks. In fact, the popularity of CSP is increasing, and com- www.cc.gatech.edu/⇠orso/software/autocsp. panies such as Facebook and GitHub have started to move CSP An empirical evaluation of AUTOCSP that shows the effec- • to production. In a nutshell, web developers can use a CSP tiveness and the practicality of our approach. header to provide, for an HTML page, a declarative whitelist policy that defines which content should be allowed to load II. BACKGROUND AND MOTIVATING EXAMPLE on that page. Unfortunately, simply enabling a default CSP This section provides an overview of the security-related for a web application can dramatically affect the application’s concepts needed to understand our approach and an example behavior and is likely to disrupt the application’s functionality. that we use to motivate our work. On the other hand, manually defining a policy can be difficult and time consuming. A. Cross-Site Scripting (XSS) To support an effective use of CSP, while both reducing Web applications are complex multi-tier applications that developers’ effort and preserving functionality, we propose include a client and a server sides. Through a browser on AUTOCSP, an automated technique and tool for retrofitting the client side, users can issue HTTP requests to access CSP to web applications. Given a web application, AUTOCSP functionality running on the server side, or backend. The operates in four main phases. First, it marks as trusted all server-side code processes the inputs contained in the HTTP “known” values in the web application’s server-side code (e.g., requests and generates web pages with the content requested hardcoded values) and exercises the web application while by the user. These dynamically-generated web pages usually performing dynamic taint tracking. The result of this phase is consist of basic HTML entities together with JavaScript, CSS, and other resources. When the browser receives these pages, A CSP is a set of directives in the form directive-name: it renders their content and executes the code they contain. source-list, where directive-name is the name of the XSS attacks are injection attacks that take advantage of directive, and source-list is the set of domains to which the dynamically generated content in a web page to insert the specific directive applies. These directives define many malicious scripts into otherwise benign and trusted web pages. aspects of a dynamic web page’s content: which scripts are In these cases, malicious code coexists and executes together enabled, from where plugins can be loaded, which styles with benign code, as the web browser is unable to discern are allowed, from where media content can be retrieved, between the two. XSS vulnerabilities can be divided into which resources can be framed, to which hosts the page different types, based on the methodology used to exploit can connect through scripts, and from where it is possible them. XSS vulnerabilities of persistent type are those whose to load fonts. There are two special keywords that can be attacks are performed by injecting and permanently storing used in a policy: unsafe-inline and unsafe-eval. The malicious script content within a resource of the targeted web first keyword enables inline scripts and inline style constructs application. Subsequently, when a user requests that resource, in the protected web page. The second keyword enables the the malicious code executes as if it were generated by the generation of code from string elements in the client-side code. web application itself (i.e., as if it were trusted). This type of These keywords can be very useful but, if allowed, may void vulnerability is widespread because the web application logic the protection offered by CSP. allows for the possibility of using inline scripts and inline style Several instances of XSS attacks can be blocked if CSP directives within the dynamically generated HTML (e.g., using is properly added to the HTML pages of a web applica- the construct). XSS vulnerabilities tion. Persistent XSS attacks, in particular, can be blocked if can also be of reflected type, when the corresponding attacks unsafe-inline is not in the policy, and the policy only are built by having the server-side code processing the content whitelists scripts and style content that were created by the of an HTTP request and attaching it verbatim to the requested developer of the web application. Reflected XSS attacks can web page. This type of vulnerability is also widespread, as be blocked if unsafe-inline is not excluded from the application logic usually permits inline scripts and inline style policy. The probability of DOM-based XSS attacks can be constructs. The DOM-based type identifies vulnerabilities in highly reduced if both unsafe-eval and unsafe-inline which the attacks are created by injecting malicious payload are excluded from the policy. Finally, resource-based XSS directly into the DOM of the victim’s web browser. Because attacks can be prevented by CSP in two different ways. First, of the way this type of exploits are constructed, the server- the attack can be blocked by preventing the resource that side of the application never sees the malicious script content. contains the malicious payload from being fetched (i.e., the This class of vulnerabilities is exploitable due to a combi- domain of the resource must not be whitelisted in the policy). nation of JavaScript and HTML features. One of the most Alternatively, if the resource needs to be fetched and is located important among these features is the ability of JavaScript on the host where the web application is running, the developer to execute code from strings (i.e., eval). Another feature is can specify a CSP that states that no script can be executed that JavaScript code, while executing, can create new elements in the context of the resource. in the DOM, such as inline/event scripts and inline/attribute Given the whitelist nature of CSP, in order to take full styles. Finally, resource-based XSS vulnerabilities can be advantage of the protection mechanism it offers, web appli- exploited by placing malicious scripts within resources being cation developers need to (1) identify a policy for each of fetched by the web browser while rendering a requested the web pages dynamically generated by the web application HTML page. Examples of these attacks are malicious script and (2) rewrite parts of the web application’s server-side code content injected into SVG files andq Chameleon attacks [2]. to make sure they generate pages with the right policy. Such This type of vulnerability is common in web applications a manual process is not only time consuming but also error because these applications generally do not restrict the sources prone. Our approach aims to remove (most of) this burden from which dynamic web pages can fetch their content. from the developers’ shoulders by automating this process. B. Content Security Policy C. Motivating Example Content security policy (CSP) [3], [4] is a declarative To motivate our work, we provide an example from a mechanism that can be used to protect web applications real-world web application called SCHOOLMATE—a school against several classes of XSS attacks. CSP can be seen management system that has been downloaded over 16,000 as a content restriction scheme that web developers can times. Figure 1 shows the server-side code for the functionality use to specify what content can be included and how this that allows students to visualize the assignments related to one content operates within dynamically-generated pages of a web of their classes (slightly modified to make it self-contained and application. CSP is enabled by adding, on the server side, more readable). Figure 2 shows the HTML web page generated a Content-Security-Policy header and corresponding by this server-side code. content to the HTTP response that contains the protected After generating the HTML page header, the server-side resource. When a web browser receives the policy, it enforces code adds an inline script to the HTML page (code lines 4– it on the content of the web page being rendered. 9, HTML lines 7–11). This script contains the functionality 1 1 2 print(""); 2 2 3 ... 3 ... 3 ... 4 print(" 5 function grades(){ 5 5 css" href="sty.css"/> 7 document.student.submit(); 7 "); 9 document.student.page2.value=3; 9 10 ... 10 document.student.submit(); 10 ... 11 print(" 11 13 Grades"); 13 14 15 href="#"> 16 mysql_fetch_row($query)){ 16 ... 16 Grades 17 ... 17 17 ... 18 print(" 18 18 19 19 21 alert("XSS"); 22 ... 22 22 23 } 23 23 24 ... 24 ... 24 25 print(""); 25 25 ... 26 ?> 26 26 27 Fig. 1: SCHOOLMATE server-side code Fig. 2: SCHOOLMATE original web Fig. 3: SCHOOLMATE CSP-enabled snippet. page snippet. web page snippet. necessary to navigate the menu of the web application and is but would allow the XSS attack to succeed. Figure 3 shows the encoded as a constant string in the original program. Lines CSP-enabled web page that would preserve the functionality 11–13 print a link for accessing the functionality offered by of the web application while blocking the XSS attack. the previous script (HTML lines 13–15). Also in this case, The correct CSP would be Content-Security-Policy: the element is encoded as a constant string in the program. default-src 'none';script-src domain;style-src Lines 15–23 consist of a loop that creates a table in the domain, where identifier domain represents the host on HTML page containing the assignments for a given class which the external JavaScript and CSS files linked in the (HTML lines 17–23). In this case, the code prints a row and HTML reside. As Figures 2 and 3 show, there are significant a column of a table by using a constant string for its structure changes between the two HTML pages. The inline script at and a variable for its content (line 20). The value of this lines 7–11 in Figure 2 is transformed into the script at lines variable is read from a and represents comments of 11–12 in Figure 3. The content of the script is moved to the class instructor. This makes the web application vulner- an external file called external.js, which is enabled by able to persistent XSS attacks (see Section II-A). Assume, the script-src CSP directive. The JavaScript URI at lines for instance, that the value retrieved from the database is 13–15 of Figure 2 is also moved to an external file uri.js . When this value is (line 4 in Figure 3) and linked using the newly introduced added to the generated HTML, it is interpreted as an inline id="uri" expression at line 14 in the new web page. A script. At line 19, an inline style is applied to the column of the similar transformation occurs for the inline style attribute table, but this value is hardcoded in the program, so it cannot of line 18 in Figure 2. The style content is moved to file be modified by an attacker. Finally, the server-side code closes sty.css, enabled using the style-src CSP directive, and the HTML page at line 25 and terminates its execution. activated through the newly introduced id="sty" expression (lines 5, 6, and 19 of Figure 3, respectively). The inline This example lets us show why blindly enabling CSP script that contains the malicious payload appears unchanged on the generated web page would either affect its normal in Figure 3 at lines 20–22 and would be blocked by the functionality or add inadequate protection against XSS browser, as the CSP associated with the web page does not attacks. Simply using CSP to block all executions of script allow inline scripts. and style content (i.e., using Content-Security-Policy: As this example shows, defining a suitable CSP can be a default-src 'none' as the policy header) would block difficult, time-consuming, and error-prone task. In the next the XSS attach, but would also prevent normal users from sections, we show how AUTOCSP can automate this process. accessing the menu functionality on the page. This is because both the inline script and JavaScript URI, respectively III. THE AUTOCSP APPROACH at lines 7–11 and 13–15 of Figure 2, would be blocked. Conversely, enabling the inline script, JavaScript URI, and In this section, we present our approach for retrofitting inline style without modifying the web application (with CSP to web applications. We first provide an overview of policy header Content-Security-Policy: default-src AUTOCSP, and then discuss its different phases in detail. 'none'; script-src 'unsafe-inline'; style-src As we discussed in Section II-B, CSP offers a whitelist 'unsafe-inline') would preserve the page’s functionality, based content-restriction mechanism; that is, CSP (1) blocks by default the loading/execution of any web-page node that Algorithm 1: AutoCSP is not specified in the policy but (2) allows for specifying Input : WA, web application; TS, set of test inputs for WA; which nodes are trusted and can thus be loaded/executed. The Output: WACSP, web application using CSP; 1 begin basic idea behind our approach is to automatically find trusted 2 set ES := ; 3 foreach input ti TS do nodes in a dynamically generated web page by analyzing 2 /* Dynamic Tainting */ the execution of the server-side code that generates such 4 buffer HB, TB, SB := ; 5 map TM := page—nodes that are generated using only trusted sources are ; 6 WA.set(ti) marked as trusted, whereas all other nodes are conservatively 7 while WA.finish() do ¬ considered untrusted. 8 instr i := WA.next() 9 i.execute() Figure 4 provides a high-level overview of our approach and 10 taint(i, TM) 11 if i PRINT then ⌘ shows its main phases. Given a web application and a set of 12 HB.add(i.result) test inputs, in its dynamic tainting phase, AUTOCSP marks as 13 TB.add(TM[i.result]) 14 SB.add(i.line) trusted all hardcoded values in the web application server-side /* Web Page Analysis */ code (plus, optionally, additional whitelisted sources specified 15 dom DOMA := parse(HB, TB, SB) /* CSP Analysis */ by the developer) and performs dynamic taint analysis as the 16 policy CSP := STRICT 17 foreach node n DOM do server-side code executes and generates web pages. Then, in 2 A 18 if n.positive() then the web page analysis phase, AUTOCSP analyzes a dynami- 19 if n (SCR OBJ STY IMG MED FRM) then ⌘ _ _ _ _ _ cally generated HTML page and the associated taint informa- 20 node nnew := toCSP (DOMA, n) 21 if nnew n then tion to determine which parts of the page can be considered as 6⌘ 22 edit en := En(linen(DOMA, n), n, nnew) trusted. In the CSP analysis phase of the approach, AUTOCSP 23 ES.add(en) processes an HTML page and its associated taint information 24 CSP.allow(nnew) 25 edit eCSP := ECSP (lineCSP (DOMA), CSP) to infer a policy that would block potentially untrusted parts 26 ES.add(eCSP) while allowing trusted parts to be loaded in the browser. This /* Source Code Transformation */ 27 webapp WPCSP := WA 28 foreach edit e ES do phase also computes how HTML pages should be transformed 2 29 if e E then in order to conform to the inferred policy. Finally, in its ⌘ CSP 30 transformCSP (WPCSP, e) UTO 31 else if e En then source code transformation phase, A CSP modifies the ⌘ source code of the web application so that it generates suitably 32 transformn(WPCSP, e) 33 return WPCSP transformed HTML pages (according to what it computed in the previous phase) in which the inferred CSPs are enabled. There are three main aspects that characterize a dynamic In the remainder of this section, we describe AUTOCSP taint analysis: taint introduction, propagation policy, and taint with the help of Algorithm 1, which represents the approach checking. Taint introduction identifies and labels specific data in pseudo-code. The inputs to AUTOCSP are a web application within the program, called taint sources, using suitable taint WA and a set of test inputs TS. For every test input ti in TS, marks. A taint propagation policy governs how taint marks AUTOCSP performs its dynamic tainting, web page analysis, propagate while the program is executing. Taint checking is and CSP analysis phases (lines 3–26). After processing every the step in which particular actions are performed when taint test input, the approach moves to its source code transforma- marks reach special locations of the program called taint sinks. tion phase (lines 27–33), which produces the final result: the We present now the instance of positive dynamic taint analysis transformed web application WACSP. We now describe the that we implemented in AUTOCSP. This part of AUTOCSP four phases of AUTOCSP in detail. is covered by lines 4–14 in Algorithm 1. 1) Taint Sources: We introduce a taint mark for a given A. Dynamic Tainting piece of data, hence marking it as positively tainted, whenever such data is hardcoded in the program. The rationale is that (1) This phase aims to determine, given a test input to the web server code normally uses hardcoded data, and in particular application, which parts of the generated web page are trusted strings, to generate the different parts of an HTML page and which parts are untrusted, using a whitelist approach. We and (2) hardcoded data are defined by the developer, and consider a DOM node of the generated HTML as trusted thus trusted. In addition, our approach also allows developers if it is defined by the developer or it is in full control of to specify additional data that should be marked as trusted the developer. We consider all remaining content untrusted. (i.e., whitelisted), such as content originating from a specific Specifically, we use an approach called positive dynamic column in a database or return values of specific functions. tainting, which we used successfully in previous work to AUTOCSP keeps track of this taint information using a taint counter SQL injection vulnerabilities [5]. Positive dynamic map (TM variable in the algorithm). tainting marks the trusted data within an application (in this 2) Taint Propagation: A taint propagation policy deter- case, HTML fragments) and propagates taint marks as the mines how the different instructions affect the taint marks application executes. Positive tainting, in contrast to more associated with their operands. To compute accurate results, it traditional negative tainting approaches, is a more conservative is important to precisely model the semantics of such instruc- approach and fits naturally the whitelist approach behind CSP. tions. For each type of instruction, our taint propagation policy Source Code Test Dynamic Tainting Web Page Analysis CSP Analysis Inputs Transformation Transformed I1 I2 HTML CSP CSP T HTML Web C1 STYLE Web Application FILE Application CSP

Source Code STYLE SCRIPT FILE WebApp HTML SCRIPT SCRIPT STYLE FILE SCRIPT WebApp FILE

Fig. 4: High-level overview of AUTOCSP. determines three aspects: the data defined by the instruction The other two buffers are used to annotate the nodes of the (i.e., its output data), the data used by the instruction (i.e., resulting DOM tree. Basically, each of the tokens generated its input data), and a mapping function (i.e., how the input while parsing the HTML content contains two annotations: data affects the output data). After executing an instruction the first one indicates whether the token is trusted, whereas i (execute function at line 9), AUTOCSP updates the taint the second one indicates the locations in the server-side code marks associated with i’s output data based on the taint data of the statements that generated the token. After identifying all associated with i’s input data and i’s mapping function (taint tokens, AUTOCSP produces a corresponding annotated DOM function at line 10). tree. Because each DOM node can correspond to multiple 3) Taint Sinks: Taint sinks are relevant instructions for HTML tokens, to be conservative, AUTOCSP marks a node the type of taint analysis being performed. For this reason, as trusted only if all of its corresponding tokens are. when executed, they trigger checking actions on the taint C. CSP Analysis marks associated with their input data. In AUTOCSP, taint sinks consist of print instructions, as the data printed by the This phase takes as input the annotated DOM tree, infers server-side code is what constitutes the web page that is then the CSP for it, and identifies how trusted DOM nodes should sent back to the client. By checking these sinks, AUTOCSP be transformed to comply with the inferred policy. The high- can check the taint marks associated to the different parts of level algorithm for this phase corresponds to lines 16–26 the HTML pages dynamically generated by the application. of Algorithm 1. For each annotated DOM tree, this phase In the server-side code, there are normally multiple places produces a set of HTML transformations and a CSP for the in which different parts of an HTML page are generated, document. The HTML transformations and the CSP are then and therefore multiple taint sinks. For each sink (line 11), converted to source code edits (lines 22 and 25) and are passed AUTOCSP performs the following actions. First, it stores the to the final phase of the approach as the edit set (ES). characters of the generated HTML page (line 12) in an HTML This phase starts by associating the strictest CSP possi- buffer (HB). AUTOCSP also populates two other buffers: the ble to the HTML of the annotated DOM tree (line 16). taint buffer (TB) and the source buffer (SB). For each element This choice ensures the most effective protection of- in the HTML buffer, AUTOCSP creates a corresponding entry fered by CSP against XSS. The initial CSP corresponds in the taint buffer (line 13) that specifies whether that element to Content-Security-Policy: default-src 'none'. is trusted (if it has a taint mark) or untrusted (otherwise). This policy does not allow the protected HTML resource to Similarly, the approach stores into the source buffer the source use inline scripts, eval constructs, and inline styles. In addition, code location of the statements that generated the content the policy does not allow the guarded resource to fetch any in the corresponding position of the HTML buffer (line 14). content from the web. These three buffers are used by the subsequent phases of the Once the initialization step is completed, AUTOCSP starts approach. processing nodes in the annotated DOM tree (lines 17–24). The key idea of this phase is to incrementally add to the CSP B. Web Page Analysis trusted HTML elements. In addition, AUTOCSP transforms The second phase of our approach (represented by line 15 each such element, if necessary, so that the element’s behavior in Algorithm 1) analyzes the HTML, taint, and source buffers. is not disrupted by the enforcement mechanism of CSP. To In this phase, AUTOCSP parses (parse function at line 15) do this, AUTOCSP identifies which elements in the annotated the HTML generated by the previous phase to build (and then DOM tree relate to CSP and, if necessary, suitably transforms operate on) its DOM representation, as a browser would do. them. In fact, using the same underlining model (DOM), AUTOCSP Our approach identifies whether nodes of the annotated can better mimic the CSP enforcement mechanism that web DOM tree relate to CSP according to what is stated in the CSP browsers would apply. AUTOCSP actually produces an en- specification [4]. Specifically, there are six classes of elements hanced version of the DOM tree that we call the annotated that AUTOCSP identifies as related to CSP (line 19): nodes DOM tree (DOMA). AUTOCSP’s parsing algorithm, which is that (1) enable scripting, (2) load plugins, (3) define the style of based on the WHATWG specification [6], operates primarily the web page, (4) fetch images, (5) connect to media content, on the HTML buffer, which contains the actual HTML content. and (6) frame other resources. If a DOM node that relates to CSP is trusted (positive type="text/css" href="..."/>. The new node fetches function at line 18 returns true), it is enabled in the CSP and, an external style file that is stored on the web server and if necessary, AUTOCSP identifies how to transform it to make has the same content as the original style node. In this case, it conform to the inferred CSP. The transformation process the CSP gets extended by allowing the document to fetch the (toCSP function at line 20) is dependent on the type of node style resource from the server using the style-src: host; considered. If a transformation is necessary, a new node edit directive. The second type of nodes are style attributes, (en) is added to the edit set (line 23). The node edit contains such as HTML elements with attributes in the form

. The approach replaces this type of statement that generated the node (result of function linen at node with a node without the style attribute and moves the line 22), the original node (n), and the modified node (nnew). style content to an external file. It then links the file to After processing every node, AUTOCSP also creates a CSP the corresponding HTML document and allows the domain edit (eCSP) and adds it to the edit set (line 26). A CSP edit where the file is stored in the CSP. The last type of style contains two pieces of information: the source code location node consists of style elements that link to an external style where the CSP header should be added (result of function file, that is, elements in the form . In this case, no trans- considered (CSP). In the reminder of this section, we provide formation is applied, but the domain of the linked style file is details on the transformations performed on specific kinds of added to the CSP. DOM nodes. 3) Other Nodes: The remaining nodes that relate to CSP 1) Script Nodes: There are different types of script are discussed together in this section, as AUTOCSP applies nodes that can appear in the DOM and that must be han- a similar analysis to all such nodes. This part of the ap- dled in different ways. The first type of node is repre- proach analyzes classes of elements that (1) load plugins, sented by inline script elements, that is, script nodes in (2) fetch images, (3) connect to media content, and (4) the form .AUTOCSP transforms frame other resources. AUTOCSP identifies the resource to this type of node to a script node in the form . The new node fetches an external resource is located to the CSP directive corresponding to file stored on the web application server whose content is the the node being analyzed. In addition, if the whitelisted re- original script. In this case, the CSP of the protected document source is coming from the same host of the web applica- gets extended by allowing the web page to fetch the script tion, the approach attaches a Content-Security-Policy: resource from the application server using the script-src: default-src 'none' header to the resource being fetched. host; directive. The second type of node consists of event This is done to avoid XSS attacks that could piggyback on the handlers attributes, such as HTML elements having attributes resource being loaded [2]. The policy used for this resources is in the form . In this as strict as possible to avoid weakening the protection offered case, AUTOCSP replaces the original element with an element by AUTOCSP. Section V shows that this choice did not affect that does not declare the event handler and creates a script the functionality of the web applications in our evaluation. that adds the same event handler to that element. The new script code is placed in an external script file that is linked D. Source Code Transformation to the corresponding HTML document and such that (1) the AUTOCSP’s source code transformation phase modifies the script is activated when the DOM is loaded, and (2) the server-side code of the web application so that it generates domain in which the file is stored is allowed in the CSP. HTML pages (1) with the inferred CSPs enabled and (2) The third type of node consists of elements having attributes conforming to such CSPs. This phase takes as input the that invoke a script using a JavaScript URI, such as . The transformation ap- across multiple executions of the web application and returns plied to this node is similar to the one applied for event a transformed CSP-enabled web application. In the rest of this handlers. First, the new script is placed in an external script section, we illustrate the two main parts of this phase, which file, linked to the HTML document, and activated when the correspond to lines 27–33 in Algorithm 1. DOM is loaded. Then, the domain where the file is stored 1) Enabling CSP: This part of AUTOCSP processes CSP is allowed in the CSP. The final type of script node is a edits created across multiple executions of the web applica- node that links to an external script file in the form . In this case, no transformation is Section III-C, a CSP edit contains the inferred CSP and the applied. However, the domain of the linked script file is added source-code location where the inferred CSP header needs to to the CSP. be added. This latter is normally the code that generates the 2) Style Nodes: Also for style nodes, AUTOCSP treats initial HTML content, as the CSP header needs to be sent to different types of nodes differently. The first type of node the web browser before any other HTML content. are inline style elements, that is, style nodes in the form Because our approach collects information across different .AUTOCSP transforms this type of executions, it is possible to have different CSPs associated node to a node in the form Firefox (v28), Opera (v20), LINPHA 43 231 136 0 and Safari (v7). All exploits were successfully blocked. MYBB 63 598 364 2 For the second part of RQ1, which is about the effects OPENEMR 113 699 533 11 of AUTOCSP on the applications’ functionality. We ran the PHPLIST 77 1224 273 1 retrofitted web applications against the inputs we described SCHOOLMATE 90 16 8 0 SERENDIPITY 65 476 385 6 in Section V-A and checked whether that resulted in er- rors in the browser. Unfortunately, we could not compare A. Experimental Benchmarks and Setup AUTOCSP to any existing tool, as the most related exist- For our empirical evaluation, we used real-world PHP ing approach, DEDACOTA [20], only works on applications web applications that were also used in previous work on written in ASP.NET. However, in order to have a baseline, XSS [10]–[15]. Among the applications used in these papers, we decided to implement two simple approaches: NONE, we selected those that were either used in more than one which simply enables on all generated HTML pages the paper or had a larger code base. This resulted in nine web strictest possible CSP policy (Content-Security-Policy: applications, among which we had to discard two (PHORUM default-src 'none') and SELF, which employes a policy and PHPBB) because they dynamically create the server- that allows guarded resources to fetch content only from side code; that is, in these applications, HTML pages are their same domain of origin (Content-Security-Policy: dynamically created by code that is also dynamically created, default-src 'self'). These simple baselines can give us which is something that AUTOCSP does not handle at the an idea of what would happen if developers would simply moment. apply CSP to a web application without analyzing it. Table I provides a summary description of the seven appli- Table II reports, for each benchmark, the number of test cations we considered. Columns Benchmark, Version, and Type inputs used (TI) and the number of unique (based on the provide name, version, and type of the web application. The client-side code location) client-side errors occurring when last column, KLOC, reports the number of (thousand) lines of using approaches None, Self, and AUTOCSP (and not present PHP code in the benchmark. when executing the original web applications without CSP). We deployed our benchmarks on a server machine with 3GB As expected, all benchmarks generate the largest number of of memory, two Intel Pentium D CPU 3.00GHz processors, errors with approach NONE, which prevents the web browser and running Ubuntu 10.04. We used ZEND 2.4 as the PHP from executing any script, applying any style, and loading application server. To answer our research questions, we ran any external resource in a web page. The number of errors our benchmarks against a set of representative inputs. Because is significant also when using SELF. This is because all the the benchmarks did not include a test suite, and our dynamic benchmarks extensively use inline scripts and style directives taint analysis needs inputs, we deployed our benchmarks and in their HTML pages, and SELF prevents their execution. asked five graduate-level students unfamiliar with AUTOCSP AUTOCSP significantly reduces the number of errors com- to explore the functionality of the web applications. The pared to the other two approaches, but it does not com- students’s sessions were recorded by a Google Chrome ex- pletely eliminate them. More precisely, it transforms three tension we created. These recordings are available, together of the web applications (GALLERY,LINPHA, and SCHOOL- with another Chrome extension for replaying them, on our MATE) without introducing any error, and causes a lim- tool’s website, provided in the Introduction. ited number of errors in the other four applications. We investigated the reasons for these errors and found that B. Results they are of three types: (E1) executions of eval (e.g., a) RQ1: To answer the part of RQ1 about AUTOCSP’s var x = eval('...')), (E2) client-side creation of inline effectiveness, we applied our approach to the benchmarks script nodes in the DOM (e.g., document.write('')), and (E3) client-side creation of applications. Specifically, we created an initial set of seven style nodes in the DOM (e.g., document.write('')). In the applications we considered, there vulnerabilities in our benchmarks (one input per vulnerability). are 4 instances of E1, 5 of E2, and 11 of E3. TABLE III: Performance and modification data for the retrofitted applications in our empirical study. Benchmark To(ms) Tt(ms) Ecsp(#) En(#) F(#) GALLERY 339 391 2 76 12 LINPHA 128 125 2 67 11 MYBB 142 142 5 97 6 OPENEMR 288 288 31 319 52 PHPLIST 193 208 1 33 8 SCHOOLMATE 24 31 1 328 26 SERENDIPITY 473 532 5 103 16 These errors could be easily removed by adding policies 'unsafe-eval' and 'unsafe-inline' to CSP. Doing so, however, would reduce the protection offered by our approach against XSS. To avoid that, a more sophisticated analysis of the semantics of the JavaScript code in the HTML pages would be needed, which is something that AUTOCSP does not do Fig. 5: Number of source code edits performed by AUTOCSP as at the moment. In future work, we plan to extend AUTOCSP more inputs are considered. with (1) an approach similar to the one proposed by Jensen and colleagues [21], to remove errors of type E1, and (2) applications considered. In most cases, the number of edits automated script analysis, to remove errors of types E2 and converges after only 20% of the inputs have been considered, E3. Currently, AUTOCSP simply reports such issues to the and in two cases after 60% are considered (for OPENEMR and web application developers. MYBB, which are the two largest applications in our pool). Based on these results we can answer RQ1 as follows: Overall, these results provide initial evidence that the ap- AUTOCSP is able, for the cases considered, to retrofit CSP to proach is not strongly dependent on the specific inputs used. the web applications and effectively protect them against XSS d) RQ4: The second part of Table III lets us investigate attacks. It may generate false positives that require manual RQ4. It shows the number of modifications performed by intervention in some cases, but the number of such false AUTOCSP on the benchmark applications: the number of positives is significantly lower than if CSP were applied using CSPs added to the web applications (Ecsp), the distinct number a straw man approach. of DOM node edits (En), and the overall number of server- b) RQ2: To answer RQ2, we compared the execution side source code files modified by AUTOCSP (F). For five time of the retrofitted web applications with that of the original out of the seven applications considered, our approach finds applications when run against our set of the test inputs. For more than one CSP to apply in the server-side code, and it each input, we measured the time from when the web browser finds 31 in the worst case. The number of DOM node edits issued an HTTP request to the moment in which the requested per web application is significant and could be in the order page was fully loaded in the web browser. The first part of of a few hundreds even for a small web application (e.g., Table III reports, for each application, the average execution SCHOOLMATE). The number of source code files affected by time in milliseconds over one run of all inputs for the given the changes is significant as well. application. Column To shows the average time for the original These results indicate that automation is necessary to retrofit applications, while Tt shows the average time for retrofitted CSP to existing web applications. applications. As the results show, the overhead is mostly negligible from a practical standpoint. Moreover, it appears VI. LIMITATIONS OF CURRENT TECHNIQUE AND that the larger the subject, the lower the relative overhead, IMPLEMENTATION which is in fact not measurable for OPENEMR, our largest Our current prototype tool does not fully support, in its application. source code transformation phase, web applications that can We can therefore say, for RQ2, that the transformations dynamically generate server-side code. More precisely, the introduced by AUTOCSP do not seem to significantly affect tool cannot apply changes to source code statements that the performance of the retrofitted web applications. are dynamically generated by server-side code. (In this case, c) RQ3: To answer RQ3, we compared the number of however, the prototype could still be used as a reporting tool; source code edits performed by AUTOCSP as more inputs the generated report would help developers understand how to are considered for its taint analysis. For each application retrofit the computed CSP to the analyzed web application.) considered, we computed the total number of edits when In addition, AUTOCSP currently performs only a superficial considering 0%, 20%, 40%, 60%, 80%, and 100% of the inputs analysis of the JavaScript code in the generated HTML pages. of a server-side PHP file, where 100% means running that file As a consequence, it may generate a CSP that is too strict and against all of its inputs in the input set. Because we randomly disrupts the application’s functionality. selected the subset of inputs, we repeated the experiment 30 Despite these limitations, our empirical evaluation shows times and reported the average of these results in Figure 5. In initial evidence that our approach can be practical, effective, the figure, AUTOCSP shows a similar trend for all of the web and produce a low rate of false positives. VII. RELATED WORK dynamic tainting implemented by this technique differs from ours. We propagate taint marks associated to every type of data DEDACOTA [20] is the work most closely related to ours. It in the program, independently from its type. In addition, we differs, however, in the nature of the approach, as it statically create an extension of the PHP engine and do not modify the rewrites a web application to separate data and code in the interpreter, which enables portability across multiple versions generated web pages and applies CSP on the transformed of PHP. CSSE is a method to detect and prevent injection version of the application. Specifically, it performs static data- attacks [30]. Their technique assigns metadata to user inputs, flow analysis to approximate the HTML output of a web propagates and checks metadata based on the concepts of page and then rewrites the HTML such that inline JavaScript metadata-string operations and context-sensitive string evalua- is stored in a separate JavaScript file. DEDACOTA does not tion. The technique also modifies the core of the PHP engine, deal with the problem of rewriting CSS code and inline event which hinders portability. In addition, the technique might handlers (although it could probably be extended to do so). In suffer from imprecision because it deals only with metadata of our evaluation, we found that inline event handlers are a con- string values. Finally, the authors mention that their technique spicuous part of the content that needs to be rewritten. We also can protect against XSS attacks and provide one example for found that rewriting inline event handlers strictly depends on one class of such attacks. AUTOCSP, conversely, by relying the DOM node in which they appear, and that this information on CSP, can protect against different classes of XSS attacks. can be better determined with a dynamic approach. PIXY [13], Client-side protection mechanisms (e.g., [3], [31]) aim [22] is a technique for detecting XSS vulnerabilities based on to enhance the client-side code. BEEP [31], in particular, static data-flow analysis of PHP scripts. PIXY and AUTOCSP whitelists scripts on the server side and specifies a security have similar goals, but operate differently: PIXY aims to find policy for every web page. The security policy enables or dis- vulnerabilities in web applications, whereas AUTOCSP aims ables execution of scripts and is similar to CSP [3]. However, to protect them using CSP. CSP handles more classes of HTML elements and provides Server-side protection mechanisms (e.g., [23]–[26]) range better guarantees of protection on the client side, as it is a from techniques that provide defenses based on input sanitizer W3C standard implemented by all major web browsers. to methods that can differentiate between legal and illegal scripts. Di Lucca and colleagues [23] propose a mixed static VIII. CONCLUSION and dynamic approach to find XSS vulnerabilities. Static We presented AUTOCSP, a novel approach for retrofitting analysis identifies web pages that might be vulnerable to CSP to existing web applications. Given a web application, XSS attacks, whereas dynamic analysis verifies whether such AUTOCSP first performs positive dynamic tainting on the web pages are actually vulnerable. SANER [24] statically serve-side code of the application. It then uses the computed tracks unsafe information from sources to sinks and applies taint information to find trusted elements of dynamically input sanitization. Subsequently, the technique tests for proper generated HTML pages and infer a policy that would block sanitization along the analyzed paths. This technique per- potentially untrusted elements while allowing the trusted ones. forms the opposite type of information-flow tracking than Finally, it automatically modifies the server-side code of the AUTOCSP (i.e., untrusted vs trusted). SCRIPTGARD [25] web application so that it generates web pages with the performs context-sensitive sanitization to match the browser’s appropriate CSP. To assess precision, practicality, and effec- parsing behavior. In a similar fashion, our approach tries to tiveness of AUTOCSP, we implemented it in a tool that targets emulate browser parsing behavior when building the DOM PHP web applications and performed an empirical evaluation tree for an HTML web page. XSS-GUARD [26] learns on a set of real-world web applications. The results of our legal scripts generated by HTML requests and removes illegal evaluation show that, for the cases considered, AUTOCSP was content from the output of dynamically-generated HTML effective in retrofitting CSP to the existing applications while pages. Similarly, our approach tries to identify legal scripts either preserving their functionality or minimally affecting it. but whitelists them instead of removing the untrusted one from In future work, we will first address the limitations of our generated web pages. In general, many of these techniques rely current implementation. We then plan to extend AUTOCSP on some form of negative tainting and sanitization, which is so that it automatically updates the computed CSPs for a error prone and can lead to false negatives. web application as developers modify the application during Researchers also explored XSS protection mechanisms evolution. based on data-flow analysis (e.g., [13], [22], [27]–[30]). ACKNOWLEDGEMENTS Among those, dynamic tainting techniques (e.g., [29], [30]) are most closely relate to our work. Nguyen-Tuong and col- We thank Abhik Roychoudhury for his input and several leagues [29] presented a technique that replaces the standard fruitful discussions. This work is partly supported by NSF PHP interpreter with a modified one to track taint marks awards CCF-1320783 and CCF-1161821, by funding from of string values. Based on the computed taint information, Google, IBM Research, and Microsoft Research to Georgia they check whether elements of the dynamically generated Tech, by the Ministry of Education, Singapore under Grant HTML pages were created from untrusted sources; if found, No. R-252-000-495-133, by a University Research Grant from these elements are suitably removed or sanitized. The kind of Intel, and by a research grant from Symantec. REFERENCES [18] “DOM based XSS Prevention Cheat Sheet - OWASP,” https://www. owasp.org/index.php/DOM based XSS Prevention Cheat Sheet, 2014. [1] “Can I use... Support tables for HTML5, CSS3, etc,” http://www.caniuse. [19] “HTML5 Security Cheatsheet,” https://html5sec.org/, 2014. com/contentsecuritypolicy, 2014. [20] A. Doupe,´ W. Cui, M. H. Jakubowski, M. Peinado, C. Kruegel, and [2] A. Barth, J. Caballero, and D. Song, “Secure Content Sniffing for Web G. Vigna, “deDacota: Toward Preventing Server-Side XSS via Automatic Browsers, or How to Stop Papers from Reviewing Themselves,” in Code and Data Separation,” in Proceedings of the ACM SIGSAC Proceedings of the IEEE Symposium on Security and Privacy (S&P), Conference on Computer amd Communications Security (CCS), 2013. 2009. [21] S. H. Jensen, P. A. Jonsson, and A. Møller, “Remedying the Eval That [3] S. Stamm, B. Sterne, and G. Markham, “Reining in the Web with Men Do,” in Proceedings of the International Symposium on Software Content Security Policy,” in Proceedings of the International World Wide Testing and Analysis (ISSTA), 2012. Web Conference (WWW), 2009. [22] N. Jovanovic, C. Kruegel, and E. Kirda, “Precise Alias Analysis for [4] “Content Security Policy 1.0,” http://www.w3.org/TR/CSP/, 2014. Static Detection of Web Application Vulnerabilities,” in Proceedings of [5] W. Halfond, A. Orso, and P. Manolios, “WASP: Protecting Web Ap- the Workshop on Programming Languages and Analysis for Security plications Using Positive Tainting and Syntax-Aware Evaluation,” IEEE (PLAS), 2006. Transactions on Software Engineering (TSE), vol. 34, no. 1, pp. 65–81, [23] G. D. Lucca, A. Fasolino, M. Mastroianni, and P. Tramontana, “Identi- 2008. fying Cross Site Scripting Vulnerabilities in Web Applications,” in 6th [6] “HTML Standard,” http://www.whatwg.org/specs/web-apps/ IEEE International Workshop on Web Site Evolution (WSE), 2004. current-work/multipage/, 2014. [24] D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda, [7] “PHP: PHP Usage Stats,” http://php.net/usage.php, 2014. C. Kruegel, and G. Vigna, “Saner: Composing Static and Dynamic [8] “jsoup Java HTML Parser, with best of DOM, CSS, and ,” http: Analysis to Validate Sanitization in Web Applications,” in Proceedings //jsoup.org/, 2014. of the IEEE Symposium on Security and Privacy (S&P), 2008. [9] “FreeMarker Java Template Engine - Overview,” http://freemarker.org/, [25] P. Saxena, D. Molnar, and B. Livshits, “ScriptGard: Automatic Context- 2014. Sensitive Sanitization for Large-Scale Legacy Web Applications,” in [10] P. Vogt, F. Nentwich, N. Jovanovic, E. Kirda, C. Kruegel, and G. Vigna, Proceedings of the ACM Conference on Computer and Communications “Cross Site Scripting Prevention with Dynamic Data Tainting and Security (CCS), 2011. Static Analysis,” in Proceedings of the Network and Distributed System [26] P. Bisht and V. Venkatakrishnan, “XSS-GUARD: Precise Dynamic Security Symposium (NDSS), 2007. Prevention of Cross-Site Scripting Attacks,” in Proceedings of the Inter- [11] G. Wassermann and Z. Su, “Static Detection of Cross-Site Scripting national Conference on Detection of Intrusions and Malware (DIMVA), Vulnerabilities,” in Proceedings of the International Conference on 2008. Software Engineering (ICSE), 2008. [27] O. Tripp, M. Pistoia, S. Fink, M. Sridharan, and O. Weisman, “TAJ: [12] L. K. Shar, H. B. K. Tan, and L. C. Briand, “Mining SQL Injection and Effective Taint Analysis of Web Applications,” in Proceedings of the Cross Site Scripting Vulnerabilities using Hybrid Program Analysis,” in ACM SIGPLAN Conference on Programming Language Design and Proceedings of the International Conference on Software Engineering Implementation (PLDI), 2009. (ICSE), 2013. [28] V. B. Livshits and M. S. Lam, “Finding Security Vulnerabilities in [13] N. Jovanovic, C. Kruegel, and E. Kirda, “Pixy: A Static Analysis Tool Java Applications with Static Analysis,” in Proceedings of the USENIX for Detecting Web Application Vulnerabilities,” in Proceedings of the Security Symposium, 2005. IEEE Symposium on Security and Privacy (S&P), 2006. [29] A. Nguyen-Tuong, S. Guarnieri, D. Greene, J. Shirley, and D. Evans, [14] J. Weinberger, P. Saxena, D. Akhawe, M. Finifter, R. Shin, and D. Song, “Automatically Hardening Web Applications Using Precise Tainting,” in “A Systematic Analysis of XSS Sanitization in Web Application Frame- Proceedings of the International Information Security Conference, 2005. works,” in Proceedings of the European Symposium on Research in [30] T. Pietraszek and C. V. Berghe, “Defending against Injection Attacks Computer Security (ESORICS), 2011. through Context-Sensitive String Evaluation,” in Proceedings of the [15] A. Kieyzun, P. J. Guo, K. Jayaraman, and M. D. Ernst, “Automatic Cre- International Symposium on Recent Advances in Intrusion Detection ation of SQL Injection and Cross-Site Scripting Attacks,” in Proceedings (RAID), 2005. of the International Conference on Software Engineering (ICSE), 2009. [31] T. Jim, N. Swamy, and M. Hicks, “Defeating Script Injection Attacks [16] “XSS Filter Evasion Cheat Sheet - OWASP,” https://www.owasp.org/ with Browser-Enforced Embedded Policies,” in Proceedings of the index.php/XSS Filter Evasion Cheat Sheet, 2014. International World Wide Web Conference (WWW), 2007. [17] “XSS (Cross Site Scripting) Prevention Cheat Sheet - OWASP,” https://www.owasp.org/index.php/XSS (Cross Site Scripting) Prevention Cheat Sheet, 2014.