VEX: Vetting Browser Extensions for Security Vulnerabilities

Home , Browser security, Greasemonkey, NoScript

VEX: Vetting Browser Extensions For Security Vulnerabilities

Sruthi Bandhakavi Samuel T. King P. Madhusudan Marianne Winslett University of Illinois at Urbana Champaign {sbandha2,kingst,madhu,winslett}@illinois.edu

Abstract ticated functionality, such as NOSCRIPT that provides fine-grained control over JavaScript execution [20], or The browser has become the de facto platform for ev- GREASEMONKEY that provides a full-blown program- eryday computation. Among the many potential attacks ming environment for scripting browser behavior [6]. that target or exploit browsers, vulnerabilities in browser These are just a few of the thousands of extensions cur- extensions have received relatively little attention. Cur- rently available for Firefox, the second most popular rently, extensions are vetted by manual inspection, which browser today1. does not scale well and is subject to human error. Extensions written with benign intent can have subtle In this paper, we present VEX, a framework for high- vulnerabilities that expose the user to a disastrous attack lighting potential security vulnerabilities in browser ex- from the web, often just by viewing a web page. Fire- tensions by applying static information-flow analysis to fox extensions run with full browser privileges, so at- the JavaScript code used to implement extensions. We tackers can potentially exploit extension weaknesses to describe several patterns of flows as well as unsafe pro- take over the browser, steal cookies or protected pass- gramming practices that may lead to privilege escala- words, compromise confidential information, or even hi- tions in Firefox extensions. VEX analyzes Firefox ex- jack the host system, without revealing their actions to tensions for such flow patterns using high-precision, the user. Unfortunately, tens of extension vulnerabili- context-sensitive, flow-sensitive static analysis. We an- ties have been discovered in the last few years, and capa- alyze thousands of browser extensions, and VEX finds ble attacks against buggy extensions have already been six exploitable vulnerabilities, three of which were previ- demonstrated [23]. ously unknown. VEX also finds hundreds of examples of To help reduce the attack surface for extensions, bad programming practices that may lead to security vul- Mozilla provides a set of security primitives to ex- nerabilities. We show that compared to current Mozilla tension developers. However, these security primi- extension review tools, VEX greatly reduces the human tives are discretionary, and can be difficult to under- burden for manually vetting extensions when looking for stand and use correctly. For example, Firefox pro- key types of dangerous flows. vides an evalInSandbox (text, sandbox) function that returns the result of evaluating the text string 1 Introduction under the restricted privileges associated with the environment sandbox. Using evalInSandbox correctly Driving the Internet revolution is the modern web requires developers to test the result of a call to browser, which has evolved from a relatively simple evalInSandbox with the non-traditional “===” rather client application designed to display static data into a than “==”, as the “==” operation may invoke unsafe code complex networked operating system tasked with man- as a side effect (See http://developer.mozilla.org/ aging many facets of a user’s on-line experience. To En/Components.utils.evalInSandbox for details). help meet the varied needs of a broad user population, Current approaches from the research community pro- browser extensions expand the functionality of browsers pose dynamic techniques for improving the security of by interposing on and interacting with browser-level extensions. The SABRE system tracks tainted JavaScript events and data. Some extensions are simple and make 1Firefox now surpasses Internet Explorer in W3schools traffic only small changes to the appearance of web pages or the (www.w3schools.com/browsers/browsers_stats.asp), browser itself. Other extensions provide more sophis- arguably due to the popularity of Firefox extensions. objects to prevent extensions from accessing sensitive in- vision before posting it on their official list of approved formation unsafely [9]. Although SABRE can prevent po- Firefox extensions, examining an extension to find a vul- tentially malicious flows from both exploited extensions nerability requires a detailed understanding of the code and from malicious extensions, SABRE adds overhead to to reason about anything beyond the most basic type of all JavaScript execution within the browser, adding 6.1x information flow. Thus tools to help vet browser exten- overhead for the SunSpider benchmark and 2.36x over- sions can be very useful for improving the security of head for the V8 JavaScript benchmark. Furthermore, extensions. SABRE’s dynamic nature pushes security violation no- We show that VEX can catch several known vulnera- tification to users who are unable to determine if a par- bilities, such as a vulnerability in the FIZZLE extension ticular flow is malicious or benign. The Google Chrome [8], and also find new problems, including exploitable Extension System revisits the overall extension API to vulnerabilities in BEATNIK and WIKIPEDIA TOOLBAR. make it easier for the browser to enforce least privilege In particular, VEX reported a previously unknown vul- and strong isolation on extensions [4]. Their system nerability in WIKIPEDIA TOOLBAR that could lead to an works by partitioning the full set of extension function- attack, and that resulted in the report CVE-2009-4127. ality into different protection domains, and sand-boxing We reported this vulnerability to the WIKIPEDIA TOOL- extensions to prevent them from obtaining more privi- BAR developers, who fixed the extension. We also show leges than needed. Although this system is likely to limit that VEX can help to find the use of unsafe programming the damage from some extension attacks, it does little to practices, such as misuse of evalInSandbox, that can prevent the vulnerabilities themselves. result from subtle information flows. In this paper, we propose VEX, a system for find- The remainder of the paper is organized as follows. ing vulnerabilities in browser extensions using static Section 2 describes the threat model and the assumptions information-flow analysis. Many vulnerabilities trans- under which we analyze the browser extensions. Sec- late to certain types of explicit information flows from tion 3 provides background material on the architecture injectable sources to executable sinks. For extensions of Firefox and the nature of certain key undesirable in- written with benign intent, most attacks involve the at- formation flows in its extensions. Section 4 describes our tacker injecting JavaScript into a data item that is sub- static analysis and the various design choices we made to sequently executed by the extension under full browser build VEX. Section 5 lists and describes our results. Sec- privileges. We identify key flows of this nature that can tion 6 surveys related work, and Section 7 concludes the lead to security vulnerabilities, and we analyze for these paper. flows statically using high-precision static analysis that is both path-sensitive and context-sensitive, to minimize the number of false positive suspect flows. VEX uses 2 Threat model, assumptions, and usage precise summaries to analyze code, and has special fea- model tures to handle the quirks of JavaScript (e.g., VEX does a constant string analysis for expressions that flow into In this paper, we focus on finding security vulnerabili- the eval statement). Because VEX uses static analysis, ties in buggy browser extensions. We do not try to iden- we avoid the runtime overhead induced by dynamic ap- tify malicious extensions, bugs in the browser itself, or proaches. bugs in other browser extensibility mechanisms, such as Determining whether extensions are malicious or har- plug-ins. We assume that the developer is neither mali- bor security vulnerabilities is a hard problem. Exten- cious nor trying to obfuscate extension functionality, but sions are typically complex artifacts that interact with we assume the developer could write incorrect code that the browser in subtle and hard to understand ways. For contains vulnerabilities. example, the ADBLOCK PLUS extension performs the We use two attack models. First, we consider attacks seemingly simple task of filtering out ads based on a that originate from web sites, and we assume the attacker list of ad servers. However, the ADBLOCK PLUS im- can send arbitrary HTML and JavaScript to the user’s plementation consists of over 11K lines of JavaScript browser. We focus on attacks where this untrusted data code. Similarly, the NOSCRIPT extension provides fine- can lead to code injection or privilege escalation through grained control over which domains are allowed to ex- buggy extensions. In the second attack model, we con- ecute JavaScript and basic cross-site scripting protec- sider some web sites as trusted. For example, if an exten- tion. The NOSCRIPT extension implementation consists sion gleans information from Facebook, we assume that of over 19K lines of JavaScript code. Also, ADBLOCK the Facebook code will not include arbitrary HTML and PLUS had 30 releases in 1/1/06–11/20/09, and NO- JavaScript, but only well formatted and trusted data. SCRIPT had 38 releases just in 1/1/09–11/20/09. While According to the Mozilla developer site, Mozilla has Mozilla uses volunteers to vet each new extension and re- a team of volunteers who help vet extensions manually.

2 Figure 1: The overall analysis process of VEX.

They run new and updated extensions isolated in a vir- ument, which also runs with full chrome privileges. For tual machine to test the user experience. The editors also example, the object window refers to the chrome win- use a validation tool, which uses grep to look for key in- dow and the object window.content refers to the con- dicators of bugs. Many of the patterns they search for tent window. To access the document object referring involve interactions between extensions and web pages, to the content (i.e., the user page), the extension has to and they use their understanding of these patterns to help access the document property of the content window, guide their inspection of the code. Our goal is to help i.e., window.content.document. automate this process, so that analysts can quickly hone To make this extension architecture practical, Firefox in on particular snippets of code that are likely to contain has APIs for extension code to communicate across pro- security vulnerabilities. Figure 1 shows our overall work tection domains. These interactions are one cause of ex- flow for using VEX. tension security vulnerabilities. As the Mozilla developer site explains, “One of the most common security is- 3 Background sues with extensions is execution of remote code in privileged context. A typical example is an RSS reader exten- 3.1 Mozilla privilege levels sion that would take the content of the RSS feed (HTML code), format it nicely and insert into the extension win- Firefox has two privilege levels: page, for the web page dow. The issue that is commonly overlooked here is that displayed in the browser’s content pane; and chrome, for the RSS feed could contain some malicious JavaScript elements belonging to Firefox and its extensions, i.e., ev- code and it would then execute with the privileges of the erything surrounding the content pane. Page privileges extension – meaning that it would get full access to the are more restrictive than chrome privileges. For exam- browser (cookies, history etc) and to user’s files” [sic]. ple, a page loaded from site x cannot access content from sites other than x. General Firefox code runs with full chrome privileges, which give it access to all browser 3.2 Points of attack states and events, OS resources like the file system and network, and all web pages. Firefox provides the ex- Here we discuss key vulnerable points for code injection tensions with full chrome privileges by exposing a spe- and privilege escalation attacks against non-malicious cial API called the XPCOM Components to extension extensions: eval, evalInSandbox, innerHTML, and JavaScript, thereby allowing the extensions to have ac- wrappedJSObject. We focus on these Firefox features cess to all the resources Firefox can access. because they are key points of interaction between ob- Extensions can often access objects that run with page jects with page and chrome privileges, respectively, and privileges and interact with page content, as well as ob- this interaction is a key source of security vulnerabilities, jects that run with full chrome privileges. Extensions can as noted above. Though other avenues of attack are pos- also include user interface components via a chrome doc- sible, we do not consider them here.

3 eval: The eval function call interprets string data as it calls window.content.document.getElementById, JavaScript, which it executes dynamically. This flexible Firefox automatically wraps the object so that the mechanism can be used to generate JavaScript code dy- window.content.document accesses only use the orig- namically, for example to serialize JSON objects. How- inal document object, not the modified one. However, ever, this flexibility can lead to code injection vulnera- Firefox also provides the wrappedJSObject method, bilities in extensions. If extensions execute eval func- which lets the extension access the modified version, tions on un-sanitized strings that come from untrusted even when automatic wrapping is turned on; calling web pages, the attacker will be able to inject JavaScript wrappedJSObject on a content document is potentially code that will run with full chrome privileges. dangerous.

InnerHTML: Each HTML element for a page has an 3.3 Suspicious flow patterns innerHTML property that defines the text that occurs between that element’s opening and closing tag. Exten- In this section we discuss the five source to sink sions can change the innerHTML property to alter ex- flows that might be vulnerable. Specifically, we track isting document object model (DOM) elements, or to flows from Resource Description Framework (RDF) add new DOM elements. When an extension modifies data (e.g., bookmarks) to innerHTML, content document the innerHTML property, the browser re-parses and pro- data to eval, content document data to innerHTML, cesses the new data. Thus, passing specially crafted un- evalInSandbox return objects used improperly by code sanitized strings (e.g., tags with script in their running with chrome privileges, or wrappedJSObject onload attribute) into innerHTML modifications can return object used improperly by code running with lead to code injection attacks. chrome privileges. These flows do not always result in a vulnerability, and they are by no means an exhaustive list of all possible extension security bugs, but they are EvalInSandbox: One way Firefox facilitates com- the patterns we use in our tool. munication across protection domains is through the RDF is a model for describing hierarchical relation- evalInSandbox method. This method enables exten- ships between browser resources [33]. Extension de- sions to execute JavaScript in the extension’s context velopers can store persistent extension data in an RDF with restricted privileges, thus enabling extensions to file, or access browser resources, such as bookmarks, process untrusted data from web pages safely. The stored in RDF format. RDF data can come from un- sandbox object is an empty JavaScript object created trusted sources. For example, when a user stores a book- with restricted privileges. For example, the call s = mark, Firefox records the un-sanitized title of the book- Sandbox("http://www.w3.org/") creates a sandbox marked page in the RDF file. Extensions that use RDF s where code can execute with page privileges, as though data need to sanitize it properly if they use it directly in it came from the domain www.w3.org. One can add an innerHTML statement that modifies an element in a properties to this object by calling the evalInSandbox chrome document. function, and any attempts to access global scope ob- Content document data flowing to eval or innerHTML jects from within evalInSandbox, including privileged can sometimes be exploited. This flow can result in script chrome objects, are denied. evalInSandbox compli- execution with chrome privileges if specially crafted cates extension programming because objects returned content from the window.content.document ob- from the method call execute in the extension with full ject is passed to eval or innerHTML or an element in the chrome privileges. Since methods associated with the chrome document. object could have been modified within the sandbox, they For evalInSandbox and wrappedJSObject, prob- should not be called in the chrome context. For example, lems can only result if the return values of these “==” should not be used on these objects as its evaluation constructs are executed with chrome privileges. For calls the tostring or valueOf method, which could evalInSandbox this means comparing return values us- have been modified; instead the non-traditional “===” ing == or != from code running with chrome privileges. operator needs to be used. For wrappedJSObject, this means making method calls on returned objects from code running with chrome priv- wrappedJSObject: JavaScript objects can be dynam- ileges. ically modified. That means that any web page can Such flow patterns may occur in only a few modify the properties of the document object. For ex- of the extensions that use these constructs. Ac- ample, a web page can reassign the getElementById cording to the Mozilla extension review web page, method to return a malicious script. To prevent this reviewers have an open-source automatic tool to script from being executed by the extension when help with reviews (see https://addons.mozilla.org/

4 en-US/firefox/pages/validation), but this tool just hence conditionals are not evaluated; we refer the reader greps for strings that indicate dangerous patterns. Af- to work on operational semantics of JavaScript [27, 18]. terward, the reviewer must go through the code of each Apart from tracking heap structures, the abstract heap suspect extension to understand the flows and determine also records explicit-flow dependencies to heap nodes, which constitute vulnerabilities and which are benign. and the rules for updating flows naturally depend on the As this task is difficult, painful, and error-prone, we de- program’s semantics. Also, as we talk about in more signed the VEX tool to help extension reviewers vet the detail below, there are some aspects of the heap (such flows in extensions automatically, greatly reducing the as prototype fields) that are not currently supported in number of extensions that need manual review. our tool. The static analysis itself is flow-sensitive and context-sensitive, and the context-sensitivity is handled using classical function-summary based methods. 4 Static information flow analysis The above choices, namely the choice of abstract heaps, and the context-sensitive flow-sensitive analysis, We develop a general explicit information flow static are design choices we have made, based on our exper- analysis tool VEX for JavaScript that computes flows be- iments with extensions for over a year, and were moti- tween any source and sink, including the flows described vated to reduce false positives. However, we have not in Section 3.3. While we could develop analysis tech- tried all variants of these choices, and it is possible that niques for a particular source and sink, we prefer a more other choices (for example, choosing to bound abstract general technique that will perform the analysis once, heaps by merging objects created at a program site), may and from the results, allow us to search for any source- also work well on extensions. However, we do know that to-sink flow. This allows VEX to be run in a single pass context-sensitivity is important (in several extensions we over thousands of extensions, rather than using separate manually examined) and further flow-sensitivity seems passes for each target pattern. important if the tool is extended to consider sanitization To support fine-grained information-flow analysis, routines as flow-stoppers. VEX tracks the precise dependencies of flows from vari- The rest of this section is structured as follows. First ables to objects created in the JavaScript extension, using we explain our analysis using abstract heaps for a core a taint-based analysis. Motivated by the fact that every subset of JavaScript, which does not have statements like flow reported needs to be checked manually for attacks, eval, associative array accesses, calls to Firefox APIs, which can take considerable human effort, we aim for etc. Subsequently, we describe how we handle the as- an analysis that admits as few false positives as possi- pects not covered in the core. ble (false positives are non-existent flows reported by the tool). 4.1 Analysis of a core subset of JavaScript Statically analyzing JavaScript extensions for flows is a non-trivial task. JavaScript extensions have a large Core JavaScript: A core subset of JavaScript is given number of objects and functions. In addition to the ob- in Figure 2; this core reflects the aspects of JavaScript de- jects defined in the program, the extensions can also ac- scribed above, but omits certain features (such as eval) cess the browser’s DOM API and the Firefox Extension which we will describe later. API provided by XPCOM components. The objects are also dynamic, in the sense that new object properties can Abstract Heaps: Our analysis keeps track of a one ab- be created dynamically at run-time. Functions are ob- stract heap at each program point. This abstract heap jects in JavaScript, and hence can be created, redefined tracks JavaScript objects and functions and the relation- dynamically, and passed as parameters. The challenge is ships between them in the form of a graph. Each node to accurately keep track of such objects, properties, and in the graph is a heap location generated by the program. the corresponding flows to them. Two different nodes, n1 and n2 are connected by an edge Our analysis keeps track of an abstract heap (AH) that labeled f, if node n1’s property f may refer to n2. To is not a priori bounded, and keeps track of the precise keep track of the actual information flows between differ- heap nodes and field relations and corresponding flows, ent program variables, we also keep track of all the pro- but ignores the exact primitive values in the heap (like gram variables that flow into the nodes in abstract heap. integers). However, we bound the number of iterations Let PVar be the set of all the program variables in the in computing the least fixed-point, and hence the abstract JavaScript program. heap gets bounded implicitly. More precisely, an abstract heap σ is a tuple (ns, n,d, The abstract heap transformations for any statement fr, dm, tm), where: closely mimic a big-step operational semantics for JavaScript, except that primitive values are forgotten, and • ns is a set of heap locations,

5 EXPRESSIONS ::= new property insertion into the prototype field of an ob- | c (CONSTANT) ject is treated as if the property is being inserted into the | x (VARIABLE) object itself. We found that this is sufficient in case of | x.f (FIELD ACCESS) JavaScript extensions as the inheritance chain is not deep | x.prot (PROTO ACCESS) in most cases. VEX keeps track of the accurate scope | eop e (BINARY OP) | this (THIS) information using the this-map. | {f1 : e1, . . . , fn : en} (OBJECT LITERAL) Our analysis consists of a set of rules for generating | function (p1, . . . , pn){S} (FUNCTIONDEF) abstract heaps at program points, and is defined by es- | f(a1, . . . , an) (FUNCTIONCALL) sentially capturing the effect of statements on the abstract | new f(a1, . . . , an) (NEW) heap. These rules follow a big-step operational semantics adapted to work on the abstracted heap. STATEMENTS ::= The big step operational semantics on abstract heaps | skip (SKIP) is defined as a relation , (P rog, σ) ⇓ σ0, where Prog is | S ; S (SEQ) 1 2 an program expression or statement and σ and σ0 are ab- | var x (VARIABLE DECL.) stract heaps. Such a relation intuitively means that σ0 is | x := e (ASSIGN 1) | x.f := e (ASSIGN 2) the heap obtained from the complete evaluation of Prog | if e then S1 else S2 (CONDITIONAL) starting from the heap σ. This resulting heap, in every | while e do S od (WHILE) iteration, will be merged with the current heap after the | return e (RETURN) program, conservatively taking the union of dependencies. Figure 2: Core JavaScript syntax. We now define this relation for expressions and statements. • n ∈ (ns ∪ {⊥}) represents the current node, and is either a node in the heap or the symbol ⊥, Notation: For any abstract heap σ, let σ = (nsσ, nσ, dσ, fr σ, dmσ, tmσ). In other words, nσ refers to the • d ⊆ PVar represents the subset of program vari- second component of σ, etc. The function fresh() cre- ables that flow in to the current node n, ates a new heap location. A special node nG represents the global heap, which consists of the objects like • fr ⊆ ns × PVar × (ns ∪ {⊥}) encodes the Object, Array, etc. pointers representing properties (fields). A triple (n1, f, n2) ∈ fr means that the property f of the object n1 may be located at n2. Evaluating expressions: Figure 3 gives the rules for evaluating expressions in the • dm ⊆ ns ×PVar is a relation that denotes a depen- program. dency map. A pair (n1, x) ∈ dm denotes that the Rule (CONSTANT) evaluates to a ⊥ node with empty program variable x flows into the node n1. dependencies. Rule (THIS) extracts the scope of the cur- • tm : ns ×ns is a “this-map” relation, which is actu- rent node. The next five rules describe the variable and ally the relation of a function. A pair (n1, n2) ∈ tm field access expressions. means that the scope of n1 is n2. In case of a variable access, the existence property x is checked in the current scope (represented by nσ(rule Notation: The relation tm will always be a function; we (VAR))), and returned if it exists. If it is not in the cur- define formally the function tm : ns → ns as tm(n) = rent scope, then the global node (rule (GLOBAL VAR)) 0 0 PVar n , where (n, n ) ∈ tm. Let dm : ns → 2 be the is checked for property x. If it exists, then it is returned function that corresponds to the relation dm, dm(n) = with dependencies. If the location for a particular vari- {x|(n, x) ∈ dm}, i.e. the set of all the program variables able is found in neither the current scope nor the global that flow into the node n. scope, using rule (UNINITIALIZED VAR) we create a new node nnew and add it to the global scope. Similar The Analysis: We now describe our analysis for the rules apply for field accesses in rules (FIELD ACCESS) core subset of JavaScript. VEX handles functions and and (UNINIT FLD). objects by creating a node for every object or func- For binary operators(rule (BINARY OP)), we return tion and their properties. Relationships between various the union of dependencies of both the expressions. When nodes are accurately generated and tracked in the anal- an object literal expression((OBJ.LIT.)) is encountered, ysis. JavaScript uses prototype-based inheritance; how- a summary is computed by recursively creating heap lo- ever, our analysis does not track prototypes. Instead, a cations for each of its properties and then creating the

6 . . (CONSTANT) (THIS) (c, σ) ⇓ (nsσ, ⊥, ∅, fr σ, dmσ, tmσ) (this , σ) ⇓ (nsσ, tmσ(nσ), dmσ(tmσ(nσ)), fr σ, dmσ, tmσ)

0 0 (nσ, x, nx) ∈ fr σ 6 ∃nx.(nσ, x, nx) ∈ fr σ (nG, x, nx) ∈ fr σ (VAR) (GLOBAL VAR) (x, σ) ⇓ (nsσ, nx, dmσ(nx), fr σ, dmσ, tmσ) (x, σ) ⇓ (nsσ, nx, dmσ(nx), fr σ, dmσ, tmσ)

0 0 00 00 6 ∃nx.(nσ, x, nx) ∈ fr σ 6 ∃nx.(nG, x, nx) ∈ fr σ nG 6= nσ (UNITIALIZED VAR) (x, σ) ⇓ (nsσ ∪ {nnew }, nnew , ∅, fr σ ∪ {(nG, x, nnew )}, dmσ, tmσ ∪ {(nnew , nG)}) where, nnew = fresh()

0 0 (x, σ) ⇓ σ (nσ0 , f, nf ) ∈ fr σ0 (x, σ) ⇓ σ ROT CCESS (FIELD ACCESS) 0 (P A ) (x.f, σ) ⇓ (nsσ0 , nf , dσ0 ∪ dmσ0 (nf ), fr σ0 , dmσ0 , tmσ0 ) (x.prot, σ) ⇓ σ

0 (x, σ) ⇓ σ 6 ∃nf .(nσ0 , f, nf ) ∈ fr σ0 (UNINIT FLD) (x.f, σ) ⇓ (nsσ0 ∪ {nnew }, nnew , dσ0 , fr σ0 ∪ {(nσ0 , f, nnew )}, dmσ0 , tmσ0 ∪ {(nnew , nσ0 )}) where, nnew = fresh()

(e1, σ) ⇓ σ1 (e2, σ) ⇓ σ2 (BINARY OP) (e op e , σ) ⇓ (ns ∪ ns , ⊥, d ∪ d , fr ∪ fr , dm ∪ dm , tm ∪ tm ) 1 2 σ1 σ2 σ1 σ2 σ1 σ2 σ1 σ2 σ1 σ2

n S nσ0 = fresh() = nnew dσ0 = dσi (e1, σ) ⇓ σ1 ... (en, σ) ⇓ σn i=1 BJ IT n n 0 (O .L .) S S ({f1 : e1, . . . , fn : en}, σ) ⇓ σ where, nsσ0 = nsσ ∪ {nnew } ∪ ( nsσi ) fr σ0 = fr σ ∪ ( (nnew , fi, nσi )) i=1 i=1 n n S S dmσ0 = dmσ ∪ ( dmσi ) tmσ0 = tmσ ∪ ( (nσi , nnew )) i=1 i=1 n 0 S pi 0 nsσ00 = nsσ ∪ {nnew } ∪ ( nnew ) nσ00 = fresh() = nnew dσ00 = ∅ i=1 RET pi (S, σ00) ⇓ σ0 nnew = fresh() ∀i ∈ {1, . . . , n}.nnew = fresh() FUN DEF n 0 ( - ) 0 RET S 0 pi (function (p1, . . . , pn){S}, σ) ⇓ σ where, fr σ00 = fr σ ∪ {(nnew , RET, nnew )} ∪ ( {(nnew , i, nnew )}) i=1 Sn pi dmσ00 = dmσ ∪ ( i=1{( RET, i), (nnew , i)}) n 0 RET 0 S pi 0 tmσ00 = tmσ ∪ {(nnew , nσ)} ∪ {(nnew , nnew )} ∪ ( (nnew , nnew )) i=1 00 0 (f, σ) ⇓ σ (nσ00 , RET, n ) ∈ fr σ (e1, σ) ⇓ σ1 ... (en, σ) ⇓ σn FUN CALL n 0 ( - 1) 0 S 0 (f(e1, . . . , en), σ) ⇓ (nsσ, ⊥, d , fr σ, dmσ, tmσ) where, d = (∃(n , i) ∈ dmσ.dσi ) i=1 00 (f, σ) ⇓ σ nσ00 = ⊥ (e1, σ) ⇓ σ1 ... (en, σ) ⇓ σn (FUN-CALL2) n S (f(e1, . . . , en), σ) ⇓ (nsσ, ⊥, dσi , fr σ, dmσ, tmσ)) i=1

Figure 3: Semantics for all core expressions except new. graph where the new object node is linked to the proper- A constructor expression (containing new) is similar to a ties with the labeled edges. function call, where if the object being instantiated is retrieved from the local or the global scope, then a copy of A function deﬁnition((FUN-DEF)) is treated in a simi- the graph starting with this object is created and returned. lar fashion as the object literal, except that new summary locations are created for each of the function arguments RET and also for the return variable (i.e. nnew ). The function body is evaluated with respect to the new heap. The re- Evaluating statements: sult of the evaluation is the new heap with the function The statement semantics are given in Figure 4. A vari- RET summary attached to the node nnew . A function call(rule able declaration(VAR.DECL.) creates a new node in (FUN-CALL1)) uses this summary to compute the node the current scope. If the heap node for that variable al- and dependencies of the return value. The return value ready exists, it is replaced by this new node. The as- of the function can be obtained by evaluating each of the signment statement (rules (ASSIGN1) and (ASSIGN2)) function argument expressions, and replacing the appro- evaluates the left hand side and the right hand side ex- priate nodes in the function summary with the values re- pressions, replaces the node on the left hand side with turned. If the function is not deﬁned, then the dependen- the node on the right hand side. Note that conditionals in cies of the return values are the union of dependencies of if-then-else and while statements are, of course, the individual function parameters(rule (FUN-CALL2)). not evaluated as our heaps are symbolic. The while state-

7 0 0 00 . (C1, σ) ⇓ σ (C2, σ ) ⇓ σ (SKIP) 00 (SEQ) (skip, σ) ⇓ σ (C1; C2, σ) ⇓ σ

(nσ, x, nx) ∈ fr σ where, ns = (ns ∪ {n }) \{n } (VAR.DECL.) σ new x (var x, σ) ⇓ (ns, nσ, dσ, fr, dmσ, tm) fr = (fr σ \{(nσ, x, nx)}) ∪ {(nσ, x, nnew )} tm = tmσ ∪ {(nnew , nσ)}

00 where, ns = nsσ ∪ ns 00 (e, σ) ⇓ σ (x, σ) ⇓ σx x σ SSIGN (A 1) fr = (fr σ00 \{(nσ, x, nσx )}) ∪ {(nσ, x, nσ00 )} (x := e, σ) ⇓ (ns, nσ, dσ, fr, dm, tm) dm = dmσ00

tm = tmσ ∪ {(nσ00 , tmσx (nσx ))} 00 where, n 0 = n d 0 = d (e, σ) ⇓ σ (x, σ) ⇓ σx (x.f, σ) ⇓ σf σ σ σ σ (ASSIGN2) fr 0 = (fr 00 \{(n , f, n )}) ∪ {(n , f, n 00 )} (x.f := e, σ) ⇓ σ0 σ σ σx σf σx σ dmσ0 = dmσ00 ∪ {(nσ00 , y)|y ∈ dσx } ∪ {(nσx , y)|y ∈ dσ00 }

tmσ0 = tmσ ∪ {(nσ00 , nσx )} (S1, σ) ⇓ σ1 (S2, σ) ⇓ σ2 (COND) ( e S S , σ) ⇓ (ns ∪ ns , n , d , fr ∪ fr , dm ∪ dm , tm ∪ tm ) if then 1 else 2 σ1 σ2 σ σ σ1 σ2 σ1 σ2 σ1 σ2

0 0 0 00 (S1, σ) ⇓ σ (S1, σ) ⇓ σ (while e do S1 od , σ ) ⇓ σ 0 (WHILE1) 00 (WHILE2) (while e do S1 od , σ) ⇓ σ (while e do S1 od , σ) ⇓ σ

(e, σ) ⇓ σ0 (RET) (return e, σ) ⇓ (nsσ0 , nσ, dσ, fr σ0 ∪ {(nσ, RET, nσ0 )}, dmσ0 , tmσ0 )

Figure 4: Statement semantics. ment is interesting: we evaluate the while body till we statically known but subject to eval are essentially ig- reach a fixed point (or till we reach a fixed number of nored, and this causes our tool to be unsound (see a later loop un-rollings) as depicted in (WHILE2). However, note on unsoundness). In most correct extensions, an notice that the abstract heap is also allowed to immedi- eval-ed statement is dynamically chosen from a set of ately go across a while-loop (WHILE1). The semantics constant-strings or taken from trusted sources. Note that for the rest of the statements is standard. if there is a flow from an untrusted source to an eval, Given the above rules for abstract heaps, we start ana- VEX will catch this flow, as it is a vulnerable flow pat- lyzing the JavaScript program using an initial state con- tern. sisting of a global heap, represented by node n . This G innerHTML: Modifications of the innerHTML of an global heap consists of summaries for a few built-in ob- HTML page by the extension makes the analysis con- jects like Array. We evaluate the rules either till we siderably more complex. For instance, if a function converge on a least fixed-point, or till we reach a preset a() calls function b() that calls function c(), and bound on the number of iterations. c()makes innerHTML modifications, it is hard to summarize this effect in the summary of c(), as the source 4.2 Handling other features of JavaScript of the flow is not locally available. We handle this by creating a symbolic representation of the source, computing summaries of innerHTML using this symbolic source, Dynamic code: The eval method in JavaScript allows and allowing outside methods to instantiate the symbolic execution of dynamically formed code, and is widely source to a concrete source in whichever context it be- used in browser extensions. While an accurate analysis comes available. of the structure of dynamically created code is a research topic in itself, and quite out of the scope of this paper, Object properties accessed in the form of associative we cannot simply ignore eval statements. Our approach arrays: In JavaScript, objects are treated as associative has been to implement a static constant-string analysis arrays. This means that any property of the object can be for strings and subject the strings that are eval-ed to this accessed using the array notation. Array indices could analysis. Our static analysis engine inserts these constant be constant strings, which are then evaluated to get the strings into the code (as though it was static code), parses actual property being accessed; or they could be num- it, and computes the flows for them. Strings that are not bers, which indicate the property number that is being

8 accessed; or they could be variables, that could be in- variables to all nodes, which would generate plenty of stantiated at run time. VEX treats these cases in a con- false positives and would essentially be useless. False servative manner. Whenever a property is created in the negatives (i.e. miss detecting programs that have a flow), node scope, its dependencies are added to the dependen- are also possible because of the fact that we have several cies of the node as shown in the (ASSIGN 2) rule in the uninitialized and unsummarized objects. Figure 4. If we cannot evaluate the array index for any VEX has several sources of unsoundness and incom- reason, it would be sufficient to retrieve the dependencies pleteness: handling of eval, handling of prototypes, of the object. handling of higher-order functions, fixed number of un- rolls of loops, handling with-scoping, handling excep- Functions that take arbitrary number of arguments: tions, etc. Some functions in JavaScript can have variable numbers of arguments. For example, the push method of the array can be called with any number of arguments and the 5 Evaluation arguments will be appended to the end of the array. To handle this, the summary of the push method has a spe- 5.1 VEX implementation cial field indicating that it can take variable number of The VEX tool checks for two kinds of flows: one from arguments and when the method is called, we conser- injectable sources to executable sinks to check for script- vatively append the dependencies of all the arguments injection vulnerabilities, and the other, also modeled as to the dependency set of the node representing the array flows, that checks for unsafe programming practices. object. VEX is implemented in Java (∼ 2000 LOC), and uti- Browser’s DOM API and XPCOM components: lizes a JavaScript parser built using the ANTLR parser These objects are treated as uninitialized variables, generator for the JavaScript 1.5 grammar provided by fields and functions. The rules (UNINITIALIZED VAR), ANTLR [1]. ANTLR outputs Java-based Abstract Syn- (UNINIT FLD) and (FUN-CALL2) can be applied to tax Trees (AST) for JavaScript files, and VEX walks their accesses. When we need to keep track of the usage through the ASTs computing the flow sets from all in- of certain components we introduce the component teresting sources to all interesting sinks, in a single pass API function arguments into the dependency set. For analysis, using the static analysis described in Section 4. example the RDF datasource is accessed using the For each sink object, VEX collects all the source objects following command: that flow into it and checks for the occurrence of flow patterns. VEX reports these flows to the user along with rdf = Components.classes the source and sink locations in the code. [“@mozilla.org/rdf/rdf-service;1”] .getService(Components.interfaces.nsIRDFService); Flow patterns checked: The current version of VEX Our analysis introduces the string checks for the following three flow patterns that capture “@mozilla.org/rdf/rdf-service;1” and the variable flows from injectable sources to executable sinks: nsIRDFService into the dependency set of the left hand side variable rdf. - Content Doc to Eval: The source location is any point where the program accesses the API window.content.document, and the source 4.3 Unsoundness and incompleteness object is the object that is returned from this call. A static analysis tool like VEX is inherently conservative. The sink locations are eval statements and the sink First, if VEX reports a flow, there may be no such feasible objects are the objects being eval-ed. flow in the program (i.e. VEX can have false positives). - Content Doc to innerHTML: The source location Though VEX over-approximates flows and tries to per- and source objects for these flows are the same form a sound analysis, there are several aspects of the as above; the sink locations are the places where analysis which, if implemented soundly, will make the the extension writes directly into the DOM us- tool throw too many infeasible flows, making it useless ing innerHTML commands, and the sink objects in practice. are the objects being assigned by the innerHTML Consider a program where there is an eval of a string command. These DOM elements may be exe- that is dynamically created and not determinable stati- cutable if they are in the chrome context. cally. Since this string can be assigned any value, it could be any arbitrary program that can create flows between - RDF to innerHTML: The source location and source any of the variables in scope. A sound tool must nec- objects are given by any retrieval of RDF objects essarily summarize the eval as causing flows from all (which are often injectable) and the sink locations

9 and sink objects are innerHTML commands as for flows that are from injectible sources to executable above. sinks (the first three flows outlined above). The first column is the number of extensions that syntactically Furthermore, VEX searches for the following patterns has code that could indicate such a flow, identified that characterize two documented unsafe programming using a grep-search. For the flow “Content-doc to practices that could lead to security vulnerabilities: Eval”, the grep was for the string ‘eval(’; for “Content- - evalInSandbox object to == or !=: This flow is doc to InnerHTML” flows, the grep was for the string meant to detect an unsafe programming prac- ‘innerHTML’; and for “RDF to InnerHTML” flows, tice where an object retrieved by an eval in the search was for both the strings “‘innerHTML” and a sandbox is subject to an == or != test (the “@mozilla.org/rdf/rdf-service;1”. As the table shows, recommended practice is that such objects must this search finds hundreds of suspect extensions, far more be tested with ===). The source location is hence than can be examined manually. any evalInSandbox-statement and the corre- The third column indicates the number of extensions sponding source objects are the objects returned by on which VEX reports an alert with corresponding flows. the evalInSandbox call. The sink locations are On an average, VEX took only 15.5 seconds per exten- usages of == and !=, and the sink objects are the sion. objects that are subject to these comparisons. To look for potential attacks, we manually analyzed most of the extensions with suspect flows that VEX - Method Call on wrappedJSObject: Objects ob- alerted us on, spending about two hours per extension tained using wrappedJSObject() commands are on average. usually untrusted, and methods of such objects The next column reports the number of extensions on should not be called. The source locations are hence which we could engineer an attack based on the flows uses of wrappedJSObject() and source objects are reported by VEX. We were able to attack six extensions, the objects returned by them. Sink locations are of which only three extensions were already known to methods calls and the sink objects are the objects be vulnerable. The attacks on Wikipedia Toolbar, Fizzle whose methods are called. version 0.5.1 and Fizzle version 0.5.2 extensions are new, The VEX tool can, of course, be adapted to other kinds see Section 5.4 for more details. of suspect flows – source and sink locations are straight- The next column shows the extensions where the forward, and the source and sink objects must be speci- source is code from a web site, and where an attack is fied carefully as above. possible provided the web site can be attacked. In other words, these extensions rely on a trusted web site as- sumption (e.g., that the code on the Facebook website 5.2 Evaluation methodology is safe). We think that these are valid warnings that users The extensions we analyzed were chosen as follows. of an extension (and Mozilla) should be aware of; trusted First, in October 2008, we built a suite of extensions web sites can after all be compromised, and the code on using a random sample of 1827 extensions from the these sites can be changed leading to an attack on all Mozilla add-ons web site, by downloading the first exten- users of such an extension. sions in alphabetical order for all subject categories. In Not all flows lead to attacks – the next set of columns November 2009, we downloaded 699 of the most popular describe the alerts that we were unable to convert to con- extensions. The two sets had 74 extensions in common, crete attacks. Some flows were not exploitable as the for a total of 2452 extensions. Our suite includes multi- input is sanitized correctly (either by the extension or the ple versions of some extensions, allowing cross-version browser), preventing JavaScript injection, while others comparisons. For instance, we found a vulnerability in a were not exploitable as the sinks do not turn out to be new version of BEATNIK (see Section 5.4), though its au- chrome executable contexts. These extensions are noted thors thought the vulnerabilities in the previous version in the next two columns. Finally, VEX, being a conser- were fixed. vative flow-analysis tool, does report alerts about flows We extracted the JavaScript files from these extensions that do not actually exist— there were very few of these, and ran VEX on them, using a 2.4GHz 64 bit x86 proces- and are noted under the column “Non-existent flows”. A sor with a maximum heap size of 4GB for the JVM. discussion on flows that do not lead to attacks is given in Section 5.5. 5.3 Experimental results As noted in the last column, there were 13 extensions with VEX alerts that were too complex(or obscurely writ- Finding flows from injectible sources to executable ten) for us to manually analyze for an attack; we do not sinks: Figure 5 summarizes the experimental results know whether attacks on these are possible or not.

10 Flow Pattern grep VEX Attackable Source is Not Attackable Unanalyzed Alerts Alerts Extensions trusted Sanitized Non-chrome Non-existent website input sinks ﬂows Content Doc to eval 430 13 2* 1 0 3 5 2 Content Doc to innerHTML 534 46 0 14 6 6 9 11 RDF to innerHTML 60 4 4** 0 0 0 0 0

Attackable Extensions: * WIKIPEDIA TOOLBARV-0.5.7, WIKIPEDIA TOOLBARV-0.5.9 , ** FIZZLEV-0.5, FIZZLEV-0.5.1, FIZZLEV-0.5.2 & BEATNIK V-1.2 Figure 5: Flows from injectible sources to executable sinks.

Unsafe Programming Practices grep Alerts VEX Alerts evalInSandbox Object to == or != 107 3 Method Call on wrappedJSObject 269 144

Figure 6: Results for unsafe programming practices.

Finding unsafe programming practices: all of the ﬂows we inspected were feasible and real, but we were unable to manually conﬁrm the remainder be- Figure 7: Attack script to display directories. cause there were too many alerts to examine.

5.4 Successful attacks attack script in its tag, and clicks on one of the Wikipedia toolbar buttons (unwatch, purge, etc.), the Attack scripts: All our attack scenarios involve a user script executes in the chrome context. The attack works who has installed a vulnerable extension who visits a ma- because the extension has the code given in Figure 8 in licious page, and either automatically or through invok- its toolbar.js file. ing the extension, triggers script written on the malicious page to execute in the chrome context. Figure 7 illus- script = window. content.document. trates an attack payload that can be used in such attacks: getElementsByTagName(‘‘script")[0].innerHTML; eval (script); this script displays the folders and files in the root di- rectory. The attack payloads could be much more dan- Figure 8: Wikipedia toolbar code. gerous, where the attacker could gain complete control of the affected computer using XPCOM API functions. The first line gets the first