<<

Updated in September 2017: Require valid versions for library detection throughout the paper. The vulnerability analysis already did so and remains identical. Modifications in I, III and IV; Figures 4 and 7; Sections III-B, IV-B, IV-, IV-F and IV-H. Additionally, highlight Ember’s security practices in Section V.

Thou Shalt Not Depend on Me: Analysing the Use of Outdated JavaScript Libraries on the Web

Tobias Lauinger, Abdelberi Chaabane, Sajjad Arshad, William Robertson, Christo Wilson and Engin Kirda Northeastern University {toby, 3abdou, arshad, wkr, cbw, ek}@ccs.neu.edu

Abstract—Web developers routinely rely on third-party - scripts or HTML into vulnerable via a crafted tag. As Script libraries such as jQuery to enhance the functionality of a result, it is of the utmost importance for websites to manage their sites. However, if not properly maintained, such dependen- library dependencies and, in particular, to update vulnerable cies can create attack vectors allowing a site to be compromised. libraries in a timely fashion. In this paper, we conduct the first comprehensive study of To date, security research has addressed a wide range of client-side JavaScript library usage and the resulting security client-side security issues in websites, including validation [30] implications across the Web. Using data from over 133 k websites, we show that 37 % of them include at least one library with a and XSS ([17], [36]), cross-site request forgery [4], and session known vulnerability; the time lag behind the newest release of fixation [34]. However, the use of vulnerable JavaScript libraries a library is measured in the order of years. In order to better by websites has not received nearly as much attention. In understand why websites use so many vulnerable or outdated 2014, a series of blog posts presented cursory measurements libraries, we track causal inclusion relationships and quantify highlighting that major websites included known vulnerable different scenarios. We observe sites including libraries in ad hoc libraries ([25], [26], [24]). These findings warnings from and often transitive ways, which can lead to different versions of other software ecosystems like Android [3], Java [32] and the same library being loaded into the same document at the same Windows [21], which show that vulnerable libraries continue time. Furthermore, we find that libraries included transitively, or to exist in the wild even when they are widely known to via ad and tracking code, are more likely to be vulnerable. This contain severe vulnerabilities. Given that JavaScript dependency demonstrates that not only administrators, but also the dynamic architecture and developers of third-party services are management is relatively primitive and corresponding tools to blame for the Web’s poor state of library management. are not as well-established as in more mature ecosystems, these findings suggest that security issues caused by outdated The results of our work underline the need for more thorough JavaScript libraries on the Web may be widespread. approaches to dependency management, code maintenance and third-party code inclusion on the Web. In this paper, we conduct the first comprehensive study on the security implications of JavaScript library usage in websites. We seek to answer the following questions: I.INTRODUCTION The Web is arguably the most popular contemporary • Where do websites load JavaScript libraries from (i.e., programming platform. Although websites are relatively easy to first or third-party domains), and how frequently are create, they are often composed of heterogeneous components these domains used? such as database backends, content generation engines, multiple • How current are the libraries that websites are using, scripting languages and client-side code, and they need to deal and do they contain known vulnerabilities? with unsanitised inputs encoded in several different formats. • Are web developers intentionally including JavaScript Hence, it is no surprise that it is challenging to secure websites libraries, or are these dependencies caused by adver- because of the large attack surface they expose. tising and tracking code? • Are existing remediation strategies effective or widely One specific, significant attack surface are vulnerabilities used? related to client-side JavaScript, such as cross-site scripting (XSS) and advanced phishing. Crucially, modern websites often • Are there additional technical, methodological, or include popular third-party JavaScript libraries, and thus are at organisational changes that can improve the security risk of inheriting vulnerabilities contained in these libraries. For of websites with respect to JavaScript library usage? example, a 2013 XSS vulnerability in the jQuery [13] library Note that the focus of this paper is not measuring the security before version 1.6.3 allowed remote attackers to inject arbitrary state of specific JavaScript libraries. Rather, our goal (and primary contribution) is to empirically examine whether website Permission to freely reproduce all or part of this paper for noncommercial operators keep their libraries current and react to publicly purposes is granted provided that copies bear this notice and the full citation disclosed vulnerabilities. on the first page. Reproduction for commercial purposes is strictly prohibited without the prior written consent of the Internet Society, the first-named author Answering these questions necessitated solving three funda- (for reproduction of an entire paper only), and the author’s employer if the mental methodological challenges. First, there is no centralised paper was prepared within the scope of employment. repository of metadata pertaining to JavaScript libraries and NDSS ’17, 26 February - 1 March 2017, San Diego, CA, USA Copyright 2017 Internet Society, ISBN 1-1891562-46-0 their versions, release dates, and known vulnerabilities. To https://doi.org/10.14722/ndss.2017.23414 address this, we manually constructed a catalogue containing all “release” versions of 72 of the most popular open-source • We present results on the origins of vulnerable libraries, including detailed vulnerability information on a JavaScript library inclusions, which allows us to subset of 11 libraries. Second, web developers often modify contrast the security posture of website developers with JavaScript libraries by reformatting, restructuring or appending third-party modules such as WordPress, advertising or code, which makes it difficult to detect library usage in the tracking networks, and social media widgets. wild. We solve this problem through a combination of static • We show that a large number of websites include and dynamic analysis techniques. Third, to understand why JavaScript libraries in unexpected ways, such as mul- specific libraries are loaded by a given site, we need to track all tiple inclusions of different library versions into the of the causal relationships between page elements (e.g., script same document, which may impact their attack surface. s1 in frame f1 injects script s2 into frame f2). To solve this, • We find existing remediation strategies to be ineffective we developed a customised version of that records at mitigating the threats posed by vulnerable JavaScript detailed causality trees of page element creation relationships. libraries. For example, less than 3 % of websites could Using these tools, we crawled the Alexa Top 75 k websites fix all their vulnerable libraries by applying only patch- and a random sample of 75 k websites drawn from a snapshot level updates. Similarly, only 2 % of websites use the of the .com zone in May 2016. These two crawls allow us to version-aliasing services offered by JavaScript CDNs. compare and contrast JavaScript library usage between popular and unpopular websites. In total, we observed 11,141,726 inline II.BACKGROUND scripts and script file inclusions; 86.6 % of Alexa sites and JavaScript has allowed web developers to build highly inter- 65.4 % of .com sites used at least one well-known JavaScript active websites with sophisticated functionality. For example, library, with jQuery being the most popular by a large majority. communication and production-related online services such as Analysis of our dataset reveals many concerning facts and Office 365 make heavy use of JavaScript to create about JavaScript library management on today’s Web. More web-based applications comparable to their more traditional than a third of the websites in our Alexa crawl include at desktop counterparts. In this paper, we focus exclusively on least one vulnerable library version, and nearly 10 % include aspects of client-side JavaScript executed in a browser, not the two or more different vulnerable versions. From a per-library recent trend of using JavaScript for server-side programming. perspective, at least 36.7 % of jQuery, 40.1 % of , 86.6 % of Handlebars, and 87.3 % of YUI inclusions use a vulnerable A. JavaScript Libraries version. Alarmingly, many sites continue to rely on libraries like YUI and SWFObject that are no longer maintained. In fact, In many cases, to make their lives easier, web developers the median website in our dataset is using a library version rely on functionality that is bundled in libraries. For example, 1,177 days older than the newest release, which explains why so jQuery [13] is a popular JavaScript library that makes HTML many vulnerable libraries tend to linger on the Web. Advertising, document traversal and manipulation, event handling, animation, tracking and social widget code can cause transitive library and much simpler and compatible across browsers. inclusions with a higher rate of vulnerability, suggesting that In the simplest case, a JavaScript library is a plain-text script these problems extend beyond individual website administrators containing code with reasonably well-defined functionality. The to providers of Web infrastructure and services. script has full access to the DOM that includes it; the concept of We also observe many websites exhibiting surprising namespaces does not exist in JavaScript, and everything that is behaviours with respect to JavaScript library inclusion. For created is by default global. More elaborate libraries use hacks example, 4.2 % of websites using jQuery in the Alexa crawl and conventions to protect the code against naming conflicts, include the same library version multiple times in the same and expose interfaces for retrieving meta-data such as the name document, and 10.9 % include multiple different versions of and version of the library. Over the course of this study, we jQuery into the same document. To our knowledge, ours found that JavaScript libraries overwhelmingly use the Semantic is the first study to make these observations, since existing Versioning [28] convention of major.minor.patch, such tools ([27], [20]) are unable to detect these anomalies. These as 1.0.1, where the major version component is increased for strange behaviours may have a negative impact on security as breaking changes, the minor component for new functionality, asynchronous loading leads to nondeterministic behaviour, and and the patch component for backwards-compatible bug fixes. it remains unclear which version will ultimately be used. To include a library into their website, developers typically Perhaps our most sobering finding is practical evidence that use the HTML tag and the JavaScript library ecosystem is complex, unorganised, and point to an externally-hosted version of the library or a copy minified quite “ad hoc” with respect to security. As of this writing, there on their own server. Library vendors often provide a are no reliable vulnerability databases, no security mailing lists version that has comments and whitespace removed and local for the most popular libraries, few or no details on security variables shortened to reduce the size of the file. Developers issues in release notes, and often, it is difficult to identify which can also concatenate multiple libraries into a single file, create versions of a library are affected by a reported vulnerability. custom builds of libraries, or use advanced minification features such as dead code removal. While custom minification builds Overall, our study makes the following contributions: are relatively common, more aggressive minification settings are rare in client-side JavaScript because they can break code [9]. • We conduct the first comprehensive study showing that a significant number of websites include vulnerable or CDNs. Many libraries are available on Content Distribution outdated JavaScript libraries. Networks (CDNs) for use by other websites. ,

2 TABLE I. THE 30 MOSTFREQUENTLIBRARIESINOUR ALEXA CRAWL and Yandex host libraries on their CDNs, some popular libraries (OUTOF 72 SUPPORTED LIBRARIES).VERSIONS: s TOTAL IN REFERENCE (e.g., Bootstrap and jQuery) offer their own CDNs, and some CATALOGUE FOR STATIC DETECTION; DYNAMIC DETECTIONS OBSERVED community-based CDNs accept to host arbitrary open-source IN CRAWLS (SOMELIBRARIES/VERSIONS NOT SUPPORTED DYNAMICALLY). libraries. JavaScript CDNs enable caching of libraries across websites to increase performance. Another useful feature offered Versions Bower Wapp Use on Crawled Sites Library s d Rank % ALEXA COM by some CDNs is version aliasing. That is, when including a jQuery 66 64 1 42 % 83.9 % 61.1 % library, the developer may specify a version prefix instead of the jQuery-UI 46 46 13 7 % 23.5 % 8.0 % full version string, in which case the CDN returns the newest 24 28 18 10 % 21.4 % 8.6 % available version with that prefix. When implemented correctly, Bootstrap 32 10 3 12.5 % 4.5 % jQuery-Migrate 7 0 11.3 % 10.7 % the patched version of a library will automatically be used on the Underscore 61 34 12 3 % 5.8 % 2.4 % website when it becomes available on the CDN. However, this SWFObject 2 1 3.7 % 2.4 % works only for security issues fixed in a backwards-compatible Moment 54 33 6 3.5 % 1.4 % RequireJS 62 40 3.4 % 2.3 % manner, and it conflicts with client-side security mechanisms jQuery-Form 14 0 2.7 % 3.4 % such as subresource integrity [37]. In addition, version aliasing Backbone 29 19 2 % 2.7 % 1.6 % Angular 110 78 2 2.4 % 1.6 % makes client-side caching of resources less efficient because it 77 57 26 2.4 % 2.5 % must be configured for shorter time spans, that is, hours instead jQuery-Tools 8 20 2.3 % 0.9 % of years. As a result, version aliasing is often discouraged [11]. jQuery-Fancybox 10 0 2.3 % 1.4 % GreenSock GSAP 45 45 2.2 % 2.8 % Handlebars 25 15 2.0 % 0.4 % Third Parties. Third-party modules such as advertising, Prototype 5 14 1.8 % 1.1 % trackers, social media or other widgets that are often embedded MooTools 27 24 3 % 1.5 % 1.4 % WebFont Loader 100 0 1.5 % 0.9 % in webpages typically implemented in JavaScript. Furthermore, jQuery-Cookie 8 0 21 1.4 % 0.2 % these scripts can also load libraries, possibly without the Hammer.js 26 14 1.2 % 0.4 % knowledge of the site maintainer. If not isolated in a frame, jQuery-Validation 13 0 1.1 % 0.6 % Mustache 29 21 1.1 % 0.9 % these libraries gain full privileges in the including site’s context. YUI 3 37 26 1.0 % 1.4 % Thus, even if a web developer keeps own library dependencies Velocity 55 15 0.9 % 0.2 % Script.aculo.us 5 12 0.8 % 0.4 % updated, outdated versions may still be included by badly 21 9 0.8 % 0.1 % maintained third-party content. Also, some JavaScript libraries Flexslider 11 0 0.6 % 0.4 % and many web frameworks contain their own copies of libraries React 41 23 28 0.5 % 1.6 % they depend on. Hence, web developers may unknowingly rely on software maintainers to update JavaScript libraries. HTML anywhere in the string as HTML [14], so that a parameter such as # B. Vulnerabilities in JavaScript Libraries would lead to code execution rather than a selection. This While JavaScript is the de-facto standard for developing behaviour was considered a vulnerability and fixed. client-side code on the Web, at the same time it is notorious Other vulnerabilities in JavaScript libraries include cases for security vulnerabilities. A common, lingering problem is where libraries do not sanitise inputs that are expected to be Cross-Site Scripting (XSS) [17], which allows an attacker to pure text, but are passed to eval() or document.write() inject malicious code (or HTML) into a website. In particular, internally, which could cause them to be executed as script if a JavaScript library accepts input from the user and does or rendered as markup, respectively. Attackers can use these not validate it, an XSS vulnerability might creep in, and all capabilities to steal data from a user’s browsing session, initiate websites using this library could become vulnerable. transactions on the user’s behalf, or place fake content on a As an example, consider the popular jQuery library and its website. Therefore, JavaScript libraries must not introduce any $() function, which is overloaded and has different behaviour attack vectors into the websites where they are used. depending on which type of argument is passed [15]: If a string containing a CSS selector is passed, the function searches the III.METHODOLOGY DOM tree for corresponding elements and returns references to Identifying client-side JavaScript libraries, finding out how them; if the input string contains HTML, the function creates they are loaded by a website, and determining whether they are the corresponding elements and returns the references. As a outdated or vulnerable requires a combination of techniques consequence, developers who pass improperly sanitised input and data sources. Challenges arise due to the lax JavaScript to this function may inadvertently allow attackers to inject language, the fragmented library ecosystem, and the complex code even though the developers’ original intent was only nature of modern websites. First, we need to collect metadata to select an existing element. While this API design places about popular JavaScript libraries, including a list of available convenience over security considerations and the implications versions, the corresponding release dates, code samples, and could be better highlighted in the documentation, it does not known vulnerabilities. Second, we must be able to determine if automatically constitute a vulnerability in the library. JavaScript code found in the wild is a known library. Third, we need to crawl websites while keeping track of causal resource In older versions of jQuery, however, the $() function’s le- inclusion relationships and match them with detected libraries. niency in parsing string parameters could lead to complications by misleading developers to believe, for instance, that any string A. Catalogueing JavaScript Libraries beginning with # would be interpreted as a selector and could be safe to pass to the function, as #test selects the element with In contrast to Maven’s Central Repository in the Java world, identifier test. Yet, jQuery considered parameters containing a JavaScript does not have a similarly popular repository of library

3 versioning and project dependency metadata. We must therefore 0 2 4 collect and correlate this data from various separate sources. 1 3 5 Angular (110) 1) Selecting Libraries: The initial construction of our meta- Backbone (29) data archive involves a certain amount of manual verification Dojo (72) work. Since there are thousands of JavaScript libraries (e.g., Ember (77) the community-based cdnjs.com hosts 2,379 projects as of Handlebars (25) August 2016), we focus our study on the most widely used jQuery (66) libraries because they are the most consequential. jQuery-Migrate (7) jQuery-Mobile (16) To select libraries, we leverage library popularity statistics jQuery-UI (46) provided by the JavaScript Bower [6] and Mustache (29) the web technology survey Wappalyzer [38]. We extend this YUI 3 (37) list of popular libraries with all projects hosted on the public 0.0 0.2 0.4 0.6 0.8 1.0 CDNs operated by Google, Microsoft and Yandex. As we will Fraction of total versions show in Section IV-C, many websites rely on these commercial Fig. 1. Fraction of library versions with i distinct known vulnerabilities CDNs to host JavaScript libraries. We collected the data from each (represented by colours), out of the total library versions in parentheses. Bower, Wappalyzer, and the three CDNs in January 2016. Angular 1.2.0 has 5 known vulnerabilities and there are 110 versions overall. Due to various data availability requirements explained in detail in Section III-A5, we need to exclude certain libraries from our study. Overall, we support 72 libraries—18 out of the ajax.googleapis.com/ajax/libs/jquery/{version}/jquery.min.js. In Top 20 installed with Bower, 7 out of the Top 10 frameworks doing so, we make sure that we download all available variants identified on websites by Wappalyzer, 13 of the 14 libraries of a library file, including the full development variant and the hosted by Google, 12 of the 18 libraries hosted by Microsoft, minified production variant without whitespace or comments. and all 11 libraries hosted by Yandex. Table I shows a subset When comparing files downloaded from official websites of 30 libraries in our catalogue as well as their rank on Bower and different CDNs, we noticed that even the same version and their market share according to Wappalyzer. Although our and variant (e.g., minified) of a library may sometimes differ catalogue appears to cover a sparse set of the libraries on between sources. We observed additional whitespace, removal Bower, many of the missing ranks belong to submodules of of comments, or the likely use of a different minifier or minifier popular libraries (e.g., rank 5 is Angular Mocks). According to setting, especially when the library’s developers do not provide a Wappalyzer, we cover 73 % of the most popular libraries. minified version. This observation highlights the importance of 2) Extracting Versioning Information: Our next step is collecting ground-truth JavaScript library samples from as many compiling a complete list of library versions along with official and semi-official sources as possible. Therefore, we use their release dates. After unsuccessful experiments with file official websites as well as dedicated CDNs (Bootstrap CDN timestamps and available-since dates on the libraries’ official and jQuery CDN), commercial CDNs (Google, Microsoft, and websites and CDNs, we determined that GitHub was the most Yandex), and open source CDNs (jsDelivr, cdnjs and OssCDN). reliable source for this kind of information. Nearly all of the In total, we collect 81,027 JavaScript files. We analyse open source libraries in our seed lists are hosted on GitHub the sizes of the “main” files of each library in our dataset and tag the source code of their releases, allowing us to extract (that is, we exclude files such as plug-ins that cannot be timestamps and version identifiers from the tags. In naming used stand-alone), and find that Script.aculo.us 1.9.0 is the their releases, they typically follow a major.minor.patch smallest at 996 bytes (minified). After accounting for duplicates version numbering scheme, which makes it straightforward to and discarding files smaller than 996 bytes (to reduce the identify tags pertaining to releases and ignore all other tags, likelihood of false positives due to shared ancillary resources including “alpha,” “beta” and “release candidate” versions that such as configuration files, localisations and plug-ins), our final are not meant to be used in production. As shown in Table I, catalogue includes 19,099 distinct files. popular libraries like Angular and jQuery have up to 110 and 66 distinct versions in our catalogue, respectively. However, 4) Identifying Vulnerabilities: The last step towards building half of the libraries have fewer than 26 versions. our catalogue is aggregating vulnerability information for our 72 JavaScript libraries. Unfortunately, there is no centralised 3) Obtaining Reference : Some methods of library database of vulnerabilities in JavaScript libraries; instead, we detection require us to have access to code samples for each manually compile vulnerability information from the Open version of a library. We gather library code from two sources: Source Vulnerability Database (OSVDB), the National Vulnera- the official website of each library, and from CDNs. For the bility Database (NVD), public bug trackers, GitHub comments, official websites, we manually download all available library blog posts, and the vulnerabilities detected by Retire.js [27]. versions. However, some official websites do not provide copies of old library versions, or they only provide copies of a subset Overall, we are able to obtain systematically documented of versions. In contrast, CDNs typically do host comprehensive details of vulnerabilities for 11 of the JavaScript libraries in collections of old library versions in order to not break websites our catalogue. In some cases, the documentation for a given that depend on older versions. We utilise the API of one such flaw specifies an affected range of versions, in which case we CDN, jsDelivr, to automatically discover all available versions consider all library versions within the range to be vulnerable. of libraries on five supported CDNs. For the remaining CDNs, In other cases, when a flaw is identified in a specific version v we construct download link templates manually, such as https:// of a library, we consider all versions ≤ v to be vulnerable.

4 Figure 1 shows details of the 11 libraries with vulnerability global variable that can be detected at runtime. Furthermore, information. For each library, we show the total number of most libraries in our catalogue contain a variable or method versions in our catalogue as well as the fraction of versions that returns the version of the library. As an illustration, the with i distinct known vulnerabilities. The worst offender is following snippet of JavaScript code detects jQuery: 1.2.0 Angular , which contains 5 vulnerabilities. Overall, we 1 v a r jq = window.jQuery || window.$ || window.$jq || see that 28.3 %, 6.7 %, and 6.1 % of these library versions window . $ j ; contain one, two, or three known vulnerabilities, respectively. 2 i f(jq && jq.fn) { 3 r e t u r n jq.fn.jquery || null;// version(if known) 5) Limitations: Although we have expended a great deal of 4 } else{ effort constructing our catalogue of JavaScript libraries, it is 5 r e t u r n false;// jQuery not found impacted by several limitations. First, by choosing GitHub for 6 } versioning and release date information, we need to exclude a Line 1 extracts jQuery’s global variable, and line 3 returns the small number of libraries that have few or no releases tagged on version number if it exists in its fn.jquery attribute. Note GitHub or do so in an apparently inconsistent way (e.g., multiple that in order to prevent false positives, we check for the global successive releases tagged on the same day). Furthermore, we variable and that the fn attribute exists. Later on, we discard all cannot include closed-source libraries such as , detections with missing or syntactically invalid version strings. advertising and tracking libraries like Google Analytics, and While this dynamic methodology detects libraries even if the social widgets since they typically do not publish version source code has been (lightly) modified, it relies on the version information. Fortunately, the vast majority of such libraries are attribute to be present. Hence, we can dynamically detect only hosted by their creators at a single, non-versioned URL (e.g., 39 out of the 72 libraries, and for some, we do not detect https://www.google-analytics.com/analytics.js), meaning that (typically older) versions lacking the version string. Table I all clients automatically include the latest version of the library. compares the versions detected dynamically in our crawls d Second, our catalogue may miss some revisions of libraries to our static reference catalogue s . Version coverage is often if the author chose to patch the code and not increment the similar; dynamic outperforms static when CDNs are incomplete. version number. Similarly, we may miss revisions if they are Limitations: Our two detection techniques represent a best- denoted using non-standard notation, such as special suffixes, effort approach to identifying JavaScript libraries in the wild. four-part version numbers, etc., and we may not possess However, there are cases where both techniques can fail. For any code samples for a version of a library if it cannot be example, heavily modified libraries will not match our file downloaded from the developer website or a supported CDN. hashes nor will they match the dynamic signatures. Furthermore, Third, our library vulnerability assessments are based solely we rely on the correctness of our information sources, i.e., that on publicly available documentation. We make no attempts to CDNs contain the version of a library that they claim, and that discover new vulnerabilities, or to quantify the exploitability libraries export the correct version string and do not attempt to of libraries as used on websites, for both practical and ethical conceal their presence. Effectively, these limitations that reasons. Thus, although a website may include a vulnerable our measurement results should be viewed as lower bounds. library, this does not necessarily imply that the website is exploitable. Furthermore, libraries differ in their release cycles, C. Data Collection attack surfaces, functionality, and public scrutiny with respect A central contribution of our work is to analyse not only to vulnerabilities. Thus, we do not claim to provide comparable whether outdated libraries are being used, but why this may coverage of vulnerabilities for each library in our catalogue. be the case. This implies that detecting whether a library exists in a window or frame is not enough; we must also B. Library Identification detect if it was loaded by another script. To model causal inclusion relationships of resources in websites, we introduce Identifying an unknown file as a specific version of a the theoretical concept of causality trees and implement it JavaScript library is challenging because these libraries are text, in a modern browser. We integrate our two library detection which gives web developers, development tools and network methods into this modified browser environment and use it to software the ability to modify them, e.g., by adding or removing collect data about the usage of JavaScript libraries on the Web. features, concatenating multiple libraries into a single file, or tampering with comments. To reliably detect as many libraries Causality Trees: The goal of a causality tree is to represent as possible, we use two complementary techniques. These the causal element creation relationships that occur during techniques are conceptually similar to those used by the Library the loading and execution of a dynamic website in a modern Detector Chrome extension [20] and Retire.js. browser. A causality tree contains a directed edge A → B if and only if element A causes element B to load. More Static Detection: We compute the file hashes of all observed specifically, the elements we model include scripts, images JavaScript code and compare them to the 19,099 reference and other media content, stylesheets, and embedded HTML hashes in our catalogue. File hashing enables us to identify all documents. A relationship exists whenever an element creates cases where libraries are used “as-is.” another element (e.g., a script creates an iframe) or changes an existing element’s URL (e.g., a script changes the URL of an Dynamic Detection: During the crawl, we detect the iframe or redirects the main document), which is equivalent to presence of libraries in the browser by fingerprinting the creating a new element with a different URL. JavaScript runtime environment and by relying on libraries to identify themselves. Specifically, modern libraries typically While the nodes in a causality tree correspond to nodes make themselves available to the environment by means of a in the website’s DOM, their structure is entirely unrelated to

5 Ad Frame w/ Scripts Chrome Debugging Protocol [8] to minimise the necessity for Ad Script Inline Script brittle browser source code modifications. The Chrome Debugging Protocol provides programmatic access to the browser and allows clients to attach to open win- dows, inspect network traffic, and interact with the JavaScript Included Scripts Root Frame environment and the DOM tree loaded in the window. Two prominent uses of this API are the Chrome Developer Tools (an Fig. 2. Example causality tree. HTML and JavaScript front-end to the protocol) and Selenium’s WebDriver interface to remotely control Chrome. the hierarchical DOM tree. Rather, nodes in the causality tree At a high level, we generate causality trees by observing are snapshots of elements in the DOM tree at specific points resource requests through the network view of the debugging in time, and may appear multiple times if the DOM elements protocol. Note that this view includes resources not actually are repeatedly modified. For instance, if a script creates an loaded over the network, e.g., inline URL schemas such as iframe with URL U1 and later changes the URL to U2, the data: or :. We disable all forms of caching corresponding script node in the causality tree will have two to observe even duplicate resource inclusions within the same document nodes as its children, corresponding to URLs U1 frame, which are otherwise handled through an in-memory and U2, but referring to the same HTML