Updated in September 2017: Require valid versions for library detection throughout the paper. The vulnerability analysis already did so and remains identical. Modifications in Tables I, III and IV; Figures 4 and 7; Sections III-B, IV-B, IV-C, IV-F and IV-H. Additionally, highlight Ember’s security practices in Section V.
Thou Shalt Not Depend on Me: Analysing the Use of Outdated JavaScript Libraries on the Web
Tobias Lauinger, Abdelberi Chaabane, Sajjad Arshad, William Robertson, Christo Wilson and Engin Kirda Northeastern University {toby, 3abdou, arshad, wkr, cbw, ek}@ccs.neu.edu
Abstract—Web developers routinely rely on third-party Java- scripts or HTML into vulnerable websites via a crafted tag. As Script libraries such as jQuery to enhance the functionality of a result, it is of the utmost importance for websites to manage their sites. However, if not properly maintained, such dependen- library dependencies and, in particular, to update vulnerable cies can create attack vectors allowing a site to be compromised. libraries in a timely fashion. In this paper, we conduct the first comprehensive study of To date, security research has addressed a wide range of client-side JavaScript library usage and the resulting security client-side security issues in websites, including validation [30] implications across the Web. Using data from over 133 k websites, we show that 37 % of them include at least one library with a and XSS ([17], [36]), cross-site request forgery [4], and session known vulnerability; the time lag behind the newest release of fixation [34]. However, the use of vulnerable JavaScript libraries a library is measured in the order of years. In order to better by websites has not received nearly as much attention. In understand why websites use so many vulnerable or outdated 2014, a series of blog posts presented cursory measurements libraries, we track causal inclusion relationships and quantify highlighting that major websites included known vulnerable different scenarios. We observe sites including libraries in ad hoc libraries ([25], [26], [24]). These findings echo warnings from and often transitive ways, which can lead to different versions of other software ecosystems like Android [3], Java [32] and the same library being loaded into the same document at the same Windows [21], which show that vulnerable libraries continue time. Furthermore, we find that libraries included transitively, or to exist in the wild even when they are widely known to via ad and tracking code, are more likely to be vulnerable. This contain severe vulnerabilities. Given that JavaScript dependency demonstrates that not only website administrators, but also the dynamic architecture and developers of third-party services are management is relatively primitive and corresponding tools to blame for the Web’s poor state of library management. are not as well-established as in more mature ecosystems, these findings suggest that security issues caused by outdated The results of our work underline the need for more thorough JavaScript libraries on the Web may be widespread. approaches to dependency management, code maintenance and third-party code inclusion on the Web. In this paper, we conduct the first comprehensive study on the security implications of JavaScript library usage in websites. We seek to answer the following questions: I.INTRODUCTION The Web is arguably the most popular contemporary • Where do websites load JavaScript libraries from (i.e., programming platform. Although websites are relatively easy to first or third-party domains), and how frequently are create, they are often composed of heterogeneous components these domains used? such as database backends, content generation engines, multiple • How current are the libraries that websites are using, scripting languages and client-side code, and they need to deal and do they contain known vulnerabilities? with unsanitised inputs encoded in several different formats. • Are web developers intentionally including JavaScript Hence, it is no surprise that it is challenging to secure websites libraries, or are these dependencies caused by adver- because of the large attack surface they expose. tising and tracking code? • Are existing remediation strategies effective or widely One specific, significant attack surface are vulnerabilities used? related to client-side JavaScript, such as cross-site scripting (XSS) and advanced phishing. Crucially, modern websites often • Are there additional technical, methodological, or include popular third-party JavaScript libraries, and thus are at organisational changes that can improve the security risk of inheriting vulnerabilities contained in these libraries. For of websites with respect to JavaScript library usage? example, a 2013 XSS vulnerability in the jQuery [13] library Note that the focus of this paper is not measuring the security before version 1.6.3 allowed remote attackers to inject arbitrary state of specific JavaScript libraries. Rather, our goal (and primary contribution) is to empirically examine whether website Permission to freely reproduce all or part of this paper for noncommercial operators keep their libraries current and react to publicly purposes is granted provided that copies bear this notice and the full citation disclosed vulnerabilities. on the first page. Reproduction for commercial purposes is strictly prohibited without the prior written consent of the Internet Society, the first-named author Answering these questions necessitated solving three funda- (for reproduction of an entire paper only), and the author’s employer if the mental methodological challenges. First, there is no centralised paper was prepared within the scope of employment. repository of metadata pertaining to JavaScript libraries and NDSS ’17, 26 February - 1 March 2017, San Diego, CA, USA Copyright 2017 Internet Society, ISBN 1-1891562-46-0 their versions, release dates, and known vulnerabilities. To https://doi.org/10.14722/ndss.2017.23414 address this, we manually constructed a catalogue containing all “release” versions of 72 of the most popular open-source • We present results on the origins of vulnerable libraries, including detailed vulnerability information on a JavaScript library inclusions, which allows us to subset of 11 libraries. Second, web developers often modify contrast the security posture of website developers with JavaScript libraries by reformatting, restructuring or appending third-party modules such as WordPress, advertising or code, which makes it difficult to detect library usage in the tracking networks, and social media widgets. wild. We solve this problem through a combination of static • We show that a large number of websites include and dynamic analysis techniques. Third, to understand why JavaScript libraries in unexpected ways, such as mul- specific libraries are loaded by a given site, we need to track all tiple inclusions of different library versions into the of the causal relationships between page elements (e.g., script same document, which may impact their attack surface. s1 in frame f1 injects script s2 into frame f2). To solve this, • We find existing remediation strategies to be ineffective we developed a customised version of Chromium that records at mitigating the threats posed by vulnerable JavaScript detailed causality trees of page element creation relationships. libraries. For example, less than 3 % of websites could Using these tools, we crawled the Alexa Top 75 k websites fix all their vulnerable libraries by applying only patch- and a random sample of 75 k websites drawn from a snapshot level updates. Similarly, only 2 % of websites use the of the .com zone in May 2016. These two crawls allow us to version-aliasing services offered by JavaScript CDNs. compare and contrast JavaScript library usage between popular and unpopular websites. In total, we observed 11,141,726 inline II.BACKGROUND scripts and script file inclusions; 86.6 % of Alexa sites and JavaScript has allowed web developers to build highly inter- 65.4 % of .com sites used at least one well-known JavaScript active websites with sophisticated functionality. For example, library, with jQuery being the most popular by a large majority. communication and production-related online services such as Analysis of our dataset reveals many concerning facts Gmail and Office 365 make heavy use of JavaScript to create about JavaScript library management on today’s Web. More web-based applications comparable to their more traditional than a third of the websites in our Alexa crawl include at desktop counterparts. In this paper, we focus exclusively on least one vulnerable library version, and nearly 10 % include aspects of client-side JavaScript executed in a browser, not the two or more different vulnerable versions. From a per-library recent trend of using JavaScript for server-side programming. perspective, at least 36.7 % of jQuery, 40.1 % of Angular, 86.6 % of Handlebars, and 87.3 % of YUI inclusions use a vulnerable A. JavaScript Libraries version. Alarmingly, many sites continue to rely on libraries like YUI and SWFObject that are no longer maintained. In fact, In many cases, to make their lives easier, web developers the median website in our dataset is using a library version rely on functionality that is bundled in libraries. For example, 1,177 days older than the newest release, which explains why so jQuery [13] is a popular JavaScript library that makes HTML many vulnerable libraries tend to linger on the Web. Advertising, document traversal and manipulation, event handling, animation, tracking and social widget code can cause transitive library and AJAX much simpler and compatible across browsers. inclusions with a higher rate of vulnerability, suggesting that In the simplest case, a JavaScript library is a plain-text script these problems extend beyond individual website administrators containing code with reasonably well-defined functionality. The to providers of Web infrastructure and services. script has full access to the DOM that includes it; the concept of We also observe many websites exhibiting surprising namespaces does not exist in JavaScript, and everything that is behaviours with respect to JavaScript library inclusion. For created is by default global. More elaborate libraries use hacks example, 4.2 % of websites using jQuery in the Alexa crawl and conventions to protect the code against naming conflicts, include the same library version multiple times in the same and expose interfaces for retrieving meta-data such as the name document, and 10.9 % include multiple different versions of and version of the library. Over the course of this study, we jQuery into the same document. To our knowledge, ours found that JavaScript libraries overwhelmingly use the Semantic is the first study to make these observations, since existing Versioning [28] convention of major.minor.patch, such tools ([27], [20]) are unable to detect these anomalies. These as 1.0.1, where the major version component is increased for strange behaviours may have a negative impact on security as breaking changes, the minor component for new functionality, asynchronous loading leads to nondeterministic behaviour, and and the patch component for backwards-compatible bug fixes. it remains unclear which version will ultimately be used. To include a library into their website, developers typically Perhaps our most sobering finding is practical evidence that use the HTML tag and the JavaScript library ecosystem is complex, unorganised, and point to an externally-hosted version of the library or a copy minified quite “ad hoc” with respect to security. As of this writing, there on their own server. Library vendors often provide a are no reliable vulnerability databases, no security mailing lists version that has comments and whitespace removed and local for the most popular libraries, few or no details on security variables shortened to reduce the size of the file. Developers issues in release notes, and often, it is difficult to identify which can also concatenate multiple libraries into a single file, create versions of a library are affected by a reported vulnerability. custom builds of libraries, or use advanced minification features such as dead code removal. While custom minification builds Overall, our study makes the following contributions: are relatively common, more aggressive minification settings are rare in client-side JavaScript because they can break code [9]. • We conduct the first comprehensive study showing that a significant number of websites include vulnerable or CDNs. Many libraries are available on Content Distribution outdated JavaScript libraries. Networks (CDNs) for use by other websites. Google, Microsoft
2 TABLE I. THE 30 MOSTFREQUENTLIBRARIESINOUR ALEXA CRAWL and Yandex host libraries on their CDNs, some popular libraries (OUTOF 72 SUPPORTED LIBRARIES).VERSIONS: s TOTAL IN REFERENCE (e.g., Bootstrap and jQuery) offer their own CDNs, and some CATALOGUE FOR STATIC DETECTION; d DYNAMIC DETECTIONS OBSERVED community-based CDNs accept to host arbitrary open-source IN CRAWLS (SOMELIBRARIES/VERSIONS NOT SUPPORTED DYNAMICALLY). libraries. JavaScript CDNs enable caching of libraries across websites to increase performance. Another useful feature offered Versions Bower Wapp Use on Crawled Sites Library s d Rank % ALEXA COM by some CDNs is version aliasing. That is, when including a jQuery 66 64 1 42 % 83.9 % 61.1 % library, the developer may specify a version prefix instead of the jQuery-UI 46 46 13 7 % 23.5 % 8.0 % full version string, in which case the CDN returns the newest Modernizr 24 28 18 10 % 21.4 % 8.6 % available version with that prefix. When implemented correctly, Bootstrap 32 10 3 12.5 % 4.5 % jQuery-Migrate 7 0 11.3 % 10.7 % the patched version of a library will automatically be used on the Underscore 61 34 12 3 % 5.8 % 2.4 % website when it becomes available on the CDN. However, this SWFObject 2 1 3.7 % 2.4 % works only for security issues fixed in a backwards-compatible Moment 54 33 6 3.5 % 1.4 % RequireJS 62 40 3.4 % 2.3 % manner, and it conflicts with client-side security mechanisms jQuery-Form 14 0 2.7 % 3.4 % such as subresource integrity [37]. In addition, version aliasing Backbone 29 19 2 % 2.7 % 1.6 % Angular 110 78 2 2.4 % 1.6 % makes client-side caching of resources less efficient because it LoDash 77 57 26 2.4 % 2.5 % must be configured for shorter time spans, that is, hours instead jQuery-Tools 8 20 2.3 % 0.9 % of years. As a result, version aliasing is often discouraged [11]. jQuery-Fancybox 10 0 2.3 % 1.4 % GreenSock GSAP 45 45 2.2 % 2.8 % Handlebars 25 15 2.0 % 0.4 % Third Parties. Third-party modules such as advertising, Prototype 5 14 1.8 % 1.1 % trackers, social media or other widgets that are often embedded MooTools 27 24 3 % 1.5 % 1.4 % WebFont Loader 100 0 1.5 % 0.9 % in webpages typically implemented in JavaScript. Furthermore, jQuery-Cookie 8 0 21 1.4 % 0.2 % these scripts can also load libraries, possibly without the Hammer.js 26 14 1.2 % 0.4 % knowledge of the site maintainer. If not isolated in a frame, jQuery-Validation 13 0 1.1 % 0.6 % Mustache 29 21 1.1 % 0.9 % these libraries gain full privileges in the including site’s context. YUI 3 37 26 1.0 % 1.4 % Thus, even if a web developer keeps own library dependencies Velocity 55 15 0.9 % 0.2 % Script.aculo.us 5 12 0.8 % 0.4 % updated, outdated versions may still be included by badly Knockout 21 9 0.8 % 0.1 % maintained third-party content. Also, some JavaScript libraries Flexslider 11 0 0.6 % 0.4 % and many web frameworks contain their own copies of libraries React 41 23 28 0.5 % 1.6 % they depend on. Hence, web developers may unknowingly rely on software maintainers to update JavaScript libraries. HTML
3 versioning and project dependency metadata. We must therefore 0 2 4 collect and correlate this data from various separate sources. 1 3 5 Angular (110) 1) Selecting Libraries: The initial construction of our meta- Backbone (29) data archive involves a certain amount of manual verification Dojo (72) work. Since there are thousands of JavaScript libraries (e.g., Ember (77) the community-based cdnjs.com hosts 2,379 projects as of Handlebars (25) August 2016), we focus our study on the most widely used jQuery (66) libraries because they are the most consequential. jQuery-Migrate (7) jQuery-Mobile (16) To select libraries, we leverage library popularity statistics jQuery-UI (46) provided by the JavaScript package manager Bower [6] and Mustache (29) the web technology survey Wappalyzer [38]. We extend this YUI 3 (37) list of popular libraries with all projects hosted on the public 0.0 0.2 0.4 0.6 0.8 1.0 CDNs operated by Google, Microsoft and Yandex. As we will Fraction of total versions show in Section IV-C, many websites rely on these commercial Fig. 1. Fraction of library versions with i distinct known vulnerabilities CDNs to host JavaScript libraries. We collected the data from each (represented by colours), out of the total library versions in parentheses. Bower, Wappalyzer, and the three CDNs in January 2016. Angular 1.2.0 has 5 known vulnerabilities and there are 110 versions overall. Due to various data availability requirements explained in detail in Section III-A5, we need to exclude certain libraries from our study. Overall, we support 72 libraries—18 out of the ajax.googleapis.com/ajax/libs/jquery/{version}/jquery.min.js. In Top 20 installed with Bower, 7 out of the Top 10 frameworks doing so, we make sure that we download all available variants identified on websites by Wappalyzer, 13 of the 14 libraries of a library file, including the full development variant and the hosted by Google, 12 of the 18 libraries hosted by Microsoft, minified production variant without whitespace or comments. and all 11 libraries hosted by Yandex. Table I shows a subset When comparing files downloaded from official websites of 30 libraries in our catalogue as well as their rank on Bower and different CDNs, we noticed that even the same version and their market share according to Wappalyzer. Although our and variant (e.g., minified) of a library may sometimes differ catalogue appears to cover a sparse set of the libraries on between sources. We observed additional whitespace, removal Bower, many of the missing ranks belong to submodules of of comments, or the likely use of a different minifier or minifier popular libraries (e.g., rank 5 is Angular Mocks). According to setting, especially when the library’s developers do not provide a Wappalyzer, we cover 73 % of the most popular libraries. minified version. This observation highlights the importance of 2) Extracting Versioning Information: Our next step is collecting ground-truth JavaScript library samples from as many compiling a complete list of library versions along with official and semi-official sources as possible. Therefore, we use their release dates. After unsuccessful experiments with file official websites as well as dedicated CDNs (Bootstrap CDN timestamps and available-since dates on the libraries’ official and jQuery CDN), commercial CDNs (Google, Microsoft, and websites and CDNs, we determined that GitHub was the most Yandex), and open source CDNs (jsDelivr, cdnjs and OssCDN). reliable source for this kind of information. Nearly all of the In total, we collect 81,027 JavaScript files. We analyse open source libraries in our seed lists are hosted on GitHub the sizes of the “main” files of each library in our dataset and tag the source code of their releases, allowing us to extract (that is, we exclude files such as plug-ins that cannot be timestamps and version identifiers from the tags. In naming used stand-alone), and find that Script.aculo.us 1.9.0 is the their releases, they typically follow a major.minor.patch smallest at 996 bytes (minified). After accounting for duplicates version numbering scheme, which makes it straightforward to and discarding files smaller than 996 bytes (to reduce the identify tags pertaining to releases and ignore all other tags, likelihood of false positives due to shared ancillary resources including “alpha,” “beta” and “release candidate” versions that such as configuration files, localisations and plug-ins), our final are not meant to be used in production. As shown in Table I, catalogue includes 19,099 distinct files. popular libraries like Angular and jQuery have up to 110 and 66 distinct versions in our catalogue, respectively. However, 4) Identifying Vulnerabilities: The last step towards building half of the libraries have fewer than 26 versions. our catalogue is aggregating vulnerability information for our 72 JavaScript libraries. Unfortunately, there is no centralised 3) Obtaining Reference Files: Some methods of library database of vulnerabilities in JavaScript libraries; instead, we detection require us to have access to code samples for each manually compile vulnerability information from the Open version of a library. We gather library code from two sources: Source Vulnerability Database (OSVDB), the National Vulnera- the official website of each library, and from CDNs. For the bility Database (NVD), public bug trackers, GitHub comments, official websites, we manually download all available library blog posts, and the vulnerabilities detected by Retire.js [27]. versions. However, some official websites do not provide copies of old library versions, or they only provide copies of a subset Overall, we are able to obtain systematically documented of versions. In contrast, CDNs typically do host comprehensive details of vulnerabilities for 11 of the JavaScript libraries in collections of old library versions in order to not break websites our catalogue. In some cases, the documentation for a given that depend on older versions. We utilise the API of one such flaw specifies an affected range of versions, in which case we CDN, jsDelivr, to automatically discover all available versions consider all library versions within the range to be vulnerable. of libraries on five supported CDNs. For the remaining CDNs, In other cases, when a flaw is identified in a specific version v we construct download link templates manually, such as https:// of a library, we consider all versions ≤ v to be vulnerable.
4 Figure 1 shows details of the 11 libraries with vulnerability global variable that can be detected at runtime. Furthermore, information. For each library, we show the total number of most libraries in our catalogue contain a variable or method versions in our catalogue as well as the fraction of versions that returns the version of the library. As an illustration, the with i distinct known vulnerabilities. The worst offender is following snippet of JavaScript code detects jQuery: 1.2.0 Angular , which contains 5 vulnerabilities. Overall, we 1 v a r jq = window.jQuery || window.$ || window.$jq || see that 28.3 %, 6.7 %, and 6.1 % of these library versions window . $ j ; contain one, two, or three known vulnerabilities, respectively. 2 i f(jq && jq.fn) { 3 r e t u r n jq.fn.jquery || null;// version(if known) 5) Limitations: Although we have expended a great deal of 4 } else{ effort constructing our catalogue of JavaScript libraries, it is 5 r e t u r n false;// jQuery not found impacted by several limitations. First, by choosing GitHub for 6 } versioning and release date information, we need to exclude a Line 1 extracts jQuery’s global variable, and line 3 returns the small number of libraries that have few or no releases tagged on version number if it exists in its fn.jquery attribute. Note GitHub or do so in an apparently inconsistent way (e.g., multiple that in order to prevent false positives, we check for the global successive releases tagged on the same day). Furthermore, we variable and that the fn attribute exists. Later on, we discard all cannot include closed-source libraries such as Google Maps, detections with missing or syntactically invalid version strings. advertising and tracking libraries like Google Analytics, and While this dynamic methodology detects libraries even if the social widgets since they typically do not publish version source code has been (lightly) modified, it relies on the version information. Fortunately, the vast majority of such libraries are attribute to be present. Hence, we can dynamically detect only hosted by their creators at a single, non-versioned URL (e.g., 39 out of the 72 libraries, and for some, we do not detect https://www.google-analytics.com/analytics.js), meaning that (typically older) versions lacking the version string. Table I all clients automatically include the latest version of the library. compares the versions detected dynamically in our crawls d Second, our catalogue may miss some revisions of libraries to our static reference catalogue s . Version coverage is often if the author chose to patch the code and not increment the similar; dynamic outperforms static when CDNs are incomplete. version number. Similarly, we may miss revisions if they are Limitations: Our two detection techniques represent a best- denoted using non-standard notation, such as special suffixes, effort approach to identifying JavaScript libraries in the wild. four-part version numbers, etc., and we may not possess However, there are cases where both techniques can fail. For any code samples for a version of a library if it cannot be example, heavily modified libraries will not match our file downloaded from the developer website or a supported CDN. hashes nor will they match the dynamic signatures. Furthermore, Third, our library vulnerability assessments are based solely we rely on the correctness of our information sources, i.e., that on publicly available documentation. We make no attempts to CDNs contain the version of a library that they claim, and that discover new vulnerabilities, or to quantify the exploitability libraries export the correct version string and do not attempt to of libraries as used on websites, for both practical and ethical conceal their presence. Effectively, these limitations mean that reasons. Thus, although a website may include a vulnerable our measurement results should be viewed as lower bounds. library, this does not necessarily imply that the website is exploitable. Furthermore, libraries differ in their release cycles, C. Data Collection attack surfaces, functionality, and public scrutiny with respect A central contribution of our work is to analyse not only to vulnerabilities. Thus, we do not claim to provide comparable whether outdated libraries are being used, but why this may coverage of vulnerabilities for each library in our catalogue. be the case. This implies that detecting whether a library exists in a window or frame is not enough; we must also B. Library Identification detect if it was loaded by another script. To model causal inclusion relationships of resources in websites, we introduce Identifying an unknown file as a specific version of a the theoretical concept of causality trees and implement it JavaScript library is challenging because these libraries are text, in a modern browser. We integrate our two library detection which gives web developers, development tools and network methods into this modified browser environment and use it to software the ability to modify them, e.g., by adding or removing collect data about the usage of JavaScript libraries on the Web. features, concatenating multiple libraries into a single file, or tampering with comments. To reliably detect as many libraries Causality Trees: The goal of a causality tree is to represent as possible, we use two complementary techniques. These the causal element creation relationships that occur during techniques are conceptually similar to those used by the Library the loading and execution of a dynamic website in a modern Detector Chrome extension [20] and Retire.js. browser. A causality tree contains a directed edge A → B if and only if element A causes element B to load. More Static Detection: We compute the file hashes of all observed specifically, the elements we model include scripts, images JavaScript code and compare them to the 19,099 reference and other media content, stylesheets, and embedded HTML hashes in our catalogue. File hashing enables us to identify all documents. A relationship exists whenever an element creates cases where libraries are used “as-is.” another element (e.g., a script creates an iframe) or changes an existing element’s URL (e.g., a script changes the URL of an Dynamic Detection: During the crawl, we detect the iframe or redirects the main document), which is equivalent to presence of libraries in the browser by fingerprinting the creating a new element with a different URL. JavaScript runtime environment and by relying on libraries to identify themselves. Specifically, modern libraries typically While the nodes in a causality tree correspond to nodes make themselves available to the environment by means of a in the website’s DOM, their structure is entirely unrelated to
5 Ad Frame w/ Scripts Chrome Debugging Protocol [8] to minimise the necessity for Ad Script Inline Script brittle browser source code modifications. The Chrome Debugging Protocol provides programmatic access to the browser and allows clients to attach to open win- dows, inspect network traffic, and interact with the JavaScript Included Scripts Root Frame environment and the DOM tree loaded in the window. Two prominent uses of this API are the Chrome Developer Tools (an Fig. 2. Example causality tree. HTML and JavaScript front-end to the protocol) and Selenium’s WebDriver interface to remotely control Chrome. the hierarchical DOM tree. Rather, nodes in the causality tree At a high level, we generate causality trees by observing are snapshots of elements in the DOM tree at specific points resource requests through the network view of the debugging in time, and may appear multiple times if the DOM elements protocol. Note that this view includes resources not actually are repeatedly modified. For instance, if a script creates an loaded over the network, e.g., inline URL schemas such as iframe with URL U1 and later changes the URL to U2, the data: or javascript:. We disable all forms of caching corresponding script node in the causality tree will have two to observe even duplicate resource inclusions within the same document nodes as its children, corresponding to URLs U1 frame, which are otherwise handled through an in-memory and U2, but referring to the same HTML
6 creation event since websites routinely contain thousands of version of each loaded library, which enables us to assess the script nodes (see Section IV-B). accuracy of the dynamic detection. To detect cases of inclusions that we may miss due to Overall, we observe that the dynamic detection code is the four-second detection interval, we additionally execute the able to identify the exact name and version of 79.2 % of the dynamic detection code post hoc on all scripts found during libraries, as well as the name (but not the version) of 18.6 % the crawl. This execution is done in an individual Node.js of the libraries. Only 2 % of libraries fail to be identified (i.e., environment with a fake DOM tree. This offline detection were false negatives). We manually examine the libraries that step cannot fully replace in-browser detections, however, since are only detected by name, and find that the vast majority are some libraries such as jQuery UI have code or environment older versions that do not include a variable or method that prerequisites that cause the offline detection to fail. returns the library version. Of course, 100 % of these libraries can be detected based on their file hashes, which reinforces the Annotation of Ads, Trackers and Widgets: To further clarify importance of using multiple techniques to identify libraries. the provenance of scripts we observe in our crawl, we aim to determine whether they are related to known advertising, Static Detection: To investigate the efficacy of library tracking, or social widget code. We achieve this by injecting detection using file hashes, we conduct a second controlled a customised version of the AdBlock browser extension into experiment: we randomly select from our crawled data 415 each frame. Our version of AdBlock flags content but permits unique scripts that the dynamic detection code classifies as it to load, and we verified that our customised version remains being jQuery, and attempt to detect them based on their file undetected by common ad-block detection scripts. We use hashes. In this case, we are treating the output of the dynamic EasyList and EasyPrivacy to identify advertising and tracking detection method as ground truth. content, and Fanboy’s Social Blocking List to detect social Overall, we observe that only 15.4 % of the libraries can network widgets. For our analysis, we label an element in be identified as jQuery based on their file hash. Although this the causality tree as ad/tracker/widget-related whenever the is a low detection rate, the result also matches our expectation corresponding element or any parent in the DOM tree is labeled that developers often deploy customised versions of libraries. by AdBlock. Additionally, we propagate these labels downwards For example, 90 % of the jQuery libraries that we fail to detect to all children of the labelled node in the causality tree. via file hashing contain fewer than 150 line break characters, Crawl Parameters: To gain a representative view of whereas non-minified copies of jQuery from our catalogue JavaScript library usage on the Web, we collected two different contain more than 1900. This strongly suggests that the unique datasets. First, we crawled the Alexa Top 75 k domains, which scripts are custom-minified versions of jQuery. represent websites popular with users. Second, we crawled Hypothetical “Name-in-URL” Detection: For the last 75 k domains randomly sampled from a snapshot of the .com validation step, we consider a simple library detection heuristic. zone, that is, a random sample of all websites with a .com The heuristic flags a script file as jQuery, for instance, whenever address, which we expect to be dominated by less popular the string “jquery” appears in the URL of the script. To evaluate websites. We conducted the two crawls in May 2016 from IP the accuracy of this heuristic, we extract from our ALEXA crawl addresses in a /24 range in the US. We observe ∼5 % and the set of script URLs that contain “jquery,” and the URLs of ∼17.2 % failure rates in ALEXA and COM, leaving us with scripts detected as jQuery by our dynamic and static methods. data from 71,217 and 62,086 unique domains, respectively. Out of these URLs, 22.3 % contain “jquery” and are also Failures were due to timeouts and unresolvable domains, which detected as the library; 69 % are flagged only by the heuristic, is expected especially for COM since the zone file contains and 8.8 % are detected only by our dynamic and static methods. domains that may not have an active website. The heuristic appears to cause a large number of false positives To preserve the fidelity of our data collection, our crawler is due to scripts named “jquery” without containing the library, based on Chromium and includes support for Flash. We disable and it also seems to suffer from false negatives due to scripts various security mechanisms such as malware and phishing that contain jQuery but have an unrelated name. filters. We only crawl the homepage of each visited site due We validate this finding by manually examining 50 scripts to the presence of many sites that thwart deeper traversal by from each of the two set differences. The scripts in the requiring log-ins. While visiting a page, the crawler scrolls detection-only sample appear to contain additional code such downwards to trigger loading of any dynamic content. As we as application code or other libraries. Only one of these scripts found page-loaded events to be unreliable, our crawler remains does not contain jQuery but Zepto.js, an alternative to jQuery on each page for a fixed delay of 60 seconds before clearing that is partially compatible and also defines the characteristic its entire state, restarting, and then proceeding to the next site. $.fn variable. On the other hand, none of the scripts in the heuristic-only sample can be confirmed as the library; nearly D. Validation all of them contain plug-ins for jQuery but not the library itself. As the final step in our methodology, we validate that our The results for the Modernizr library, which does not have static and dynamic detection methods work in practice. an equivalent to jQuery’s extensive plug-in ecosystem, confirm this trend. The overlap between heuristic and detection is 55.3 % Dynamic Detection: To investigate the efficacy of our of URLs, heuristic-only 0.8 %, and detection-only 44 %—the dynamic detection code, we conduct a controlled experiment: simple heuristic misses many Modernizr files renamed by we load each of the reference libraries in our catalogue into developers. These results underline the need for more robust Node.js, one at a time, and attempt to detect each file with detection techniques such as our dynamic and static methods; the dynamic detection method. Intuitively, we know the exact we do not use the heuristic in our analysis.
7 1.0 1.0 1.0 Internal External 0.8 Inline 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4 CDF (Alexa sites) CDF (Alexa sites) CDF (Alexa sites) Inline 0.2 0.2 External 0.2 Internal Vulnerable All All 0.0 0.0 0.0 100 101 102 103 104 105 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 # scripts # detected libraries # detected libraries
Fig. 3. Distribution of JavaScript inclusion type Fig. 4. Distribution of library inclusion type fre- Fig. 5. Distribution of vulnerable library count frequency per site in ALEXA. quency per site in ALEXA. versus overall library count per site in ALEXA.
1.0 1.0 1.0 Internal External 0.8 Inline 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4 CDF (com sites) CDF (com sites) CDF (com sites) Inline 0.2 0.2 External 0.2 Internal Vulnerable All All 0.0 0.0 0.0 100 101 102 103 104 105 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 # scripts # detected libraries # detected libraries
Fig. 6. Distribution of JavaScript inclusion type Fig. 7. Distribution of library inclusion type fre- Fig. 8. Distribution of vulnerable library count frequency per site in COM. quency per site in COM. versus overall library count per site in COM.
IV. ANALYSIS Since paths correspond to causal relationships, intuitively this means that the inclusion of a node could have been influenced In this section, we analyse the data from our web crawls. by up to 438 predecessors. In both crawls, images tend to First, we present a general overview of the dataset by drilling appear further up in the causality trees at a median depth of 1, down into the causality trees, overall JavaScript inclusion that is, at least half of them are directly included in the main statistics, and vulnerable JavaScript library inclusions. Next, we document, whereas documents tend to appear further down examine risk factors for sites that include vulnerable libraries, at a median depth of 2, which indicates that they are more and the age of vulnerable libraries (i.e., relative to the latest frequently dynamically generated. release of each library). Finally, we examine unexpected, duplicate library inclusions in websites, and investigate whether common remediations practices are useful and used in practice. B. General JavaScript Statistics Scripts are the most common node type in our causality A. Causality Trees trees; 97 % of ALEXA sites and 83.6 % of COM sites contain We begin our analysis by measuring the complexity of the JavaScript. The most common script type are inline scripts, websites in our crawls. The median causality tree in ALEXA which includes script code embedded as text in a