Analysis and Detection of Spying Browser Extensions
Total Page:16
File Type:pdf, Size:1020Kb
I Spy with My Little Eye: Analysis and Detection of Spying Browser Extensions Anupama Aggarwal⇤, Bimal Viswanath†, Liang Zhang‡, Saravana Kumar§, Ayush Shah⇤ and Ponnurangam Kumaraguru⇤ ⇤IIIT - Delhi, India, email: [email protected], [email protected], [email protected] †UC Santa Barbara, email: [email protected] ‡Northeastern University, email: [email protected] §CEG, Guindy, India, email: [email protected] Abstract—In this work, we take a step towards understanding history) and sends the information to third-party domains, and defending against spying browser extensions. These are when its core functionality does not require such information extensions repurposed to capture online activities of a user communication (Section 3). Such information theft is a and communicate the collected sensitive information to a serious violation of user privacy. For example, a user’s third-party domain. We conduct an empirical study of such browsing history can contain URLs referencing private doc- extensions on the Chrome Web Store. First, we present an in- uments (e.g., Google docs) or the URL parameters can depth analysis of the spying behavior of these extensions. We reveal privacy sensitive activities performed by the user on observe that these extensions steal a variety of sensitive user a site (e.g., online e-commerce transactions, password based information, such as the complete browsing history (e.g., the authentication). Moreover, if private data is further sold or sequence of web traversals), online social network (OSN) access exchanged with cyber-criminals [4], it can be misused in tokens, IP address, and geolocation. Second, we investigate numerous ways. Left unchecked, users may be exposed to the potential for automatically detecting spying extensions by various risks, including identity theft [5], and financial loss. applying machine learning schemes. We show that using a Compared to other types of user tracking, spying by Recurrent Neural Network (RNN), the sequence of browser browser extensions deserves special attention. Unlike web API calls made by an extension can be a robust feature, pages, browser extensions persist through the entire browser life cycle, and can access in-depth user browsing activities outperforming hand-crafted features (used in prior work on across all websites. By making privileged API calls to the malicious extensions) to detect spying extensions. Our RNN browser, extensions can access the user’s browsing history, based detection scheme achieves a high precision (90.02%) and current open tabs, record keyboard inputs and more. On the recall (93.31%) in detecting spying extensions. other hand, third-party web trackers can only track users on Index Terms—spying browser extensions; information tracking sites where they are embedded by the publisher, and thus obtain a “limited” or fragmented view of a user’s entire browsing behavior. Also note that extensions are available on most popular browsers (e.g., Chrome, Safari, and Fire- 1. Introduction fox) and have 510 million cumulative installs in Chrome. Therefore, spying extension developers have a large user Many websites today rely on advertising to sustain their base that they can target. services. Online ad platforms have developed sophisticated Identifying spying behavior is challenging. Given an user tracking systems to capture user behavior across web- extension, automatically identifying the flow of sensitive sites in order to generate highly personalized ads. Several user information is a non-trivial task [6]. Moreover, we studies have been conducted to understand third-party user observe spying extensions that obfuscate user information tracking on the web, where a third-party entity (like social before sending it to remote servers, and those that switch media widget), embedded by the first-party site the user between spying and non-spying states during their life-time visits, can log the presence and other activities of the user (e.g., using remote Command-and-Control triggers). Such on that site [1] [2]. Unfortunately, user tracking has reached behavior further complicates the identification. to a state where even privileged software running on a user’s To bootstrap our analysis, we resort to expert manual browser is stealing user information [3]. In this work, we investigation of behavioral logs (traces of network, storage, take a step towards understanding and defending against this and browser API requests made by an extension) of ex- new type of privacy risk—browser extensions that spy on tensions to identify spying behavior (Section 3). To reduce user information. manual effort, we apply several heuristics to automatically A browser extension is an add-on software that enhances short-list a candidate set of possible spying extensions, functionality of the browser. We consider a spying extension among all the 43,521 extensions on the Chrome store. After as one which accesses user information (e.g., browsing manually investigating over 1,000 extensions in the candi- date set, we discover 218 spying extensions. This dataset extensions, most of the concepts apply to other browsers as provides an opportunity to better understand spying behavior well, for example, Firefox with WebExtensions [13]. and to build a defense scheme that can automatically detect Extension Architecture. Browser extensions are software spying extensions (ie without any manual effort). programs running on web browsers, providing extended We start by uncovering the modus operandi of spying browsing functionality to users. An extension consists of extensions and analyze their characteristics (Section 4). Spy- JavaScript, HTML, CSS, and other web resources that are ing extensions steal browsing history, social media access needed for rendering web pages. The main logic of an tokens, IP address and geolocation of users. Surprisingly, extension is usually placed in a background page where spying extensions are as popular (based on user base) and programs keep running during the entire browser session. have similar ratings as other extensions on the Chrome store. Many extensions also use content scripts that are programs We suspect that users are mostly unaware of the spying injected into a web page to gain access to the page resources. behavior as only 12 out of 218 extensions received reports Similar to web pages, extensions have access to all of any suspicious behavior from users. Web APIs, such as standard JavaScript API, XMLHttpRe- Next, we focus on automatically detecting spying exten- quest, and DOM. More importantly, Chrome extensions are sions. While a lot of prior work has focused on HoneyPages exposed to a set of privileged browser APIs, called the to trigger and catch malicious behavior by extensions [7], Chrome API. Based on the functionality, Chrome API is and information flow control based approaches [6] [8] [9]; bundled as various feature endpoints, such as web requests we take an alternate approach of leveraging recent advances (e.g., listening or tampering network requests), bookmarks, in machine learning (ML) for detection. Practicality and history, cookies, storage, and tabs. In order to gain access to effectiveness of ML-based defenses has encouraged the the Chrome API, extension developers have to specifically industry (e.g., Google) to also adopt such approaches to request permissions in the Manifest file. When installing an protect against malicious extensions [10]. extension, users have to decide to accept these permission As is the case with any ML-based defense, we start by requests or not. For example, the cookie permission in the first identifying a robust feature. Prior work on detecting Manifest file enables extensions to make API calls to read malicious extensions relies on extensive feature engineering cookies (cookie.get) and change cookies (cookie.set). With to craft a large number of features (e.g., features based on the necessary permissions, extensions can monitor all user extension meta-data, source code, and network, storage and activities in browser. It should be noted that other browsers, API requests) [10]. Surprisingly, our analysis reveals that Safari and Firefox also provide a similar privileged Browser most of the existing features are ineffective in detecting API to extensions [13] [14]. spying extensions (Section 5.2). Instead, browser API calls modeled as sequential data provides the highest predictive Chrome Web Store. The Chrome Web Store serves as a power, thus obviating the need for complex hand-crafted repository for all extensions and provides a low entry barrier features. We further discuss the benefits and robustness for developers to publish extensions. However, due to the aspects of using API call sequence for detection. rampant increase of malicious extensions, Chrome now asks However, browser API call sequences can contain com- for a one-time sign-up fee to publish extensions [15], and plex sequential patterns which makes traditional n-gram does not allow users to install extensions sourced outside based sequence classification approaches ineffective in our the Chrome Web Store ecosystem [16], unless a deliber- case. We observe that recent developments in a class of deep ate developer mode is enabled by the user. Additionally, neural networks called Recurrent Neural Networks (RNN) Google is known to constantly monitor the extension store are well suited for our scenario. We present an RNN based to identify malicious browser extensions [10]. Despite these detection scheme that is capable of learning sophisticated countermeasures, our findings show persistent presence of patterns in API call sequences [11]. An in-depth evalu- spying extensions on the Chrome Web Store. ation of our RNN based approach shows that it outper- forms traditional ML schemes and achieves a high precision 3. Data Collection (90.02%) and recall (93.31%) in detecting spying extensions (Section 5.4). Further, our RNN-based classifier additionally 3.1. Ground Truth Data of Spying Extensions identified 65 previously unknown spying extensions. Finally, we discuss deployment aspects of our approach. While our To conduct our study, we first need a large sample of spying scheme can be used in a centralized setting by analyzing extensions.