Fingerprinting Keywords in Search Queries Over Tor
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings on Privacy Enhancing Technologies ; 2017 (4):251–270 Se Eun Oh*, Shuai Li, and Nicholas Hopper Fingerprinting Keywords in Search Queries over Tor Abstract: Search engine queries contain a great deal ×10 6 ×10 6 of private and potentially compromising information 2 2 about users. One technique to prevent search engines 1 1 from identifying the source of a query, and Internet ser- 0 0 vice providers (ISPs) from identifying the contents of PCA2 PCA2 -1 -1 queries is to query the search engine over an anony- Backgroud Webpage Trace Backgroud Webpage Trace Google Query Trace Duckduckgo Query Trace -2 -2 mous network such as Tor. -1 0 1 2 3 4 5 -1 0 1 2 3 4 5 In this paper, we study the extent to which Website PCA1 ×10 7 PCA1 ×10 7 Fingerprinting can be extended to fingerprint individual (a) Google (b) Duckduckgo queries or keywords to web applications, a task we call Fig. 1. Principal Components Analysis (PCA) Plot of Google and Keyword Fingerprinting (KF). We show that by aug- Duckduckgo query traces and background webpage traces based menting traffic analysis using a two-stage approach with on CUMUL feature set new task-specific feature sets, a passive network adver- Bing and Google, and ISPs, are in a position to col- sary can in many cases defeat the use of Tor to protect lect sensitive details about users. These search queries search engine queries. have also been among the targets of censorship [29] and We explore three popular search engines, Google, Bing, surveillance infrastructures [17] built through the coop- and Duckduckgo, and several machine learning tech- eration of state and private entities. niques with various experimental scenarios. Our experi- One mechanism to conceal the link between users mental results show that KF can identify Google queries and their search queries is an anonymous network such containing one of 300 targeted keywords with recall of as Tor, where the identities of clients are concealed from 80% and precision of 91%, while identifying the specific servers and the contents and destinations of connec- monitored keyword among 300 search keywords with ac- tions are concealed from network adversaries, by send- curacy 48%. We also further investigate the factors that ing connections through a series of encrypted relays. contribute to keyword fingerprintability to understand However, even Tor cannot always guarantee connection how search engines and users might protect against KF. anonymity since the timing and volume of traffic still reveal some information about user browsing activity. DOI 10.1515/popets-2017-0048 Over the past decade, researchers have applied vari- Received 2017-02-28; revised 2017-06-01; accepted 2017-06-02. ous machine learning techniques to features based on packet size and timing information from Tor connec- tions to show that Website Fingerprinting (WF) attacks 1 Introduction can identify frequently visited or sensitive web pages Internet search engines are the most popular and con- [7, 14, 20, 30, 31, 44]. venient method by which users locate information on In this paper, we describe a new application of the web. As such, the queries a user makes to these these attacks on Tor, Keyword Fingerprinting (KF). In search engines necessarily contain a great deal of pri- this attack model, a local passive adversary attempts vate and personal information about the user. For ex- to infer a user’s search engine queries, based only on ample, AOL search data leaked in 2006 was found to analysing traffic intercepted between the client and the contain queries revealing medical conditions, drug use, entry guard in the Tor network. A KF attack proceeds in sexual assault victims, marital problems, and political three stages. First, the attacker must identify which Tor leanings [18, 38]. Thus, popular search engines such as connections carry the search result traces of a particu- lar search engine against the other webpage traces. The second is to determine whether a target query trace is in a list of “monitored queries” targeted for identification. *Corresponding Author: Se Eun Oh: University of Min- nesota, E-mail: [email protected] The third goal is to classify each query trace correctly Shuai Li: University of Minnesota, E-mail: [email protected] to predict the query (keyword) that the victim typed. Nicholas Hopper: University of Minnesota, E-mail: hop- Note that while KF and WF attacks share some [email protected] common techniques, KF focuses on the latter stages of Unauthenticated Download Date | 1/3/18 2:35 AM Proceedings on Privacy Enhancing Technologies ; 2017 (4):252–270 this attack, distinguishing between multiple results from 2 Threat Model a single web application, which is challenging for WF techniques. Figure 1 illustrates this distinction; it shows Our attack model for KF follows the standard WF at- the website fingerprints (using the CUMUL features tack model, in which the attacker is a passive attacker proposed by Panchenko et al. [30]) of 10,000 popular who can only observe the communication between the web pages in green, and 10,000 search engine queries in client and the Tor entry guard. He cannot add, drop, blue. Search engine queries are easy to distinguish from modify, decrypt packets, or compromise the entry guard. general web traffic, but show less variation between key- The attacker has a list of monitored phrases and hopes words. Hence, while direct application of WF performs to identify packet traces in which the user queries a the first step very well and can perform adequately in search engine for one of these phrases, which we re- the second stage, it performs poorly for differentiating fer to as keywords. We also assume that search engine between monitored keywords. Thus, the different level servers are uncooperative and consequently users relay of application as well as the multi-stage nature of the at- their queries through Tor to hide the link between pre- tack make it difficult to directly use or compare results vious and future queries. Thus, our threat model only from the WF setting. monitors traces entering through the Tor network. The primary contribution of this paper is to demon- The attacker will progress through two sequential strate the feasibility of KF across a variety of set- fingerprinting steps. First, he performs website finger- tings. We collect a large keyword dataset, yielding about printing to identify the traffic traces that contain queries 160,000 search query traces for monitored and back- to the targeted search engine, rather than other web- ground keywords. Based on this dataset, we determine page traffic. Traffic not identified by this step is ignored. that WF features do not carry sufficient information We refer to the remaining traces, passed to the next to distinguish between queries, as Figure 1 illustrates. step, as search query traces or keyword traces. Second, Thus we conduct a feature analysis to identify more use- the attacker performs KF to predict keywords in query ful features for KF; adding these features significantly traces. The attacker will classify query traces into in- improves the accuracy that can be achieved in the sec- dividual keywords, creating a multiclass classifier. De- ond and third stages of the attack. pending on the purpose of the attack, the attacker trains Using these new features, we explore the effective- classifiers with either binary or multiclass classifications. ness of KF, by considering different sizes of “mon- For example, binary classification will detect monitored itored” and “background” keyword sets; both incre- keywords against background keywords while multiclass mental search (e.g. Google Instant1) and “high secu- classification will infer which of the monitored keywords rity” search (with JavaScript disabled); across popu- the victim queried. lar search engines — Google, Bing, and Duckduckgo; We assume the adversary periodically re-trains the and with different classifiers, support vector machines classifiers used in both steps with new training data; (SVM), k-Nearest Neighbor, and random forests (k- since both training and acquiring the data are resource- fingerprinting). We show that well-known high-overhead intensive, we used data gathered over a three-month pe- WF defenses [5, 6] significantly reduce the success of riod to test the generalizability of our results. KF attacks but do not completely prevent them with 3 Data Set standard parameters. We examine factors that differ be- tween non-fingerprintable (those with recall of 0%) and 3.1 Data Collection fingerprintable keywords. Overall, our results indicate We describe the Tor traffic capture tool and data that use of Tor alone may be inadequate to defend the sets collected for our experiment. To collect search content of users’ search engine queries. query traces on Tor, we used the Juarez et al. Tor Browser Crawler [22, 39], built based on Selenium, Stem, and Dumpcap to automate Tor Browser browsing and packet capture. We added two additional features to the crawler for our experiments. The first is to identify abnor- mal Google responses due to rate limiting and rebuild 1 Google Instant is a predictive search for potential responses as the Tor circuit. When an exit node is rate-limited by soon as or before the user types keywords into the search box. By default, Google Instant predictions are enabled only when Google, it responds (most often) with a CAPTCHA, the computer is fast enough. Unauthenticated Download Date | 1/3/18 2:35 AM Proceedings on Privacy Enhancing Technologies ; 2017 (4):253–270 thus, we added a mechanism to detect these cases and Table 1. Data collection settings for Google, Bing, and Duck- reset the Tor process, resulting in a different exit node. duckgo (inc:incremental, one:one-shot, Y:Yes, N:No) Among 101,685 HTML files for Google queries and keyword set Google(inc) Google(one) Bing Duck 125,245 HTML files for Bing, we manually inspected Top-ranked Y Y Y Y those of size less than 100KB, yielding 3,315 HTML re- Blacklisted Y Y N N sponses for Google and 9,985 for Bing, and found that AOL Y N N N a simple size threshold of 10KB for the response doc- ument was sufficient to distinguish rate-limit responses search query traces from January to February in 2017.) from normal results with perfect accuracy.