<<

HTTPOS: Sealing Information Leaks with Browser-side Obfuscation of Encrypted Flows

§ § § † § ‡ Xiapu Luo ∗, Peng Zhou , Edmond W. W. Chan , Wenke Lee , Rocky K. C. Chang , Roberto Perdisci The Hong Kong Polytechnic University§, Georgia Institute of Technology†, University of Georgia‡ § † ‡ {csxluo,cspzhouroc,cswwchan,csrchang}@comp.polyu.edu.hk , [email protected] , [email protected]

Abstract be profiled from traffic features [29]. A common approach to preventing leaks is to obfuscate the encrypted traffic by Leakage of private information from web applications— changing the statistical features of the traffic, such as the even when the traffic is encrypted—is a major security packet size and packet timing information [13,23,35,38]. threat to many applications that use HTTP for data deliv- Existing methods for defending against information ery. This paper considers the problem of inferring from en- leaks, however, suffer from quite a few problems. A major crypted HTTP traffic the web sites or web pages visited by problem is that, as server-side solutions, they require modi- a user. Existing browser-side approaches to this problem fications of web entities, such as browsers, servers, and even cannot defend against more advanced attacks, and server- web objects [13,38]. Modifying the web entities is not fea- side approaches usually require modifications to web enti- sible in many circumstances and cannot easily satisfy differ- ties, such as browsers, servers, or web objects. In this paper, ent applications’ requirements on information leak preven- we propose a novel browser-side system, namely HTTPOS, tion. A second fundamental problem with these methods to prevent information leaks and offer much better scalabil- is that they are still vulnerable to some advanced traffic- ity and flexibility. HTTPOS provides a comprehensive and analysis attacks. For example, although Sun et al. [35] pro- configurable suite of traffic transformation techniques for posed twelve approaches to defeat their traffic-analysis at- a browser to defeat traffic analysis without requiring any tack based on web object size, new attacks based on the server-side modifications. Extensive evaluation of HTTPOS tuple of packet size and direction [26] could still identify on live web traffic shows that it can successfully prevent the the web sites visited by a user. Finally, the efficacy of these state-of-the-art attacks from inferring private information methods has not been validated thoroughly based on actual from encrypted HTTP flows. implementations and live HTTP traffic. An exception is the work from Chen et al. [13] that is implemented as an IIS extension and a add-on. 1 Introduction In this paper we explore a browser-side approach to pre- vent information leaks from encrypted web traffic. Com- pared with the server-side approach, the browser-side ap- Leakage of private information from web applications proach has the scalability advantage, because only the traf- is a major security threat to many applications that use fic between the browser and the visited servers needs to be HTTP for data delivery. Cloud computing and other similar obfuscated. Moreover, it is possible for users to choose service-oriented platforms will only exacerbate this prob- which encrypted flows to be obfuscated in order to con- lem, because these services are usually delivered through serve resources and to reduce impacts on performance, but web browsers. Moreover, it is well known that data encryp- this flexibility advantage is very difficult to obtain from a tion alone is insufficient for preventing information leaks. server-side approach. However, designing a browser-side For instance, traffic-analysis attacks can identify the web method is very challenging, because the server’s behavior sites visited by a user from encrypted traffic [7,22,23,26] cannot be directly modified to evade traffic analysis. That and anonymized NetFlows records [14]. Chen et al. have is, we cannot apply the previously proposed methods that further showed that sensitive personal information, such as assume the capability of modifying the server’s behavior to medical records and financial data, could also be inferred the browser-side approach. through traffic analysis [13]. Besides, a user’s browser We show in this paper that it is possible to devise a could be fingerprinted [39], and her browsing patterns could browser-side method to defeat traffic analysis by presenting

∗Most of the work by this author was performed while at Georgia Tech. HTTPOS (which stands for HTTP or HTTPS with Obfus- cation), our proposed browser-side method. In addition to and HTTPOS’s location. There are five entities in the threat the advantages discussed above, HTTPOS has several other models: a victim user (i.e., client in Figure 1), an attacker, important advantages, such as supporting a wider scope of an encrypted tunnel, HTTPOS, and remote web sites/pages. application scenarios than the previous approaches (see the The threat models for scenarios (a)-(c) were adopted in pre- threat model in Section 2). HTTPOS is a user-level pro- vious works, including Sun et al. [35], Liberatore et al. [26] gram and does not need to modify any web entity. It obfus- and Wright et al. [38], and the threat model for scenario (d) cates encrypted web traffic by modifying four fundamen- was adopted by Chen et al. [13]. tal network flow features on the TCP and the HTTP layers, In scenarios (a)-(c), a client visits a web site through an namely, packet size, timing of packets, web object size, and encrypted tunnel at different layers, for example, wireless flow size. These features, as shown in the evaluation, are channel with WEP/WPA [21], IPSec-based IP tunnel [37], sufficient for diffusing and confusing the existing traffic- and SSH-based TCP tunnel [6], and the attacker attempts to analysis attacks. To modify the traffic from web servers, find outthat web site. In scenario (d), a client visits different HTTPOS exploits a number of basic features in TCP (e.g., web pages at a certain web site. This attack model assumes Maximal Segment Size (MSS) negotiation and advertising that the attacker knows the web site, and she attempts to window) and HTTP (e.g., HTTP Range and HTTP Pipelin- discover the web pages visited by the client. Note that an ing). updated web page due to the client’s interactions with the We have implemented HTTPOS and conducted exten- web site is considered as a new web page from an attacker’s sive experiments on live HTTP traffic to evaluate its per- viewpoint. For example, some web sites (e.g., Google) may formance in terms of evading traffic-analysis attacks and return auto-suggested words upon receiving a user input. impacts on the performance of network flows. The re- These dynamically updated web pages are regarded as dif- sults show that HTTPOS can effectively prevent the state- ferent web pages. of-the-art attacks from inferring private information from In all four scenarios, the attacker eavesdrops the en- encrypted HTTP flows. As will be shown in Section 5.2, crypted tunnel to obtain the encrypted packets sent between without HTTPOS an attacker can achieve high accuracy the victim and web servers, but she cannot decrypt these of inferring the web sites visited by a user. For example, packets. To infer the visited web sites/pages from these some attacks can achieve 94% accuracy on inferring the encrypted packets, the attacker first profiles the character- 100 web sites we tested by selecting the one with the high- istics of the traffic between the victim and each candidate est confidence based on their classification algorithm. With web site/page. The traffic profiling depends on the traffic HTTPOS all attacks’ accuracy drops to zero for at least 98 analysis methods [7,13,22,23,26]. She can easily build web sites. Even if an attacker chooses the top five web sites such profiles by visiting those web sites/pages via the en- as her inference, the accuracy for at least 94 web sites re- crypted tunnels herself. Equipped with the set of traffic pro- mains zero for all attacks. Moreover, some traffic-analysis files, the attacker then performs the inference by classifying attack can easily infer a user’s input to the Google search the captured traffic trace into the traffic profiles prepared box. But when HTTPOS is applied, the output of such at- beforehand. From the viewpoint of pattern classification, tack is reduced to a random guess. the traffic profiling step is known as conducting supervised Section 2 first presents the threat model, and Section 3 learning to train a classifier, whose feature set is the traffic describes our strategies for evading traffic-analysis attacks profile, and the class label is the web site/page. Moreover, and methods for manipulating network flow features. After the inference step corresponds to labeling a traffic trace ac- that, we introduce HTTPOS’s design and implementation cording to the trained classifier [18]. in Section 4, followed by extensive experiment results in Section 5. We finally introduce the related work in Section HTTPOS obfuscates the encrypted traffic by exploiting the protocol features in TCP and HTTP. Since the TCP con- 6 before concluding this paper in Section 7. nection (and therefore the HTTP connection) is end-to-end in scenarios (a), (b), and (d), HTTPOS can be deployed at 2 Threat models the browsers. On the other hand, the browser’s TCP connec- tion is terminated at the tunnel entry in scenario (c). There- Unlike previous works, we consider in this paper both fore, when HTTPOS is deployed at the browsers for TCP (1) the problem of inferring the web sites visited by users tunnels, only the HTTP methods can be used for traffic ob- and (2) the problem of inferring the web pages browsed fuscation. Furthermore, HTTPOS can be deployed at the by users. The three attack scenarios illustrated in Fig- tunnel entry in scenarios (b) and (c). Since we implement ures 1(a)-1(c) concern problem (1), whereas the one in Fig- HTTPOS as an HTTP proxy (which will be discussed in ure 1(d) concerns problem (2). We summarize the threat Section 4), the same HTTPOS can be placed at the browser models for the four scenarios in Table 1 based on the attack or the tunnel entry for both scenarios. However, placing goals, visibility of the packet information to the attacker, HTTPOS at the TCP tunnel’s entry maximizes HTTPOS’s Client Encrypted IP Client Encrypted Wireless Tunnel Channel Web Sites Web Sites Attacker HTTPOS Wireless Attacker Access Point HTTPOS

(a) Wireless (e.g., WEP/WPA). (b) IP tunnel (e.g., IPSec).

Client Encrypted TCP Tunnel Client Encrypted HTTP Web Sites Tunnel

Attacker Web Pages HTTPOS HTTPOS Attacker

(c) TCP tunnel (e.g., SSH). (d) HTTPS (SSL/TLS).

Figure 1: The four attack scenarios considered in this paper.

Table 1: Threat models for the four attack scenarios. Wireless (e.g., WEP/WPA) IP tunnel (e.g., IPSec) TCP tunnel (e.g., SSH) HTTPS (SSL/TLS)

Attacker’s goal Identify web site Identify web site Identify web site Identify web page Visibility of HTTP header No No No No Visibility of TCP header No No Yes Yes Visibility of Destination IP No No No Yes HTTPOS’s location Client Client/tunnel entry Client/tunnel entry Client obfuscation power, because, as mentioned earlier, HTTPOS et al. proposed the SSWRPQ attack, which is the first at- at the browser cannot use TCP’s protocol features to obfus- tack that can identify web sites through the number and size cate encrypted traffic. of web objects [35]. Later on, Bissias et al. proposed us- ing the inter-arrival time between packets and packet size 3 Defending against traffic-analysis attacks to profile a web site in their BLJL attack [7]. Liberatore et al. exploited the tuple (flow direction, packet size) for traffic analysis and proposed two classification algorithms: Jac- In Section 3.1, we first elaborate on the classification al- card coefficient (JC) and naive Bayesian classifier (NBC) gorithms used in the state-of-the-art traffic-analysis attacks. that are referred to as LL-JC attack and LL-NBC attack, Then we propose strategies with formal analysis to deceive respectively [26]. Most recently, Chen et al. employed a those classification algorithms in Section 3.2. In particu- sequence of (flow direction, packet size) to infer the web lar, we identify four basic features that can affect the infor- pages visited by a victim in their CWWZ attack [13]. mation used by those traffic-analysis attacks: packet size, The SSWRPQ attack employs the number and size of web timing of packets, web object size, and flow size. We then objects as features. Since all HTTP traffic is encrypted, an propose methods in Section 3.3 to manipulate these features attacker cannot obtain the exact values of such features. As to evade these traffic-analysis attacks. suggested in [24,35], the amount of bytes from the server between two consecutive requests from the client is used to 3.1 The state•of•the•art attacks approximate the sizes of web objects. After obtaining the number of web objects and their sizes, the attack uses the We consider the five traffic-analysis attacks on encrypted Jaccard coefficient to quantify the similarity between a new HTTP flows [7,13,26,35] in Table 2. We name them by trace and an existing profile. concatenating the first letters of the authors’ names. Sun The BLJL attack employs both inter-arrival time (IAT) be- Table 2: Traffic-analysis attacks studied in this paper. Attacks Features Classification algorithms

SSWRPQ [35] The number and size of web objects Jaccard coefficient BLJL [7] Inter-arrival time between packets and packet size Cross correlation LL-JC [26] Tuples of (flow direction, packet size) Jaccard coefficient LL-NBC [26] Tuples of (flow direction, packet size) Naive Bayesian classifier CWWZ [13] Sequence of tuple (flow direction, packet size) Sequence comparison tween packets and packet size to profile a web site and uses Based on the assumption that all attributes are independent, cross-correlation to measure the similarity between a new n trace and an existing profile. To compare a new trace’s fea- P(Vnew Vi)= ∏ P(dk Vi), (5) | | tures to an existing profile, the BLJL attack computes the k=1 cross correlation of its IAT sequences and packet size se- where dk denotes a feature in Vi. quences by: The CWWZ attack, unlike the LL-JC and LL-NBC attacks that only inspect individual packets, considers a sequence of ∑n [(τ τ¯)(s s¯)] i=1 i i packets [13] to infer user inputs to a web page. A sequence R = n − n− , (1) p∑ (τi τ¯)p∑ (si s¯) i=1 − i=1 − of directional packet sizes is referred to as a flow vector, de- where τ and s denote the new trace’s IAT values and packet noted by C = ct ,ct+1,...,ct+n 1 , where ct+n 1 represents the directional{ packet size observed− } at time t +−n 1, and n sizes, respectively, whereas τ¯ ands ¯ are the respective mean − values of an existing profile’s IAT and packet size. is the sequence length. Let k be the number of all avail- The LL-JC attack employs tuples of (flow direction, able characters and kt be the number of possible character sequences at time t, and these possible characters consti- packet size) as features. Let D = d1,d2,... be a set of tuples in a trace. The JC is defined as{ } tute an ambiguity set. After observing ct , the attacker may know that only kt /αt possible inputs from the ambiguity Dnew TDi set can produce ct , where αt [1,kt ) is defined as a reduc- S(Dnew,Di)= | |, (2) ∈ Dnew SDi tion factor. In the next observation, the ambiguity set’s size | | kt+1 is reduced from k kt to k (kt /αt ). After receiving where Dnew and Di denote the set of tuples in a new trace · · ct ,ct+1,...,ct+n 1 , the size of ambiguity set is reduced and that in the profile of the ith web site, respectively. The { n kn− } n from k to Πn α , where Πi=1(αt+i 1) is referred to as normalized S (i.e., S) is used to determine to which class a i=1( t+i 1) − reduction power. A− victim’s input can be easily recovered given trace belongs: if an attack has large reduction power. S(Dnew,Di) S(Dnew,Di)= , (3) 3.2 Two defense strategies ∑ j U S(Dnew,D j) ∈ where U is the set of all existing profiles. We propose two general strategies to deceive an attack’s The LL-NBC attack only considers the existence of cer- classification algorithm. The first one is inducing the clas- tain tuples without examining the number of tuples in a sification algorithm to make a random guess by introducing flow. This attack employs the Kernel Density Estimation features that have not been involved in training the algo- (KDE) to estimate the probability density function of each rithm. The second strategy is to confuse the classification tuple (i.e., considering the value of each feature) and then algorithm to misclassify a trace. employs a naive Bayesian classifier to reach a decision. We use V to denote the features used in an LL-NBC attack. 3.2.1 The diffusion strategy The relationship between V and D is that for a given fea- ture in V, its value is equal to zero if the correspondingtuple The SSWRPQ, LL-JC, and LL-NBC attacks implicitly as- is not in D and KDE is not used. After using KDE, although sume that packet sizes (or web object sizes) observed in the such features may have values larger than zero, their values training data set will appear in the testing data set (i.e., a may be very small, depending on the parameters used in new trace to be classified). If all packet sizes or web object the kernel function and the location of tuples that are in D. sizes in a new trace never appear in the training data set, these algorithms will be forced to make a random guess. NBC classifies Vnew into a class Vi if and only if Lemma 1 details how to evade algorithms based on the Jac- P(Vnew Vi)P(Vi) > P(Vnew Vj)P(Vj), j = i. (4) card coefficient and the naive Bayesian classifier. | | ∀ 6 Lemma 1. If a flow comprises a set of tuples, denoted as flow’s features. For instance, Lemma 2 first presents an ap- Dnew, which never exist in any training set, then the LL- proach to confuse the LL-JC, LL-NBC, and SSWRPQ at- JC and LL-NBC attacks cannot classify this flow correctly. tacks. Similarly, if a web object size of a flow never appears in any training set, then the SSWRPQ attack cannot identify Lemma 2. Let Di and D j be the respective sets of tuples in the class of this flow. the profiles of sites i and j, andVi and Vj be the respective feature sets of sites i and j used by the LL-NBC attack. If the Proof. In the LL-JC attack, S(D ,D )= 0, because D tuples for a flow of site i become (D j Di) (i.e., the tuples new i new − does not appear in any training set. Similarly, if the sizes that are in D j but not in Di) after changing the packet sizes of web objects does not exist in any training set, the SS- in the flow, the LL-JC and the LL-NBC attacks will regard WRPQ’s Jaccard coefficient becomes zero. In the LL-NBC the transformed flow as a flow of site j. Similarly, after attack, if Dnew is totally new to the classification algorithm, changing the sizes of web objects in a flow of site i to the P(Vnew Vi)= 0 ( i, i U). Although using KDE may al- one in the profile of site j, the SSWRPQ attack will regard | ∀ ∈ low P(Vnew Vi) > 0 due to the kernel function, we can select the transformed flow as a flow of site j. | Dnew whose tuples are not close to any tuples in the training Proof. Let Dnew be the tuples in the transformed flow. Since set, so that P(Vnew Vi) 0. | → S(Dnew,D j) > S(Dnew,Di)= 0, the LL-JC attack regardsthe Defense mechanisms based on Lemma 1 are feasible in transformed flow as a flow of site j rather than a flow of practice, because the packet size is dominated by a rela- site i. Let V new be the feature set in the transformed flow. tively small number of values [34]. In other words, we can According to the relationship between D and V , we know that P(V new Vj) > P(V new Vi)= 0 (or P(V new Vi) 0 if the easily find packet sizes that never appear in any training set. | | | → Figure 2 plots the CDF of the number of uniquepacket sizes KDE is used). Thus the LL-NBC attack will consider the in a flow from two data sets. The UMass data set contains transformed flow as a flow of site j. Similarly, if the trans- packet header traces collected four times a day from Febru- formed flow’s web object size only exists in site j’s profile, ary 2006 to April 2006, and the size of compressed pcap the SSWRPQ attack will find that the transformed flow is files is around 2.6GB [26]. This data set is used for test- more similar to flows of site j than flows of site i. ing the LL-JC and the LL-NBC attacks [26]. The WIDE For the CWWZ attack, if we introduce other flow vec- traces, on the other hand, contain all traffic going through tors (e.g., by entering some useless inputs) in addition to the samplepoint-F of the WIDE backbone networks from the flow vectors induced by the real inputs, the attacker has 30 March 2009 to 2 April 2009, and the size of the pcap to consider all inputs that may result in these flow vectors traces is around 433GB [10]. Since the WIDE data set con- for the following two reasons. First, the attacker could not tains various kinds of traffic, we extract HTTP flows that differentiate between flow vectors caused by useless inputs have at least five packets. We regard the packets sent to a and those induced by real inputs. Second, the attacker could as request packets, and those sent from a web not knowthe start and the end of real inputs. As a result, the server as response packets. Both figures show that in the reduction power can be reduced. majority of flows the number of unique packet sizes is less Lemma 3 presents a sufficient condition to induce the than 100. Moreover, the request packets usually have fewer BLJL attack to reach an incorrect decision. number of unique packet sizes than the response packets. For the CWWZ attack, if a flow vector does not occur Lemma 3. By letting all packets in a flow have the same in any training set, the attacker cannot exclude any possible size sc and their IATs have the same value τc, R in Eq. (1) input and has to consider all the ambiguity set in the next is then determined by sc and τc, instead of the transformed flow vector. Therefore, the reduction power is fixed to one, flow’s original feature. and the final decision is the same as a random guess. For Proof. If τi = τc and si = sc, then R = p(τc τ¯)(sc s¯). the BLJL attack, since it uses a cross-correlation based al- − − gorithm, it makes an implicit assumption that the number of By adjusting sc and τc, we can therefore make a flow simi- packets (i.e., n) should be the same in the training data set lar to any other flows. As a result, the BLJL attack cannot and the testing data set. If this assumption does not hold, identify the original flow. R could not be computed. Thus, it is possible to defeat this attack by changing the number of packets. 3.3 Manipulating the features

3.2.2 The confusion strategy To defeat the traffic-analysis attacks listed in Table 2 and possibly new traffic-analysis attacks, we manipulate four To confuse an attack’s classification algorithm, we could fundamental network flow features, including packet size, manipulate a flow to make its features similar to another web object size, flow size, and timing of packets. Under 1 1

0.8 0.8

0.6 0.6 CDF CDF 0.4 0.4

0.2 0.2 Request flow Request flow Response flow Response flow

0 0 1 2 3 0 0 1 2 3 10 10 10 10 10 10 10 10 Number of unique packet sizes Number of unique packet sizes (a) The UMass data set. (b) The WIDE data set.

Figure 2: The distributions of unique packet sizes in two HTTP data sets.

our threat models, these features can be measured from an each of which only asks for ui (i = 1,...,NU ) bytes, where NU encrypted HTTP flow and exploited to differentiate network ∑i=1 ui = L. flows by an attacker. We describe our basic approaches for TCP MSS negotiation A TCP packet’s payload length is manipulating these features in HTTPOS below. Since the limited by the TCP MSS. TCP allows sender and receiver flow size is determined by the web object size, we discuss to negotiate the MSS through the MSS option in the TCP the flow size together with the web object size. SYN and TCP SYN/ACK packets. By exploiting this fea- ture, HTTPOS can constrain the packet size by announcing 3.3.1 Packet size a small MSS. TCP advertising window Since the advertising window HTTPOS alters the size of outgoing packets on the HTTP controls the number of bytes a sending TCP could dispatch, and TCP layers. On the HTTP layer, HTTPOS increases the size of the incoming packet can be manipulated by ad- the packet size by adding additional bytes to the HTTP justing the advertising window if it is not larger than the header, for example, adding additional fields, appending MSS. characters to the Referer field, or using specific media types to replace the asterisk in the Accept field. On the 3.3.2 Web object size and flow size TCP layer, HTTPOS decreases the packet size by splitting a TCP packet into several smaller TCP packets. If an at- It is nontrivial for a browser-side approach to affect the tacker could not observe the TCP header (i.e., scenarios (a) web object size. A possible method is to adjust the con- and (b)), HTTPOS increases the packet size by extending tent codings [20]. For example, if a web object is com- the TCP option fields (e.g., by adding the TCP No - pressed before being sent to the client, HTTPOS modifies tion option that is usually used to pad the option list). How- the Accept-Encoding header to force the server to send ever, it is more challenging to control the size of incoming an uncompressed web object. A similar method is to ad- packets from a web server, because it is usually determined just the quality value [20]. Although these two methods are by the server application and the server’s TCP/IP stack. quite general, we find that most servers do not support them. HTTPOS exploits the HTTP Range option, TCP maximum To disguise the size of web objects, we therefore adopt segment size (MSS) negotiation, and TCP advertising win- the following methods based on the observation that an at- dow to manipulate the packet size. tacker could only estimate the size of a web object from en- HTTP Range HTTP/1.1 provides a Range option to fetch crypted traffic using the amount of bytes received between portion of a web object to avoid downloading the same con- two requests [24,35]. tent already received by a client [20]. RFC 2616 refers TCP retransmissions If the client is using IP tunnel or en- to a request having a Range header as a partial GET. crypted wireless channel that prevents an attacker from see- To fetch a web object, HTTPOS sends a request with a ing the TCP header, HTTPOS affects the web object size Range header, such as “Range: bytes=0-0.” If and enlarges the flow size by causing TCP retransmissions. the server supports HTTP Range, it will reply with “206 More precisely, since an ACK number informs a sending Partial Content” and a Content-range header, TCP how many bytes have been successfully received by a such as “Content-range: bytes 0-0/L,” where TCP receiver, by acknowledging only part of the received L is the actual length of the requested web object. Af- bytes, HTTPOS could force the server to re-send the unac- ter that, HTTPOS sends NU requests to the web server, knowledgedportion along with the new portion if the packet size permits [28]. In this case, the attacker will overesti- to manipulate the timing of request packets, because the re- mate both web object size and flow size. If the attacker sponse packets are triggered by request packets. The sec- can observe the TCP header, she may ignore the retransmit- ond level is to manipulate the timing of response packets ted packets. In this case, HTTPOS employs the following through delaying ACK packets. The rational behind this HTTP-based approaches, because an attacker cannot ob- method is that without receiving ACK packets the sending serve the content of encrypted packets. TCP may not send out new data packets due to TCP’s ACK- HTTP Pipelining Using HTTP Pipelining to send several based self-clocking feature. Note that the flow sequence can requests together without waiting for the corresponding re- also be changed by reordering the sequence of TCP SYN sponses makes it difficult for an attacker to infer the size of packets or delaying the corresponding HTTP request pack- a web object [20], because an attacker may regard the total ets. By doing so, HTTPOS can help a user evade the traffic- bytes from the server as the size of a web object. To achieve analysis attack in [14]. this, HTTPOS either merges several requests from the client or attaches a useless request with the original request before 4 HTTPOS sending the packet to the server. In both cases, the attacker will overestimate the HTTP request’s size, web object size, 4.1 Design and flow size. When using the latter approach, HTTPOS does not forward the response to the useless request to the HTTPOS acts as a proxy through which a client visits a client. web site. It accepts HTTP requests from the client and mod- HTTP Range Employing HTTP Range to request data that ifies them as needed before sending them out. Figure 3 il- have been downloaded can enlarge the flow size and dis- lustrates the HTTPOS operations. When HTTPOS receives guise the web object size. Its basic idea is similar to the a URL, it checks whether information related to this URL method using retransmitted TCP packet. The difference is is in the cache, which includes whether the URL can be that the latter method operates on TCP and therefore re- fetched through HTTP Range, whether the server supports quires TCP header being invisible to the attacker, whereas HTTP Pipelining, and the web object size. This informa- the HTTP header in the encrypted traffic is already invisible tion determines which module will be used. Each module to the attacker. Although Sun et al. [35] also suggested us- realizes a method described in Section 4.2. ing HTTP Pipelining and HTTP Range, they did not imple- If it is the first time for HTTPOS to handle a URL, ment and evaluate them. We implement these approaches HTTPOS uses the method based on TCP advertising win- and carefully evaluate them using live HTTP traffic. We dow (Section 4.2.1). It manipulates the advertising win- also measure the popularity of supporting HTTP Range and dow in outgoing packets to control the size of response HTTP Pipelining (Section 4.4). Moreover, we combine the packets. To test whether the URL can be fetched through method based on HTTP Pipelining and that based on TCP HTTP Range, HTTPOS inserts “Range: bytes=0-0” advertising window to evade an advanced CWWZ attack to the outgoing HTTP request. To verify whether the server (see Section 5.2.2). supports HTTP Pipelining, HTTPOS duplicates the HTTP Injecting useless requests Injecting a useless request be- request and adds “Range: bytes=1-1” to it and then tween two requests sent by a client can disguise the web sends these two HTTP requests to the server at the same object size. More precisely, after a client sends one request, time. After receiving the responses, the feature information HTTPOS decreases the advertising window to a small value and the web object size are saved in the cache. (say 10 bytes), so that the server cannot return all the re- If the information is found in the cache, HTTPOS selects quested content in one packet. Once a response packet is the proper method based on its effectiveness on evading received, HTTPOS injects a random request. In this case, traffic-analysis attacks and mitigating performance degra- the attacker will underestimate the web object size, because dation. If the URL could not be fetched through HTTP she may observe many small responses to different HTTP Range, HTTPOS uses the method based on TCP MSS requests. There is no restriction on the requests injected by + TCP advertising window (Section 4.2.2). Otherwise, HTTPOS. Its usage is to mislead an attacker into obtaining HTTPOS selects the method based on multiple TCP con- wrong web object size and flow size. nections + HTTP Range (Section 4.2.3) if the server does not supportHTTP Pipelining, or the method based on HTTP 3.3.3 Timing of packets Pipelining + HTTP Range (Section 4.2.4) if the server does. Since TCP-based methods (i.e., methods based on TCP Since the outgoing packets sent from the client go through MSS and TCP advertising window) cannot change the size HTTPOS, it is easy to control their timing by delaying the of web objects, HTTPOS injects a useless HTTP request transmissions. To manage the timing of packets from the between two requests as described in Section 3.3.2 to mis- server, HTTPOS operates on two levels. The first level is lead the attacks that exploit the characteristics of web ob- Store URL Method based on Inject Useless Request information into cache Advertising Window No Method based on Method based on MSS URL URL in cache? multiple TCP End +Advertising Window connections + Range Yes No No Yes Yes Method based on Get URL information Support Range? Support Pipelining? Pipelining + Range

Figure 3: The HTTPOS operations. jects. It is worth noting that if an attacker cannotobserve the and then announces an advertising window larger than r to TCP header, HTTPOS can also use retransmitted packets to get the remaining r bytes. Therefore, if the normal oper- mislead those attacks. Table 3 summarizes all methods in ation needs one RTT to download the R bytes, HTTPOS HTTPOS and the corresponding attacks that can be evaded. may use an additional RTT to download it (i.e., one RTT for R r bytes and another RTT for r bytes). If the server’s − 4.2 Modules TCP stack increases its congestion window based on the number of valid ACK packets, HTTPOS could send cus- 4.2.1 Method based on TCP advertising window tomized ACK packets to induce the server to increase its congestion window quickly. Since a TCP sender cannot send more data than the ad- vertising window permits, HTTPOS controls the sizes of 4.2.3 Method based on multiple TCP connections + incoming response packets by manipulating the advertis- HTTP Range ing window in each outgoing packet. More precisely, given a web object of S bytes, HTTPOS selects Nv integers If a server supports HTTP Range but does not support

v1,v2,...,vNv , where vi is the advertising window in the HTTP Pipelining, HTTPOS establishes multiple TCP con- { } Nv nections and sends partial GET requests for a web object ith outgoing TCP packet and ∑i=1 vi = S. HTTPOS sends a new TCP packet only after receiving a TCP data packet in parallel to the server. As explained in Section 3.3.1, from the server to prevent the server from combiningthe ad- HTTP Range can limit the size of response packets. The vertising windows in several TCP packets and then sending server will process these requests simultaneously if the a large packet that may be recognized by a traffic analysis. server adopts multi-threading and then return the web ob- ject through several packets. HTTPOS will re-organize the 4.2.2 Method based on TCP MSS + TCP advertising responses before delivering the content to the client. window 4.2.4 Method based on HTTP Pipelining + HTTP Since the method based on TCP advertising window allows Range only one TCP packet to be sent in an RTT, it may intro- duce large delay. To address this problem, we propose a If a server supports both HTTP Pipelining and HTTP new methodthat employs both TCP MSS and TCP advertis- Range, HTTPOS puts several partial GET requests into one ing window. This method is motivated by two observations. packet and sends it out. The server will process these re- First, given a web object of S bytes and a default MSS of M quests one by one and send back the responses. Without bytes, a successful traffic-analysis attack usually relies on the need to wait for the arrival of a response before sending the last packet whose size is equal to S mod M. Second, a another partial GET request, the additional delay introduced TCP sender can send several M-byte TCP data packets in an by HTTPOS will decrease. RTT. HTTPOS therefore can either change the default MSS or manipulate the last packet’s size by using TCP advertis- 4.3 Implementations ing window. In the former case, after setting the MSS to M′, the last packet’s size becomes S mod M′. In the latter case, We implemented HTTPOS in C with 3022 lines S HTTPOS sets the advertising window to M M in order to of code (reported by CLOC [31]) and tested it on fetch the first S R bytes, where R = S mod⌊ ⌋M. After that, Ubuntu 9.04 with 2.6.27 kernel. To manipulate TCP it sets the advertising− window to R r bytes, where r is a packets, HTTPOS uses iptables (version 1.4.0) and random positive integer less than R, to− download R r bytes the libnetfilter queue library (version 0.0.16) to − Table 3: Methods in HTTPOS and the corresponding attacks that can be evaded. Methods Layers SSWRPQ BLJL LL-JC LL-NBC CWWZ

Method based on Advertising Window TCP √√ √ √ Method based on MSS + Advertising Window TCP √√ √ √ Method based on Multiple TCP connections + HTTP Range HTTP √ √√ √ √ Method based on HTTP Pipelining + HTTP Range HTTP √ √√ √ √ Inject Useless Request HTTP √ √ √ Inject Packet Delay TCP √ hook outgoing TCP packets of interest. HTTPOS adds ducted two sets of measurement to evaluate whether oper- rules into iptables’ INPUT and OUTPUT chains, so ating systems and web servers support manipulating packet that the packets matching the rules will be queued in size through TCP MSS, TCP advertising window, HTTP the kernel. HTTPOS acquires a packet through the Pipelining, and HTTP Range. In the first set, we tested pop- libnetfilter queue library and then modifies it (e.g., ular operating systems and web servers with their default the advertising window and the MSS option) before re- settings in our test-bed. In particular, we tested Apache leasing it. Additional delay is introduced to the outgoing v2.3.6, nignx v0.8.42 and lighttpd v1.4.26 in a Ubuntu ma- packets if needed. HTTPOS uses raw socket to inject TCP chine (kernel 2.6.28) and IIS v7.5 in a Windows 7 box. packets if necessary and employs the libpcap 1.0.0 li- We selected these web servers, because they represent more brary to capture TCP packets for verification. Moreover, the than 90% market share [30]. Since the Google web server POSIX Threads (pthreads) library was utilized to create cannot be downloaded, we cannot test it in the test-bed. and manage multiple threads for multiple HTTP/TCP con- In the second set, we targeted on the top 2000 web sites nections between clients and HTTPOS, and those between in the Alexa rankings [1]. We modified Pagestats [16] HTTPOS and web servers. to drive Firefox 3.6.3 to automatically visit these web sites. In our measurement experiments to be discussed in Sec- Since Firefox downloads all the necessary web objects, tion 5.1, we established an IPSec tunnel as an example of which may be located in different web servers, we man- IP tunnel, built an SSH tunnel as an example of TCP tunnel, aged to collect 143,333 URLs in 8,845 web servers after and set up a wireless channel encrypted by WPA. HTTPOS Firefox visited the front pages of the 2000 web sites. We uses different modules to handle HTTP requests and re- used NetCraft’s service [5] to identify the operating system sponses for different scenarios. When the IPSec tunnel and and the web server software used by each server. Since the wireless channel are used, HTTPOS acts as an HTTP NetCraft resolved the web server software used in only proxy. When the SSH tunnel is employed, HTTPOS be- 5884 web servers, we employed httprecon-7.3 [32] to haves as a SOCKS proxy for users to visit the Internet. At further infer the web server software in the remaining 2961 the same time, HTTPOS communicates with the SSH tun- servers. There are still 1181 servers whose web server soft- nel through SOCKS 4 [25], because the SSH port forward- ware cannot be identified by httprecon-7.3, and we ing provides service via SOCKS. For the HTTPS channel, refer them to as “others.” Moreover, since NetCraft iden- HTTPOS is implemented as a Firefox add-on to manipulate tified 4957 web servers’ operating systems, we group the HTTP requests before they are sent to the SSL/TLS layer. other 3888 servers as “others.” For the Google web server In all scenarios, HTTPOS can modify the header of HTTP which has different names [4], we crawled 4622 URLs requests and insert useless requests if necessary. starting from http://www.google.com.hk/intl/ We also implemented those traffic-analysis attacks intro- zh-TW/options/ and extracted the names of the web duced in Section 3.1 using Python and Weka 3.6.1 [36] to server software from the Server field in the response evaluate the effectiveness of HTTPOS. In Section 5, we re- header. As a result, we obtained a total of 231 Google port their accuracy with and without HTTPOS. servers.

4.4 Measuring the support rates of the TCP and 4.4.1 TCP MSS and TCP advertising window HTTP based control To test whether a server allows HTTPOS to manipulate the packet size through TCP MSS, we modify the advertised HTTPOS exploits the basic protocol features in TCP MSS values in the TCP option. Let MSSL be the MSS value and HTTP described in RFC 793 and RFC 2616, respec- announced by us in the TCP SYN packet and MSSR be the tively. Since not all servers comply with the RFCs, we con- MSS value returned in the server’s TCP SYN/ACK packet. Table 4: Major operating systems’ support rate of TCP MSS negotiation and TCP advertising window based control.

OSes ADVL = 2000 bytes MSSL = 1460 bytes (the default) ADVL = MSSL bytes (No. of servers) MSSL=128 MSSL=256 MSSL=536 ADVL=128 ADVL=256 ADVL=536 MSSL=128 MSSL=256 MSSL=536

Windows (388) 88.40% 89.43% 100.00% 95.36% 95.36% 97.42% 99.22% 99.48% 100.00% Linux (3875) 97.90% 98.63% 100.00% 99.17% 99.32% 99.50% 99.76% 99.94% 100.00% AIX (19) 84.21% 100.00% 100.00% 94.73% 94.73% 94.73% 100.00% 100.00% 100.00% Solaris (71) 98.59% 100.00% 100.00% 97.18% 97.18% 98.59% 100.00% 100.00% 100.00% FreeBSD (224) 25.89% 99.55% 99.55% 99.10% 99.10% 99.10% 99.10% 100.00% 100.00% BIG-IP (380) 98.68% 99.21% 100.00% 99.47% 99.47% 99.47% 99.73% 100.00% 100.00% Others (3888) 84.90% 96.38% 99.89% 96.94% 97.38% 97.94% 99.61% 99.76% 99.94%

We let MSSL be less than the typical value for MSSR (which for which the support rate for FreeBSD is 99.55%. Finally, is 1460 bytes in most cases). Moreover, MSSL should never even though HTTPOS cannot force the response packets to appear in the flow between the client and web server. If in- be 128 bytes or less, the actual payload size already differs deed MSSR > MSSL, we send an HTTP request to download from the one when default MSSL is used. As a result, the a web object larger than MSSR. If the payload sizes of all new payload size also helps a user evade the traffic-analysis response packets are less than or equal to MSSL, then the attacks. server permits HTTPOS to control its packet size. 1 To test whether a server allows HTTPOS to control the size of response packet through TCP advertising window, 0.8 we first disable TCP window scale option and then change the advertising window, denoted as ADVL, of an outgoing 0.6 TCP packet to a value smaller than MSS. Similar to the pre- CDF vious case, if the response packet’s payload size is less than 0.4 or equal to ADVL, then the server allows HTTPOS to control Header size its packet size. 0.2 Size=128 bytes Size=256 bytes Since TCP MSS and TCP advertising window can be Size=536 bytes 0 set to arbitrary values, we could not enumerate all possible 0 500 1000 1500 2000 2500 combinations. Instead, we investigated three scenarios: (1) Size of HTTP response header (bytes) ADVL = 2000 bytes and MSSL = 128,256,536 bytes; (2) use the default MSS announced by{ the remote server} (i.e., Figure 4: CDF of the size of HTTP response headers based on our data sets of 143,333 URL responses obtained from 8845 MSSL = MSSR = 1460 bytes) and ADVL = 128,256,536 { } web servers. bytes; and (3) MSSL = ADVL = 128,256,536 bytes. Ta- ble 4 summarizes the measurement{ results. Under} most settings, more than 85% servers allow HTTPOS to control their packet size. In particular, the support rate increases 4.4.2 HTTP Range and HTTP Pipelining when either the MSS or the advertising window increases. Moreover, when the MSS and advertising window use the same value, most servers support HTTPOS. For example, Table 5: The support rates of HTTP Range and HTTP Pipelin- for MSSL = ADVL = 536 bytes, 8843 out of 8845 servers allow HTTPOS to control their packet size. ing in terms of the number of servers. Web servers HTTP HTTP HTTP For MSSL = 128 bytes and ADVL = 2000 bytes, only (No. of servers) Range Pipelining Range+Pipelining 25.89% of the FreeBSD servers allow HTTPOS to control their packet size. The reason is that FreeBSD sets its de- Apache (4249) 84.00% 63.90% 58.80% IIS (1738) 76.06% 77.00% 65.88% fault minimal MSS to 216 bytes to prevent TCP MSS re- nginx (1103) 80.15% 75.16% 70.35% source exhaustion attacks [2]. However, this low support lighttpd (367) 84.47% 74.70% 68.94% rate does not mean that HTTPOS cannot evade those traffic- Others (1388) 73.34% 65.13% 55.55% analysis attacks for FreeBSD servers. First, HTTPOS can still control the packet size by setting MSSL = ADVL = 128 bytes, which has 99.10% support rate. Second, Figure 4 We discover that some web applications may ignore shows that the HTTP headers in more than 90% of the re- HTTP Range requests even if the underlying web server sponses from our data sets are larger than 256 bytes. There- supports HTTP Range. Therefore, we measure the support fore, it is sufficient for HTTPOS to use MSSL = 256 bytes rate of HTTP Range on both the URL level and the web jor web servers with default settings support both HTTP Table 6: The support rates of HTTP Range and HTTP Pipelin- Range and HTTP Pipelining. Besides, Tables 5 and 6 ing in terms of the number of URLs. show the measured support rates of the HTTP features Web servers HTTP HTTP HTTP from 143,333 URLs located in 8845 servers. In particu- (No. of URLs) Range Pipelining Range+Pipelining lar, we find that 117,025 URLs (81.6%) from 7103 servers Apache (59698) 89.02% 79.71% 68.80% (80.3%) support HTTP Range, 114,087 URLs (79.6%) IIS (22485) 85.03% 88.24% 73.38% nginx (18714) 83.16% 87.58% 70.74% from 6060 servers (68.5%) support HTTP Pipelining, and lighttpd (5506) 82.64% 84.87% 67.51% 94,458 URLs (65.9%) from 5545 servers (62.7%) support Others (36930) 66.74% 69.31% 53.98% both HTTP Range and HTTP Pipelining. Tables 7 and 8 show the Google web servers’ support rates of HTTP Range and HTTP Pipelining in terms of the number of servers and URLs, respectively. We find that Table 7: The Google web servers’ support rates of HTTP all the Google web servers support HTTP Pipelining, but Range and HTTP Pipelining in terms of the number of servers. only “sffe,” “DFE/largefile,” and a partial of “GSE” support Google web servers HTTP HTTP HTTP HTTP Range. (No. of servers) Range Pipelining Range+Pipelining sffe (38) 100% 100% 100% DFE/largefile (109) 100% 100% 100% 5 Evaluation GSE (24) 58.33% 100% 58.33% codesite (2) 0% 100% 0% In this section, we present the results of evaluating Others (58) 0% 100% 0% HTTPOS in terms of its effectiveness on defeating the traffic-analysis attacks and its impact on the goodput of fetching web objects. Table 8: The Google web servers’ support rates of HTTP Range and HTTP Pipelining in terms of the number of URLs. 5.1 Experiment settings

The Google web servers HTTP HTTP HTTP We first downloaded the front pages from the top 100 (No. of URLs) Range Pipelining Range+Pipelining web sites ranked by Alexa [1]. For web sites that belong sffe (2580) 99.88% 100% 99.88% to the same company and have similar web page layouts, DFE/largefile (906) 100% 100% 100% we tested only the site having the highest rank. For ex- GSE (461) 48.59% 100% 48.59% codesite (335) 0% 100% 0% ample, Google owns several sites having high ranks (such Others (340) 0% 100% 0% as google.com, google.com.hk, and google.de), and we just tested google.com. Moreover, we replaced porn sites with other top web sites. We used Firefox 3.6.3 equipped with Flash plugin 10 to visit the web sites. To au- server level. To test whether a URL supports HTTP Range, tomate the experiments, we prepared a Python script to in- we send partial GET requests to the server and then inspect voke modified Pagestats [16] to visit each web site and the response of Accept-Ranges. If the server replies used TCPDump to capture the trace. We refer to the process with “Accept-Ranges: bytes,” it supports HTTP of visiting all the web sites once as a round, and we per- Range; otherwise, it may send back “Accept-Ranges: formed a total of 100 rounds of measurement experiment. none” or nothing. On the web server level, we regard a The traces in odd-numbered rounds were used to train clas- server as supporting HTTP Range if one URL on that server sification algorithms, and the trained models were tested on supports HTTP Range. traces in even-numbered rounds. Based on Pagestats’s We also discover that if a web server supports HTTP results, we computed the goodput as the ratio of total bytes Pipelining, all web applications running on that web fetched to the download time. server also support HTTP Pipelining. To test whether We examined two deployment scenarios for HTTPOS: a server supports HTTP Pipelining, we send out sev- on the browser side when IP tunnel, encrypted wireless eral HTTP requests together, each of which carries channel, or HTTPS channel is used and at a TCP tunnel “Connection: keep-alive,” without waiting for the entry. To establish an IP tunnel, we employed L2TP v1.2.0 corresponding responses. If the server responds to all these and OpenSwan v2.6.24 to build an IPSec tunnel between requests, it is considered supporting HTTP Pipelining. Oth- two endpoints. To set up the wireless channel, we used a erwise, the server may just respond to the first request and laptop with Intel PRO/Wireless 2200BG Mini-PCI Adapter then close the connection. to connect to an Access Point with WPA1 encryption en- According to our first set of experiments, all those ma- abled in our laboratory and employed AirPcap [12] to cap- ture wireless frames. For TCP tunnels, we used SSH port for the web site is then given by the percentage of successful forwarding to create a TCP-based tunnel between two end- attacks obtained from the 50 rounds of measurement. points following the configuration in [26]. Figure 5 shows the CDF of the attack accuracy for the It is important to point out that our experiment settings four attacks with and without applying HTTPOS to the are actually favorable to an attacker for the following rea- traffic flowing through an IPSec tunnel. Similarly, Figure sons. 6 reports their accuracy for SSH tunnel. The two solid curves show the attack accuracy without HTTPOS, whereas 1. Koukis et al. [24] showed that parsing mixed web ses- the two dashed curves are the results when HTTPOS is sions in packet traces obtained from an encrypted tun- used. The figures show that without using HTTPOS the nel is very difficult. However, in our case we saved all attacks can identify the visited web sites with high fidelity. the packets belonging to a web session into a separate For K = 1, the LL-JC attack on IPSec traffic achieves at pcap file, thus removing this obstacle for the attacker. least 70% accuracy for guessing the 100 web sites, where 2. Coulls et al. [14] pointed out that a web browser’s around 70% of the web sites are correctly identified from caching may significantly affect the accuracy of an at- all 50 rounds (i.e., 100% accuracy). The LL-NBC attack tack, because the browser does not need to download achieves similar performance, where at least 80% accuracy web objects in the cache, thus affecting the flow size. is achieved for each site, and more than 60% of the web This problem, however, does not occur to our case, be- sites are identified with 100% accuracy. When targeting cause Pagestats [16] always clears Firefox’s cache on SSH traffic, the LL-JC (LL-NBC) attack achieves 100% after visiting a web site/page. accuracy for more than 75% (90%) of the web sites with more than 60% (90%) accuracy for each site. We also ob- 3. Liberatore et al. [26] reported that a large delay be- serve that the LL-JC and LL-NBC attacks have better per- tween the training data set and the test data set may formance than the SSWRPQ and BLJL attacks, and all the cause lower accuracy. In particular, they observed a four attacks achieve a better accuracy for K = 5. decrease from 73% to 63% for a delay of four weeks. With HTTPOS, the accuracy of the four attacks drop sig- The delay in our case, however, is small (i.e., around nificantly. Figures 5 and 6 show that, for K = 1, none of the 30 mins), and the 100 rounds of experiments were car- attacks can achieve 100% accuracy for any web site. For ried out continuously. IPSec traffic, the accuracy of these attacks drops to 0% for at least 98% of the web sites, because they fail to make a 4. By using the traces from every other round of mea- single correct decision for the majority of those web sites. surement, we provide the attacker with a much more For K = 5, the SSWRPQ, BLJL, LL-JC, and LL-NBC at- accurate view of the traffic. In a realistic attack sce- tacks still suffer from 0% accuracy for 100%, 98%, 96%, nario, an attacker normally spends some time to learn and 94% of the web sites, respectively. The “better” per- from the captured traffic. Therefore, the traffic pattern formance achieved by the LL-NBC attack is possibly due may not be the same as those she has observed before to the KDE which considers some packet sizes that never when the attack is finally launched. appear but are close to the sizes in the training data set. We also observe similar (poor) performance for SSH traffic, for 5.2 Evasion evaluation which these attacks achieve 0% accuracy for at least 98% (95%) of the web sites for K = 1(K = 5). 5.2.1 Defeating the SSWRPQ, BLJL, LL-JC and LL- NBC attacks 5.2.2 Evading the CWWZ attack To evaluate the effectiveness of HTTPOS against the four attacks targeting on identifying web sites (i.e., the SS- We use only the Google search engine to evaluate HTTPOS WRPQ, BLJL, LL-JC, and LL-NBC attacks), our approach against the CWWZ attack for two main reasons. The first is is to compare the attack accuracy with and without applying that the search engine example in Section IV.D of [13] pro- HTTPOS to the encrypted traffic. To compute the attack ac- vides all the key features of the CWWZ attack. The second curacy for a given web site, we first compute the similarity is that the actual URLs for the web sites (i.e., OnlineHealth, between the trace obtained for the web site and the available OnlineTax and OnlineBank) reported in Chen et al.’s paper profiles based on the attack methods introduced in Section [13] have not been revealed. Since the Googlesearch engine 3.1. For each attack, we then sort the similarity and select provides both HTTP and HTTPS services, we conducted the top K web sites as our inference. The attack is consid- experiments for both services to evaluate HTTPOS’s effec- ered successful if the actual web site is one of the K web tiveness. Since the Google search engine runs on the GWS sites selected by the attack. Clearly, the likelihood of mak- web server and does not support HTTP Range, HTTPOS ing a correct decision increases with K. The attack accuracy uses the TCP-based methods and injects useless requests to 1 1 K=1 SSWRPQ+HTTPOS BLJL+HTTPOS (K=5) SSWRPQ+ BLJL+HTTPOS (K=1) 0.8 HTTPOS (K=5) K=5 SSWRPQ+HTTPOS 0.8 K=1 SSWRPQ 0.6 0.6 K=5 SSWRPQ SSWRPQ+ K=1 BLJL+HTTPOS

CDF CDF BLJL (K=1) 0.4 HTTPOS (K=1) 0.4 SSWRPQ (K=5) K=5 BLJL+HTTPOS 0.2 SSWRPQ (K=1) 0.2 K=1 BLJL BLJL (K=5) K=5 BLJL 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Accuracy Accuracy (a) The SSWRPQ attack. (b) The BLJL attack.

1 1 LL−JC+HTTPOS (K=1) 0.8 0.8 LL−NBC+HTTPOS (K=1) LL−NBC+HTTPOS (K=5) LL−JC+HTTPOS (K=5) 0.6 0.6 K=1 LL−JC+HTTPOS K=1 LL−NBC+HTTPOS CDF 0.4 CDF 0.4 K=5 LL−JC+HTTPOS K=5 LL−NBC+HTTPOS LL−NBC (K=1) LL−JC (K=5) 0.2 K=1 LL−JC 0.2 K=1 LL−NBC K=5 LL−JC LL−JC (K=1) K=5 LL−NBC LL−NBC (K=5) 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Accuracy Accuracy (c) The LL-JC attack. (d) The LL-NBC attack.

Figure 5: Attack accuracy for the SSWRPQ, BLJL, LL-JC, and LL-NBC attacks with and without applying HTTPOS to the traffic in an IPSec tunnel.

1 1 BLJL+HTTPOS (K=5) 0.8 SSWRPQ+HTTPOS (K=1) SSWRPQ+HTTPOS (K=5) 0.8 BLJL+HTTPOS (K=1) K=1 SSWRPQ+HTTPOS 0.6 K=5 SSWRPQ+HTTPOS 0.6 BLJL (K=1) K=1 BLJL+HTTPOS CDF 0.4 K=1 SSWRPQ CDF 0.4 K=5 SSWRPQ SSWRPQ (K=5) K=5 BLJL+HTTPOS 0.2 SSWRPQ (K=1) 0.2 K=1 BLJL BLJL (K=5) K=5 BLJL 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Accuracy Accuracy (a) The SSWRPQ attack. (b) The BLJL attack.

1 1 LL−JC+HTTPOS (K=5) 0.8 0.8 LL−NBC+HTTPOS (K=1) LL−NBC+HTTPOS (K=5) LL−JC+HTTPOS (K=1) 0.6 0.6 K=1 LL−JC+HTTPOS K=1 LL−NBC+HTTPOS CDF 0.4 CDF 0.4 K=5 LL−JC+HTTPOS K=5 LL−NBC+HTTPOS 0.2 K=1 LL−JC LL−JC (K=5) 0.2 K=1 LL−NBC LL−NBC (K=1) K=5 LL−JC LL−JC (K=1) K=5 LL−NBC LL−NBC (K=5) 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Accuracy Accuracy (c) The LL-JC attack. (d) The LL-NBC attack.

Figure 6: Attack accuracy for the SSWRPQ, BLJL, LL-JC, and LL-NBC attacks with and without applying HTTPOS to the traffic in a SSH tunnel. evade the CWWZ attack. with various inputs, we use an example to illustrate how We first consider the scenario of communicating with the HTTPOS can evade the CWWZ attack. In this example, Google search engine through an encrypted wireless chan- the values in a flow vector are the payload sizes of wireless nel. The experiment setting is the same as the one in Sec- frames. We entered the word “hi” in the Google search in- tion IV.D of [13]. Before showing the experiment results put box and the sequence of directional packet sizes [13] is (556 , 451 ,557 ,463 ), where the request pack- tor becomes (1065 , 1323 , 1065 , 1304 ). Conse- ets carrying→ “h”← and “hi”→ are← of 556 bytes and 557 bytes, quently, the attacker→ can neither← find a→ similar← vector from respectively, and the corresponding response packets are her training data set nor recover the original vector, thus di- of 451 bytes and 463 bytes. If HTTPOS pads the HTTP minishing the CWWZ attack’s reduction power to one. request headers with useless fields to the same size, sets Figures 7 and 8 plot the CDFs of the CWWZ attack’s re- MSSL = 200 bytes, and disables the TCP window-scaling duction power with and without HTTPOS. A larger reduc- bit, then we get a new flow vector (572 , 260 , 260 , tion power indicates that the CWWZ attack has a stronger 75 , 572 , 260 , 260 , 87 ).→ Since the← CWWZ← capability to infer the visited web pages. Note that the min- attack← cannot→ not infer← the user← input← from this sequence, its imal reduction power is one, meaning that the CWWZ at- reduction power is dampened to one. tack cannot infer any useful informationfrom each observed However, a smart attacker might group several pack- flow vector. In this experiment, we follow the steps in [13] ets together and rebuild a correct flow vector. For exam- and randomly choose 1000 popular search key words from ple, by subtracting 72 bytes—the payload size of a wire- Google Trend [3]. Since these key words contain only char- less frame carrying a TCP ACK packet—from the size of acters a,b,...,z,0,1,...,9,dot,space , the size of the am- { } the first response packet in the original flow vector, an at- biguity set is k = 38. tacker may infer the size of TCP payload as 451 72 = 379 Figures 7(a)-7(d) plot the results when the CWWZ at- bytes. Similarly, by subtracting 72 bytes from− the sizes tack is applied to the HTTP traffic between HTTPOS and of the first three response packets in the new flow vec- the Google search engine through an encrypted wireless tor and then summing the remainder, the attacker obtains channel. We only consider the first four characters (i.e., 260 72 + 260 72 + 75 72 = 451 72 = 379 bytes, n 4), because they are sufficient for showing HTTPOS’s ≤ therefore− recovering− the original− flow vector.− To defeat this effectiveness. That is, only the flow vectors induced by attack, we inject a useless request between two requests the first n 4 characters are examined. Obviously, with- ≤ as described in Section 3.3.2 and therefore obtain another out HTTPOS, the CWWZ attack achieves large reduction flow vector (572 , 260 , 572 , 260 , 75 , 260 , power as n increases. The attacker can then determine the 260 , 82 , 572→ ,← 260 ,→ 260 ←, 572 ←, 87 ←, words sent to the search engine. However, with HTTPOS, 260 ←, 260 ←, 103 →). Since← HTTPOS← can inject→ various← the reduction power is fixed to one and does not change requests← from← the ambiguity← set, the attacker cannot restore along with n. Therefore, the attacker cannot gain any infor- the original flow vector from this new sequence. mation from the observed flow. Yet an even more advanced attacker may still be able to Figures 8(a)-8(d) plot the results for the HTTPS traffic. infer a set of keywords sent by the user from some special Without HTTPOS, the reduction power increases with n. packet sizes (e.g., 75 bytes and 87 bytes in the above exam- When n reaches 4, more than 50% of the key words have ple). We propose the following method to address this chal- the reduction power around 106, which can reduce the huge lenge. Since the Google server supports HTTP Pipelining, original ambiguity set (whose size is 384) to a small set HTTPOS sends out the request for a user input and a use- (whose size is 384/106 2). However, even for n = 4, ≃ less request in one or more successive packets with a zero HTTPOS can force the reduction power for all the 1000 key advertising window. The server will process both requests words to one. In other words, the attacker must guess the and store the responses in its TCP/IP stack, because the zero key word based on the ambiguity set with 384 possibilities. advertising window forbids it to send back the responses immediately. After a short period (e.g., 30 ms according 5.3 Performance evaluation to our evaluation with the Google server), HTTPOS sends an ACK packet with a large advertising window (e.g., 2000 5.3.1 Evaluation of individual methods bytes), and the server is induced to pack the responses into To evaluate the effect of each HTTPOS method on the per- blocks of MSS-byte packets. For the above example, the formance, we randomly selected 1000 URLs that support flow vector becomes (1072 , 837 , 1072 , 870 ), all methods. We set ADV = 200 bytes for the method which is completely different→ from the← original→ one. ← L based on TCP advertising window, let MSS = 200 bytes for We also evaluated HTTPOS using the Google HTTPS- L the method based on TCP MSS, and used three TCP con- based search service (i.e., ://www.google. nections for the multiple connections method. Moreover, com). In this setting, the values in a flow vector are the sizes we splitted each web object into three parts for the HTTP of the TCP packet payload. When we entered “hi” in the Range method. Note that the following results do not rep- search input box, we observed a flow vector (539 20 , resent the best performance that HTTPOS can achieve. 679 , 540 20 , 658 )1. With HTTPOS, the flow± vec-→ ← ± → ← Figure 9 plots the ratio of the resultant goodput for 1The parameter “gs gbg” in the queries to the Google HTTPS-based each HTTPOS method to that without using HTTPOS. For search engine introduces 20 random bytes to each request. brevity, we use MSS to denote the method based on TCP ± 1 1

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4

0.2 With HTTPOS 0.2 With HTTPOS Without HTTPOS Without HTTPOS

0 −2 0 2 4 6 8 0 −2 0 2 4 6 8 10 10 10 10 10 10 10 10 10 10 10 10 Reduction power Reduction power (a) Reduction power of the flow vector induced by the first character. (b) Reduction power of the flow vector induced by the first two characters.

1 1 With HTTPOS 0.8 0.8 Without HTTPOS

0.6 0.6

CDF 0.4 CDF 0.4

0.2 With HTTPOS 0.2 Without HTTPOS

0 −2 0 2 4 6 8 0 −2 0 2 4 6 8 10 10 10 10 10 10 10 10 10 10 10 10 Reduction power Reduction power (c) Reduction power of the flow vector induced by the first three characters. (d) Reduction power of the flow vector induced by the first four characters.

Figure 7: Reduction power of the CWWZ attack on the HTTP traffic between HTTPOS and the Google search engine through an encrypted wireless channel.

1 1

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4

0.2 With HTTPOS 0.2 With HTTPOS Without HTTPOS Without HTTPOS

0 −2 0 2 4 6 8 0 −2 0 2 4 6 8 10 10 10 10 10 10 10 10 10 10 10 10 Reduction power Reduction power (a) Reduction power of the flow vector induced by the first character. (b) Reduction power of the flow vector induced by the first two characters.

1 1 With HTTPOS 0.8 0.8 Without HTTPOS

0.6 0.6

CDF 0.4 CDF 0.4

0.2 With HTTPOS 0.2 Without HTTPOS

0 −2 0 2 4 6 8 0 −2 0 2 4 6 8 10 10 10 10 10 10 10 10 10 10 10 10 Reduction power Reduction power (c) Reduction power of the flow vector induced by the first three characters. (d) Reduction power of the flow vector induced by the first four characters.

Figure 8: Reduction power of the CWWZ attack on the HTTPS traffic between HTTPOS and the Google search engine.

MSS + TCP advertising window, MultiCon the method AdvWin the method based on TCP advertising window. As based on Multiple TCP Connections + HTTP Range, shown, HTTPOS can achieve at least 80% goodput for 75% Pipelining the method based on HTTP Pipelining + HTTP of the URLs when the MSS methodor the MultiCon method Range, Range the method based on pure HTTP Range, and is applied, and 90% goodputfor 50% of the URLs when the 1 5.3.2 Impacts on the performance of Internet browsing 0.8 To evaluate the overall performance of HTTPOS, we vis- ited each of the top 100 web sites 10 times with and with- 0.6 out applying the HTTPOS operations depicted in Figure

CDF 3. We recorded the download time based on the output 0.4 MSS MultiCon of Pagestats for each site. Figure 11(a) and Figure 0.2 Pipelining 11(b) show the CDFs of the ratio for the download time Range without and with HTTPOS when using IPSec tunnel and AdvWin

0 −2 −1 0 1 SSH tunnel, respectively. The figures show clearly that the 10 10 10 10 Goodput with HTTPOS / Goodput without HTTPOS first-time visit to each site via HTTPOS needs more time than the normal visits (i.e., all values less than one), be- Figure 9: The ratio of the resultant goodput for each cause HTTPOS uses the method based on advertising win- HTTPOS method to that without HTTPOS. dow. When HTTPOS is applied, the time for visiting 60 1 MSS sites through IPSec is at most 1.6 times of the time without MultiCon HTTPOS. 0.8 Pipelining Range However, once the URL information is cached, the time 0.6 AdvWin required for the following visits is close to the time for the

CDF normal visits (i.e., value close to one). The time for visiting 0.4 60 sites through IPSec is at most 1.1 times of the time with- out HTTPOS. Furthermore, the time for visiting more than 0.2 90 sites through IPSec is at most 1.4 times of the time with-

out HTTPOS. The reason is that for each URL, HTTPOS 0 −2 0 2 4 10 10 10 10 Additional delay (ms) will select the method with the least impact on the perfor- mance while not compromising the protection capability ac- Figure 10: Additional delay introduced by each HTTPOS cording to Figure 3. It is also interesting to note that as method. a result of employing multiple TCP connections, HTTPOS may even enjoy better performance than the normal visits (i.e., values larger than one) in some cases. As shown in Figure 11(a), visiting around 40 out of the 100 sites through Pipelining method is applied. IPSec requires less time than the normal visits. The performance degradation introduced by the MSS method is not significant, because, as discussed in Section 5.3.3 Impacts on the performance of Google search 4.2.2, it only needs an additional RTT to finish the trans- mission. On the other hand, we notice that unlike an IP To evaluate the impacts of HTTPOS on the performance of tunnel, a TCP tunnel may multiplex multiple TCP connec- using Google search, we measured the RTT from the epoch tions into a single TCP tunnel which could become the per- when the user sends a query to the epoch when the user formance bottleneck. To tackle this problem, we establish receives the response with and without HTTPOS. Figure several TCP tunnels in advance and divert TCP connections 12(a) illustrates the RTTs obtained from the scenario where into different TCP tunnels to achieve parallel transmissions. a user in Hong Kong visited the Google search service (i.e., 74.125.47.147 Moreover, as expected, the AdvWin method gives the worst ) through a 802.11g wireless link. Fig- performance, because it allows only a single packet trans- ure 12(b) shows the RTTs obtained from the scenario where mission from the server in an RTT. Moreover,we use a very the user accessed the same search service through HTTPS. In both experiments, the user entered 1000 popular search small ADVL (i.e., 200 bytes) for this evaluation. keywords from Google Trend [3] for 30 times, and the re- Although some methods may cause certain URLs to ex- spective median RTTs for each keyword is shown in Figure perience a low goodput, we find that the additional delay 12(a) and Figure 12(b). introduced by these methods is not significant under our pa- In this experiment, HTTPOS employs the technique de- rameter settings. Figure 10 reveals the additional delay in- scribed in Section 5.2.2 to evade the CWWZ attack. More troduced by each HTTPOS method. As shown, the MSS, precisely, HTTPOS puts the real request and a useless re- MultiCon, and Pipelining methods introduce less than 100 quest in one packet, and sets ADVL = 0 before dispatching ms delay for more than 90% of the URLs. The Range the packet to the server. After a small delay, HTTPOS sends method, on the other hand, introduces less than 200 ms de- an ACK packet to announce a large advertising window and lay for more than 80% of the URLs. induce the server to send back responses. The delay is 120 1 1 First−time visit First−time visit Following visit Following visit 0.8 0.8

0.6 0.6 CDF CDF 0.4 0.4

0.2 0.2

0 −1 0 0 −0.9 −0.7 −0.5 −0.3 −0.1 10 10 10 10 10 10 10 Time spent without HTTPOS/Time spent with HTTPOS Time spent without HTTPOS/Time spent with HTTPOS (a) IPSec tunnel. (b) SSH tunnel.

Figure 11: The effect of HTTPOS on the performance of Internet browsing.

0.36 0.36 Normal HTTPS Traffic 0.34 0.34 HTTPOS on HTTPS Traffic Normal Wireless Traffic HTTPOS on Wireless Traffic 0.32 0.32

0.3 0.3

0.28 0.28

Round−trip Time (second) 0.26 Round−trip Time (second) 0.26

0 200 400 600 800 1000 0 200 400 600 800 1000 Index of keywords Index of keywords (a) HTTPOS on wireless traffic. (b) HTTPOS on HTTPS traffic.

Figure 12: The effect of HTTPOS on the performance of using Google search. ms for wireless traffic and 100 ms for HTTPS traffic, re- are usually 1 1 pixel web bugs belonging to online adver- spectively. Figure 12 shows that the additional delay intro- tisement companies∗ that customize their web servers, and duced by HTTPOS is small, because the server can send none of our methods works for them. Since these web bugs back the responses immediately upon receiving the ACK are usually used to track users, a user may just filter them packet. The additional delay is less than 80 ms in Figure to protect privacy. Moreover, since they may exist in many 12(a) and less than 60 ms in Figure 12(b). web pages, their sizes could increase an attack’s false posi- tive rate instead of facilitating the attack. 5.4 Discussion 5.4.2 The CWWZ attack We believe that HTTPOS significantly raises the bar and We also note that if an advanced CWWZ attacker can ob- makes future traffic-analysis attacks much harder to de- serve the payload of HTTPS packets, she may still be able sign. Moreover, as HTTPOS provides fundamental defense to infer the size of a web object even after changing the strategies and basic methods to modify flow features, new packet size. More precisely, an attacker can first identify evasion methods may be developed based on them. More- packets carrying SSL/TLS application data from the type over, we report below our additional findings on web bugs, field in the SSL/TLS header and then use the field of ap- another attack model for the CWWZ attacks and our solu- plication data length to assemble consecutive TCP packets. tions to defending against it, and our measurement results However, such attack does not work if the SSL/TLS packets for the support rate of HTTP Range by HTTPS servers. go through an IPSec tunnel or a wireless channel. HTTPOS tackles this attack through two approaches. 5.4.1 Web bugs The first approach is for the servers that support HTTP Range. We measured the support rate of HTTP Range by In the course of conducting the measurement experiments, HTTPS servers and found that more than 80% URLs in the we observed some cases where the size of packets carrying measured servers support HTTP Range. Further details are certain web objects cannot be adjusted. These web objects given in Section 5.4.3. In this case, HTTPOS first divides the web object into a random number of portions and makes all inputs. For example, if a user inputs “hi” in the search these portions to overlap a random part with one another. In box, the auto-suggestion function may send the first request this way, the web object sizes reported by the field of ap- packet with “h” and then the second packet with “hi.” To plication data length in the SSL/TLS header are incorrect. evade the CWWZ attack, HTTPOS may just transmit the re- The following is an example of downloading a web object quest carrying “hi” but drop the request carrying “h.” How- from Twitter through HTTPS. In the normalcase, the packet ever, this approach may affect web usability. size sequence is (476 , 1460 , 1460 , 1171 ), and the SSL/TLS record size→ sequence← is (471← , 4086← ). 5.4.3 Support rate of HTTP Range by HTTPS servers Although we can use TCP-based methods to→ split packets,← the attacker may still infer the response size from the se- We measured the support rate of HTTP Range by HTTPS quence of the SSL/TLS records’ sizes. After HTTPOS ap- servers from two data sets. The first one contains the plies HTTP Range, the sequence of packet sizes is modi- web sites of the 50 largest banks in America2. By using fied to (497 , 1460 , 245 , 500 , 1460 , 258 , Pagestats to visit these web sites’ front pages or login → ← ← → ← ← 500 , 1460 , 254 ), and the sequence of the SSL/TLS pages (if their front pages do not support HTTPS) through → ← ← records’ sizes becomes (492 , 1700 , 495 , 1713 , HTTPS, we collected 1585 valid URLs from 104 HTTPS → ← → ← 495 , 1709 ). As a result, the attacker is prevented from servers, and 1323 URLs support HTTP Range (i.e., the sup- → ← recovering the response packet size. port rate is 83.47%). The second data set is based on web The second approach is for the servers that do not sup- sites ranked by Alexa [1]. We first connected to the 443 port port HTTP Range (e.g., the Google HTTPS-based search (i.e., the default HTTPS service port) of top 1M web sites service). In this case, HTTPOS can still inject a number of and stopped when we obtained 3000 web sites, to which useless requests from the ambiguity set and set each request the SSL/TLS connections were successfully established. message to the same size. This strategy is motivated by the Since not all web sites that open 443 port provide web ser- observation that an attacker could not know the exact length vice through HTTPS, we found only 1245 sites that offer of the word typed by a user, and the inserted requests result HTTPS-based web service. By crawling their front pages, in many possible words. For example, if a user types “hi,” we gathered 45,401 URLs from 2448 HTTPS servers, and an attacker can observe the following packet size sequence 85.09% (i.e., 38,632) of the URLs support HTTP Range. (539 20 , 679 , 540 20 ,658 ) and the SSL/TLS ± → ← ± → ← record sequence (534 20 ,291 ,378 ,535 20 , 6 Related work 291 , 357 ), where± 291→ is the← length of← the HTTP± → re- sponse← header.← Once HTTPOS inserts useless requests se- quence “card” through HTTP pipelining, the packet size se- Most of the existing proposals on defeating against quence and SSL/TLS record sequence become (1418 , traffic-analysis attacks on encrypted HTTP traffic require 165 , 1418 , 578 , 1418 , 165 , 1418 , 530 →) modifications to web servers, browsers and/or web objects. and →(1578 ←, 291 ←, 378 →, 291 →, 355 ←, 291 ← , In contrast, our HTTPOS is a browser-side solution that 361 , 1578→ , 291← , 352← , 291← , 357← , 291 ←, does not need such modifications. Moreover, since none 336 ←). Based→ on the← packet← size sequence,← the← CWWZ← of the existing techniques changes all four basic flow fea- attack’s← reduction power is reduced to one, because the se- tures, they could be defeated by the traffic-analysis attacks quence has been totally changed. detailed in Section 3.1. On the other hand, our techniques can successfully evade all of these attacks. Although the reduction power of an advanced CWWZ Sun et al. [35] proposed twelve countermeasuresand dis- attack exploiting the SSL/TLS record sequence could not be cussed the related costs. Most of them require the support reduced to one, the attacker still could not know the user’s of the web server and/or some modifications to the web ob- input, because there are at least six possible words, includ- jects. One exception is to use HTTP Range to increase the ing “h,” “hi,” “c,” “ca,” “car,” and “card.” Note that any word size of web objects. However, since these methods camou- that can result in the same SSL/TLS record sequence as any flage just the number and the size of web objects, they may one of the six words is also a possible candidate. For exam- not evade the traffic-analysis attacks based on directional ple, if word “x” and “y” induce the same SSL/TLS record packet size and packet timing information [7,26]. More- size as word “h,” both “x” and “y” will be considered as over, except for padding and pipelining, Sun et al. listed possible inputs by the attack. Therefore, when more useless the properties of each method but without implementing requests are injected, it becomes harder for such attacks to and evaluating them. On the other hand, HTTPOS exploits recover the original sequence. protocol features in both TCP and HTTP to change the Since injecting useless requests may introduce much four basic HTTP flow features. Moreover, we implemented overhead, another approach for evading the CWWZ attack targeting auto-suggestion is to send only one request with 2http://nyjobsource.com/banks.html HTTPOS and carefully evaluated the HTTPOS methods on control. live HTTP traffic. Information leaks through HTTP covert channels and the Hintz [23] and Danezis [15] suggested a number of corresponding detection mechanisms have been examined approaches to evade traffic-analysis attacks, for instance, recently. Feamster et al. employed sequences of HTTP adding useless data to the flow and removing some web requests to transmit covert information [19]. Burnett et objects in a web page. However, such approaches may al. [11] embedded stealthy information into user-generated not evade new attacks [13,26], because they do not change content. We proposed WebShare [27] to leak information the distributions of packet size and timing among packets. through the value of prevalentweb counters. On the defense Modifying a browser’s setting to force all web objects to side, Borders and Prakash designed WebTap [8] to detect be transferred through one connection may evade the attack HTTP covert channels that convey information through the proposed by Coulls et al. [14] that is based on the number content and the timing of HTTP requests, and proposed a of TCP connections belonging to the same web session and framework to quantify such information leaks [9]. Schear the amount of bytes delivered by individual TCP connec- et al. devised Glavlit [33] to throttle content-based HTTP tions. However, the volume of packets from a web server covert channels. Zhang et al. invented Sidebuster [40] to au- between two requests may still be exploited to identify web tomatically detect and quantify possible side-channel leaks sites, because web browsers usually send HTTP requests in web applications. one after the other, and web servers usually process these HTTP requests in sequence [24]. 7 Conclusions Wright et al. proposed traffic morphing to evade the traffic-analysis attacks based on packet size and direction In this paper, we proposed a suite of new browser-side [38]. Their system aims at incurring less additional data techniques to prevent an attacker from inferring web sites to a flow while enabling a flow to evade traffic-analysis at- or web pages visited by a user. These techniques exploit tacks. Their system first profiles the packet size distribution the basic protocol features in TCP and HTTP to manipulate for each web site and prepares a transformation matrix that four fundamental characteristics in encrypted HTTP flows. maps a packet size in one flow to a packet size in another We implemented these techniques into a browser-side sys- flow. Before sending a packet, traffic morphing changes the tem called HTTPOS that does not need to modify any web packet size according to the transformation matrix by either entity. An extensive evaluation of HTTPOS on live HTTP splitting or padding the packet. By doing so, the distribu- traffic shows that it can evade the state-of-the-art attacks tion of packet size observed by an attacker will not be the with low overhead. Future works include further mitigat- same as the distribution of original packets. Due to padding, ing the impact of HTTPOS on the performance and sealing traffic morphing may also change the flow size and pack- other privacy leakages in web browsers. ets’ timing information. However, traffic morphing requires modifications to both web server and web browser, because Acknowledgments it needs to pad or split packets on the serverside and remove padded stuff on the client side. Moreover, transforming all packets at the server site in real time will affect the perfor- We thank the anonymous reviewers for their quality re- mance of the web service. Unfortunately, the authors did views and David Evans, in particular, for shepherding our not provide the evaluation results. In contrast, HTTPOS paper, and Paul Royal for his suggestions and help. This does not modify both web server and web browser. In work is partially supported by a grant (H-ZL17) from The fact, a server may not even know whether a client is using Hong Kong Polytechnic University. This material is based HTTPOS. Moreover, we adopt many approaches to mitigate upon work supported in part by the National Science Foun- possible performance degradation caused by HTTPOS. dation under grant no. 0831300, the Department of Home- land Security under contract no. FA8750-08-2-0141, the Although anonymity networks (e.g., Tor [17]) also pro- Office of Naval Research under grants no. N000140710907 vide anonymoussurfing, there are two major differencesbe- and no. N000140911042. Any opinions, findings, and tween HTTPOS and an anonymity network. First, HTTPOS conclusions or recommendations expressed in this material preventsan attacker from inferringthe web site a user is vis- are those of the authors and do not necessarily reflect the iting, whereas an anonymity network prevents a web server views of the National Science Foundation, the Department from knowing who is visiting it. Moreover, as pointed out of Homeland Security, or the Office of Naval Research. by Sun et al. [35], although multiple proxies are used in anonymity networks, the first link between a client and the first proxy is still vulnerable to those traffic-analysis attacks. References Second, anonymity networks are usually providedby a third party, but HTTPOS is a browser-side solution under a user’s [1] Alexa Internet, Inc. http://www.alexa.com. [2] Freebsd/i386 5.2.1-release release notes. http: [25] Y. Lee. SOCKS: A protocol for TCP proxy across //www..org/releases/5.2.1R/ firewalls. http://ftp.icm.edu.pl/packages/ relnotes-i386.html. socks/socks4/SOCKS4.protocol. [3] Google trends. http://www.google.com/trends. [26] M. Liberatore and B. Levine. Inferring the source of en- [4] Google’s server names. http:// crypted HTTP connections. In Proc. ACM CCS, 2006. googlesystem.blogspot.com/2007/09/ [27] X. Luo, E. Chan, and R. Chang. Crafting web counters into googles-server-names.html. covert channels. In Proc. IFIP SEC, 2007. [5] NetCraft Ltd. http://www.netcraft.com. [28] X. Luo, E. Chan, and R. Chang. CLACK: A network covert [6] D. Barrett, R. Silverman, and R. Byrnes. SSH, The Secure channel based on partial acknowledgment encoding. In Shell: The Definitive Guide. O’Reilly Media, 2005. Proc. IEEE ICC, 2009. [7] G. Bissias, M. Liberatore, D. Jensen, and B. Levine. Privacy [29] G. Macia, Y. Wang, R. Rodriguez, and A. Kuzmanovic. ISP- vulnerabilities in encrypted HTTP streams. In Proc. Privacy enabled behavioral ad targeting without deep packet inspec- Enhancing Technologies Workshop, 2005. tion. In Proc. IEEE INFOCOM, 2010. [8] K. Borders and A. Prakash. Web Tap: Detecting covert web [30] NetCraft Ltd. November 2010 web server survey. http: traffic. In Proc. ACM CCS, 2004. //news.netcraft.com/archives/2010/11/ [9] K. Borders and A. Prakash. Quantifying information leaks 05/november-2010-web-server-survey.html. in outbound web traffic. In Proc. IEEE Symp. Security and [31] Northrop Grumman Corp. Cloc. http://cloc. Privacy, 2009. sourceforge.net/. [10] P. Borgnat, G. Dewaele, K. Fukuda, P. Abry, and K. Cho. [32] M. Ruef. httprecon. http://www.computec.ch/ Seven years and one day: Sketching the evolution of Internet projekte/httprecon/. traffic. In Proc. IEEE INFOCOM, 2009. [33] N. Schear, C. Kintana, Q. Zhang, and A. Vahdat. Glavlit: [11] S. Burnett, N. Feamster, and S. Vempala. Chipping away at Preventing exfiltration at wire speed. In Proc. HotNets-V, censorship with user-generated content. In Proc. USENIX 2006. Security, 2010. [34] R. Sinha, C. Papadopoulos, and J. Heidemann. Internet [12] CACE Technologies, Inc. AirPcap. http://www. packet size distributions: Some observations. Technical Re- cacetech.com. port ISI-TR-2007-643, USC/Information Sciences Institute, [13] S. Chen, R. Wang, X. Wang, and K. Zhang. Side-channel 2007. leaks in web applications: a reality today, a challenge to- [35] Q. Sun, D. Simon, Y. Wang, W. Russell, V. Padmanab- morrow. In Proc. IEEE Symp. Security and Privacy, 2010. han, and L. Qiu. Statistical identification of encrypted web [14] S. Coulls, C. Wright, F. Monrose, M. Collins, and M. Reiter. browsing traffic. In Proc. IEEE Symp. Security and Privacy, On web browsing privacy in anonymized NetFlows. In Proc. 2002. USENIX Security, 2007. [36] The University of Waikato. Weka. http://www.cs. [15] G. Danezis. Traffic analysis of the HTTP protocol over waikato.ac.nz/˜ml/weka/. TLS. http://research.microsoft.com/en-us/ [37] P. Wouters and K. Bantoft. Openswan: Building and Inte- um/people/gdane/papers/TLSanon.pdf, 2007. grating Virtual Private Networks. Packt Publishing, 2006. [16] S. Dedeo. Pagestats. http://web.cs.wpi.edu/ [38] C. Wright, S. Coull, and F. Monrose. Traffic morphing: An ˜cew/pagestats/. efficient defense against statistical traffic analysis. In Proc. [17] R. Dingledine, N. Mathewson, and P. Syverson. Tor: The ISOC NDSS, 2009. second-generation onion router. In Proc. USENIX Security, [39] T. Yen, X. Huang, F. Monrose, and M. Reiter. Browser fin- 2004. gerprinting from coarse traffic summaries: Techniques and [18] R. Duda, P. Hart, and D. Stork. Pattern Classification. implications. In Proc. DIMVA, 2009. Wiley-Interscience, 2nd edition, 2000. [40] K. Zhang, Z. Li, R. Wang, X. Wang, and S. Chen. Side- [19] N. Feamster, M. Balazinska, G. Harfst, H. Balakrishnan, and buster: Automated detection and quantification of side- D. Karger. Infranet: Circumventing censorship and surveil- channel leaks in web application development. In Proc. lance. In Proc. USENIX Security, 2002. ACM CCS, 2010. [20] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext transfer protocol – HTTP/1.1. RFC 2616, June 1999. [21] M. Gast. 802.11 Wireless Networks: The Definitive Guide. O’Reilly Media, 2005. [22] D. Herrmann, R. Wendolsky, and H. Federrath. Website fin- gerprinting: attacking popular privacy enhancing technolo- gies with the multinomial naive-Bayes classifier. In Proc. ACM Workshop on Cloud Computing Security, 2009. [23] A. Hintz. Fingerprinting websites using traffic analysis. In Proc. Privacy Enhancing Technologies Workshop, 2002. [24] D. Koukis, S. Antonatos, and K. Anagnostakis. On the pri- vacy risks of publishing anonymized IP network traces. In Proc. IFIP Communications and Multimedia Security, 2006.