PeerMix – A Peer-To-Peer Anonymous Network

Abstract

PeerMix is a scalable, peer-to-peer anonymizing routing service that offers strong guarantees of source anonymity against both global observers and collaborating mix nodes. PeerMix connections are bidirectional and application-independent, appropriate for Web browsing or file-sharing.

PeerMix functions as a proxy, so proxy-aware applications can add anonymity seamlessly. A subset of the PeerMix nodes operates as an ephemeral mix network; that is, the mix net is a randomly selected and constantly changing subset of the peers. A prototype PeerMix network has been tested on the Stanford University Sweet Hall SPARC machines. This thesis describes previous work on anonymous connections, discusses the PeerMix anonymous routing system, and analyzes its security.

1. Introduction

PeerMix is a scalable, peer-to-peer anonymizing routing service that offers strong guarantees of source anonymity against both global observers and collaborating mix nodes.

Anonymizing services are desirable in a variety of contexts, both good and evil: protecting consumers from profiling by corporations, hiding communication even the existence of which must be kept secret, allowing downloads of material illegal in a given jurisdiction but publishable elsewhere, such as DeCSS or Mein Kampf, and enabling citizens, including criminals, to communicate without the knowledge of their government.

Although users desire anonymity in the abstract, users often do not realize that their communications online are traceable or, if they do realize it, do not worry enough to spend money to gain anonymity. However, a peer-to-peer system allows users to gain anonymity at no expense

1 to them, only the use of otherwise unused bandwidth and computing power. Furthermore, a peer- to-peer system cannot be shut down by a government or lawsuit and thus is more resistant to attacks by one of its intended adversaries.

PeerMix uses a distributed mix network in which nodes route packets for other peers. A central server is used only to maintain availability information. Such an availability server might not be considered illegal even in jurisdictions where the provision of anonymous browsing services would be illegal. Nor could a peer-to-peer system be rendered useless by a law that requires the providers of an anonymizing network to keep records subject to subpoena. Moreover, PeerMix is scaleable to large numbers of nodes, since the computational load on each node remains constant as the number of nodes increases.

2. Review of Previous Work

2.1 Anonymizing Proxies

The simplest approach to anonymizing Web browsing or other forms of communication is to interpose a proxy between the client and the server. The server will see only the address of the proxy and will thus be unable to link clients between visits. There are several such anonymizing proxies available, the best known of which is the Anonymizer (www.anonymizer.com) [1]. Such a proxy provides only anonymity against a very weak adversary – a curious Web server or a local observer at that Web server, though for many users, such weak adversaries are the only ones worthy of concern.

The anonymizing proxy must be trusted because it knows the mapping between HTTP requests and clients. In fact, because the anonymizing proxy routes all of a user’s traffic, it can know more about a user’s viewing habits than any single Web server. Furthermore, even if the proxy is trustworthy, a local observer at either the proxy or the client will be able to track what sites

2 the client has viewed. A more satisfactory solution requires both encryption and multiple proxies in order to frustrate traffic analysis by more powerful adversaries.

2.2 Mix Networks

Chaum [2] introduces the idea of a mix network as a technique to allow an electronic mail system to hide who communicates with whom. Let A denote an address, M a message, and Ek encryption with the k. A single mix server takes Epub(A, M), where pub is its public key, decrypts the pair (A, M), stores it in a buffer, then, after receiving a certain number of pairs, randomly permutes them, and sends each message M to its associated address A. If the encryption algorithm has semantic security, observers cannot determine the correspondence between (M,A) and its encryption Epub(M,A) and so cannot determine which message in the output corresponds to which message in the input or vice versa. Thus, if the batch size is n, it is infeasible to determine which of n recipients was intended for the encrypted pair Epub(A, M), nor can the sender be determined simply by knowing A and M.

Similarly, a mix can provide anonymous return addresses for messages, allowing bi- directional communication with sender anonymity. If x wishes to communicate with y, x can send

Epub(A, M) to the mix, as above, including in the message a return address - an Epub(A, k) pair, where k is a secret key. In order for y to communicate with x, y must prepend the return address to his reply (Epub(A, k), M). The mix will then decrypt the pair, encrypt the reply with the secret key in the return address, batch and permute as any other message, and then forward the message Ek(M) to x.

Since it is infeasible to determine the correspondence between a message and its encryption without knowing the secret key specified in the return address, it cannot be determined which of the n recipients of the message is actually x, just as for directly addressed messages.

The technique of sending through a single mix server can be obviously extended to sending messages through a cascade of mixes. This has the advantage of both reducing the trust in any

3 single mix server and allowing the message sent to be mixed in with a larger number of other messages. If any mix along the route is trustworthy, the unlinkability of any batch should be guaranteed.

Neither inputs nor return addresses may be repeated, or else a simple intersection of the recipients of the two batches of n messages would reveal the corresponding recipient. Mix servers must therefore track inputs and addresses against replay.

2.3

In a mix net, each message requires a public key decryption to process. As public-key operations are slow compared to secret-key operations, any system that requires real-time performance must either accept only a small number of messages mixed per server or reduce the number of public key operations performed. Onion routing [3] is a system developed to build application-independent, real-time anonymous connections by allowing the public-key operations to be amortized over the lifetime of a connection. Onion routing minimizes the number of public key operations by only using asymmetric for an initial connection-setup and key- distribution step and subsequently using symmetric cryptography.

In onion routing, the client first sets up an anonymous connection through a source-routed path of onion routers, each of which shares a secret key with the client. A connection setup message is called an onion, composed of layers encrypted with the public key of the intended recipient mix, each of which contains the next address and a secret key. The onion is sent to the first mix along the path, which decrypts the onion, strips off the header, stores the secret key, and sends the remainder of the onion to the address specified in the header, which repeats the procedure.

The onion is processed by the mix servers along the path from the client to the server, each of which saves the secret key and the address of the next and previous mixes. When further data is sent along the connection to the server, the data is encrypted with the shared secret key at each hop. Data can

4 also be sent backwards by the same method of repeated encryption, and then decrypted at the client using the secret keys it shares with each onion router.

While onion routing uses connections to decrease the total public key operations required by the system, it eases traffic analysis by requiring that each message follow along the same path.

Thus the onion routing connections are padded at every hop and encouraged to be short-lived. A connection-oriented protocol is appropriate only for networks that can assume a high degree of robustness in the underlying components, due to the substantial overhead of connection setup and teardown.

Onion routing is secure against global observers and, if deployed at a firewall, against collaborating mixes. However, connections from free-standing machines to the onion routing network can be broken by collaborators with probability c2/n2, where c is the number of collaborators in the system and n is the total number of mixes [4].

Onion routing was briefly available to the public, but was shut down in January 2000 pending release of the second version. The authors of onion routing envisioned that Internet

Service Providers and firewalls would offer onion routing services for the packets that flow through them, so that the onion routing infrastructure could be absorbed in the general infrastructure of the network. Unfortunately, no other providers of onion routing seem forthcoming, nor is anonymity a service demanded by a large number of users from their ISP. A similar service, the Freedom

Network by zero-knowledge.com [5], offered anonymity guarantees similar to onion routing, but was discontinued in December 2001 when the company decided that there was too little demand for anonymity at $50 a year.

2.4 Crowds

Crowds [6] is a peer-to-peer system that was designed to offer anonymity for HTTP requests. In Crowds, peers (called jondos) are organized into static paths, each pair of which shares

5 a secret key. Requests are forwarded from one jondo to another along a path, until the final jondo submits the request to the Web server. Paths are reformulated at periodic intervals when new members join the crowd. The work for each peer is constant with respect to the number of active peers.

Crowds uses a central server for both availability and key agreement among paths. Peers, however, carry out all data forwarding, so that the crowd performs the bulk of the encryption operations, not the central servers, increasing scalability.

Crowds is secure against a local eavesdropper and was intended to be secure against collaborating peers. Because every peer views the message in the clear, however, any jondo along the path can view message destinations and use frequency analysis to build a profile of the users it serves. When the paths are periodically reformulated, collaborating jondos can then use these profiles to identify any users that were on both paths and can thus identify the source of connections. Afterwards, the exposed user will be exposed until the next reformulation.

3. PeerMix Design and Implementation

3.1 Overview and definitions

PeerMix is a peer-to-peer network of nodes that use a randomly chosen subset of all peers as mixes for the entire network. This mix network implements a source-routed best-effort datagram service, which may be called by analogy anonymous IP. The two endpoints of a connection communicate over the mix network by anonymous TCP, an anonymous end-to-end byte stream connection that guarantees reliability and in-orderness.

Each PeerMix node acts as a proxy at some user-specified port for local applications. A single PeerMix node can act as a proxy for any number of applications, though currently only browsing is defined. The proxy communicates with an application-dependent connection handler

6 over an anonymous socket – a two-way, reliable byte stream that implements anonymous TCP.

This connection handler will then generally communicate in the clear with an application server outside the PeerMix system to handle the request. The substratum for this communication is the mix net service provided by the peers, which acts as an anonymous IP. The anonymous IP protocol uses fixed-size cells with layered headers called onions, as in onion routing, to specify the chain of nodes through which a cell will pass. A client can generate onions that encapsulate the data sent to the server as well as return addresses called reply onions to define a return route through the network for the server’s reply.

3.1.1 Application Proxy

The application proxy on the client has an open port which accepts TCP connections for anonymizing from the local machine, or, if set up at a firewall, from the internal network. This application proxy performs application-dependent operations on the request, which may involve the cooperation of the connection handler on the other side of the anonymous socket. It then forwards the data through the anonymous socket. For instance, a Web application proxy might implement caching in cooperation with a connection handler that pre-fetched pages.

3.1.2 Connection Handler

The connection handler lives on the server and is created when an anonymous TCP SYN packet is received at a mix node. It reads from the anonymous socket and performs application- dependent operations on the request. Generally, a connection handler connects to another server that is not aware of the anonymizing network, such as a Web server. Together, the application proxy and the connection handler provide the functionality an application would expect from a proxy, allowing proxy-aware applications to be seamlessly anonymized.

7 3.1.3 Anonymous socket

The anonymous socket attempts to layer an anonymous TCP protocol on top of the anonymous IP presented by the mix network. Anonymous TCP gives a byte stream interface to the mix network, ensures reliability and correct ordering of the bytes, and manages the flow of reply onions from the client to the server. The initiating end of the anonymous connection will be referred to as the client and the passive end as the server. The server endpoint should not be confused with the application server that it generally contacts to process the request. In the case of

Web browsing, the server endpoint will make a request to a Web server that is outside the PeerMix network and then relay the response back to the client over the anonymous socket.

The anonymous TCP header, encapsulated within the layered anonymous IP headers, follows.

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Header Length |Protocol | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Connection Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Sequence Number | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Length | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | (Optional) Code | + + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Reply Onion Length | Reply Onion Count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

8 The header is followed by any reply onions, and then followed by the data. The Protocol field denotes the protocol - currently 1 denotes a request, 16 is defined for ECHO, 17 is defined for HTTP 1.0, 18 is reserved for HTTP 1.1. Numbers under 16 are reserved for internal protocols necessary for the working of the mix net or availability. The following bits of the Flags field are currently defined - SYN, FIN, and ACK, which serve much the same purpose as in TCP, and MAC, which indicates that the optional MAC field is included. The Connection Number is an integer that uniquely represents a connection between the source and the destination and is private to both. The

Sequence Number is the number of data packets that have been sent before this packet; the

Acknowledgement Number is the number of in-order data packets that have been received before this packet. The Length is the length of the packet, including the header. The optional Message

Authentication Code uses an HMAC calculated over the payload and header to verify that they are uncorrupted. The Reply Onion Length and the Reply Onion Count specify how long each reply onion included in the message is, as well as the number included. While the size of a reply onion depends on the number of hops it takes, all reply onions for a given connection are constrained to be the same size to ease the burden on the server. These reply onions then follow in packed form for

(Reply Onion Length) * (Reply Onion Count) bytes. The data follows immediately afterwards, and is then padded with zeroes to a full 16 byte block.

After the anonymous IP payload is received from lower layer, the payload is sent to the anonymous socket specified by the connection number. If the payload contains any reply onions, they are stripped out and placed in a queue. If the ACK flag is set, the acknowledgement is processed immediately. Then the packet is stored for in-order processing by sequence number.

The socket then processes all cells in order up to the highest contiguous sequence number, acting on any non-ACK flags that may be set and then sending the data in the packet to the application. The socket sends an ACK for the highest received sequence number, unless the packet

9 received contained no data and had no flags set other than ACK. The ACK functions just as in

TCP, except that whole packets are acknowledged rather than individual bytes.

Cells are resent if no acknowledgement is received after a timeout set by a modified version of the Karn/Partridge algorithm used in TCP. In TCP, the timeout is calculated individually for each destination, since the route depended on the destination. Using anonymous IP, however, the route followed by the cell is independent of the destination, so average timeouts are jointly calculated for the network as a whole when any anonymous socket at the local node receives a cell.

Connection setup occurs when the client sends a SYN packet to the server, which responds with a SYN_ACK packet. This is not a handshake as in TCP, however, because the client is free to send data to the server before receiving the SYN_ACK, and normally the SYN packet will contain data as well. The handshake is omitted because the network is assumed to have high latency compared with the underlying network, and a full three-way handshake would waste one round-trip time while waiting for a SYN_ACK, requiring at best two round-trip times before the first byte of application data reaches the client. In the case of a mix network, a round-trip time can be on the order of a second for slow connections or mix servers and is at least several hundreds of milliseconds in the best case.

The client must receive a SYN_ACK before data from the server can be processed, so the connection setup allows for random initial sequence numbers as in TCP or for other shared state to be established.

Either the server or the client can initiate connection termination by sending a FIN packet.

A FIN packet can only be sent, however, after all local data has been sent and acknowledged.

The receiver of a FIN packet is required to close the incoming stream and then flush any outstanding data over the anonymous socket. After all outstanding data has been acknowledged, the

10 receiver sends a FIN_ACK, which moves it into the TIME_WAIT state. In the TIME_WAIT state, a socket may still send a FIN_ACK if it receives a FIN and may be closed after a period of time.

After the initiator receives a FIN_ACK, the anonymous socket is closed and the resources are released.

3.1.4.1 Ephemeral Mix Networks

Not all peers function as mixes; instead, a subset of the peers is randomly chosen at fixed periods of time to form an ephemeral mix network, as describe in section 3.1.4.2. The mix network is ephemeral because, at the end of the epoch, new mixes are chosen. This mix network was designed to implement an unreliable service instead of a reliable one primarily because of the dynamic nature of the peers.

When a mix node receives a cell, it decrypts the first 128 bytes according to its RSA private key, then decrypts the remainder using the symmetric encryption key contained within the first 32 bytes, which is the header, in a manner similar to that used for onions in onion routing [3]. If the message is destined for that node, it passes them onto the anonymous socket identified in the header’s connection number, or creates a new anonymous socket and connection handler.

Otherwise, it stores the message in a buffer with other pass-through traffic. The buffer is randomly permuted and then flushed when either there are n messages in the buffer addressed to distinct nodes, where n is generally between 5 and 10, or after tbuf_wait time has passed, where tbuf_wait is typically 200 to 500 ms. If n messages with distinct destinations have not been received after tbuf_wait time, the node generates up to n dummy messages and then sends all the messages to their destinations.

Each peer is identified by its primitive id: the concatenation of its hostname and server port.

This is to allow multiple servers to be located at the same IP address. Each server is also assumed to have a public key certificate, signed by a trusted CA. However, this certificate need not

11 authenticate the peer as any particular person or give it any trust. The certificates are merely means of preventing man-in-the-middle attacks by malicious nodes and providing some access controls on who may join and when, as described below. Thus, for this implementation, a CA can be created with a well-known certificate that simply hands out certificates on a first-come, first-serve basis to any server that can maintain a TCP connection.

Each message has a 32-byte header:

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|Version |Hops Left |Crypto Alg. |Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Expiration | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Destination Port | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Key Seed Material + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The first bit of the header must be zero to ensure that the first 1024 bits of the onion are less than the RSA modulus. The Version field is currently defined to be 1. The Hops Left field holds how many hops are left between the sender and the receiver, which may be used in generating dummy messages. The Crypto Alg field defines how the Key Seed Material is turned into keys and what ciphers are used to encrypt the data. Currently, 1 is defined to be triple DES, with keys and IV generated by repeatedly hashing the . The only bit of Flags currently defined is the REPLY bit, which is set for innermost header (the one seen by the final destination) if the header is a reply onion as defined below. The Expiration is the date beyond which packets are not considered valid

12 and therefore need no longer to be checked against replay. It is defined as a number of seconds since January 1, 1970, (standard UNIX time). The Destination Address holds the IP address of the destination in network byte order, and the Destination Port holds the port. The Reserved part may later be used to implement quality of service levels with regard to mixing.

Borrowing the terminology from onion routing [3], an onion is a layered-encryption structure that wraps anonymous IP headers around a payload. A client makes an onion by choosing a destination and h-1 intermediate hosts at random from the group of all mix servers, where h is the number of hops the message will take. The onion is constructed iteratively, from the innermost layer outward. At each step, the first 128 bytes of the onion are encrypted using RSA, then the remainder is encrypted using the cryptographic algorithms and key seed material found in the header. Since the innermost layer is only 32 bytes long, it is padded up to 128 bytes with random data.

Messages are sent in large, fixed-size units called cells. As processing each cell requires an

RSA decryption at every hop, cells should be large in order to minimize the number of decryptions required for a given amount of data. The cell size is currently set to be 4096 bytes. Each cell contains several layers of anonymous IP headers, one for each hop, as well as the anonymous TCP header. For a cell that will take 5 hops, these headers require 276 bytes, leaving 3820 bytes for data. Larger cells would allow large amounts of data to be transmitted with fewer RSA decryptions and allow a higher proportion of payload to cell size, but would waste bandwidth, as many packets

(such as ACKs) are very small. Cells are sent from one machine to another using TCP, to allow for cell sizes greater than the maximum transmission unit for any link. The connection is initiated and then closed, as it is assumed that any pair of peers will communicate only infrequently with each other.

13 The server needs some form of return addresses to reply to the client. In this case, the return addresses take the form of reply onions. Reply onions, specifying the return path through the mix network, are created and stored by the client and sent in the anonymous TCP header to the server.

A reply onion simply uses the same layered encryption structure as an onion without a payload.

When the server needs to send a response to the client, it appends the payload to the reply onion, which is then processed as though it were any other cell. The first 128 bytes are decrypted with

RSA, as usual, then the rest are decrypted using the key contained within the header. Thus the payload is decrypted once for each hop, using key seeds known to the source, and is encrypted h times at the client node to find the plaintext. However, because the cell sent out by the server endpoint should be unlinkable to the plaintext response it received from the application server (i.e. the Web server the client ultimately wished to contact), the data is decrypted once before sending, using the key seed in the outermost header seen by the server, and then encrypted again at the client, for a total of h+1 decryptions. If this step were not performed, the request received from the application server could be linked to the cell sent by the server endpoint, easing traffic analysis.

3.1.4.2 Message Volume Control

The mix network will only provide anonymity if the amount of traffic in the system is great enough to allow messages to be hidden. If the amount of traffic per mix server is too low, the message has to wait as long as tbuf_wait time before being sent on the next mix server. If the rate of arrival is less than n/tbuf_wait, dummy padding messages must be generated in order to fool traffic analysis. In the worst case, where only a single message arrives in time, the mix may need to generate n-1 padding messages every tbuf_wait period. These dummy messages waste both bandwidth and processing power for nodes in the network. If every peer is a mix server, for many reasonable applications like Web browsing, insufficient traffic will be generated per server to avoid large

14 quantities of dummy messages. In addition, high traffic at each node adds to the security by making it more difficult to link cells to connections at a given node.

The appropriate fraction of peers that function as mix servers is chosen as a system parameter but depends on the load of the network. Thus, it is chosen to be slightly above the normal load for the network, and peers are encouraged to submit dummy requests if they fall below the normal load expected per peer. This padding is on the order of one request per minute, which is a rather trivial burden on each peer, yet allows the fraction of mixing peers to be set rather than agreed upon in some distributed way by the system.

In order to minimize this loss, it is desirable that a subset of peers be chosen to form the mix network, and that other peers direct their cells to this network to be mixed. The scheme should allow rotation between peers each epoch. In this implementation, each epoch is about sixty-five seconds. The scheme for selecting active mix nodes is as follows. Each peer computes idt = h(id || t) (mod N) where t is the time at which the epoch began and N is a large number, often a power of two. Each peer agrees on a fraction of mix servers to use as a system parameter. That fraction is rounded to b/N. The peer will then send its messages to those servers with idt < b.

Each peer, even those with idt >= b, still processes cells and responds to requests, as some computers may see random variations in load that affect b. These servers will typically be forced to generate dummy requests, but will be able to offer the same security guarantees as others. In order to avoid this poorer performance, peers are encouraged to set their clocks from a network time source.

It is desirable that mix servers not be able to predict, when choosing their primitive id (their hostname and port), when they will be able to serve as mixes. If they could, a malicious user could set up a large number of servers which would all be mix servers at the same time, raising the fraction of collaborating nodes during that epoch. In that case, say, once a day, a random seed

15 could be chosen, and idt could be calculated as idt = h(id || seed || t) (mod N). This seed could be chosen by a central, trusted server if one is available. Alternatively, peers could calculate the seed by an agreed-upon algorithm using public data such as sports scores or the closing values of stocks.

Such a calculation would be truly distributed, with trust placed only in the public repositories of such data, like online newspapers or Web portals. In any case, the seed must be computationally infeasible to discover before the time it is set. In order that nodes never be able to predict what times that they would serve as mixes when requesting a certificate, it is sufficient that the daily seed be infeasible to predict and that new nodes are not allowed to mix until a new random seed has been chosen.

The message volume control generates high and smooth traffic between mix servers, but not between mix servers and non-mixing peers. In order to prevent certain attacks discussed in section

4, non-mixing peers can maintain padded connections to a mix server, and arrange for all their connection to include that mix server as a first hop. If the connections are padded to constant traffic, this allows for much stronger security guarantees for these nodes. However, for many users the added benefit of extra security is not necessary, and the padding requires a significant overhead in terms of bandwidth and performance for the mix node. As this is a connection, the two servers may perform a key exchange beforehand, and link encrypt all outgoing onions with a secret key, reducing the load on the mix server. In either case, this feature is not currently implemented.

3.2 Application protocols

The anonymous IP and anonymous TCP layers are application-independent. Two applications seem most appropriate for anonymizing – Web browsing and file sharing, of which only the Web browsing proxy has been implemented. Other applications, like anonymous e-mail, are also desirable, but can be handled more appropriately by existing systems such as Mixmaster [8] since they do not have real-time constraints.

16 For the Web application, the client sets the local application proxy as the proxy in his browser. The application proxy then forwards the proxy request to the connection handler on the other end of the protocol, which receives and forwards the request.

One optimization that can be made using the Web proxy and connection handlers is pre- fetching images and sound files before they are requested by the browser. When an HTML file is downloaded, the browser parses the file and determines which other files it must request from the server, and then sends other requests. In this case, the server’s connection handler can do the same parse and pre-fetch those files and send them to the client’s application proxy before it makes a request to the connection handler. This saves one round-trip time between the connection handler and the proxy, which can be quite significant and allows the connection handler to fetch the images at the same time the HTML file is passing back through the connection. In practice, this should decrease the latency for a multiple-file download to be the same as the equivalent single-file download.

3.3 Certificate Authority

In general, the PeerMix design assumes a public key infrastructure, with the caveat that public keys are merely tied to hosts and ports, rather than to individuals or companies. No trustworthiness is assumed through possession of a certificate. However, for the requirements of

PeerMix, very little proof of identity is required - merely the ability to send and receive data at a given IP address. This allows the CA for PeerMix to be entirely online and therefore far less expensive than a traditional CA.

3.4 Availability Protocols

In a peer-to-peer mixing system, it is important that each peer know the identity of all other peers in the network. If each peer only knew the identities of its neighbors, each packet sent would

17 leak information about the sender. Consider the case where a user contacts a news site every day at a given time. Any observer that noticed this could compile a list of peers who submitted these requests, and intersect those peers’ neighbors to find with high probability the peer that generated the request. In general, if cells are sent to only a subset of the peers chosen by “distance” in some fashion, message frequency analysis would eventually allow a global observer to discover the source of the requests. It is thus important that all peers know the identity and availability status of all other peers.

Currently, availability information is maintained through a central server who sends periodic updates to the peers containing a signed list of currently available members and their certificates. The certificate for the central server is well known and distributed with the software.

The load on the central server due to cryptographic operations is only a single signature operation each time the list is distributed. The central server must, however, respond to members’ requests to join and leave the group, as well initiating the distribution of the list to each member.

A more decentralized method of distribution is possible in which the peers form a multicast tree [9], and the central server need only sign the list and distribute it to a small number of children, who would then distribute it hierarchically to the rest of the peers. Under such a system, a single availability server could support a crowd of thousands of peers. In fact, the only requirement for the availability server is not that it perform extensive computation, but merely that it be trusted by all participants. A malicious root of the multicast tree could alter the list of members to include only its malicious collaborators, exposing all users.

4. Security Analysis

4.1 Goals

18 There are various properties concerning the identities of senders and receivers that we may wish to protect in an anonymous communication substrate. Pfitzmann and Waidner [10] define three goals: sender anonymity, receiver anonymity, and unlinkability. Sender anonymity requires that no attacker be able to connect a particular message to its sender, while receiver anonymity is similarly defined as ensuring no message can be linked to its receiver. Sender-receiver unlinkability merely means that one cannot determine if a given sender and a given receiver are communicating with each other. In this case, we consider these goals as applying to the endpoints of an anonymous connection, not to messages observed in between endpoints.

Following Reiter and Rubin [6], we can further consider a corresponding to an attacker’s knowledge of the probability that a given node sent or received a given message. In particular, Reiter and Rubin consider a node “beyond suspicion” if it is no more likely to be the sender or receiver of a given message than any other node in the system. In general, one can merely minimize the attacker’s probability that two nodes are communicating.

We also distinguish between the two goals of identifying the endpoints of a connection given no previous information, and confirming that two given endpoints are, in fact, communicating. A system that allows confirmation of a guess also allows identification by the brute-force method of attempting to confirm all pairs, but protecting against confirmation is more expensive than protection against identification, and may not be required by all peers.

Finally, ensuring anonymity for a single message does not necessarily ensure anonymity for a series of messages if there is a way to link subsequent messages together. This can be done within a connection or across connections if the messages contain some clear-text identifying information or merely by analyzing user profiles, if one exists. If no packets contain explicitly identifying information, defenses against attacks based on linking can be directed against frequency analysis

19 based on user profiles, either by making profiles difficult to gather or by making the frequency analysis more difficult.

4.2 Adversary Model

There are many ways to attack an anonymous routing system, and each requires different capabilities on the part of the attacker. Following [4], we can consider the following adversaries:

Observer: can observe a connection without altering traffic on it.

Disrupter: can delay, remove, insert, or corrupt traffic on a link.

Collaborating node: can arbitrarily manipulate cells sent through it, as well as creating new cells.

General adversaries can be considered to be a set of collaborating adversaries at certain links or nodes. We notice that these adversaries can be subsumed under the heading of a set of collaborating nodes, as each adversary is less powerful than a collaborating node. In addition, it is often instructive to consider the global observer adversary, as a global observer is more feasible than a globally compromised network, against which the PeerMix system offers no protection.

Both adaptive and static adversaries can be considered, but, as PeerMix offers connectionless routing and the mix networks are ephemeral, lasting only about a minute, only those adaptive adversaries that could compromise computers in a matter of seconds would have any advantage over a static adversary. Since a reasonable time for compromising a server is at least on the order of minutes, only static adversaries are considered. Trust is evenly enough distributed over the entire architecture that an adaptive adversary has very little ability to choose which computers to compromise ahead of time.

4.3 Assumptions

20 In order to analyze the system, it is important to make certain assumptions about its components.

First, we assume that all servers share a common set of peers available for mixing. This neglects fluctuations based on clock skew or earlier or later response from the availability server. It also assumes that each peer agrees on the appropriate fraction of servers to be mixes.

Second, we assume that each peer that is not a mix has a connection to a random mix that is bandwidth-limited and padded to a constant amount of traffic. That random mix will therefore be serving both itself and many non-mix peers. This padded connection is optional and can be expensive in bandwidth for the peer, and we will note against which attacks this assumption protects and consider other considerations protecting those nodes who choose not to use this padded connection.

Third, we assume that the rate of ingoing and outgoing traffic for a mix server is equal. This assumption would be unreasonable for a single node, but is reasonable given that a mix server receives traffic for itself and a number of non-mix nodes and generates it for others, substantially smoothing the distribution of traffic.

Fourth, we assume that the rate of traffic flowing into a mix server from the network is constant. The message volume control mechanism is meant to reduce the number of mix servers until the total amount of traffic is large enough that every mix server is constantly fully loaded. We are thus disregarding fluctuations based on the random choice of peers when deciding upon routes.

More importantly, we ignore the added traffic at a given node based on any active connections in which it may be participating.

Fifth, we assume that each node on a path is chosen at random by the client, either when creating the onion or the reply onion. The client is assumed never to be compromised and to possess a source of randomness that it is computationally infeasible for the adversary to predict.

21

4.4 Analysis

As rigorous analyses for mix servers offering provable security are not available except for toy systems, I will discuss how the PeerMix system resists common attacks. A helpful catalog of such attacks is given in [11], which I will follow, considering first the attacks upon a single message or connection, and then attacks upon a sequence of linkable messages.

The brute-force attack: A global observer can conduct the brute-force attack of following every message as it moves through the system. An attacker can only discover that the intended recipient is no more likely than some large fraction of other nodes in the system to have sent the message, and any peer can increase its degree of anonymity for any message arbitrarily close to beyond suspicion.

Consider a message tainted if it was mixed in with the original message, or if it was mixed in with a tainted message. A recipient of a tainted message may be suspected of being the recipient of the original message; in fact, each recipient of a tainted message appears equally likely to a global observer to be the intended receiver. Now, each message sent by a node is mixed in with n others, each which is mixed in with n others and so forth. Thus, at least n messages are tainted at the end of h hops and no more than nh are. If there are m total mixes, an approximation for the fraction of the peers that could be suspected of being recipients is on the order of

nh h − 1 − (1 − 1 )n ≈ 1 − e m if m is large, obtained by assuming all tainted messages go to distinct m

nh − servers until the last round. The fraction of servers who are not suspected is e m , which can be made arbitrarily small as h is increased.

22 Intersection attacks based on brute-force methods: An adversary can repeat the brute- force attack multiple times, each time intersecting the suspected set with a new suspected set, and

nh − pruning the set of suspected servers by 1 − e m , where n is the number of cells mixed per batch, h is the number of hops, and m is the number of mixes in the system. In general, the size of the

nh − suspected set after r rounds will be on the order of m(1− e m )r .

nh nh  −   −  lnm(1− e m )r  = rln1− e m  − ln(m)        

nh Thus r ≈ e m ln(m) . This relation limits the amount of traffic that can be communicated anonymously over a single connection, or over multiple connections in a way that can be linked to a single receiver. At n = 6, h = 5, and m = 1000, this allows on the order of 16,000 packets, or about

24.5 MB after discounting ACKs and headers, before anonymity can be compromised from a brute- force intersection attack. The client can increase that amount substantially simply by choosing a large hop count.

Cell correlation and timing attacks: If an adversary suspected that two nodes were communicating a single file or across a single anonymous connection, he might attempt to correlate cell totals sent by one end with cell totals received by the other or the timing signature of the packets sent with those received. This attack does not allow identification of communicating nodes, but it does allow confirmation that the two endpoints are communicating. This attack gains extra potency if the server endpoint of the connection is compromised and attempts to send an unexpectedly large number of packets or intentionally embed timing information in them.

23 However, for mixing peers or those peers with a padded connection to a random mix, we may assume that the message volume control mechanism sufficiently smoothes traffic that ingoing cell totals cannot be detected.

This attack would allow a compromise of the system, however, if both the server node and the last mix server on the connection were compromised, as then these two mixes could correlate cell totals and timing information and detect the communication. The chance that both servers will be compromised during any connection is (c/m)2, where c is the number of compromised nodes and m is the number of mixes in the PeerMix system. This probability of compromised anonymity is similar to onion routing in the remote configuration [4].

However, the PeerMix system offers several practical difficulties to adversaries not offered by onion routing, difficulties which remain even for peers which have chosen not to use a padded connection with a single first mix server. First, because cells are large and fixed-size, the size of the file downloaded can only be inferred with a coarseness of the nearest 3K bytes (disregarding headers), making it difficult to correlate totals except for very large or very small files. Second, there are many other peers in the system, each of which generates both real and dummy requests at a fixed rate, so that many other peers may have similar cell totals. Third, the connectionless nature of the routing ensures that all messages take different paths unknown to the server end, adding a certain amount of noise to the timing information and thereby increasing the amount of packets that must be sent across a connection by a compromised server. Fourth, clients have control over the number of outstanding reply onions available to the server at any time, and can ensure that this number is not allowed to grow too large before the reply onions expire.

Attacks that succeed based on the fraction of collaborators in the system are particularly problematic for a peer-to-peer system, as any malicious node can join the system and become a collaborator. In the long run, the best defense against such an attack for a peer-to-peer system is for

24 it to be large enough that no organization can sustain enough collaborating nodes to gain more than a very small fraction of the total hosts on the system. In order to sustain a large set of collaborators, the adversary must be able to supply enough computing power to maintain these collaborators as mix servers, which would require substantial computational resources.

The node flushing attack: An adversary can send n-1 messages to a given node and then identify which messages sent from that node were his. The remaining message can then be linked between input and output. Carrying this attack out upon the PeerMix system, however, is impractical, as messages do not take recurrent routes. The advantage offered by any single mix node could be negated, but it would be difficult to totally compromise anonymity as every node along the path would need to be flooded in order to link with certainty the sender and receiver of a message.

Message tagging: It is impossible for either the end server or any intermediate compromised node to tag a message. The encryption makes the onions appear to be essentially random data, no matter what the end server includes in the message. Messages cannot be given an unusual length, since all cells are fixed-size. Also, no intermediary can flip bits without the change being recognized by the endpoints in the MAC and the packet discarded.

Intersection attack: If an adversary can link messages originating from a single source over a long period of time, the adversary can intersect the sets of available servers to eventually find the source of the communication, in a similar way that the brute-force results were intersected. This attack, while far less effective than the brute-force intersection attack, also requires fewer resources.

A user-level solution to both attacks involves not sending messages that include any sort of identifying information for the user. This seems to rule out, for instance, using PeerMix to maintain a pseudonymous personality on the Internet, as each use of the pseudonym could allow another

25 round on an intersection attack. The intersection attack is a well-known open problem for which no good solutions are known other than requiring all users to be continually online.

5. Software Availability

The prototype version of PeerMix has been implemented and can be used for anonymous

Web browsing. As a prototype, however, it does not include elements that would be necessary for full deployment of the system, such as a CA, a graphical user interface, an installation package, or several of the optional security features mentioned in the paper. More importantly, PeerMix does not yet have a large user base, which is necessary for anonymity. A single user anonymity system cannot prevent requests from being linked to the user. A large amount of real traffic is required for any anonymous networking scheme to work well.

The prototype fares well in preliminary tests of latency and bandwidth, with a subsecond latency and throughput of 10KB/s for single files. Browsing the Web anonymously over a

100Mb/s connection is no worse than viewing pages over a highly loaded connection. The tests were conducted on SPARC machines located in Sweet Hall at Stanford University, which are in close network proximity. However, the machines perform RSA decryptions half as fast as a

400MHz Pentium, which is more characteristic of the types of machines that would typically be used for PeerMix.

There are several optimizations that can be made to decrease both the latency of the connection and the load on the mix servers. First, DES encryption, which is currently performed in

Java and takes as long as an RSA decryption, can be moved to OpenSSL [7], which should nearly double the number of packets that can be processed per second. Furthermore, requiring the connection handler to pre-fetch the associated files, such as images and sounds, when a page is requested should significantly reduce latency over multi-file connections and improve performance.

26 Finally, a system with many users should, ironically, have better performance than a system with few users, since there will be no need to generate dummy requests, which involve performing several layers of encryption operations.

Development of the prototype will continue over the next few weeks so that it can be deployed in the real world instead of in a test bed system.

6. Conclusions

This thesis describes anonymous connections and their realization in a peer-to-peer setting in PeerMix. Peer-to-peer anonymous connections can be realized with only moderate worsening of latency and throughput for a connection, and without requiring dedicated processing for the mix servers. PeerMix works using a source-routed, anonymous analogue of TCP/IP that runs on all the peers. PeerMix anonymous connections are resistant both to traffic analysis and to eavesdropping.

This thesis also analyzes the security implications of the PeerMix design, discusses the appropriate values for the security parameters, and compares the security and efficiency of PeerMix to other anonymous routing schemes. The prototype system is under development and has been tested on a small set of communicating peers. It will be released to the public in due course, at which point ordinary users will be able to browse the Web anonymously at no cost to them.

Acknowledgements

Many thanks to my advisor, Dan Boneh, for helping me both select a topic and then proceed with design and analysis. He spent many hours working with me to make sure the system was well designed. I’d also like to thank Chi Ming Wong, Darren Lee, and Sheba Najmi for reading drafts of the thesis and for their very helpful comments. Finally, I’d like to dedicate this thesis to my girlfriend, Sheba Najmi, for the immense amount of support and caring that she has given me over

27 the months that I have been writing this thesis. Sustaining the strength to write this thesis would have been difficult if not impossible without her help and encouragement.

References

[1] The Anonymizer. http://www.anonymizer.com, May 9, 2002.

[2] D. Chaum. “Untraceable Electronic Mail, Return Addresses, and Digital Psuedonyms”, Communications of the ACM, vol. 24, no. 2, Feb. 1981, pages 84-88.

[3] M. Reed, P. Syverson, and D. Goldschlag. “Anonymous Connections and Onion Routing”, IEEE Journal on Selected Areas in Communications, vol. 16 no. 4, May 1998, pp. 482-494.

[4] P. Syverson, G. Tsudik, M. Reed, and C. Landwehr. “Towards an analysis of onion routing security”, In Proc. Workshop on Design Issues in Anonymity and Unobservability (25-26 July 2000), ICSI RR-00-011, pp. 83-100.

[5] I. Goldberg and A. Shostack. “Freedom Network 1.0 Architecture and Protocols,” White Paper, http://www.freedom.net/info/freedompapers/index.html, May 9, 2002.

[6] M. Reiter and A. Rubin. “Crowds: Anonymity for Web Transactions”, ACM Transactions on Information System Security, vol. 1, no.1, November 1998, pp. 62-92.

[7] OpenSSL. http://www.openssl.org, May 9, 2002.

[8] L. Cottrell. Mixmaster, http://obscura.obscura.com/~loki/, May 9, 2002.

[9] H. Deshpande, M. Bawa, and H. Garcia-Molina. “Streaming Live Media over a Peer-to-Peer Network”, submitted for publication. http://dbpubs.stanford.edu:8090/pub/2001-31. May 9 2002.

[10] A. Pfitzmann and M. Waidner. “Networks without user observability – design options”, Advances In Cryptology – Eurocrypt ’85 (1985), vol. 219 of Lecture Notes in Computer Science, Springer-Verlag.

[11] J. Raymond. “Traffic Analysis: Protocols, Attacks, Design Issues and Open Problems”, Berkeley International Computer Science Institute (ICSI) Technical Report TR-00-011, pp. 7- 26, July 2000.

28