Today

Peer-to-Peer Systems and 1. Peer-to-Peer Systems Distributed Hash Tables – , , BitTorrent, challenges

2. Distributed Hash Tables

COS 418: Advanced Computer Systems 3. The Lookup Service Lecture 8 4. Concluding thoughts on DHTs, P2P Mike Freedman [Credit: Slides Adapted from Kyle Jamieson and Daniel Suo] 2

What is a Peer-to-Peer (P2P) system? Why might P2P be a win?

Node • High capacity for services through parallelism: – Many disks Node – Many network connections Internet – Many CPUs Node Node • A distributed system architecture: • Absence of a centralized server may mean: – No centralized control – Less chance of service overload as load increases – Nodes are roughly symmetric in function – Easier deployment – A single failure won’t wreck the whole system – System as a whole is harder to attack • Large number of unreliable nodes

3 4

1 P2P adoption Example: Classic BitTorrent

Successful adoption in some niche areas 1. User clicks on download link – Gets with content hash, IP addr of tracker 2. User’s BitTorrent (BT) talks to tracker 1. Client-to-client (legal, illegal) – Tracker tells it list of peers who have file 3. User’s BT client downloads file from peers 2. Digital currency: no natural single owner (Bitcoin) 4. User’s BT client tells tracker it has a copy now, too 3. Voice/video telephony: user to user anyway 5. User’s BT client serves the file to others for a while – Issues: Privacy and control Provides huge download bandwidth, without expensive server or network links 5 6

The lookup problem Centralized lookup (Napster)

get(“Pacific Rim.mp4”)

N2 N2 N3 N3 Client Client N1 N1 ? Lookup(“Pacific Internet SetLoc(“Pacific Rim.mp4”, DB Rim.mp4”) Publisher (N4) IP address of N4)

Simple, but O(N) state and a Publisher (N4) N6 single pointN6 of failure put(“Pacific Rim.mp4”, N5 key=“Pacific Rim.mp4”, N5 [content]) value=[content]

7 8

2 Flooded queries (original Gnutella) Routed DHT queries (Chord)

Lookup(“Pacific Lookup(H(data)) Rim.mp4”)

N N2 2 N3 N3 Client Client N1 N1 Robust, but O(N = number of peers) messages per lookup

Publisher (N4) N Publisher (N4) N N 6 N 6 key=“Star Wars.mov”, 5 key=“H(audioCan we make data)”, it robust5 , reasonable value=[content] value=[content] state, reasonable number of hops?

9 10

Today What is a DHT (and why)?

1. Peer-to-Peer Systems • Local : key = Hash(name) put(key, value) 2. Distributed Hash Tables get(key) à value

3. The Chord Lookup Service • Service: Constant-time insertion and lookup How can I do (roughly) this across millions of hosts on the Internet? 4. Concluding thoughts on DHTs, P2P Distributed Hash Table (DHT)

11 12

3 What is a DHT (and why)? Cooperative storage with a DHT

• Distributed Hash Table: key = hash(data) Distributed application put(key, data) get (key) data lookup(key) à IP addr (Chord lookup service) Distributed hash table (DHash) send-RPC(IP address, put, key, data) lookup(key) node IP address send-RPC(IP address, get, key) à data Lookup service (Chord)

node node …. node • Partitioning data in large-scale distributed systems – Tuples in a global database engine – Data blocks in a global file system • App may be distributed over many nodes – Files in a P2P file-sharing system • DHT distributes data storage over many nodes 13 14

BitTorrent over DHT Why the put/get DHT interface?

• BitTorrent can use DHT instead of (or with) a tracker • API supports a wide range of applications – DHT imposes no structure/meaning on keys • BitTorrent clients use DHT: – Key = file content hash (“infohash”) • Key/value pairs are persistent and global – Value = IP address of peer willing to serve file – Can store keys in other DHT values • Can store multiple values (i.e. IP addresses) for a key – And thus build complex data structures

• Client does: – get(infohash) to find other clients willing to serve – put(infohash, my-ipaddr)to identify itself as willing

15 16

4 Why might DHT design be hard? Today

• Decentralized: no central authority 1. Peer-to-Peer Systems

• Scalable: low network traffic overhead 2. Distributed Hash Tables

• Efficient: find items quickly (latency) 3. The Chord Lookup Service – Basic design • Dynamic: nodes fail, new nodes join – Integration with DHash DHT, performance

17 18

Chord lookup algorithm properties Chord identifiers

• Interface: lookup(key) ® IP address • Key identifier = SHA-1(key)

• Efficient: O(log N) messages per lookup • Node identifier = SHA-1(IP address) – N is the total number of servers

• SHA-1 distributes both uniformly • Scalable: O(log N) state per node

• Robust: survives massive failures • How does Chord partition data? • Simple to analyze – i.e., map key IDs to node IDs 19 20

5 Consistent hashing [Karger ‘97] Chord: Successor pointers

Key 5 N120 K5 N10 N105 K20 N105

Node 105 Circular 7-bit N32 N32 ID space N90 K80 N90

K80 N60

Key is stored at its successor: node with next-higher ID

21 22

Basic lookup Simple lookup algorithm

N120 Lookup(key-id) N10 “Where is K80?” ß N105 succ my successor if my-id < succ < key-id // next hop call Lookup(key-id) on succ N32 “N90 has K80” else // done K80 N90 return succ N60 Correctness depends only on successors

23 24

6 Improving performance “Finger table” allows log N-time lookups

• Problem: Forwarding through successor is slow

¼ ½ • Data structure is a linked list: O(n)

• Idea: Can we make it more like a binary search? 1/8

1/16 – Need to be able to halve distance at each step 1/32 1/64

N80

25 26

Finger i Points to Successor of n+2i Implication of finger tables

• A binary lookup tree rooted at every node N120 – Threaded through other nodes' finger tables K112 ¼ ½ • Better than arranging nodes in a single tree

1/8 – Every node acts as a root • So there's no root hotspot 1/16 1/32 1/64 • No single point of failure • But a lot more state in total N80

27 28

7 Lookup with finger table Lookups Take O(log N) Hops

Lookup(key-id) N5 look in local finger table for N10 N110 K19 highest n: my-id < n < key-id N20 if n exists N99 call Lookup(key-id) on node n // next hop N32 Lookup(K19) else return my successor // done N80

N60

29 30

An aside: Is log(n) fast or slow? Joining: Linked list insert

• For a million nodes, it’s 20 hops

• If each hop takes 50ms, lookups take a second N25 • If each hop has 10% chance of failure, it’s a couple of timeouts N36

• So in practice log(n) is better than O(n) but not great 1. Lookup(36) K30 N40 K38

31 32

8 Join (2) Join (3)

N25 N25

2. N36 sets its own N36 3. Copy keys 26..36 N36 K30 successor pointer from N40 to N36 K30 K30 N40 N40 K38 K38

33 34

Notify maintains predecessors Stabilize message fixes successor

N25 N25 notify N25 ✘ N36 stabilize N36

N40 notify N36 N40

“My predecessor is N36.”

35 36

9 Joining: Summary Failures may cause incorrect lookup

N120 N10 N113

N25 N102

N36 K30 N85 K30 Lookup(K90) N40 K38 N80 • Predecessor pointer allows link to new node N80 does not know correct • Update finger pointers in the background successor, so incorrect lookup • Correct successors produce correct lookups 37 38

Successor lists Today

1. Peer-to-Peer Systems • Each node stores a list of its r immediate successors – After failure, will know first live successor 2. Distributed Hash Tables – Correct successors guarantee correct lookups • Guarantee is with some probability 3. The Chord Lookup Service – Basic design – Integration with DHash DHT, performance

39 42

10 The DHash DHT DHash data authentication

• Builds key/value storage on Chord • Two types of DHash blocks: – Content-hash: key = SHA-1(data) • Replicates blocks for availability – Public-key: Data signed by corresponding private key – Stores k replicas at the k successors after the block on the Chord ring • Chord File System example: • Caches blocks for load balancing directory inode data block FS block block H(B1) H(D) H(F) B1 – Client sends copy of block to each server it contacted along lookup path root−block public key D F

...... data block DHash DHash DHash H(B2) • Authenticates block contents ... signature B2

43 44 Chord Chord Chord Figure 2: A simple CFS file system structure example. The root-block is identified by a public key and signed by the corre- CFS Client CFS Server CFS Server sponding private key. The other blocks are identified by cryp- tographic hashes of their contents. Figure 1: CFS software structure. Vertical links are local APIs; horizontal links are RPC APIs. storage layer, and a Chord lookup layer. The client file system uses the DHash layer to retrieve blocks. The client DHash layer uses the This is more space-efficient, for large files, than whole-file caching. client Chord layer to locate the servers that hold desired blocks. CFS relies on caching only for files small enough that distributing Each CFS server has two software layers: a DHash storage layer DHash replicates blocks atblocks r successorsis not effective. Evaluating the performance impact of blockToday and a Chord layer. The server DHash layer is responsible for stor- storage granularity is one of the purposes of this paper. ing keyed blocks, maintaining proper levels of replication as servers OceanStore [13] aims to build a global persistent storage util-1. Peercome-to-Peerand go, Systemsand caching popular blocks. The server DHash and ity. ItN5provides data privacy, allows client updates, and guarantees N110 Chord layers interact in order to integrate looking up a block iden- durable storage.N10 However, these features come at a price: com- tifier with checking for cached copies of the block. CFS servers plexity. For example, OceanStore uses a Byzantine agreement pro- N20 2. Distributedare oblivious Hashto file Tablessystem semantics: they simply provide a dis- N99 tocol for conflict resolution, and a complex protocol based on Plax- tributed block store. ton trees [24] to implementBlockthe 17location service [35]. OceanStore CFS clients interpret DHash blocks in a file system format N40 assumes that the core system will be maintained by commercial3. Theadopted Chordfrom LookupSFSRO Service[9]; the format is similar to that of the UNIX N80 providers. N50 V7 file system, but uses DHash blocks and block identifiers in place Ohaha [22] uses consistent hashing to map files and keyword – Basic design N68 N60 of disk blocks and disk addresses. As shown in Figure 2, each block queries to servers, and a -like algorithm to locate – Integrationis either a piece withof DHasha file or aDHT,piece ofperformancefile system meta-data, such as files. As a result, it shares some of the same weaknesses as Freenet. • Replicas are easy to find if successor fails a directory. The maximum size of any block is on the order of tens of kilobytes. A parent block contains the identifiers of its children. • Hashed node IDs ensure2.5 independentWeb Caches failure 4. ConcludingThe publisher thoughtsinserts onthe fileDHT,system’ P2Ps blocks into the CFS sys- Content distribution networks (CDNs), such45 as Akamai [1], tem, using a hash of each block’s content (a content-hash) as its 50 handle high demand for data by distributing replicas on multiple identifier. Then the publisher signs the root block with his or her servers. CDNs are typically managed by a central entity, while CFS private key, and inserts the root block into CFS using the corre- is built from resources shared and owned by a cooperative group of sponding public key as the root block’s identifier. Clients name a users. file system using the public key; they can check the integrity of There are several proposed scalable cooperative Web caches [3, the root block using that key, and the integrity of blocks lower in 8, 10, 15]. To locate data, these systems either multicast queries or the tree with the content-hash identifiers that refer to those blocks. require that some or all servers know about all other servers. As This approach guarantees that clients see an authentic and inter- a result, none of the proposed methods is both highly scalable and nally consistent view of each file system, though under some cir- 11 robust. In addition, load balance is hard to achieve as the content cumstances a client may see an old version of a recently updated of each cache depends heavily on the query pattern. file system. Cache Resolver [30], like CFS, uses consistent hashing to evenly A CFS file system is read-only as far as clients are concerned. map stored data among the servers [12, 14]. However, Cache Re- However, a file system may be updated by its publisher. This in- solver assumes that clients know the entire set of servers; main- volves updating the file system’s root block in place, to make it taining an up-to-date server list is likely to be difficult in a large point to the new data. CFS authenticates updates to root blocks by peer-to-peer system where servers join and depart at unpredictable checking that the new block is signed by the same key as the old times. block. A timestamp prevents replays of old updates. CFS allows file systems to be updated without changing the root block’s iden- 3. Design Overview tifier so that external references to data need not be changed when CFS provides distributed read-only file storage. It is structured as the data is updated. a collection of servers that provide block-level storage. Publishers CFS stores data for an agreed-upon finite interval. Publishers (producers of data) and clients (consumers of data) layer file system that want indefinite storage periods can periodically ask CFS for an semantics on top of this block store much as an ordinary file system extension; otherwise, a CFS server may discard data whose guar- is layered on top of a disk. Many unrelated publishers may store anteed period has expired. CFS has no explicit delete operation: separate file systems on a single CFS system; the CFS design is instead, a publisher can simply stop asking for extensions. In this intended to support the possibility of a single world-wide system area, as in its replication and caching policies, CFS relies on the consisting of millions of servers. assumption that large amounts of spare disk space are available. 3.1 System Structure 3.2 CFS Properties Figure 1 illustrates the structure of the CFS software. Each CFS CFS provides consistency and integrity of file systems by adopt- client contains three software layers: a file system client, a DHash ing the SFSRO file system format. CFS extends SFSRO by provid- DHTs: Impact Why don’t all services use P2P?

• Original DHTs (CAN, Chord, , Pastry, Tapestry) proposed in 2001-02 1. High latency and limited bandwidth between peers (vs. intra/inter-datacenter)

• Next 5-6 years saw proliferation of DHT-based apps: – Filesystems (e.g., CFS, Ivy, OceanStore, Pond, PAST) 2. User computers are less reliable than managed servers – Naming systems (e.g., SFR, Beehive) – DB query processing [PIER, Wisc] – Content distribution systems (e.g., CoralCDN) 3. Lack of trust in peers’ correct behavior – distributed databases (e.g., PIER) – Securing DHT routing hard, unsolved in practice

51 52

DHTs in retrospective What DHTs got right

• Seem promising for finding data in large P2P systems • Consistent hashing • Decentralization seems good for load, – Elegant way to divide a workload across machines – Very useful in clusters: actively used today in Amazon and other systems • But: the security problems are difficult • But: churn is a problem, particularly if log(n) is big • Replication for high availability, efficient recovery • And: cloud computing solved many economics • Incremental reasons, as did rise of ad-based business models • Self-management: minimal configuration

• DHTs have not had the hoped-for impact • Unique trait: no single server to shut down/monitor

53 54

12