Acknowledgements Distributed Hash Tables (DHTs) Chord & CAN

The followings slides are borrowed or adapted from talks by Robert CS 699/IT 818 Morris (MIT) and (ICSI) Sanjeev Setia

1 2

Chord Simplicity

 Resolution entails participation by O(log(N)) Chord: A Scalable Peer-to-peer Lookup Service for Internet nodes Applications  Resolution is efficient when each node enjoys accurate information about Robert Morris O(log(N)) other nodes Ion Stoica, ,  Resolution is possible when each node M. Frans Kaashoek, Hari Balakrishnan enjoys accurate information about 1 other MIT and Berkeley node “Degrades gracefully” SIGCOMM Proceedings, 2001 3 4

1 Chord Algorithms Chord Properties

Basic Lookup  Efficient: O(log(N)) messages per lookup  N is the total number of servers Node Joins  Scalable: O(log(N)) state per node Stabilization  Robust: survives massive failures Failures and Replication  Proofs are in paper / tech report  Assuming no malicious participants

5 6

Chord IDs Consistent Hashing[Karger 97]

 Target: web page caching  Key identifier = SHA-1(key)  Like normal hashing, assigns items to  Node identifier = SHA-1(IP address) buckets so that each bucket receives  Both are uniformly distributed roughly the same number of items  Both exist in the same ID space  Unlike normal hashing, a small change in the bucket set does not induce a total  How to map key IDs to node IDs? remapping of items to buckets

7 8

2 Consistent Hashing [Karger 97] Basic lookup

Key 5 K5 N120 Node 105 N10 “Where is key 80?” N105 K20 N105

Circular 7-bit “N90 has K80” N32 ID space N32

K80 N90 N90 A key is stored at its successor: node with next higher ID K80 N60

9 10

Simple lookup algorithm “Finger table” allows log(N)-time lookups

Lookup(my-id, key-id)

n = my successor 1/4 1/2 if my-id < n < key-id call Lookup(id) on node n // next hop 1/8 else

1/16 return my successor // done 1/32 1/64 1/128 N80  Correctness depends only on successors

11 12

3 Lookup with fingers Finger i points to successor of n+2i-1

Lookup(my-id, key-id) N120 look in local finger table for 112 1/4 1/2 highest node n s.t. my-id < n < key-id if n exists call Lookup(id) on node n // next hop 1/8 else 1/16 1/32 1/64 return my successor // done 1/128 N80

13 14

Lookups take O(log(N)) hops Node Join - Linked List Insert

N5

N10 N110 K19 N20 N25 N99 N36 N32 Lookup(K19) 1. Lookup(36) K30 N40 K38 N80

N60

15 16

4 Node Join (2) Node Join (3)

N25 N25

2. N36 sets its own N36 3. Copy keys 26..36 N36 K30 successor pointer from N40 to N36 K30 K30 N40 N40 K38 K38

17 18

Node Join (4) Stabilization

 Case 1: finger tables are reasonably fresh  Case 2: successor pointers are correct; fingers N25 are inaccurate  Case 3: successor pointers are inaccurate or key 4. Set N25’s successor N36 K30 migration is incomplete pointer K30 N40 K38  Stabilization algorithm periodically verifies and refreshes node knowledge  Successor pointers Update finger pointers in the background  Predecessor pointers Correct successors produce correct lookups  Finger tables

19 20

5 Solution: successor lists Failures and Replication

N120 N10  r N113 Each node knows immediate successors  After failure, will know first live successor N102  Correct successors guarantee correct lookups

N85 Lookup(90)  Guarantee is with some probability

N80

N80 doesn’t know correct successor, so incorrect lookup

21 22

Choosing the successor list length Chord status

 Working implementation as part of CFS  Chord library: 3,000 lines of C++  Assume 1/2 of nodes fail  Deployed in small Internet testbed r  P(successor list all dead) = (1/2)  Includes:  I.e. P(this node breaks the Chord ring)  Correct concurrent join/fail  Proximity-based routing for low delay  Depends on independent failure  Load control for heterogeneous nodes r N  P(no broken nodes) = (1 – (1/2) )  Resistance to spoofed node IDs  r = 2log(N) makes prob. = 1 – 1/N

23 24

6 Experimental overview Chord lookup cost is O(log N)

 Quick lookup in large systems  Low variation in lookup costs  Robust despite massive failure  See paper for more results Constant is 1/2

Experiments confirm theoretical results Average Messages per Lookup

Number of Nodes

25 26

Failure experimental setup Massive failures have little impact

 Start 1,000 CFS/Chord servers 1.4  Successor list has 20 entries 6 1.2 (1/2) is 1.6%  Wait until they stabilize 1  Insert 1,000 key/value pairs 0.8 0.6  Five replicas of each 0.4  Stop X% of the servers 0.2 Failed Lookups (Percent)  Immediately perform 1,000 lookups 0 5 10 15 20 25 30 35 40 45 50 Failed Nodes (Percent)

27 28

7 Latency Measurements Chord Summary

 180 Nodes at 10 sites in US testbed  16 queries per physical site (sic) for random keys  Chord provides peer-to-peer hash lookup  Maximum < 600 ms  Efficient: O(log(n)) messages per lookup  Minimum > 40 ms   Median = 285 ms Robust as nodes fail and join  Expected value = 300 ms (5 round trips)  Good primitive for peer-to-peer systems

http://www.pdos.lcs.mit.edu/chord

29 30

Content-Addressable Network (CAN) A Scalable, Content-Addressable Network  CAN: Internet-scale hash table  Interface Sylvia Ratnasamy, Paul Francis, Mark Handley,  insert(key,value)

Richard Karp, Scott Shenker  value = retrieve(key)  Properties  scalable  operationally simple  good performance  Related systems: Chord/Pastry/Tapestry/Buzz/Plaxton ...

31 32

8 Problem Scope CAN: basic idea

 Design a system that provides the interface  scalability K V  robustness K V  performance K V  security K V

K V  Application-specific, higher level primitives K V K V  keyword searching  mutable content K V  anonymity K V K V K V

33 34

CAN: basic idea CAN: basic idea

K V K V K V K V

K V K V K V K V

K V K V

K V K V K V K V

K V K V

K V K V K V K V insert K V insert K V

(K1,V1) (K1,V1) 35 36

9 CAN: basic idea CAN: basic idea

(K1,V1) K V K V K V K V

K V K V K V K V

K V K V

K V K V K V K V

K V K V

K V K V K V K V K V K V

retrieve (K1) 37 38

CAN: solution CAN: simple example

 virtual Cartesian coordinate space

 entire space is partitioned amongst all the nodes  every node “owns” a zone in the overall space

 abstraction 1  can store data at “points” in the space  can route from one “point” to another

 point = node that owns the enclosing zone

39 40

10 CAN: simple example CAN: simple example

3

1 2 1

2

41 42

CAN: simple example CAN: simple example

3

1

2 4

43 44

11 CAN: simple example CAN: simple example

node I::insert(K,V) I I

45 46

CAN: simple example CAN: simple example

node I::insert(K,V) node I::insert(K,V)

(1) a = hx(K) I (1) a = hx(K) I

b = hy(K) y = b

x = a 47 x = a 48

12 CAN: simple example CAN: simple example

node I::insert(K,V) node I::insert(K,V)

(1) a = hx(K) I (1) a = hx(K) I

b = hy(K) b = hy(K)

(2) route(K,V) -> (a,b) (2) route(K,V) -> (a,b) (K,V)

(3) (a,b) stores (K,V)

49 50

CAN: simple example CAN

node J::retrieve(K)

(1) a = hx(K) Data stored in the CAN is addressed by b = hy(K) name (i.e. key), not location (i.e. IP (2) route “retrieve(K)” to (a,b) (K,V) address)

J

51 52

13 CAN: routing table CAN: routing

(a,b)

(x,y)

53 54

CAN: routing CAN: node insertion

A node only maintains state for its Bootstrap immediate neighboring nodes node

new node 1) Discover some node “I” already in CAN

55 56

14 CAN: node insertion CAN: node insertion

(p,q) 2) pick random point in space

I I new node new node 1) discover some node “I” already in CAN 57 58

CAN: node insertion CAN: node insertion

(p,q)

J J new I new node 3) I routes to (p,q), discovers node J 59 4) split J’s zone in half… new owns one half 60

15 CAN: node insertion CAN: node failures

 Need to repair the space Inserting a new node affects only a single  recover database . soft-state updates other node and its immediate neighbors . use replication, rebuild database from replicas  repair routing . takeover algorithm

61 62

CAN: takeover algorithm CAN: node failures

 Simple failures Only the failed node’s immediate neighbors  know your neighbor’s neighbors  when a node fails, one of its neighbors takes over are required for recovery its zone  More complex failure modes  simultaneous failure of multiple adjacent nodes  scoped flooding to discover neighbors  hopefully, a rare event

63 64

16 Design recap Evaluation

 Basic CAN   completely distributed Scalability  self-organizing  Low-latency  nodes only maintain state for their immediate neighbors  Load balancing  Additional design features  multiple, independent spaces (realities)  Robustness  background load balancing algorithm  simple heuristics to improve performance

65 66

CAN: scalability CAN: low-latency

 For a uniformly partitioned space with n nodes and d  Problem dimensions  latency stretch = (CAN routing delay)  per node, number of neighbors is 2d (IP routing delay) 1/d  average routing path is (dn )/4 hops  application-level routing may lead to high stretch  simulations show that the above results hold in practice  Solution  Can scale the network without increasing per-node  increase dimensions state  heuristics  Chord/Plaxton/Tapestry/Buzz . RTT-weighted routing  log(n) nbrs with log(n) hops . multiple nodes per zone (peer nodes) . deterministically replicate entries

67 68

17 CAN: low-latency CAN: low-latency

#dimensions = 2 #dimensions = 10 10 180 160 8 140 w/o heuristics w/o heuristics 120 w/ heuristics 6 w/ heuristics 100 80 4 60 40 2

Latency stretch 20 Latency stretch 0 0 16K 32K 65K 131K 16K 32K 65K 131K #nodes #nodes

69 70

CAN: load balancing Uniform Partitioning

 Two pieces  Dealing with hot-spots . popular (key,value) pairs  Added check . nodes cache recently requested entries  at join time, pick a zone . overloaded node replicates popular entries at neighbors  check neighboring zones  Uniform coordinate space partitioning  pick the largest zone and split that one . uniformly spread (key,value) entries . uniformly spread out routing load

71 72

18 Uniform Partitioning CAN: Robustness

65,000 nodes, 3 dimensions 100  Completely distributed

80 w/o check  no single point of failure

of nodes Percentage w/ check  Not exploring database recovery 60  Resilience of routing 40  can route around trouble V = total volume n 20

0 V V V V V 2V 4V 8V 16 8 4 2 Volume

73 74

Routing resilience Routing resilience

destination

source

75 76

19 Routing resilience Routing resilience

destination

77 78

Routing resilience Routing resilience

 Node X::route(D) If (X cannot make progress to D) – check if any neighbor of X can make progress – if yes, forward message to one such nbr

79 80

20 Routing resilience Routing resilience

CAN size = 16K nodes CAN size = 16K nodes Pr(node failure) = 0.25 #dimensions = 10 100 100

80 80

60 60

40 40

20 20

Pr(successful routing) Pr(successful routing) 0 0 0 0.25 0.5 0.75 2 4 6 8 10 Pr(node failure)

81 82 dimensions

Summary

 CAN

 an Internet-scale hash table

 potential building block in Internet applications  Scalability

 O(d) per-node state  Low-latency routing

 simple heuristics help a lot  Robust

 decentralized, can route around trouble

83

21