Domain

CS 268: Structured peer-to-peer overlay networks - Sometimes called DHTs, DOLRs, CASTs, … Structured P2P Networks: - Examples: CAN, , Pastry, Tapestry, … Pastry and Tapestry Contrasted with unstructured P2P networks - Gnutella, Freenet, etc.

Today talking about Pastry and Tapestry Sean Rhea April 29, 2003

2

Service Model Service Model (con’t.)

Let Ι be a set of identifiers Owner mapping exposed in variety of ways - Such as all 160-bit unsigned integers - In Chord, have function

Let Ν be a set of nodes in a P2P system n = find_successor (i) - Some subset of all possible (IP, port) tuples - In Pastry and Tapestry, have function

Structured P2P overlays implement a mapping route_to_root (i,m)

¡ In general, can be iterative or recursive owner Ν: Ι Ν - Iterative = directed by querying node - Given any identifier, deterministically map to a node

- Recursive = forwarded through network

Properties ¡ May also expose owner –1: Ν P(Ι) - Should take O(log |Ν|) time and state per node - Which identifiers given node is responsible for - Should be roughly load balanced

3 4

1

Other Service Models Lecture Overview

Other models can be implemented on owner Introduction

Example: (DHT) PRR Trees - Overview void put (key, data) { - Locality Properties Pastry n = owner (key) - Routing in Pastry n.hash_table.insert (key, data) - Joining a Pastry network } - Leaving a Pastry network

Tapestry data get (key) { - Routing in Tapestry n = owner (key) - Object location in Tapestry

return n.hash_table.lookup (key) Multicast in PRR Trees

} Conclusions

5 6

PRR Trees PRR Trees: The Basic Idea

Work by Plaxton, Rajaraman, Richa (SPAA ’97) Basic idea: add injective function ¡ - Interesting in a distributed publication system node_id: Ν Ι - Similar to Napster in interface - Gives each node a name in the identifier space - Only for static networks (set Ν does not change) owner (i) = node whose node_id is “closest” to i Ι

- No existing implementation (AFAIK) - Definition of closest varies, but always over

Pastry and Tapestry both based on PRR trees To find owner (i) from node with identifier j 1. Let p = longest matching prefix between i and j - Extend to support dynamic node membership 2. Find node k with longest matching prefix of |p|+1 digits - Several implementations 3. If no such node, j is the owner (root) 4. Otherwise, forward query to node k

Step 2 is the tricky part

7 8

2

PRR Trees: The Routing Table PRR Trees: Routing

Ν Each node n has O(b log b | |) neighbors To find owner (47E2) - Each Lx neighbor shares x digits with n - Query starts at node 3AF2 - Set of neighbors forms a routing table - Resolve first digit by routing to 4633 - Resolve second digit by routing to 47DA, etc. 3A01 9CD0 3AFC L2 L0 L3 2974 45B3 47C1

3AF2 3AF2 4633 47DA 47EC L1 L3 L0

3C57 5A8F 4889 47F7 443E 3AF1 9 10

PRR Trees: Routing (con’t.) PRR Trees: Handling Inexact Matches

Problem: what if no exact match? Want owner function to be deterministic - Consider the following network - Must have a way to resolve inexact matches

- Who is the owner of identifier 3701? Solved different ways by each system Network is well formed 1000 - I have no idea what PRR did - Every routing table spot that can be - Pastry chooses numerically closest node filled is filled • Can break ties high or low 2000 3800 - Can route to all node identifiers - Tapestry performs “surrogate routing” Owner of 3701 not well defined • Chooses next highest match on per digit basis

- Starting from 1000, it’s node 3800 3600 More on this later - Starting from 2000, it’s node 3600

Violation of service model

11 12

3 Locality in PRR Trees Locality in PRR Trees: Experiments

Consider a node with id=1000 in a PRR network - At lowest level of routing table, node 1000 needs neighbors with prefixes 2-, 3-, 4-, etc. - In a large network, there may be several of each

Idea: chose the “best” neighbor for each prefix - Best can mean lowest latency, highest bandwidth, etc.

Can show that this choice gives good routes - For certain networks, routing path from query source to owner no more than a constant worse than routing path in underlying network - I’m not going to prove this today, see PRR97 for details

13 14

Lecture Overview Pastry Introduction

Introduction A PRR tree combined with a Chord-like ring

PRR Trees - Each node has PRR-style neighbors - Overview - And each node knows its predecessor and successor - Locality Properties

Pastry • Called its leaf set - Routing in Pastry To find owner (i), node n does the following: - Joining a Pastry network - If i is n’s leaf set, choose numerically closest node - Leaving a Pastry network - Else, if appropriate PRR-style neighbor, choose that Tapestry - Routing in Tapestry - Finally, choose numerically closest from leaf set - Object location in Tapestry A lot like Chord Multicast in PRR Trees - Only leaf set necessary for correctness Conclusions - PRR-neighbors like finger table, only for performance

15 16

4

Pastry Routing Example Notes on Pastry Routing

PRR neighbors in black Leaf set is great for correctness Leaf set neighbors in blue - Need not get PRR neighbors correct, only leaf set

Owner of 3701 is now well-defined - If you believe the Chord work, this isn’t too hard to do

From 1000 1000 Leaf set also gives implementation of owner -1(n) - Resolve first digit routing to 3800 - All identifiers half-way between n and its predecessor to - At 3800, see that we’re done half-way between n and its successor

- (Numerically closer than 3600) 2000 3800

Can store k predecessors and successors From 2000 - Resolve first digit routing to 3600 - Gives further robustness as in Chord - At 3600, 3701 is in leaf set 3600 • In range 2000-3800 - Route to 3800 b/c numerically closer

17 18

Joining a Pastry Network Pastry Join Example

Must know of a “gateway” node, g Node 3701 wants to join Join path Ι - Has 1000 as gateway Pick new node’s identifier, n, U.A.R. from ¡

Join path is 1000 3800 1000

Ask g to find the m = owner (n) - 3800 is the owner

- And ask that it record the path that it takes to do so 3701 ties itself into leaf set

2000 3800 Ask m for its leaf set 3701 builds routing table

Contact m’s leaf set and announce n’s presence - L0 neighbors from 1000 • 1000, 2000, and 3800 - These nodes add n to their leaf sets and vice versa 3600

- L1 neighbors from 3800

Build routing table • 3600

- Get level i of routing table from node i in the join path Existing nodes on join path 3701 - Use those nodes to make level i of our routing table consider 3701 as a neighbor

19 20

5

Pastry Join Notes Pastry Join Optimization

Join is not “perfect” Best if gateway node is “close” to joining node - A node whose routing table needs new node should learn about it - Gateway joined earlier, should have close neighbors • Necessary to prove O(log |Ν|) routing hops - Recursively, gateway’s neighbors’ neighbors are close - Also not guaranteed to find close neighbors - Join path intuitively provides good initial routing table

Can use routing table maintenance to fix both - Less need to fix up with routing table maintenance - Periodically ask neighbors for their neighbors Pastry’s optimized join algorithm

- Use to fix routing table holes; replace existing distant neighbors - Before joining, find a good gateway, then join normally

Philosophically very similar to Chord To find a good gateway, refine set of candidates - Start with minimum needed for correctness (leaf set) - Start with original gateway’s leaf set - Patch up performance later (routing table) - Keep only a closest few, then add their neighbors - Repeat (more or less--see paper for details)

21 22

Leaving a Pastry Network Dealing with Broken Leaf Sets

Do not distinguish between leaving and crashing What if no nodes left in leaf set due to failures?

- A good design decision, IMHO Can use routing table to recover (MCR93) Remaining nodes notice leaving node n down - Choose closest nodes in routing table to own identifier - Stops responding to keep-alive pings - Ask them for their leaf sets

Fix leaf sets immediately - Choose closest of those, recurse - Easy if 2+ predecessors and successors known Allows use of smaller leaf sets

Fix routing table lazily - Wait until needed for a query - Or until routing table maintenance - Arbitrary decision, IMHO

23 24

6

Lecture Overview Tapestry Routing

Introduction Only different from Pastry when no exact match

PRR Trees - Instead of using next numerically closer node, use node - Overview with next higher digit at each hop

- Locality Properties

Pastry Example: - Routing in Pastry - Given 3 node network: nodes 0700, 0F00, and FFFF - Joining a Pastry network - Who owns identifier 0000?

- Leaving a Pastry network In Pastry, FFFF does (numerically closest)

Tapestry

- Routing in Tapestry In Tapestry, 0700 does - Object location in Tapestry - From FFFF to 0700 or 0F00 (doesn’t matter) Multicast in PRR Trees - From 0F00 to 0700 (7 is next highest digit after 0) Conclusions - From 0700 to itself (no node with digit between 0 and 7)

25 26

Notes on Tapestry Routing Object Location in Tapestry

Mostly same locality properties as PRR and Pastry Pastry was originally just a DHT

But compared to Pastry, very fragile - Support for multicast added later (RKC+01)

Consider previous example: 0700, 0F00, FFFF PRR and Tapestry are DOLRs - What if 0F00 doesn’t know about 0700? - Distributed Object Location and Routing

- 0F00 will think it is the owner of 0000 Service model - 0700 will still think it is the owner - publish (name) - Mapping won’t be deterministic throughout network - route_to_object (name, message)

Tapestry join algorithm guarantees won’t happen

- All routing table holes than can be filled will be Like Napster, Gnutella, and DNS - Provably correct, but tricky to implement - Service does not store data, only pointers to it - Leaf set links are bidirectional, easier to keep consistent - Manages a mapping of names to hosts

27 28

7

A Simple DOLR Implementation Problems with Simple DOLR Impl.

Can implement a DOLR on owner service model No locality - Even if object stored nearby, owner might be far away publish (name) { - Bad for performance n = owner (name) - Bad for availability (owner might be behind partition) n.add_mapping (name, my_addr, my_port) } No redundancy - Easy to fix if underlying network has leaf/successor sets route_to_object (name, message) { - Just store pointers on owner’s whole leaf set n = owner (name) • If owner fails, replacement already has pointers m = n.get_mapping (name) - But Tapestry doesn’t have leaf/successor sets m.send_msg (message) }

29 30

Tapestry DOLR Implementation Tapestry DOLR Impl.: Experiments

Insight: leave “bread crumbs” along publish path - Not just at owner

publish (name) { foreach n in path_to_owner (name) n.add_mapping (name, my_addr, my_port) }

route_to_object (name, message) { foreach n in path_to_owner (name) if ((m = n.get_mapping (name)) != null) { m.send_msg (message); break; } }

31 32

8 Tapestry DOLR Impl. Notes Path Convergence Examples

Bread crumbs called “object pointers” Chord Pastry

PRR show that overlay path from query source 000 000 to object is no worse than a constant longer than underlying network path 111 001 111 001 - Just like in routing in a PRR tree

True for two reasons:

1. Hops early in path are short (in network latency) 110 010 110 010 2. Paths converge early

Path convergence is a little subtle - Two nearby nodes often have same early hops - Because next hop based on destination, not source 101 011 101 011 - And because neighbor choice weighted on latency 100 100

33 (nodes 001, 100, and 110 are “close”) 34

Lecture Overview Multicast in PRR Trees

Introduction PRR Tree gives efficient paths between all nodes

PRR Trees - Uses application nodes as routers

- Overview

- Locality Properties Since control routers, can implement multicast

Pastry - Seems to have been thought of simultaneously by: - Routing in Pastry • Pastry group, with SCRIBE protocol (RKC+01) - Joining a Pastry network • Tapestry group, with Bayeux protocol (ZZJ+01)

- Leaving a Pastry network

Tapestry I’ll talk about SCRIBE - Routing in Tapestry - Object location in Tapestry

Multicast in PRR Trees

Conclusions

35 36

9 SCRIBE Protocol SCRIBE Example

Owner

Like Tapestry object location, use bread crumbs 370

To subscribe to multicast group i - Walk up tree towards owner (i), leaving bread crumbs - If find pre-existing crumb, leave one more and stop 37F To send a message to group i Message stops here - Send message to owner (i) 301 - Owner sends message to each bread crumb it has 3A4 • Its children who are subscribers or their parents - Each child recursively does the same Sender 74D BCE 56B

37 Subscribers 38

SCRIBE Notes Conclusions

Multicast evaluation points PRR trees are a powerful P2P primitive - Stretch, also called Relative Delay Penalty (RDP) - Can be used as a DHT - Network Stress • Path to owner has low RDP • Chord can be “hacked” to do the same SCRIBE has constant RDP - Can be used as a DOLR - Assuming Pastry is a good PRR tree • Finds close objects when available SCRIBE has low network stress • No clear way to get this from Chord - Harder to see, but due to choosing neighbors by latency - Can be used for application-level multicast - Demonstrated in simulations (CJK03) • Also no clear way to get this from Chord More work to support dynamic membership - Pastry uses leaf sets like Chord - Tapestry has own algorithm

39 40

10 For more information

Pastry http://research.microsoft.com/~antr/pastry/pubs.htm

Tapestry http://oceanstore.cs.berkeley.edu/publications/index.html

41

11