<<

Scaling Problem • Millions of clients ⇒ and network meltdown Peer-to-Peer

15-441

2

P2P System Why P2P?

• Harness lots of spare capacity – 1 Big Fast Server: 1Gbit/s, $10k/month++ – 2,000 cable : 1Gbit/s, $ ?? – 1M end-hosts: Uh, wow. • Build self-managing systems / Deal with huge scale – Same techniques attractive for both companies / • Leverage the resources of machines (peers) servers / p2p – Computation, storage, bandwidth • E.g., Akamaiʼs 14,000 nodes • Googleʼs 100,000+ nodes

3 4

1 Outline P2P file-

• p2p file sharing techniques • Quickly grown in popularity – Downloading: Whole-file vs. chunks – Dozens or hundreds of file sharing applications – Searching – 35 million American adults use P2P networks -- • Centralized index (, etc.) 29% of all users in US! • Flooding (, etc.) • Smarter flooding (, …) – Audio/ transfer dominates traffic on the • Routing (, etc.) Internet • Uses of p2p - what works well, what doesnʼt? – servers vs. arbitrary nodes – Hard state (!) vs soft-state (caches) • Challenges – Fairness, freeloading, security, … 5 6

Whatʼs out there? Searching

Central Flood Super- Route N N 2 N flood 1 3 Whole Napster Gnutella Freenet File Key=“title” Internet Value=MP3 … ? Client Publisher Lookup(“title”) Chunk KaZaA DHTs N N (bytes, 4 6 Based eDonkey N5 not 2000 chunks) 7 8

2 Searching 2 Framework

• Needles vs. Haystacks • Common Primitives: – Searching for top 40, or an obscure punk – Join: how do I begin participating? track from 1981 that nobodyʼs heard of? – Publish: how do I advertise my file? • Search expressiveness – Search: how to I find a file? – Whole word? Regular expressions? File – Fetch: how to I retrieve a file? names? Attributes? Whole-text search? • (e.g., p2p gnutella or p2p google?)

9 10

Next Topic... Napster: History

• Centralized Database – Napster • 1999: Sean Fanning launches Napster • – Gnutella • Peaked at 1.5 million simultaneous • Intelligent Query Flooding users – KaZaA • Swarming • Jul 2001: Napster shuts down – BitTorrent • Unstructured Overlay Routing – Freenet • Structured Overlay Routing – Distributed Hash Tables

11 12

3 Napster: Overiew Napster: Publish

• Centralized Database: – Join: on startup, client contacts central insert(X, server 123.2.21.23) – Publish: reports list of files to central ... server – Search: query the server => return someone that stores the requested file Publish – Fetch: get the file directly from peer I have X, Y, and Z! 123.2.21.23

13 14

Napster: Search Napster: Discussion 123.2.0.18 • Pros: search(A) – Simple --> – Search scope is O(1) Fetch 123.2.0.18 – Controllable (pro or con?) Query Reply • Cons: – Server maintains O(N) State – Server does all processing Where is file A? – Single point of failure

15 16

4 Next Topic... Gnutella: History

• Centralized Database – Napster • In 2000, J. Frankel and T. Pepper from • Query Flooding – Gnutella Nullsoft released Gnutella • Intelligent Query Flooding • Soon many other clients: , – KaZaA • Swarming , LimeWire, etc. – BitTorrent • In 2001, many protocol enhancements • Unstructured Overlay Routing – Freenet including “ultrapeers” • Structured Overlay Routing – Distributed Hash Tables

17 18

Gnutella: Overview Gnutella: Search I have file A. • Query Flooding: I have file A. – Join: on startup, client contacts a few other nodes; these become its “neighbors” – Publish: no need Reply – Search: ask neighbors, who ask their neighbors, and so on... when/if found, reply to sender. • TTL limits propagation Query – Fetch: get the file directly from peer Where is file A?

19 20

5 Gnutella: Discussion KaZaA: History

• Pros: • In 2001, KaZaA created by Dutch company – Fully de-centralized Kazaa BV – Search cost distributed • Single network called FastTrack used by – Processing @ each node permits powerful search semantics other clients as well: Morpheus, giFT, etc. • Cons: • Eventually protocol changed so other clients – Search scope is O(N) could no longer talk to it – Search time is O(???) • Most popular file sharing network today with – Nodes leave often, network unstable >10 million users (number varies) • TTL-limited search works well for haystacks.

– For scalability, does NOT search every node. May21 22 have to re-issue query later

KaZaA: Overview KaZaA: Network Design “Super Nodes” • “Smart” Query Flooding: – Join: on startup, client contacts a “supernode” ... may at some point become one itself – Publish: send list of files to supernode – Search: send query to supernode, supernodes flood query amongst themselves. – Fetch: get the file directly from peer(s); can fetch simultaneously from multiple peers

23 24

6 KaZaA: File Insert KaZaA: File Search search(A) --> 123.2.22.50 insert(X, 123.2.21.23) ... search(A) 123.2.22.50 --> Query Publish Replies 123.2.0.18

I have X! Where is file A? 123.2.21.23 123.2.0.18

25 26

KaZaA: Fetching KaZaA: Discussion

• More than one node may have requested file... • Pros: • How to tell? – Tries to take into account node heterogeneity: – Must be able to distinguish identical files • Bandwidth – Not necessarily same filename • Host Computational Resources – Same filename not necessarily same file... • Host Availability (?) • Use Hash of file – Rumored to take into account network locality – KaZaA uses UUHash: fast, but not secure • Cons: – Alternatives: MD5, SHA-1 – Mechanisms easy to circumvent • How to fetch? – Still no real guarantees on search scope or search time – Get bytes [0..1000] from A, [1001...2000] from B • Similar behavior to gnutella, but better. – Alternative: Erasure Codes

27 28

7 Stability and Superpeers BitTorrent: History

• Why superpeers? • In 2002, B. Cohen debuted BitTorrent – Query consolidation • Key Motivation: • Many connected nodes may have only a few files – Popularity exhibits temporal locality (Flash Crowds) • Propagating a query to a sub-node would take more b/w – E.g., Slashdot effect, CNN on 9/11, new movie/game than answering it yourself release – Caching effect • Focused on Efficient Fetching, not Searching: • Requires network stability – Distribute the same file to all peers • Superpeer selection is time-based – Single publisher, multiple downloaders • Has some “real” publishers: – How long youʼve been on is a good predictor of – Blizzard Entertainment using it to distribute the beta of their how long youʼll be around. new game

29 30

BitTorrent: Overview BitTorrent: Publish/Join Tracker • Swarming: – Join: contact centralized “tracker” server, get a list of peers. – Publish: Run a tracker server. – Search: Out-of-band. E.g., use Google to find a tracker for the file you want. – Fetch: chunks of the file from your peers. chunks you have to them. • Big differences from Napster: – Chunk based downloading (sound familiar? :) – “few large files” focus – Anti-freeloading mechanisms 31 32

8 BitTorrent: Fetch BitTorrent: Sharing Strategy

• Employ “Tit-for-tat” sharing strategy – A is downloading from some other people • A will let the fastest N of those download from him – Be optimistic: occasionally let freeloaders download • Otherwise no one would ever start! • Also allows you to discover better peers to download from when they reciprocate • Goal: Pareto Efficiency – Game Theory: “No change can make anyone better off without making others worse off”

33 – Does it work? (donʼt know!) 34

BitTorrent: Summary Next Topic...

• Centralized Database • Pros: – Napster – Works reasonably well in practice • Query Flooding – Gnutella – Gives peers incentive to resources; avoids • Intelligent Query Flooding freeloaders – KaZaA • Cons: • Swarming – BitTorrent – Pareto Efficiency relatively weak condition • Unstructured Overlay Routing – Central tracker server needed to bootstrap swarm – Freenet – (Tracker is a design choice, not a requirement. • Structured Overlay Routing Could easily combine with other approaches.) – Distributed Hash Tables (DHT)

35 36

9 Distributed Hash Tables DHT: Chord Summary

• Academic answer to p2p • Routing table size? • Goals –Log N fingers – Guaranteed lookup success – Provable bounds on search time • Routing time? – Provable scalability –Each hop expects to 1/2 the distance to the • Makes some things harder desired id => expect O(log N) hops. – Fuzzy queries / full-text search / etc. • Read-write, not read-only • Hot Topic in networking since introduction in ~2000/2001

37 38

DHT: Discussion When are p2p / DHTs useful?

• Pros: • Caching and “soft-state” data – Guaranteed Lookup – Works well! BitTorrent, KaZaA, etc., all – O(log N) per node state and search scope use peers as caches for hot data • Cons: • Finding read-only data – No one uses them? (only one file sharing – Limited flooding finds hay app) – DHTs find needles – Supporting non-exact match search is hard • BUT

39 40

10 A Peer-to-peer Google? Writable, persistent p2p

• Complex intersection queries (“the” + “who”) • Do you trust your data to 100,000 monkeys? – Billions of hits for each term alone • Node availability hurts • Sophisticated ranking – Ex: Store 5 copies of data on different nodes – Must compare many results before returning a – When someone goes away, you must replicate the data they held subset to user – Hard drives are *huge*, but cable upload • Very, very hard for a DHT / p2p system bandwidth is tiny - perhaps 10 Gbytes/day – Need high inter-node bandwidth – Takes many days to upload contents of 200GB – (This is exactly what Google does - massive hard drive. Very expensive leave/replication clusters) situation!

41 42

P2P: Summary

• Many different styles; remember pros and cons of each – centralized, flooding, swarming, unstructured and structured routing Extra Slides • Lessons learned: – Single points of failure are very bad – Flooding messages to everyone is bad – Underlying network topology is important – Not all nodes are equal – Need incentives to discourage freeloading – Privacy and security are important – Structure can provide theoretical bounds and guarantees

43

11 KaZaA: Usage Patterns KaZaA: Usage Patterns (2)

• KaZaA is more than • KaZaA is not Zipf! one workload! – FileSharing: – Many files < 10MB “Request-once” (e.g., Audio Files) – Web: “Request- – Many files > 100MB repeatedly” (e.g., Movies)

from Gummadi et al., SOSP 2003 from Gummadi et al., SOSP 2003 45 46

KaZaA: Usage Patterns (3) Freenet: History

• What we saw: • In 1999, I. Clarke started the Freenet – A few big files consume most of the bandwidth project – Many files are fetched once per client but still very popular • Solution? • Basic Idea: – Caching! – Employ Internet-like routing on the to publish and locate files • Addition goals: – Provide and security – Make difficult from Gummadi et al., SOSP 2003 47 48

12 Freenet: Overview Freenet: Routing Tables • id – file identifier (e.g., hash of file) • next_hop – another node that stores the file id • Routed Queries: • file – file identified by id being stored on the local node – Join: on startup, client contacts a few other id next_hop file

nodes it knows about; gets a unique node id • Forwarding of query for file id … – Publish: route file contents toward the file id. File – If file id stored locally, then stop is stored at node with id closest to file id • Forward data back to upstream requestor – Search: route query for file id toward the closest – If not, search for the “closest” id in the table, and … forward the message to the corresponding node id next_hop – Fetch: when query reaches a node containing – If data is not found, failure is reported back file id, it returns the file to the sender • Requestor then tries next closest match in routing table

49 50

Freenet: Routing Freenet: Routing Properties

query(10)

n1 n2 • “Close” file ids tend to be stored on the same 4 n1 f4 1 9 n3 f9 node 12 n2 f12 4’ 5 n3 – Why? Publications of similar file ids route toward 4 n4 n5 14 n5 f14 4 n1 f4 the same place 2 5 13 n2 f13 10 n5 f10 n3 3 3 n6 8 n6 • Network tend to be a “small world” 3 n1 f3 – Small number of nodes have large number of 14 n4 f14 5 n3 neighbors (i.e., ~ “six-degrees of separation”) • Consequence: – Most queries only traverse a small number of hops to find the file

51 52

13 Freenet: Anonymity & Security Freenet: Discussion

• Anonymity • Pros: – Randomly modify source of packet as it traverses the – Intelligent routing makes queries relatively short network – Can use “mix-nets” or onion-routing – Search scope small (only nodes along search path involved); no flooding • Security & Censorship resistance – No constraints on how to choose ids for files => easy to – Anonymity properties may give you “plausible have to files collide, creating “denial of service” (censorship) deniability” – Solution: have a id type that requires a private key signature • Cons: that is verified when updating the file – Cache file on the reverse path of queries/publications => – Still no provable guarantees! attempt to “replace” file with bogus data will just cause the – Anonymity features make it hard to measure, file to be replicated more! debug

53 54

14