Peer-To-Peer Scaling Problem P2P System Why P2P?

Scaling Problem • Millions of clients ⇒ server and network meltdown Peer-to-Peer 15-441 2 P2P System Why P2P? • Harness lots of spare capacity – 1 Big Fast Server: 1Gbit/s, $10k/month++ – 2,000 cable modems: 1Gbit/s, $ ?? – 1M end-hosts: Uh, wow. • Build self-managing systems / Deal with huge scale – Same techniques attractive for both companies / • Leverage the resources of client machines (peers) servers / p2p – Computation, storage, bandwidth • E.g., Akamaiʼs 14,000 nodes • Googleʼs 100,000+ nodes 3 4 1 Outline P2P file-sharing • p2p file sharing techniques • Quickly grown in popularity – Downloading: Whole-file vs. chunks – Dozens or hundreds of file sharing applications – Searching – 35 million American adults use P2P networks -- • Centralized index (Napster, etc.) 29% of all Internet users in US! • Flooding (Gnutella, etc.) • Smarter flooding (KaZaA, …) – Audio/Video transfer now dominates traffic on the • Routing (Freenet, etc.) Internet • Uses of p2p - what works well, what doesnʼt? – servers vs. arbitrary nodes – Hard state (backups!) vs soft-state (caches) • Challenges – Fairness, freeloading, security, … 5 6 Whatʼs out there? Searching Central Flood Super- Route N node N 2 N flood 1 3 Whole Napster Gnutella Freenet File Key=“title” Internet Value=MP3 data… ? Client Publisher Lookup(“title”) Chunk BitTorrent KaZaA DHTs N N (bytes, 4 6 Based eDonkey N5 not 2000 chunks) 7 8 2 Searching 2 Framework • Needles vs. Haystacks • Common Primitives: – Searching for top 40, or an obscure punk – Join: how do I begin participating? track from 1981 that nobodyʼs heard of? – Publish: how do I advertise my file? • Search expressiveness – Search: how to I find a file? – Whole word? Regular expressions? File – Fetch: how to I retrieve a file? names? Attributes? Whole-text search? • (e.g., p2p gnutella or p2p google?) 9 10 Next Topic... Napster: History • Centralized Database – Napster • 1999: Sean Fanning launches Napster • Query Flooding – Gnutella • Peaked at 1.5 million simultaneous • Intelligent Query Flooding users – KaZaA • Swarming • Jul 2001: Napster shuts down – BitTorrent • Unstructured Overlay Routing – Freenet • Structured Overlay Routing – Distributed Hash Tables 11 12 3 Napster: Overiew Napster: Publish • Centralized Database: – Join: on startup, client contacts central insert(X, server 123.2.21.23) – Publish: reports list of files to central ... server – Search: query the server => return someone that stores the requested file Publish – Fetch: get the file directly from peer I have X, Y, and Z! 123.2.21.23 13 14 Napster: Search Napster: Discussion 123.2.0.18 • Pros: search(A) – Simple --> – Search scope is O(1) Fetch 123.2.0.18 – Controllable (pro or con?) Query Reply • Cons: – Server maintains O(N) State – Server does all processing Where is file A? – Single point of failure 15 16 4 Next Topic... Gnutella: History • Centralized Database – Napster • In 2000, J. Frankel and T. Pepper from • Query Flooding – Gnutella Nullsoft released Gnutella • Intelligent Query Flooding • Soon many other clients: Bearshare, – KaZaA • Swarming Morpheus, LimeWire, etc. – BitTorrent • In 2001, many protocol enhancements • Unstructured Overlay Routing – Freenet including “ultrapeers” • Structured Overlay Routing – Distributed Hash Tables 17 18 Gnutella: Overview Gnutella: Search I have file A. • Query Flooding: I have file A. – Join: on startup, client contacts a few other nodes; these become its “neighbors” – Publish: no need Reply – Search: ask neighbors, who ask their neighbors, and so on... when/if found, reply to sender. • TTL limits propagation Query – Fetch: get the file directly from peer Where is file A? 19 20 5 Gnutella: Discussion KaZaA: History • Pros: • In 2001, KaZaA created by Dutch company – Fully de-centralized Kazaa BV – Search cost distributed • Single network called FastTrack used by – Processing @ each node permits powerful search semantics other clients as well: Morpheus, giFT, etc. • Cons: • Eventually protocol changed so other clients – Search scope is O(N) could no longer talk to it – Search time is O(???) • Most popular file sharing network today with – Nodes leave often, network unstable >10 million users (number varies) • TTL-limited search works well for haystacks. – For scalability, does NOT search every node. May21 22 have to re-issue query later KaZaA: Overview KaZaA: Network Design “Super Nodes” • “Smart” Query Flooding: – Join: on startup, client contacts a “supernode” ... may at some point become one itself – Publish: send list of files to supernode – Search: send query to supernode, supernodes flood query amongst themselves. – Fetch: get the file directly from peer(s); can fetch simultaneously from multiple peers 23 24 6 KaZaA: File Insert KaZaA: File Search search(A) --> 123.2.22.50 insert(X, 123.2.21.23) ... search(A) 123.2.22.50 --> Query Publish Replies 123.2.0.18 I have X! Where is file A? 123.2.21.23 123.2.0.18 25 26 KaZaA: Fetching KaZaA: Discussion • More than one node may have requested file... • Pros: • How to tell? – Tries to take into account node heterogeneity: – Must be able to distinguish identical files • Bandwidth – Not necessarily same filename • Host Computational Resources – Same filename not necessarily same file... • Host Availability (?) • Use Hash of file – Rumored to take into account network locality – KaZaA uses UUHash: fast, but not secure • Cons: – Alternatives: MD5, SHA-1 – Mechanisms easy to circumvent • How to fetch? – Still no real guarantees on search scope or search time – Get bytes [0..1000] from A, [1001...2000] from B • Similar behavior to gnutella, but better. – Alternative: Erasure Codes 27 28 7 Stability and Superpeers BitTorrent: History • Why superpeers? • In 2002, B. Cohen debuted BitTorrent – Query consolidation • Key Motivation: • Many connected nodes may have only a few files – Popularity exhibits temporal locality (Flash Crowds) • Propagating a query to a sub-node would take more b/w – E.g., Slashdot effect, CNN on 9/11, new movie/game than answering it yourself release – Caching effect • Focused on Efficient Fetching, not Searching: • Requires network stability – Distribute the same file to all peers • Superpeer selection is time-based – Single publisher, multiple downloaders • Has some “real” publishers: – How long youʼve been on is a good predictor of – Blizzard Entertainment using it to distribute the beta of their how long youʼll be around. new game 29 30 BitTorrent: Overview BitTorrent: Publish/Join Tracker • Swarming: – Join: contact centralized “tracker” server, get a list of peers. – Publish: Run a tracker server. – Search: Out-of-band. E.g., use Google to find a tracker for the file you want. – Fetch: Download chunks of the file from your peers. Upload chunks you have to them. • Big differences from Napster: – Chunk based downloading (sound familiar? :) – “few large files” focus – Anti-freeloading mechanisms 31 32 8 BitTorrent: Fetch BitTorrent: Sharing Strategy • Employ “Tit-for-tat” sharing strategy – A is downloading from some other people • A will let the fastest N of those download from him – Be optimistic: occasionally let freeloaders download • Otherwise no one would ever start! • Also allows you to discover better peers to download from when they reciprocate • Goal: Pareto Efficiency – Game Theory: “No change can make anyone better off without making others worse off” 33 – Does it work? (donʼt know!) 34 BitTorrent: Summary Next Topic... • Centralized Database • Pros: – Napster – Works reasonably well in practice • Query Flooding – Gnutella – Gives peers incentive to share resources; avoids • Intelligent Query Flooding freeloaders – KaZaA • Cons: • Swarming – BitTorrent – Pareto Efficiency relatively weak condition • Unstructured Overlay Routing – Central tracker server needed to bootstrap swarm – Freenet – (Tracker is a design choice, not a requirement. • Structured Overlay Routing Could easily combine with other approaches.) – Distributed Hash Tables (DHT) 35 36 9 Distributed Hash Tables DHT: Chord Summary • Academic answer to p2p • Routing table size? • Goals –Log N fingers – Guaranteed lookup success – Provable bounds on search time • Routing time? – Provable scalability –Each hop expects to 1/2 the distance to the • Makes some things harder desired id => expect O(log N) hops. – Fuzzy queries / full-text search / etc. • Read-write, not read-only • Hot Topic in networking since introduction in ~2000/2001 37 38 DHT: Discussion When are p2p / DHTs useful? • Pros: • Caching and “soft-state” data – Guaranteed Lookup – Works well! BitTorrent, KaZaA, etc., all – O(log N) per node state and search scope use peers as caches for hot data • Cons: • Finding read-only data – No one uses them? (only one file sharing – Limited flooding finds hay app) – DHTs find needles – Supporting non-exact match search is hard • BUT 39 40 10 A Peer-to-peer Google? Writable, persistent p2p • Complex intersection queries (“the” + “who”) • Do you trust your data to 100,000 monkeys? – Billions of hits for each term alone • Node availability hurts • Sophisticated ranking – Ex: Store 5 copies of data on different nodes – Must compare many results before returning a – When someone goes away, you must replicate the data they held subset to user – Hard drives are *huge*, but cable modem upload • Very, very hard for a DHT / p2p system bandwidth is tiny - perhaps 10 Gbytes/day – Need high inter-node bandwidth – Takes many days to upload contents of 200GB – (This is exactly what Google does - massive hard drive. Very expensive leave/replication clusters) situation! 41 42 P2P: Summary • Many different styles;

Peer-To-Peer Scaling Problem P2P System Why P2P?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support