CS555: Distributed Systems [Fall 2019] Dept. Of Science, Colorado State University

Frequently asked questions from the previous class survey

CS 555: DISTRIBUTED SYSTEMS ¨ Difference in routing in the network space vs ID space [ & DISTRIBUTED COMPUTING ECONOMICS] ¨ Can be viewed as a semi-structured P2P system?

Shrideep Pallickara Computer Science Colorado State University

CS555: Distributed Systems [Fall 2019] L9.1 September 24, 2019 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.2 Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Topics covered in this lecture

¨ BitTorrent

¨ Distributed Computing Economics

BITTORRENT

CS555: Distributed Systems [Fall 2019] September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.3 September 24, 2019 L9.4 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Dept. Of Computer Science, Colorado State University

Bit Torrent: Traffic statistics BitTorrent

¨ In November 2004 ¨ Designed for downloading large files ¤ Responsible for 25% of all traffic ¨ Not intended for real-time routing of content

¨ February 2013 ¨ Relies on capabilities of ordinary user machines

¤ 3.35% of all worldwide bandwidth ¤ > 50% of the 6% total bandwidth dedicated to

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.5 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.6 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.1 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University

Bit Torrent: Key concepts Segmented file transfer [1/2]

¨ Instead of downloading a file from a single source server ¨ File being transferred is divided into fixed-size segments called ¤ Users join a swarm of hosts to upload-to/download-from simultaneously chunks (or pieces) ¤ Chunks are of the same size throughout a single download (10MB file: 10 ¨ Several basic commodity can replace large, customized 1MB chunks or 40 256KB chunks) servers

¤ Without compromising on efficiency ¨ Chunks are downloaded non-sequentially and rearranged into the ¤ In fact, lower bandwidth usage with swarms prevents large internet traffic correct order by BitTorrent spikes

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.7 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.8 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Segmented file transfer [2/2] BitTorrent: Protocol summary

¨ Advantages: ¨ Splits files into fixed-sized chunks ¤ File transfers can be stopped at any time and resumed ¨ Chunks are then made available at various peers across the P2P n Without loss of previously downloaded content network ¤ Clients seek out readily available chunks, rather than waiting for an unavailable (next in sequence) chunk ¨ Clients can download a number of chunks in parallel from different sites

¤ Reduces the burden on a particular peer to service the entire download

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.9 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.10 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

The BitTorrent protocol Advantages of hashing chunks

¨ When a file is made available in BitTorrent, a .torrent file is ¨ Each chunk has a cryptographic hash in the torrent descriptor created ¨ Modifications of chunks can be reliably detected ¤ Holds metadata associated that file ¤ Prevents accidental and malicious modifications ¨ Metadata ¤ The name and length of the file ¨ If a node starts with an authentic/legitimate torrent descriptor?

¤ Location of a tracker (URL) ¤ It can verify the authenticity of the entire file that it receives n Centralized server that manages download for that file ¤ Checksum n Associated with each chunk n Generated using the SHA-1 algorithm

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.11 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.12 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.2 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University

The swarm or torrent for a particular file includes Trackers

¨ Tracker ¨ The use of trackers, compromises a core P2P principle

¨ Seeders ¤ But simplifies the system

¨ Leechers ¨ Trackers are responsible for tracking the download status for a particular file

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.13 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.14 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

The roles of participants in BitTorrent: Seeder The roles of participants in BitTorrent: Leechers

¨ Peer with a complete version of a file (i.e. with all its chunks) is known ¨ Peers that want to download a file are known as leechers as a seeder ¤ A given leecher, at any given time, contains a number of chunks for that file

¨ Peer that initially creates the file, provides the initial seed for file ¨ Once a leecher downloads all chunks for a file, it can become a distribution seeder for subsequent downloads ¤ Files spread virally based on demand

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.15 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.16 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

When a peers wants to download a file Incentive mechanism: Quid pro quo

¨ Contacts the tracker ¨ Gives downloading preference to peers who have previously uploaded to the site ¨ Is given a partial view of the torrent ¤ Encourages concurrent uploads/downloads to make better use of bandwidth ¤ The set of peers that can support the download

¤ The tracker does not participate in scheduling the downloads ¨ A peer supports downloads from n simultaneous peers by unchoking n Decentralized these peers

¨ Chunks are requested and transmitted in any order ¤ Decisions based on rolling calculations of download rates

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.17 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.18 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.3 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University

Scheduling downloads How BitTorrent differs from a classic download

¨ Rarest first scheduling policy BitTorrent Classic download

¨ Peer prioritizes chunk that is rarest among its set of connected peers

Connections Many small data requests ¨ Ensures that chunks that are not widely available, spread rapidly One TCP connection over different IP connections to one machine to different machines

Download Order Random or “rarest Sequential first” to ensure high- availability

** Allows BitTorrent to achieve lower cost, higher redundancy, and resistance to abuse

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.19 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.20 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

BitTorrent: Advantages BitTorrent: Shortcomings

¨ Advantages ¨ Downloads can take time to rise to full speed ¤ Lower costs, greater redundancy, higher resistance to abuse or “flash ¤ May take time to enough peer connections to be established

crowds” ¤ Takes time for a node to receive data to become an effective uploader

¨ Shortcomings ¨ Regular (non-BitTorrent/traditional) downloads on the other hand: ¤ Non-contiguous download precludes progressive download ¤ Rise to full speed very quickly and maintain this speed throughout ¤ No streaming playback n Beta BitTorrent Streaming protocol was made available for testing in 2013; this was not successful n New service BitTorrent Live was released as Public Beta in Spring 2019.

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.21 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.22 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

But how do you find a torrent? Support for trackerless Torrents

¨ Browsing the web or by some other means ¨ Azureus (now ) supported this first ¤ Open it with a BitTorrent ¨ Mainline BitTorrent provides a DHT based implementation ¨ Client connects to trackers in the torrent file and finds peers ¤ Mainline DHT ¤ If swarm contains only the initial seeder, client connects directly to it and ¤ Kademlia-based Distributed Hash Table (DHT) used by BitTorrent clients begins to request pieces

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.23 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.24 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.4 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University

Outsourcing allows smaller services to benefit from mega services

¨ Automate the routine ¤ Harness economies-of-scale

¨ Companies outsource payroll, insurance, web presence, and e-mail

¤ Universities have tied-up with for e-mail for instance DISTRIBUTED COMPUTING ECONOMICS JIM GRAY. Distributed Computing Economics. Technical Report: MSR-TR-2003-24. Microsoft Research. July 2003.

CS555: Distributed Systems [Fall 2019] L11.25 September 24, 2019 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.26 Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Distributed computing does not have an outsourcing Outsourcing works under certain conditions … or business model

¨ Should be a service business ¨ Designed for computer-to-computer interactions ¤ And computing should be CENTRAL ¤ No eyeballs involved n To operating and supporting the customer ¨ Need new business models to make profit ¨ Application should be nearly identical across companies ¤ Enter the notion of leasing in modern Cloud systems ¤ Payroll, E-mail n Exception not the rule

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.27 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.28 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Baseline hardware parameters 1 dollar buys you

¨ 2 GHz CPU with 2 GB RAM = $2000 ¨ 1 GB transfer over WAN

¨ 200 GB disk = $200 Note: Numbers are circa 2003 ¨ 8 hours of CPU time

¤ 100 access/sec ¨ 10 tops (Tera CPU operations) ¤ 50 MB/sec transfer speed ¨ 1 GB disk space for 3 years ¨ 1 Gbps port-pair = $200 ¨ 10 M database accesses ¨ 1 Mbps WAN link = $50/month ¨ 10 TB of sequential disk access

¨ 10 TB of LAN Bandwidth (bulk)

¨ 10 KWhrs = 4 days of computer time

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.29 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.30 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.5 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University

Caveats The right abstraction level for Internet Distributed Computing

¨ Beowulf clusters have different networking economics ¨ Disk Block ? No ¤ Networking costs comparable to disk bandwidth ¨ File? No n 10,000 times cheaper than price of Internet transports ¨ Database? No ¤ Do not confuse with Internet-scale computations ¨ Applications? Yes ¨ If telecom costs drop faster than Moore’s law …analysis fails ¤ BLAST search ¤ Over past 40 years telecom costs have fallen the slowest ¤ Google search

¤ Send/GET e-mail

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.31 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.32 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Computing on-demand enables mobile applications A computation task has 4 demands that must be met

¨ Tasks are mobile ① Networking Questions & Answers ¨ Computing is dynamically provisioned ② Computation ¨ Write-once-run-anywhere (WORA) ¤ Java Transform data/info into new information

¤ COBOL ③ Database/File Access Access to reference information ④ Database/File Storage Long term storage

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.33 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.34 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Ratios of demands and the relative costs is pivotal Ideal mobile computation task

¨ OK to send GB of data if it saves years of computation ¨ Stateless ¤ No disk access

¨ NOT OK to send KB of data over network ¨ Tiny network input or output

¤ If computation can be performed locally ¨ Huge computational demand

¨ Examples: ¤ Cryptographic search n {encrypted text, clear text, key search range}

¤ Monte Carlo simulation ¤ SETI@HOME

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.35 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.36 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.6 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University

How do you move a Terabyte? Why SETI@HOME is a good deal

Speed Rent $/TB 9 Context $/Mbps Time/TB ¨ Sends out 10 jobs: each is 300 KB Mbps $/month Sent Home phone 0.04 40 1,000 3,086 6 years ¨ Network costs Home DSL 0.6 50 117 360 5 months ¤ 1 GB = $1 T1 1.5 1,200 800 2,469 2 months ¤ 1 MB = 10-3 $ T3 43 28,000 651 2,010 2 days ¤ 100 KB = 10-4 $ OC3 155 49,000 316 976 14 hours ¨ Compute Cost = 0.5$ OC 192 9600 1,920,000 200 617 14 minutes ¨ -4 Compute Cost/Network Cost = 0.5/(3*10 ) 100 Mpbs 100 1 day ¤ Approx: 1600:1 Gbps 1000 2.2 hours

Source: TeraScale Sneakernet, Microsoft Research, Jim Gray, Chong; Tom Barclay;

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.37 September 24, 2019 Alex SzalayCS555:; Jan Distributed vandenBerg Systems [Fall 2019] L9.38 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Consequences Web Data processing systems

¨ The cheapest & fastest way to move Terabytes cross country is ¨ Network or State intensive sneakernet ¨ 100 MB FTP task = 10 cents ¤ 24 hours ¤ 99% network cost ¤ $50 shipping vs $1000 WAN cost ¨ HTML webpage access

¨ Sending 10PB CERN data via network is silly: ¤ 10-6 dollars, 88% network cost ① Buy disk bricks in Geneva ¨ Hotmail ② Fill them TeraScale SneakerNet: Using Inexpensive Disks for Backup, ¤ -5 Archiving, and Data Exchange 10 dollars; some balance in CPU and network costs ③ Ship them Jim Gray; Wyman Chong; Tom Barclay; Alex Szalay; Jan vandenBerg Microsoft Technical Report may 2002, MSR-TR-2002-54 http://research.microsoft.com/research/pubs/view.aspx?tr_id=569 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.39 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.40 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Why was a good deal Computations that are not economically viable

¨ 5 MB song ¨ Data loading and data scanning tasks -3 ¤ Network cost = 5 x 10 $ = ½ a penny ¤ CPU-intensive; but also data intensive.

¤ Therefore not economically viable as mobile applications. ¨ Both sender and receiver could afford it

¨ Yahoo! Serving web pages -3 ¤ 10 $ in advertising revenue per page ¤ 10-5 $ total cost in serving web page ¤ ROI: 100:1

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.41 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.42 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.7 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University

Break even point for mobile computation tasks The type of network also matters

¨ 10 Tops & 1 GB of networking both cost $1 ¨ LAN is 10,000 cheaper than WAN

¨ Break-even point ¨ Computational Fluid Dynamics

¤ 10,000 instructions per of network traffic ¤ Simulate crack propagation in an Object ¤ 100 MB input, 10 GB output, 7 CPU years ¨ Outsourcing becomes attractive when the cost-benefit ratio involves ¤ 106 instructions per byte : so good for WAN ¤ 30,000 instructions per byte n But needs to executed in a tightly connected cluster n Cluster networking is free when compared to WAN networking

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.43 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.44 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Toy Story 2 Bioinformatics systems

¨ A 200 MB image takes several CPU hours to render ¨ BLAST, FASTA and Smith-Waterman ¤ Algorithms for matching DNA sequences against a database (GenBank or ¨ Instruction density SwissProt). § 200-600 x 103 instructions per byte ¤ Database sizes 50 GB

¨ Send 50 MB task; compute for 10 hours; ¨ Does it make sense to send SwissProt (40GB) to a server if processing ¤ Return 200 MB image! (7220 hrs) is free? ¤ Yes

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.45 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.46 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

Do not provision databases, provision the searches What does this imply? instead

¨ Does NOT make sense to provision databases on demand ¨ Put the computations near the data 5 ¤ Instruction density must exceed 10 per byte ¨ Set up dedicated servers instead

¤ Use inexpensive servers and processors ¨ Combining data from multiple sites ¤ Provision searches! ¤ PUSH processing to data sources ¨ 40 GB server costs $20K n Filter the data early

¤ Can deliver complex 1-hour searches for $1

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.47 September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.48 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.8 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University

The contents of this slide-set are based on the following references

¨ Distributed Systems: Concepts and Design. George Coulouris, Jean Dollimore, Tim Kindberg, Gordon Blair. 5th Edition. Addison Wesley. ISBN: 978-0132143011. [Chapter 10]

¨ JIM GRAY. Distributed Computing Economics. Technical Report: MSR-TR-2003-24. Microsoft Research. July 2003.

September 24, 2019 CS555: Distributed Systems [Fall 2019] L9.49 Professor: SHRIDEEP PALLICKARA Dept. Of Computer Science, Colorado State University

SLIDES CREATED BY: SHRIDEEP PALLICKARA L9.9