COMP 150-IDS: Internet Scale Distributed Systems (Fall 2020)
Scalability, Performance & Caching
Noah Mendelsohn Tufts University Email: [email protected] Web: http://www.cs.tufts.edu/~noah
Copyright 2012, 2015, 2018, 2019 & 2020 – Noah Mendelsohn Goals
. Explore some general principles of performance, scalability and caching . Explore key issues relating to performance and scalability of the Web
2 © 2010 Noah Mendelsohn Performance Concepts and Terminology
3 © 2010 Noah Mendelsohn Performance, Scalability, Availability, Reliability
. Performance – Get a lot done quickly – Preferably a low cost . Scalability – Low barriers to growth – A scalable system isn’t necessarily fast… – …but it can grow without slowing down . Availability – Always there when you need it . Reliability – Never does the wrong thing, never loses or corrupts data
4 © 2010 Noah Mendelsohn Throughput vs. Response Time vs. Latency
. Throughput: the aggregate rate at which a system does work . Response time: the time to get a response to a request . Latency: time spent waiting (e.g. for disk or network) . We can improve throughput by: – Minimizing work done per request – Doing enough work at once to keep all hardware resources busy… – …and when some work is delayed (latency) find other work to do – Using parallelism, often to work on multiple requests independently . We can improve response time by: – Minimizing total work and delay (latency) on critical path to a response – Applying parallel resources to an individual response…including streaming – Precomputing response values
5 © 2010 Noah Mendelsohn Know How Fast Things Are
6 © 2010 Noah Mendelsohn Typical “speeds n feeds”
. CPU (e.g. Intel Core I7): – A few billions instructions / second per core – Memory: 20GB/sec (20 bytes/instruction executed) . Long distance network – Latency (ping time): 10-100ms – Bandwidth: 5 – 100 Mb/sec . Local area network (Gbit Ethernet) – Latency: 50-100usec (note microseconds) – Bandwidth: 1 Gb/sec (100mbytes/sec) . Hard disk – Rotational delay: 5ms – Seek time: 5 – 10 ms – Bandwidth from magenetic media: 1Gbit/sec . SSD – Setup time: 100usec – Bandwidth: 2Gbit/sec (typical) note: SSD wins big on latency, some on bandwidth
7 © 2010 Noah Mendelsohn Making Systems Faster Single Thread Speed
8 © 2010 Noah Mendelsohn Sharing the CPU
Multiple Programs Running at once
PowerPt Browser
OPERATING SYSTEM
MAIN MEMORY CPU
© 2010 Noah Mendelsohn What affects speed of a single program? How well is code written?
In what language?
Compiler Browser Code optimization?
OPERATING SYSTEM MAIN MEMORY CPU
© 2010 Noah Mendelsohn What affects speed of a single program? How well is code written? How efficient are system libraries? (including malloc, In what language? sqrt)
How efficient is Compiler the OS (including Browser Code optimization? file I/O, networking System Library Code stack)? OPERATING SYSTEM MAIN MEMORY CPU
© 2010 Noah Mendelsohn What affects speed of a single program? How powerful is A GPU is intended to the GPU? How speed graphics much memory operations with a CPU- does it have? like core optimized for parallel work and data streaming How well do Browser Code application and associated System Library Code (e.g. sqrt)libraries use the GPU? OPERATING SYSTEM MAIN MEMORY CPU + GPU
Note that GPUs are also useful for general parallel computation
© 2010 Noah Mendelsohn What affects speed of a single program?
Speed /capacity Network of storage connection devices performance
Browser Code
System Library Code (e.g. sqrt) OPERATING SYSTEM MAIN MEMORY CPU
© 2010 Noah Mendelsohn Making Systems Faster Hiding Latency
14 © 2010 Noah Mendelsohn Hard disks are slow
Platter
Sector
© 2010 Noah Mendelsohn Handling disk data the slow way
The Slow Way
• Read a block Sector • Compute on block • Read another block • Compute on other block • Rinse and repeat
Computer waits msec while reading disk 1000s of instruction times!
© 2010 Noah Mendelsohn Faster way: overlap to hide latency
The Faster Way
• Read a block Sector • Start reading another block • Compute on 1st block • Start reading 3rd block • Compute on 2nd block • Rinse and repeat
Buffering: we’re reading ahead…computing while reading!
© 2010 Noah Mendelsohn Making Systems Faster Bottlenecks and Parallelism
18 © 2010 Noah Mendelsohn Parallelism and pipelining
Adjust contrast and sharpness
19 © 2010 Noah Mendelsohn Parallelism
Multiple computers each take a piece of the image
20 © 2010 Noah Mendelsohn Pipelining
Compute brightness range
Adjust brightness Sharpen
21 © 2010 Noah Mendelsohn Amdahl’s claim: parallel processing won’t scale
1967: Major controversy …will parallel computers work?
“Demonstration is made of the continued validity of the single processor approach and of the weaknesses of the multiple processor approach in terms of application to real problems and their attendant irregularities. Gene Amdahl*”
* Gene Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, AFIPS spring joint computer conference, 1967 http://www- inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
22 © 2010 Noah Mendelsohn Amdahl: why no parallel scaling?
“The first characteristic of interest is the fraction of the computational load which is associated with data management housekeeping. This fraction […might eventually be reduced to 20%...]. The nature of this overhead appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate. Gene Amdahl (Ibid)
In short: even if the part you’re optimizing went to zero time, the speedup would be only 5x.
Speedup = 1/(rs +(rp/n)) where rs and rp are sequential/parallel fractions of computation As rp/n 0, Speedup -> 1/rs
23 © 2010 Noah Mendelsohn So…why does parallelism work after all?
Amdahl missed that as we got more parallelism, we would work on bigger problems!
• Simulations with more data points • Indexing all the pages on the World Wide Web • Serving search queries from all users of the Web • Running word processors “in the cloud” for millions of users • Etc.
24 © 2010 Noah Mendelsohn Web Performance and Scaling
25 © 2010 Noah Mendelsohn Web Performance & Scalability Goals
. Overall Web Goals: – Extraordinary scalability, with good performance – Therefore…very high aggregate throughput (think of all the accesses being made this second) – Economical to deploy (modest cost/user) – Be a good citizen on the Internet . Web servers: – Decent performance, high throughput and scalability . Web clients (browsers): – Low latency (quick response for users) – Reasonable burden on PC – Minimize memory and CPU on small devices
26 © 2010 Noah Mendelsohn What we’ve already studied about Web scalability…
. Web Builds on scalable hi-perf. Internet infrastructure: – IP – DNS – TCP . Decentralized administration & deployment – The only thing resembling a global, central Web server is the DNS root – URI generation . Stateless protocols – Relatively easy to add servers
27 © 2010 Noah Mendelsohn Web server scaling Reservation Records
Data store Browser Web-Server Application - logic
© 2010 Noah Mendelsohn Stateless HTTP protocol helps scalability
Web-Server Application - logic Browser
Web-Server Application - logic Browser
Data store
Web-Server Application - logic Browser
© 2010 Noah Mendelsohn Caching
30 © 2010 Noah Mendelsohn There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
31 © 2010 Noah Mendelsohn Why does caching work at all?
. Locality: – In many computer systems, a small fraction of data gets most of the accesses – In other systems, a slowly changing set of data is accessed repeatedly . History: use of memory by typical programs – Denning’s Working Set Theory* – Early demonstration of locality in program access to memory – Justified paged virtual memory with LRU replacement algorithm – Also indirectly explains why CPU caches work . But…not all data-intensive programs follow the theory: – Video processing! – Many simulations – Hennessy and Patterson: running vector (think MMX/SIMD) data through the CPU cache was a big mistake in IBM mainframe vector implementations
* Peter J. Denning, The Working Set Model for Program Behavior, 1968 http://denninginstitute.com/pjd/PUBS/WSModel_1968.pdf Also 2008 overview on locality from Denning: http://denninginstitute.com/pjd/PUBS/ENC/locality08.pdf
32 © 2010 Noah Mendelsohn Why is caching hard?
Things change
Telling everyone when things change adds overhead
So, we’re tempted to cheat… …caches out of sync with reality
33 © 2010 Noah Mendelsohn CPU Caching – Simple System
CPU
Read request Read data
Read data CACHE
Read request Read data
Memory
34 © 2010 Noah Mendelsohn CPU Caching – Simple System
Life is Good No Traffic to Slow Memory CPU
Repeated read Read data request Read data CACHE
Memory
35 © 2010 Noah Mendelsohn CPU Caching – Store Through Writing
Everything is up-to-date… CPU …but every write waits for slow memory!
Write request
Write data CACHE
Write request
Memory
36 © 2010 Noah Mendelsohn CPU Caching – Store In Writing
The write is fast, but memory is out of CPU date!
Write request
Write data CACHE
Memory
37 © 2010 Noah Mendelsohn CPU Caching – Store In Writing
If we try to write data from memory to CPU disk, the wrong data will go out!
Write data CACHE
Memory
38 © 2010 Noah Mendelsohn Cache invalidation is hard!
CPU
We can start to see why cache Write data CACHE invalidation is hard!
Memory
39 © 2010 Noah Mendelsohn Multi-core CPU caching
CPU CPU
Write request
CACHE Coherence CACHE Protocol
CACHE
Memory
40 © 2010 Noah Mendelsohn Multi-core CPU caching
CPU CPU Data
Write request Read request
Data CACHE Coherence CACHE Protocol
CACHE
Memory
41 © 2010 Noah Mendelsohn Multi-core CPU caching
CPU CPU
Write request Disk read request
Data CACHE Coherence Data CACHE Protocol
Data ACACHE read from disk must flush all caches
Memory Disk Data
42 © 2010 Noah Mendelsohn Consistency vs. performance
. Caching involves difficult tradeoffs . Coherence is the enemy of performance! – This proves true over and over in lots of systems – There’s a ton of research on weak-consistency models… . Weak consistency: let things get out of sync sometimes – Programming: compilers and libraries can hide or even exploit weak consistency . Yet another example of leaky abstractions!
43 © 2010 Noah Mendelsohn What about Web Caching?
Note: update rate on Web is mostly low – makes things easier!
44 © 2010 Noah Mendelsohn Browsers have caches Web Server
Browser Usually includes a cache!
E.g. Firefox Browser cache prevents repeated requests E.g. Apache for same representations
© 2010 Noah Mendelsohn Browsers have caches Web Server
Browser Usually includes a cache!
E.g. Firefox Browser cache prevents repeated requests E.g. Apache for same representations…even different pages share images stylesheets, etc.
© 2010 Noah Mendelsohn Web Reservation System Reservation Records iPhone or Android Reservation Application
Flight Reservation Logic Proxy Cache (optional!)
HTTP HTTP RPC? ODBC? Proprietary?
Browser or Phone App E.g. Squid Web Server Data Store Application - logic
Many commercial applications work this way
© 2010 Noah Mendelsohn HTTP Caches Help Web to Scale
Web Proxy Web-Server Cache Application - Browser logic
Web Proxy Web-Server Cache Application - Browser logic
Data store Web Proxy Web-Server Cache Application - Browser logic
© 2010 Noah Mendelsohn Web Caching Details
49 © 2010 Noah Mendelsohn HTTP Headers for Caching . Cache-control: max-age: – Server indicates how long response is good . Heuristics: – If no explicit times, cache can guess . Caches check with server when content has expired – Sends ordinary GET w/validator headers – Validators: Modified time (1 sec resolution); Etag (opaque code) – Server returns “304 Not Modified” or new content . Cache-control: override default caching rules – E.g. client forces check for fresh copy – E.g. client explicitly allows stale data (for performance, availability) . Caches inform downstream clients/proxies of response age . PUT/POST/DELETE clear caches, but… – No guarantee that updates go through same proxies as all reads! – Don’t mark as cacheable things you expect to update through parallel proxies!
50 © 2010 Noah Mendelsohn Summary
51 © 2010 Noah Mendelsohn Summary
. We have studied some key principles and techniques relating to peformance and scalability – Hardware performance – Single program issues (code quality, compiler, etc.) – Hiding latency – Parallelism and Amdahl’s law – Buffering and caching – Stateless protocols, etc. . The Web is a highly scalable system w/most OK performance . Web approaches to scalability – Built on scalable Internet infrastructure – Few single points of control (DNS root changes slowly and available in parallel) – Administrative scalability: no central Web site registry . Web performance and scalability – Very high parallelism (browsers, servers all run in parallel) – Stateless protocols support scale out – Caching
52 © 2010 Noah Mendelsohn