Today’s Papers

EECS 262a • Lottery : Flexible Proportional-Share Resource Advanced Topics in Computer Systems Management Carl A. Waldspurger and William E. Weihl. Appears in Lecture 11 Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation (OSDI), 1994 • SEDA: An Architecture for WellConditioned, Scalable Internet / SEDA Services th Matt Welsh, David Culler, and Eric Brewer. Appears in October 8 , 2014 Proceedings of the 18th Symposium on Operating Systems Principles (SOSP), 2001 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley • Thoughts?

http://www.eecs.berkeley.edu/~kubitron/cs262

10/8/2014 Cs262a-F14 Lecture-11 2

Scheduling Review Multi-Level Feedback Scheduling

• Scheduling: selecting a waiting and allocating a Long-Running resource (e.g., CPU time) to it Compute tasks demoted to low priority • First Come First Serve (FCFS/FIFO) Scheduling: – Run threads to completion in order of submission – Pros: Simple (+) • A scheduling method for exploiting past behavior – Cons: Short jobs get stuck behind long ones (-) – First used in Cambridge Time Sharing System (CTSS) – Multiple queues, each with different priority » Higher priority queues often considered “foreground” tasks • Round-Robin Scheduling: – Each queue has its own scheduling algorithm – Give each a small amount of CPU time (quantum) when it » e.g., foreground – Round Robin, background – First Come First Serve executes; cycle between all ready threads » Sometimes multiple RR priorities with quantum increasing exponentially – Pros: Better for short jobs (+) (highest:1ms, next:2ms, next: 4ms, etc.) – Cons: Poor when jobs are same length (-) • Adjust each job’s priority as follows (details vary) – Job starts in highest priority queue – If timeout expires, drop one level – If timeout doesn’t expire, push up one level (or to top)

10/8/2014 Cs262a-F14 Lecture-11 3 10/8/2014 Cs262a-F14 Lecture-11 4 Lottery Scheduling Lottery Scheduling • Very general, proportional-share • Give each job some number of lottery tickets scheduling algorithm • Problems with traditional schedulers: • On each time slice, randomly pick a winning – Priority systems are ad hoc at best: highest ticket and give owner the resource priority always wins, starvation risk – “Fair share” implemented by adjusting priorities • On average, resource fraction (CPU time) is with a feedback loop to achieve fairness over the (very) long term (highest priority still wins all the proportional to number of tickets given to time, but now the Unix priorities are always each job changing) – Priority inversion: high-priority jobs can be blocked behind low-priority jobs • Tickets can be used for a wide variety of different resources (uniform) and are – Schedulers are complex and difficult to control machine independent (abstract) 10/8/2014 with hard to understandCs262a-F14 Lecture-11 behaviors 5 10/8/2014 Cs262a-F14 Lecture-11 6

How to Assign Tickets? Lottery Scheduling Example • Priority determined by the number of tickets • Assume short jobs get 10 tickets, long jobs get 1 ticket each process has: # short jobs/ % of CPU each % of CPU each – Priority is the relative percentage of all of the tickets # long jobs short jobs gets long jobs gets competing for this resource 1/1 91% 9% 0/2 N/A 50% • To avoid starvation, every job gets at least 2/0 50% N/A one ticket (everyone makes progress) 10/1 9.9% 0.99% 1/10 50% 5%

10/8/2014 Cs262a-F14 Lecture-11 7 10/8/2014 Cs262a-F14 Lecture-11 8 How Fair is Lottery Scheduling? Ticket Transfer

• If client has probability p = (t/T) of winning, then the • How to deal with dependencies expected number of wins (from the binomial distribution) is np • Basic idea: if you are blocked on someone – Probabilistically fair else, give them your tickets – Variance of binomial distribution: σ2 = np(1 – p) – Accuracy improves with √n • Example: client-server • Geometric distribution yields number of tries until first – Server has no tickets of its own win – Clients give server all of their tickets during RPC • Advantage over strict priority scheduling: behaves – Server’s priority is the sum of the priorities of all of its active gracefully as load changes clients – Adding or deleting a job affects all jobs proportionally, – Server can use lottery scheduling to give preferential service independent of how many tickets each job possesses to high-priority clients

• Big picture answer: mostly accurate, but short-term • Very elegant solution to long-standing inaccuracies are possible problem (not the first solution however) 10/8/2014 – See (hidden) StrideCs262a-F14Scheduling Lecture-11 lecture slides for follow-on 9 10/8/2014 Cs262a-F14 Lecture-11 10

Ticket Inflation Currencies • Make up your own tickets (print your own • Set up an exchange rate with the base money) currency

• Only works among mutually trusting clients • Enables inflation just within a group – Also isolates from other groups • Presumably works best if inflation is temporary • Simplifies mini-lotteries, such as for a mutex

• Allows clients to adjust their priority dynamically with zero communication

10/8/2014 Cs262a-F14 Lecture-11 11 10/8/2014 Cs262a-F14 Lecture-11 12 Compensation Tickets Lottery Scheduling Problems • What happens if a thread is I/O bound and regular • Not as fair as we’d like: blocks before its quantum expires? – mutex comes out 1.8:1 instead of 2:1, while multimedia apps – Without adjustment, this implies that thread gets less than its come out 1.92:1.50:1 instead of 3:2:1 share of the processor

• Basic idea: • Practice midterm question: – If you complete fraction f of the quantum, your tickets are – Are these differences statistically significant? inflated by 1/f until the next time you win – Probably are, which would imply that the lottery is biased or that there is a secondary force affecting the relative priority • Example: (e.g., X server) – If B on average uses 1/5 of a quantum, its tickets will be inflated 5x and it will win 5 times as often and get its correct share overall

10/8/2014 Cs262a-F14 Lecture-11 13 10/8/2014 Cs262a-F14 Lecture-11 14

Lottery Scheduling Problems Stride Scheduling • Multimedia app: • Follow on to lottery scheduling (not in paper) – Biased due to X server assuming uniform priority instead of using tickets • Basic idea: – Conclusion: to really work, tickets must be used everywhere – every queue is an implicit scheduling decision... every – Make a deterministic version to reduce short-term variability spinlock ignores priority... – Mark time virtually using “passes” as the unit

• Can we force it to be unfair? • A process has a stride, which is the number of – Is there a way to use compensation tickets to get more time, passes between executions e.g., quit early to get compensation tickets and then run for – Strides are inversely proportional to the number of tickets, the full time next time? so high priority jobs have low strides and thus run often

• What about kernel cycles? • Very regular: a job with priority p will run every 1/p – If a process uses a lot of cycles indirectly, such as through passes the Ethernet driver, does it get higher priority implicitly? 10/8/2014 (probably) Cs262a-F14 Lecture-11 15 10/8/2014 Cs262a-F14 Lecture-11 16 Stride Scheduling Algorithm Is this a good paper?

• Algorithm (roughly): • What were the authors’ goals? – Always pick the job with the lowest pass number • What about the evaluation/metrics? – Updates its pass number by adding its stride • Did they convince you that this was a good system/approach? • Similar mechanism to compensation tickets • Were there any red-flags? – If a job uses only fraction f, update its pass number by f × stride instead of just using the stride • What mistakes did they make? • Does the system/approach meet the “Test of Time” • Overall result: challenge? – It is far more accurate than lottery scheduling and error can • How would you review this paper today? be bounded absolutely instead of probabilistically

10/8/2014 Cs262a-F14 Lecture-11 17 10/8/2014 Cs262a-F14 Lecture-11 18

SEDA – Threads (Background)

BREAK

Project proposals due Friday at midnight – should have: • The standard model (similar to Mesa) 1. Motivation and problem domain • Concurrent threads with locks/mutex/… for synchronization 2. Description of what you are going to do and what is new about it • calls 3. How you are going to do the evaluation (methodology, base • May hold locks for long time case, …) • Problems with more than 100 or so threads due to OS 4. List of resources you need overhead (why?) 5. List of ALL participants (with at least two people from CS262A) – See SEDA graph of throughput vs. number of threads Projects are supposed to be advancing the state of the art in some • Strong support from OS, libraries, debuggers, .... way, not just redoing something that others have done 10/8/2014 Cs262a-F14 Lecture-11 20 SEDA – Thread Pools (Background) SEDA – Events (Background)

Master Thread queue

Thread Pool

• Problem with previous threads: Unbounded Threads – When a website becomes too popular throughput sinks and latency • Lots of small handlers (Finite State Machines) that run to spikes completion • Instead, allocate a bounded “pool” of threads (maximum – No blocking allowed level of multiprogramming) – crude form of load conditioning • Basic model is event arrives and runs a handler – Popular with some web servers – State is global or part of handler (not much in between) • But can’t differentiate between short-running (static • An event loop runs at the core waiting for arrivals, then calls content) and long-running (dynamic content) threads handler 10/8/2014 Cs262a-F14 Lecture-11 21 10/8/2014 Cs262a-F14 Lecture-11 22 Al hi hl f i t li t t k iti f th d Fi 12( )

SEDA – Events (Background) Cooperative Task Scheduling I

• Threads exist, but just run event loop  handler  event loop  … • preemptive: tasks may be interrupted at any – Stack trace is not useful for debugging! – Typically one thread per CPU (any more doesn’t add anything since time threads don’t block) – Must use locks/mutex to get atomicity – Sometimes have extra threads for things that may block; e.g. OSs that – May get preempted while holding a lock – others must wait only support synchronous disk reads until you are rescheduled – risk of priority inversion – Might want to differentiate short and long atomic sections (short should finish up work) • Natural fit with finite-state machines (FSMs) – Arrows are handlers that change states – Blocking calls are split into two states (before and after the call)

• Allows very high concurrency – Multiplex 10,000 FSMs over a small number of threads

10/8/2014– Approach used in telephonyCs262a-F14 systems Lecture-11 and some web servers 23 10/8/2014 Cs262a-F14 Lecture-11 24 Cooperative Task Scheduling II Cooperative Task Scheduling III

• cooperative: tasks are not preempted, but do yield the processor • serial: tasks run to completion – Can use stacks and make calls, but still interleaved – Basic event handlers, which are atomic – Yield points are not atomic: limits what you can do in an – Not allowed to block atomic section – What if they run too long? (not much to do about that, could – Better with compiler help: is a call a yield point or not? kill them; implies might be better for friendly systems) – Hard to support multiprocessors – Hard to support multiprocessors

• Note: preemption is OK if it can’t affect the current atomic section – easy way to achieve this is data partitioning! Only threads that access the shared state are a problem! – Can preempt for system routines – Can preempt to switch to a different process (with its own set of threads), but assumes processes don’t share state 10/8/2014 Cs262a-F14 Lecture-11 25 10/8/2014 Cs262a-F14 Lecture-11 26

Implementing Split-Phase Actions – Implementing Split-Phase Actions – Threads Events • Threads: Not too bad – just block until the action • Events: hard completes (synchronous) • Must store live state in a continuation (on the heap usually) • Assumes other threads run in the meantime • Handler lifetime is too short, so need to explicitly allocate and deallocate later • Ties up considerable memory (full stack) • Scoping is bad too: need a multi-handler scope, which usually implies global scope • Rips the function into two functions: before and after • Easy memory management: stack allocation/deallocation matches natural lifetime • Debugging is hard • Evolution is hard: – Adding a yielding call implies more ripping to do – Converting a non-yielding call into a yielding call is worse – every call site needs to be ripped and those sites may become yielding which cascades the problem 10/8/2014 Cs262a-F14 Lecture-11 27 10/8/2014 Cs262a-F14 Lecture-11 28 Atomic Split-Phase Actions (Really Hard) One Strategy

• Threads – pessimistic: acquire lock and then block • Structure as a sequence of actions that may or may • Threads – optimistic: read state, block, try write, retry not block (like cache reads) if fail (and re-block!) • Acquire lock, walk through sequence, if block then • Events – pessimistic: acquire lock, store state in release and start over continuation; later reply completes and releases lock • If get all the way through then action was short (and – Seems hard to debug, what if event never comes? or comes atomic) more than once? • This seems hard to automate! • Events – optimistic: read state, store in continuation ; – Compiler would need to know that some actions mostly apply write, retry if fail won’t block or won’t block the second time... and then also know that something can be retried without multiple side • Basic problem: exclusive access can last a long time effects... – hard to make progress • Main conclusion (for me): compilers are the key to • General question: when can we move the lock to one concurrency in the future side or the other?

10/8/2014 Cs262a-F14 Lecture-11 29 10/8/2014 Cs262a-F14 Lecture-11 30

SEDA Graceful Degradation

• Problems to solve: • Want throughput to increase linearly and then stay 1. Need a better way to achieve concurrency than just threads flat as you exceed capacity 2. Need to provide graceful degradation • Want response time to be low until saturation and 3. Need to enable feedback loops that adapt to changing then linearly increase conditions • Want fairness in the presence of overload • Almost no systems provide this • The Internet: • Key is to drop work early or to queue it for later – Makes the concurrency higher (threads have implicit queues on locks, sockets, etc.) – Requires high availability • Virtualization makes it harder to know where you – Ensures that load will exceed the target range stand!

10/8/2014 Cs262a-F14 Lecture-11 31 10/8/2014 Cs262a-F14 Lecture-11 32 Problems with Threads I (claimed) Problems with Threads II (claimed)

• Threads limits are too small in practice (about 100) • Is there a fragmentation issue with stacks (lots of – Some of this is due to linear searches in internal data partially full pages)? probably to some degree (each structures, or limits on kernel memory allocation stack needs at least one page – If so, we can move to a model with non-contiguous stack • Claims about locks, overhead, TLB and cache frames (leads to a different kind of fragmentation) misses are harder to understand – don’t seem to be – If so, we could allocate subpages (like FFS did for fragments), and thread a stack through subpages (but this fundamental over events needs compiler support and must recompile all libraries) – Do events use less memory? probably some but not 50% less • Queues are implicit, which makes it hard to control or even identify the bottlenecks – Do events have fewer misses? only if working set is smaller – Is it bad to waste VM for stacks? only with tons of threads • Key insight in SEDA: – No user allocated threads: programmer defines what can be concurrent and SEDA manages the threads – Otherwise no way to control the overall number or distribution of threads

10/8/2014 Cs262a-F14 Lecture-11 33 10/8/2014 Cs262a-F14 Lecture-11 34

Problems with Event-based Approach SEDA

• Debugging • Use threads within a stage, events between stages • Legacy code • Stages have explicit queues and explicit concurrency • Threads (in a stage) can block, just not • Stack ripping too often • SEDA will add and remove threads from a stage as needed • Simplifies modularity: queues decouple stages in terms of performance at some cost to latency • Threads never cross stages, but events can be pass by value or pass by reference • Stage scheduling affects locality – better to run one stage for a while than to follow an event through multiple stages – This should make up for the extra latency of crossing stages 10/8/2014 Cs262a-F14 Lecture-11 35 10/8/2014 Cs262a-F14 Lecture-11 36 Feedback Loops Events versus Threads Revisited

• Works for any measurable property that has smooth behavior • How does SEDA do split-phase actions? (usually continuous as well) – Property typically needs to be monotonic in the control area (else get • Intra-stage: lost in local minima/maxima) – Threads can just block • Within a stage: batching controller decides how many events – Multiple threads within a stage, so shared state must be protected – common case is that each event is mostly to process at one time independent (think HTTP requests) – Balance high throughput of large batches with lower latency of small batches – look for point where the throughput drops off • Inter-stage: • Thread pool controller: find the minimum number of threads – Rip action into two stages that keeps queue length low – Usually one-way: no return (equivalent to tail recursion) – this means that the continuation is just the contents of the • Global thread allocation based on priorities or queue event for the next stage lengths... – Loops in stages are harder: have to manually pass around • Performance is very good, degrades more gracefully, and is the state more fair! – Atomicity is tricky too: how do you hold locks across multiple stages? generally try to avoid, but otherwise need one stage – But huge dropped requests to maintain response time goal to lock and a later one to unlock 10/8/2014– However, can’t really doCs262a-F14 any better Lecture-11 than this... 37 10/8/2014 Cs262a-F14 Lecture-11 38

Is this a good paper?

• What were the authors’ goals? • What about the evaluation/metrics? • Did they convince you that this was a good system/approach? • Were there any red-flags? • What mistakes did they make? • Does the system/approach meet the “Test of Time” challenge? • How would you review this paper today?

10/8/2014 Cs262a-F14 Lecture-11 39