ULE: A Modern Scheduler for FreeBSD

Chadd Williams

February 19, 2004 Motivation

• Why write a new scheduler? • Excellent interactive performance! – With small loads – Old algorithms are O(n) • Improve SMP support – no support for processor affinity or binding – Symmetric Multi-Threading support History

• 4.3BSD – No SMP support • No processor affinity • Multi processor systems can be much slower – No scheduling classes – Priority derived by estimation of recent CPU usage and nice: estcpu • decay function primary cause of poor high load performance • must iterate over every in the system – No fixed time slices • every 100ms a different, equal priority, is run Old FreeBSD Scheduler

• Extended 4.3BSD scheduler – Basic SMP support – Added scheduling classes • real-time and idle classes added • interrupt class added with SMP support – same as real-time with lower priorities (lower = better) – Tuned parameters for good interactive performance under heavy load • user base of mainly servers – nice refined • processes with values 20 higher than the least nice will not run Priority Calculation • based on number of ticks while running • decayed when processes sleep/wakeup • decayed once a second – iterates over every process in the system! O(n) – based on current load average of system – without regular decay processes may starve • sleeping process’s priority would otherwise never decay • decay improves priority New Scheduler • O(1) scheduling algorithms • CPU affinity • per CPU run queues • Two priority queues to achieve fairness – run lower priority processes after high priority processes exhaust their time slices • Dynamic slices – larger slices given to higher priority processes • probably interactive processes • negative nice positively affect allocated CPU time New FreeBSD Scheduler

• Event driven – no periodic timer • Several queues • Two CPU load-balancing algorithms • Interactivity scorer • Slice calculator • Priority calculator Queues • Each CPU has one kse queue struct – kse: kernel schedulable entity – three arrays of run queues • indexed by bucketed priorities • two used for interrupt, real-time, timesharing classes – current – next • one for idle class – load statistics – nice window Queues: Non-Idle • Threads run from the current queue until empty – pick threads in priority order to run – switch current and next queues – each thread runs for a time slice every two q switches – ensures fairness • Interactive threads assigned to current queue – low latency response – also runtime and interrupt threads • Thread assigned to a queue until it sleeps or until its slice expires – recalculate base priority, slice size, interactivity score each time a slice expires Interactivity Scoring • Determined via voluntary sleep/run time • Interactive threads have high sleep times – waiting for user input • Voluntary sleep time – number of ticks between sleep() and wakeup() – does not grow unbounded • Quickly change status • Use threshold to assign interactive or not Priority Calculation • Priority used only for run ordering – not fairness • Timesharing class has calculated priority – others assigned statically • Interactivity score used • nice value used – can positively affect CPU time Nice Impact

• Must be able to prevent threads from running • kseq tracks minimum nice value – only threads within 20 of this value get a slice value greater than zero – immediately put in next queue • Threads given slices inversely proportional to the difference of their nice and min nice Slices

• Minimum slice value of 10ms • Maximum slice value of 140ms • Nice interaction – differences of less than 3 don’t matter – 40 nice values/14 slice sizes • Interactive tasks: 10ms slice – quickly react to becoming non-interactive SMP • Important goal of ULE is better SMP usage • CPU affinity – schedule a thread on the last CPU it was run on – improve cache usage • Two load balancing schemes – Pull Migration • get threads when the CPU is idle – Push Migration • put thread to less loaded CPU from heavily loaded CPU Pull Migration • Prevent any CPU from idling • Lock run queue of other processors and look for runnable threads – highest priority thread in most loaded kseq stolen • Less expensive than idling for small numbers of processors • Effective for short running, high turnover Push Migration

• Move thread from loaded kseq to less loaded kseq – twice a second • Only affects two kseqs total – load is irregular – dual processor is the common case – balancing 4+ processors takes slightly longer • Too much balancing may ruin cache SMT Support • Symmetric Multi-Threading – OS sees many logical CPUs on a physical CPU – logical CPUs share cache, etc. • No migration penalty between logical CPUs on same physical CPU • Must be careful not to overload a physical CPU • Load balancing algorithms work on groups of CPUs Yeah, but does it work?

• Benchmarks run to test responsiveness of the system • Single CPU system • Compared to – 4BSD (old FreeBSD) – Solaris – Linux Priority Priority

• Shows priority for a process over time with constant runtime • Cyclic priority – Solaris – 4BSD • Increasing priority – Linux Nice Nice

• All platforms but Linux can starve a high nice process – ULE line never hits zero because of how the test was run • Older schedulers show bias towards -20 Interactivity Interactivity

• Two rounds of 30 parallel simulated compiles • Shows response time of interactive apps • Each scheduler is ok after a while – takes a while for Linux and 4BSD • Spike at start of second round – Solaris and 4BSD Interactivity Interactivity

• Interactive response during previous nice test • ULE does very well • 4BSD and Linux are pretty bad • “good enough to use a shell to kill an app” Pathological Case for Linux Pathological Case for Linux

• Measure interactive response • Five nice –5 processes – each attempt to use 25% of the processor • Priority in Linux tends toward the minimum • Early ULE implementations had this problem • Interactivity scoring algorithm in ULE solved this problem Performance • ULE development focused on interactivity and nice • Now performance tuning • Parallel compile test – 4X faster on ULE than 4BSD – CPU affinity – test case is memory bound – little kernel interaction • ULE 30% better on apache throughput – dual Xeon – more realistic Conclusions • Interactivity pretty good – interactivity, priority, slice slice are all separate concepts – adjust each independently • nice has less power – livelock under nice load eliminated • Currently using SMT leads to worse performance • Default in FreeBSD-current (5.X)