Process Scheduling 5

Unix Internals Process Scheduling 5

Unix scheduler  policy  implementation context switch  hardware execution context of current process is saved in it’s process control block (PCB) which is traditionally part of the u area of the process  context of the next process is retrieved from it’s PCB, loaded into the hardware registers  architecture-specific tasks o flush data cache after context switch o flush instruction cache after context switch o flush address translation cache after context switch o pipelined architectures (e.g., RISC) flush instruction pipeline prior to context switch hardware clock  fixed time interval interrupt clock tick 10 milliseconds  major tick n clock ticks n depends upon specific Unix variant  clock frequency o number of ticks per second o constant HZ defined in param.h file  kernel functions measure time in number of ticks

clock interrupt handler  highly system-dependent  priority second only to power-failure interrupt  updates CPU usage statistics for current process  performs scheduler-related functions o priority recomputation o time-slice expiration handling  sends SIGXCPU signal to current process if it has exceeded its CPU usage quota

C. Robert Putnam Page 1 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM  updates o time-of-day clock o SVR4 variable lbolt – number of ticks elapsed since system last booted  handles callouts  wakes system processes when required o swapper o pagedaemon  handles alarms

callouts  records a function that the kernel will invoke after a specific number of ticks  int to_ID = timeout(void(*fn)(), caddr_t arg, long delta); o fn() – kernel function to be invoked o arg – argument to be passed to the function o delta – time interval measured in number of ticks  kernel invokes callout in system context o function cannot sleep o function cannot access the process context  void untimeout( int to_ID);  periodic tasks o retransmission of network packets o scheduler functions o memory management functions o monitoring devices to avoid losing interrupts o polling devices  normal kernel operation – do not execute at interrupt priority o clock handler checks if any callouts are due o sets flag indicating that callout handler must run o when system returns from base interrupt priority . checks flag . if flag is set, system invokes callout handler . handler invokes each callout that is due to run o once due, callouts run, but only after all pending interrupts have been serviced

pending callout list  checking time o every CPU tick o high interrupt priority o optimize checking time  insertion time o low priority o much less frequently than one per tick

C. Robert Putnam Page 2 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM  implementation o 4.3BSD sort list in order of “time to fire” . stores differential time of expiration for each entry difference between its time to fire and that of the previous callout . decrements time difference in first entry at each clock o sort list in order of “time to fire” . stores absolute time of expiration for each entry . compares current absolute time with time in first entry o timing wheel . fixed-size circular array of callout queues . at each tick, clock interrupt handler advances current time pointer to the next element in the array, wrapping around at the end of the array . if any callouts are in the queue, their expiration time is checked . new callouts are inserted into the queue which is N elements away from current queue, where N is the time to fire measured in ticks . timing wheel hashes callouts based on the expiration time . within each queue, callouts may be sorted or unsorted . sorting callouts  reduces time required to process non-empty queues but  increases insertion time

alarms  real time alarm o measures actual elapsed time o notifies process via SIGALRM signal  profiling alarm o measures amount of time process has been executing o notifies process via SIGPROF signal  virtual-time alarm  monitors time spent in user mode  notifies process via SIGVALRM signal

BSD Unix setitimer() system call  allows process to request any type of alarm  specify time interval in microseconds

SVRx alarm() system call  requests real time alarm  specify time interval in whole number of seconds

C. Robert Putnam Page 3 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM SVR4 hrtsys() system call  allows time to be specified in microseconds

high resolution of real time alarms does not imply high accuracy

kernel delivers SIGALRM signal to process, process will not respond to signal until it is next scheduled to run

high-resolution timers used by high priority processes are less likely to have scheduling delays but if current process is executing in kernel mode and does not reach an preemption point

profiling & virtual alarms

clock interrupt handler charges whole tick to process even though it uses only a portion of it time measured by these alarms reflect number of clock interrupts that have occurred while process was running long run – average out good indicator of time used for any single alarm, significant inaccuracy

scheduler goals

interactive process reduce average time and variance between user action and application response sufficiently so that users cannot readily detect the delay

batch process measure of scheduling efficiency is the task completion time in the presence of other activity

real time process -- time critical predictable scheduling behavior with guaranteed bounds on response times

kernel functions – paging, interrupt handling, process management execute promptly when required

mix of processes

all application processes must continue to progress no application should prevent another from progressing unless user has explicitly permitted it system should always be able to receive and process interactive user input

C. Robert Putnam Page 4 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM Traditional Unix Scheduling

SVR3, 4.3BSD

 time-sharing, interactive environments  multiple users  several batch & foreground processes simultaneously  improve response times of interactive users  ensuring that low-priority, background jobs do not starve

 priority-based scheduling  each process has a scheduling priority that changes with time  scheduler selects highest-priority runnable process  uses preemptive time-slicing to schedule processes of equal priority  dynamically varies process priority based on their CPU usage patterns  kernel is nonpreemptible  running process o may voluntarily relinquish CPU when blocking on a resource o can be preempted when it returns to user mode

process priorities kernel priority numbers 0 – 49 user mode priority numbers 50 – 127

proc structure p_pri current scheduling priority p_usrpri user-mode priority p_cpu measure of recent CPU usage p_nice user-controllable nice factor

 scheduler uses p_pri to decide which process to schedule  process in user mode  p_pri == p_usrpri  process wakes up after blocking in a system call  scheduler uses p_pri to store the temporary kernel priority

sleep priority  terminal input 28  disk I/O 20  process wakes up after blocking o kernel sets p_pri  sleep priority of the event or resource o process scheduled ahead of other user processes  process completes system call, about to return to user mode  kernel sets scheduling priority p_pri  p_usrpri

C. Robert Putnam Page 5 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM user-mode priority p_usrpri depends upon  recent CPU usage p_cpu  user-controllable nice factor p_nice range 0 – 39 , default 20 o user mode: Δ↑ p_nice  Δ↓ priority o superuser: Δ↓ p_nice  Δ↑ priority

process priority algorithm  process creation  p_cpu = 0  clock tick  clock handler increments p_cpu for current process (max 127)  every second  kernel invokes schedcpu() -- scheduled by callout o reduces p_cpu of each process by decay factor o SVR3  fixed decay factor ½  undesirable side effect – system load Δ↑  Δ↑ priorities o 4.3BSD computes  decay = (2 * load_average) / (2 * load_average + 1);  load_average == average number of runnable processes over last second  p_usrpri = PUSER + ( p_cpu / 4 ) + ( 2 * p_nice );  PUSER is the baseline user priority of 50

C. Robert Putnam Page 6 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM scheduler implementation

whichgs (global variable – bitmask 32 cells – one bit for each queue) 0 0 0 1 0 1 0 … qs (32 rows) 0 – 3 proc structures 4 -- 7 8 -- 11 P P P 12--15 P P 16--19 runnable processes 20--23

…

context switch - 124-127 swtch() function  examines whichgs to find index of the first set bit  bit identifies scheduler queue containing highest priority runnable process  removes process from head of queue & switches context to it  proc structure -- p_addr field points to page table entries of the u area  process control block is a part of the u area  VAX-11 o special instructions – 32 bit fields . FFS – find First Set . FFC – Find First Clear o special instructions – double linked lists . INSQHI – insert elements into doubly linked lists . REMQHI– remove elements from doubly linked lists o special instructions – process context . LDPCTX – load process context . SVPCTX – save process context

C. Robert Putnam Page 7 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM run queue manipulation  highest priority process always runs unless current process is executing in kernel mode  fixed time quantum -- 4.3BSD 100 milliseconds  every 100 ms, kernel invokes roundrobin() function to schedule next process from same queue  if higher priority process were runnable, preferentially scheduled without waiting for roundrobin() function  if other runnable processes are on a lower priority queue, the current process continues to run even though its quantum has expired  every second, schedcpu() recomputes the priority of each process o removes the process from the queue o recomputed process priority o inserts process into a (possibly) different run queue  every four ticks, clock interrupt handler computes priority of the current process  context switch o current process (voluntary context switch) . blocks on a resource . exits o priority recomputation  priority of another process > current priority o current process or interrupt handler wakes up a higher priority process

 voluntary context switch o kernel directly calls swtch() from sleep() or exit() functions

 involuntary switches o system is in kernel mode, cannot directly preempt the process o kernel sets runrun flag o when process is about to return to user mode, kernel checks the runrun flag o if runrun flag is set, kernel transfers control to swtch() routine

limitations  large number of processes  inefficient to recompute priorities every second  no way to guarantee CPU resources to specific group of processes  no guarantee of response time  little control over priorities by applications  nonpreemptive kernel -- priority inversion -- higher-priority process may have to wait significant amount of time for lower-priority process to complete

C. Robert Putnam Page 8 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM SVR4 Scheduler

scheduling class defines scheduling policy for set of processes

SVR4 scheduling classes  time-sharing  real-time

class-independent routines -- implement common services  context switching  run queue manipulation  preemption  procedural interface

class-dependent functions – pure virtual functions  priority recomputation o real-time class – fixed priorities o time-sharing class -- varies process priorities dynamically  inheritance

class independent layer

dqactmap (global variable – bitmask 160 cells – one bit for each queue) 0 0 1 0 0 1 0 … dispq (160 rows) 160 proc structures 159 158 P P P 157 P P 156 runnable processes 155 0 – 59 time-sharing class … 60 – 99 system priorities setfrontdq() – insert process100 at front– 159 of real-time queue, class e.g., 0 process preempted before time quantum had expired setbackdq() – insert process at back of queue, e.g., newly runnable process dispdeq() – remove process from queue

dispatch latency delay between the

C. Robert Putnam Page 9 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM time that process becomes runnable and time that process actually begins running

SVR4 kernel preemption points

preemption point -- places in kernel code where  all kernel data structures are in stable state  kernel is about to embark on a lengthy computation

at preemption point  kernel macro PREEMPT() checks kprunrun flag  if kprunrun flag is set, o real-time process is requesting access to run o PREEMPT() macro calls preempt() kernel routine to preempt process  bounds the amount of time that real-time process must wait before being scheduled

preemption points  pathname parsing routine lookuppn() before beginning to parse each individual pathname component  open() system call before creating a file if it does not exist  memory subsystem before freeing process pages

 preempt() kernel routine o invokes CL_PREEMPT operation to perform class-dependent processing then o calls swtch() to initiate context switch

C. Robert Putnam Page 10 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM  swtch() function o calls pswtch() to perform machine-independent part of context switch o pswtch() . clears runrun and kprunrun flags, . selects highest priority runnable process . removes highest priority runnable process from the dispatch queue . updates dqactmap . sets state of process to SONPROC (running on processor) . updates memory management registers to map the u area . updates virtual address translation registers of the new process o invokes assembly language code to . manipulate register content . flush translation buffers . etc

C. Robert Putnam Page 11 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM interface to scheduling classes generic interface provides virtual functions that are implemented by each scheduling class

rt_classfuncs

rt_init real-time sys_classfunc sys_init system ts_classfunc ts_init time-sharing

proc structures proc structures

p_cid p_cid p_cid p_cid

p_clfuncs p_clfuncs p_clfuncs p_clfuncs

p_clproc p_clproc p_clproc p_clproc

class- class- class- class- dependent dependent dependent dependent data data data data

classfuncs structure vector of pointers to the functions that implement the class-dependent interface for any class

global class table contains one entry for each class  class name  pointer to an initialization function  pointer to the classfuncs vector for that class

p_cid index into global class table class ID p_clfuncs pointer to the classfuncs vector for the process class p_clproc pointer to class-dependent private data structure

C. Robert Putnam Page 12 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM accessing class-dependent functions  set of macros are used to o resolve generic interface function calls and to o invoke the correct class-dependent function

 #define CL_SLEEP(procp, clproc, … ) \ (*(procp)->p_clfuncs->cl_sleep)(clprocp, … )

scheduling classes determine the  policies for priority computation o range of priorities for the processes o if and under what conditions the priority can change o size of the time slice each time a process runs o whether time slice size will depend upon the priority class e.g., infinite time quantum for real-time tasks  scheduling of the processes in the class

entry points of the class-dependent interface

CL_TICK clock interrupt handler initiates call monitors time slice, recomputes priorities, handles time quantum expiration, etc. CL_FORK CL_FORKRET fork() initiates call CL_FORK initializes child-specific data structure CL_FORKRET may set runrun, allowing child process to run before parent CL_ENTERCLASS CL_EXITCLASS call is initiated when a process enters or exits a scheduling class; responsible for allocating & deallocating the class-dependent data structures

CL_SLEEP sleep() initiates call; may recompute process priority

CL_WAKEUP wakeprocs() initiates call; places the process on the appropriate run queue; may set runrun or kprunrun scheduling class determines the specific actions to be accomplished by a particular function, e.g., clock interrupt handler calls CL_TICK routine for the class to which the process belongs; the routine determines the exact actions to be performed

C. Robert Putnam Page 13 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM Time-Sharing Class  default class  dynamic priorities  round-robin scheduling is used for processes with equal priorities  static dispatcher parameter table controls process priorities and time slices  time slice depends upon scheduling priority  ↓Δ priority  ↑Δ time slice

index globpri quantum tqexp slpret maxwait lwait 0 0 100 0 10 5 10 1 1 100 0 11 5 11 … … … … … … … 15 15 80 7 25 5 25 … … … … … … … 40 40 20 30 50 5 50 … … … … … … … 59 59 10 49 59 5 59

ts_globpri global priority ts_quantum time quantum ts_tqexp ts_cpupri value when time quantum expires before using ts_maxwait seconds ts_slpret ts_cpupri value when returning to user mode after sleeping ts_maxwait seconds to wait for time quantum expiry ts_lwait ts_cpupri value when time quantum does not expire before using ts_maxwait seconds

 event-driven scheduling ,i.e., process priority changed in response to specific event related to process  scheduler o reduces priority each time it uses its entire time slice o boosts priority if . process blocks on event or resource . takes a long time to use its time slice

struct tsproc class-dependent data ts_timeleft time remaining in quantum 0 ≤ ts_umdpri ≤ 59 ts_cpupri system part of priority ts_upri user part of priority (nice value), range [ -20, +19 ], default 0

C. Robert Putnam Page 14 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM ts_umdpri user mode priority == min { max { ts_cpupri + ts_upri, 0 } 59 } ts_diswait clock time (seconds) since start of quantum

process returns from sleeping in kernel  priority is kernel priority determined by sleep condition  return to user mode  priority restored from ts_umdpri  ts_umdpri may be changed by priocntl(), but only a superuser increase it  ts_cpupri is adjusted according to dispatcher parameter table

dispatcher parameter table access  current ts_cpupri  ts_tqexp, to_slpret, ts_lwait  new ts_cpupri  ts__umdpri  ts_globpri, ts_quantum, ts_maxwait

ts_upri = 14, ts_cpupri = 1, ts_globpri = 15, ts_umdpri = 15

time quantum expires prior to ts_maxwait time  ts_cpupri = 0, ts_umdpri = 14

time quantum does not expire prior to ts_maxwait time  ts_cpupri = 11, ts_umdpri = 25

process makes a system call and blocks on resource; resumes and returns to user mode  ts_cpupri = 11 ( ts_cpupri  ts_slpret ), ts_umdpri = 25

C. Robert Putnam Page 15 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM Real-Time Class

 priority range [100 – 159]  fixed priority  fixed quantum  bounded dispatch latency  bounded response time  real-time processes are scheduled before any kernel services

process executing in kernel mode; real-time process becomes runnable; real-time process must wait until current process is about to return to user mode or until it reaches a kernel preemption point

process explicitly makes a priocntl call to change priority or time quantum only super processes can enter real-time class; priocntl – specify priority and time quantum

default parameter table ↓Δ priority  ↑Δ time slice

struct rtproc  current time quantum  time remaining in quantum  current priority

dispatch latency time between  process becomes runnable  process begins to run

response time time between  occurrence of event  response to event time required by  interrupt handler to process the event  dispatch latency  real-time process to respond to the event

C. Robert Putnam Page 16 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM preemption points divide kernel algorithms into small bounded work units

real-time process becomes runnable 

rt_wakeup() [ class-dependent wakeup processing routine]  sets kernel flag kprunrun

current process executing kernel code  reaches preemption point  checks flag  initiates context switch to real-time process

latency bounds guaranteed only when real-time process is highest-priority runnable process on system

 real-time process is running on system  higher-priority real-time process becomes available  higher-priority real-time process is preferentially scheduled  latency calculation of lower-priority process is recalculated after process regains CPU

priocntl system call -- single process  change priority class of process  set ts_upri for time-sharing process  reset priority and quantum for real-time process  obtaining current values of scheduling parameters

priocntlset system call -- set of related process  change priority class of processes  set ts_upri for time-sharing processes  reset priority and quantum for real-time processes  obtaining current values of scheduling parameters

applies to all processes  in system  in process group or session  in scheduling class  owned by particular user  having same parent

C. Robert Putnam Page 17 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM Analysis  allows addition of scheduling classes to system  time-sharing class changes priorities based on events related to process  fast and scalable dynamic priority calculation  event-driven scheduling favors I/O bound and interactive jobs over CPU-bound jobs  interactive job with large computations  system will be unresponsive  total system load and job mix characteristics determines system response  system administer can alter system behavior by changing dispatcher table settings and rebuilding kernel; such retuning may be required to keep system efficient and responsive

 adding a scheduling class does not require access to kernel source code o provide implementation of each class-dependent scheduling function o initialize classfuncs vector to point to these functions o provide initialization function to perform setup tasks, e.g., allocating internal data structures o add entry for this class in the class table in a master configuration file, typically located in the master.d directory of the kernel build directory o entry contains pointers to the initialization function and classfuncs vector o rebuild kernel

 priocntl system call restricted to superuser  no good way for a time-sharing class process to switch to a different class  no provision for deadline-driven scheduling  code path between preemption points is too long for many time-critical processes  hard real-time systems require a fully preemptible kernel  extremely difficult to tune system for a mixed set of applications  system load varies constantly hence it is not reasonable to require careful manual tuning each time the load changes

C. Robert Putnam Page 18 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM Solaris 2.x Scheduling

multithreaded, symmetric multiprocessing operating system

preemptive kernel

fully preemptive kernel

global kernel data structures are protected by synchronization objects  mutual exclusion locks (mutexes)  semaphores

implementation of interrupts using special kernel threads use the standard synchronization primitives of the kernel

 rarely needs to raise the interrupt level to protect critical regions  few nonpreemptible code segments  higher-priority process is scheduled as soon as it is runnable

interrupt threads run at the highest priority scheduling classes can be dynamically loaded priorities of the interrupt threads are recomputed to ensure that they remain at the highest possible level blocked thread can only be restarted on the same processor

multiprocessor support

single dispatch queue for all processors slelected threads can be restricted to a single, specific processor processors communicate by sending cross-processor interrupts

per-processor data structure -- scheduling variables cpu_thread thread currently running on this processor cpu_dispthread thread last selected to run on this processor cpu_idle idle thread for this processor cpu_runrun preemption flag used for time-sharing threads cpu_kprunrun kernel preemption flag set by real-time threads cpu_chosen_level priority of thread that is going to preempt the current thread on this processor

C. Robert Putnam Page 19 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM Thread T1, priority 120 Processor P1 Thread T2, priority 130 Processor P2 Thread T3, priority 100 Processor P3 Thread T4, priority 132 Processor P4 Thread T5, priority 135 Processor P5

Thread T6, priority 130 Blocked

Thread T7, priority 115 Blocked

 event on processor P1 makes thread T6 runnable  kernel o places T6 on dispatch queue o calls cpu_choose() to find processor with the lowest priority thread

cpu_choose() . marks selected processor for preemption . sets cpu_chosen_level to priority T6 value130 . send cross-processor interrupt

 event on processor P2 makes thread T7 runnable  kernel o places T7 on dispatch queue o calls cpu_choose() to find processor with the lowest priority thread

cpu_choose() . compares cpu_chosen_level to priority T7 value115 . does not mark selected processor for preemption

 processor P3 o handles the interrupt o preempts thread T3

hidden scheduling

C. Robert Putnam Page 20 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM priority inversion

implementation of priority inheritance

limitations of priority inheritance

turnstiles

analysis

Mach Scheduling

C. Robert Putnam Page 21 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM C. Robert Putnam Page 22 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM C. Robert Putnam Page 23 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM C. Robert Putnam Page 24 4/3/2018 025608908da3ab8c3d1cf3661d077766.doc 6:19 AM