CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 1

sdgOS: Rotating Between Algorithms

Valerie Choung and Samuel Damashek

Abstract—

We extended the SouperDamGoodOS (sdgOS) from 15-410 to support symmetric multiprocessing (SMP). Additionally, we created a mechanism for a to select which scheduling mode it would prefer to run under. The two scheduling modes currently supported are a normal round-robin scheduling algorithm with -level granularity, and the other is an approximation of gang scheduling. In this paper, we discuss the implementation details our operating system and also analyze the performance of some sample programs under both scheduling algorithms.

Keywords—Scheduling, Gang scheduling, OS, Autogroup. !

1 Introduction very similar to gang scheduling, except that if a processor cannot find more work from the cur- A very simple scheduling algorithm (which rently active process, it will steal work from an- we call the “normal scheduling mode”, or NSM) other process’s work queue. may treat all threads equally, and rotate through Our project builds on the SouperDamGoodOS threads in a round-robin fashion. However, this (sdgOS), which runs on an x86 machine support- introduces process/task-level fairness issues: if one ing a PS/2 keyboard. This environment can be process creates many threads, threads in that pro- simulated on Simics. Our study primarily has cess would take up more processor time slices in three parts to it: a given time frame when compared against other, less thread-intensive processes. 1) Implement symmetric multi-processing Gang scheduling is one way to ensure that a (SMP) process experiences a reasonable number of clock 2) Support scheduling algorithm rotations ticks before the process is switched away from to 3) Implement as syscall to switch a process be run at a later time. A pure gang scheduler from one scheduling mode to another will keep track of a currently active process to (process scheduling mode switching). be ran, and at any given moment, all processors Currently our project supports process will be executing threads that belong to the cur- scheduling mode switching from normal schedul- rently active process (or they will be idle, if no ing mode (NSM) to the gang scheduling mode additional threads are available.) This also means (GSM). The other direction (GSM to NSM) is that a thread in any process can assume that other unimplemented due to time constraints, but the processors will be working exclusively on related logic is almost symmetrical. threads. From this project, we observe that perfor- Also related is coscheduling. Coscheduling is mance of programs under GSM is greatly influ- CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 2 enced by the length of the timeslices between . scheduler mode rotations. Additionally, each rota- tion incurs a non-negligible time cost due to inter- processor barriers. .

1.1 Secrecy . The 15-410 course staff is notoriously secre- tive about the nature of many design decisions .2 students commonly encounter during the kernel project. This is driven by a desire to cause stu- dents to find and come up with solutions to these . Thus, we adapted decisions on their own. To preserve this secrecy, we our kernel mutexes to SMP by occasionally censor design decisions that we made in our original kernel and in our SMP extension in our online report. However, nothing is censored in . the final version that we uploaded to Gradescope.1 2.1.1 GSM adaptation 2 Design and Implementation We designed a further extension of our kernel 2.1 Mutexes mutexes which considered gang scheduling, but Our original, single-processor kernel mutexes did not end up implementing it for reasons ex- were plained below. Most notably, we realized it may . go against the philosophy of gang scheduling if we were running under GSM and were to .

.

. . . Unfortunately, solving this problem would have required a significant redesign of our con- . currency primitives. It doesn’t suffice to simply

. This makes sense on a single-processor system, . We believe a robust solution would be to since

1. This section is paraphrased from Ben Blum’s PhD disser- 2. This assumption holds because we use mutexes only for tation [2]. short, bounded-length critical sections. CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 3 2.2.2 Partial locks . Since the scheduler lock isn’t initially held by a newly-created thread, a to . a new thread could be a problem. This is be- We decided that this level of code redesign is not cause worth it for the scale our project, though it would be necessary in a production system. . This is mo- tivates the partial_lock and partial_unlock, 2.2 Scheduler lock and Context Switching which will . (It also turns out that partial locks can be used in conjunction with . barriers, which is nice.) . 2.3 Context Switching to New Thread . So, after we context switch to a new thread, the code that sets up the new thread for entering 2.2.1 Abstraction vs. Implementation the user space will perform a partial_unlock. To As a programmer who is not implementing avoid the race condition where a timer interrupt scheduler locks, one would only need to know could trigger a context switch before the partial that after a call to sched_lock, there needs to unlock is performed, we keep flag a thread if it is be a corresponding sched_unlock. This is a little new. If the flag is set, then the context switcher deceiving though: itself will force interrupts to be disabled. Later, Consider a single processor that uses the the code that sets up the new thread for entering scheduler lock: user space will resolve the interrupt flag (through iret). . 2.3.1 A Failed Idea Ultimately, we decided to keep our general context switching logic roughly the same as in the . This is actually fine, since after com- original NSM-only sdgOS. pleting a context switch, the destination thread At one point, we tried a form of “optimistic” will call sched_unlock. context switching: Essentially, a thread could be sched_unlock will on behalf placed in the work queue while simultaneously of the source thread, running. The context switcher would then check . to see if the target thread was currently running In summary, after a context switch, the call on a different processor. If so, it would roll back to sched_unlock actually corresponds to the and find a different thread to run. This mech- sched_lock performed by the thread that previ- anism did work, but we decided that this form ously ran on the same processor. of context switching was more detrimental to the CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 4 development process than helpful; since threads Figure 1 in the Appendix illustrates how the could be both running and runnable, debugging runlists work together. became a lot harder, and now the context switch The size of timeslices between NSM and GSM would contain a potentially O(n) operation. While can be configured easily. Within each schedul- this is not dissimilar to spinning in an idle thread, ing mode, every timer tick will trigger a context it is objectively more wasteful. Furthermore, an switch to a thread within the same scheduling O(n) operation can be very massive if there exist mode. For GSM, the currently active process is many runnable threads or running threads. This rotated once each time we switch away from GSM is very unscalable, so we discarded this idea after to NSM. It would work to rotate the currently a while. active process when switching from NSM to GSM as well, and makes no practical difference, as long 2.3.2 Timer as it is consistent. Initially, the bootstrapping processor (BSP, aka cpu0) would propagate timer interrupts to all 2.5 set sched mode Syscall the application processors (APs). This caused all the processors to attempt to context switch at the The set_sched_mode syscall suggests to the same time, increasing contention for the sched- scheduler what scheduling mode it would like to uler lock. Additionally, when running many short run under. Currently, we allow processes under the threads (something like slaughter print_basic normal scheduling mode to switch themselves over 5 5 0)3, there would be a lot of contention for to the gang scheduling mode. The other direction mutexes, which are used to kill threads. We re- (gang to normal mode) is unimplemented, but the duce contention by offsetting the APs’ timers. concept is symmetric. Contention for locks is still a bottleneck in some Converting a process from one scheduling applications, but we did not explore optimizations mode to another is surprisingly nuanced. Below for this. we list some considerations: 1) Should we guarantee that all threads in 2.4 Work Queues the process are running under the new Most multi-processor systems appear to use scheduling mode immediately after the work stealing schedulers, where each processor syscall exits? has its own work queue. While this is possible 2) Should gang processes be allowed to fork? to incorporate into our design, we leave that as 3) When should the next barrier-and- future work to be done. context-switch be? Our design consists of two queues: one for NSM and one for GSM. The NSM queue is a basic After answering these questions, we decide that round robin queue. The GSM queue is a mutli- GSM would probably be simpler to implement as tiered queue, where the high tier corresponds to an approximation of gang scheduling, rather than processes. Each process then contains a queue of pure gang scheduling. threads that correspond to the process. The syscall itself works roughly as follows: We check if the requested scheduling mode is the same 3. slaughter print_basic 5 5 0 basically forkbombs with threads that print “Hello World”. as the current process’s scheduling mode. If so, CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 5 we do nothing. If the process’s scheduling mode is We also attempted to use the MFENCE instruc- NSM and it would like to run under GSM, then tion in our barrier, but when testing on real hard- the syscall will scour the NSM runlist to find all ware with a Pentium 2 processor, we realized that runnable threads corresponding to the process. It MFENCE was not a valid instruction. We address then puts those threads into the process’s runlist, this problem in Section 4.1.1. then puts the process into the GSM runlist.

While unimplemented, the reverse direction 2.6.2 Final barrier design would be very similar: It would move all threads in the process’s runlist into the NSM runlist, and For our kernel barriers, we end up using the remove the process from the GSM runlist. sense-reversal barrier for switching between sched- With this implementation, we do not guaran- uler modes. There is no particular reason for our tee that all threads in the process are running choice of barrier, other than to make additional under the new scheduling mode immediately after development on the kernel easier, if we need to the syscall exits. Processes running under GSM use a barrier for other purposes in the future. are not allowed to fork - doing so will raise an One downside to this is that the barrier re- error. Likewise, threads running under GSM are quires more space than the standard barrier. not allowed to create new threads. Practically Additionally, our sense-reversal barrier does not speaking, this means that programs must know space out the memory used for processor-local how many threads it will need before it switches senses. This means that if run on a real system from NSM to GSM. with many cores, the barrier would cause several The syscall does not delay or trigger the next cache misses due to false sharing. This can be fixed context switch (aside from missed timer interrupts if we were to find out the size of an L1 cache line due to the the locked scheduler lock.) of our machine, at the cost of even more memory usage. However, we did not do so because of time constraints. 2.6 Barriers

In our implementation, inter-processor barri- ers are used to rotate scheduling modes. 2.7 Gang Scheduling Approximation

Our GSM mode is only an approximation of 2.6.1 Failed Attempts gang scheduling. We do not immediately trigger Initially, we used one barrier to rotate schedul- context switches after a call to set_sched_mode, ing modes, and we used another barrier for switch- nor do we immediately rotate the currently run- ing a process’s scheduling mode. However, we ning scheduling algorithm. Instead, we allow the quickly ran into the barrier-barrier deadlock, and threads to continue running until the next timer so we changed our barrier into a sense-reversal tick. What this means is that right after a sched- barrier. It does not seem necessary, though to uler mode rotation, there will be a brief period actually use a barrier when switching a process’s (around 2 timer ticks) where some threads will scheduling mode, and acquiring the scheduler lock still be running in the other scheduling mode. This is sufficient. just makes our GSM non-pure gang scheduling. CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 6 2.7.1 The Yield 3 Other Considerations and Misc We got a large amount of work done, consider- One complication is that after a call to ing the hurdles we ran into (See section Debugging sched_yield (possibly through kernel scheduler the Kernel). However, due to time constraints, we logic or through the yield syscall), skimped on a few aspects of the operating system . that we considered nice-to-haves. Particularly, we though it is an do not implement TLB shootdowns, because it imperfect solution. is orthogonal to the question of scheduling. This primarily affects the remove_pages correctness, but that is mostly irrelevant to our project.

3.0.1 Dedicated Thread .

. . .

. 4 Testing 4.1 Testing Infrastructure For development, we used Wind River Simics . to simulate an x86 machine with 2 to 4 processors. it would be difficult to One problem we ran into was that Simics force threads to switch to a thread that is running does not accurately generate MP tables. It under- in a different mode. Allowing an impure form of estimates the number of processors and maps gang scheduling lets us ignore this complication. them incorrectly into the table. Although Simics claims to support up to 16 processors, we found that it only supports 1, 4, and 8 processors. More 2.8 Comparison With Linux Autogroup over, when 4 processors are simulated, only pro- cessors 0 and 2 are mapped into the MP table as Linux Autogroup is a semi-experimental cpu numbers 0 and 1, while processors 1 and 3 are scheduling feature that works with the Com- completely lost. Likewise, when 8 processors are pletely Fair Scheduler (CFS) that allows targeted specified, processors 1, 3, 5, and 7 are lost. processes to be approximately gang-scheduled as Additionally, Simics in its default configura- well (though it does not call it gang scheduling). It tion does not simulate a cache (instead, any mem- depends on a task’s “niceness” factor to determine ory access occurs as fast as possible), so latency how to prioritize tasks within the gang scheduler. due to cache misses is not considered when testing Our implementation of GSM is different because on Simics. We did not optimize this for running it is agnostic of the actual task being performed. on the crash machine, however. CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 7 4.1.1 Running on the Crash Machine have time to address. When this happened, we Because many performance characteristics are would discard that result, restart the machine, unrealistic in Simics, we ran our final bench- and restart the test. marks on a real multi-processor system, the 15-410 ”crash machine”. This machine contains two 400 4.2 Debugging the Kernel MHz Pentium-II Xeon processors, 512 megabytes of RAM, a floppy drive, and a CD-ROM. 4.2.1 Simics Fun In order to get our kernel to run properly on Debugging the kernel was quite an adventure. this machine, we had to make a couple modifi- Documentation of Simics shows that backtrace is cations which required significant debugging (see supported on multi-processor simulations. How- Section 4.2.3). ever, we found that it is only ever able to identify We realized that we had been using the MFENCE the latest stack frame - backtrace simply does not instruction for ensuring instruction ordering in work. Some of our tests run for upwards of 30 our locking code, but that this instruction was minutes, and adding debug statements makes this only introduced in Intel’s Pentium 4 family. This even slower. As such, printf debugging was not code worked in Simics, but on the crash machine useful for any of the longer traces. We debugged likely resulted in an Invalid Opcode fault. Because the kernel primarily through stepping through of this, we changed our code to use the CPUID assembly lines and manually inspecting the code. instruction as a memory fence. This instruction This is certainly one of the main reasons we did is both supported on Pentium 2 processors and not progress as far in our project as we had hoped. guaranteed to be a serializing instruction by the To run our kernel in Simics: Intel processor specification. 1) Navigate to the root folder of our project. It’s also relevant that by default, instruction 2) Run make. re-ordering does not occur in Simics for techni- 3) Set the SIMICS_CPU_COUNT to be 1 if you cal reasons (execution is sequentially consistent). want to run on a uniprocessor, 4 if you However, instructions may be re-ordered on the want to run with 2 processors, or 8 if you crash box, which also complicates testing and want to run with 4 processors. requires debugging tricky race conditions on the 4) Run simics64. crash machine instead of in Simics (where they 5) Once Simics is ready, enter r to boot the may not occur). kernel. Indeed, we ran into what we believe to be a tricky race condition on the crash machine which 4.2.2 Simics Debugging Extension we did not have time to diagnose. While running our test programs, the programs would sometimes We had a Simics debugging extension deadlock. The programs would not crash, and the (damgood_tcbs) that allowed us to quickly figure kernel would continue to accept keyboard and out what threads were running on which proces- timer interrupts and respond appropriately. We sor, as well as the states they were in. This was believe this to be an issue with our user-space useful for identifying whether or not our kernel concurrency primitives that, again, we did not was in an error state. CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 8 4.2.3 Debugging on a Real System 5 Analysis While the Simics debugging extension and the 5.1 Fine-grained Locking use of Simics debug prints were nice for develop- ment, we had to add some other debug utilities to We tested the performance of user programs our kernel so we could get our kernel working on a with fine-grained locking under GSM and NSM. real x86 machine (see Section 4.1.1). Initially, we Our test program was mandelbrot, which was used ”POST codes”which are displayed on a small provided to us from 15-410. We based our GSM display on the front of the crash machine. We test on mandelbrot, modifying the original test inserted unique codes at important points in our by inserting a call to set_sched_mode after all kernel so we could tell which threads were being the threads were spawned, if gang is passed as an context-switched to, and when different parts of argument to the test. the kernel were initialized. On our physical test machine (see Section This was not scalable to more detailed debug- 4.1.1), we tested performance of this mandelbrot ging information, so we added a function in our fractal generator by first launching the largetest console library which could print debugging infor- program, and then launching the mandelbrot pro- mation to the display without using any locks. We gram. We eventually encountered what seemed suspected there may be an issue in our locking to be some boot failures, meaning that we did code during debugging, so if our debug prints also not acquire as many data points as we liked (see required locks then the debug prints may have also Section 4.2.3). This makes it difficult to draw been broken. Additionally, we called our debug conclusions about what our data is suggesting. For print code in interrupt code, and acquiring mu- what it’s worth, the average number of console texes in that code may have resulted in deadlock. writes performed by mandelbrot in 10, 000 cycles For debugging, we printed the number of times was 1055.5 when running under GSM and 988 any occurred, the number of times any when running under NSM. mutex lock occurred, and the number of times any Using an unpaired t-test, we determined that context switch occurred. the difference in the number of console writes be- tween GSM and NSM are not statistically signifi- cant. Though we ran into the issue of insufficient 4.3 How We Tested User Programs data points, there are a few things we could have We tested user programs by inserting calls done to improve our testing. One idea is to run to get_ticks() to retrieve the number of CPU another process in GSM mode while mandelbrot ticks elapsed since boot. This would provide is already running in GSM. This would have us with rough benchmarks of the programs. provided a more fair comparison against the case One of the basic correctness tests we ran is where mandelbrot was running under NSM along agility_drill_gang. This test is adapted from with largetest. With the way our testing was the 15-410 test agility_drill, which is intended actually performed, we also ended up indirectly to stress-test user mode mutexes after spawning measuring speedup due to priority scheduling as several threads. We modified agility_drill to well (see Section 5.1.1), since the speedup due to enter GSM after spawning its threads. GSM was not isolated. CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 9 5.1.1 Speedup due to Priority Scheduling below: Since we realized that testing priority schedul- 1) Dynamic (multi-tiered) work queue with ing was fairly simple given our existing scheduling work stealing. Each processor would have infrastructure, we decided to do so using the its own associated queue, to avoid con- Simics platform. Again, we ran tests on mandel- tention for the scheduler lock. brot under GSM and NSM, but without load- 2) Support for process scheduling mode ing largetest beforehand. Without other pro- switching from GSM to NSM, not only cesses running concurrently, this would theoret- NSM to GSM. ically make GSM and NSM run in exactly the 3) Support for more scheduling algorithms. same way. However, we could configure GSM to 4) Mutex support for gang scheduling run for longer amounts of time between scheduler 5) Explore using mutexes instead of spins in mode rotations. This configurable length of time our barrier implementation. is functionally equivalent to a priority in priority 6) TLB shootdowns. This is less relevant to scheduling. our analysis of scheduling algorithms, but While we do not have concrete numbers for would be nice to have just as a correctness this test, we observed that a complete render guarantee. of mandelbrot under high priority took signif- icantly less time than a render of mandelbrot 7 Conclusion under low priority. Increasing the GSM time in- We successfully added support for symmetric terval by 10× gave us roughly 25× speedup, and multi-processing in our 15-410 kernel. This sup- the speedup with respect to GSM time interval port was confirmed through testing both in the increases increased somewhat linearly (through Simics emulator and on real x86 hardware. As an semi-qualitative examination.) This suggests that extension of this work, we expanded our scheduler a lot of the time costs are incurred during schedul- to add support for ”gang execution” of processes, ing mode rotations. where a process has complete ownership of the 5.2 Overhead Costs processor for a certain time interval. This also added a de-facto priority system, Some timer interrupts become very time- where processes executing in GSM are not sig- expensive, since whenever a scheduling mode ro- nificantly impacted by a large number of threads tation is necessary, all processors must enter a running in NSM. We confirmed this empirically barrier. We tried to mitigate this issue by only in Simics, noting significant speedup when a test performing a scheduling mode rotation every few was run in GSM versus NSM when another large timer ticks (this is a configurable constant), so workload is running in NSM. We did not reach that not every single timer tick would trigger a an empirical conclusion on the impact of gang rotation. scheduling itself on program speed, but had some ideas for future work which could reach such a 6 Future Work conclusion. There are several things we would have liked to do, but did not have the time for. We list them CARNEGIE MELLON UNIVERSITY - 15418: PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING 10 8 List of Student Tasks 10 Resources We consulted with Professor Todd Mowry on 8.1 Valerie Choung our project idea and progress. We also consulted Much of GSM, including: with Professor Dave Eckhardt on SMP design, ex- isting gang scheduling algorithms, and debugging. 1) Context switching logic We would like to thank both professors for their 2) Scheduler logic assistance throughout our project. 3) Writing correctness tests 4) Barrier implementation References 5) Syscall logic 6) A lot of this design document [1] SMP4, SMP4 : Symmetric Multiprocessing for Pebbles. . [2] PRCCTESTING, Practical Concurrency Testing. Ben Blum, 2018. 8.2 Samuel Damashek

Much of SMP, including:

1) 2) Simics debugging extension 3) Mutex lock implementation 4) Scheduler lock implementation 5) Timer/AP configurations 6) Benchmarking / real hardware

8.3 Total Distribution

Approximately 50/50.

9 Appendix

Figure 1: Selection of the next GSM thread.