Assignment 3 Synchronization and Memory Management

OPERATING SYSTEMS – ASSIGNMENT 3

SYNCHRONIZATION AND MEMORY MANAGEMENT

Introduction

The assignment main goal is to teach you about XV6 synchronization mechanisms and memory management. First, you will learn and use the Compare And Swap (CAS) instruction by adding support for the use of the atomic CAS instruction. Next, you will replace XV6’s process management synchronization mechanism with your own lock-free implementation (using CAS). This will establish an environment that supports a multi-processor/core. Finally, you will learn how xv6 handles memory and extend it by implementing a paging infrastructure which will allow xv6 to store parts of the process' memory in a secondary storage.

You can download xv6 sources for the current work by executing the following command:

git clone http://www.cs.bgu.ac.il/~osce171/git/Assignment3

Task 1: CAS as a synchronization mechanism Compare And Swap (CAS) is an atomic operation which receives 3 arguments CAS(addr,expected,value) and compares the content of addr to the content of expected. Only if they are the same, it modifies addr to be value and returns true. This is done as a single atomic operation. CAS is used for many modern synchronization algorithms. At first glance, it is not trivial to see why CAS can be used for synchronization purposes. We hope it is easy to understand by using the following simple example:

A multi-threaded shared counter is a simple data structure with a single operation:

int increase();

The increase operation can be called from many threads concurrently and must return the number of times that increase was called. The following naive implementation does not work:

// shared global variable that will hold the number // of times that ’increase’ was called. int counter; int increase() { return counter++; }

This implementation does not work when it is called from multiple threads (or by multiple CPUs which have access to the same memory space). This is due to the fact that the counter++ operation is actually a composition of three operations:

1. Fetch the value of the global counter variable from the memory and put it in a cpu local register

2. Increment the value of the register by one

3. Store the value of the register back to the memory address referred to by the counter variable

Let us assume that the value of the counter at some point in time is C and that two threads attempt to perform the counter++ at the same time. Both of the threads can fetch the current value of the counter (C) into their own local register, increment it and then store the value C+1 back to the counter address (due to race conditions). However, since there were two calls to the function increase this means that we missed one of them (i.e., we should have had counter=C+2).

One can solve the shared counter problem using spin locks (or any other locks) in the following way: int counter; spinlock lock; int increase() { int result; acquire(lock); result = counter++; release(lock); return result; }

This solution will work. One drawback of this approach is that while one thread acquired the lock all other threads (that call the increase function) must wait before advancing into the function. In the new XV6 code attached to this assignment you can look at the proc.c file on the function allocpid that is called from allocproc in order to get a new pid number to a newly created process. This function is actually a shared counter implementation.

Another possible solution is to use CAS to solve this problem int conter = 0; int increase() { int old; do { old = counter; } while (!CAS(&counter, old, old+1)); return old+1; }

In this approach multiple threads can be inside the increase function simultaneously, in each call to CAS exactly one of the threads will exit the loop and the others will retry updating the value of the counter. In many cases this approach results in better performance than the lock based approach - as you will see in the next part of this assignmnet. So to summarize, when one is uses spinlocks it must follow the pattern:

1. Lock using some variable or wait if already locked

2. Enter critical section and perform operations

3. Unlock the variable

If the critical section is long - threads/cpus are going to wait a long time. Using CAS you can follow a different design pattern (a one that is in use in many high performance systems):

1. Copy shared resource locally (to the local thread/cpu stack)

2. Change the locally copied resource as needed

3. Check the shared resource - if same as originally copied then commit your changes to the shared resource otherwise retry the whole process.

This method actually reduce the "critical section" into a single line of code - 3, because only line 3 effects the shared resource, this single line of code is performed atomically using CAS and therefore no locking is needed.

1.1. Implementing CAS in XV6:

In order to implement CAS as an atomic operation we will use the cmpxchg x86 assembly instruction with the lock prefix (which will force cmpxchg to be atomic). Since we want to use an instruction which is available from assembly code, in this part of the assignment you are required to learn to use gcc-inline assembly. You can take a look at many examples of using inline assembly inside the XV6 code itself, especially in the file x86.h.

Create a function called cas inside the x86.h file which has the following signature:

static inline int cas(volatile int ∗addr, int expected, int newval);

Use inline assembly in order to implement cas using the cmpxchg instruction (note that in order to access the ZF (zero-flag) you can use the pushfl assembly instruction in order to access the FLAGS register).

In order to check that everything works change the implementation of the allocpid function (inside proc.c) to use the cas shared counter implementation instead of the existing spinlock based shared counter implementation.

1.2. Enhancing XV6’s process management synchronization:

XV6 process management is built around the manipulation of a single process table (implemented as an array of struct proc and refered to as ptable). In a multi processors/core environment XV6 must make sure that no two processors will manipulate (either execute or modify) the same process or its table entry - it does so by using a spinlock which is refered to as the ptable.lock. A spinlock is a lock implementation which internally uses busy waiting. The usage of busy waiting may seem strange at first, especially in a performance critical code like the OS kernel, but in the kernel level if a processor find out it has to wait in order to schedule a process - there is nothing else it can do but re-attempt to schedule the process. In the user level on the other hand, busy waiting is wasteful since the processor can actually run another process instead of waiting in a loop.

Busy waiting means that a processor will loop while checking an exit condition again and again. Only when the exit condition is satisfied - the processor may start scheduling processes (while blocking other processors on the same loop). In this part you are requested to change the current implementation to use the atomic CAS operation in order to maintain the consistency of the ptable.

When talking about process management, the main purpose of the synchronization mechanism (ptable.lock) is to protect the transitions between different process states – in other words, to ensure that the transitions between different process states are atomic. For example, consider the transition from the UNUSED state to the EMBRIYO state. Earlier you saw that such a transition happens when the kernel wants to allocate a new process structure. The ptable.lock that you removed previously solved the race condition that can happen if two processes perform the system call fork concurrently and enter the allocproc function and choose the same proc structure as their new proc. An additional example is the transition from the SLEEPING state to the RUNNABLE state. In this case, a synchronization mechanism is needed in order to prevent the “loss” of a “wakeup” event. At the time the process changes its state to SLEEPING the spinlock ptable.lock is locked and the same spinlock is needed in order to wake the process. This prevents the missing of the event that wakes up the process.

One downside of using ptable.lock as it was used in XV6 comes from the fact that one CPU that is reading or writing any information about the ptable prevents any other CPU to do the same (even when their operations may not conflict or create a race condition). This, reduces the efficiency of the system.

Consider the case where one process p1 executes the fork system call, this system call must allocate a new process structure. Choosing a new process structure is considered a critical section from the reasons listed above and therefore it was protected by the ptable.lock. Another process p2, at the same time, may perform the wait system call which also contains a critical section that is also protected by ptable.lock. In such a case, p1 may need to wait until p2 will release ptable.lock. The main issue here is that both of the processes could actually execute their code simultaneously since they cannot affect one another in this specific case.

Here you are requested to use CAS in order to solve the above problems. This will require you to use CAS in order to make the transitions among the different process states atomic. Before attempting to do so, there are several issues you need to consider and which we will cover briefly.

When the kernel does it work on kernel space it can receive interrupts which will cause it to change the code it is executing. Therefore, when attempting to make the process state transitions atomic you must consider the following cases:  Atomic for the CPU that performs the transition i.e., the CPU code path cannot be changed externally while transitioning states.  Atomic for the CPUs that are different than the CPU which performs the transition, i.e., other CPUs cannot interfere with the state transition.

The solution for the first case is straightforward – if one disables interrupts during the transitions between process states, then handling each transition becomes atomic for the CPU which performs the transition. The solution for the second case is not so straightforward. In order to understand the difficulties that can arise due to the second case let us consider the following problematic transitions:

 Transition from RUNNABLE state to RUNNING state – for such a transition one must ensure that only a single CPU executes the code of a selected process at a given time. Otherwise, two CPUs may execute the same process.  Transition from RUNNING state to RUNNABLE state – at a first glance there is no problems with the atomicity of this transition since only the CPU that is currently running the process will perform such a transaction. But if you examine this transition more carefully, you will find that the first stage of this transition is to change the state of the current process from RUNNING to RUNNABLE and only then to perform context switch to the scheduler. Once the process state is changed to be RUNNABLE, a different CPU can immediately choose it to be executed. This is problematic because the process is not fully transitioned yet. For example, the registeres may not be saved yet which means that the process context is not ready for execution. This means that the state of the process should not be changed from RUNNING to RUNNABLE at this stage but only after it is actually ready to be executed. But, what is the state of a process that is between RUNNING and RUNNABLE? It seems that we need more process states. We will use the state −X (X can be any of the existing states) in order to express: "this process is transitioning to X". Therefore, a RUNNING process will not immediately transition to RUNNABLE, but instead to -RUNNABLE. It will only get changed to RUNNABLE when the transition is completed (i.e., after the context switch).  Transition from RUNNING state to SLEEPING - this transition is similar to transition 2 and therefore can be solved using the new transition states (−X).

To summarize, in this part you should only change the files proc.c and proc.h, the tasks in this part are:

1. Remove ptable.lock completely from proc.c.

2. Since acquiring ptable.lock also disables interrupts and releasing the ptable.lock enables the interrupts you need to replace each of these calls with their corresponding pushcli and popcli calls.

3. Examine each state transition and decide its start and end locations. When the transition from X to Y starts, switch atomically the process state to −Y and when it ends atomically switch −Y to Y.

4. The different functions in proc.c may check the different process states and decide on their actions according to the state. As a rule of thumb most such checks should be replaced with a check of the absolute value of the state (e.g., if a function originally checked whether the state of a process p is RUNNING then it should now check if the state of p is RUNNING or -RUNNING). This is not true for all instances.

5. Test your implementation: make sure that usertests is passed.

 Pay attention that a context switch can be started only from two functions (1) scheduler and (2) sched. A context switch that starts at sched will end on scheduler and a context switch that starts at scheduler will end on any of two functions: (1) sched or (2) forkret – make sure you understand why.  A special protection needed (for multiple CPUs) for transitions of following states: RUNNABLE, SLEEPING and ZOMBIE.

Task 2: Paging framework: Memory management is one of the most important features of any operating system. In this part of the assignment we will examine how xv6 handles memory and attempt to extend it by implementing a paging infrastructure which will allow xv6 to store parts of the process' memory in a ‘virtual’ secondary storage.

2.1. Xv6 memory overview;

Memory in xv6 is managed in pages (and frames) where each such page is 4096 (= 212) bytes long. Each process has its own page table which is used to translate virtual to physical addresses. The virtual address space can be significantly larger than the physical memory. In xv6, the process address space is 232 bytes long while the physical memory is currently limited to 224 MB only (see PHYSTOP at memlayout.h). When a process attempts to access some address in its memory (i.e. provides a 32 bit virtual address) it must first seek out the relevant page in the physical memory. Xv6 uses the first (leftmost) 20 bits to locate the relevant Page Table Entry (PTE) in the page table. The PTE will contain the physical location of the frame – a 20 bits frames address (within the physical memory). These 20 bits will point to the relevant frame within the physical memory. To locate the exact address within the frame, the 12 least significant bits of the virtual address, which represent the in-frame offset, are concatenated to the 20 bits retrieved from the PTE. Roughly speaking, this process can be described by the following illustration:

Maintaining a page table may require a significant amount of memory as well, so a two level page table is used. The following figure describes the process in which a virtual address translates into a physical one:

Each process has a pointer to its page directory (line 59, proc.h). This is a single page sized (4096 bytes) directory which contains the page addresses and flags of the second level table(s). This second level table is spanned across multiple pages which are very much like the page directory.

When seeking an address the first 10 bits will be used to locate the correct entry within the page directory (by using the macro PDX(va)). The physical frame address can be found within the correct index of the second level table (accessible via the macro PTX(va)). As explained earlier, the exact address may be found with the aid of the 12 LSB (offset).

 At this point you should go over the xv6 documentation on this subject: https://pdos.csail.mit.edu/6.828/2016/xv6/book-rev9.pdf (chapter 2)

 Before proceeding further we strongly recommend that you go over the code again. Now, attempt to answer questions such as:

o How does the kernel know which physical pages are used and unused? o What data structures are used to answer this question? o Where do these reside? o Does xv6 memory mechanism limit the number of user processes? o What is the smallest number of processes xv6 can ‘have’ at the same time?  A very good and detailed account of memory management in the Linux kernel can be found here: http://www.kernel.org/doc/gorman/pdf/understand.pdf  Another link of interest: “What every programmer should know about memory” http://lwn.net/Articles/250967/

2.2. Developing a paging framework;

An important feature lacking in xv6 is the ability to swap out pages to a backing store. That is, at each moment in time all processes are held within the main (physical) memory. In this assignment, you are to implement a paging framework for xv6 which is capable of taking out pages and storing them to secondary storage. In addition, the framework will retrieve pages back to the memory on demand.

We start by developing the process-paging framework. In our framework, each process is responsible for paging in and out its own pages (as oppose to managing this globally for all processes). To keep things simple, we will use part of the physical memory (which will represent secondary storage) to store swapped out memory pages. Such approach can be seen as memory compression which is a part of the Windows 10 memory management.

2.3. Swapping out;

We next detail some restrictions on processes memory. In any given time a process should have no more than MAX_PSYC_PAGES (=5) pages in the physical memory. In addition, a process will not be larger than MAX_TOTAL_PAGES (=100) pages. Whenever a process exceeds the MAX_PSYC_PAGES limitation it has to select pages and move them to the secondary storage to meet the memory restrictions.

To know which page is in the swapped out and where it is located you should maintain a data structure (i.e., paging meta-data). We leave the exact design of the required data structure to you.

 You may want to enrich the process structure with the paging meta-data.  Be sure to swap out only the process' private memory (if you don’t know what it means, then you haven't read the documentation properly. Go over it again: https://pdos.csail.mit.edu/6.828/2016/xv6/book-rev9.pdf).  There are several functions already implemented in xv6 that can assist you – reuse existing code.  Don't forget to free the page's physical memory.

Whenever a page is moved to the secondary storage, it should be marked in the process' page table entry that the page is not present. This is done by clearing the present (PTE_P) flag. A cleared present flag does not imply that a page was swapped out (there could be other reasons). To resolve this issue we use one of the available flag bits in the page table entry to indicate that the page was indeed paged out. Add the following line to mmu.h:

#define PTE_PG 0x200 // Paged out to secondary storage

Now, whenever you move a page to the secondary storage set this flag as well.

2.4. Retrieving pages on demand;

While executing, a process may require paged out data. Whenever the processor fails to access the required page, it generates a trap (interrupt 14, T_PGFLT). After verifying that the trap is indeed a page fault, use the %CR2 register to determine the faulting address and identify the page. Allocate a new physical page, copy its data from the secondary storage, and map it back to the page table. After returning from the trap frame to user space, the process should retry executing the last failed command again (should not generate a page fault now).

 Don't forget to check if you passed MAX_PSYC_PAGES, if so another page should be paged out.

2.5. Comprehensive changes in existing parts;

These changes also affect other parts of xv6. You should modify existing code to properly handle the new framework. Specifically, make sure that xv6 can still support a fork system call. The forked process should have its own swapped out memory which is identical to the parent's memory. Upon termination, the kernel should free the swapped out memory as well as the process' private pages.

 Make sure you have no memory leakage!  Make sure that you can prove that no memory leakage exists!

2.6. Page replacement schemes;

Now that you have a paging framework, there is an important question which needs to be answered: Which page should be swapped out?

As seen in class, there are numerous alternatives to selecting which page should be swapped. Controlling which policy is executed is done with the aid of a makefile macro. Add policies by using the C preprocessing abilities.  You should read about #IFDEF macros. These can be set during compilation by gcc (see http://gcc.gnu.org/onlinedocs/cpp/Ifdef.html)

For the purpose of this assignment, we will limit ourselves to only a few simple selection algorithms:

a) FIFO, according to the order in which the pages were created [SELECTION=FIFO]. b) Second change FIFO, according to the order in which the pages were created and the status of the PTE_A (accessed/reference bit) flag of the page table entry [SELECTION=SCFIFO]. c) NFU, we will approximate LRU using the PTE_A (accessed/reference bit) flag of the page table entry. Implement the “aging” version of the algorithm which you saw in class [SELECTION=NFU]. Your NFU implementation should use the timer interrupts to update its data structure. d) The paging framework is disabled – No paging will be done and behavior should stay as in the original xv6 [SELECTION=NONE].

Modify the Makefile to support ‘SELECTION’ – a macro for quick compilation of the appropriate page replacement scheme. For example, the following line invokes the xv6 build with NFU scheduling:

make qemu SELECTION=NFU

If the SELECTION macro is omitted, NFU should be used as default. Again, it is up to you to decide which data-structures are required for each of the algorithms.

2.7. Enhanced process details viewer:

Now that you have implemented a paging framework, which supports configurable paging replacement schemes it is time to test it. Prior to actually writing the tests (see next Task), we will develop a tool to assist us in testing.

In this task you will enhance the capability of the xv6’s process details viewer. Start by running xv6. Press ctrl + P (^P) from within the shell. You should see a detailed account of currently running processes. Try running any program of your liking (e.g., stressfs) and press ^P again (during and after its execution). The ^P command provides important information (what information is that?) on each of the processes. However it reveals no information regarding the current memory state of each process.

Add the changes to the function handling ^P (see procdump at proc.c) so that the number of allocated memory pages to each process is also printed. Print the number of pages which are currently paged out. In addition print the number of times the process had page faults and the total number of times in which pages were paged out. The new ^P output should include a line for each existing process.

The ^P feature is invoked upon user demand, thus allowing the user to see the current process information. However, the final process information is important as well. Print the same information for each user process upon termination. Since in most cases this information is verbose, enable the final printing only if the quick compilation macro VERBOSE_PRINT is equal to TRUE (Similar to the SELECTION flag). If the VERBOSE_PRINT macro is omitted, FALSE is used as the default value. 2.8. Sanity tests:

Write a simple user-space program to test the paging framework you have created. Your program should allocate some memory and then use it. Check your program with the different page replacement algorithms you created in Task 2. Analyze the program's memory usage using the ^P tool you built in Task 3. This testing program should be named mmt.

 Be prepared to explain the fine details of your memory sanity test. Can you detect major performance differences between page replacement algorithms with it?  Make sure your test covers all aspects of your new framework, including for example the fork changes you made. Submission Guidelines

Make sure that your Makefile is properly updated and that your code compiles with no warnings whatsoever. We strongly recommend documenting your code changes with comments – these are often handy when discussing your code with the graders.

Due to our constrained resources, assignments are only allowed in pairs. Please note this important point and try to match up with a partner as soon as possible.

Submissions are only allowed through the submission system. To avoid submitting a large number of xv6 builds you are required to submit a patch (i.e. a file which patches the original xv6 and applies all your changes). You may use the following instructions to guide you through the process:

Back-up your work before proceeding!

Before creating the patch review the change list and make sure it contains all the changes that you applied and noting more. Modified files are automatically detected by git but new files must be added explicitly with the ‘git add’ command:

> git add . –Av; git commit –m “commit message”

At this point you may examine the differences (the patch):

> git diff origin

Once you are ready to create a patch simply make sure the output is redirected to the patch file:

> git diff origin > ID1_ID2.patch

 Tip: Although grades will only apply your latest patch, the submission system supports multiple uploads. Use this feature often and make sure you upload patches of your current work even if you haven’t completed the assignment.

Finally, you should note that the graders are instructed to examine your code on lab computers only!

We advise you to test your code on lab computers prior to submission, and in addition after submission to download your assignment, apply the patch, compile it, and make sure everything runs and works.