Operating Systems − they will be added as we go • you can also use slides from previous years Petr Ročkai − they are already in study materials − but: not everything is covered in those These are draft (work in progress) lecture notes for PB152. It is expected to be gradually illed in, hopefully before the semester You are reading the lecture notes for the course PB152 Operating ends. For now, it is mostly a printable version of the slides. Systems. This is a supplementary resource based on the lecture Organisation slides, but with additional details that would not it into the slide format. These lecture notes should be self-contained in the sense • lectures, with an optional seminar that they only rely on knowledge you have from other courses, • written exam at the end like PB150 Computer Systems (or PB151) or PB071 Principles of − multiple choice Low-Level programming. Likewise, being familiar with the topics − free-form questions covered in these lecture notes is suficient to pass the exam. • 1 online test mid-term, 1 before exam − mainly training for the exam proper Books The written exam will consist of two parts, a multiple-choice test • there are a few good OS books (with exactly one correct answer on each question) and a free- • you are encouraged to get and read them form question. Additionally, you will be required to pass a mid- term test with the same pool of multiple-choice questions. The • A. Tanenbaum: Modern Operating Systems question pool is known in advance: both the entire mid-term and • A. Silberschatz et al.: Concepts the multiple-choice part of the inal exam will contain questions • L. Skočovský: Principy a problémy OS UNIX that are listed in these notes (on the last slide in each lecture). • W. Stallings: Operating Systems, Internals and Design The possible answers will be, however, not known in advance, • many others, feel free to explore and will be randomized for each instance of the test. Do not try to learn the correct answer by its wording: you need to understand The books mentioned here usually cover a lot more ground that the questions and the correct answers to pass the test. is possible to include in a single-semester course. The study of operating systems is, however, very important in many sub-ields Seminars of computer science, and also in most programming disciplines. • a separate, optional course (code PB152cv) Spending extra time on this topic will likely be well worth your time. • covers operating systems from a practical perspective • get your hands on the things we’ll talk about here • offers additional practice with C programming Topics 1. Anatomy of an OS The seminar is a good way to gain some practical experience with 2. System Libraries and operating systems. It will mainly focus on interaction of programs 3. The Kernel with OS services, but will also cover user-level issues like virtual- 4. File Systems isation, OS installation and scripting. 5. Basic Resources and Multiplexing Mid-Term and End-Term Tests 6. Concurrency and Locking • 24 hours to complete, 2 attempts possible In the irst half of the semester, we will deal with the basic com- • 10 questions, picked from review questions ponents and abstractions used in general-purpose operating sys- − mid-term irst 24, end-term second 24 tems. The irst lecture will simply give an overview of the entire • you need to pass either mid-term or end-term OS and will attempt to give you an idea how those it together. In • 7 out of 10 required for mid-term, 8 of 10 for end-term the second lecture, we will cover the basic programming inter- • preliminary mid-term date: 11th of April, 6pm faces provided by the OS, provided mainly by system libraries. Even though passing the mid-term/end-term online tests is easy Topics (cont’d) (and it is easy for you to collaborate), I strongly suggest that you 7. Device Drivers use the opportunity to evaluate your knowledge for yourself, hon- 8. Network Stack estly. The same questions will appear on the exam, and this time, it won’t be possible to consult the slides, the internet or your 9. Command Interpreters & User Interfaces friends. Also, you will be only allowed 1 mistake (out of 10 ques- 10. Users and Permissions tions). Use the few opportunities to practice for the exam wisely. 11. Virtualisation & Containers 12. Special-Purpose Operating Systems Study Materials The second half of the semester will start with device drivers, • this course is undergoing a major update which form an important part of operating systems in general, • lecture slides will be in the IS since they mediate communication between application software and hardware peripherals connected to the computer. In a simi- In the third lecture, we will focus on the kernel, arguably the most lar fashion, the network stack allows programs to communicate important (and often the most complicated) part of an operating with other computers (and software running on those other com- system. We will start from the beginning, with the boot : puters) that are attached to a . how the kernel is loaded into memory, initialises the hardware and starts the user-space components (that is, everything that is Related Courses not the kernel) of the operating system. • PB150/PB151 Computer Systems We will then talk about boundary enforcement: how the kernel polices user processes so they cannot interfere with each other, • PB153 Operating Systems and their Interfaces or with the underlying hardware devices. We will touch on how • PA150 Advanced OS Concepts this enforcement makes it possible to allow multiple users to share • PV062 File Structures a single computer without infringing on each other (or at least • PB071 Principles of Low-level programming limiting any such infringement). • PB173 Domain-speciic Development in C/C++ Another topic of considerable interest will be how kernels are designed and what is and what isn’t part of the kernel proper. There is a number of courses that overlap, provide prerequisite We will explore some of the trade-offs involved in the various de- knowledge or extend what you will learn here. The list above is signs, especially with regards to security and correctness vs per- incomplete. The course PB153 is an alternative to this course. formance. Most students are expected to take PB071 in parallel with this Finally, we will look at the mechanism, which is how course, even though knowledge of C won’t be required for the the- the user-space communicates with the kernel, and requests var- ory we cover. However, C basics will be needed for the optional ious low-level operating system services. seminar (PB152cv). 4. File Systems Organisation of the Semester • why and how • generally, one lecture = one topic • abstraction over shared block storage • there will be most likely 12 lectures • directory hierarchy • a 50-minute review in the last lecture • everything is a ile revisited • online mid-term in April • i-nodes, directories, hard & soft links Semester Overview Next up are ile systems, which are a very widely used abstrac- This section gives a high-level overview of the topics that will be tion on top of persistent block storage, which is what hardware covered in individual lectures. storage devices provide. We will ask ourselves, irst of all, why ilesystems are important and why they are so pervasively imple- 2. System Libraries and APIs mented in operating systems, and then we will look at how they • POSIX: Portable Operating System Interface work on the inside. In particular, we will explore the traditional • UNIX: (almost) everything is a ile UNIX ilesystem, which offers important insights about the archi- • the least common denominator of programs: C tecture of the operating system as a whole, and about important • user view: objects, archives, shared libraries aspects of the POSIX ile semantics. • compiler, linker 5. Basic Resources and Multiplexing System libraries and their APIs provide the most direct access to • , processes operating system services. In the second lecture, we will explore • sharing CPUs & scheduling how programs access those services and how the system libraries • processes vs threads tie into the C programming language. We will also deal with basic • , clocks artifacts that make up programs: object iles, archive iles, shared libraries and how those come about: how we go from a C source One of the basic roles of the operating system is management of ile all the way to an executable ile through compilation and link- various resources, starting with the most basic: the CPU cores ing. and the RAM. Since those resources are very important to every Throughout this lecture, we will use POSIX as our go-to source of process or program, we will spend the entire lecture on them. examples, since it is the operating system interface that is most In particular, we will look at the basic units of resource assign- widely implemented. Moreover, there is abundance of documen- ment: threads for the CPU and processes for memory. We will tation and resources both online and ofline. also look at the mechanisms used by the kernel to implement as- signment and protection of those resources, namely the virtual 3. The Kernel memory subsystem and the scheduler. • privileged CPU mode • the boot process 6. Concurrency and Locking • boundary enforcement • inter-process communication • kernel designs: micro, mono, exo, … • accessing shared resources • system calls • mutual exclusion • deadlocks and deadlock prevention Throughout most of the course, we will have talked about general- purpose operating systems: those that run on personal comput- Scheduling and slicing of CPU time is closely related to another ers and servers. The last lecture will be dedicated to more spe- important topic that pervades operating system design: concur- cialised systems: those that run in washing machines, on satel- rency. We will take a high-level, introductory look at this topic, lites or on the Mars rovers. We will also briely cover high-assurance since the details are often complicated, architecture-speciic and systems, which focus on extremely high reliability and/or secu- require deep understanding of both hardware (SMP, cache hier- rity. archies) and of kernels. Part 1: Anatomy of an OS 7. Device Drivers • user vs kernel drivers In the irst lecture, we will irst pose the question “what is an op- • interrupts &c. erating system” and give some short, but largely unsatisfactory • GPU answers. Since an operating system is a complex system, it is built from a number of components. Each of those components is de- • PCI &c. scribed more easily than the entire operating system, and for this • block storage reason, we will attempt to understand an operating system as the • network devices, wii sum of its parts. • USB • bluetooth Lecture Overview 8. Network Stack 1. Components 2. Interfaces • TCP/IP 3. Classiication • name resolution • socket APIs After talking about what is an operating system, we will give more • irewalls and packet ilters details about its components and afterwards move on to the in- • network ile systems terfaces between those components. Finally, we will look at clas- sifying operating systems: this is another angle that could help 9. Command Interpreters & User Interfaces us pin down what an operating system is. • interactive systems What is an OS? • history: consoles and terminals • the software that makes the hardware tick • text-based terminals, RS-232 • and makes other software easier to write • bash and other Bourne-style shells, POSIX • graphical: X11, Wayland, OS X, Windows, Android, iOS Also • catch-all phrase for low-level software 10. Users and Permissions • an abstraction layer over the machine • multi-user systems • but the boundaries are not always clear • isolation, ownership • ile system permissions Our irst (very approximate) attempt at deining an OS is via its • capabilities responsibilities towards hardware. Since it sits between hard- ware and the rest of software, in some sense, it is what makes 11. Virtualisation & Containers the hardware work. Modern hardware alone is rarely capable of achieving anything useful on its own. It needs to be programmed, • resource multiplexing redux and the basic layer of programming is provided by the operating • isolation redux system. • multiple kernels on a single system • type 1 and type 2 What is not (part of) an OS? • virtio • irmware: (very) low level software − much more hardware-speciic than an OS 12. Special-Purpose Operating Systems − often executes on auxiliary processors • general-purpose vs special-purpose • application software • embedded systems − runs on top of an operating system • real-time systems − this is what you got the computer for • high-assurance systems (seL4) − eg. games, spreadsheets, photo editing, …

One approach to understand what is an operating system could be to look at things that are related, but are not an operating sys- tem, nor a part of one. There is one additional software-ish layer below the operating system, usually known as irmware. In a typ- smallest, most special-purpose systems). The kernel is the most ical computer, many pieces of irmware are present, but most of fundamental and lowest layer of an operating system, while sys- them execute on auxiliary processors – e.g. those in a WiFi card, tem libraries sit on top and use the services of the kernel. They or in the graphics subsystem, in the hard drive and so on. In con- also broker the services of the kernel to user-level programs and trast, the operating system runs on the main processor. There provide additional services (which do not need to be part of the is one piece of irmware that typically runs on the main CPU: on kernel itself). older systems, it’s known as BIOS, on modern systems it is simply The remaining layers are mostly made of programs in the usual known as “the irmware”. sense: other than being a part of the operating systems, there In the other direction, on top of an operating system, there is a isn’t much to distinguish them from user programs. The irst cat- whole bunch of application software. While some software of this egory of such programs are system daemons or system services, type might be bundled with an operating system, it is not, strictly which are typically long-running programs which react to requests speaking, a part of it. Application software is the programs that or perform maintenance tasks. you use to get things done, like text editors, word processors, but The user interface is slightly more complicated, in the sense that also programming IDEs (integrated development environment), it consists of multiple sub-components that align with other parts computer games or web applications (do I say Facebook?). And here. The bullet point here summarises those parts of the user so on and so forth. interface that are more or less standard programs, like the com- mand interpreter. What does an OS do? The Kernel • interact with the user • manage and multiplex hardware • lowest level of an operating system • manage other software • executes in privileged mode • organises and manages data • manages all the other software • provides services for other programs − including other OS components • enforces security • enforces isolation and security • provides low-level services to programs The tasks and duties that the operating system performs are rather varied. On one side, it takes care of the basic interaction with The kernel is the lowest and arguably the most important part of the user: a command interpreter, a or an operating system. Its main distinction is that it executes in a batch-mode job processing system with input provided as punch special processor mode (often known as privileged, monitor or su- cards. Then there is the hardware, which needs to be managed pervisor mode). The main tasks of the kernel are management of and shared between individual programs and users. Installation basic hardware resources (processor, memory) and speciically of additional (application) software is another of the responsibil- providing those resources to other software running on the com- ities of an operating system. puter. This includes the rest of the operating system. Organisation and management of data is a major task as well: this Another crucial task is enforcement of isolation and security. The is what ile systems do. This again includes access control and hardware typically provides means to isolate individual programs sharing among users of the underlying hardware which stores the from each other, but it is up to the software (OS kernel) to set up actual bits and bytes. those hardware facilities correctly and effectively. Finally, there is the third side that the operating system inter- Finaly, the kernel often provides the lowest level of various ser- faces with: the application software. In addition to the user and vices to the upper layers. Those are provided mainly in the form the hardware, application programs need operating services to of system calls, and mainly relate (directly or indirectly) to hard- be able to perform their function. Among other things, they need ware access. to interact with users and use hardware resources, after all. It is the operating system that is in charge of both. System Libraries • form a layer above the OS kernel Part 1.1: Components • provide higher-level services − use kernel services behind the scenes What is an OS made of? − easier to use than the kernel interface • typical example: libc • the kernel − provides C functions like printf • system libraries − also known as msvcrt on Windows • system daemons / services • user interface System Daemons • system utilities • programs that run in the background Basically every OS has those. • they either directly provide services − but daemons are different from libraries Operating systems are made of a number of components, some − we will learn more in later lectures more fundamental than others. Basically all of the above are present, • or perform maintenance or periodic tasks in some form, in any operating system (excluding perhaps the • or perform tasks requested by the kernel • but many do not − scheduling algorithms User Interface − memory allocation • mediates user-computer interaction − all sorts of management • the main shell is typically part of the OS • porting: changing a program to run in a new environ- − command line on UNIX or DOS ment − graphical interfaces with a desktop and windows − for an OS, typically new hardware − but also buttons on your microwave oven • also building blocks for application UI Hardware Platform − buttons, tabs, text rendering, OpenGL… • CPU instruction set (ISA) − provided by system libraries and/or daemons • busses, IO controllers − PCI, USB, Ethernet, … System Utilities • irmware, power management • small programs required for OS-related tasks • e.g. system coniguration Examples − things like the registry editor on Windows • x86 (ISA) – PC (platform) − or simple text editors • ARM – Snapdragon, i.MX 6, … • ilesystem maintenance, daemon management, … • m68k – Amiga, Atari, … − programs like ls/dir or newfs or fdisk • also bigger programs, like ile managers Platform & Architecture Portability • an OS typically supports many platforms Optional Components − Android on many different ARM SoC’s • bundled application software • quite often also different CPU ISAs − web browser, media player, … − long tradition in UNIX-style systems • (3rd-party) software management − NetBSD runs on 15 different ISAs • a programming environment − many of them comprise 6+ different platforms − eg. a C compiler & linker • special-purpose systems are usually less portable − C header iles &c. • source code Code Re-Use • it makes a lot of sense to re-use code Part 1.2: Interfaces • majority of OS code is HW-independent • this was not always the case Programming Interface − pioneered by UNIX, which was written in C • kernel provides system calls − typical OS of the time was in machine language − ABI: Application Binary Interface − porting was basically “writing again” − deined in terms of machine instructions • system libraries provide APIs Application Portability − Application Programming Interface • applications care more about the OS than about HW − symbolic / high-level interfaces − apps are written in high-level languages − typically deined in terms of C functions − and use system libraries extensively − system calls also available as an API • it is enough to port the OS to new/different HW − most applications can be simply recompiled Message Passing • still a major hurdle (cf. Itanium) • APIs do not always come as C functions • message-passing interfaces are possible Application Portability (2) − based on inter-process communication • same application can often run on many OSes − possible even across networks • especially within the POSIX family • form of API often provided by system daemons • but same app can run on Windows, macOS, UNIX, … − may be also wrapped by C APIs − Java, Qt (C++) − web applications (HTML, JavaScript) Portability • many systems provide the same set of services • some OS tasks require close HW cooperation − differences are mostly in programming interfaces − virtual memory and CPU setup − high-level libraries and languages can hide those − platform-speciic device drivers Abstraction • FreeBSD, OpenBSD • instruction sets abstract over CPU details • MINIX • compilers abstract over instruction sets • many, many others • operating systems abstract over hardware There is a whole bunch of operating systems, even of general- • portable runtimes abstract over operating systems purpose operating systems. While running the OS itself is not • applications sit on top of the abstractions the primary reason for getting a computer (application software is), it does form an important part of user experience. Of course, Abstraction Costs it also interfaces with computer hardware and with application • more complexity programs, and not all systems run on all computers and not all • less eficiency applications run on all operating systems. • leaky abstractions Special-Purpose Operating Systems Abstraction Beneits • embedded devices • easier to write and port software − limited budget • fewer constraints on HW evolution − small, slow, power-constrained Abstraction Trade-Offs − hard or impossible to update • real-time systems • powerful hardware allows more abstraction − must react to real-world events • embedded or real-time systems not so much − often safety-critical − the OS is smaller & less portable − robots, autonomous cars, space probes, … − same for applications − more eficient use of resources We have mentioned earlier, that general-purpose operating sys- tems are usually large and complex. The smallest complete op- Part 1.3: Classiication erating systems (if they are not merely educational toys) start around 100 thousand lines of code, but millions of lines is more General-Purpose Operating Systems typical. It is not unheard of that an operating system contains more than 10 million lines of code. These amounts clearly repre- • suitable for use in most situations sent thousands of man-years of work – writing your own operat- • lexible but complex and big ing system, solo, is not very realistic. • run on both servers and clients That said, special-purpose systems are often much smaller. They • cut down versions run on smartphones usually only support far fewer hardware devices and they provide • support variety of hardware simpler and less varied services to the ‘application’ software.

The most important and interesting category is ‘general-purpose Size and Complexity operating systems’. This is the one that we will mostly talk about in this course. The systems in this category are usually quite lex- • operating systems are usually large and complex ible (so they can cover everything that people usually use com- • typically 100K and more lines of code puters for) but, for the same reason, also quite complex. Often • 10+ million is quite possible the same operating system will be able to run on both so-called • many thousand man-years of work ‘server’ computers (those mainly sitting in data centres provid- • special-purpose systems are much smaller ing services to other computers remotely) and ‘client’ computers – those that interact with users directly. Let’s recall that the kernel runs in privileged CPU mode. Any soft- Likewise, the same operating system can, perhaps in a slimmed ware running in this mode is pretty much all-powerful and can down version, run on a smartphone, or a similar size- and power- easily circumvent any access restrictions or security protections. constrained device. All current major smartphone operating sys- It is a well-known fact that the more code you have, the more bugs tems are of this type. Historically, there were a few more spe- there are. Since bugs in the kernel can have far-fetching and cat- cialised phone operating systems, mainly because at that time, astroic consequences, it is imperative that there are as few as phone hardware was considerably more constrained than it is possible. Even more importantly, device drivers often need hard- today. Nonetheless, an OS like Symbian, for instance, could con- ware access and the easiest (and sometimes only) way to achieve ceivably be used on personal computers assuming its hardware that is by executing in kernel (privileged) mode. support was extended. As you may also know, device drivers are often of rather ques- tionable quality: hardware vendors often consider those an after- Operating Systems: Examples thought and don’t pay too much attention to their software teams. • Microsoft Windows If those drivers then execute in kernel mode, this is a serious prob- lem. Different OS vendors employ different strategies to mitigate • Apple macOS & iOS this issue. • Google Android Accordingly, we would like to make kernels small and banish as • many drivers from the kernel as we could. It is, however, not an easy (or even obviously right) thing to do. There are two main In this section, we will study the programming interfaces of op- design schools when it comes to kernel ‘size‘: erating systems, irst in some generality, without a speciic sys- tem in mind. We will then go on to deal speciically with the C- Kernel Revisited language interface of POSIX systems. • bugs in the kernel are very bad − system crashes, data loss Programming Interfaces − critical security problems • kernel system call interface • bigger kernel means more bugs • → system libraries / APIs ← • third-party drivers inside the kernel? • inter-process protocols • command-line utilities (scripting) Monolithic Kernels In most operating systems, the lowest-level interface accessible • lot of code in the kernel to application programs is the system call interface. It is, typically, • less abstraction, less isolation speciied in terms of a machine-language-level protocol (that is, • faster and more eficient an ABI), but usually also provided as a C API. This is the case for POSIX-mandated system calls, but also on e.g. Windows NT sys- tems. • move as much as possible out of kernel • more abstraction, more isolation Lecture Overview • slower and less eficient 1. The C Programming Language The monolithic kernel is an older and in some sense simpler de- 2. System Libraries sign. A lot of code ends up in the kernel, which is not really a − what is a library? problem until bugs happen. There is less abstraction involved in − header iles & libraries this design, fewer interfaces and in general, fewer moving parts 3. Compiler & Linker for the same amount of functionality. Those traits then translate − object iles, executables to faster execution and more eficient resource use. Such kernels 4. File-based APIs are called monolithic because everything that a traditional kernel does is performed by a single (monolithic) piece of software. In this lecture, we will start by reviewing (or perhaps introduc- The opposite of monolithic kernels are microkernels. The kernel ing) the C programming language. Then we will move on to the proper in such a system is the smallest possible subset of code subject of libraries in general and system libraries in particular. that must run in privileged mode. Everything that can be ban- We will look at how libraries enter the program compilation process ished into user mode (of the processor) is. This design provides and what other ingredients there are. Finally, we will have a closer a lot more isolation and requires more abstraction. The inter- look at a speciic set of ile-based programming interfaces. faces within different parts of the low-level OS services are more complicated. However, subsystems are well isolated from each Sidenote: UNIX and POSIX other and faults do not propagate nearly as easily. However, op- • we will mostly use those terms interchangeably erating systems which use this kernel type run more slowly and • it is a family of operating systems use resources less eficiently. − started in late 60s / early 70s Paradox? • POSIX is a speciication − a document describing what the OS should provide • real-time & embedded systems often use microkernels − including programming interfaces • isolation is good for reliability • eficiency also depends on the workload We will assume POSIX unless noted otherwise − throughput vs latency Before we begin, it should be noted that throughout this course, • real-time does not necessarily mean fast we will use POSIX and UNIX systems as examples. If a speciic function or interface is mentioned without further qualiication, Review Questions it is assumed to be speciied by POSIX and implemented by UNIX- 1. What are the roles of an operating system? like systems. 2. What are the basic components of an OS? 3. What is an operating system kernel? Part 2.1: The C Programming Language 4. What is an Application Programming Interface? The C programming language is one of the most commonly used Part 2: System Libraries and APIs languages in operating system implementations. It is also the subject of PB071, and at this point, you should be already familiar with its basic syntax. Likewise, you are expected to understand the concept of a function and other basic building blocks of pro- Part 2.2: System Libraries grams. Even if you don’t know the speciic C syntax, the idea is very similar to any other programming language you might know. (System) Libraries Programming Languages • mainly C functions and data types • interfaces deined in header iles • there are many different languages • deinitions provided in libraries − C, C++, Java, C#, … − static libraries (archives): libc.a − Python, Perl, Ruby, … − shared (dynamic) libraries: libc.so − ML, Haskell, Agda, … • on Windows: msvcrt.lib and msvcrt.dll • but C has a special place in most OSes • there are (many) more besides libc / msvcrt Different programming languages have different use-cases in mind, In this course, when we talk about libraries, we will mean C li- and exist at different levels of abstraction. Most languages other braries speciically. Not Python or Haskell modules, which are than C that you will meet, both at the university and in practice, quite different. That said, a typical C library has basically two are so-called high-level languages. There are quite a few language parts, one is header iles which provide a description of the in- families, and there is a number of higher-level languages derived terface (the API) and the compiled library code (an archive or a from C, like C++, Java or C#. shared library). For the purposes of this course, we will mostly deal with plain C, The interface (as described in header iles) consists of functions and with POSIX (Bourne-style) shell, which can also be thought of (for which, the types of arguments and the type of return value as a programming language. are given in a header ile) and of data structures. The bodies of the functions (their implementation) is what makes up the compiled C: The Least Common Denominator library code. To illustrate: • except for assembly, C is the “bare minimum” • you can almost think of C as portable assembly Declaration: what but not how • it is very easy to call C functions • and to use C data structures int sum( int a, int b );

You can use C libraries in almost every language Deinition: how is the operation done? You could think of C as a ‘portable assembler’, with a few minor bells and whistles in form of the standard library. Apart from this int sum( int a, int b ) library of basic and widely useful subroutines, C provides: ab- { straction from machine opcodes (with human-friendly inix op- return a + b; erator syntax), structured control low and automatic local vari- } ables as its main advantages over assembly. In particular the abstraction over the target processor and its in- The irst example on this slide is a declaration: it tells us the name struction set proved to be instrumental in early operating sys- of a function, its inputs and its output. The second example is tems, and helped establish the idea that an operating system is called a deinition (or sometimes a body) of the function and con- an entity separate from the hardware. tains the operations to be performed when the function is called. On top of that, C is also popular as a systems programming lan- guage because almost any program, regardless of what language Library Files it is written in, can quite easily call C functions and use C data • /usr/lib on most Unices structures. − may be mixed with application libraries − especially on Linux-derived systems The Language of Operating Systems − also /usr/local/lib for user/app libraries • many (most) kernels are written in C • on Windows: C:\Windows\System32 • this usually extends to system libraries − user libraries often bundled with programs • and sometimes to almost the entire OS • non-C operating systems provide C APIs The machine code that makes up the library (i.e. the code that was generated from function deinitions) resides in iles. Those Consequently, C has essentially become a ‘language of operating iles are what we usually call ‘libraries’ and they usually live in systems’: most kernels and even the bulk of most operating sys- a speciic ilesystem location. On most UNIX system, those loca- tems is written in C. Each operating system (apart from perhaps tions are /usr/lib and possibly /lib for system libraries and a few exceptions) provides a C standard library in some form and /usr/local/lib for user or application libraries. On certain sys- can execute programs written in C (and more importantly, pro- tems (especially Linux-based), user libraries are mixed with sys- vide them with essential services). tem libraries and they are all stored in /usr/lib. On Windows, the situation is similar in that both system and ap- Header Files plication libraries are installed in a common location. Addition- • on UNIX: /usr/include ally, on Windows (and on macOS), shared libraries are often in- • contains prototypes of C functions stalled alongside the application. • and deinitions of C data structures Static Libraries • required to compile C and C++ programs • stored in libfile.a, or file.lib (Windows) Like static libraries, header iles are only required when building • only needed for compiling (linking) programs programs, but not when using them. Header iles are fragments • the code is copied into the executable of C source code, and on UNIX systems are traditionally stored in /usr/include. User-installed header iles (i.e. not those pro- • the resulting executable is also called static vided by system libraries) live under /usr/local/include (though − and is easier to work with for the OS again, on Linux-based systems user and system headers are often − but also more wasteful intermixed in /usr/include). Static libraries are only used when building executables and are Header Example 1 (from unistd.h) not required for normal operation of the system. Therefore, many operating systems do not install them by default – they have to be installed separately as part of the developer kit. When a static int execv(char *, char **); library is linked into a program, this basically entails copying the pid_t fork(void); machine code from the library into the inal executable. int pipe(int *); In this scenario, after linking is performed, the library is no longer ssize_t read(int, void *, size_t); needed since the executable contains all the code required for its execution. For system libraries, this means that the code that (and many more prototypes) comes from the library is present on the system in many copies, once in each program that uses the library. This is somewhat al- This is an excerpt from an actual system header ile, and declares leviated by linkers only copying the parts of the library that are a few of the functions that comprise the POSIX C API. actually needed by the program, but there is still substantial du- plication. Header Example 2 (from sys/time.h) The duplication arising this way does not only affect the ile sys- tem, but also memory (RAM) when those programs are loaded – struct timeval multiple copies of the same function will be loaded into memory { when such programs are executed. time_t tv_sec; long tv_usec; Shared (Dynamic) Libraries }; • required for running programs • linking is done at execution time /* ... */ • less code duplication • can be upgraded separately int gettimeofday(timeval *, timezone *); • but: dependency problems int settimeofday(timeval *, timezone *);

The other approach to libraries is dynamic, or shared libraries. In The POSIX C Library this case, the library is required to actually run the program: the • libc – the C runtime library linker does not copy the machine code from the library into the • contains ISO C functions executable. Instead, it only notes that the library must be loaded − printf, fopen, fread alongside with the program when the latter is executed.. • and a number of POSIX functions This reduces code duplication, both on disk and in memory. It − open, read, gethostbyname, … also means that the library can be updated separately from the application. This often makes updates easier, especially in case − C wrappers for system calls a library is used by many programs and is, for example, found to contain a security problem. In a static library, this would mean System Calls: Numbers that each program that uses the library needs to be updated. A • system calls are performed at machine level shared library can be replaced and the ixed code will be loaded • which syscall to perform is decided by a number alongside programs as usual. − e.g. SYS_write is 4 on OpenBSD The downside is that it is dificult to maintain binary compatibil- − numbers deined by sys/syscall.h ity – to ensure that programs that were built against one version − different for each OS of the library also work with a later version. When this is vio- lated, as often happens, people run into dependency problems System Calls: the syscall function (also known as DLL hell on Windows). • there is a C function called syscall − prototype: int syscall( int number, ... ) • you can learn a lot from those sources • this implements the low-level syscall sequence • it takes a syscall number and syscall parameters Part 2.3: Compiler & Linker − this is a bit like printf − irst parameter decides what are the other parame- C Compiler ters • many POSIX systems ship with a C compiler • (more about how syscall() works next week) • the compiler takes a C source ile as input − a text ile with a .c sufix System Calls: Wrappers • and produces an object ile as its output • using syscall() directly is inconvenient − binary ile with machine code in it • libc has a function for each system call − but cannot be directly executed − SYS_write → int write( int, char *, size_t ) Object Files − SYS_open → int open( char *, int ) • contain native machine (executable) code − and so on and so forth • along with static data • those wrappers use syscall() internally − e.g. string literals used in the program • possibly split into a number of sections Portability − .text, .rodata, .data and so on • libraries provide an abstraction layer over OS internals • and metadata • they are responsible for application portability − list of symbols (function names) and their addresses − along with standardised ilesystem locations − and user-space utilities to some degree Object File Formats • higher-level languages rely on system libraries • a.out – earliest UNIX object format • COFF – Common Object File Format NeXT and Objective C − adds support for sections over a.out • the NeXT OS was built around Objective C • PE – Portable Executable (MS Windows) • system libraries had ObjC APIs • Mach-O – Mach Executable (macOS) • in API terms, ObjC is very different from C • ELF – Executable and Linkable Format (all modern Unices) − also very different from C++ − traditional OOP features (like Smalltalk) Archives (Static Libraries) • this has been partly inherited into macOS • static libraries on UNIX are called archives − evolving into Swift • this is why they get the .a sufix • they are like a zip ile full of object iles System Libraries: UNIX • plus a table of symbols (function names) • the math library libm − implements math functions like sin and exp Linker • library libpthread • object iles are incomplete • terminal access: libcurses • they can refer to symbols that they do not deine • cryptography: libcrypto (OpenSSL) − the deinitions can be in libraries • the C++ standard library libstdc++ or libc++ − or in other object iles • a linker puts multiple object iles together System Libraries: Windows − to produce a single executable • msvcrt.dll – the ISO C functions − or maybe a shared library • kernel32.dll – basic OS APIs • gdi32.dll – Graphics Device Interface Symbols vs Addresses • user32.dll – standard GUI elements • we use symbolic names to call functions &c. • but the call machine instruction needs an address Documentation • manual pages on UNIX • the executable will eventually live in memory − try e.g. man 2 write on aisa.fi.muni.cz • data and instructions need to be given addresses − section 2: system calls • what a linker does is assign those addresses − section 3: library functions (man 3 printf) • MSDN for Windows Resolving Symbols − https://msdn.microsoft.com • the linker processes one object ile at a time • it maintains a symbol table What is a Filesystem? − mapping symbols (names) to addresses • a set of iles and directories − dynamically updated as more objects are processed • usually lives on a single block device • objects can only use symbols already in the table − but may also be virtual • resolving symbols = inding their addresses • directories and iles form a tree − directories are internal nodes Executable − iles are leaf nodes • inished image of a program to be executed • usually in the same format as object iles File Paths • but already complete, with symbols resolved • ilesystems use paths to point at iles − but: may use shared libraries • a string with / as a directory delimiter − in that case, some symbols remain unresolved − the delimiter is \ on Windows • a leading / indicates the ilesystem root Shared Libraries • e.g. /usr/include • each shared library only needs to be in memory once • shared libraries use symbolic names (like object iles) The File Hierarchy • there is a “mini linker” in the OS to resolve those names − usually known as a runtime linker / − resolving = inding the addresses • shared libraries can use other shared libraries home var usr − they can form a DAG (Directed Acyclic Graph)

Addresses Revisited xrockai include lib • when you run a program, it is loaded into memory • parts of the program refer to other parts of the program stdio.h unistd.h libc.a libm.a − this means they need to know where it will be loaded − this is a responsibility of the linker The Role of Files and Filesystems • shared libraries use position-independent code • very central in Plan9 − works regardless of the base address it is loaded at • central in most UNIX systems − we won’t go into detail on how this is achieved − cf. Linux pseudo-ilesystems − /proc provides info about all processes Compiler, Linker &c. − /sys gives info about the kernel and devices • the C compiler is usually called cc • somewhat reduced in Windows • the linker is known as ld • quite suppressed in Android (and more on iOS) • the archive (static library) manager is ar • the runtime linker is often known as ld.so The Filesystem API • you open a ile (using the open() syscall) Part 2.4: File-Based APIs • you can read() and write() data • you close() the ile when you are done Everything is a File • you can rename() and unlink() iles • part of the UNIX design philosophy • you can use mkdir() to create directories • directories are iles File Descriptors • devices are iles • pipes are iles • the kernel keeps a table of open iles • network connections are (almost) iles • the ile descriptor is an index into this table • you do everything using ile descriptors Why is Everything a File • non-Unix systems have similar concepts − descriptors are called handles on Windows • re-use the comprehensive ile system API • re-use existing ile-based command-line tools Regular iles • bugs are bad simplicity is good • these contain sequential data (bytes) • want to print? cat file.txt > /dev/ulpt0 • may have inner structure but the OS does not care − (reality is a little more complex) • there is metadata attached to iles − like when were they last modiied − who can and who cannot access the ile Review Questions • you read() and write() iles 5. What is a shared (dynamic) library? 6. What does a linker do? Directories 7. What is a symbol in an object ile? • a list of iles and other directories 8. What is a ile descriptor? − internal nodes of the ilesystem tree − directories give names to iles Part 3: The Kernel • can be opened just like iles − but read() and write() is not allowed This lecture is about the kernel, the lowest layer of an operating − iles are created with open() or creat() system. It will be in 5 parts: − directories with mkdir() − directory listing with opendir() and readdir() Lecture Overview 1. privileged mode Mounts 2. • UNIX joins all ile systems into a single hierarchy 3. kernel architecture • the root of one ilesystem becomes a directory in an- 4. system calls other 5. kernel-provided services − this is called a mount point • Windows uses drive letters instead (C:, D: &c.) First, we will look at processor modes and how they mesh with the layering of the operating system. We will move on to the boot Devices process, because it somewhat illustrates the relationship between the kernel and other components of the operating system, and • block and character devices are (special) iles also between the irmware and the kernel. We will look at more • block devices are accessed one block at a time detail at kernel architecture: things that we already hinted at in − a typical block device would be a disk previous lectures, and will also look at and uniker- − includes USB mass storage, lash storage, etc nels, in addition to the architectures we already know (micro and − you can create a ile system on a block device monolithic kernels). • character devices are more like normal iles The fourth part will focus on system calls and their binary inter- − terminals, tapes, serial ports, audio devices face – i.e. how system calls are actually implemented at the ma- chine level. This is closely related to the irst part of the lecture Pipes about processor modes, and builds on the knowledge we gained last week about how system calls look at the C level. • pipes are a simple communication device Finally, we will look at what are the services that kernels provide • one program can write() data to the pipe to the rest of the operating system, what are their responsibilities • another program can read() that same data and we will also look more closely at how microkernel and hybrid • each end of the pipe gets a ile descriptor operating systems work. • a pipe can live in the ilesystem (named pipe) Reminder: Software Layering Sockets • → the kernel ← • the socket API comes from early BSD Unix • system libraries • socket represents a (possible) network connection • system services / daemons • sockets are more complicated than normal iles • utilities − establishing connections is hard • application software − messages get lost much more often than ile data • you get a ile descriptor for an open socket Part 3.1: Privileged Mode • you can read() and write() to sockets

Socket Types CPU Modes • CPUs provide a privileged (supervisor) and a user mode • sockets can be internet or unix domain • this is the case with all modern general-purpose CPUs − internet sockets connect to other computers − not necessarily with micro-controllers − Unix sockets live in the ilesystem • x86 provides 4 distinct privilege levels • sockets can be stream or datagram − most systems only use ring 0 and ring 3 − stream sockets are like iles − Xen paravirtualisation uses ring 1 for guest kernels − you can write a continuous stream of data − datagram sockets can send individual messages There is a number of operations that only programs running in • physical memory is split into frames supervisor mode can perform. This allows kernels to enforce bound- • virtual memory is split into pages aries between user programs. Sometimes, there are intermediate • pages and frames have the same size (usually 4KiB) privilege levels, which allow iner-grained layering of the operat- • frames are places, pages are the content ing system, for instance, drivers can run in a less privileged level • page tables map between pages and frames than the ‘core’ of the kernel, providing a level of protection for the kernel from its own device drivers. You might remember that Before we get to virtual addresses, let’s have a look at the other device drivers are the most problematic part of any kernel. major use for the address translation mechanism, and that is pag- In addition to device drivers, multi-layer privilege systems in CPUs ing. We do so, because it perhaps better illustrates how the MMU can be used in certain virtualisation systems. More about this to- works. In this viewpoint, we split physical memory (the physi- wards the end of the semester. cal address space) into frames, which are storage areas: places where we can put data and retrieve it later. Think of them as Privileged Mode shelves in a bookcase. The virtual address space is then split into pages: actual pieces • many operations are restricted in user mode of data of some ixed size. Pages do not physically exist, they just − this is how user programs are executed represent some bits that the program needs stored. You could − also most of the operating system think of a page as a really big integer. Or you can think of pages • software running in privileged mode can do ~anything as a bunch of books that its into a single shelf. − most importantly it can program the MMU The , then, is an catalog, or an address book of sorts. − the kernel runs in this mode Programs attach names to pages – the books – but the hardware can only locate shelves. The job of the MMU is to take a name The kernel executes in privileged mode of the processor. In this of the book and ind the physical shelf where the book is stored. mode, the software is allowed to do anything that’s possible. In Clearly, the operating system is free to move the books around, as particular,it can (re)program the unit (MMU, long as it keeps the page table – the catalog – up to date. Remain- see next slide). Since MMU is how program separation is imple- ing software won’t know the difference. mented, code executing in privileged mode is allowed to change the memory of any program running on the computer. This ex- Swapping Pages plains why we want to reduce the amount of code running in su- pervisor (privileged) mode to a minimum. • RAM used to be a scarce resource The way most operating systems operate, the kernel is the only • paging allows the OS to move pages out of RAM piece of software that is allowed to run in this mode. The code in − a page (content) can be written to disk system libraries, daemons and so on, including application soft- − and the frame can be used for another page ware, is restricted to the user mode of the processor. In this mode, • not as important with contemporary hardware the MMU cannot be programmed, and the software can only do • useful for memory mapping iles (cf. next lecture) what the MMU allows based on the instructions it got from the kernel. If we are short on shelf space, we may want to move some books into storage. Then we can use the shelf we freed up for some Memory Management Unit other books. However, hardware can only retrieve information from shelves and therefore if a program asks for a book that is • is a subsystem of the processor currently in storage, the operating system must arrange things • takes care of address translation so that it is moved from storage to a shelf before the program is − user software uses virtual addresses allowed to continue. − the MMU translates them to physical addresses This process is called swapping: the OS, when pressed for mem- • the mappings can be managed by the OS kernel ory, will evict pages from RAM onto disk or some other high-capacity (but slow) medium. It will only page them back in when they are Let’s have a closer look at the MMU. Its primary role is address required. In contemporary computers, memory is not very scarce translation. Addresses that programs refer to are virtual – they and this use-case is not very important. do not correspond to ixed physical locations in memory chips. However, it allows another neat trick: instead of opening a ile Whenever you look at, say, a pointer in C code, that pointer’s nu- and reading it using open and read system calls, we can use so- meric value is an address in some virtual address space. The job called memory mapped iles. This basically provides the content of the MMU is to translate that virtual address into a physical one of the ile as a chunk of memory that can be read or even writ- – which has a ixed relationship with some physical capacitor or ten to, and the changes are sent back to the ilesystem. We will other electronic device that remembers information. discuss this in more detail next week. How those addresses are mapped is programmable: the kernel can tell the MMU how the translation goes, by providing it with Look Ahead: Processes translation tables. We will discuss how page tables work in a short while; what is important now is that it is the job of the ker- • process is primarily deined by its address space nel to build them and send them to the MMU. − address space meaning the valid virtual addresses • this is implemented via the MMU Paging • when changing processes, a different page table is loaded − this is called a • the page table deines what the process can see User-mode Initialisation

We will deal with processes later in the course, but let me just • init mounts the remaining ile systems quickly introduce the concept now so we can appreciate how im- • the init process starts up user-mode system services portant the MMU is for a modern operating system. • then it starts application services • and inally the login process Memory Maps After Log-In • different view of the same principles • the login process initiates the user session • the OS maps physical memory into the process • loads desktop modules and application software • multiple processes can have the same RAM area mapped • drops the user in a (text or graphical) shell − this is called shared memory • now you can start using the computer • often, a piece of RAM is only mapped in a single process CPU Init Page Tables • this depends on both architecture and platform • the MMU is programmed using translation tables • on x86, the CPU starts in 16-bit mode − those tables are stored in RAM • on legacy systems, BIOS & bootloader stay in this mode − they are usually called page tables • the kernel then switches to during its • and they are fully in the management of the kernel boot • the kernel can ask the MMU to replace the page table − this is how processes are isolated from each other Bootloader Kernel Protection • historically limited to tens of kilobytes of code • the bootloader locates the kernel on disk • kernel memory is usually mapped into all processes − it often allows the operator to choose different ker- − this substantially improves performance on many nels CPUs − limited understanding of ile systems − well, until Meltdown hit us, anyway • then it loads the kernel image into RAM • kernel pages have a special ‘supervisor’ lag set • and hands off control to the kernel − code executing in user mode cannot touch them − else, user code could tamper with kernel memory Modern Booting on x86 • on modern system, the bootloader runs in protected Part 3.2: Booting mode − or even the long mode on 64-bit CPUs Starting the OS • the irmware understands the FAT ilesystem • upon power on the system is in a default state − it can load iles from there into memory − mainly because RAM is volatile − this vastly simpliies the boot process • the entire platform needs to be initialised − this is irst and foremost the CPU Booting ARM − and the console hardware (keyboard, monitor, …) • on ARM boards, there is no uniied irmware interface − then the rest of the devices • U-boot is as close as one gets to uniication • the bootloader needs low-level hardware knowledge Boot Process • this makes writing bootloaders for ARM quite tedious • the process starts with a built-in hardware init • current U-boot can use the EFI protocol from PCs • when ready, the hardware hands off to the irmware − this was BIOS on 16 and 32 bit systems Part 3.3: Kernel Architecture − replaced with EFI on current amd64 platforms Architecture Types • the irmware then loads a bootloader • the bootloader loads the kernel • monolithic kernels (Linux, *BSD) • microkernels (Mach, L4, QNX, NT, …) Boot Process (cont’d) • hybrid kernels (macOS) • the kernel then initialises device drivers • type 1 hypervisors (Xen) • and the root ilesystem • exokernels, rump kernels • then it hands off to the init process Microkernel • at this point, the takes over • handles • (hardware) interrupts − makes little sense on real hardware • task / process scheduling − but can be very useful on a • message passing • bundle applications as virtual machines − without the overhead of a general-purpose OS • everything else is separate Exo vs Uni Monolithic kernels • an runs multiple applications • all that a microkernel does − includes process-based isolation • plus device drivers − but abstractions are very bare-bones • ile systems, volume management • only runs a single application • a network stack − provides more-or-less standard services • data encryption, … − e.g. standard hierarchical ile system Microkernel Redux − socket-based network stack / API • we need a lot more than a microkernel provides Part 3.4: System Calls • in a “true” microkernel OS, there are many modules • each runs in a separate process Reminder: Kernel Protection • the same for ile systems and networking • those modules / processes are called servers • kernel executes in privileged mode of the CPU • kernel memory is protected from user code Hybrid Kernels But: Kernel Services • based around a microkernel • user code needs to ask kernel for services • and a gutted monolithic kernel • how do we switch the CPU into privileged mode? • the monolithic kernel is a big server • cannot be done arbitrarily (security) − takes care of stuff not handled by the microkernel − easier to implement than true microkernel OS System Calls − strikes middle ground on performance • hand off execution to a kernel routine • pass arguments into the kernel Micro vs Mono • obtain return value from the kernel • microkernels are more robust • all of this must be done safely • monolithic kernels are more eficient − less context switching Trapping into the Kernel • what is easier to implement is debatable • there are a few possible mechanisms − in the short view, monolithic wins • details are very architecture-speciic • hybrid kernels are a compromise • in general, the kernel sets a ixed entry address − an instruction can change the CPU into privileged Exokernels mode • smaller than a microkernel − while at the same time jumping to this address • much fewer abstractions − applications only get block storage Trap Example: x86 − networking is much reduced • there is an int instruction on those CPUs • only research systems exist • this is called a software − interrupts are normally a hardware thing Type 1 Hypervisors − interrupt handlers run in privileged mode • also known as bare metal or native hypervisors • it is also synchronous • they resemble microkernel operating systems • the handler is set in IDT (interrupt descriptor table) − or exokernels, depending on the viewpoint • the “applications” for a hypervisor are operating sys- Software Interrupts tems • those are available on a range of CPUs − hypervisor can use coarser abstractions than an OS • generally not very eficient for system calls − entire storage devices instead of a ilesystem • extra level of indirection Unikernels − the handler address is retrieved from memory − a lot of CPU state needs to be saved • kernels for running a single application Aside: SW Interrupts on PCs Part 3.5: Kernel Services • those are used even in real mode − legacy 16-bit mode of 80x86 CPUs What Does a Kernel Do? − BIOS (irmware) routines via int 0x10 & 0x13 • memory & process management − MS-DOS API via int 0x21 • task (thread) scheduling • and on older CPUs in 32-bit protected mode • device drivers − Windows NT uses int 0x2e − SSDs, GPUs, USB, bluetooth, HID, audio, … − Linux uses int 0x80 • ile systems • networking Trap Example: amd64 / x86_64 • sysenter and syscall instructions Additional Services − and corresponding sysexit / sysret • inter-process communication • the entry point is stored in a machine state register • timers and time keeping • there is only one entry point • process tracing, proiling − unlike with software interrupts • security, sandboxing • quite a bit faster than interrupts • cryptography

Which System Call? Reminder: Microkernel Systems • often there are many system calls • the kernel proper is very small − there are more than 300 on 64-bit Linux • it is accompanied by servers − about 400 on 32-bit Windows NT • in “true” microkernel systems, there are many servers • but there is only a handful of interrupts − each device, ilesystem, etc. is separate − and only one sysenter address • in hybrid systems, there is one, or a few − a “superserver” that resembles a monolithic kernel Reminder: System Call Numbers Kernel Services • each system call is assigned a number • available as SYS_write &c. on POSIX systems • we usually don’t care which server provides what • for the “universal” int syscall( int sys, ... ) − each system is different • this number is passed in a CPU register − for services, we take a monolithic view • the services are used through system librares System Call Sequence − they abstract away many of the details − e.g. whether a service is a system call or an IPC call • irst, libc prepares the system call arguments • and puts the system call number in the correct register User-Space Drivers in Monolithic Systems • then the CPU is switched into privileged mode • this also transfers control to the syscall handler • not all device drivers are part of the kernel • case in point: printer drivers System Call Handler • also some USB devices (not the USB bus though) • part of the GPU/graphics stack • the handler irst picks up the system call number − memory and output management in kernel • and decides where to continue − most of OpenGL in user space • you can imagine this as a giant switch statement switch ( sysnum ) Review Questions { 9. What CPU modes are there and how are they used? case SYS_write: return syscall_write(); 10. What is the memory management unit? case SYS_read: return syscall_read(); 11. What is a microkernel? /* many more */ 12. What is a system call? } Part 4: File Systems System Call Arguments • each system call has different arguments Lecture Overview • how they are passed to the kernel is CPU-dependent 1. Filesystem Basics • on 32-bit x86, most of them are passed in memory 2. The Block Layer • on amd64 Linux, all arguments go into registers 3. Virtual Filesystem Switch − 6 registers available for arguments 4. The UNIX Filesystem 5. Advanced Features Special Files • many things look somewhat like iles Part 4.1: Filesystem Basics • let’s exploit that and unify them with iles • recall part 2 on APIs: “everything is a ile” What is a ? − the API is the same for special and regular iles • a collection of iles and directories − not the implementation though • (mostly) hierarchical • usually exposed to the user File System Types • usually persistent (across reboots) • fat16, fat32, vfat, exfat (DOS, lash media) • ile managers, command line, etc. • ISO 9660 (CD-ROMs) • UDF (DVD-ROM) What is a (Regular) File? • NTFS (Windows NT) • a sequence of bytes • HFS+ (macOS) • and some basic metadata • ext2, ext3, ext4 (Linux) − owner, group, timestamp • ufs, ffs (BSD) • the OS does not care about the content − text, images, video, source code are all the same Multi-User Systems − executables are somewhat special • ile ownership • ile permissions What is a Directory? • disk quotas • a list of name ile mappings • an associative container if you will Ownership & Permissions − semantically, the value types are not homogeneous • we assume a discretionary model − syntactically, they are just i-nodes • whoever creates a ile is its owner • one directory = one component of a path • ownership can be transferred − /usr/local/bin • the owner decides about permissions What is an i-node? − basically read, write, execute • an anonymous, ile-like object Disk Quotas • could be a regular ile • disks are big but not ininite − or a directory • bad things happen when the ile system ills up − or a special ile − denial of service − or a symlink − programs may fail and even corrupt data Files are Anonymous • quotas limits the amount of space per user • this is the case with UNIX Part 4.2: The Block Layer − not all ile systems work like this • there are pros and cons to this approach Disk-Like Devices − e.g. open iles can be unlinked • names are assigned via directory entries • disk drives provide block-level access • read and write data in 512-byte chunks What Else is a Byte Sequence? − or also 4K on big modern drives • characters coming from a keyboard • a big numbered array of blocks • bytes stored on a magnetic tape Aside: Disk Addressing Schemes • audio data coming from a microphone • pixels coming from a webcam • CHS: Cylinder, Head, Sector • data coming on a TCP connection − structured adressing used in (very) old drives − exposes information about relative seek times Writing Byte Sequences − useless with variable-length cylinders • sending data to a printer − 10:4:6 CHS = 1024 cylinders, 16 heads, 63 sectors • playing back audio • LBA: Logical Block Addessing • writing text to a terminal (emulator) − linear, unstructured address space • sending data over a TCP stream − started as 22, later 28, … now 48 bit Block-Level Access Block-Level Encryption • disk drivers only expose linear addressing • symmetric & length-preserving • one block (sector) is the minimum read/write size • encryption key is derived from a passphrase • many sectors can be written “at once” • also known as “full disk encryption” − sequential access is faster than random • incurs a small performance penalty − maximum throughput vs IOPS • very important for security / privacy

Aside: Access Times Storing Data in Blocks • block devices are slow (compared to RAM) • splitting data into ixed-size chunks is unnatural − RAM is slow (compared to CPU) • there is no permission system for individual blocks • we cannot treat drives as an extension of RAM − this is unlike virtual (paged) memory − not even fastest modern lash storage − it’d be really inconvenient for users − latency: HDD 3–12 ms, SSD 0.1 ms, RAM 70 ns • processes are not persistent, but block storage is

Block Access Cache Filesystem as Resource Sharing • caching is used to hide latency • usually only 1 or few disks per computer − same principle between CPU and RAM • many programs want to store persistent data • iles recently accessed are kept in RAM • ile system allocates space for the data − many cache management policies exist − which blocks belong to which ile • implemented entirely in the OS • different programs can write to different iles − many devices implement their own caching − no risk of trying to use the same block − but the amount of fast memory is usually limited Filesystem as Abstraction Write Buffers • allows the data to be organised into iles • the write equivalent of the block cache • enables the user to manage and review data • data is kept in RAM until it can be processed • iles have arbitrary & dynamic size • must synchronise with caching − blocks are transparently allocated & recycled − other users may be reading the ile • structured data instead of a lat block array

I/O Scheduler (Elevator) Part 4.3: Virtual Filesystem Switch • reads and writes are requested by users Layer • access ordering is crucial on a mechanical drive • many different ilesystems − not as important on an SSD • the OS wants to treat them all alike − but sequential access is still much preferred • VFS provides an internal, in-kernel API • requests are queued (recall, disks are slow) • ilesystem syscalls are hooked up to VFS − but they are not processed in FIFO order VFS in OOP terms RAID • VFS provides an abstract class, filesystem • hard drives are also unreliable • each ilesystem implementation derives filesystem − backups help, but take a long time to restore − e.g. class iso9660 : public filesystem • RAID = Redundant Array of Inexpensive Disks • each actual ile system gets an instance − live-replicate same data across multiple drives − /home, /usr, /mnt/usbflash each one − many different conigurations − the kernel uses the abstract interface to talk to them • the system stays online despite disk failures The filesystem Class RAID Performance • RAID affects the performance of the block layer struct handle { /* ... */ }; • often improved reading throughput struct filesystem − data is recombined from multiple channels { • write performance is more mixed virtual int open( const char * path ) = 0; − may require a fair amount of computation virtual int read( handle file, ... ) = 0; − more data needs to be written for redundancy /* ... */ } Filesystem-Speciic Operations − this includes permissions & ownership − type and properties of the special ile • open: look up the ile for access • they are just different kind of an i-node • read, write – self-explanatory • open, read, write, etc. bypass the ilesystem • seek: move the read/write pointer • : lush data to disk File Locking • mmap: memory-mapped IO • select: IO readiness notiication • multiple programs writing the same ile is bad − operations will come in randomly Standard IO − the resulting ile will be a mess • the usual way to use iles • ile locks ix this problem • open the ile − multiple APIs: fcntl vs flock − operations to read and write bytes − differences on networked ilesystems • data has to be buffered in user space − and then copied to/from kernel space Mount Points • not very eficient • recall that there is only a single directory tree • but there are multiple disks and ilesystems Memory-mapped IO • ile systems can be joined at directories • uses virtual memory (cf. last lecture) • root of one becomes a subdirectory of another • treat a ile as if it was swap space • the ile is mapped into process memory Part 4.4: The UNIX Filesystem − page faults indicate that data needs to be read − dirty pages cause writes Superblock • available as the mmap system call • holds toplevel information about the ilesystem • locations of i-node tables Sync-ing Data • locations of i-node and free space bitmaps • recall that the disk is very slow • block size, ilesystem size • waiting for each write to hit disk is ineficient • but if data is held in RAM, what if power is cut? I-Nodes − the sync operation ensures the data has hit disk • recall that i-node is an anonymous ile − often used in database implementations − or a directory, or a special • i-nodes only have numbers Filesystem-Agnostic Operations • directories tie names to i-nodes • handling executables • fcntl handling I-Node Allocation • special iles • often a ixed number of i-nodes • management of ile descriptors • i-nodes are either used or free • ile locks • free i-nodes may be stored in a bitmap Executables • alternatives: B-trees • memory mapped (like mmap) I-Node Content • may be paged in lazily • exact content of an i-node depends on its type • executables must be immutable while running • regular ile i-nodes contain a list of data blocks • but can be still unlinked from the directory − both direct and indirect (via a data block) The fcntl Syscall • symbolic links contain the target path • special devices describe what device they represent • mostly operations relating to ile descriptors − synchronous vs asynchronous access Attaching Data to I-Nodes − blocking vs non-blocking • a few direct block addresses in the i-node − close on exec: more on this in a later lecture − eg. 10 refs, 4K blocks, max. 40 kilobytes • also one of the several locking APIs • indirect data blocks Special Files − a block full of addresses of other blocks − one indirect block approx. 2 MiB of data • device nodes, pipes, sockets, … • extents: a contiguous range of blocks • only metadata for special iles lives on disk Fragmentation Soft Links (Symlinks) • internal – not all blocks are fully used • they exist to lift the one-device limitation − iles are of variable size, blocks are ixed • soft links to directories are OK − a 4100 byte ile needs 2 4 KiB blocks − this can cause loops in the ilesystem • external – free space is non-contiguous • the soft link i-node contains a path − happens when many iles try to grow at once − the meaning can change when paths change − this means new iles are also fragmented • dangling link: points to a non-existent path

Fragmentation Problems Free Space • performance: can’t use fast sequential IO • similar problem to i-node allocation − programs often read iles sequentially − but regards data blocks − fragmention random IO on the device • goal: quickly locate data blocks to use • metadata size: can’t use long extents − also: keep data of a single ile close together • internal: waste of disk space − also: minimise external fragmentation • usually bitmaps or B-trees Directories • uses data blocks (like regular iles) File System Consistency • but the blocks hold name i-node maps • what happens if power is cut? • modern ile systems use hashes or trees • data buffered in RAM is lost • the format of directory data is ilesystem-speciic • the IO scheduler can re-order disk writes • the ile system can become corrupt File Name Lookup Journalling • we often need to ind a ile based on a path • each component means a directory search • also known as an intent log • directories can have many thousands entries • write down what was going to happen synchronously • ix the actual metadata based on the journal Old-Style Directories • has a performance penalty at run-time • unsorted sequential list of entries − reduces downtime by making consistency checks fast • new entries are simply appended at the end − may also prevent data loss • unlinking can create holes • lookup in large directories is very ineficient Part 4.5: Advanced Features

Hash-Based Directories What Else Can Filesystems Do? • only need one block read on average • transparent ile compression • often the most eficient option • ile encryption • extendible hashing • block de-duplication − directories can grow over time • snapshots − gradually allocates more blocks • checksums • redundant storage Tree-Based Directories File Compression • self-balancing search trees • use one of the standard compression algorithms • optimised for block-level access − must be fairly general-purpose (i.e. not JPEG) • B trees, B+ trees, B* trees − and of course lossless • logarithmic number of reads − e.g. LZ77, LZW, Huffman Coding, … − this is worst case, unlike hashing • quite challenging to implement Hard Links − the length of the ile changes (unpredictably) − eficient random access inside the ile • multiple names can refer to the same i-node − names are given by directory entries File Encryption − we call such multiple-named iles hard links • use symmetric encryption for individual iles − it’s usually forbidden to hard-link directories − must be transparent to upper layers (applications) • hard links cannot cross device boundaries − symmetric crypto is length-preserving − i-node numbers are only unique within a ilesystem − encrypted directories, inheritance, &c. • a new set of challenges • programs were scheduled ahead of time − key and passphrase management • we are talking punch cards &c. • and computers that took an entire room Block De-duplication • sometimes the same data block appears many times History: Time Sharing − virtual machine images are a common example • “mini” computers could run programs interactively − also containers and so on • teletype terminals, screens, keyboards • some ilesystems will identify those cases • multiple users at the same time − internally point many iles to the same block • hence, multiple programs at the same time − copy on write to preserve illusion of separate iles Processes: Early View Snapshots • process is an executing program • it is convenient to be able to copy entire ilesystems • there can be multiple processes − but this is also expensive • various resources belong to a process − snapshots provide an eficient means for this • each process belongs to a particular user • snapshot is a frozen image of the ilesystem − cheap, because snapshots share storage Process Resources − easier than de-duplication • memory (address space) − again implemented as copy-on-write • processor time Checksums • open iles (descriptors) − also working directory • hardware is unreliable − also network connections − individual bytes or sectors may get corrupted − this may happen without the hardware noticing Process Memory Segments • the ilesystem may store checksums along with meta- • program text: contains instructions data • data: static and dynamic data − and possibly also ile content − with a separate read-only section − this protects the integrity of the ilesystem • stack memory: execution stack • beware: not cryptographically secure − return addresses Redundant Storage − automatic variables • like ilesystem-level RAID Process Memory • data and metadata blocks are replicated − may be between multiple local block devices • each process has its own address space − but also across a cluster / many computers • this means processes are isolated from each other • drastically improves fault tolerance • requires that the CPU has an MMU • implemented via paging (page tables) Review Questions 13. What is a block device? Process Switching 14. What is an IO scheduler? • switching processes means switching page tables 15. What does memory-mapped IO mean? • physical addresses do not change 16. What is an i-node? • but the mapping of virtual addresses does • large part of physical memory is not mapped Part 5: Basic Resources & Multiplexing − could be completely unallocated (unused) − or belong to other processes Lecture Overview 1. processes and virtual and TLB 2. thread scheduling • address translation is slow 3. interrupts and clocks • recently-used pages are stored in a TLB − short for Translation Look-aside Buffer Part 5.1: Processes and Virtual Memory − very fast hardware cache • the TLB needs to be lushed on process switch Prehistory: Batch Systems − this is fairly expensive (microseconds) • irst computers ran one program at a time Processor Time Sharing • fault when either process tries to write • CPU time is sliced into time shares − remember the memory is marked as read-only • time shares (slices) are like memory frames • the OS checks if the memory is supposed to be writable • process computation is like memory pages − if yes, it makes a copy and allows the write • processes are allocated into time shares Init Multiple CPUs • on UNIX, fork is the only way to make a process • execution of a program is sequential • but fork splits existing processes into 2 • instructions depend on results of previous instructions • the irst process is special • one CPU = one instruction sequence • it is directly spawned by the kernel on boot • physical limits on CPU speed multiple cores Process Identiier Threads • processes are assigned numeric identiiers • how to use multiple cores in one process? • also known as PID (Process ID) • threads: a new unit of CPU scheduling • those are used in process management • each thread runs sequentially • used calls like kill or setpriority • one process can have multiple threads Process vs Executable What is a Thread? • process is a dynamic entity • thread is a sequence of instructions • executable is a static ile • different threads run different instructions • an executable contains an initial memory image − as opposed to SIMD or many-core units (GPUs) − this sets up memory layout • each thread has its own stack − and content of the text and data segments • multiple threads can share an address space Exec Modern View of a Process • on UNIX, processes are created via fork • in a modern view, process is an address space • how do we run programs though? • threads are the right scheduling abstraction • exec: load a new executable into a process − this completely overwrites process memory • process is a unit of memory management − execution starts from the entry point • thread is a unit of computation • running programs: fork + exec • old view: one process = one thread Part 5.2: Thread Scheduling Memory Segment Redux • one (shared) text segment What is a Scheduler? • a shared read-write data segment • scheduler has two related tasks • a read-only data segment − plan when to run which thread • one stack for each thread − actually switch threads and processes • usually part of the kernel Fork − even in micro-kernel operating systems • how do we create new processes? • by fork-ing existing processes Switching Threads • fork creates an identical copy of a process • threads of the same process share an address space • execution continues in both processes − a partial context switch is needed − each of them gets a different return value − only register state has to be saved and restored • no TLB lushing – lower overhead Lazy Fork • paging can make fork quite eficient Fixed vs Dynamic Schedule • we start by copying the page tables • ixed schedule = all processes known in advance • initially, all pages are marked read-only − only useful in special / embedded systems • the processes start out sharing memory − can conserve resources − planning is not part of the OS Lazy Fork: Faults • most systems use dynamic scheduling • the shared memory becomes copy on write − what to run next is decided periodically Preemptive Scheduling Scheduling Strategies • tasks (threads) just run as if they owned the CPU • irst in, irst served (batch systems) • the OS forcibly takes the CPU away from them • earliest deadline irst (realtime) − this is called • round robin • pro: a faulty program cannot block the system • ixed priority preemptive • somewhat less eficient than cooperative • fair share scheduling (multi-user)

Cooperative Scheduling Interactivity • threads (tasks) cooperate to share the CPU • throughput vs latency • each thread has to explicitly yield • latency is more important for interactive workloads • this can be very eficient if designed well − think phone or desktop systems • but a bad program can easily block the system − but also web servers • throughput is more important for batch systems Scheduling in Practice − think render farms, compute grids, simulation • cooperative on Windows 3.x for everything Reducing Latency • cooperative for threads on classic Mac OS − but preemptive for processes • shorter time slices • preemptive on pretty much every modern OS • more willingness to switch tasks (more preemption) − including real-time and embedded systems • dynamic priorities • priority boost for foreground processes Waiting and Yielding Maximising Throughput • threads often need to wait for resources or events • longer time slices − they could also use software timers • reduce context switches to minimum • a waiting thread should not consume CPU time • • such a thread will yield the CPU • it is put on a list and later woken up by the kernel Multi-Core Schedulers Run Queues • traditionally one CPU, many threads • nowadays: many threads, many CPUs (cores) • runnable (not waiting) threads are queued • more complicated algorithms • could be priority, round-robin or other queue types • more complicated & concurrent-safe data structures • scheduler picks threads from the run queue • preempted threads are put back Scheduling and Caches • threads can move between CPU cores Priorities − important when a different core is idle • what share of the CPU should a thread get? − and a runnable thread is waiting for CPU • priorities are static and dynamic • but there is a price to pay • dynamic priority is adjusted as the thread runs − thread / process data is extensively cached − this is done by the system / scheduler − caches are typically not shared by all cores • a static priority is assigned by the user Core Afinity Fairness • modern schedulers try to avoid moving threads • equal (or priority-based) share per thread • threads are said to have an afinity to a core • what if one process has many more threads? • an extreme case is pinning • what if one user has many more processes? − this altogether prevents the thread to be migrated • what if one user group has many more active users? • practically, this practice improves throughput − even if nominal core utilisation may be lower Fair Share Scheduling • we can use a multi-level scheduling scheme NUMA Systems • CPU is sliced fairly irst among user groups • non-uniform memory architecture • then among users − different memory is attached to different CPUs • then among processes − each symmetric block within a NUMA is called a node • and inally among threads • migrating a process to a different node is expensive − thread vs node ping-pong can kill performance − threads of one process should live on one node Tickless Kernels • the timer interrupt wakes up the CPU Part 5.3: Interrupts and Clocks • this can be ineficient if the system is idle • alternative: use one-off timers Interrupt − allows the CPU to sleep longer • a way for hardware to request attention − this improves power eficiency on light loads • CPU mechanism to divert execution • partial (CPU state only) context switch Tickless Scheduling • switch to privileged (kernel) CPU mode • slice length (quantum) becomes part of the planning Hardware Interrupts • if a core is idle, wake up on next software timer − synchronisation of software timers • asynchronous, unlike software interrupts • other interrupts are delivered as normal • triggered via bus signals to the CPU − network or disk activity • IRQ = interrupt request − keyboard, mice, … − just a different name for hardware interrupts • PIC = programmable interrupt controller Other Interrupts Interrupt Controllers • serial port − data is available on the port • PIC: simple circuit, typically with 8 input lines • network hardware − peripherals connect to the PIC with wires − data is available in a packet queue − PIC delivers prioritised signals to the CPU • keyboards, mice • APIC: advanced programmable interrupt controller − user pressed a key, moved the mouse − split into a shared IO APIC and per-core local APIC • USB devices in general − typically 24 incoming IRQ lines • OpenPIC, MPIC: similar to APIC, used by e.g. Freescale Interrupt Routing Timekeeping • not all CPU cores need to see all interrupts • APIC can be told how to deliver IRQs • PIT: programmable interval timer − the OS can route IRQs to CPU cores − crystal oscillator + divider • multi-core systems: IRQ load balancing − IRQ line to the CPU − useful to spread out IRQ overhead • local APIC timer: built-in, per-core clock − especially useful with high-speed networks • HPET: high-precision event timer • RTC: real-time clock Review Questions Timer Interrupt 17. What is a thread and a process? • generated by the PIT or the local APIC 18. What is a (thread, process) scheduler? • the OS can set the frequency 19. What do fork and exec do? • a hardware interrupt happens on each tick 20. What is an interrupt? • this creates an opportunity for bookkeeping • and for preemptive scheduling Part 6: Concurrency and Locking

Timer Interrupt and Scheduling This lecture will deal with the issues that arise from running mul- tiple threads and processes at the same time, both using time- • measure how much time the current thread took sharing of a single processor and by executing on multiple phys- • if it ran out of its slice, preempt it ical CPU cores. − pick a new thread to execute − perform a context switch Lecture Overview • those checks are done on each tick 1. Inter-Process Communication − rescheduling is usually less frequent 2. Synchronisation Timer Interrupt Frequency 3. Deadlocks • typical is 100 Hz In the irst part, we will explore the basic why’s and how’s of • this means a 10 ms scheduling slice (quantum) inter-process and inter-thread communication. This will natu- • 1 kHz is also possible rally lead to questions about shared resources, and to the topic − harms throughput but improves latency of thread synchronisation, mutual exclusion and so on. Finally, we will deal with waiting and deadlocks, which arise whenever − this is analogous to a conversation multiple threads can wait for each other. • but unidirectional communication also makes sense − e.g. sending commands to a child process What is Concurrency? − do acknowledgments count as communication? • events that can happen at the same time • it is not important if it does, only that it can Communication Example • events can be given a happens-before partial order • network services are a typical example • they are concurrent if unordered by happens-before • take a web server and a web browser • the browser sends a request for a web page Why Concurrency? • the server responds by sending data • problem decomposition − different tasks can be largely independent Files • relecting external concurrency • it is possible to communicate through iles − serving multiple clients at once • multiple processes can open the same ile • performance and hardware limitations • one can write data and another can process it − higher throughput on multicore computers − the original program picks up the results Parallel Hardware − typical when using programs as modules • hardware is inherently parallel A File-Based IPC Example • software is inherently sequential • iles are used e.g. when you run cc file.c • something has to give − it irst runs a preprocessor: cpp -o file.i file.c − hint: it’s not going to be hardware − then the compiler proper: cc1 -o file.o file.i Part 6.1: Inter-Process Communication − and inally a linker: ld file.o crt.o -lc • the intermediate iles may be hidden in /tmp Communication is an important part of all but the most trivial − and deleted when the task is completed software. While the mechanisms described in this section are tra- ditionally known as inter-process communication (IPC), we will Directories also consider cases where threads of a single process use those • communication by placing iles or links mechanisms (and will not use a special name, even the commu- • typical use: a spool directory nication is, in fact, intra-process in those cases). − clients drop iles into the directory for processing Reminder: What is a Thread − a server periodically picks up iles in there • used for e.g. printing and email • thread is a sequence of instructions • each instruction happens-before the next Pipes − or: happens-before is a total order on the thread • a device for moving bytes in a stream • basic unit of scheduling − note the difference from messages Reminder: What is a Process • one process writes, the other reads • the reader blocks if the pipe is empty • the basic unit of resource ownership • the writer blocks if the pipe buffer is full − primarily memory, but also open iles &c. • may contain one or more threads UNIX and Pipes • processes are isolated from each other − IPC creates gaps in that isolation • pipes are used extensively in UNIX • pipelines built via the shell’s | operator I/O vs Communication • e.g. ls | grep hello.c • most useful for processing data in stages • take standard input and output − imagine process A writes a ile Especially in the case of user-setup pipes (via shell pipelines), the − later, process B reads that ile boundary between IPC and “standard IO” is rather blurry. Pro- • communication happens in real time grams in UNIX are often written in a style, where they read input, − between two running threads / processes process it and write the result as their output. This makes them − automatic: without user intervention amenable for use in pipelines. While pipes are arguably “more automatic” than dropping the output in a ile and running another Direction command to process the ile, there is also a degree of manual in- • bidirectional communication is typical tervention. Another way to look at the difference may be that a pipe is primarily an IPC mechanism, while a ile is primarily a a₀ ← load x | b₀ ← load x storage mechanism. a₁ ← a₀ + 1 | b₁ ← b₀ + 1 store a₁ x | store b₁ x Sockets • similar to, but more capable than pipes Critical Section • allows one server to talk to many clients • any section of code that must not be interrupted • each connection acts like a bidirectional pipe • the statement x = x + 1 could be a critical section • could be local but also connected via a network • what is a critical section is domain-dependent − another example could be a bank transaction Shared Memory − or an insertion of an element into a linked list • memory is shared when multiple threads can access it − happens naturally for threads of a single process Race Condition: Deinition − the primary means of inter-thread communication • (anomalous) behaviour that depends on timing • many processes can map same piece of physical mem- • typically among multiple threads or processes ory • an unexpected sequence of events happens − this is the more traditional setting • recall that ordering is not guaranteed − hence also allows inter-process communication Races in a Filesystem Message Passing • the ile system is also a shared resource • communication using discrete messages • and as such, prone to race conditions • we may or may not care about delivery order • e.g. two threads both try to create the same ile • we can decide to tolerate message loss − what happens if they both succeed? • often used across a network − if both write data, the result will be garbled Part 6.2: Synchronisation Mutual Exclusion • only one thread can access a resource at once Shared Variables • ensured by a mutual exclusion device (a.k.a mutex) • a mutex has 2 operations: lock and unlock • structured view of shared memory • lock may need to wait until another thread unlocks • typical in multi-threaded programs • e.g. any global variable in a program Semaphore • but may also live in memory from malloc • somewhat more general than a mutex Shared Heap Variable • allows multiple interchangeable instances of a resource − that many threads can enter the critical section void *thread( int *x ){ *x = 7;} • basically an atomic counter int main() { Monitors pthread_t id; • a programming language device (not OS-provided) int *x = malloc( sizeof( int ) ); • internally uses standard mutual exclusion pthread_create( &id, NULL, thread, x ); • data of the monitor is only accessible to its methods } • only one thread can enter the monitor at any given time

Race Condition: Example Condition Variables • consider a shared counter, i • what if the monitor needs to wait for something? • and the following two threads • imagine a bounded queue implemented as a monitor int i = 0; − what happens if it becomes full? void thread1() { i = i + 1;} − the writer must be suspended void thread2() { i = i - 1;} • condition variables have wait and signal operations

What is the value of i after both inish? Spinlocks • a spinlock is the simplest form of a mutex Race on a Variable • the lock method repeatedly tries to acquire the lock • memory access is not atomic − this means it is taking up processor time • take x = x + 1 − also known as busy waiting • spinlocks between threads on the same CPU are very Shared Resources bad • hardware comes in a limited number of instances − but can be very eficient between CPUs • many devices can only do one thing at a time • think printers, DVD writers, tape drives, … Suspending Mutexes • we want to use the devices eficiently sharing • these need cooperation from the OS scheduler • when lock acquisition fails, the thread sleeps Network-based Sharing − it is put on a waiting queue in the scheduler • sharing is not limited to processes on one computer • unlocking the mutex will wake up the waiting thread • printers and scanners can be network-attached • needs a system call slow compared to a spinlock • all computers on network may need to coordinate ac- cess Condition Variables Revisited − this could lead to multi-computer deadlocks • same principle as a suspending mutex • the waiting thread goes into a wait queue Locks as Resources • the signal method moves the thread back to a run queue • we explored locks in the previous section • the busy-wait version is known as polling • locks (mutexes) are also a form of resource − a mutex can be acquired (locked) and released Barrier − a locked mutex belongs to a particular thread • sometimes, parallel computation proceeds in phases • locks are proxy (stand-in) resources − all threads must inish phase 1 − before any can start phase 2 Preemptable Resources • this is achieved with a barrier • sometimes, held resources can be taken away − blocks all threads until the last one arrives • this is the case with e.g. physical memory − waiting threads are usually suspended − a process can be swapped to disk if need be • preemtability may also depend on context Dining Philosophers − maybe paging is not available

Non-preemptable Resources • those resources cannot be (easily) taken away • think photo printer in the middle of a page • or a DVD burner in the middle of writing • non-preemptable resources can cause deadlocks

Resource Acquisition • a process needs to request access to a resource • this is called an acquisition • when the request is granted, it can use the device Readers and Writers • after it is done, it must release the device − this makes it available for other processes • imagine a shared database • many threads can read the database at once Waiting • but if one is writing, no other can read nor write • what if there are always some readers? • what to do if we wish to acquire a busy resource? • unless we don’t really need it, we have to wait Read-Copy-Update • this is the same as waiting for a mutex • the thread is moved to a wait queue • the fastest lock is no lock • RCU allows readers to work while updates are done Resource Deadlock − make a copy and update the copy − point new readers to the updated copy • two resources, A and B • when is it safe to reclaim memory? • two processes, P and Q • P acquires A, Q acquires B Part 6.3: Deadlocks • P tries to acquire B but has to wait for Q • Q tries to acquire A but has to wait for P Deadlock Conditions • if we can remove one of them, deadlocks are prevented 1. mutual exclusion Prevention via Spooling 2. hold and wait condition 3. non-preemtability • this attacks the mutual exclusion property 4. circular wait • multiple programs could write to a printer • the data is collected by a spooling daemon Deadlock is only possible if all 4 are present. • which then sends the jobs to the printer in sequence

Non-Resource Deadlocks This can trade a deadlock on the printer for a deadlock on disk • not all deadlocks are due to resource contention space. However, disk space is much more likely to be preempt- able in this scenario, since a job blocked by a full disk can be can- • imagine a message-passing system celed (and erased from disk) and later retried. • process A is waiting for a message • process B sends a message to A and waits for reply Prevention via Reservation • the message is lost in transit • we can also try removing hold-and-wait Example: Pipe Deadlock • for instance, we can only allow batch acquisition • recall that both the reader and writer can block − the process must request everything at once • what if we create a pipe in each direction? − this is usually impractical • process A writes data and tries to read a reply • alternative: release and re-acquire − it blocks because the opposite pipe is empty Prevention via Ordering • process B reads the data but waits for more deadlock • this approach eliminates circular waits Deadlocks: Do We Care? • we impose a global order on resources • deadlocks can be very hard to debug • a process can only acquire resources in this order • they can also be exceedingly rare − must release + re-acquire if the order is wrong • we may ind the risk of a deadlock acceptable • it is impossible to form a cycle this way • just reboot everything if we hit a deadlock Livelock − also known as the ostrich algorithm • in a deadlock, no progress can be made Deadlock Detection • but it’s not much better if processes go back and forth • we can at least try to detect deadlocks − for instance releasing and re-acquiring resources • usually by checking the circular wait condition − they make no useful progress • keep a graph of who owns what and who waits for what − they additionally consume resources • if there is a loop in the graph deadlock • this is as livelock and is just as bad as a deadlock

Deadlock Recovery Starvation • if a preemptable resource is involved, reassign it • starvation happens when a process can’t make progress • otherwise, it may be possible to do a rollback • generalisation of both deadlock and livelock − this needs elaborate checkpointing mechanisms • for instance, unfair scheduling on a busy system • all else failing, kill some of the processes • also recall the readers and writers problem − the devices may need to be re-initialised Review Questions Deadlock Avoidance 21. What is a mutex? • we can possibly deny acquisitions to avoid deadlocks 22. What is a deadlock? • we need to know the maximum resources for each process 23. What are the conditions for a deadlock to form? • avoidance relies on safe states 24. What is a race condition? − worst case all processes ask for maximum resources − safe means we can avoid a deadlock in the worst Part 7: Device Drivers case Lecture Overview Deadlock Prevention 1. Drivers, IO and Interrupts • deadlock avoidance is typically impractical 2. System and Expansion Busses • there are 4 conditions for deadlocks to exist 3. Graphics • we can try attacking those conditions 4. Persistent Storage 5. Networking and Wireless Port-Mapped IO • early CPUs had very limited address space Part 7.1: Drivers, IO and Interrupts − 16-bit addresses mean 64KB of memory • peripherals got a separate address space Input and Output • special instructions for using those addresses • we will mostly think in terms of IO − e.g. in and out on x86 processors • peripherals produce and consume data • input – reading data produced by a device Memory-mapped IO • output – sending data to a device • devices share address space with memory • more common in contemporary systems What is a Driver? • IO uses the same instructions as memory access • piece of software that talks to a device − load and store on RISC, mov on x86 • usually quite speciic / unportable • allows selective user-level access (via the MMU) − tied to the particular device − and also to the operating system Programmed IO • often part of the kernel • input or output is driven by the CPU • the CPU must wait until the device is ready Kernel-mode Drivers • would usually run at bus speed • they are part of the kernel − 8 MHz for ISA (and hence ATA-1) • running with full kernel privileges • PIO would talk to a buffer on the device − including unrestricted hardware access • no or minimal context switching overhead Interrupt-driven IO − fast but dangerous • peripherals are much slower than the CPU − polling the device is expensive Microkernels • the peripheral can signal data availability • drivers are excluded from microkernels − and also readiness to accept more data • but the driver still needs hardware access • this frees up CPU to do other work in the meantime − this could be a special memory region Interrupt Handlers − it may need to react to interrupts • in principle, everything can be done indirectly • also known as irst-level interrupt handler − but this may be quite expensive, too • they must run in privileged mode − they are part of the kernel by deinition User-mode Drivers • the low-level interrupt handler must inish quickly • many drivers can run completely in user space − it will mask its own interrupt to avoid re-entering • this improves robustness and security − and schedule any long-running jobs for later (SLIH) − driver bugs can’t bring the entire system down Second-level Handler − nor can they compromise system security • possibly at some cost to performance • does any expensive interrupt-related processing • can be executed by a kernel thread Drivers in Processes − but also by a user-mode driver • usually not time critical (unlike irst-level handler) • user-mode drivers typically run in their own process − can use standard locking mechanisms • this means context switches − every time the device demands attention (interrupt) − every time another process wants to use the device • the driver needs system calls to talk to the device • allows the device to directly read/write memory − this incurs even more overhead • this is a huge improvement over programmed IO • interrupts only indicate buffer full/empty In-Process Drivers • the device can read and write arbitrary physical mem- ory • what if a (large portion of) a driver could be a library − opens up security / reliability problems • best of both worlds − no context switch overhead for requests IO-MMU − bugs and security problems remain isolated • like the MMU, but for DMA transfers • often used for GPU-accelerated 3D graphics • allows the OS to limit memory access per device • very useful in virtualisation • MCA pioneered software-conigured devices • only recently found its way into consumer computers • PCI further improved on MCA with “Plug and Play” − each PCI device has an ID it can tell the system Part 7.2: System and Expansion Busses − allows for enumeration and automatic coniguration

History: ISA (Industry Standard Architecture) PCI IDs and Drivers • 16-bit system expansion bus on IBM PC/AT • PCI allows for device enumeration • programmed IO and interrupts (but no DMA) • device identiiers can be paired to device drivers • a ixed number of hardware-conigured interrupt lines • this allows the OS to load and conigure its drivers − likewise for I/O port ranges − or even download / install drivers from a vendor − the HW settings then need to be typed back for SW • parallel data and address transmission AGP: Accelerated Graphics Port • PCI eventually became too slow for GPUs MCA, EISA − AGP is based on PCI and only improves performance • MCA: Micro Channel Architecture − enumeration and coniguration stays the same − proprietary to IBM, patent-encumbered • adds a dedicated point-to-point connection − 32-bit, software-driven device coniguration • multiple transfers per clock (up to 8, for 2 GB/s) − expensive and ultimately a market failure • EISA: Enhanced ISA PCI Express − a 32-bit extension of ISA • the current high-speed peripheral bus for PC − mostly created to avoid MCA licensing costs • builds on / extends conventional PCI − short-lived and replaced by PCI • point-to-point, serial data interconnect • much improved throughput (up to ~30GB/s) VESA Local Bus • memory mapped IO & DMA on otherwise ISA systems USB: Universal Serial Bus • tied to the 80486 line of CPUs (and AMD clones) • primarily for external peripherals • primarily for graphics cards − keyboards, mice, printers, … − but also used with hard drives − replaced a host of legacy ports • quickly fell out of use with the arrival of PCI • later revisions allow high-speed transfers − suitable for storage devices, cameras &c. PCI: Peripheral Component Interconnect • device enumeration, capability negotiation • a 32-bit successor to ISA − 33 MHz (compared to 8 MHz for ISA) USB Classes − later revisions at 66 MHz, PCI-X at 133 MHz • a set of vendor-neutral protocols − added support for bus-mastering and DMA • HID = human-interface device • still a shared, parallel bus • mass storage = disk-like devices − all devices share the same set of wires • audio equipment Bus Mastering • printing • normally, the CPU is the bus master Other USB Uses − which means it initiates communication • ethernet adapters • it’s possible to have multiple masters • usb-serial adapters − they need to agree on a conlict resolution protocol • wii adapters (dongles) • usually used for accessing the memory − there isn’t a universal protocol DMA (Direct Memory Access) − each USB WiFi adapter needs • bluetooth • the most common form of bus mastering • the CPU tells the device what and where to write ARM Busses • the device then sends data directly to RAM • ARM is typically used in System-on-a-Chip designs − the CPU can work on other things in the meantime • those use a proprietary bus to connect peripherals − completion is signaled via an interrupt • there is less need for enumeration Plug and Play − the entire system is baked into a single chip • the peripherals can be pre-conigured • the ISA system for IRQ coniguration was messy USB and PCIe on ARM • originally a purpose-built hardware renderer • USB nor PCIe are exclusive to the PC platform − based on polygonal meshes and Z buffering • most ARM SoC’s support USB devices • increasingly more lexible and programmable − for slow and medium-speed off-SoC devices • on-board RAM, high-speed connection to system RAM − e.g. used for ethernet on RPi 1 GPU Drivers • some ARM SoC’s support PCI Express − this allows for high-speed off-SoC peripherals • split into a number of components • graphics output / frame buffer access PCMCIA & PC Card • memory management is often done in kernel • People Can’t Memorize Computer Industry Acronyms • geometry, textures &c. are prepared in-process − PC = Personal Computer, MC = Memory Card • front end API: OpenGL, Direct3D, Vulkan, … • hotplug-capable notebook expansion bus Shaders • used for memory cards, network adapters, modems • comes with its own set of drivers (cardbus) • current GPUs are computation devices • the GPU has its own machine code for shaders ExpressCard • the GPU driver contains a shader compiler • an expansion card standard like PCMCIA / PC Card − either all the way from a high level language (HLSL) • based on PCIe and USB − or starting with an intermediate code (SPIR) − can mostly re-use drivers for those standards Mode Setting • not in wide use anymore − last update was in 2009, introducing USB 3 support • this part deals with screen coniguration and resolu- − the industry association disbanded the same year tion • including support for e.g. multiple displays miniPCIe, mSATA, M.2 • usually also supports primitive (SW-only) framebuffer • those are physical interfaces, not special busses • often done by a kernel with minimum user-level sup- • they provide some mix of PCIe, SATA and USB port − also other protocols like I²C, SMBus, … • used mainly for compact SSDs and wireless Graphics Servers − also GPS, NFC, bluetooth, … • multiple apps cannot all drive the graphics card − the graphics hardware needs to be shared Part 7.3: Graphics and GPUs − one option is a graphics server • provides an IPC-based drawing and/or windowing API Graphics Cards • performs painting on behalf of the applications • initially just a device to drive displays Compositors • reads pixels from memory and provides display signal − basically a DAC with a clock • a more direct way to share graphics cards − the memory can be part of the graphics card • each application gets its own buffer to paint into • evolved acceleration capabilities • painting is mostly done by a (context-switched) GPU • the individual buffers are then composed onto screen Graphics Accelerator − composition is also hardware-accelerated • allows common operations to be done in hardware • like drawing lines or illed polygons GP-GPU • the pixels are computed directly in video RAM • general-purpose GPU (CUDA, OpenCL, …) • this can save considerable CPU time • used for computation instead of just graphics • basically a return of vector processors 3D Graphics • close to CPUs but not part of normal OS scheduling • rendering 3D scenes is computationally intensive • CPU-based, software-only rendering is possible Part 7.4: Persistent Storage − texture-less in early light simulators − bitmap textures since ’95 / ’96 (Descent, Quake) Drivers • CAD workstation had 3D accelerators (OpenGL ’92) • split into adapter, bus and device drivers • often a single driver per device type GPU (Graphical Processing Unit) − at least for disk drives and CD-ROMs • a term coined by nVidia near the end of ’90s • bus enumeration and coniguration • data addressing and data transfers • also allows CD-ROM, tapes, scanners (!)

IDE / ATA SCSI Drivers • Integrated Drive Electronics • split into a host bus adapter (HBA) driver − disk controller becomes part of the disk • a generic SCSI bus and command component − standardised as ATA-1 (AT Attachment …) − often re-used in both ATAPI and USB storage • based on the ISA bus, but with cables • and per-device or per-class drivers • later adapted for non-disk use via ATAPI − optical drives, tapes, CD/DVD-ROM − standard disk and SSD drives ATA Enumeration • each ATA interface can attach only 2 drives iSCSI − the drives are HW-conigured as master/slave • basically SCSI over TCP/IP − this makes enumeration quite simple • entirely software-based • multiple ATA interfaces were standard • allows standard computers to serve as block storage • no need for speciic HDD drivers • takes advantage of fast cheap ethernet • re-uses most of the SCSI driver stack PIO vs DMA • original IDE could only use programmed IO NVMe: Non-Volatile Memory Express • this eventually became a serious bottleneck • a fairly simple protocol for PCIe-attached storage • later ATA revisions include DMA modes • optimised for SSD-based devices − up to 160MB/s with highest DMA modes − much bigger and more command queues than AHCI − compare 1900MB/s for SATA 3.2 − better / faster interrupt handling • stresses concurrency in the kernel block layer SATA • serial, point-to-point replacement for ATA USB Mass Storage • hardware-level incompatible to (parallel) ATA • an USB device class (vendor-neutral protocol) − but SATA inherited the ATA command set − one driver for the entire class − legacy mode allows PATA drivers to talk to SATA dri- • typically USB lash drives, but also external disks ves • USB 2 is not suitable for high-speed storage • hot-swap capable – replace drives in a running system − USB 3 introduced UAS = USB-Attached SCSI

AHCI (Advanced Host Controller Interface) Tape Drives • vendor-neutral interface to SATA controllers • unlike disk drives, only allow sequential access − in theory only a single ‘AHCI’ driver is needed • needs support for media ejection, rewinding • an alternative to ‘legacy mode’ • can be attached with SCSI, SATA, USB • NCQ = Native Command Queuing • parts of the driver will be bus-neutral − allows the drive to re-order requests • mainly for data backup, capacities 6-15TB − another layer of IO scheduling

The PATA-compatible mode hides most features. Optical Drives • mainly used as a read-only distribution medium ATA and SATA Drivers • laser-facilitated reading of a rotating disc • the host controller (adapter) is mostly vendor-neutral • can be again attached to SCSI, SATA or USB • the bus driver will expose the ATA command set • conceived for audio playback very slow seek − including support for command queuing • device driver uses the bus driver to talk to devices Optical Disk Writers (Burners) • partially re-uses SCSI drivers for ATAPI &c. • behaves more like a printer for optical disks • drivers are often done in user space SCSI (Small Computer System Interface) • attached by one of the standard disk busses • originated with minicomputers in the 80’s • special programs required to burn disks • more complicated and capable than ATA − alternative: packet-writing drivers − ATAPI basically encapsulates SCSI over ATA • device enumeration, including aggregates Part 7.5: Networking and Wireless − e.g. entire enclosures with many drives Networking • a very complex protocol (relative to hardware standards) • networks allow multiple computers to exchange data − assisted by irmware running on the adapter − this could be iles, streams or messages Bluetooth • there are wired and wireless networks • we will only deal with the lowest layers for now • a wireless alternative to USB • NIC = Network Interface Card • allows short-distance radio links with peripherals − input (keyboard, mice, game controllers) Ethernet − audio (headsets, speakers) • speciies the physical media − data transmission (e.g. smartphone sync) • on-wire format and collision resolution − gadgets (watches, heartrate monitoring, GPS, …) • in modern setups, mostly point-to-point links − using active packet switching devices Part 8: Network Stack • transmits data in frames (low-level packets) Lecture Overview Addressing 1. Networking Intro • at this level, only local addressing 2. The TCP/IP Stack − at most a single LAN segment 3. Using Networks • uses baked-in MAC addresses 4. Network File Systems − MAC = Media Access Control • addresses belong to interfaces, not computers Part 8.1: Networking Intro

Transmit Queue Host and Domain Names • packets are picked up from memory • hostname = human readable computer name • the OS prepares packets into the transmit queue • hierarchical system, big-endian: www.fi.muni.cz • the device picks them up asynchronously • FQDN = fully-qualiied domain name • similar to how SATA queues commands and data • the local sufix may be omitted (ping aisa)

Receive Queue Network Addresses • data is also queued in the other direction • address = machine-friendly and numeric • the NIC copies packets into a receive queue • IPv4 address: 4 octets (bytes): 192.168.1.1 • it invokes an interrupt to tell the OS about new items • IPv6 address: 16 octets − the NIC may batch multiple packets per interrupt • Ethernet (MAC): 6 octets, c8:5b:76:bd:6e:0b • if the queue is not cleared quickly packet loss Network Types Multi-Queue Adapters • LAN = Local Area Network • fast adapters can saturate a CPU − Ethernet: wired, up to 10Gb/s − e.g. 10GbE cards, or multi-port GbE − WiFi (802.11): wireless, up to 1Gb/s • these NICs can manage multiple RX and TX queues • WAN = Wide Area Network (the Internet) − each queue gets its own interrupt − PSTN, xDSL, PPPoE − different queues can be handled by different CPU − GSM, 2G (GPRS, EDGE), 3G (UMTS), 4G (LTE) cores − also LAN technologies – Ethernet, WiFi Checksum and TCP Ofloading Networking Layers • more advanced adapters can ofload certain features 2. Link (Ethernet, WiFi) • commonly computation of mandatory packet checksums 3. Network (IP) • but also TCP-related features 4. Transport (TCP, UDP, …) • this needs both driver support and TCP/IP stack sup- port 7. Application (HTTP, SMTP, …) WiFi Networking and Operating Systems • wireless network interface – “wireless ethernet” • a network stack is a standard part of an OS • shared medium – electromagnetic waves in air • large part of the stack lives in the kernel • (almost) mandatory encryption − although this only applies to monolithic kernels − otherwise easy to eavesdrop or even actively attack − microkernels use user-space networking • another chunk is in system libraries & utilities − DNS for hostname vs IP address mapping − ARP for IP vs MAC address mapping Kernel-Side Networking • device drivers for networking hardware ARP (Address Resolution Protocol) • network and transport protocol layers • inds the MAC that corresponds to an IP • routing and packet iltering (irewalls) • required to allow packet delivery • networking-related system calls (sockets) − IP uses the link layer to deliver its packets • network ile systems (SMB, NFS) − the link layer must be given a MAC address • the OS builds a map of IP MAC translations System Libraries • the socket and related APIs Ethernet • host name resolution (a DNS client) • link-level communication protocol • encryption and data authentication (SSL, TLS) • largely implemented in hardware • certiicate handling and validation • the OS uses a well-deined interface − packed receive and submit System Utilities − using MAC addresses (ARP is part of the OS) • network coniguration (ifconfig) • diagnostics (ping, traceroute) Packet Switching • packet logging and inspection (tcpdump) • shared media are ineficient due to collisions • route management (route, bgpd) • ethernet is typically packet switched − a switch is usually a hardware device Networking Aspects − but also in software (usually for virtualisation) • packet format − physical connections form a star topology − what are the units of communication • addressing Bridging − how are the sender and recipient named • bridges operate at the link layer (layer 2) • packet delivery • a bridge is a two-port device − how a message is delivered − each port is connected to a different LAN − the bridge joins the LANs by forwarding frames Protocol Nesting • can be done in hardware or software • protocols run on top of each other − brctl on Linux, ifconfig on OpenBSD • this is why it is called a network stack • higher levels make use of the lower levels Tunneling − HTTP uses abstractions provided by TCP • tunnels are virtual layer 2 or 3 devices − TCP uses abstractions provided by IP • they encapsulate trafic using a higher-level protocol Packet Nesting • tunneling is used to implement Virtual Private Networks − a software bridge can operate over an UDP tunnel • higher-level packets are just data to the lower level − the tunnel is usually encrypted • an Ethernet frame can carry an IP packet in it • the IP packet can carry a TCP packet PPP (Point-to-Point Protocol) • the TCP packet can carry an HTTP request • a link-layer protocol for 2-node networks Stacked Delivery • available over many physical connections − phone lines, cellular connections, DSL, Ethernet • delivery is, in the abstract, point-to-point − often used to connect endpoints to the ISP − routing is mostly hidden from upper layers • supported by most operating systems − the upper layer requests delivery to an address − split between the kernel and system utilities • lower-layer protocols are usually packet-oriented − packet size mismatches can cause fragmentation Wireless • a packet can pass through different low-level domains • WiFi is mostly like (slow, unreliable) Ethernet Layers vs Addressing • needs encryption since anyone can listen • also authentication to prevent rogue connections • not as straightforward as packet nesting − PSK (pre-shared key), EAP / 802.11x − address relationships are tricky • encryption needs key management • special protocols exist to translate addresses Part 8.2: The TCP/IP Stack • these numbers are used to re-assemble the stream − IP packets can arrive out of order IP (Internet Protocol) • they are also used to acknowledge reception − and subsequently to manage re-transmission • uses 4 byte (v4) or 16 byte (v6) addresses − split into network and host parts Packet Loss and Re-transmission • it is a packet-based protocol • is a best-effort protocol • packets can get lost for a variety of reasons − packets may get lost, reordered or corrupted − a link goes down for an extended period of time − buffer overruns on routing equipment IP Networks • TCP sends acknowledgments for received packets • IP networks roughly correspond to LANs − the ACKs use sequence numbers to identify packets − hosts on the same network are located with ARP UDP: User (Unreliable) Datagram Protocol − remote networks are reached via routers • a netmask splits the address into network/host parts • TCP comes with non-trivial overhead • IP typically runs on top of Ethernet or PPP − and its guarantees are not always required • UDP is a much simpler protocol Routing − a very thin wrapper around IP • routers forward packets between networks − with minimal overhead on top of IP • somewhat like bridges but layer 3 • routers act as normal LAN endpoints Name Resolution − but represent entire remote IP networks • users do not want to remember numeric addresses − or even the entire Internet − phone numbers are bad enough • host names are used instead Services and TCP/UDP Port Numbers • can be stored in a ile, e.g. /etc/hosts • networks are generally used to provide services − not very practical for more than 3 computers − each computer can host multiple − but there are millions of computers on the Internet • different services can run on different ports • port is a 16-bit number and some ar given names DNS: Domain Name Service − port 25 is SMTP, port 80 is HTTP, … • hierarchical protocol for name resolution − runs on top of TCP or UDP ICMP: Internet Control Message Protocol • domain names are split into parts using dots • control messages (packets) − each domain knows whom to ask for the next bit − destination host/network unreachable − the name database is effectively distributed − time to live exceeded − fragmentation required DNS Recursion • diagnostic packets, e.g. the ping command • take www.fi.muni.cz. as an example domain − echo request and echo reply • resolution starts from the right at root servers − combine with TTL for traceroute − the root servers refer us to the cz. servers − the cz. servers refer us to muni.cz TCP: Transmission Control Protocol − inally muni.cz. tells us about fi.muni.cz • a stream-oriented protocol on top of IP • works like a pipe (transfers a byte sequence) DNS Recursion Example − must respect delivery order − and also re-transmit lost packets $ dig www.fi.muni.cz. A +trace • must establish connections . IN NS j.root-servers.net. cz. IN NS b.ns.nic.cz. TCP Connections muni.cz. IN NS ns.muni.cz. • the endpoints must establish a connection irst fi.muni.cz. IN NS aisa.fi.muni.cz. • each connection serves as a separate data stream www.fi.muni.cz. IN A 147.251.48.1 • a connection is bidirectional • TCP uses a 3-way handshake: SYN, SYN/ACK, ACK DNS Record Types • A is for (IP) Address Sequence Numbers • AAAA is for an IPv6 Address • TCP packets carry sequence numbers • CNAME is for an alias • MX is for mail servers − individual connections are established with accept() • and many more • or into a client using connect()

Firewalls Resolver API • the name comes from building construction • libc contains a resolver − a ire-proof barrier between parts of a building − available as gethostbyname (and gethostbyname2) • the idea is to separate networks from each other − also gethostbyaddr for reverse lookups − making attacks harder from the outside • can look in many different places − limiting damage in case of compromise − most systems support at least /etc/hosts − and DNS-based lookups Packet Filtering • packet iltering is how irewalls are usually implemented Network Services • can be done on a router or at an endpoint • servers listen on a socket for incoming connections • dedicated routers + packet ilters are more secure − a client actively establishes a connection to a server − a single such irewall protects the entire network • the network simply transfers data between them − less opportunity for mis-coniguration • interpretation of the data is a layer 7 issue − could be commands, ile transfers, … Packet Filter Operation • packet ilters operate on a set of rules Network Service Examples − the rules are generally operator-provided • (secure) remote shell – sshd • each incoming packet is classiied using the rules • the internet email suite • and then dispatched accordingly − MTA = Mail Transfer Agent, speaks SMTP − may be forwarded, dropped, rejected or edited − SMTP = Simple Mail-Transfer Protocol • the world wide web Packet Flter Examples − web servers provide content (iles) • packet ilters are often part of the kernel − clients and servers speak HTTP and HTTPS • the rule parser is a system utility − it loads rules from a coniguration ile Client Software − and sets up the kernel-side ilter • the ssh command talks uses the SSH protocol • there are multiple implementations − a very useful system utility on virtually all UNIXes − iptables, nftables in Linux • web browser is the client for world wide web − pf in OpenBSD, ipfw in FreeBSD − browsers are complex application programs − some of them bigger than even operating systems Part 8.3: Using Networks • email client is also known as a MUA (Mail User Agent) Sockets Reminder Part 8.4: Network File Systems • the socket API comes from early BSD Unix • socket represents a (possible) network connection Why Network Filesystems? • you get a ile descriptor for an open socket • copying iles back and forth is impractical • you can read() and write() to sockets − and also error-prone (which is the latest version?) − but also sendmsg() and recvmsg() • how about storing data in a central location • and sharing it with all the computers on the LAN Socket Types • sockets can be internet or unix domain NAS (Network-Attached Storage) − internet sockets work across networks • a (small) computer dedicated to storing iles • stream sockets are like iles • usually running a cut down operating system − you can write a continuous stream of data − often based on Linux or FreeBSD − usually implemented using TCP • provides ile access to the network • datagram sockets send individual messages • sometimes additional app-level services − usually implemented using UDP − e.g. photo management, media streaming, …

Creating Sockets NFS (Network File System) • a socket is created using the socket() function • the traditional UNIX networked ilesystem • it can be turned into a server using listen() • hooked quite deep into the kernel − assumes generally reliable network (LAN) Lecture Overview • ilesystems are exported for use over NFS 1. Command Interpreters • the client side mounts the NFS-exported volume 2. The Command Line 3. Graphical Interfaces NFS History • originated in Sun Microsystems in the 80s Part 9.1: Command Interpreters • v2 implemented in System V, DOS, … • v3 appeared in ’95 and is still in use Shell • v4 arrives in 2000, improving security • programming language centered on OS interaction VFS Reminder • rudimentary control low • untyped, text-centered variables • implementation mechanism for multiple FS types • dubious error handling • an object-oriented approach − open: look up the ile for access Interactive Shells − read, write – self-explanatory • almost all shells have an interactive mode − rename: rename a ile or directory • the user inputs a single statement on keyboard RPC (Remote Procedure Call) • when conirmed, it is immediately executed • this forms the basis of command-line interfaces • any protocol for calling functions on remote hosts − ONC-RPC = Open Network Computing RPC Shell Scripts − NFS is based on ONC-RPC (also known as Sun RPC) • a shell script is an (executable) ile • NFS basically runs VFS operations using RPC • in simplest form, it is a sequence of commands − this makes it easy to implement on UNIX-like sys- − each command goes on a separate line tems − executing a script is about the same as typing it • but can use structured programming constructs Port Mapper • ONC-RPC is executed over TCP or UDP Shell Upsides − but it is more dynamic wrt. available services • very easy to write simple scripts • TCP/UDP port numbers are assigned on demand • irst choice for simple automation • portmap translates from RPC services to port numbers • often useful to save repetitive typing − the port mapper itself listens on port 111 • deinitely not good for big programs

The NFS Daemon Bourne Shell • also known as nfsd • a speciic language in the “shell” family • provides NFS access to a local ile system • the irst shell with consistent programming support • can run as a system service − available since 1976 • or it can be part of the kernel • still widely used today − this is more typical for performance reasons − best known implementation is bash − /bin/sh is mandated by POSIX SMB (Server Message Block) • a network ile system from Microsoft The name bash stands for Bourne Again Shell (bad puns should • available in Windows since version 3.1 (1992) be a human right). − originally ran on top of NetBIOS C Shell − later versions used TCP/IP • SMB1 accumulated a lot of cruft and complexity • also known as csh, irst released in 1978 • more C-like syntax than sh (Bourne Shell) SMB 2.0 − but not really very C-like at all • simpler than SMB1 due to fewer retroits and compat • improved interactive mode (over sh from ’76) • better performance and security • also still used today (tcsh) • support for symbolic links Korn Shell • available since (2006) • also known as ksh, released in 1983 Part 9: Shells & User Interfaces • middle ground between sh and csh • basis of the POSIX.2 requirements • a number of implementations exists in effect looking for two separate iles. The latter, ls "$foo"", will be executed with argv[1] = "hello world"’. Commands • typically a name of an executable Special Variables − may also be control low or a built-in • $? is the result of last command • the executable is looked up in the ilesystem • $$ is the PID of the current shell • the shell doas a fork + exec • $1 through $9 are positional parameters − this means new process for each command − $# is the number of parameters − process creation is fairly expensive • $0 is the name of the shell (argv[0])

Built-in Commands Environment • cd change the working directory • is like shell variables but not the same • export for setting up environment • the environment is passed to all executed programs • echo print a message − but a child cannot modify environment of its parent • exec replace the shell process (no fork) • variables are moved into the environment by export • environment variables often act as settings Variables Important Environment Variables • variable names are made of letters and digits • $PATH tells the system where to ind programs • using variables is indicated with $ • $HOME is the home directory of the current user • setting variables does not use the $ • all variables are global (except subshells) • $EDITOR and $VISUAL set which text editor to use • $EMAIL is the email address of the current user VARIABLE="some text" • $PWD is the current working directory echo $VARIABLE Globbing Variable Substitution • patterns for quickly listing multiple iles • variables are substituted as text • e.g. ls *.c shows all iles ending in .c • $foo is simply replaced with the content of foo • * matches any number of characters • arithmetic is not well supported in most shells • ? matches one arbitrary character − or any expression syntax, e.g. relational operators • works on entire paths (ls src/*/*.c) − consider z=$(($x + $y)) for addition in bash Conditionals Command Substitution • allows conditional execution of commands • basically like variable substitution • if cond; then cmd1; else cmd2; fi • written as `command`or $(command) • also elif cond2; then cmd3; fi − irst executes the command • cond is also a command (the exit code is used) − and captures its standard output − then replaces $(command) with the output test (evaluating boolean expressions) • originally an external program, also known as [ Quoting − nowadays built-in in most shells • whitespace is an argument separator in shell − works around lack of expressions in shell • multi-word arguments must be quoted • evaluates its arguments and returns true or false • quotes can be double quotes "" or single '' − can be used with if and while constructs − double quotes allow variable substitution test Examples Quoting and Substitution • test file1 -nt file2 ‘nt’ = newer than • whitespace from variable substitution must be quoted • test 32 -gt 14 ‘gt’ = greater than − `foo=“hello world”`` • test foo = bar string equality − ls $foo is different than ls "$foo" • combines with variable substitution (test $y = x) • bad quoting is a very common source of bugs • consider also ilenames with spaces in them Loops • while cond; do cmd; done The irst command, ls $foo will expand into ls hello world − cond is a command, like in if and executed with argv[1] = "hello" and argv[2] = "world", • for i in 1 2 3 4; do cmd; done − allows globs: for f in *.c; do cmd; done • usually shows at least your username and the hostname − also command substitution • also: working directory, battery status, time, weather, − for f in `seq 1 10`; do cmd; done …

Case Analysis Job Control • selects a command based on pattern matching • only one program can run in the foreground (terminal) • case $x in *.c) cc $x;; *) ls $x;; esac • but a running program can be suspended (C-z) − yes, case really uses unbalanced parens • and resumed in background (bg) or in foreground (fg) − the ;; indicates end of a case • use & to run a command in background: ./spambot &

Command Chaining Terminal • ; (semicolon): run two commands in sequence • can print text and read text from a keyboard • && run the second command if the irst succeeded • normally everything is printed on the last line • || run the second command if the irst failed • the text could contain escape (control) sequences • e.g. compile and run: cc file.c && ./a.out − for printing colourful text or clearing the screen − also for printing text at a speciic coordinate Pipes Older text scrolls upwards: this is the mode used in normal shell • shells can run pipelines of commands usage. This scrollback behaviour is automatic in the terminal. • cmd1 | cmd2 | cmd3 Full-screen terminal applications (which use coordinate-based print- − all commands are run in parallel ing) will not use the capability. − output of cmd1 becomes input of cmd2 Terminal (emulator) is typically a program these days, but used − output of cmd2 is processed by cmd3 to be a dedicated piece of hardware. echo hello world | sed -e s,hello,goodbye, Full-Screen Terminal Apps Functions • applications can use the entire terminal screen • a library abstracts away the low-level control sequences • you can also deine functions in shell − the library is called ncurses for new curses • mostly a light-weight alternative to scripts − different terminals use different control sequences − no need to export variables • special characters exist to draw frames and separators − but cannot be invoked by non-shell programs • functions can also set variables UNIX Text Editors Recall that the environment is only passed down, never back up. • sed – stream editor, non-interactive This means that a shell script setting a variable will not affect the • ed – line oriented, interactive parent shell. In functions (and when scripts are invoked using .), • vi – visual, screen oriented variables can be set as normal. • ex – line-oriented mode of vi Part 9.2: The Command Line TUI: Text User Interface • the program draws a 2D interface on a terminal Interactive Shell • these types of interfaces can be quite comfortable • the shell displays a prompt and waits • they are often easier to program than GUIs • the user types in a command and hits enter • very low bandwidth requirements for remote use • the command is executed immediately • output is printed to the terminal Part 9.3: Graphical Interfaces

Command Completion Windowing Systems • most shells let you use TAB to auto-complete • each application runs in its own window − works at least for command names and ile names − or possibly multiple windows − but “smart completion” is common • multiple applications can be shown on screen • interactive history: hit “up” to recall a command • windows can be moved around, resized &c. − also interactive history search, e.g. C-r in bash − facilitated by frames around window content − generally known as window management Prompt • the string printed when shell expects a command Window-less Systems • controlled by the PS1 environment variable • especially popular on smaller screens • applications take the entire screen X11 (X Window System) − give or take status or control widgets • a traditional UNIX windowing system • task switching via a dedicated screen • provides a C API (xlib) • built-in network transparency (socket-based) A GUI Stack • core protocol version 11 from 1987 • graphics card driver, mode setting • drawing/painting (usually hardware-accelerated) X11 Architecture • multiplexing (e.g. using windows) • X server provides graphics and input • widgets: buttons, labels, lists, … • X client is an application that uses X • layout: what goes where on the screen • a window manager is a (special) client • a compositor is another special client Well-known GUI Stacks • Windows Remote Displays • macOS, iOS • application is running on computer A • X11 • the display is not the console of A • Wayland − could be a dedicated graphical terminal • Android − could be another computer on a LAN − or even across the internet Portability • GUI “toolkits” make portability easy Remote Display Protocols − Qt, GTK, Swing, HTML5+CSS, … • one approach is pushing pixels − many of them run on all major platforms − VNC (Virtual Network Computing) • code portability is not the only issue • X11 uses a custom drawing protocol − GUIs come with look and feel guidelines • others use high-level abstractions − portable applications may fail to it − NeWS (PostScript-based) − HTML5 + JavaScript Text Rendering • a surprisingly complex task VNC (Virtual Network Computing) • unlike terminals, GUIs use variable pitch fonts • sends compressed pixel data over the wire − brings up issues like kerning − can leverage regularities in pixel data − hard to predict pixel width of a line − can send incremental updates • bad interaction with printing (cf. WYSIWIG) • and input events in the other direction • no support for peripherals or ile sync Bitmap Fonts Basically the only virtue of VNC is simplicity. Security is an after- • characters are represented as pixel arrays thought and not super-compatible across implementations. It is − usually just black and white mainly designed for low-bandwidth, high-latency networks (i.e. the • traditionally pixel-drawn by hand Internet). − very time consuming (many letters, sizes, variants) • the result is sharp but jagged (not smooth) RDP (Remote Desktop Protocol) • more sophisticated than VNC (but proprietary) Outline Fonts • can also send drawing commands over the wire • Type1, TrueType – based on splines − like X11, but using DirectX drawing • they can be scaled to arbitrary pixel sizes − also allows remote OpenGL • same font can be used for screen and for print • support for audio, remote USB &c. • rasterisation is usually done in software RDP is primarily based on the pixel-pushing paradigm, but there Hinting, Anti-Aliasing is a number of extensions that allow sending high-level rendering commands for local, hardware-accelerated processing. In some • screens are low resolution devices setups, this includes remote accelerated OpenGL and/or Direct3D. − typical HD displays have DPI around 100 − laser printers have DPI of 300 or more SPICE (Simple Protocol for Indep. Computing Env.) • hinting: deform outlines to better it a pixel grid • open protocol somewhere between VNC and RDP • anti-aliasing: smooth outlines using grayscale • can send OpenGL (but only over a local socket) • two-way audio, USB, clipboard integration • still mainly based on pushing (compressed) pixels Access Control Models • owners usually decide who can access their objects Remote Desktop Security − this is known as discretionary access control • the user needs to be authenticated over network • in high-security environments, this is not allowed − passwords are easy, biometric data less so − known as mandatory access control • the data stream should be encrypted − a central authority decides the policy − not part of the X11 or NeWS protocols − or even HTTP by default (used for HTML5/JS) (Virtual) System Users

For instance, RDP in Windows 10 does not support ingerprint • users are an useful ownership abstraction logins (it was supported on earlier versions, but was disabled due • various system services get their own “fake” users to security laws). • this allows them to own iles and processes • and also limit their access to the rest of the OS Part 10: Access Control Principle of Least Privilege Lecture Overview • entities should have minimum privilege required − applies to software components 1. Multi-User Systems − but also to human users of the system 2. File Systems • this limits the scope of mistakes 3. Sub-user Granularity − and also of security compromises Part 10.1: Multi-User Systems Privilege Separation Users • different parts of a system need different privilege • least privilege dictates splitting the system • originally a proxy for people − components are isolated from each other • currently a more general abstraction − they are given only the rights they need • user is the unit of ownership • components communicate using the simplest feasible • many permissions are user-centered IPC Computer Sharing Process Separation • computer is a (often costly) resource • recall that each process runs in its own address space • eficiency of use is a concern − but shared memory can be requested − a single user rarely exploits a computer fully • each user has a view of the ilesystem • data sharing makes access control a necessity − a lot more is shared by default in the ilesystem − especially the namespace (directory hierarchy) Ownership • various objects in an OS can be owned Access Control Policy − primarily iles and processes • there are 3 pieces of information • the owner is typically whoever created the object − the subject (user) − ownership can be transferred − the verb (what is to be done) − usually at the impetus of the original owner − the object (the ile or other resource) • there are many ways to encode this information Process Ownership • each process belongs to some user Access Rights Subjects • the process acts on behalf of the user • in a typical OS those are (possibly virtual) users − the process gets the same privilege as its owner − sub-user units are possible (e.g. programs) − this both constrains and empowers the process − roles and groups could also be subjects • processes are active participants • the subject must be named (names, identiiers) − easy on a single system, hard in a network File Ownership • each ile also belongs to some user Access Rights Verbs • this gives rights to the user (or rather their processes) • the available “verbs” (actions) depend on object type − they can read and write the ile • a typical object would be a ile − they can change permissions or ownership − iles can be read, written, executed • iles are passive participants − directories can be searched or listed or changed • network connections can be established &c. • passwords are easiest, but not easy − encryption is needed to safely transmit passwords Access Rights Objects − along with computer authentication • anything that can be manipulated by programs • 2-factor authentication is a popular improvement − although not everything is subject to access control • could be iles, directories, sockets, shared memory, … Computer Authentication • object names depend on their type • how to ensure we send the password to the right party? − ile paths, i-node numbers, IP addresses, … − an attacker could impersonate our remote computer • usually via asymmetric cryptography Subjects in POSIX − a private key can be used to sign messages • there are 2 types of subjects: users and groups − the server will sign a message establishing its iden- • each user can belong to multiple groups tity • users are split into normal users and root − root is also known as the super-user 2-factor Authentication • 2 different types of authentication User Management − harder to spoof both at the same time • the system needs a database of users • there are a few factors to pick from • in a network, user identities often need to be shared − something the user knows (password) • could be as simple as a text ile − something the user has (keys) − /etc/passwd and /etc/group on UNIX systems − what the user is (biometric) • or as complex as a distributed database Enforcement: Hardware LDAP and Active Directory are popular choices for centralised • all enforcement begins with the hardware network-level user databases. − the CPU provides a privileged mode for the kernel − DMA memory and IO instructions are protected User and Group Identiiers • the MMU allows the kernel to isolate processes • users and groups are represented as numbers − and protect its own integrity − this improves eficiency of many operations − the numbers are called uid and gid Enforcement: Kernel • those numbers are valid on a single computer • kernel uses hardware facilities to implement security − or at most, a local network − it stands between resources and processes − access is mediated through system calls Changing Identities • ile systems are part of the kernel • each process belongs to a particular user • user and group abstractions are part of the kernel • ownership is inherited across fork() • super-user processes can use setuid() Enforcement: System Calls • exec() can sometimes change a process owner • the kernel acts as an arbitrator • a process is trapped in its own address space Login • processes use system calls to access resources • a super-user process manages user logins − kernel can decide what to allow • the user types their name and provides credentials − based on its access control model and policy − upon successful authentication, login calls fork() − the child calls setuid() to the user Enforcement: Service APIs − and uses exec() to start a shell for the user • userland processes can enforce access control − usually system services which provide IPC API User Authentication • e.g. via the getpeereid() system call • the user needs to authenticate themselves − tells the caller which user is connected to a socket • passwords are the most commonly used method − user-level access control is rooted in kernel facilities − the system needs to know the right password − user should be able to change their password Part 10.2: File Systems • biometric methods are also quite popular File Access Rights Remote Login • ile systems are a case study in access control • authentication over network is more complicated • all modern ile systems maintain permissions − the only extant exception is FAT (USB sticks) • often used for granting extra privileges • different systems adopt different representation − e.g. the mount command runs as the super-user

Representation Sticky Directories • ile systems are usually object-centric • ile creation and deletion is a directory permission − permissions are attached to individual objects − this is problematic for shared directories − easily answers “who can access this ile”? − in particular the system /tmp directory • there is a ixed set of verbs • in a sticky directory, different rules apply − those may be different for iles and directories − new iles can be created as usual − different systems allow different verbs − only the owner can unlink a ile from the directory

The UNIX Model Access Control Lists • each ile and directory has a single owner • ACL is a list of ACE’s (access control elements) • plus a single owning group − each ACE is a subject + verb pair − not limited to those the owner belongs to − it can name an arbitrary user • ownership and permissions are attached to i-nodes • ACL is attached to an object (ile, directory) • more lexible than the traditional UNIX system Access vs Ownership • POSIX ties ownership and access rights ACLs and POSIX • only 3 subjects can be named on a ile • part of POSIX.1e (security extensions) − the owner (user) • most POSIX systems implement ACLs − the owning group − this does not supersede UNIX permission bits − anyone else − instead, they are interpreted as part of the ACL • ile system support is not universal (but widespread) Access Verbs in POSIX File Systems • read: read a ile, list a directory Device Files • write: write a ile, link/unlink i-nodes to a directory • UNIX represents devices as special i-nodes • execute: exec a program, enter the directory − this makes them subject to normal access control • execute as owner (group): setuid/setgid • the particular device is described in the i-node − only a super-user can create device nodes Permission Bits − users could otherwise gain access to any device • basic UNIX permissions can be encoded in 9 bits • 3 bits per 3 subject designations Sockets and Pipes − irst comes the owner, then group, then others • named sockets and pipes are just i-nodes − written as e.g. rwxr-x--- or 0750 − also subject to standard ile permissions • plus two numbers for the owner/group identiiers • especially useful with sockets Changing File Ownership − a service sets up a named socket in the ile system − ile permissions decide who can talk to the service • the owner and root can change ile owners • chown and chgrp system utilities Special Attributes • or via the C API • lags that allow additional restrictions on ile use − chown(), fchown(), fchownat(), lchown() − e.g. immutable iles (cannot be changed by anyone) − same set for chgrp − append-only iles (for logile integrity protection) Changing File Permissions − compression, copy-on-write controls • non-standard (Linux chattr, BSD chflags) • again available to the owner and to root • chmod is the user space utility Network File System − either numeric argument: chmod 644 file.txt • NFS 3.0 simply transmits numeric uid and gid − or symbolic: chmod +x script.sh − the numbering needs to be synchronised • and the corresponding system call (numeric-only) − can be done via a central user database setuid and setgid • NFS 4.0 uses per-user authentication − the user authenticates to the server directly • special permissions on executable iles − ilesystem uid and gid values are mapped • they allow exec to also change the process owner File System Quotas Mandatory Access Control • storage space is limited, shared by users • delegates permission control to a central authority − iles take up storage space • often coupled with security labels − ile ownership is also a liability − classiies subjects (users, processes) • quotas set up limits space use by users − and also objects (iles, sockets, programs) − exhausted quota can lead to denial of access • the owner cannot change object permissions

Removable Media Security labels are a generalisation of user groups. • access control at ile system level makes no sense Capabilities − other computers may choose to ignore permissions • not all verbs (actions) need to take objects − user names or id’s would not make sense anyway • e.g. shutting down the computer (there is only one) • option 1: encryption (for denying reads) • mounting ile systems (they can’t be always named) • option 2: hardware-level controls • listening on ports with number less than 1024 − usually read-only vs read-write on the entire medium Dismantling the root User The chroot System Call • the traditional root user is all-powerful • each process in UNIX has its own root directory − “all or nothing” is often unsatisfactory − for most, this coincides with the system root − violates the principle of least privilege • the root directory can be changed using chroot() • many special properties of root are capabilities • can be useful to limit ile system access − root then becomes the user with all capabilities − e.g. in privilege separation scenarios − other users can get selective privileges

Uses of chroot Security and Execution • chroot alone is not a security mechanism • security hinges on what is allowed to execute − a super-user process can get out easily • arbitrary code execution are the worst exploits − but not easy for a normal user process − this allows unauthorized execution of code • also useful for diagnostic purposes − same effect as impersonating the user • and as lightweight alternative to virtualisation − almost as bad as stolen credentials

Part 10.3: Sub-User Granularity Untrusted Input • programs often process data from dubious sources Users are Not Enough − think image viewers, audio & video players • users are not always the right abstraction − archive extraction, font rendering, … − creating users is relatively expensive • bugs in programs can be exploited − only a super-user can create new users − the program can be tricked into executing data • you may want to include programs as subjects − or rather, the combination user + program Process as a Subject • some privileges can be tied to a particular process Naming Programs − those only apply during the lifetime of the process • users have user names, but how about programs? − often restrictions rather than privileges • option 1: cryptographic signatures − this is how privilege dropping is done − portable across computers but complex • processes are identiied using their numeric pid − establishes identity based on the program itself − restrictions are inherited across fork() • option 2: i-node of the executable − simple, local, identity based on location Sandboxing • tries to limit damage from code execution exploits Program as a Subject • the program drops all privileges it can • program: passive (ile) vs active (processes) − this is done before it touches any of the input − only a process can be a subject − the attacker is stuck with the reduced privileges − but program identity is attached to the ile − this can often prevent a successful attack • rights of a process depend on its program Untrusted Code − exec() will change privileges • traditionally, you would only execute trusted code − usually based on reputation or other external fac- − usually need kernel support tors − this does not scale to a large number of vendors Type 1 (Bare Metal) • it is common to execute untrusted, even dubious code • IBM z/VM − this can be okay with suficient sandboxing • (Citrix) Xen • Microsoft Hyper-V API-Level Access Control • VMWare ESX • capability system for user-level resources − things like contact lists, calendars, bookmarks Type 2 (Hosted) − objects not provided directly by the kernel • VMWare (Workstation, Player) • enforcement e.g. via a virtual machine • Oracle VirtualBox − not applicable to execution of native code • Linux KVM − alternative: an IPC-based API • FreeBSD bhyve • OpenBSD vmm Android/iOS Permissions • applications from a store are semi-trusted History • typically single-user computers/devices • started with mainframe computers • permissions are attached to apps instead of users • IBM CP/CMS: 1968 • partially virtual users, partially API-level • IBM VM/370: 1972 • IBM z/VM: 2000 Part 11: Virtualisation & Containers Desktop Virtualisation Lecture Overview • x86 hardware lacks virtual supervisor mode 1. Hypervisors • software-only solutions viable since late 90s 2. Containers − Bochs: 1994 3. Management − VMWare Workstation: 1999 − QEMU: 2003 Part 11.1: Hypervisors Paravirtualisation What is a Hypervisor • introduced as VMI in 2005 by VMWare • also known as a Virtual Machine Monitor • alternative approach in Xen in 2006 • allows execution of multiple operating systems • relies on modiication of the guest OS • like a kernel that runs kernels • near-native speed without HW support • improves hardware utilisation The Virtual x86 Revolution Motivation • 2005: virtualisation extensions on x86 • OS-level sharing is tricky • 2008: MMU virtualisation − user isolation is often insuficient • unmodiied guest at near-native speed − only root can install software • most software-only solutions became obsolete • the hypervisor/OS interface is simple − compared to OS-application interfaces Paravirtual Devices Virtualisation in General • special drivers for virtualised devices − block storage, network, console • many resources are “virtualised” − random number generator − physical memory by the MMU • faster than software emulation − peripherals by the OS − orthogonal to CPU/MMU virtualisation • makes resource management easier • enables isolation of components Virtual Computers Hypervisor Types • usually known as Virtual Machines • everything in the computer is virtual • type 1: bare metal − either via hardware (VT-x, EPT) − standalone, microkernel-like − or software (QEMU, virtio, …) • type 2: hosted • much easier to manage than actual hardware − runs on top of normal OS Essential Resources PCI Passthrough • the CPU and RAM • an anti-virtualisation technology • persistent (block) storage • based on an IO-MMU (VT-D, AMD-Vi) • network connection • a virtual OS can touch real hardware • a console device − only one OS at a time, of course

CPU Sharing GPUs and Virtualisation • same principle as normal processes • can be assigned (via VT-d) to a single OS • there is a scheduler in the hypervisor • or time-shared using native drivers (GVT-g) − simpler, with different trade-offs • paravirtualised • privileged instructions are trapped • shared by other means (X11, SPICE, RDP)

RAM Sharing Peripherals • very similar to standard paging • useful either via passthrough • software (shadow paging) − audio, webcams, … • or hardware (second-level translation) • or standard sharing technology • ixed amount of RAM for each VM − network printers & scanners − networked audio servers Shadow Page Tables Peripheral Passthrough • the guest system cannot access the MMU • set up shadow table, invisible to the guest • virtual PCI, USB or SATA bus • guest page tables are sync’d to the sPT by VMM • forwarding to a real device • the gPT can be made read-only to cause traps − e.g. a single USB stick − or a single SATA drive The trap can then synchronise the gPT with the sPT, which are translated versions of each other. The ‘physical’ addresses stored Suspend & Resume in the gPT are virtual addresses of the hypervisor. The sPT stores • the VM can be quite easily stopped real physical addresses, since it is used by the real MMU. • the RAM of a stopped VM can be copied − e.g. to a ile in the host ilesystem Second-Level Translation − along with registers and other state • hardware-assisted MMU virtualisation • and also later loaded and resumed • adds guest-physical to host-physical layer • greatly simpliies the VMM Migration Basics • also much faster than shadow page tables • the stored state can be sent over network • and resumed on a different host Network Sharing • as long as the virtual environment is same • usually a paravirtualised NIC • this is known as paused migration − transports frames between guest and host − usually connected to a SW bridge in the host Live Migration − alternatives: routing, NAT • uses asynchronous memory snapshots • a single physical NIC is used by everyone • host copies pages and marks them read-only • the snapshot is sent as it is constructed Virtual Block Devices • changed pages are sent at the end • usually also paravirtualised • often backed by normal iles Live Migration Handoff − maybe in a special format • the VM is then paused − e.g. based on copy-on-write • registers and last few pages are sent • but can be a real block device • the VM is resumed at the remote end • usually within a few milliseconds Special Resources • mainly useful in desktop systems Memory Ballooning • GPU / graphics hardware • how to deallocate “physical” memory? • audio equipment − i. e. return it to the hypervisor • printers, scanners, … • this is often desirable in virtualisation • needs a special host/guest interface Namespaces • visibility compartments in the Linux kernel Part 11.2: Containers • virtualizes common resources − the ilesystem hierarchy (including mounts) What are Containers? − process tables • OS-level virtualisation − networking (IP address) − e.g. virtualised network stack − or restricted ile system access cgroups • not a complete virtual computer • controls resource allocation in Linux • turbocharged processes • a CPU group is a fair scheduling unit Why Containers • a memory group sets limits on memory use • mostly orthogonal to namespaces • virtual machines take a while to boot • each VM needs its own kernel LXC − this adds up if you need many VMs • mainline Linux way to do containers • easier to share memory eficiently • based on namespaces and cgroups • easier to cut down the OS image • relative newcomer (2008, 7 years after vserver) Kernel Sharing • feature set similar to VServer, OpenVZ &c. • multiple containers share a single kernel User-Mode Linux • but not user tables, process tables, … • halfway between a container and a virtual machine • the kernel must explicitly support this • an early fully paravirtualised system • another level of isolation (process, user, container) • a Linux kernel runs as a process on another Linux Boot Time • integrated in Linux 2.6 in 2003 • a light virtual machine takes a second or two DragonFlyBSD Virtual Kernels • a container can take under 50ms • but VMs can be suspended and resumed • very similar to User-Mode Linux • but dormant VMs take up a lot more space • part of DFlyBSD since 2007 • uses standard libc, unlike UML chroot • paravirtual ethernet, storage and console • the mother of all container systems User Mode Kernels • not very sophisticated or secure • but allows multiple OS images under 1 kernel • easier to retroit securely • everything else is shared − uses existing security mechanisms − for the host, mostly a standard process chroot-based Containers • the kernel needs to be ported though • process tables, network, etc. are shared − analogous to a new hardware platform • the superuser must also be shared Migration • containers have their own view of the ilesystem − including system libraries and utilities • not widely supported, unlike in hypervisors • process state is much harder to serialise BSD Jails − ile descriptors, network connections &c. • an evolution of the chroot container • somewhat mitigated by fast shutdown/boot time • adds user and process table separation • and a virtualised network stack Part 11.3: Management − each jail can get its own IP address • root in the jail has limited power Disk Images • disk image is the embodiment of the VM Linux VServer • the virtual OS needs to be installed • like BSD jails but on Linux • the image can be a simple ile − FreeBSD jail 2000, VServer 2001 • or a dedicated block device on the host • not part of the mainline kernel • jailed root user is partially isolated Snapshots • making a copy of the image = snapshot • blocks coming from a common ancestor are shared • can be done more eficiently: copy on write − but updating images means loss of sharing • alternative to OS installation − make copies of the freshly installed image Container vs VM Updates − and run updates after cloning the image • de-duplication may be easier in containers − shared ile system – e.g. link farming Duplication • kernel updates: containers and type 2 hypervisors • each image will have a copy of the system − can be mitigated by live migration • copy-on-write snapshots can help • type 1 hypervisors need less downtime − most of the base system will not change − regression as images are updated separately Docker • block-level de-duplication is expensive • automated container image management • mainly a service deployment tool File Systems • containers share a single Linux kernel • disk images contain entire ile systems − the kernel itself can run in a VM • the virtual disk is of (apparently) ixed size • rides on a wave of bundling resurgence • sparse images: unwritten area is not stored • initially only ilesystem metadata is allocated The Cloud Overcommit • public virtualisation infrastructure • “someone else’s computer” • the host can allocate more resources than it has • the guests are not secure against the host • this works as long as not many VMs reach limits − entire memory is exposed, including secret keys • enabled by sparse images and CoW snapshots − host compromise is fatal • also applies to available RAM • the host is mostly secure from the guests Thin Provisioning Part 12: Review • the act of obtaining resources on demand • the host system can be extended as needed What is an OS made of? − to keep pace with growing guest demands • the kernel • alternatively, VMs can be migrated out • system libraries • improves resource utilisation • system daemons / services Coniguration • user interface • system utilities • each OS has its own coniguration iles • same methods apply as for physical networks Basically every OS has those. − software coniguration management • bundled services are deployed to VMs The Kernel • lowest level of an operating system Bundling vs Sharing • executes in privileged mode • bundling makes deployment easier • manages all the other software • the bundled components have known behaviour − including other OS components • but updates are much trickier • enforces isolation and security • this also prevents resource sharing • provides low-level services to programs

Security System Libraries • hypervisors have a decent track record • form a layer above the OS kernel − security here means protection of host from guest • provide higher-level services − breaking out is still possible sometimes − use kernel services behind the scenes • containers are more of a mixed bag − easier to use than the kernel interface − many hooks are needed into the kernel • typical example: libc − provides C functions like printf Updates − also known as msvcrt on Windows • each system needs to be updated separately − this also applies to containers Programming Interfaces • kernel system call interface • software running in privileged mode can do ~anything • system libraries / APIs − most importantly it can program the MMU • inter-process protocols − the kernel runs in this mode • command-line utilities (scripting) Memory Management Unit (System) Libraries • is a subsystem of the processor • mainly C functions and data types • takes care of address translation • interfaces deined in header iles − user software uses virtual addresses • deinitions provided in libraries − the MMU translates them to physical addresses − static libraries (archives): libc.a • the mappings can be managed by the OS kernel − shared (dynamic) libraries: libc.so • on Windows: msvcrt.lib and msvcrt.dll What does a Kernel Do? • there are (many) more besides libc / msvcrt • memory & process management • task (thread) scheduling Shared (Dynamic) Libraries • device drivers • required for running programs − SSDs, GPUs, USB, bluetooth, HID, audio, … • linking is done at execution time • ile systems • less code duplication • networking • can be upgraded separately • but: dependency problems Kernel Architecture Types Why is Everything a File • monolithic kernels (Linux, *BSD) • microkernels (Mach, L4, QNX, NT, …) • re-use the comprehensive ile system API • hybrid kernels (macOS) • re-use existing ile-based command-line tools • type 1 hypervisors (Xen) • bugs are bad simplicity is good • exokernels, rump kernels • want to print? cat file.txt > /dev/ulpt0 − (reality is a little more complex) System Call Sequence What is a Filesystem? • irst, libc prepares the system call arguments • and puts the system call number in the correct register • a set of iles and directories • then the CPU is switched into privileged mode • usually lives on a single block device • this also transfers control to the syscall handler − but may also be virtual • directories and iles form a tree What is an i-node? − directories are internal nodes − iles are leaf nodes • an anonymous, ile-like object • could be a regular ile File Descriptors − or a directory − or a special ile • the kernel keeps a table of open iles − or a symlink • the ile descriptor is an index into this table • you do everything using ile descriptors Disk-Like Devices • non-Unix systems have similar concepts • disk drives provide block-level access Regular iles • read and write data in 512-byte chunks • these contain sequential data (bytes) − or also 4K on big modern drives • may have inner structure but the OS does not care • a big numbered array of blocks • there is metadata attached to iles − like when were they last modiied I/O Scheduler (Elevator) − who can and who cannot access the ile • reads and writes are requested by users • you read() and write() iles • access ordering is crucial on a mechanical drive − not as important on an SSD Privileged CPU Mode − but sequential access is still much preferred • many operations are restricted in user mode • requests are queued (recall, disks are slow) − this is how user programs are executed − but they are not processed in FIFO order − also most of the operating system Filesystem as Resource Sharing • physical addresses do not change • usually only 1 or few disks per computer • but the mapping of virtual addresses does • many programs want to store persistent data • large part of physical memory is not mapped • ile system allocates space for the data − could be completely unallocated (unused) − which blocks belong to which ile − or belong to other processes • different programs can write to different iles What is a Thread? − no risk of trying to use the same block • thread is a sequence of instructions Filesystem as Abstraction • different threads run different instructions • allows the data to be organised into iles − as opposed to SIMD or many-core units (GPUs) • enables the user to manage and review data • each thread has its own stack • iles have arbitrary & dynamic size • multiple threads can share an address space − blocks are transparently allocated & recycled Fork • structured data instead of a lat block array • how do we create new processes? Memory-mapped IO • by fork-ing existing processes • uses virtual memory (cf. last lecture) • fork creates an identical copy of a process • treat a ile as if it was swap space • execution continues in both processes • the ile is mapped into process memory − each of them gets a different return value − page faults indicate that data needs to be read − dirty pages cause writes Process vs Executable • available as the mmap system call • process is a dynamic entity • executable is a static ile Fragmentation • an executable contains an initial memory image • internal – not all blocks are fully used − this sets up memory layout − iles are of variable size, blocks are ixed − and content of the text and data segments − a 4100 byte ile needs 2 4 KiB blocks • external – free space is non-contiguous Exec − happens when many iles try to grow at once • on UNIX, processes are created via fork − this means new iles are also fragmented • how do we run programs though? • exec: load a new executable into a process Hard Links − this completely overwrites process memory • multiple names can refer to the same i-node − execution starts from the entry point − names are given by directory entries • running programs: fork + exec − we call such multiple-named iles hard links − it’s usually forbidden to hard-link directories What is a Scheduler? • hard links cannot cross device boundaries • scheduler has two related tasks − i-node numbers are only unique within a ilesystem − plan when to run which thread − actually switch threads and processes Process Resources • usually part of the kernel • memory (address space) − even in micro-kernel operating systems • processor time • open iles (descriptors) Interrupt − also working directory • a way for hardware to request attention − also network connections • CPU mechanism to divert execution • partial (CPU state only) context switch Process Memory • switch to privileged (kernel) CPU mode • each process has its own address space • this means processes are isolated from each other Timer Interrupt • requires that the CPU has an MMU • generated by the PIT or the local APIC • implemented via paging (page tables) • the OS can set the frequency • a hardware interrupt happens on each tick Process Switching • this creates an opportunity for bookkeeping • switching processes means switching page tables • and for preemptive scheduling What is Concurrency? Drivers and Microkernels • events that can happen at the same time • drivers are excluded from microkernels • it is not important if it does, only that it can • but the driver still needs hardware access • events can be given a happens-before partial order − this could be a special memory region • they are concurrent if unordered by happens-before − it may need to react to interrupts • in principle, everything can be done indirectly Why Concurrency? − but this may be quite expensive, too • problem decomposition − different tasks can be largely independent Interrupt-driven IO • relecting external concurrency • peripherals are much slower than the CPU − serving multiple clients at once − polling the device is expensive • performance and hardware limitations • the peripheral can signal data availability − higher throughput on multicore computers − and also readiness to accept more data • this frees up CPU to do other work in the meantime Critical Section • any section of code that must not be interrupted Memory-mapped IO • the statement x = x + 1 could be a critical section • devices share address space with memory • what is a critical section is domain-dependent • more common in contemporary systems − another example could be a bank transaction • IO uses the same instructions as memory access − or an insertion of an element into a linked list − load and store on RISC, mov on x86 • allows selective user-level access (via the MMU) Race Condition: Deinition • (anomalous) behaviour that depends on timing Direct Memory Access • typically among multiple threads or processes • allows the device to directly read/write memory • an unexpected sequence of events happens • this is a huge improvement over programmed IO • recall that ordering is not guaranteed • interrupts only indicate buffer full/empty • the device can read and write arbitrary physical mem- Mutual Exclusion ory • only one thread can access a resource at once − opens up security / reliability problems • ensured by a mutual exclusion device (a.k.a mutex) • a mutex has 2 operations: lock and unlock GPU Drivers • lock may need to wait until another thread unlocks • split into a number of components • graphics output / frame buffer access Deadlock Conditions • memory management is often done in kernel 1. mutual exclusion • geometry, textures &c. are prepared in-process 2. hold and wait condition • front end API: OpenGL, Direct3D, Vulkan, … 3. non-preemtability 4. circular wait Storage Drivers • split into adapter, bus and device drivers Deadlock is only possible if all 4 are present. • often a single driver per device type − at least for disk drives and CD-ROMs Starvation • bus enumeration and coniguration • starvation happens when a process can’t make progress • data addressing and data transfers • generalisation of both deadlock and livelock • for instance, unfair scheduling on a busy system Networking Layers • also recall the readers and writers problem 2. Link (Ethernet, WiFi) 3. Network (IP) What is a Driver? 4. Transport (TCP, UDP, …) • piece of software that talks to a device 7. Application (HTTP, SMTP, …) • usually quite speciic / unportable − tied to the particular device Networking and Operating Systems − and also to the operating system • a network stack is a standard part of an OS • often part of the kernel • large part of the stack lives in the kernel − although this only applies to monolithic kernels • the user inputs a single statement on keyboard − microkernels use user-space networking • when conirmed, it is immediately executed • another chunk is in system libraries & utilities • this forms the basis of command-line interfaces

Kernel-Side Networking Shell Scripts • device drivers for networking hardware • a shell script is an (executable) ile • network and transport protocol layers • in simplest form, it is a sequence of commands • routing and packet iltering (irewalls) − each command goes on a separate line • networking-related system calls (sockets) − executing a script is about the same as typing it • network ile systems (SMB, NFS) • but can use structured programming constructs

IP (Internet Protocol) Terminal • uses 4 byte (v4) or 16 byte (v6) addresses • can print text and read text from a keyboard − split into network and host parts • normally everything is printed on the last line • it is a packet-based protocol • the text could contain escape (control) sequences • is a best-effort protocol − for printing colourful text or clearing the screen − packets may get lost, reordered or corrupted − also for printing text at a speciic coordinate

TCP: Transmission Control Protocol A GUI Stack • a stream-oriented protocol on top of IP • graphics card driver, mode setting • works like a pipe (transfers a byte sequence) • drawing/painting (usually hardware-accelerated) − must respect delivery order • multiplexing (e.g. using windows) − and also re-transmit lost packets • widgets: buttons, labels, lists, … • must establish connections • layout: what goes where on the screen

UDP: User (Unreliable) Datagram Protocol X11 (X Window System) • TCP comes with non-trivial overhead • a traditional UNIX windowing system − and its guarantees are not always required • provides a C API (xlib) • UDP is a much simpler protocol • built-in network transparency (socket-based) − a very thin wrapper around IP • core protocol version 11 from 1987 − with minimal overhead on top of IP Users DNS: Domain Name Service • originally a proxy for people • hierarchical protocol for name resolution • currently a more general abstraction − runs on top of TCP or UDP • user is the unit of ownership • domain names are split into parts using dots • many permissions are user-centered − each domain knows whom to ask for the next bit − the name database is effectively distributed User Management • the system needs a database of users NFS (Network File System) • in a network, user identities often need to be shared • the traditional UNIX networked ilesystem • could be as simple as a text ile • hooked quite deep into the kernel − /etc/passwd and /etc/group on UNIX systems − assumes generally reliable network (LAN) • or as complex as a distributed database • ilesystems are exported for use over NFS • the client side mounts the NFS-exported volume User Authentication • the user needs to authenticate themselves Shell • passwords are the most commonly used method • programming language centered on OS interaction − the system needs to know the right password • rudimentary control low − user should be able to change their password • untyped, text-centered variables • biometric methods are also quite popular • dubious error handling Ownership Interactive Shells • various objects in an OS can be owned • almost all shells have an interactive mode − primarily iles and processes • the owner is typically whoever created the object • the bundled components have known behaviour − ownership can be transferred • but updates are much trickier − usually at the impetus of the original owner • this also prevents resource sharing

Access Control Policy The End • there are 3 pieces of information − the subject (user) Actually… − the verb (what is to be done) • a 2-part, written inal exam − the object (the ile or other resource) • test: 9/10 required • there are many ways to encode this information − pool of 44 questions (in the slides) • free-form text Sandboxing − one of the 11 lecture topics • tries to limit damage from code execution exploits − 1 page A4: be concise but comprehensive • the program drops all privileges it can − this is done before it touches any of the input − the attacker is stuck with the reduced privileges − this can often prevent a successful attack

What is a Hypervisor • also known as a Virtual Machine Monitor • allows execution of multiple operating systems • like a kernel that runs kernels • isolation and resource sharing

Hypervisor Types • type 1: bare metal − standalone, microkernel-like • type 2: hosted − runs on top of normal OS − usually need kernel support

Paravirtual Devices • special drivers for virtualised devices − block storage, network, console − random number generator • faster than software emulation − orthogonal to CPU/MMU virtualisation

VM Suspend & Resume • the VM can be quite easily stopped • the RAM of a stopped VM can be copied − e.g. to a ile in the host ilesystem − along with registers and other state • and also later loaded and resumed

What are Containers? • OS-level virtualisation − e.g. virtualised network stack − or restricted ile system access • not a complete virtual computer • turbocharged processes

Bundling vs Sharing • bundling makes deployment easier