Distributed System: Lecture 3

Box Leangsuksun SWECO Endowed Professor, Computer Science Louisiana Tech University [email protected]

CTO, PB Tech International Inc. [email protected]

Slides for Chapter 7: Operating System support

From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Outline

• Background • OS support

3 Backgound

• With user/customer requirements • system modeling concepts – physical – Architecture – Fundamental (interaction/performance, failure and security) • what if analysis & Design

4/1/14 Towards survivable architecture 4 Clients-servers architecture: Processes & comm

Client invocation invocation Server

result result Server

Client Key: Process: Computer:

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Peer-to-peer architecture

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Sockets and ports interaction model

agreed port socket any port socket

message client server other ports Internet address = 138.37.94.248 Internet address = 138.37.88.249

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Figure 7.1 System layers

Applications, services

Middleware

OS: kernel, OS1 OS2 libraries & Processes, threads, Processes, threads, servers communication, ... communication, ... Platform

Computer & Computer & network hardware network hardware Node 1 Node 2

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 OS Support

• Important roles of the system kernel/OS • an understanding of the advantages and disadvantages of splitting functionality between protection domains (kernel and user-level code) • the relation between operation system layer and middle layer • how well the requirement of middleware & operating system – Efficient and robust access to physical resources – The flexibility to implement a variety of resource-management policies

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 OS Support

• The task of any operating system is to provide problem-oriented abstractions of the underlying physical resources (For example, sockets rather than raw network access) – the processors – Memory – Communications – storage media • System call interface takes over the physical resources on a single node and manages them to present these resource abstractions Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 OS Support

• Network operating systems – They have a network capability built into them and so can be used to access remote resources. Access is network-transparent for some – not all – type of resource. – Multiple system images • The node running a network operating system retain autonomy in managing their own processing resources An operating system that produces a single system image like this for all the resources in a distributed system is called a distributed operating system OS Support

• a network operating system retain autonomy in managing their own processing resources • Single system image – OS in which users are never concerned with where their programs run, or the location of any resources. The operating system has control over all the nodes in the system

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 Middleware and network operating systems

• No real distributed operating systems in general use, only network operating systems – users more focus on their application , which often meets their current problem-solving needs – users tend to prefer to have a degree of autonomy for their machines, even is a closely knit organization

The combination of middleware and network operating systems provides an acceptable balance between the requirement for autonomy

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 System layers

Applications, services

Middleware

OS: kernel, OS1 OS2 libraries & Processes, threads, Processes, threads, servers communication, ... communication, ... Platform

Computer & Computer & network hardware network hardware Node 1 Node 2

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 What are core distinct features from OS support?

• To enable DS applications • Network enabled features

4/1/14 Towards survivable architecture 15 OS to solve Challenges in DSs

• Heterogeneity • Openness • Security • Scalability • Fault handling • Concurrency • Transparency

4/1/14 Towards survivable architecture 16 Core OS functionality Handles the creation of and Communication between operations upon threads attached to process different processes on the same computer Process manager

Tread creation, synchronization Management of and scheduling Communication physical and manager virtual memory

Dispatching of interrupts, Thread manager Memory manager system call traps and other exceptions Supervisor

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 Kernel and Protection

• A system program that always runs with complete access privileged for the physical resources on its host computer • Execute in supervisor (privileged) mode; the kernel arranges that other processes execute in user (unprivileged) mode • sets up address spaces to protect itself and other processes and to provide processes with their required virtual memory layout • The process can safely transfer from a user-level address space to the kernel’s address space via an exception such as an interrupt or a system call trap

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 Processes and threads

• A thread is the operating system abstraction of an activity (the term derives from the phrase “thread of execution”) • An execution environment is the unit of resource management: a collection of local kernel-managed resources to which its threads have access • An execution environment primarily consists – An address space – Thread synchronization and communication resources such as semaphore and communication interfaces – High-level resources such as open file and windows Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 Single and Multithreaded Processes! Benefits!

• Responsiveness • Resource Sharing • Utilization of MP & Multicore Architectures A comparison of processes and threads as follows

• Creating a new thread with an existing process is cheaper than creating a process. • More importantly, switching to a different thread within the same process is cheaper than switching between threads belonging to different process. • Threads within a process may share data and other resources conveniently and efficiently compared with separate processes. • But, by the same token, threads within a process are not protected from one another. A comparison of processes and threads as follows (2)

• The overheads associated with creating a process are in general considerably greater than those of creating a new thread. – A new execution environment must first be created, including address space table • The second performance advantage of threads concerns switching between threads – that is, running one thread instead of another at a given process Types!

• User-level thread

• Kernel-Level Thread User Threads!

• Thread management done by user-level threads library • Three primary thread libraries: – POSIX Pthreads! – Win32 threads – Java threads Threading Issues!

• Semantics of fork() and exec() system calls • Thread cancellation • Signal handling • Thread pools • Thread specific data • Scheduler activations Semantics of fork() and exec()!

• Does fork() duplicate only the calling thread or all threads? Thread Cancellation!

• Terminating a thread before it has finished • Two general approaches: – Asynchronous cancellation terminates the target thread immediately – Deferred cancellation allows the target thread to periodically check if it should be cancelled Signal Handling!

• Signals are used in UNIX systems to notify a process that a particular event has occurred • A signal handler is used to process signals 1. Signal is generated by particular event 2. Signal is delivered to a process 3. Signal is handled • Options: – Deliver the signal to the thread to which the signal applies – Deliver the signal to every thread in the process – Deliver the signal to certain threads in the process – Assign a specific thread to receive all signals for the process Thread Programming Paradigms!

• On-demand - create a thread whenever you need – Easy to program – More overheads • Thread pool - create a pool of threads, and then assign tasks to them. – More efficient – Difficult to program due to you have to manage threads in your code Thread Pools!

• Create a number of threads in a pool where they await work • Advantages: (over thread on demand approach) – Usually slightly faster to service a request with an existing thread than create a new thread – Allows the number of threads in the application(s) to be bound to the size of the pool Thread Specific Data!

• Allows each thread to have its own copy of data • Useful when you do not have control over the thread creation process (i.e., when using a thread pool) Scheduler Activations!

• Both M:M and Two-level models require communication to maintain the appropriate number of kernel threads allocated to the application • Scheduler activations provide upcalls - a communication mechanism from the kernel to the thread library • This communication allows an application to maintain the correct number kernel threads Address spaces

• Region, separated by inaccessible areas of virtual memory • Region do not overlap • Each region is specified by the following properties – Its extent (lowest virtual address and size) – Read/write/execute permissions for the process’s threads – Whether it can be grown upwards or downward Address space

2N

Auxiliary regions

Stack

Heap

Text 0

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Address spaces (2)

• A mapped file is one that is accessed as an array of bytes in memory. The virtual memory system ensures that accesses made in memory are reflected in the underlying file storage • A shared memory region is that is backed by the same physical memory as one or more regions belonging to other address spaces • The uses of shared regions include the following – Libraries – Kernel – Data sharing and communication Process Creation

• Supported by the operating system. For example, the UNIX fork system call. • For a distributed system, the design of the process creation mechanism has to take account of the utilization of multiple computers • The choice of a new process can be separated into two independent aspects – The choice of a target host – The creation of an execution environment

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 Choice of process host

• The choice of node at which the new process will reside – the process allocation decision – is a matter of policy • Transfer policy – Determines whether to situate a new process locally or remotely. For example, on whether the local node is lightly or heavily load • Location policy – Determines which node should host a new process selected for transfer. This decision may depend on the relative loads of nodes, on their machine architectures and on any specialized resources they may process Choice of process host (2)

Load manager collect information about the • Process location policies may be nodes and use it to – Static allocate new processes to node – Adaptive

• Load-sharing systems One load managermay be component

– Centralized Several load manager organized in a tree structure – Hierarchical – decentralized Node exchange information with one another direct to make allocation decisions

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 Choice of process host (3)

• In sender-initiated load-sharing algorithms, the node that requires a new process to be created is responsible for initiating the transfer decision • In receiver-initiated algorithm, a node whose load is below a given threshold advertises its existence to other nodes so that relatively loaded nodes will transfer work to it • Migratory load-sharing systems can shift load at any time, not just when a new process is created. They use a mechanism called process migration Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 Creation of a new execution environment • There are two approaches to defining and initializing the address space of a newly created process – Where the address space is of statically defined • For example, it could contain just a program text region, heap region and stack region • Address space regions are initialized from an executable file or filled with zeroes as appropriate Creation of a new execution environment

– The address space can be defined with respect to an existing execution environment • For example the newly created child process physically shares the parent’s text region, and has heap and stack regions that are copies of the parent’s in extent (as well as in initial contents) • When parent and child share a region, the page frames belonging to the parent’s region are mapped simultaneously into the corresponding child region New ideas: cloud computing to deal with multiple app or processes & runtime issues

• Virtualization • System Image (OS or VM) • App to run on VM • Create, pause, ship, resume. • This approach is to ease required runtime environment by sending both runtime & app to target host

4/1/14 Towards survivable architecture 43 Copy-on-write

Process A’s address space Process B’s address space The page fault handler allocates RB copied from RA a new frame for RA RB process B and copies the The pages are initially original frame’s write-protected at the data into byte by hardware level byte

Kernel

Shared frame A's page B's page table table

a) Before write page fault b) After write

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 Client and server with threads

Worker pool

Thread 2 makes requests to server Receipt & Input-output queuing Thread 1 generates results T1 Requests N threads Client Server A disadvantage of this architecture is its inflexibility

Another disadvantage is the high level of switching between the I/O and worker threads as they manipulate the share queue Alternative server threading architectures

Associates a thread with Associates a thread each connection with each object

per-connection threads per-object threads reques workers t remote I/O remote remote I/O objects objects objects

a. Thread-per-request b. Thread-per-connection c. Thread-per-object

Disadvantage:Advantage: the the threads do TheirIn each disadvantage of these last is two that architectures clients may bethe overheadsnot contend of for the a threadshared delayedserver benefits while a fromworker lowered thread thread- has several creationqueue, and and throughput destruction is outstandingmanagement requests overheads but comparedanother thread with thehas operationspotentially maximized nothread-per-request work to perform architecture. Process & Context Switch

4/1/14 Towards survivable architecture 47 Process Control Block (PCB) Diagram of Process State Ready Queue And Various I/O Device Queues CPU Switch From Process to Process User program & Kernel interface!

Note: This picture is excerpted from Write a Linux Hardware Device Driver, Andrew O’Shauqhnessy, Unix world

IPC Communications Models Two processes & socket comm

Describe detailed cost function or model based on this architecture model

agreed port socket any port socket

message client server other ports Internet address = 138.37.94.248 Internet address = 138.37.88.249

Process A Process B

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Cost Model • Process A reads data • Process A: system call write data to a socket • Context Switch on Process A • Data transmit to Process B on 138.37.88.249 • Process B system call read data to a socket • Context Switch on Process A • Process B got data from socket

• Assume that there are 100,000 write/read between A & B • Read data from a variable cost 1 unit time • Context swtich 50 unit time 4/1/14 Towards survivable architecture 55 Cost Model

• Assume that there are 100,000 write/read between A & B • Write/Read data from a variable cost 1 unit time • Context swtich 50 unit time • Data transmission 100 units • What is the estimate total cost?

• What are possible solutions to reduce cost?

4/1/14 Towards survivable architecture 56 Client and server

Thread 2 makes requests to server Receipt & Input-output Thread 1 queuing generates results T1 Requests N threads Client Server

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Alternative server threading architectures

per-connection threads per-object threads workers

remote I/O remote remote I/O objects objects objects

a. Thread-per-request b. Thread-per-connection c. Thread-per-object

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 State associated with execution environments and threads

Execution environment Thread Address space tables Saved processor registers Communication interfaces, open files Priority and execution state (such as BLOCKED) Semaphores, other synchronization Software interrupt handling information objects List of thread identifiers Execution environment identifier Pages of address space resident in memory; hardware cache entries

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Java thread constructor and management methods

Thread(ThreadGroup group, Runnable target, String name) Creates a new thread in the SUSPENDED state, which will belong to group and be identified as name; the thread will execute the run() method of target. setPriority(int newPriority), getPriority() Set and return the thread’s priority. run() A thread executes the run() method of its target object, if it has one, and otherwise its own run() method (Thread implements Runnable). start() Change the state of the thread from SUSPENDED to RUNNABLE. sleep(int millisecs) Cause the thread to enter the SUSPENDED state for the specified time. yield() Causes the thread to enter the READY state and invoke the scheduler. destroy() Destroy the thread.

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Java thread synchronization calls

thread.join(int millisecs) Blocks the calling thread for up to the specified time until thread has terminated. thread.interrupt() Interrupts thread: causes it to return from a blocking method call such as sleep(). object.wait(long millisecs, int nanosecs) Blocks the calling thread until a call made to notify() or notifyAll() on object wakes the thread, or the thread is interrupted, or the specified time has elapsed. object.notify(), object.notifyAll() Wakes, respectively, one or all of any threads that have called wait() on object.

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Scheduler activations

P added Process Process A B SA preempted Process SA unblocked

Kernel SA blocked

P idle Virtual processors Kernel P needed

A. Assignment of virtual processors B. Events between user-level scheduler & kernel to processes Key: P = processor; SA = scheduler activation

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 The four type of event that kernel notified to the user-level scheduler

• Virtual processor allocated – The kernel has assigned a new virtual processor to the process, and this is the first timeslice upon it; the scheduler can load the SA with the context of a READY thread, which can thus can thus recommence execution • SA blocked – An SA has blocked in the kernel, and kernel is using a fresh SA to notify the scheduler: the scheduler sets the state of the corresponding thread to BLOCKED and can allocate a READY thread to the notifying SA • SA unblocked – An SA that was blocked in the kernel has become unblocked and is ready to execute at user level again; the scheduler can now return the corresponding thread to READY list. In order to create the notifying SA, the another SA in the same process. In the latter case, it also communicates the preemption event to the scheduler, which can re-evaluate its allocation of threads to SAs. • SA preempted – The kernel has taken away the specified SA from the process (although it may do this to allocate a processor to a fresh SA in the same process); the scheduler places the preempted thread in the READY list and re-evaluates the thread allocation. Invocation performance

• Invocation performance is a critical factor in distributed system design • Network technologies continue to improve, but invocation times have not decreased in proportion with increases in network bandwidth Invocations between address spaces

(a) System call Thread Control transfer via trap instruction

Control transfer via privileged instructions User Kernel

Protection domain (b) RPC/RMI (within one computer) boundary

Thread 1 Thread 2

User 1 Kernel User 2 (c) RPC/RMI (between computers)

Thread 1 Network Thread 2

User 1 User 2 Kernel 1 Kernel 2

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 RPC delay against parameter size

RPC delay Client delay against requested data size. The delay is roughly proportional to the size until the size reaches a threshold at about network packet size

Requested data size (bytes) 0 1000 2000 Packet size

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 The following are the main components accounting for remote invocation delay, besides network transmission times

– Marshalling – Data copying – Packet initialization – Thread scheduling and context switching Potentially,– Waiting even after for marshalling acknowledgements, message data is copied several times in the course of an 1. MarshallingThisSeveralRPCThe involveschoice system and of initializingunmarshalling, RPCcalls protocol(that protocol is, contextmaywhich headers influence involve switches) and copying trailers, are made and converting during an RPC, as data,includingdelay, become particularly checksums. a significant when The overheadlarge cost amounts is therefore as the of amount data proportional, of data grows 1. stubsAcross invokes the user-kernel the kernel’s boundary, communication between the client operations or server address space and inare part, sent to the amount of data sent 2. Onekernel or buffers more server threads is scheduled 3.2. IfAcross the operating each protocol system layer employs (for example, a separate RPC/UDP/IP/Ethernet) network manager process, then 3. eachBetween Send the involves network interfacea context and switch kernel to buffers one of its threads

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 A lightweight remote procedure call

• The LRPC design is based on optimizations concerning data copying and thread scheduling. • Client and server are able to pass arguments and values directly via an A stack. The same stack is used by the client and server stubs • In LRPC, arguments are copied once: when they are marshalled onto the A stack. In an equivalent RPC, they are copied four times

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 A lightweight remote procedure call

Client Server

A stack A

1. Copy args 4. Execute procedure and copy results

User stub stub

Kernel 2. Trap to Kernel 3. Upcall 5. Return (trap)

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Asynchronous operation

• A common technique to defeat high latencies is asynchronous operation, which arises in two programming models: – concurrent invocations – asynchronous invocations • An asynchronous invocation is one that is performed asynchronously with respect to the caller. That is, it is made with a non-blocking call, which returns as soon as the invocation request message has been created and is ready for dispatch Times for serialized and concurrent invocations

Serialised invocations Concurrent invocations process args process args marshal marshal Send transmission Send process args marshal Receive Send Receive unmarshal unmarshal execute request execute request marshal marshal Send Send Receive unmarshal Receive Receive execute request unmarshal unmarshal marshal process results process results Send process args marshal Receive Send unmarshal process results Receive time unmarshal execute request marshal Send

Receive unmarshal process results Client Server Client Server

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Monolithic kernel and microkernel

S4 ......

S1 S2 S3 S4 ...... S1 S2 S3

Monolithic Kernel Microkernel Key:

Server: Kernel code and data: Dynamically loaded server program:

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 The role of the microkernel

Middleware

Language Language OS emulation support support subsystem .... subsystem subsystem

Microkernel

Hardware

The microkernel supports middleware via subsystems

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 Comparison

• The chief advantages of a microkernel-based operating system are its extensibility • A relatively small kernel is more likely to be free of bugs than one that is large and more complex • The advantage of a monolithic design is the relative efficiency with which operations can be invoked Virtualization

4/1/14 Towards survivable architecture 75 Introduction to Virtualization

• System virtualization studied since the 70's (Goldberg, Popek) • Fundamental – Run multiple virtual machines (OSes) simultaneously – Isolating between virtual machines. – Controlling Resources sharing between VMs – Increase resources utilization – One of the hottest technologies in 2006 Virtualization: Key concepts

• Virtual Machine (VM), guest OS: complete operating system running in a virtual environment

• Host OS: operating system running on top the hardware, interface between the user and the VMM and VMs

• Virtual Machine Monitor (VMM):, : manage VMs (scheduling, hardware access) Virtualization: Usage

Ø Server consolidation Ø Software testing Ø Security, Isolation Ø Lower cost of ownership of server. Ø Increase manageability Ø Enhance server reliability Major Fields of Virtualization

• Storage Virtualization

• Network Virtualization

• Server Virtualization Architecture & Interfaces

• Architecture: formal specification of a system’s interface and the logical behavior of its visible resources.

API Applications Libraries System Calls ABI Operating System ISA System ISA User ISA

Hardware

n API – application binary interface n ABI – application binary interface n ISA – instruction set architecture Credit: CS5204 – Operating Systems from vtech u VMM Types

• System ¨ Provides ABI interface ¨ Efficient execution ¨ Can add OS-independent services (e.g., migration, intrusion detection)

n Process ¨ Provdes API interface ¨ Easier installation ¨ Leverage OS services (e.g., device drivers) ¨ Execution overhead (possibly mitigated by just- in-time compilation)

CS5204 – Operating Credit: CS5204 – Operating Systems from vtech u Systems System-level Design Approaches • Full virtualization (direct execution) – Exact hardware exposed to OS – Efficient execution – OS runs unchanged – Requires a “virtualizable” architecture – Example: VMWare

n Paravirtualization ¨ OS modified to execute under VMM ¨ Requires porting OS code ¨ Execution overhead ¨ Necessary for some (popular) architectures (e.g., x86) ¨ Examples: , Denali

CS5204 – Operating Credit: CS5204 – Operating Systems from vtech u Systems Design Space (level vs. ISA)

API interface ABI interface

• Variety of techniques and approaches available • Critical technology space highlighted

CS5204 – Operating Credit: CS5204 – Operating Systems from vtech u Systems System VMMs

Type 1 • Structure – Type 1: runs directly on host hardware – Type 2: runs on HostOS • Primary goals – Type 1: High performance – Type 2: Ease of construction/installation/acceptability • Examples – Type 1: VMWare ESX Server, Xen, OS/370 – Type 2: User-mode Linux Type 2

CS5204 – Operating Credit: CS5204 – Operating Systems from vtech u Systems Hosted VMMs

• Structure – Hybrid between Type1 and Type2 – Core VMM executes directly on hardware – I/O services provided by code running on HostOS

• Goals – Improve performance overall – leverages I/O device support on the HostOS

• Disadvantages – Incurs overhead on I/O operations – Lacks performance isolation and performance guarantees

• Example: VMWare (Workstation)

CS5204 – Operating Credit: CS5204 – Operating Systems from vtech u Systems Whole-system VMMs

n Challenge: GuestOS ISA differs from HostOS ISA n Requires full emulation of GuestOS and its applications n Example: VirtualPC

CS5204 – Operating Credit: CS5204 – Operating Systems from vtech u Systems Strategies

GuestOS • De-privileging – VMM emulates the effect on system/hardware resources of privileged instructions whose execution traps into the VMM – aka trap-and-emulate privileged instruction – Typically achieved by running GuestOS at a lower hardware priority level than the VMM – Problematic on some architectures where privileged instructions do not trap when executed at deprivileged priority trap resource • Primary/shadow structures emulate change – VMM maintains “shadow” copies of critical structures whose “primary” versions are manipulated by the GuestOS vmm – e.g., page tables change – Primary copies needed to insure correct environment visible to GuestOS resource • Memory traces – Controlling access to memory so that the shadow and primary structure remain coherent – Common strategy: write-protect primary copies so that update operations cause page faults which can be caught, interpreted, and emulated.

CS5204 – Operating Credit: CS5204 – Operating Systems from vtech u Systems Different Virtualization Concepts

• Full-virtualization: full virtual machine, from the boot sequence to the virtualized hardware • Para-virtualization: the guest OS has to be modify for performance optimization • Emulation: the guest OS architecture is different from the architecture of the host OS (translation on the fly). Ex: PPC VM on top of a x86 host OS. Classification

• Two kinds of system virtualization – Type-I: the virtual machine monitor and the virtual machine run directly on top of the hardware, – Type-II: the virtual machine monitor and the virtual machine run on top of the host OS

VM VM Host OS VM VM VMM VMM Host OS Hardware Hardware Type I Virtualization (Bare-metal) Type II Virtualization (hosted) VMware Workstation, Microsoft Virtual PC, VMware ESX, Microsoft Hyper-V, Xen Sun VirtualBox, QEMU, KVM Bare-metal or hosted?

• Bare-metal – Has complete control over hardware – Doesn’t have to “fight” an OS • Hosted – Avoid code duplication: need not code a process scheduler, memory management system – the OS already does that – Can run native processes alongside VMs – Familiar environment – how much CPU and memory does a VM take? Use top! How big is the virtual disk?

90 ls –l – Easy management – stop a VM? Sure, just kill it! • A combination – Mostly hosted, but some parts are inside the OS kernel for performance reasons – E.g., KVM Available Solutions • Example of Virtualization Projects – Type I: Xen, L4, VMware ESX, Microsoft Hyper- V • Type II: VMware Workstation, Microsoft Virtual PC, Sun VirtualBox, QEMU, KVM • Different Benefits – Type I: performances • direct access to the hardware simple to implement • para-virtualization possible – Type II: development • no limitation of para-virtualization • emulation possible How to run a VM? Emulate!

• Do whatever the CPU does but in software • Fetch the next instruction • Decode – is it an ADD, a XOR, a MOV? • Execute – using the emulated registers and memory

Example: addl %ebx, %eax is emulated as: enum {EAX=0, EBX=1, ECX=2, EDX=3, …}; unsigned long regs[8]; regs[EAX] += regs[EBX];

92 How to run a VM? Emulate!

• Pro: – Simple!

• Con: – Slooooooooow

• Example hypervisor:

93 How to run a VM? Trap and emulate!

• Run the VM directly on the CPU – no emulation! • Most of the code can execute just fine – E.g., addl %ebx, %eax • Some code needs hypervisor intervention – int $0x80 – movl something, %cr3 – I/O • Trap and emulate it! – E.g., if guest runs int $0x80, trap it and execute guest’s interrupt 0x80 handler 94 How to run a VM? Trap and emulate!

• Pro: – Performance!

• Cons: – Harder to implement – Need hardware support • Not all “sensitive” instructions cause a trap when executed in usermode • E.g., POPF, that may be used to clear IF • This instruction does not trap, but value of IF does not

95 change!

– This hardware support is called VMX (Intel) or SVM (AMD) – Exists in modern CPUs

• Example hypervisor: KVM How to run a VM? Dynamic (binary) translation!

• Take a block of binary VM code that is about to be executed • Translate it on the fly to “safe” code (like JIT – just in time compilation) • Execute the new “safe” code directly on the CPU

• Translation rules? – Most code translates identically (e.g., movl %eax,

96 %ebx translates to itself) – “Sensitive” operations are translated into hypercalls • Hypercall – call into the hypervisor to ask for service • Implemented as trapping instructions (unlike POPF) • Similar to syscall – call into the OS to request service How to run a VM? Dynamic (binary) translation!

• Pros: – No hardware support required – Performance – better than emulation

• Cons: – Performance – worse than trap and emulate – Hard to implement – hypervisor needs on-the-fly x86- to-x86 binary compiler

97• Example : VMware, QEMU How to run a VM? Paravirtualization!

• Does not run unmodified guest OSes • Requires guest OS to “know” it is running on top of a hypervisor

• E.g., instead of doing cli to turn off interrupts, guest OS should do hypercall(DISABLE_INTERRUPTS)

98 How to run a VM? Paravirtualization!

• Pros: – No hardware support required – Performance – better than emulation

• Con: – Requires specifically modified guest – Same guest OS cannot run in the VM and bare-metal

• Example hypervisor: Xen

99 Industry trends

• Trap and emulate

• With hardware support

• VMX, SVM

100 Linux-related virtualization projects Project Type License Bochs Emulation LGPL QEMU Emulation LGPL/GPL VMware Full virtualization Proprietary z/VM Full virtualization Proprietary Xen Paravirtualization GPL UML Paravirtualization GPL Linux-VServer Operating system- GPL level virtualization OpenVZ Operating system- GPL level virtualization Hardware support for full virtualization and paravirtualization • Recall that the IA-32 (x86) architecture creates some issues when it comes to virtualization. Certain privileged-mode instructions do not trap, and can return different results based upon the mode. For example, the x86 STR instruction retrieves the security state, but the value returned is based upon the particular requester's privilege level. This is problematic when attempting to virtualize different operating systems at different levels. For example, the x86 supports four rings of protection, where level 0 (the highest privilege) typically runs the operating system, levels 1 and 2 support operating system services, and level 3 (the lowest level) supports applications. Hardware vendors have recognized this shortcoming (and others), and have produced new designs that support and accelerate virtualization. Hardware support for full virtualization and paravirtualization • Intel is producing new virtualization technology that will support hypervisors for both the x86 (VT-x) and Itanium® (VT-i) architectures. • The VT-x supports two new forms of operation – one for the VMM (root) – one for guest operating systems (non-root). • The root form is fully privileged, while the non-root form is deprivileged (even for ring 0). • The architecture also supports flexibility in defining the instructions that cause a VM (guest operating system) to exit to the VMM and store off processor state. Other capabilities have been added Hardware support for full virtualization and paravirtualization • AMD is also producing hardware-assisted virtualization technology, under the name Pacifica. • Among other things, Pacifica maintains a control block for guest operating systems that are saved on execution of special instructions. • The VMRUN instruction allows a virtual machine (and its associated guest operating system) to run until the VMM regains control (which is also configurable). The configurability allows the VMM to customize the privileges for each of the guests. • Pacifica also amends address translation with host and guest memory management unit (MMU) tables. I/O Virtualization

• Typical methods to virtualize the CPU • A computer is more than a CPU • Also need I/O!

• Types of I/O: – Block (e.g., hard disk) – Network – Input (e.g., keyboard, mouse) – Sound – Video • Most performance critical (for servers): – Network

105 – Block Xen Overview

• Para-virtualization possible – full-virtualization is virtualization support at the hardware level (VT Intel technology, AMD-V/Pacifica technology) – XenoLinux: port of the Linux kernel to the Xen Hypervisor • Hypervisor based on a micro-kernel Ø Open Source, Linux based Ø Create and manage VMs via command line Ø Restricted hardware access though API Ø Host’s kernel need to be patched. VMware Overview

Ø Commercial virtualization applications Ø Full Virtualization Ø Highly Portability Ø Simulate BIOS, PXE boot. Ø Simulate virtual Hardware for every VM Ø Support Bridge, NAT, and Host-Only Networks Ø Run wide range unmodified guest OSes such as Windows, Linux, Solaris, BSD, Netware, DOS, VMware Overview

Source : http://www.vmware.com VMware vs. Xen

Relative performance on native Linux (L), Xen/Linux (X), VMware Workstation 3.2 (V), and User Mode Linux (U).

Source : “Xen and Art of Virtualization”, Ian Pratt, University of Cambridge, Xensource Inc. Http://www.cl.cam.ac.uk/netos/papers/2005-xen-may.ppt VMware vs. Xen (TCP results)

1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 L X V U L X V U L X V U L X V U Tx, MTU 1500 (Mbps) Rx, MTU 1500 (Mbps) Tx, MTU 500 (Mbps) Rx, MTU 500 (Mbps)

TCP bandwidth on Linux (L), Xen (X), VMWare Workstation (V), and UML (U)

Source : “Xen and Art of Virtualization”, Ian Pratt, University of Cambridge, Xensource Inc. Http://www.cl.cam.ac.uk/netos/papers/2005-xen-may.ppt Qemu

• Emulation solution • Direct access to the hardware possible if the host OS and the guest OS have the same architecture

User Space User Space User Space User Space User Space

Linux Windows Linux Mac OS X Solaris Drivers Drivers Drivers Drivers Drivers

Qemu x86 Qemu x86 Qemu PPC Qemu PPC Qemu Sparc

Host OS: Linux, Mac OS X, Windows

Hardware: processor, memory, disk, network, etc.

From http://fr.wikipedia.org/wiki/Qemu Xen Overview

• Para-virtualization possible – full-virtualization is virtualization support at the hardware level (VT Intel technology, AMD-V/ Pacifica technology) – XenoLinux: port of the Linux kernel to the Xen Hypervisor • Hypervisor based on a micro-kernel • Efficient virtualization: HPC possible Xen Overview

Ø Open Source, Linux based Ø High Performance Ø Support Bridge, and Routing Networks Ø Create and manage VMs via command line Ø Restricted hardware access though API Ø Host’s kernel need to be patched. Xen’s Ring Model

Ring 3 User Applications Ring 3 User Applications Ring 1 and 2 Ring 2 is not used are not used Ring 1 for VM’s Ring 0 Ring 0 Operating Xen’s System Hypervisor

Standard x86 Architecture Xen on x86 Architecture

The architecture of Xen

116 Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 116 Use of rings of privilege

117 Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 117 Virtualization of memory management

118 Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 118 Split device drivers

119 Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 119 I/O rings

120 Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 120 The XenoServer Open Platform Architecture

121 Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 121 Virtualization Examples

• Server consolidation - Virtual machines are used to consolidate many physical servers into fewer servers, which in turn host virtual machines. Each physical server is reflected as a virtual machine "guest" residing on a virtual machine host system. This is also known as Physical-to-Virtual or 'P2V' transformation. Virtualization Examples

• Disaster recovery - Virtual machines can be used as "hot standby" environments for physical production servers. This changes the classical "backup-and-restore" philosophy, by providing backup images that can "boot" into live virtual machines, capable of taking over workload for a production server experiencing an outage. Virtualization Examples

• Testing and training - Hardware virtualization can give root access to a virtual machine. This can be very useful such as in kernel development and operating system courses. Virtualization Examples

• Portable applications - The Microsoft Windows platform has a well-known issue involving the creation of portable applications, needed (for example) when running an application from a removable drive, without installing it on the system's main disk drive. This is a particular issue with USB drives. Virtualization can be used to encapsulate the application with a redirection layer that stores temporary files, Windows Registry entries, and other state information in the application's installation directory – and not within the system's permanent file system. See portable applications for further details. It is unclear whether such implementations are currently available. Virtualization Examples

• Portable workspaces - Recent technologies have used virtualization to create portable workspaces on devices like iPods and USB memory sticks. These products include: – Application Level – Thinstall – which is a driver-less solution for running "Thinstalled" applications directly from removable storage without system changes or needing Admin rights – OS-level – MojoPac, Ceedo, and U3 – which allows end users to install some applications onto a storage device for use on another PC. – Machine-level – moka5 and LivePC – which delivers an operating system with a full software suite, including isolation and security protections. Virtualization Tips

• In the VMware space, VirtualCenter is the management tool of choice for ESX Server. • Other products, like Hewlett-Packard's Virtual Machine Management or IBM's Director modules, are adding functionality to deal with virtual machine [VM] environments. • The problem is that most of these tools that are snap-ins lack much of the simple functionality you get in VirtualCenter. • Most companies will end up buying both VirtualCenter and the vendor's tool and use both depending on what they are doing. Virtualization Tips

• Shy away from large amounts of processing when doing consolidation. • If you are doing virtualization for other reasons, like workload management, then you can get nearly anything to run virtualized if you are willing to change some of the things you do. • However, if you are looking for maximum consolidation ratios and high ROIs, stay away from the quad boxes that are already running at 50%. Security Tips

• Some standard minimum security at least: – Disable remote root access – use sudo when needed – configure the AD PAM modules for Windows shops. Security Tips

• Some organizations use too much surrounding security and end up making their environment slower, more difficult and expensive to manage. • When dealing with the VMs, all of the standard procedures should be followed. • The host systems themselves should often be considered appliances, and organizations should limit the amount of customized agents and security hacks performed on these systems. Security Tips

• One should not go overboard with ESX hosts, since they are basically appliances serving up computing resources and should be treated as such. Nevertheless, taking a common sense approach to security on the servers is the best bet. • The most common mistakes made with virtual security are based on ignorance, lack of knowledge of the Linux console, failure to understand how virtual switch architecture works, and what the host does not directly see in the data in the VM disk files. Security Tips

• The same practices that are performed to secure a physical environment can, and should, be used in a virtual environment as well. • Everything from proper VLAN/firewall organization to host-based intrusion detection should be leveraged to keep the environment secure. Scalability Tips

• Simplicity. The more complicated the design and infrastructure, the less scalable it will be. – For example, a common mistake in large organizations, is that they assume they cannot create a simple solution because they are big. One can argue that they should make the solution or design for VMware as simple as possible to make it scalable for the size of their organization and largest client base. • Don't design the entire solution around the one-offs. Scalability Tips

• When designing a virtual infrastructure, one should never look at the environment and try to plan one large infrastructure for the entire virtualization project. It won’t work. • Organize the overall environment into smaller groupings of servers and addressed individually. • When approached this way, at the end of the project, a very scalable deployment methodology that uses the same principals with a manageable number of servers in various phases of the project will be in place