<<

Diplomarbeit

A Secure Architecture for Untrusted Plugins

Achim Weimert

9.März 2011

Technische Universität Berlin Fakultät IV Institut für Softwaretechnik und Theoretische Informatik Professur Security in Telecommunications

Betreuender Hochschullehrer: Prof. Dr. Jean-Pierre Seifert Betreuender Mitarbeiter: Dipl.-Inf. Matthias Lange

Erklärung

Die selbständige und eigenhändige Ausfertigung versichert an Eides statt

Berlin, den

Achim Weimert

Contents

1 Introduction 1

2 Background 5 2.1 Sandboxing ...... 5 2.2 Dynamic Content ...... 6 2.2.1 Client-Side Programming Support ...... 6 2.3 Plugins ...... 8 2.3.1 Plugin Interfaces ...... 8 2.3.1.1 Plugin API ...... 9 2.3.1.2 Pepper Plugin API ...... 11 2.3.1.3 MozPlugger ...... 11 2.3.2 Important Plugins ...... 12 2.3.2.1 Player ...... 12 2.3.2.2 Java Applets ...... 12 2.3.2.3 Silverlight ...... 13 2.4 Content Handling in Browsers ...... 13 2.5 Browser Architecture ...... 14 2.5.1 Browser Extensions ...... 15 2.5.2 Multi- Architectures ...... 15 2.6 Threat Analysis ...... 17 2.6.1 Attacking Web Browser Availability ...... 17 2.6.2 Unrestricted File System Access ...... 17 2.6.3 Unrestricted Network Access ...... 18 2.6.4 User Forgery ...... 19 2.6.5 Attacking Web Browser Integrity ...... 20 2.6.6 Summary ...... 21 2.6.7 Threat Model and Assumptions ...... 22

3 Design 25 3.1 Designing a Secure Plugin API ...... 25 3.2 Execution Model ...... 26 3.2.1 Threading Model ...... 27 3.2.2 Event Model Using Callbacks ...... 27 3.2.3 Virtual CPU Model ...... 27 3.3 Information Flow ...... 28 3.3.1 vCPU System Calls ...... 28 3.3.2 Event Handling ...... 29

V Contents

3.3.3 Handling of Pending Events ...... 30 3.3.4 Handling of Data Events ...... 31 3.3.5 Host Main Loop ...... 32 3.4 Scheduling ...... 32 3.5 Synchronization ...... 34 3.5.1 Thread-Synchronization with Upcalls ...... 34 3.5.2 Thread-Synchronization with other Threads ...... 34 3.5.3 Thread-Synchronization with Upcalls and other Threads . . . . . 34

4 Implementation 37 4.1 vCPU ...... 37 4.1.1 Sandboxing Using ptrace ...... 37 4.1.2 Setup of Host-Client Interaction ...... 38 4.1.3 Event Handling ...... 39 4.1.4 Host Main Loop ...... 39 4.1.5 vCPU System Calls ...... 40 4.1.5.1 vCPU System Call Execution ...... 40 4.1.5.2 vCPU System Call Types ...... 40 4.1.6 Race Conditions Between Events and System Calls ...... 41 4.1.7 User Level Resume ...... 42 4.2 Threading Library ...... 45 4.2.1 Scheduling ...... 45 4.2.2 Synchronization ...... 46 4.2.2.1 Mutex ...... 46 4.2.2.2 Condition Variable ...... 47 4.2.2.3 Semaphore ...... 48 4.2.3 Client Prioritization of Data Events ...... 49 4.2.4 Dynamic Memory Allocation ...... 49 4.3 Example Execution ...... 50 4.4 Video Playback Using Plugin ...... 51

5 Evaluation 55 5.1 System Call Roundtrip ...... 55 5.2 Context Switch ...... 55 5.3 Comparing User-level Resume with syscall_resume ...... 56 5.4 Event Latency ...... 57 5.5 Data Event Latency ...... 58 5.5.1 Influence of Thread Priorities on Latency ...... 59 5.5.2 Influence of Event Buffer Size on Latency ...... 60 5.6 Computation Overhead ...... 61 5.7 Data Throughput ...... 62 5.8 Video Playback Using Plugin ...... 63

6 Related Work 67 6.1 Browsers ...... 67

VI Contents

6.1.1 Transparently Securing Plugins in ...... 67 6.1.2 Sandbox ...... 67 6.1.3 Native Client ...... 68 6.1.4 OP Web Browser ...... 69 6.1.5 Microsoft Gazelle ...... 69 6.2 Operating Systems ...... 70 6.2.1 Chromium OS ...... 70 6.2.2 Illinios Browser ...... 71 6.2.3 Capsicum ...... 71

7 Conclusion 73 7.1 Outlook ...... 73 7.2 Future Work ...... 73

A Summary (German) 77

Glossary 79

Bibliography 81

VII

List of Figures

2.1 A NPAPI plugin’s lifecycle...... 10 2.2 Architecture and dependencies of modern web browsers ...... 15 2.3 Screenshot of a plugin reading the local file system...... 18 2.4 Screenshot of a plugin reading the file /etc/passwd...... 19 2.5 Screenshot taken before detouring the function gettimeofday...... 20 2.6 Screenshot taken after detouring the function gettimeofday...... 21 2.7 Threat Model ...... 22

4.1 vCPU host and client interaction ...... 42 4.2 vCPU execution example ...... 52

5.1 Event latency with computational threads ...... 58 5.2 Event latency with data handling threads ...... 59 5.3 Latency distribution ...... 60 5.4 Data event latency with computational threads ...... 61 5.5 Data event latency with data handling threads ...... 62 5.6 Latency dependence on buffer size ...... 63 5.7 Delivery time dependence on buffer size ...... 64 5.8 Data throughput ...... 65

IX

List of

5.1 Time consumption of system calls ...... 55 5.2 Time consumption of context switches ...... 56 5.3 Time consumption of user-level resume and syscall_resume ...... 57 5.4 Event latency ...... 57 5.5 Computation overhead ...... 61 5.6 Data throughput ...... 62 5.7 Video decoding performance ...... 64

XI

1 Introduction

The Internet has changed the way people access information and communicate. At its beginning, it was mainly used for connecting remote mainframes. For example, a computer with high computing capabilities could provide its processing capacity to remote researchers. Other usages include the transmission of larger amounts of data (e.g. using FTP software) as well as communication within specific groups based on email or within larger groups based on Usenet. A structured way of retrieving information was available in the form of Gopher, which was designed to present text documents in a hierarchical manner. In the 1990s, the world wide web emerged as interlinked web pages which are delivered over the Internet. At that time, the first web browsers and web pages were developed to solve the problem of referencing, retrieving and displaying documents. Web pages mainly consisted of formatted text and hyperlinks as defined by the markup language HTML. Describing a web page, HTML was rendered and displayed by the browser. Using hyperlinks, different web pages could reference each other, giving the user a mean to easily navigate from one web page to the next one. This new way of accessing documents quickly became popular and information was increasingly being published using web pages. At that time, user interaction could only be reflected in a web page by navigating to a new web page. The user experience would benefit from a more dynamic interaction mode. To that end, the static nature of early HTML was not suitable. First trials on replacing traditional applications with websites were being made, for the development of websites featuring rich user interfaces and complex user interactions, programming support was found necessary. For example, due to the missing support for programming inside web pages, dynamic web applications like email or chat clients were not possi- ble. Consequently, limited programming support for interpreted scripting languages was added to browsers in the form of JavaScript and VBScript. Using those scripting languages, web pages and their elements could dynamically be modified. Comparing favorably with refetching of a web page, scripting support greatly improved the possi- bilities for dynamic web pages. Being supported by many browsers, JavaScript gradually pushed back VBScript and became widely used on web pages. Browsers at that time were missing support for multimedia content like video play- back or rich document (e.g. Portable Document Format - PDF) and advanced graphic formats (e.g. - SVG). Whenever the user navigated to such a resource, the browser offered to download the file. After the download was completed the user could open the file using a locally installed application. As the user experience would improve by dispensing the additional steps and displaying commonly used content types inside the browser, the inclusion of complex documents evolved as a new require- ment. For different reasons, support for those documents could not be added using the available scripting languages. For example, being an interpreted language, JavaScript

1 1 Introduction performance was not suitable for the calculations involved in video decoding. Addition- ally, as JavaScript is executed in a limited runtime environment, hardware acceleration for rendering complex graphics was not available. Support for compiled programming languages could have been added. As a result, the browser would download the source code, compile and execute it. However, this would require a infrastructure in the browser, which is not feasible given the diversity of client installations. Contrar- ily, by supporting precompiled pieces of code which adhere to a specification and run natively, browsers could allow third-party developers to extend it with so-called plug- ins. This technique not only facilitates developing of new features independently from the browser manufacturer but also reduces the size of the browser. Additionally, due to the separation of browser and plugin source code, software with a different license can be used as a plugin in combination with the browser. Especially for producers of closed-source media handlers, plugins provide a method to integrate their software into popular open-source browsers. Running natively, plugins would offer high performance suitable for intensive calculations. Due to the advantages of plugins, an interface coined Netscape Plugin API (NPAPI) was added to many browsers. This allowed third-party code in the form of precompiled plugins to be used by the browser to display vendor- specific content seamlessly in web pages. Since plugins are native , they are mostly platform dependent. Additionally, as the interface was not completely specified, plugins need to use platform-specific features, for example for rendering. In general the platform-dependency of plugins is not a problem, as common plugins are available for most platforms. With the popularity of websites relying on plugins, the majority of the systems having a browser set up also include plugins for rendering calculation-intensive multimedia or other active content. Websites - and web browsers as a website’s runtime environment - nowadays often process sensitive data. For example, online banking websites giving complete access to the user’s banking account, or social networks handling detailed personal information about the user are commonly used. Those websites rely on the browser to protect the data from unrelated web pages displayed at the same time. Not only websites rely on the browser for protection, users also trust the browser to limit the information flow between webpages and local data. As an example, file uploads to webpages normally require confirmation by the user. Browser plugins that by design receive the same kind of trust as the browser thus evade the control mechanisms for untrusted code in webpages. Consequently, various concerns arise as to the trustworthiness of browsers, websites and plugins. To deal with the security concerns, current browsers need to separate web applications to mitigate the threat of malicious websites. To separate web applications, browsers apply rules that restrict the access of JavaScript code shipped with a webpage to that webpage. This rule is also known as the same origin policy. While the operating system (OS) cannot protect data that was deliberately entered into the browser by the user (e.g. passwords), the OS should protect the confidentiality and integrity of local user data from vulnerable browsers. As popular operating systems protect data at user granularity and browsers run in the context of a user, the OS cannot directly protect local user data. With the OS mechanisms being insufficient in providing fine-grained means of separation to browsers, manufacturers started to add sandboxing features to

2 their browsers. Sandboxing was implemented to prevent a subverted component in a browser to immediately result in a compromise of local user data. Due to the complexity of enforcing separation on a browser level, researchers regularly find implementation errors in this area. Additional problems can be observed in current browser’s plugin handling. Once loaded, current plugins are tightly integrated in the browser and run unconstrained in the user’s context. Implementing the plugin interface in this way is not required by the API, but rather evolved as a result of the loose specification and by being straight-forward to implement on the browser side. This integration is also convenient for plugin developers, as they can directly use operating system and browser functions. For example, plugins can take advantage of the operating system’s support for multithreading. For the user, this requires a complete trust into plugins and their manufacturers concerning security and privacy matters. Due to the missing support for fine-grained separation in operating systems and browsers, plugins need to be encapsulated to adequately protect user data. Having the plugin in a separate protection domain allows the design of a new interface between operating system, browser and plugin. If the interface is not as permissive as the one provided by the operating system and if all interaction of the plugin is subject to approval by the browser, then the browser can enforce a specific security policy. The plugin interface should offer all functionality needed by plugins in a way that enable the browser to easily detect, check and block potentially unsafe operations. For example, instead of mapping the file system into the encapsulated plugin, only the content of files chosen by the user may be readable to a plugin. The goal of my thesis is to propose a new plugin architecture. To improve upon the deficiencies of current architectures, I introduce a secure architecture for untrusted web- browser plugins. This thesis is organized as follows. In Chapter 2, I describe current technologies which form the basis of my work. I outline their features and deficien- cies. Chapter 3 explains possibilities to improve plugin architectures and introduces my design of a new plugin interface. This interface defines all interactions between plugin and browser and is designed to make sure the interactions are explicit and checkable. It provides a mechanism similar to system calls to allow the plugin to call browser func- tions, e.g. to request a network resource. Supporting the opposite direction, data from the browser can be sent to the plugin in the form of events or using shared memory. To encapsulate browser plugins from the operating system and the browser, I propose executing plugins in a separate process sandboxed using the system call ptrace. As plugins running in this limited execution environment cannot access operating sys- tem functions any more, means like threading and synchronization need to be provided inside this sandbox. Therefore, I designed a threading library in user space on which an encapsulated plugin can be built on. I base my architecture on the concept of virtual CPU which efficiently supports the implementation of the designed interface while facil- itating sandboxing. Details of the implementation on architectures running a Linux operating system are presented in Chapter 4. Chapter 5 provides an evaluation of the implemented architecture. Relevant measurements concerning computational overhead, latency and data throughput are presented. Results are put into context by comparing them with native execution. In Chapter 6, I present research related to my work. To conclude the thesis, Chapter 7 summarizes the results and suggests future work.

3

2 Background

This section starts by describing the concept of sandboxing, which serves as a basis for the following chapters. I then give an overview of techniques used to display web pages. In this context, I explain how traditional browsers handle static as well as dynamic content and describe the role of plugins. Afterwards, I outline current developments in browser architectures as well as security considerations of browser manufacturers concerning modern browsers and plugins. Next, I analyze the current plugin architecture used in browsers like . Finally, I derive a threat model as basis for my research.

2.1 Sandboxing

For reasons such as distrust in a piece of software or requirements for fault containment, programs may need to put different modules into separate protection domains. For example, a host program that runs computational tasks on behalf of others generally needs to prevent the executed code from having an impact on the host itself. Other examples include programs that can be extended by third-party software. To prevent faults in external components to influence the main application, external components can be encapsulated. In general, a software sandbox is a mechanism that allows execution of arbitrary code within an isolated environment. As all interactions with other software or the operating system are denied, the effect of code running inside the sandbox is limited to other code inside the same sandbox. Different methods have been proposed to allow sandboxing of software. This section describes use cases for sandboxing and outlines three important concepts: process virtual machines, virtual address spaces and software fault isolation. In this context, encapsulation can be achieved using a process virtual machine (pro- cess VM). The process VM introduces an abstraction of the underlying hardware and operating system. A program can then be executed in a platform-independent program- ming environment. Most implementations of process VMs are based on an interpreter that enforces security policies while parsing, checking and executing programs. Apart from using process virtual machines, process isolation can also be provided for native code. Using virtual address spaces, a feature available in modern operating sys- tems, every process is placed in its own address space. This basically isolates processes, as direct access to memory of other processes is not available. Consequently, collabora- tion needs to rely on the operating system to set up inter-process communication (IPC), which allows it to enforce security policies. As argued in [WLAG93], address space isolation of modules can lead to severe per- formance degradation. With isolated address spaces, remote procedure calls commonly result in multiple context switches, where the operating system has to save and restore

5 2 Background the state of the CPU. If encapsulated modules are tightly coupled and therefore inten- sively have to use IPC, a substantial overhead is introduced. For that matter, the authors propose a method called sandboxing to provide software fault isolation in programs using a single address space. Therefore, applications need to be separated into independent modules. A distrusted module may then be put into its own fault domain. Each fault domain receives a contiguous region of memory out of the application’s address space. After being loaded, the object code of such distrusted modules is analyzed and modified if required. To enforce encapsulation, all write and jump instructions need to be limited to addresses of the module’s fault domain. To handle store or jumps which rely on a reg- ister and consequently cannot be statically checked, these are modified to use a specific register. The verifier makes then sure that the register contains an address which points inside the fault domain at all times. This proposal relies on RISC architectures where instructions start at fixed positions. This facilitates detection of potentially dangerous instructions. On CISC architectures like x86, instructions are of variable-length and can start at any byte. In following work, the authors of [MM06] extended the sandboxing approach for use on CISC architectures. Here, artificial padding is used to align instruc- tions. Additional optimizations were implemented, e.g. mapping guard regions before and after the address space of the fault domain to remove the need to check relative addressing with small offsets. According to their research, this kind of sandboxing shows a relatively low performance overhead while providing great improvements in security.

2.2 Dynamic Content

At the beginning of the world wide web, a web browser’s purpose was to retrieve and display static web pages. Websites mostly consisted of one or more documents in the format of the markup language HTML with embedded images. At that time, only limited user interactions were possible. If the website’s dimensions are larger than the size of the browser window, scrollbars are added with which the user can instruct the browser to render currently hidden parts of the website on the screen. The key feature of HTML concerning interaction, however, are clickable hyperlinks that redirect the browser to a given resource. This allows for simple navigation between web pages that are linked together. If the user navigates to a resource that cannot be handled by the browser, e.g. documents that are not in the format of HTML like a PDF file, the browser will offer to download the resource to the local file system. From there, the user can open it like any other file using installed programs.

2.2.1 Client-Side Programming Support Whenever content of the displayed web pages was to change, a new web page had to be loaded. As hyperlinks only allow for simple interactions, mechanisms for manipulating web pages at client-side were considered. Setting up a build system inside the browser that downloads, compiles and executes code would require a complex tool chain at client side. Due to the diversity in client setups, this approach turned out to be impractical. In contrast, an interpreted scripting language requires less infrastructure to be inte- grated. Consequently, client-side code execution was added into browsers in the form of

6 2.2 Dynamic Content

JavaScript and VBScript. JavaScript is derived from ECMAScript, a standard devel- oped by Ecma International. It depends on the availability of a JavaScript engine. The engine parses embedded or referenced JavaScript code and executes it in a runtime envi- ronment. Due to its wide adoption in major browsers, JavaScript soon took the lead and is now largely used. As JavaScript is a client-side scripting language, it can manipulate objects in its host environment as well as receive notifications on events. Examples of such events include onload, which is triggered when the browser loaded a webpage or onclick that is triggered when the user clicked on an element. Most commonly, JavaScript is used to interact with the (DOM) of a web page. The DOM is an interface to an object-oriented representation of a web page. Functions can be registered with specific events, like loading of a web page or mouse moving over a specific element of the DOM. By accessing objects of the DOM, JavaScript code may change the appearance of a set of objects or the complete web page. For example, JavaScript can be used to replace images included in a web page based on user interaction or to develop applications running inside the browser. Concerning performance, JavaScript has long been left behind. With the growing importance of web-based JavaScript applications, all major browsers manufacturers recently improved the performance of their JavaScript implementations. Techniques like compiling JavaScript to native machine code on a method-base at runtime or before executing it as a whole, improved garbage collection and tracing resulted in huge per- formance improvements. Experiments show that using modern browsers, allows simple implementations of a Flash player in JavaScript [Sch], or for example JavaScript comes close to 50% of the performance of Google Native Client1. JavaScript and its interpreters are designed to only allow interactions with objects provided by the host environment. Assuming a faultless implementation, JavaScript consequently cannot be used to arbitrarily interact with the other programs or the operating system itself. This sandbox-like mechanism limits the range of possible appli- cations. Additional security policies further restrict the capabilities of JavaScript. A key prin- ciple in the browser to provide separation between web pages retrieved from different sources is the same origin policy (SOP). The origin of a resource is defined by the triplet of domain name, protocol and port. Based on the origin, a script’s access is limited to resources of the same origin. This allows content of the same origin to interact with each other and the website while preventing content on parallelly opened websites from accessing the first website. With the recent development of websites relying on the combination of resources with different origins to create new services, so-called mash-ups, exceptions to SOP were needed. Instead of relying on a proxy that maps resources from other origins into the website’s origin, message-based communication between web pages was implemented. By allowing both sender and receiver to check each other’s origin, confidentiality and authenticity of data can be ensured.

1 Implementation achieves 7fps using JavaScript vs. 15fps using [Met]

7 2 Background

In addition to SOP, browsers also apply different mechanisms for protecting a web- site’s resources. For example, cookies representing a persistent state are not bound to SOP but to domain and path. Using JavaScript, which is only limited by SOP and consequently can access cookie resources as long as domain, protocol and port match can circumvent the path limitation. Differing mechanisms for protecting web- site resources lead to “incoherencies in web browser access control policies” as outlined in [SMWL10]. These incoherencies make it difficult for browsers developers to provide protected domains for untrusted code. Even worse, they present an additional burden to website developers trying to create secure web applications. To this extent, special care has to be taken when processing user input in web applications.

2.3 Plugins

The first web browsers could only render simple HTML web pages featuring limited user interaction. If such a browser comes across a resource in an unknown format, it can only offer to save it locally. From there on, a user can open the resource as any other file. Those steps involved when opening an unknown resource type may be adequate for content that does not belong to the web page and is normally saved permanently by the user. In contrast, for other resource types like complex document or graphic formats as well as videos, it may be desirable for users and web page developers to display this kind of content embedded into a webpage inside the browser. For different reasons, the available scripting languages were not suitable for implementing resource handlers. First of all, the performance of the scripting language was not sufficient for computation intensive tasks. Second, the limitations posed by the runtime environment, e.g. missing support for hardware acceleration, were not sufficient. Finally, closed-source handlers could not be distributed in JavaScript, as that would reveal their source code. As a solution, browser manufacturers could start integrating handlers for all sorts of content types into their browser. This would greatly increase the code base of the browser and directly couple the development of the handling code to the browser devel- opment cycle. Moreover, proprietary handlers would need to be disclosed by the man- ufacturers to the browser developers. As the number of file formats is increasing and diverse, supporting all of them is an infeasible task. In contrast, by introducing support for external software components, a browser could be extended by third-party devel- opers. This allows for new features to be added independent of browser development. Consequently, the size of the browser does not increase with each supported file format. Additionally, source code that is developed under an incompatible can be kept separate from the browser. Having a standardized way of integrating those software components allows for a software component to be used by different browsers without the need for adjustments.

2.3.1 Plugin Interfaces To support third-party software components to handle resources, plugin interfaces were designed and added to browsers. This section introduces the Netscape Plugin API (NPAPI), an interface used by popular browsers on different platforms. I describe the

8 2.3 Plugins functionality defined by the interface and outline current efforts to extend it. To support hardware in a platform-independent manner, the Pepper Plugin API was proposed. The section concludes with the description of the meta-plugin MozPlugger that extends the described interface for usage by native applications. A major competing technology for plugins that is used by the web browser Internet Explorer (and other Windows-applications) is ActiveX. Due to the focus on Linux based platforms and its limitation to Windows, ActiveX is not further discussed in this thesis.

2.3.1.1 Netscape Plugin API One standardized interface for developing browser plugins is provided by the Netscape Plugin Application Programming Interface (NPAPI). Using the NPAPI, the handling of unknown resource types can be delegated to external plugins. Once installed, those plu- gins can seamlessly integrate into websites. Popular browsers like Chrome [Good], Fire- fox [Mozc], [Ope], [App] as well as older versions of Internet Explorer [Mic] (up to version 5.5) support this kind of plugins. To be loaded by the browser, a NPAPI plugin needs to declare a specific set of func- tions. Those functions are called by the browser to enumerate the plugin’s capabilities, to notify the plugin about new conditions, like the completion of URL requests or changes of the window, to influence the plugin’s lifecycle and to configure a plugin instance. On the other hand, the browser provides an array of functions to the plugin that can be used to interact with the browser. For example, the plugin can instruct the browser to load a resource or allocate memory for the plugin as well as force the browser to repaint the area of a web page belonging to a plugin. The basic lifecycle of a plugin as shown in Figure 2.1 starts with the browser detecting an unknown resource type. The browser then searches for a suitable plugin to handle the content. The plugin is loaded and initialized. After the plugin has been initialized, the browser creates a new instance of the plugin. When the user leaves or closes a web page, the plugin instance is deleted. When all instances have been deleted, the browser destroys the plugin. Furthermore, plugins and websites can communicate with each other using a JavaScript interface. On the one hand, plugins can retrieve a handle to the website that caused the plugin to be instantiated, enumerate included objects and call methods on them. On the other hand, the website can call native methods or access properties of the plugin using JavaScript. For example, a video player plugin can be controlled by buttons on a website whereas the plugin can enable and disable those buttons depending on the state of the video player. NPAPI plugins can choose to display data inside a website using one of two different modes: Windowed or windowless. Windowed plugins receive a platform dependent han- dle to an operating system native window. The plugin’s drawing operations commonly use this given window while heavily relying on the operating system’s graphics API. Additionally, a windowed plugin has to deal with the sent to its window or otherwise the browser may freeze. Windowless plugins on the other hand get to render their content as part of the browser’s rendering process. Again, plugins need to use the specific painting of the current platform. Irrespective of the window mode a plugin

9 2 Background

Figure 2.1: A NPAPI plugin’s lifecycle. chooses, it can always directly interact with the operating system to open additional windows or to paint to the screen. In general, the NPAPI makes no assumptions about limitations of functionality avail- able to a plugin but rather implies that plugins become part of the browser as soon as they have been loaded. Consequently, until recently, most browser implementations handled plugin code inside the browser’s main loop. In this case, browser and plugin become tightly coupled and a fault in either of them may bring down both of them. Additionally, plugins inherit the browser’s permissions because they are running in the same protection domain. Further information on this topic is part of my threat analysis and explained in Section 2.6. With NPAPI, security depends on the plugin’s good behavior and most security con- siderations are up to the plugin developer. For example, plugin instances can load resources from different origins on request of a web page. To prevent information from leaking to unrelated websites plugins are expected to respect and implement the SOP. This assures that the author of a web page cannot circumvent the browser’s SOP using a plugin.

10 2.3 Plugins

2.3.1.2 Pepper Plugin API In an attempt to rethink plugin architecture in the context of platform independence and - to some extent - security, Adobe, Google and joined force to create a new plugin API called Pepper [Mozb]. An important goal was to minimize the changes required to legacy NPAPI, while adding platform independent access to hardware resources like webcams, microphones, bar code scanners and card readers. A revised version coined Pepper2 or Pepper Plugin API (PPAPI) [Mozb] followed that showed the need for more substantial changes in the way plugins are integrated. For example, of the two window modes of plugins only the windowless mode is supported. Additionally, drawing operations will use a 2D API that is based on an offscreen buffer to decouple the painting of plugin content from the rendering of the page. Although more development can be expected on the interface definitions, experimental support for PPAPI is currently being included in Chromium [Gooe].

2.3.1.3 MozPlugger MozPlugger [BL] is a plugin that allows arbitrary applications to be displayed in the browser as a media handler. Although MozPlugger is implemented as a plugin, it is not handling any specific content type itself. Being a meta-plugin, it facilitates content handling by communicating with native applications. It is available for the Unix version of Mozilla and other Unix browsers supporting the NPAPI. The implementation of version 1.14.2 needs around 6, 000 physical source lines of code (SLOC). Using MozPlugger, for example a stand-alone media player application can become a browser’s default handler for embedded video files by adding an entry to MozPlugger’s configuration file. The entry includes the media type, the path to the application that can decode the media type and options how to interact with the application. One of the options specifies the title of the external application’s window. Using the title, MozPlugger can identify the window, hide it and redirect its content to the calling website. As soon as the browser encounters a website including an embedded media type which can be handled by MozPlugger, it creates a new instance of MozPlugger. The instance of the plugin then starts the external media player application as defined in the configuration file and hides the application’s native user interface. As long as the browser displays the website, MozPlugger intercepts any drawing operations to the application’s window (whose title was specified by the configuration file) and redirects them to a specific region within the website. Using this technique, the stand-alone media player becomes a browser plugin without any modifications and typically even without being aware of it. The way MozPlugger works is that an indirection layer is introduced between the browser and the media handler. When looking at the system from a security point of view, on the one hand, MozPlugger can help to separate plugin execution from the browser. For example, as the event loop of the plugin and the browser get decoupled and are running in different processes, a delay in the plugin’s start up routine will not affect the performance of the browser’s user interface. On the other hand, the question

11 2 Background on how to monitor, validate and restrict a plugin’s access to resources like network, file system and X server, remains unanswered.

2.3.2 Important Plugins Recently, the number of plugins that provide websites with the functionality to execute code in the browser has been increasing. On the one hand, those plugins provide a way to add support for new media types to the browser. On the other hand, the media types - and consequently the third-party plugins - may include functionality to load and execute additional content. As the content is created and provided by an unknown and potentially untrusted third-party, security concerns similar to the loading of a website apply. Current browsers have no influence on how the security measurements in the plugin are implemented and consequently need to rely on the manufacturer of the plugin. This section includes an overview of widely used plugins.

2.3.2.1 The Adobe Flash Player is a runtime environment for third-party animations and appli- cations inside the browser. With its support for audio and video playback, vector graphics, hardware like webcam and microphone as well as hardware acceleration and a scripting language, it is commonly used to create multimedia-centric applications. The focus on multimedia also reflected in the available authoring tools. Most offer functionality and templates to create animations, movies or games. As described in [Ado08], Flash applications come in the form of byte-code and are executed in the ActionScript Virtual Machine. The virtual machine ensures sandboxing of unrelated applications by interpreting and verifying the byte code before translating it to machine code. The API provided by Flash Player is designed to restrict the access of Flash applications to the operating system and user data. Consequently, critical calls are checked during runtime and denied if they do not conform to the security policy. With popular websites like YouTube relying on Flash, the Adobe Flash Plugin became widely installed on desktop computers. Adobe itself states that

Adobe® Flash® Player is [...] reaching 99% of Internet-enabled desktops in mature markets as well as a wide range of devices [Ado].

Due to its popularity and a number of vulnerabilities recently found in the software, Adobe Flash Player has increasingly been a gateway for attacks on desktop computers as described in [FTJ+10].

2.3.2.2 Java Applets Java Applets provide a way to run third-party code in a browser using the (JVM). Java Applets consist out of compiled byte-code, which is executed under the control of a JVM. Applets may either be untrusted which limits their capabilities or may be trusted which increases their access permissions and allows them to access user data or hardware.

12 2.4 Content Handling in Browsers

Java provides 3D hardware acceleration which allows for complex rendering. In con- trast to Flash, the focus of Java Applets is not on multimedia applications: Authoring tools and integrated components for developing multimedia applets are missing. Cur- rently, Java Applets are mostly used for scientific rendering, being replaced by Flash for animations. Nonetheless, the plugin for running Java Applets is still installed on many browsers according to [Ado].

2.3.2.3 Similar to Adobe Flash, Microsoft Silverlight is a platform for Internet applications focusing on multimedia. The runtime environment includes the Microsoft .NET Frame- work (a software platform partly standardized by Ecma International and the Inter- national Organization for Standardization) and is immediately provided for systems running Windows or MacOS, as well as some mobile phones using Symbian or Windows Phone 7. An open-source implementation called Moonlight developed by Novell can be used on Linux platforms. Moonlight is based on Mono, which provides a runtime environment compatible to the .NET Framework. Silverlight includes support for mul- timedia applications, e.g. hardware acceleration for rendering or video playback as well as webcam and microphone support. As with Flash, Silverlight applications are executed in a sandboxed mode, limiting the functionality developers can access from within an application [KM10]. Sandboxing is provided by the .NET Framework. Additionally, similar to Java Applets, Silverlight applications can be granted higher privileges by the user, allowing them to arbitrarily access user files and to communicate with other processes using the Component Object Model (COM) included in the .NET Framework.

2.4 Content Handling in Browsers

To display a web page, the browser downloads an HTML resource, parses the resource’s markup language and decodes and loads all referenced resources. When downloading a resource which is neither HTML nor a supported media format, the browser cannot interpret or display its content. In that case, it will offer to save the resource to a local file chosen by the user. As soon as all necessary data is available, the browser tries to render the website according to the markup. If a content type is detected for which a plugin is registered, the browser will instantiate it as described in Section 2.1 and allow the plugin to render the resource as part of the web page. When all rendering is completed, the generated visual representation of the web page is displayed to the user. Depending on the web page’s content, rendering may be performed many times per second, e.g. to display videos and smooth animations or due to events triggered by executed scripts.

13 2 Background

2.5 Browser Architecture

Until recently, most browsers applied an architecture based on a monolithic design. In those architectures, all subsystems are tightly coupled and often without clear bound- aries. The architecture aims at offering high performance while keeping resource usage low. Despite the monolithic design, [GG06] identified eight subsystems that generally can be found in all modern browsers: user interface, browser engine, rendering engine, net- working, JavaScript interpreter, XML parser, display backend and data persistence sub- system. The user interface handles the interaction between the user and other browser subsystems. It is responsible for dealing with user generated events such as mouse clicks, managing views like tabs, preferences, scroll-bars and relate with other applications or the window manager. The browser engine provides an abstraction of the rendering engine as needed by the user interface. It facilitates loading of a given resource and integrates basic navigation like forward, backward and reloading of web pages. On the next level, the rendering engine parses content formatted in the markup language HTML including style information and images. From the result, the rendering engine derives a visual representation of the web page. As a basis for the rendering engine, the network- ing subsystem retrieves resources using protocols like HTTP and FTP. It is responsible for detecting the media type of the resource and may cache files. The JavaScript inter- preter parses embedded or referenced JavaScript source code and executes the script. The XML parser subsystem generates a representation of XML documents using Docu- ment Object Model (DOM). The display back-end normally directly interacts with the operating system. It offers methods for drawing on the screen as well as reusable user interface components such as windows and widgets. Finally, the data persistence sub- system is used to save browser data like preferences and bookmarks as well as session information or to cache resources. The described architecture can be seen in Figure 2.2. At runtime, all components are instantiated and executed in the same protection domain. Consequently, a fault in any of them can compromise all the others. Although different subsystems possess different purposes, a successful attack on the JavaScript interpreter for example can take advantage of the data persistence subsystem and access available information. The same applies to plugins, where any contained fault eventu- ally jeopardize the security of the browser (see Section 2.6 for a detailed analysis). Additionally, to provide common browser functionality on current operating systems, monolithic browsers not only require access to system resources but also to user data. For example, retrieval of network resources over different protocols is normally based on direct network access. Due to missing separation of components at runtime, the browser as a whole has to be granted unlimited network access. Furthermore, uploads of user files requires access to the file system in the same manner as the current user. As a consequence, browsers typically run in user context and are provided with the same permissions as granted to the current user.

14 2.5 Browser Architecture

Figure 2.2: Architecture and dependencies of modern web browsers as identified by [GG06].

2.5.1 Browser Extensions Aside from plugins, many browser architectures allow the installation of components in the form extensions. Extensions give third-party developers the possibility to alter or add to the functionality of the browser. In contrast to plugins, extensions are generally not meant to handle a special resource type but add features to the browser that interact with different parts of the browser and the web pages. For example, an extension may block content from a web page or help the user to download specific elements of a web page. Extensions for the web browser Firefox are mostly written using XUL (a language to describe user interfaces) and JavaScript, but may also use components written in ++. As extensions become part of the browser once installed, they also receive the same level of trust. Using their manifold possibilities, these extensions may hide themselves from the extension manager, open arbitrary network connections or intercept network traffic and user input. Support for a new kind of extensions is currently being developed in the form of the Jetpack API [Moza]. It supports extensions that use HTML and JavaScript and allows developers to restrict their extension’s capabilities in a way that is visible to the user. Extensions used by the web browser Chrome likewise are based on HTML and JavaScript. This limits the extension’s possibilities and allows the browser to easily enforce permission-based security policies.

2.5.2 Multi-process Architectures As mentioned before, most browsers were designed to render all pages in the same process. With the increasing functionality and complexity of browsers, for example due to the increasing usage and complexity of JavaScript and formatting information

15 2 Background in web pages, the need to improve the architecture of current browsers became preva- lent [GTK08], [RG09]. Therefore, researchers proposed as a first step to split the browser into separate components. In the second step, every component should then be isolated in its own protection domain. Depending on the granularity of the components and the kind of protection domain, the browser’s overall reliability and security can be improved. By containing faults in the component that caused it, higher reliability can be achieved. On a related note, executing components in different processes instead of a single, shared thread, browsers can take advantage of the operating system’s scheduling and – if avail- able – multi-core processors. Consequently, by taking better advantage of the system’s resources, performance gains are possible. Finally, having a process-level separation of components simplifies the use of methods provided by the operating system to isolate components, leading to improved security. Browser manufacturers recently started to base browsers on a multi-process architec- ture. To give a brief overview over current development, I outline techniques used in popular browsers. Chromium [Gooa], the open-source project behind , follows a modular design and emphasizes the principal of least privilege. The render- ing engine in place is WebKit, which can be used in one of three different modes of operation [BJRT08]: Monolithic, per-browsing-instance, per-site-instance. Using the per-browsing-instance mode, Chromium considers a concept of related websites called a browsing instance which are part of the same protection domain. For example, if a website opens a popup window, both the popup and the website belong to the same browsing instance, share common resources and are allowed to communicate. Activating the per-site-instance mode applies the strictest policies to websites, where each website is run in its own protection domain. Although those modes of operation are a step in the right direction, there remains the need to include exceptions to make the browser compatible with current websites [RBP09]. For example, if a website includes an object from a different domain, this object may still be rendered within the website’s protec- tion domain. Additionally, some resources like browser cookies also need to be accessible from any protection domain to allow the request of objects from a resource referenced by a cookie to include the cookie itself. While the layout engine WebKit (used by Google Chrome and Apple’s Safari) itself does not support special means to increase plugin security, WebKit2 follows the Chro- mium-approach by offering a reworked framework that integrates a split process model. Starting with WebKit2 (announced in April 2010), parts of WebKit are executed in the user interface process and parts in a web process. Furthermore, WebKit2’s API allows developers using the framework to benefit from the split process model. On the one hand, this can lead to improved performance and better security of the browser as for example systems with multicore CPUs can easily execute code in parallel. Additionally, memory corruptions are contained in the causing process. On the other hand, although WebKit2 has the potential to put its separated subsystems into separated protection domains, there are currently no mechanisms for sandboxing provided. Nevertheless, plugins may benefit from the split process model due to fault containment. Until recently, all versions of Mozilla Firefox applied a monolithic architecture where the user interface, web content and plugins run in a single process. As of version 3.6.4, support for so called out-of-process plugins (OOPP) was introduced. OOPP

16 2.6 Threat Analysis decouples plugin execution from user interface and rendering to prevent the browser from crashing in case a plugin misbehaves. OOPP is the first step of Mozilla’s project Electrolysis [Mozd] whose purpose is to implement a split process model for the browser user interface, web content and plugins. Similar to WebKit2, this may ease the isolation of browser components, plugins and websites during a future phase. To reduce the risk of known plugin vulnerabilities to be exploited by attackers, Firefox received a feature to detect outdated plugins [NM]. When a new version for an installed plugin is detected, the user is notified and suggestions to update the plugin are displayed. Originally, Microsoft Internet Explorer was using a monolithic architecture. Starting with version 8, the browser uses different processes for displaying its main window, opened tabs and ActiveX components. Additionally, when running Windows Vista, the browser can put processes into a protected mode. This mode prevents a process to write to objects of high integrity like the user’s home folder or the Windows registry while allowing modifications to objects of low integrity like the folder of temporary Internet files. Consequently, these techniques make it more difficult for malicious code to permanently install itself to the system.

2.6 Threat Analysis

During my research I conducted various experiments to assess the current plugin inter- face and derive a suitable threat model. Before introducing my detailed threat model, I will present my experiments and their results. Hence, this section includes a description of current architecture’s characteristics concerning plugin security as revealed during my research. To analyze the NPAPI’s capabilities and limitations, I created different plugins to attack the browser’s availability, integrity and confidentiality. For that purpose, the plugins use and misuse the features provided by the API. The test environment consists of a host machine running Firefox 3.5.8 on top of Ubuntu 9.10 64bit.

2.6.1 Attacking Web Browser Availability As a first step, I analyzed the life-cycle of a plugin and the involved interactions with the browser. The research of a plugin’s life-cycle exposed that the browser creates and initializes plugins in the browser’s event loop. Consequently, any delay in the initialization will make the browser’s UI freeze. The plugin sect-freeze demonstrates this behavior.

2.6.2 Unrestricted File System Access In the next experiment, I assessed to what extent a plugin can access the file system. It turns out that there are no limitations or validations in place and plugins can freely open resources accessible to the current user. For example, the plugin sect-file shows the listing of the user’s home directory whose Linux file permissions limit read/write/execute permissions to this user only. Another example of possible file access is shown by the plugin sect-device-file which displays the content of /proc/cpuinfo. This file typically includes detailed infor-

17 2 Background mation about the system’s processors, e.g. its vendor, model and frequency. See Fig- ure 2.3 for a screenshot taken after the plugin has loaded. The plugin sect-file-passwd reads out /etc/passwd which includes information about the users available on the sys- tem. The result is shown in Figure 2.4.

Figure 2.3: Screenshot of a plugin reading the file /proc/cpuinfo. This resource maps information about the system’s CPU to the local file system.

2.6.3 Unrestricted Network Access To analyze the permissions of plugins to network communication, I created the plugin sect-socket. This plugin opens a network socket, connects to a remote host, uses the HTTP protocol to send and receive data and displays the received file to the users. As there are no limitations on the connection, e.g. what host it can connect to or what protocol it can use, or what data it can send, the socket allows the plugin to exchange arbitrary data with a host of its choice. Consequently, all data that can be collected (for example by one of the plugins described before) is not bound to the host and can be transferred to remote servers.

18 2.6 Threat Analysis

Figure 2.4: Screenshot of a plugin reading the file /etc/passwd.

2.6.4 User Interface Forgery By opening a maximized top-level window above the browser, the user’s visual experi- ence can be completely controlled by the plugin. On loading of the plugin sect-window, a window outside the browser gets opened. It is maximized and set to be the top level window. As it is transparent except for the area above the browser’s address bar, the plugin can now display of its own choice to the user. This behavior can be used by a phishing website to support the website’s authenticity. Not only can plugins draw to any region on the screen, they also have direct access to the user’s X server session. By design, communication with the X server is validated by the use of an authentication token called magic cookie. As the token can be read by Firefox, and no further limitations apply for plugins, they themselves can take advantage of all the functionality provided by the X server. For demonstration purposes, the plugin sect-X11-events registers itself at the X server to receive all input events of all opened

19 2 Background windows - code that could be used by a malicious to write a plugin based keylogger.

2.6.5 Attacking Web Browser Integrity Continuing on a lower level, the plugin sect-replace-function shows a plugin’s capa- bilities to detour functions by replacing machine instructions in loaded library code during runtime as described in [HB98]. Therefore, the beginning of the loaded func- tion’s code is overwritten by an unconditional jump instruction which redirects control to a user defined function. This function can now call a backup of the original function with modified arguments, tamper with the function’s result or execute completely differ- ent code. As an example, the function gettimeofday() gets intercepted and modified to return a fixed date in the past as soon as the plugin is loaded. The result was not only visible when using JavaScript time functions, but also triggered warning messages about invalid certificates when connecting to hosts using Secure Sockets Layer (SSL). SSL is a cryptographic protocol that relies on certificates signed by a trusted authority. Validity of certificates is commonly limited to a specific time interval, e.g. 2 years. If the modified function gettimeofday() returns a date that is outside of a certificate’s validity period, the browser warns the user that the connection may not be secure. Screenshots displaying the current time using JavaScript were taken before and after detouring the function and are included in Figure 2.5 and Figure 2.6.

Figure 2.5: Screenshot taken before detouring the function gettimeofday.

Finally, complete control over the browser can be gained. Using the system call ptrace, a parent process can monitor another process, manipulate its data and interfere with

20 2.6 Threat Analysis

Figure 2.6: Screenshot taken after detouring the function gettimeofday. its execution. Such a plugin could, for example, intercept and change all the browser’s system calls as well as modify the internal data structures of the browser. As a process cannot easily call ptrace on itself or its parent, some additional steps have to be taken. First of all, a child process is forked from within the plugin. This child process itself forks another process. By ending the intermediate process, the second child process gets re-parented to the system’s init process. The re-parented child process is now unrelated to the browser process. Using this setup, the child can easily call ptrace on the browser which the plugin sect-ptrace uses to log the browser’s system calls.

2.6.6 Summary Each of the cases shown in the previous sections as well as their combination can lead to a variety of attacks on the current user’s data. Additionally, not only a special crafted, evil plugin has those capabilities. Any plugin provided by a trustworthy third party may be misused by an attacker. As soon as it includes an exploitable software bug, an attacker can set-up a website that uses the flaw to inject code. This code can then take advantage of the extensive possibilities of a plugin. The Symantec Global Internet Security Threat Report shows that this not only a theoretical problem but also a widely used attack vector:

Among the vulnerabilities discovered in 2009, a vulnerability affecting both Adobe Reader and Flash Player was the second most attacked vulnerability. [FTJ+10]

21 2 Background

2.6.7 Threat Model and Assumptions The goal of my architecture is to provide an interface for browser-plugin interaction and enforce its exclusive use. Accordingly, the potential effects of compromised plugins are minimized (see Figure 2.7). In this threat model, I assume that the browser is trusted and free of flaws. Additionally, the user of the browser is trusted to decide privacy- related questions from the browser, e.g. if a website may access the webcam or which file may be uploaded to a network resource, without being influenced by the attacker using other channels.

Figure 2.7: Possible influences of an attacker in the threat model. An attacker has complete control over a website. Taking advantage of browser functionality, like redirects or the network component, he can send arbitrary data to an installed plugin. In case the plugin is vulnerable, the attacker can gain complete control over the plugin using this way (illustrated by the dashed line).

The attacker on the other hand has complete control over a website. This scenario is reasonable, as content distributed over an advertisement network can force a browser to load resources specified by an attacker. A plugin provided by a third party in binary format is untrusted. It may request and execute additional resources over the network under the control of the same-origin policy. Although the plugin may be provided

22 2.6 Threat Analysis by a trusted third party, it may include vulnerabilities known to the attacker. Those vulnerabilities could be due to the fact that the focus of the plugin provider is media handling instead of security. Naturally, the architecture depends on the layers below it like system libraries, the underlying operating system as well as hardware. Providing the runtime environment for my implementation, those lower layers are assumed to be secure. For example, system calls used for providing sandboxing are trusted to be implemented as described in their documentation and without flaws.

23

3 Design

Based on my analysis of current technologies in Section 2.2 and the threat model derived in Section 2.6, the design of my architecture for untrusted web browser plugins pursues the following goals.

• The browser can enforce that information flow from and to plugins conforms to a security policy chosen by the user.

• The architecture can be used to execute multi-media plugins. Therefore data throughput should be maximized while the computational overhead and event latency should be minimized.

• The interface provided by the architecture is expendable. This allows APIs to be introduced for providing more functionality to plugins in a controlled manner.

From here on, the host which provides the runtime environment to plugins is com- monly realized as the web browser. This plugin host may either be integrated in other browser modules or be implemented as an independent module in the browser. Client consequently refers to the plugin running as a client in the runtime environment provided by the host.

3.1 Designing a Secure Plugin API

The interface between the browser and plugins is one of the key points for providing a secure runtime environment. On the one hand, it has to give the plugin access to resources and operating system functionality. On the other hand, it should only include functionality that can be easily checked for compliance with a security policy. Applying this design, the browser can control plugins and deny potentially dangerous operations. As opposed to current plugin architectures, this would first of all allow security-aware plugin developers to rely on the browser for security means, e.g. to ensure SOP. Second, the browser would not need to trust the third-party plugins. Finally, if a plugin contains a vulnerability, the possible damage is limited to standard plugin functionality. This standard plugin interface may provide functionality such as access to data sources, interaction with the user interface, temporary local storage as well as specially approved permanent storage and hardware devices. In contrast, unrestricted access to all user files may not be provided. During my review of currently used plugins, I identified the following key requirements for an API that supports multimedia plugins.

Connectivity and data sources Plugins require connectivity and access to different data sources. For example to support loading of additional resources, they need to

25 3 Design

request data from remote network locations. Advanced media handling may require access to devices like webcams, audio devices, graphic cards (for hard- ware acceleration), GPS receivers or printers.

Local interaction and transient storage Support for user interactions provided by a graphical user interfaces (GUI) as well as transient storage are needed by many plugins. By its nature, a multimedia plugin generally depends on a GUI to display its visual output. To allow temporary files, e.g. to cache data for later processing or re-use, transient storage needs to be available to plugins.

Permanent storage Two kinds of permanent storage are required by plugins. First of all, plugins may need to save user settings and therefore require storage that is available over multiple sessions but without compromising SOP. Second, plugins may need to read or write to local files, e.g. to save some media to disk on user request. Therefore, they depend on access to files, which must be specially approved by the user.

Additionally, the plugin interface should support and facilitate means traditionally provided to plugins by the operating system:

Ease of programming Implementing plugins often requires programming mechanisms like threading and memory management. These techniques must either be pro- vided by the plugin interface or the runtime environment must facilitate their implementation within plugins.

The plugin interface outlined in this section requires all communication to be explicit. Consequently, the host can block potential dangerous operations. If the interface is enforced on the plugin, i.e. it cannot use other means to interact with instances outside of its protection domain, it receives less trust by the operating system and the browser and its potential damage is limited. A positive side effect would be that plugin developers can concentrate on the functionality of the plugin instead of enforcing security measurements like SOP.

3.2 Execution Model

In general, plugins need to deal with different event sources like the graphical user interface, timers or hardware devices at the same time. Additionally, some plugins require larger amounts of multimedia data from a local or remote location which may not be available immediately. As this data is delivered in smaller segments, the architecture needs to provide some means to efficiently deliver this out-of-band data to the plugin. Furthermore, current plugins often have to fulfill different complex tasks in parallel. For example, a video player plugin may need to constantly download a video stream from a network location while at the same time decoding the most recently received frames, drawing the decoded frame to the screen and reacting on user interactions. Moreover, a video player plugin requires fine grained control over the tasks involved to guarantee a continuous display of the video. Consequently, a model supporting the execution of

26 3.2 Execution Model different threads which need to communicate and support priorities is crucial to the implementation. The following sections outlines different execution models, describes their properties and explains reasons for selecting one of them for implementation in my plugin architecture.

3.2.1 Threading Model The threading model is a widely-used execution model and for example included in the POSIX API. Implementations of this model allow the user to run a number of threads seemingly or genuinely in parallel. Preemptions and a scheduling algorithm combined with thread priorities ensure regular execution of all threads. Threading is limited to the threading facilities of the underlying operating system. As operating systems tradition- ally favored synchronous interfaces to asynchronous ones, blocking read/write operations were common but means like signals were hardly available. Therefore, implementations based on events require complex locking to ensure data structures are in a consistent state at all times. As locking introduces a computational overhead and threads need to wait for other threads to finish, the overall performance may degrade depending on factors like the number and frequency of events or the granularity of locking.

3.2.2 Event Model Using Callbacks In an event driven execution model, implemented functions are designated to be exe- cuted under specific events or conditions. Event driven systems apply the semantics where an event handler is run to completion once it has been started. Assuming that event handlers do not run for long periods, implementations are commonly based on a threading model where the same thread is used for driving the GUI and executing callbacks on events. As any long running code sequence in a callback would freeze the system, the sequence has to be cut into smaller parts. The execution of smaller parts allows other callbacks to be executed in between, i.e. to quickly react on user events and refresh the GUI. Consequently, preemption of tasks must be implemented on application level which complicates the design. Additionally, this way requires a complex model to handle input arguments and return values of functions, as their start, execution and termination may take place in different callbacks.

3.2.3 Virtual CPU Model Execution on computer systems is normally asynchronous, for example an interrupt may arrive at any time, causing the operating system to execute an interrupt handler. Consequently, threads commonly need to respond to asynchronous events. Nevertheless, traditional thread models apply a synchronous scheme. In that case, a thread either performs computational tasks or waits for events. With waiting being an operation that blocks the thread, it is not possible to simultaneously do both. Supporting the extension of the synchronous thread model, the virtual CPU (vCPU) model is an execution abstraction that strongly resembles physical CPUs [LWP10]. It models upcalls and virtual interrupts, features that are commonly seen in systems on

27 3 Design physical CPUs. Consequently, this abstraction allows for synchronous execution while supporting control flow diversions on events. Using vCPU interfaces, threading can be implemented at user level. A user level implementation allows for light-weight threads and customization, e.g. to support particular needs like thread priorities or special scheduling requirements. As the vCPU model continues to support the traditional fully sequential model and existing applications for the most part rely on this sequential model, their execution on vCPUs involve low development effort as they require little modifications. A vCPU generally relies on the following fields: the state indicator in form of a flag to enable and disable event delivery. It can be set and cleared by the client whereas delivery of an event disables delivery of subsequent events. Upon event delivery, the state of the client is written to the state save area from where it can be read by the client. This is necessary since event delivery asynchronously transfers control in the client, which requires overwriting the CPU’s instruction pointer and other registers. Using the state save area, the client can save the state to internal data structures and restore it at a later point. The state save area to some extent is similar to IA32’s kernel stack, which receives the execution context upon kernel entry. Finally, the entry point vector consists out of a stack pointer and an entry point, to which control flow is diverted during event delivery. Again, there are some correspondences to IA32, where the new instruction pointer is provided by the interrupt description table and the stack pointer may be available in the task state segment. The API of a vCPU constitutes two functions, one to enable and one to disable the asynchronous notifications. The client may disable notifications at any time. When notifications are disabled, the vCPU client can execute without being interrupted by the vCPU host. This may be necessary during manipulations of threading data structures or to provide means of user level synchronization. After being interrupted due to an event, the vCPU can be resumed by using an upcall. As explained in Section 3.2, threading support for a client that interacts with multiple event sources is an important requirement of the architecture. Comparing favorable with both the traditional threading model and an event model using callbacks, I base my architecture on a vCPU execution model.

3.3 Information Flow

The communication between host and client requires a carefully designed protocol that ensures the client’s privileges can only be elevated with the user’s consent. For example, a client may only access a user’s file after special approval by the user.

3.3.1 vCPU System Calls To allow a client to transfer data to the host, a system call mechanism can be designed into the vCPU. It is used by the client to request the execution of instructions with higher privileges, e.g. to read a file from the user or display data on the screen. For this purpose, the design requires the client to disable event delivery, save information about the call to the host at a specific, shared memory location and cause a change in

28 3.3 Information Flow the client’s state which stops the child and is detectable by the host. The host then reads and verifies the arguments from shared memory. In case the arguments are valid, the parent may execute code of higher privileges to handle the request and continue the child at its entry point vector. Any immediate return value to the client can be provided using shared memory. This method can be used for synchronous as well as asynchronous calls to the vCPU host. In the case of synchronous system calls, the result can be written to a shared memory area before the client is resumed. For example, if the client currently has no work to do and requests to be suspended until the next event is available, the call to this sleep function seems synchronous to the client. For asynchronous system calls, the client is resumed in its entry point vector before the system call has completed and the result is ready. When the result finally becomes available, it is delivered to the client using one or more events. As an example, when the client requests a large network resource, the data may arrive slowly and split into smaller parts. In this case, the requesting call returns immediately, but the data may be delivered on availability in chunks using events.

3.3.2 Event Handling Different approaches can be used to deliver data in the form of events to the vCPU client. Below, I present four solutions and briefly describe their advantages and disadvantages. Special cases and optimizations in form of pending events and data events are described in the following Sections 3.3.3 and 3.3.4.

Polling A simple implementation of event delivery can rely on polling. By regularly inquiring of the host whether data has arrived using a vCPU system call, the client controls at which point of execution an event is delivered. This way, the client not only gets to decide how often it interrupts its normal execution to handle events, but can also limit the rate of events. Consequently, this method is commonly implemented either using busy waiting, where the client’s only purpose is to check for data, or by checking for data at defined points in between other calculations. The drawbacks of this approach manifest themselves in latency and overhead. Repeatedly switching into the vCPU to check for or deliver events may create performance overheads with busy waiting inducing a high load to the system. By checking for events in between other calculations, undesirable latencies may be introduced as the client has no way of handling events as soon as they become available.

Event Notifications A second approach takes advantage of event notifications followed by the delivery of the event upon request of the client. It solves the latency problem by making the host interrupt the client to push an event notification to the client as soon as an event becomes available. The client may retrieve data promptly or at a later point in time using a vCPU system call. This way, the client still has control over when to react on an event and can handle it immediately. Consequently, the execution of the client may be interrupted at any point and will be interrupted for every single event.

29 3 Design

Event Notifications with 1-Element Buffer A third approach which applies a buffer for a single event in addition to event notifications requires a shared memory area and a protocol that specifies how a client and a host process access this memory area in a synchronized manner. The shared data includes the following elements: A state indicator specifying if events can be delivered, an event flag denoting if at least one event has been made available by the host and finally a buffer to save the actual event. To allow different types of events, instead of having only one buffer, a buffer for each event type can be introduced. Additionally, every event type needs a flag indicating if the host wrote an event into the corresponding buffer. Event delivery can then be implemented as follows. The host checks if the event can be saved in the buffer corresponding to the type of the event. If the event can be saved, it is written to the buffer, the buffer is marked as being in use and the flag for the event type is set. In the second case, when the buffer is already in use, the delivery of the event needs to be retried later or the client needs to use a vCPU system call to poll the event after the buffer has been emptied. For a more elegant solution that copes with this situation see Section 3.3.3. In either case, the host afterwards sets the event flag while simultaneously making sure interruptions are disabled. If this fails, the host needs to stop the client, then disable interruptions and set the event flag. If interruptions have been enabled before the client was stopped, the client’s current state gets saved to the state save area while resuming the child’s execution at the entry point vector. If interruptions were still disabled, the client is resumed without modifying its state. As the event flag is set, the client will realize that events were delivered in the meantime and will need to handle them before enabling event delivery.

Event Notifications with n-Elements Buffer Extending the previous approach, a buffer for multiple events can be used in combination with event notifications. As opposed to using only buffers to save a single event, this fourth approach applies a ring buffer data structure to allow the host to deliver multiple events before events need to be queued inside the host. Consequently, this technique minimizes the number of times events need to be polled by the client. By choosing a ring buffer that is sufficiently large, the host may only fail to deliver an event in exceptional cases, therefore reducing the number of times a system call needs to be used for event polling to improve performance. Due to the advantages in performance, I base the implementation of event handling on this combination of event buffers and event notifications.

3.3.3 Handling of Pending Events Occasionally, events come in faster than the client can handle them and may even fill up the buffer for multiple events as described in the previous Section 3.3.2. To handle those conditions, a pending event flag may be introduced globally or a separate one for each event type. It is set by the host if there are events available in the host but there is no empty position in the appropriate buffers. As it is the case with normal event delivery, the host needs to set the event flag and make sure interruptions are disabled. From now on, the host will not deliver any events of this type until the client resets the flag and requests delivery using the corresponding vCPU system call. After removing at

30 3.3 Information Flow least one event from the buffer, the client may reset the pending events flag and request the delivery of the queued events from the host. The introduction of this pending events flag offers improvements in efficiency. First of all, repeated unsuccessful delivery of events due to the lack of buffer space induces an overhead to the host. With the host pausing event delivery as soon as the first event could not be delivered, the overhead is reduced. With events bursting in, this optimization becomes especially useful. Second, delays between the client being ready to process new events and the host detecting this condition after it paused event delivery are minimized.

3.3.4 Handling of Data Events Plugins are often used to handle multimedia data. For instance, Video players are dependent on receiving data events in a timely manner. Then, during the playback of a video, lagging should be minimized, sound and video should stay synchronized while at the same time network packages need to be decoded. Consequently, many plugins require means to handle data events of different priorities. In this context, data events may either include the actual data or a reference to a shared memory location from where the data can be read. Especially for larger amounts of data, the second approach minimizes unnecessary copy operation. Based on the event handling as described in Section 3.3.2, data events require a finer grained approach. To support simultaneous delivery and handling of different data events with different priorities, e.g. network data and file data, I use a two-level architecture. This allows for events of higher priority to have a lower latency. Two levels are required to decouple the handling of the event notification from the actual handling of the event’s data. The first level comprises the handling of the event notification in the entry point vector. This time, instead of reading the data, a waiting handler thread for the event type is removed from the waiting queue and added to the queue of runnable threads. As soon as the event handler thread gets scheduled by the threading library, it reads the events from shared memory and marks the buffer as being empty. At the discretion of the user-defined event handler thread, the data may now be made available to other user-level threads. By delegating the handling of data events to a user-level thread, it can be executed with interruptions enabled while becoming subject to user- level scheduling. Consequently, if the priority of a newly woken event handler thread is sufficiently high, it may cause the current thread to be preempted. This results in a clean interface. First of all, as the entry point vector is executed while interruptions are disabled, the handling of the notification is correctly synchronized with the host and data event handlers. Second, the time spent in the entry point vector to handle a data event is minimized and only comprises scheduling of the appropriate event handling thread. Third, the event handling thread can be assigned a high priority which causes the scheduling algorithm to favor its execution, leading to lower latencies. Finally, actual work on the delivered data can be done in user-level threads which may synchronize to the event handler thread by user-level synchronization mechanisms such as mutex or semaphore. Details on the design of user-level synchronization mechanisms are included in Section 3.5.

31 3 Design

3.3.5 Host Main Loop The purpose of the host main loop is twofold. On the one hand, it delivers events to the client, on the other hand, it reacts on status changes in the client which usually are caused by vCPU system calls. Therefore, the main loop first checks the interval since it delivered the last timer event to the client. If a specified interval has passed, a timer event is delivered to the client. Afterwards, available data events are delivered if the client is currently able to receive them. Then the host requests a timer and waits for a status change of the client. If the host continues due to the timer running out, the loop resumes with event delivery as before. Otherwise, the host cancels the timer and analyzes the status of the client in the following order:

• If the plugin has exited, the plugin’s resources are released.

• If the plugin was terminated by a signal, the plugin’s resources are released and the user is notified.

• If the plugin is currently stopped due to executing an illegal instruction, due to receiving a trap signal or an abort signal, the plugin is terminated, its resources are released and the user is notified.

• If the plugin is currently stopped due to receiving a trap signal which was caused by the client trying to execute a system call, the plugin is terminated, its resources are released and the user is notified.

• If the plugin is currently stopped due to accessing an invalid memory location, the host needs to check if the client requests a vCPU system call and handle it as described in Section 3.3.1. If the plugin did not stop to perform a vCPU system call, the plugin is terminated, its resources are released and the user is notified.

• If the plugin is currently stopped due to a stop signal, host and client resume without further action. This condition may occur, after the client gets stopped for the delivery of an event but already stopped in the meantime, e.g. for a vCPU system call.

Consequently, the host is either (1) delivering data to the client, (2) executing privileged calls or (3) waiting for the client to change its status. The time spent in (1) and (2) is minimized by delivering large amounts of data using shared memory as well as providing asynchronous notifications about the result of system calls. Therefore, the main loop’s waiting time for status changes (3) is maximized which guarantees low latency.

3.4 Scheduling

The threads running inside the client can have different priorities as argued previously (see Section 3.3.4). Therefore, the threading library includes data structures for each priority level pointing to threads, which are ready to run. When a scheduling decision is needed, the thread with the highest priority and the longest waiting time gets scheduled.

32 3.4 Scheduling

Threads of the same priority are handled in a round-robin manner and regularly receive time slices of equal size. In detail, to allow different preemptable threads to be executed in a defined manner, the following method is applied. For each priority, the library includes a possibly empty queue of runnable threads. When a scheduling decision has to be taken, the library searches the queues of runnable threads for the non-empty queue of highest priority. In case all queues are empty, the library puts the client process into sleep mode using the corresponding vCPU system call. Otherwise, the library removes the first thread from the queue, marks it as being currently running and calculates the tick count at which the thread should be preempted. Afterwards, it restores the thread’s execution state and resumes it. As soon as the threading library receives a timer event, it checks how much of the thread’s time slice is remaining. If the time slice is used up, the next thread needs to be scheduled as described before. If not, the current thread can be resumed once more. Preempted threads are added to the end of their priority’s running queue, their state attribute is set accordingly and their execution state is saved to the thread’s data structure. As scheduling decisions are commonly initiated by timer events, a thread may exceed its time slice by n nanoseconds, where 0 ≤ n < T imerT ickInterval. To limit the effect of threads exceeding their time slice, the interval between two timer ticks should be a fraction of the execution time per time slice. Providing threads with functionality to voluntarily suspend execution for a specific amount of time, becoming a so-called sleeping thread, can be designed as follows. First, the thread calls a library function that stops the thread’s execution and saves its state. Then, the library can add the thread to an internal list of sleeping threads while asso- ciating it with its wake-up time. Upon reception of a timer tick, the library needs to inspect the sleeping thread’s wake-up time. If its sleeping interval has passed, the thread is moved to the ready queue corresponding to its priority. As checking for sleep- ing threads occurs often, a straight-forward optimization keeps sleeping threads in a priority that is sorted by increasing wake-up timestamps. Consequently, only the first sleeping thread’s time interval needs to be inspected to decide if there is a thread that needs to be resumed. The woken thread’s priority may be higher than the one of the current thread, Consequently, the current thread may get preempted before it depleted its time slice. A similar condition is met when a new thread with higher priority is created by a thread with lower priority, again leading to early preemptions. As the scheduling does not take into account if a thread previously was able to deplete its time slice or not, it may not be completely fair under some circumstances. Improvements may be subject of future work as described in Section 7.2. Modifications to the thread queues are only allowed when running in uninterruptible mode. Consequently, before adding a thread to or removing a thread from a queue, and changing the threads attributes, interruptions are disabled. This guarantees that the thread related data structures are in a consistent state when switching between different threads.

33 3 Design

3.5 Synchronization

In the described vCPU setup, the host may interrupt the client due to an event and resume it using an upcall. As this results in a context switch in the client, special care has to be taken when manipulating shared data structures. For example, scheduling of new threads modifies shared data structures and may leave the client in an undefined state if interrupted. Moreover, threads fall under the influence of the user level threading library’s scheduling. Consequently, threads may again get preempted at any point in their execution, again posing the risk of inconsistent data. To sum it up, the architecture needs to provide or facilitate means of synchronization.

3.5.1 Thread-Synchronization with Upcalls To protect a critical section in user code against upcalls, a thread can take advantage of the functions to enable and disable notifications provided by the vCPU interface. Before entering a critical section, the client disables event notifications. From then on, the client’s control flow will not be interrupted by the host and the client can safely modify its global data structures. Naturally, the underlying operating system may still transparently interrupt and resume the client. To exit the critical section, the client enables event notification (which may immediately require the client to handle events as described in Section 3.3.2). This method cannot only be used by user level threads but is also used in an upcall to prevent additional upcalls while an upcall is being processed. Therefore, the host disables notifications when diverting the client’s control flow to the entry point.

3.5.2 Thread-Synchronization with other Threads Running different threads within the client leads to the need for means of communication and synchronization between them. For example, one thread may be processing data provided by another thread. Building on scheduling as described in Section 3.4, methods for user level thread synchronization can be designed. By introducing a blocked thread state, threads can be preempted and added to specific lists of blocked threads. This can be used to provide means for mutually exclusive access. A thread that has to wait enters the blocked state. When a specific condition is met, e.g. during an upcall or due to a different thread’s execution, a blocked thread can be moved from the list of blocked to the list of ready threads. Checking if exclusive access can be granted as well as modifications to the shared list of blocked threads again needs to be protected against upcalls as described before. This way, mechanisms similar to the ones defined in the POSIX standard can be implemented: mutex, condition variable and semaphore. As blocked threads do not consume computational power, the presented method provides an efficient way of synchronizing threads.

3.5.3 Thread-Synchronization with Upcalls and other Threads Under some conditions, a user level thread may need to synchronize with upcalls as well as other threads. In that case, it can use both synchronization methods as described in

34 3.5 Synchronization the previous sections: Synchronize with upcalls by enabling/disabling notifications and with other threads by means of user level synchronization.

35

4 Implementation

As described in Section 3.2.3, the vCPU concept provides features needed by a restricted user-level plugin architecture. Therefore, I chose it to be the base of my implementation and built the plugin architecture on top of it. I implemented the architecture on 64-bit Ubuntu Linux 9.10 featuring a Linux kernel of version 2.6.31 and glibc of version 2.10.1.

4.1 vCPU

Implementations of the vCPU model need to share the following information between host and client: a state indicator, a state save area, a pending events flag and an entry point vector. First of all, a state indicator showing if the client can currently be interrupted for event delivery is needed. Second, after resuming the client with an upcall, its previous state at the time of the interruption has to be provided in the state save area. Third, a flag is required to indicate if pending events need to be handled by the client. Finally, the entry point vector, which specifies the address at which the client’s execution should be resumed after an event was delivered by the host.

4.1.1 Sandboxing Using ptrace Sandboxing of the client is implemented in two stages. First of all, the host that includes the vCPU internals is protected from the client by executing the client in a separate address space. This ensures that the client cannot read or write to the host’s data structures. Second, implementing a vCPU in user space requires a means to monitor and manipulate a child process while being able to prevent interaction between it and the operating system. On Linux platforms, the ptrace system call allows a parent process to trace another process. Therefore, the parent can specify under which conditions a child shall be stopped. When being traced the child is stopped whenever a signal is delivered to it. Additionally, the parent can request the kernel to stop the child after every instruction (single-step mode) or on entry to and exit from system calls. After the child has been stopped, the parent can use ptrace to analyze the child’s memory, registers and the reason for the interruption, modify the child’s data and state, and finally resume the child’s execution at the parent’s discretion. To read the child’s status, the parent may take advantage of the wait system call, which normally returns after the child stops or terminates, but can also be set to return immediately. In this implementation, the parent process is attached to a waiting child process. From this point on, the parent process periodically examines the state of the child. As ptrace makes sure the child is stopped on entry to and exit from a system call and whenever a signal is delivered to it, the parent has sufficient control over the child. First of all, this is used to block Linux system calls as well as to detect segmentation

37 4 Implementation faults. Section 4.1.5 explains how vCPU system calls are implemented on top of this. Second, the parent can deliberately stop a running child to inject events and initiate event handling as described in Section 4.1.3. The child’s execution state is retrieved using ptrace and saved to shared memory. Third, any misbehavior on the part of the child leading to signals like SIGILL or SIGABRT, unexpected SIGSTOP and a premature exit of the child are detected by the parent in the same way. The first design goal (refer to Section 3) states that information flow from and to plugins conforms to a security policy chosen by the user. According to this design goal, I evaluated the consequences of executing potentially dangerous calls in the vCPU client. Therefore, I implemented user-level program that calls various Linux system calls. On execution, the host successfully detected and blocked the Linux system call and terminated the client process. As access to Linux system calls is denied, the plugin process is truly sandboxed and can only communicate using the provided API.

4.1.2 Setup of Host-Client Interaction Host interaction is defined by the vCPU as outlined in Section 3.2.3. To setup host-client interaction, the vCPU host needs four arguments:

• pointer to a function that prepares the execution of the child (prepare_client function)

• pointer to a function that executes the client (run_client function)

• pointer to a function that releases any resources after the client exited (destroy_- client function)

• an integer specifying the time span between two timer events (msec_per_tick)

The process to execute the client is created using the fork system call, which creates the child process by duplicating the calling process. The child receives copies of data available in the parent, including file descriptors. Consequently, by mapping a shared memory region into the host (using mmap), before calling fork, client and host both possess a handle to the same memory area. Therefore, as a first step, the host requests shared memory from the operating system to share an interrupted client’s state, flags, events as well as a system call id and its arguments between the host and the client process. Then the given prepare function is executed which may setup objects needed by the client. Afterwards, the client process is created using the system call fork. As the previously allocated memory was marked as shared, changes to it will be visible in both processes. The newly forked client process runs with the same permissions as the parent and needs to be put under control of ptrace. Therefore, the client’s first instructions are to indicate that it is being traced by its parent and subsequently raising a trap to synchronize with the host. For that reason, the parent waits for the child’s first status change after the fork. When this call returns, the client’s execution is monitored by the host and communication using the vCPU protocol for system calls and event delivery on shared memory is initialized.

38 4.1 vCPU

From this point on, the host enters its main loop which is responsible for different actions. First of all, it monitors the client. Second, it executes privileged calls on behalf of the client. Third, it transfers data from the client to the host using system calls and from the host to the client using events.

4.1.3 Event Handling I described different approaches to event handling in Section 3.3.2. As outlined there, I implemented a ring buffer of fixed size for each event type on a shared page. The ring buffer data structure includes a head pointer, a fill counter, a pending flag and memory for a fixed amount of elements. A set pending flag indicates if there are additional events queued inside the host. To make sure the head pointer and fill counter as well as the pending event flag are in a consistent state at all times, they are located in the same uint64_t memory location which allows for atomic modifications. The host uses an add operation that writes an element into the next empty position in the buffer ((head + fill) mod SIZE) and increases the fill counter by one. The client, uses a remove command which marks the position the head pointer currently references to be empty by increasing the head pointer (which wraps around) and decreasing the fill count. To read the next element, the client needs to make sure the fill counter indicates available elements before reading the element at the position referenced by head. After reading all events from the buffer the client checks the pending events flag. If it is set, the client needs to requests the delivery of any queued events using a vCPU system call as described in Section 4.1.5, or new events of this type will not be delivered. Using this setup, the ring buffer can be configured as a buffer for a single element or a buffer for multiple elements by setting its SIZE. Experiments comparing the different modes can be found in Section 5.5.2.

4.1.4 Host Main Loop As described in Section 3.3.5, the host needs to be able to wait for a status change in the client. For this purpose, I use the system call waitpid in my implementation. This system call will not return until the client’s status changes, i.e. the client is stopped or terminated. Using predefined macros (see the manual of waitpid for details), the reason as to why the client stopped can be identified. As the host not only needs to react on client-side status changes, for example it has to deliver events to the client, the host sets an interval timer that interrupts the call to waitpid. Therefore, the host uses setitimer with a timeout of 100 µs. The corresponding signal handler is set to a function with an empty body. When waitpid returns, return and error values are inspected. If the return value is −1 and errno == EINTR, the host detects an interruption due to a signal handler. In other cases, the alarm is canceled and the state of the client is inspected. For event delivery by the host main loop, atomic setting of flags is required. Therefore, a partly hierarchical structure of flags is saved in a uint64_t in shared memory. It comprises the state indicator, a flag specifying if unhandled events are available and a local flag for each event type specifying if unhandled events of this type are available.

39 4 Implementation

Using data structures in shared memory as described in Section 4.1.3, the host delivers events as follows. If an empty space is available in the shared memory and there are no events queued in the host, the event is saved to shared memory. Otherwise, the pending event flag of the buffer for the specified event type is set. In both cases, the local flag is set. Afterwards, the notification may need to be pushed to the client. If upcalls are currently disabled by the state indicator, the host can immediately set the global event flag. Otherwise, the client needs to be interrupted by sending it the signal SIGSTOP. Then upcalls are disabled and the global event flag is set. In case upcalls were previously enabled, the client is resumed at its entry point vector. In case upcalls were disabled, the client is resumed without interfering with its execution context. For scheduling and other time-dependent activities in the client, the host periodically provides the client with time information. Therefore, the host main loop updates a variable in shared memory with the amount of ticks since the start of the vCPU. It is measured in milliseconds using gettimeofday and converted to ticks by dividing it by msec_per_tick. Then it sends a timer event to the client. At the entry point vector, the client reads the value from shared memory and, for example, the threading library (refer to Section 4.2) preempts the active user thread and initiates scheduling. Before each call to waitpid, the host main loop delivers a timer event if a specified interval has passed.

4.1.5 vCPU System Calls Another crucial feature of the architecture are vCPU system calls, that need to provide an efficient and secure way for the client to initiate calls of higher privileges. Therefore, a protocol constraining the communication between client and host was defined and implemented as explained in this section.

4.1.5.1 vCPU System Call Execution In a first step, the client disables interruptions by setting the corresponding flag on the shared page. Then the client writes the id of the system call as well as required arguments into a shared memory region. Afterwards, the client triggers a segmentation fault at a specific memory address by writing the value 0 to it. This memory address must be known to both the host and the client. The segmentation fault causes the host’s call of ptrace to indicate a state change in the client, which the host uses to analyze the segmentation fault. If it can detect a valid system call, it will trigger the system call on behalf of the client. Depending on the call, there are different paths from here: A result may be written to a defined shared memory region, the execution of the client may be resumed at its entry point vector or in a specified state, the client may remain stopped or even be killed. If the client is resumed but the result of the system call is not yet available, the client’s thread library may queue the calling thread and schedule a different one until the result is available.

4.1.5.2 vCPU System Call Types Implemented system calls include:

40 4.1 vCPU syscall_exit Causes client and host to exit. syscall_sleep Suspends the client until an event, for example a timer tick, is signaled. syscall_resume Causes the client to be resumed in a specified state while enabling interruptions. If interruptions could not be enabled, the client is resumed at its entry point vector to handle the indicated events first. The state to resume resides in the state save area on a shared memory page. As described in Section 4.1.7, the functionality of this system call can be provided by user level functions. syscall_resume_disabled Causes the client to be resumed in a specified state with interruptions remaining disabled. Consequently, the client will not be resumed at its entry point vector if the event flag is set. The state to resume resides in the state save area on a shared memory page. As described in Section 4.1.7, the functionality of this system call can be provided by user level functions. syscall_deliver_event Instructs the host to deliver any queued events for a specified event type. After delivering the last queued event, the host resets the pending event flag. From this point on, events of the specified type will again be delivered to the client, which was temporarily disabled as the pending events flag was set. syscall_get_data Causes the host to deliver a specified data resource to the client. The data is delivered asynchronously using data events as described in Section 4.2.3. syscall_display This system call transfers width and height and format of a picture (in AVPicture format of FFmpeg [FFm]) that was saved to shared memory to the host. The host then caches the picture from shared memory and displays it in a window using the multimedia library Simple DirectMedia Layer (SDL) [Lan]. An additional system call implements a synchronous print for debugging purposes. This system call writes information provided by the client on the shared page to the standard output. System calls with an id of 100 or higher are considered to be user defined breaks which the host ignores but causes the client to be resumed at its entry point vector to react on the system call. To measure latencies, a null system call and a system call to request the delivery of a special event were implemented as described in Section 5.

4.1.6 Race Conditions Between Events and System Calls The host and client process can be interrupted and control be transfered between them and parallel running processes by the operating system at any point. Consequently, special care has to be taken to prevent race conditions if the host tries to deliver an event to the client while the client tries to call a vCPU system call. Assuming the following executions as visualized in Figure 4.1: 1. Client is running in interruptible mode

2. Host saves an event to the buffer on the shared page

3. Host detects that client is interruptible

41 4 Implementation

4. Client calls to the vCPU: a) Client disables interruptions b) Client stops itself by causing a segmentation fault

5. Host signals client to stop

6. Host waits for the client to stop

7. Host continues as the client’s state changed due to the segmentation fault

Figure 4.1: Diagram visualizing vCPU host and client interactions in a runtime environ- ment with scheduling. The described flow of control is detected to prevent a race condition between the host and the client.

At this point, the host may modify the shared page and set the flags corresponding to the delivered event. To make sure the child’s system call is not lost, the host is required to detect the reason for the child’s interruption. In case it did not stop due to the stop signal, the client’s state is analyzed and handled as usually (described in Section 4.1.4). If the client is resumed at a later point, it will receive the stop signal sent by the host and the process gets interrupted. When the host detects the stop signal, it can safely ignore this signal and resume the client process, as it requires no further processing.

4.1.7 User Level Resume vCPU system calls involve switching to the host which leads to a computational over- head. Consequently, replacing unnecessary system calls with user level functions may improve the overall performance of the implementation. Architectures supporting the

42 4.1 vCPU x86 instruction set provide the ret instruction to modify two registers at the same time. Taking advantage of this instruction, a client may implement a user level resume function that can be used instead of the resume system call as described below. The experiments presented in Section 5.3 show that the user level function provides an effi- cient replacement of the system call. Pseudo-code of such a function is included in Listing 4.1. The client resume function requires three arguments: a pointer to the interruptions flag, a pointer to the pending events flag and a pointer to the state that shall be resumed. Additionally it assumes that interruptions are currently disabled and execution uses the entry point’s stack. In this implementation, flags for interruptions and pending events reside in the same machine word. Using an atomic compare-and-swap operation, the client then enables interruptions while making sure that there are no pending events set. If this fails, the client is resumed at the entry point vector to handle the pending events. Therefore, the stack of the entry point vector is loaded before continuing execution at the entry point vector. If interruptions could be enabled, the thread state is restored by the following sequence of instructions implemented using inline assembler. The pointer to the state is loaded into the first register, e.g. rax, and is used to address the attributes comprising the thread state. Then, all other registers including the stack pointer rsp, rbx, r8 to r15 are restored, except for the instruction pointer rip. As the stack pointer already points to the thread’s stack, special care has to be taken when modifying the stack. According to the x86 AMD64 calling conventions used by 64-bit Linux programs, a block of 128 bytes below the stack pointer is reserved for temporary data and shall not be modified. To make sure values saved in this so-called red zone remain intact, the stack pointer is decreased by 128 before the resume function writes data to the thread’s stack. Then, the instruction pointer and the flags are pushed onto the current stack. Follow- ing, the flags are restored from the stack. Now the pointer to the thread state (residing in rax) is only needed to restore the original value of the register it is saved in. That register is then restored. As a last step, the execution of the assembler instruction ret pops the instruction pointer from the stack, loads it and releases a specified number of bytes from the stack. Supplying it with an operand of 128 makes sure the stack pointer references the original position before the previously skipped red zone. From here on, the client resumes execution in the specified state. As the client can get interrupted at any moment after interruptions have been enabled, some subtleties have to be taken into account. After interruption, control is transfered to the entry point vector which normally saves the data stored in the state save area into the interrupted thread’s data structure. In case of an incomplete resume, the thread’s data structure still contains the valid, original state of the thread, which is needed to resume the thread later on. Therefore, in this case the data in the state save area is ignored to prevent unnecessary resumptions of an interrupted call to the resume function (which in turn could again be resuming an interrupted call to the resume function and so forth, resulting in performance degradation and eventually a stack overflow). Furthermore, by completely ignoring the state of a partly resumed client thread, it is needless to analyze which registers were restored and which still need to be restored. To detect an interrupted call to the resume function at the entry point vector, the resume function

43 4 Implementation procedure UserResume( ptr_interruptions_flag: Pointer , ptr_pending_events_flag: Pointer , ptr_state: Pointer ); var enabled : Boolean ; expected_interruptions_flag : Boolean ; updated_interruptions_flag : Boolean ; expected_pending_events_flag : Boolean ; updated_pending_events_flag : Boolean ; begin expected_interruptions_flag := f a l s e ; (∗ interruptions ∗) updated_interruptions_flag := true ; (∗ g e t enabled ∗) expected_pending_events_flag := f a l s e ; (∗ no pending events ∗) updated_pending_events_flag := f a l s e ; (∗ no pending events ∗) enabled := AtomicCompareAndSwap( expected_interruptions_flag , expected_pending_events_flag , updated_interruptions_flag , updated_pending_events_flag , ptr_interruptions_flag , ptr_pending_events_flag );

i f not enabled then (∗ could not enable interruptions due to pending events ∗) RSP := initial_entry_point_stack; JumpTo(entry_point_vector ); end RAX := ptr_state; (∗ use RAX for addressing ∗) RBX := RAX→rbx ; (∗ ... restore all general purpose registers except RAX ∗) RSP := RAX→rsp ; (∗ restore stack pointer ∗) RSP := RSP − 128; (∗ s k i p red−zone ∗) Push (RAX→ r i p ) ; (∗ push instruction pointer ∗) Push (RAX→ f l a g s ) ; (∗ and flags to stack ∗) PopAndRestoreFlags (); RAX := RAX→rax ; (∗ r e s t o r e RAX ∗) Return(128); (∗ pop and load RIP ∗) (∗ and increase RSP by 128 ∗) end ; Listing 4.1: Pseudo-code on how to resume a thread without the use of a system call. RAX→rbx denotes addressing of the variable rbx which is a member of a data structure referenced by the register RAX.

44 4.2 Threading Library includes labels at both its beginning and at its end. Comparing the address of the interrupted thread’s instruction pointer to those two labels allows the entry point vector to safely decide if the interrupted thread state should be written to the thread’s data structure. It is important, that there are no instructions inside the resume functions which result in the instruction pointer temporarily referencing a position outside the two labels, e.g. during a call to a subroutine. Consequently, if other functions are required they have to be inlined in between the two labels. After all events have been handled, the client can try again to restore the thread’s state using the resume function. To sum it up, calling of the user level resume function results in one of the following conditions to be met:

• While interruptions are still disabled, the client detected pending events and con- tinues execution at the entry point vector.

• An event occurred after enabling interruptions. The host saved the execution state to the state save area and resumed the client at its entry point vector.

• Resumption completed without events and a user level thread is executed.

To resume a thread without enabling interruptions, for example if the client code needs to continue uninterruptible after a system call, I also implemented a simplified resume function. This function only restores the thread state as described before and resumes its execution. As interruptions do not get enabled in this case, it is neither required to handle events nor to deal with interruptions during resume.

4.2 Threading Library

Using the vCPU setup, the need for a client to decide between execution or waiting for events is removed, as events are asynchronously injected into the client. Additionally, a threading library is implemented inside the client. Those threads are light-weight and only require memory, while allowing - depending on the architecture’s instruction set (refer to Section 4.1.7 for implementation details) - for thread switching without calling into the host context. The data structures required by host and client reside on a shared page for efficient communication.

4.2.1 Scheduling Scheduling is implemented as designed in Section 3.4. Upon receiving a timer event or a thread becoming ready, the threading library needs to initiate scheduling. The data structures used to save threads of the same priority are queues, as they naturally support round-robin scheduling for each priority. Thread priorities are implemented as integer numbers ranging from the highest priority 0 to a configurable, lowest priority (currently 5). This allows for priorities to correspond to indices in an array of thread queues.

45 4 Implementation

4.2.2 Synchronization To provide means of synchronization between threads, the threading library implements functionality for mutex, condition variables and semaphores. As described in 3.5, syn- chronization can be implemented using the functions of the vCPU interface to enable and disable event notifications.

4.2.2.1 Mutex The mutex module provides a simple mechanism for giving exclusive access to a resource using the two operations lock and unlock. For this purpose, a mutex maintains a queue of waiting threads. Initially, the mutex is unlocked and the waiting queue is empty. The first thread to lock the mutex gets exclusive access to the resource. Threads calling lock while the mutex is already locked get suspended and added to the mutex’s waiting queue until all of the previous threads have called unlock. Each time the mutex is unlocked, the next queued thread is removed from the waiting queue and added to the queue of runnable threads which ensures that it will eventually be scheduled. As lock and unlock may modify the library’s queue of runnable threads or require an atomic execution path, they both temporarily disable interruptions. A pseudo-code implementation of the described mutex is included in Listing 4.2. procedure Mutex.Lock(); begin MakeSureEventsAreDisabled ();

i f this .IsLocked() then this .AddCurrentThreadToQueue (); Wait ( ) ; else this .SetLocked(); end

EnableEventsIfPreviouslyEnabled (); end ; procedure Mutex.Unlock (); var thread: Thread; begin MakeSureEventsAreDisabled ();

i f this .CountWaitingThreads() > 0 then thread := this .PopThreadFromQueue(); MakeThreadRunnable(thread ); else this .SetUnlocked();

46 4.2 Threading Library

end

EnableEventsIfPreviouslyEnabled (); end ; Listing 4.2: Pseudo-code of a mutex’s lock and unlock function.

4.2.2.2 Condition Variable Depending on a mutex, a condition variable provides an efficient way of synchronizing access to a resource while making sure that a specific condition is met. A thread locks the mutex, checks for a condition and calls wait if the condition is not satisfied. wait unlocks the mutex, adds the current thread to a waiting queue and suspends it. Now a different thread may lock the mutex and perform some modifications to the data structures. If those modifications result in changes to values which are part of the condition, the thread should call either signal or broadcast on the condition variable to wake up potentially queued threads. Resumed threads then wait at the mutex for exclusive access, upon which the condition needs to be checked as before. As the operations modify queues of waiting threads or require atomic execution of multiple instructions, calls to those functions are internally protected by temporarily disabling interruptions. A pseudo-code implementation of the described concept of a condition variable is included in Listing 4.3. procedure ConditionVariable. Initialize(mutex: Mutex); begin this.mutex := mutex; end ; procedure ConditionVariable .Wait(); begin MakeSureEventsAreDisabled ();

this .mutex.Unlock(); this .AddCurrentThreadToQueue (); Wait ( ) ;

EnableEventsIfPreviouslyEnabled (); end ; procedure ConditionVariable.Signal (); var thread: Thread; begin MakeSureEventsAreDisabled ();

i f this .CountWaitingThreads() > 0 then

47 4 Implementation

thread := this .PopWaitingThread(); this .mutex.AddThreadToQueue(thread ); i f not this .mutex.IsLocked() then thread := this .mutex.PopThreadFromQueue(); MakeThreadRunnable(thread ); end end

EnableEventsIfPreviouslyEnabled (); end ; procedure ConditionVariable.Broadcast (); var thread: Thread; threads: ListOfThreads; begin MakeSureEventsAreDisabled ();

i f this .CountWaitingThreads() > 0 then threads := this .PopAllWaitingThreads(); this .mutex.AddThreadsToQueue(threads ); i f not this .mutex.IsLocked() then thread := this .mutex.PopThreadFromQueue(); MakeThreadRunnable(thread ); end end

EnableEventsIfPreviouslyEnabled (); end ; Listing 4.3: Pseudo-code of the implemented condition variable.

4.2.2.3 Semaphore Based on a mutex and a condition variable as shown in Listing 4.4, semaphores imple- ment a high-level module to synchronize parallel access to a resource by multiple threads. On the one hand, the P operation to acquire a resource uses a mutex-protected condition variable to wait for the resource to become available. On the other hand, V to release the resource uses the condition variable to broadcast the availability of the resource to waiting threads.

48 4.2 Threading Library procedure Semaphore. Initialize (mutex: Mutex, condition: ConditionVariable); begin this.mutex := mutex; this.condition := condition; end ; procedure Semaphore.V(value : Integer ); begin this .mutex.Lock(); this.resources := this.resources + value; this .condition.Broadcast(); this .mutex.Unlock(); end ; procedure Semaphore.P(value : Integer ); begin this .mutex.Lock(); while this.resources < value begin this .condition.Wait(); end ; this.resources := this.resources − value ; this .mutex.Unlock(); end ; Listing 4.4: Pseudo-code of a semaphore’s V and P function.

4.2.3 Client Prioritization of Data Events As an extension to the event handling described in Section 4.1.3, data events receive special treatment. This allows plugins to receive and handle data events of different priorities. The design described in Section 3.3.4 is implemented as follows. A user-defined event handler thread calls the library function wait_for_event while passing the event type as an argument. This will queue and suspend the thread until a data event of the specified type is signaled. When the event handler thread gets scheduled, i.e. its call to the wait function returns, it may read events, provide data to other threads, and finally repeat the whole process by returning to the beginning and calling the wait function again.

4.2.4 Dynamic Memory Allocation Typical programs rely on dynamic memory allocation to efficiently manage a limited amount of memory at runtime. In common programming languages, dynamic memory

49 4 Implementation is provided by standard libraries and enables programs to request memory of arbitrary size and release it at a later point [WJNB95]. To support those programs, a module that dynamically manages a pre-allocated, designated memory area was implemented in the client. The interface comprises three functions: allocate, free and re-allocate. The latter is used to shrink or grow previously allocated memory objects. Information about free memory is saved within the free memory itself as a list of free blocks. The list is sorted in ascending order of memory addresses. Therefore, the memory manager data structure only needs to know the address of the beginning of the first free block, i.e. the head of the list. At this address, the size of the block and the address of the next free block is saved. For this to work, the size of the free block must be large enough to hold at least those two values. The last free block and thereby the end of the list is identified by having a next pointer with the value NULL. To allow freeing of an allocated block without specifying its size (similar to the usage of the free function provided by glibc), the size of a used block needs to be cached. Therefore, allocated blocks are preceded by a fixed-length value denoting the size of the following memory block. Allocating memory (alloc) To allocate a block of memory with a specific size, the memory manager traverses the list of free blocks to find the first one which is large enough to hold the size as well as the requested memory. If the size of the free block is larger than required and the remaining memory is big enough to hold a free block entry, it is added to the list of free blocks. In any case the previously free block is removed from the list of free blocks. The size of the memory block is written to the beginning of the block and a pointer to the subsequent memory address is returned. If the traversal of the list does not lead to a block of adequate size, alloc returns NULL. Freeing memory (free) To free a memory block, an address returned by the allocation function is passed to the memory manager’s free-function. The memory manager retrieves the size of the block and inserts the block into the list of free blocks. As the list is sorted, the newly added block can easily be joined with preceding or following blocks if their boundaries align. By means of joining neighboring free blocks, fragmentation of memory is reduced. Re-allocating memory (realloc) Re-allocating memory uses both alloc and free. To expand memory, realloc allocates memory of the new size by calling alloc. Afterwards, existing data is copied to the new location and the old memory block is freed. If alloc fails to allocate a new memory block, realloc returns NULL. Otherwise, the location of the new memory block is returned. For performance reasons and simplicity, requests to shrink memory have no consequences and return the original memory address.

4.3 Example Execution

Describing the execution of threads in the vCPU, this section includes a detailed overview over the steps involved when running two threads concurrently. Figure 4.2 shows a schematic representation of involved steps and the flow of control.

50 4.4 Video Playback Using Plugin

At the beginning, the user implemented method creates the vCPU host. To complete the setup, the host needs to know the address of some user implemented methods: A method to initialize the client (prepare_client), a method to run the client (run_- client), a method to free any resources after the client completed execution (destroy_- client), the pointer to the entry point vector and the interval (in milliseconds), after which a timer event will be sent to the client. The host uses this information to initialize its data structures. Then, the host’s run method needs to be called, which sets up the shared page and calls prepare_client. This creates a global instance of the threading library and provides it with a pointer to the shared page and the number of vCPU timer ticks per time slice and forks a Linux process to execute the client. Afterwards, when the client process has been set under the host’s control and the two threads are synchronized, the client process calls run_client and the parent process enters its main loop. In the user implemented run_client, the threading library initializes two threads to perform some calculations, followed by a call to the threading library’s run method. The run method schedules the first available thread, then loads and resumes that thread’s initial state. When the main loop detects that the specified interval for a tick event has passed, it delivers a timer tick event to the client. Therefore, the host stops the client, saves the client’s state into the state save area and resumes the client at the entry point vector. There, client detects the timer event and checks if the interrupted thread already depleted its time slice. In this example, the time slice is used up, so the client copies the content of the save state are into the current thread’s data structure and queues the thread in the list of runnable threads. The second thread, which is now first in the queue, is scheduled, loaded and resumed. After several switches between the threads, the second thread initiates a synchronous print system call to write a debug message to the console. It disables events, copies the message to the shared page, sets the id of the system call and causes a segmentation fault. The host thread, which has been waiting for changes in the client, resumes. It detects the system call, copies the message, prints it to the console and resumes the client in the entry point vector. The threading library tries to enable events and to resume the thread that caused the system call. As a timer tick was delivered in the meantime, the client needs to handle the event first. This causes the thread’s state to be written to the thread’s data structure and the first thread to be scheduled. When one of the threads finished its execution, it notifies the thread library to destroy the thread. When both threads have been destroyed, the thread library detects that there are no more threads available and shuts the client down by calling the exit system call. The host reacts on this system call by leaving its main loop and returning from its run method. At this point, the host can be destroyed which calls destroy_client to clean up any resources the client acquired.

4.4 Video Playback Using Plugin

Based on the implemented architecture, I created a plugin for video playback as one example of a multimedia plugin. The plugin is set up to handle a specific file type for

51 4 Implementation

Host Client

Library Thread1 Thread2

Figure 4.2: Diagram visualizing an exemplary execution of vCPU host and client, show- ing the flow of control. The client runs two computation threads that are due to user-level scheduling. Simultaneous actions are handled transparently by operating system scheduling.

52 4.4 Video Playback Using Plugin the Firefox browser. When instantiated, it requests a video resource from the host, upon which the host lets the user choose a file and transfers its content using data events to the plugin. Then the plugin decodes the video frame-by-frame. Each frame is immediately transfered to the host upon which the host displays it on a web page inside browser. The plugin is implemented as follows. On startup, it requests a data resource using the syscall_get_data system call. The host detects the system call and asks the user to choose a file. This is realized by calling the program zenity [Cru] which displays native dialogs on Ubuntu. The chosen file is read into memory and transfered to the plugin using data events. The plugin reads the data events using a dedicated event handler thread. When the video is available, the main plugin thread is resumed and starts decoding of the video. The plugin uses the open source project FFmpeg [FFm] for this purpose. Its libraries libavformat and libavcodec provide the functionality necessary to generate a - and frame- based representation of an encoded video resource. After initialization of FFmpeg, a handler that allows reading a video stream from memory (instead of file) is registered. Then a decoder capable of reading the video format is set up. The decoder repeatedly gets called to handle the video frame-by-frame. Each frame is saved to shared memory and reported to the host using the syscall_- display system call. This system call copies the decoded frame into a data structure of the host and displays it to the user. Therefore, a windows is opened and the frame rendered by means of SDL. A major caveat during the implementation of the video decoding plugin was mem- ory management. FFmpeg’s libraries make heavy use of dynamically allocated memory managed by the functions malloc, realloc and free. As those functions are imple- mented on top of the Linux system calls brk and mmap which are blocked by the sand- boxing mechanism, a different solution had to be found. Advantageously, FFmpeg’s memory management is conveniently wrapped by a defined set of functions that can be overwritten by a user program. Consequently, by providing alternate implementa- tions of av_malloc, av_realloc, av_free, av_freep, av_mallocz and av_strdup (as found in FFmpeg’s default memory allocator in ffmpeg/libavutil/mem.c), FFmpeg’s dynamic memory could be handled by client-side memory management as described in Section 4.2.4.

Browser Integration To integrate the architecture into the browser, I use MozPlugger as introduced in Section 2.3.1.3. Therefore, I added an entry to the MozPlugger con- figuration file ~/.mozilla/mozpluggerrc. It defines the MIME-Type application/ sect-plugin and file extension mzp-plugin and adds the description SECT plugin. The actual command that handles the content (path to the compiled architecture) is specified after the option swallow(threads). This option instructs MozPlugger to search for a window with the title threads and move it inside the browser1. The com- plete entry can be found in Listing 4.5. After setting up MozPlugger and thereby preparing Firefox to load the custom plugin handler, an HTML web page referencing the video player plugin was created. It includes

1 see the manual of MozPlugger for details

53 4 Implementation

an object with a link to a mzp-plugin file as shown in Listing 4.6. Upon navigating to that web page the plugin is instantiated and executed.

application/sect −plugin : mzp−plugin : SECT plugin swallow(threads) : /path/to/plugin Listing 4.5: MozPlugger configuration entry

Plugin Test Listing 4.6: HTML to display the plugin

54 5 Evaluation

This section presents the evaluation of the implementation. All experiments were con- ducted on AMD Athlon 64 X2 Dual Core Processor 5200+ with 4 GB RAM running Ubuntu 9.10. I present average time consumption of core functionality like vCPU system calls and process switches as well as computation overhead and event latency. Putting the numbers into context, a comparison with native execution is included where suitable.

5.1 System Call Roundtrip

To measure the time required for executing a synchronous system call, a special vCPU system call syscall_null was introduced. When syscall_null is called, control is transfered to the host which will do nothing more than resuming the child’s execution at its entry point vector. The time of this system call was measured by reading out the system’s tick counter directly before and after the system call. The effect of outliers is minimized by averaging the experiment over 10, 000 runs. For comparison, the same measurements where conducted for the Linux system call getpid. I chose getpid as its overhead is small. To make sure the result is not being cached in user space by glibc, the system call is executed using syscall(SYS_getpid). The results are shown in Table 5.1.

Table 5.1: Time consumption of system calls clock cycles per call time per call relation vCPU (syscall_null) 37,702 ticks ≈ 35.671 µs 100% native (getpid) 248 ticks ≈ 0.234 µs 1%

The experiments show that native system calls are about 100 times faster than vCPU system calls. The reason for this is the vCPU system calls dependence on several native system calls: Two calls to ptrace are used to prepare the child to continue after a system call while actually being continued requires another call to ptrace. In contrast to syscall_null, which additionally requires switching between the child and parent process, getpid does not lead to a context switch.

5.2 Context Switch

One design goal was the support for efficient execution of different threads in the vCPU. Therefore, I measured the costs of a switch between different streams of execution in

55 5 Evaluation the implemented architecture. In the vCPU client, two threads are waiting for each other using a semaphore. On native Linux, two processes are created: once sharing the same address space (once using clone with the flag CLONE_VM), then with separate address spaces (using the system calls fork and clone). The processes wait for each other by writing/reading an integer value over a pair of pipes. To minimize the effect of outliers, the measurements were taken for 10, 000, 000 switches from the first process to the second process and back to the first process. Table 5.2 presents the measured timings.

Table 5.2: Time consumption for 10, 000, 000 switches from one context to the other and back. configuration total time time per switch relation 1 vCPU user level resume 19 s ≈ 1.0 µs 100% 2 shared address space (clone) 45 s ≈ 2.3 µs 237% 3 native separate address space (fork) 57 s ≈ 2.9 µs 300% 4 separate address space (clone) 57 s ≈ 2.9 µs 300%

The native processes in different address spaces exhibit lowest performance (experi- ment 3, 4). The reason lies in the way process switches are executed in this case: As each process has its own address space, a context switch requires loading of a different address space and consequently flushing of the translation lookaside buffer (TLB). The performance of clone and fork in experiments 3 and 4 is equal because both system calls are implemented using the same functions in glibc 1. Switches between native processes that share an address space are faster, as a context switch in that case does not require loading of a new address space. Switches in vCPU are fastest as the vCPU threads share the address space and most of the time do not require a system call to switch between them or resume execution.

5.3 Comparing User-level Resume with syscall_resume

The comparison between resuming a client thread using a syscall and using the user level resume function have been assessed in experiments similar to the ones described in the previous section. In the first experiment, the vCPU client uses the user level resume function if possible. In the second experiment, only the vCPU system call is used. Table 5.3 shows the results. Using the system call to resume a client thread results requires is about 17 times slower than using the user level resume function. This is due to the fact that as long as no events are pending, a resume does not require switching to the host. As resuming using the user level resume function does not require any system calls, it provides much better performance.

1 refer to the manual of fork

56 5.4 Event Latency

Table 5.3: Time consumption for 10, 000, 000 switches from one vCPU client thread to the other and back. configuration total time time per switch relation user level resume 19 s ≈ 1.0 µs 100% syscall resume 329 s ≈ 16.5 µs 1, 732%

5.4 Event Latency

The latency of event delivery from the browser to the plugin provides insights to the architecture’s usefulness for multimedia applications. For example, games running inside a plugin may need to react on mouse or network events in a timely manner. To measure the latency of events, the vCPU system call syscall_latency_event was implemented. When called, it reads out the current CPU time stamp count (TSC), writes it to a shared memory area and delivers a designated latency event to the client. Upon reception of this event, the client reads the new CPU time stamp and logs the difference between first and second time stamp as latency. The event handler for latency events runs with upcalls disabled, rather than in a dedicated event handler thread. Thus, latency events are comparable to timer tick events. Using this setup, the client can measure the latency of events under different condi- tions. Table 5.4 presents the latency in a setup without load. Figure 5.1 and 5.2 show the dependence of latency on the load. In the first experiment, Figure 5.1, load was created by having an increasing number of computational threads running in parallel. In the second experiment, Figure 5.2, load was created by having an increasing number of data handling threads running in parallel.

Table 5.4: Average latency between the sending of an event by the host and its reception by the client measured without load. configuration event latency in clock cycles time relation vCPU without load 8,801 ticks ≈ 8.3 µs 100% vCPU with load (computational) 9,458 ticks ≈ 8.9 µs 107% vCPU with load (data handling) 8,808 ticks ≈ 8.3 µs 100%

The results show that the event latency measured with an increasing number of com- putation threads and with an increasing number of event handling threads remains constant. Furthermore, the results are about the same as measured in experiments without load (Table 5.4) with latencies increasing slightly (7%) if one or more calcula- tion intensive threads are running in parallel. The reason for the constant latency is due to handling these kinds of events by the client directly after receiving a notification. The slight increase under heavy computational load can be explained by the fact that heavy load on the physical CPU increases latencies in the system as a whole. The measurements that form the basis for the diagrams were found to include few outliers which are of significantly high value. Figure 5.3 shows the cumulative latency

57 5 Evaluation

10000 Latency of events with parallel computation threads

8000

6000

4000 Latency in CPU clock cycles

2000

0 1 2 3 4 5 6 Number of parallel computation threads

Figure 5.1: Latency of event delivery while running an increasing number of computa- tional threads in parallel. distribution of events which were retrieved while having 10 computational threads run- ning in parallel. Clearly, almost 80% of the measured latencies are less than 20,000 ticks while few but high outliers occur with almost 1, 000, 000 ticks. Those outliers occur when the operating system’s scheduling interrupts event delivery to favor another process’s execution. As there is no feasible way of preventing this from happening, those values need to be handled specially. To avoid distortions of the results, values significantly higher than the average (calculated for the remaining values) were filtered. This led to the removal of around the 10% of the measured values.

5.5 Data Event Latency

Figure 5.4 presents the latency of data events measured while running an increasing number of calculation threads in parallel. To retrieve adequate measurements, all data events received an extra field to save the CPU tick count at the moment of their delivery. The data event handler threads in the client can then read the CPU tick counter upon reception and calculate the latency of the event. Experiments have been conducted in two series. In the first series, the computational threads are of low priority. In the second series, the computational threads are of the same priority as the event handling

58 5.5 Data Event Latency

10000 Latency of events with parallel data event handling threads

8000

6000

4000 Latency in CPU clock cycles

2000

0 1 2 3 4 5 6 Number of parallel data event handling threads

Figure 5.2: Latency of event delivery while running an increasing number of data event handling threads in parallel. thread. Due to the large differences in the graphs, the y-axes uses logarithmic scale. As shown in the diagram, high priority data event handling executes in constant time and same priority event handling increases with the number of parallel computation threads.

5.5.1 Influence of Thread Priorities on Latency Figure 5.5 presents the latency of a data events when processing different types of data events at the same time. Therefore, different experiments were conducted. In the first series (1), the data event handler thread of the data event type under consideration received the highest priority. In the second series (2), all data event handler threads received the same priority. The graph shows, that events handled with a higher priority than all other events (1) have a latency independent of the number of parallel event types. In contrast, in the other case the latency increases linearly with the number of parallel event types (2). The reason for this can be found in the way scheduling of the threads is designed. Upon detection of a data event the corresponding event handling thread is made runnable. If it is of higher priority than the current active thread, the current thread gets preempted and the new event handling thread scheduled. Consequently, high priority threads can preempt all other threads (1).

59 5 Evaluation

100 Sum of events with a latency <= given CPU cycles

90

80

70

60

50

40 Number of events

30

20 80% of the events were delivered in under 19,000 CPU cycles

10

0 10000 100000 1e+06 1e+07 CPU cycles

Figure 5.3: Cumulative latency distribution of events when running 10 computational threads in parallel.

5.5.2 Influence of Event Buffer Size on Latency To assess the influence of event buffer size on latency and execution time, I conducted the following experiments. Figure 5.6 visualizes the latency of an event in relation to its buffer size. For this experiment, events were extended to include a time stamp which is set by the host up on delivery of the event. Up on reception, the client reads the current time stamp and compares it to the one set by the host. This way, the client can calculate the latency of every event it receives. In the experiment, the host delivers events at a higher frequency than the client can deal with. As seen in the figure, event latency increases linearly with size of the buffer. Events are coming in faster than the client can handle them. Consequently, the host saves the events to the event buffer. As the client can only read one event at a time, the latency of the last event includes the latency of all previously processed events, resulting in the graph shown in the figure. Figure 5.7 presents the total time consumption when sending 100, 000 events of size 1024 bytes to the client. As buffer size increases, the time consumption decreases. Following the saving of an event, the setting of the event flag requires a short-term interruption of the client. With a buffer for multiple events, the overhead induced by interrupting the client can be shared among those events, leading to the decrease in execution time.

60 5.6 Computation Overhead

1e+09 Average event latency for HIGH priority handling (1) Average event latency for EQUAL priority handling (2) 1e+08

1e+07

1e+06

100000

10000

1000

100 Average CPU cycles per 100 events (log-scale)

10

1 0 1 2 3 4 5 6 Number of parallel computation threads

Figure 5.4: Latency of data event delivery while running an increasing number of com- putation threads in parallel.

5.6 Computation Overhead

As stated in the design goals in Section 3, the overhead of calculations executed in the plugin architecture should be minimized. To analyze the performance of the architecture concerning this statement, I implemented and executed the time-consuming computa- tion of the n-th Fibonacci number on both native Linux and inside the architecture. Table 5.5 presents the result.

Table 5.5: Time consumption for 10,000 calculations of the 25th Fibonacci number. time relation vCPU 13,733 ms 100.0% native 13,643 ms 99.3%

The results only differ slightly, showing that the overhead for running calculations inside the plugin architecture is negligible for multimedia applications.

61 5 Evaluation

2.5e+06 Average event latency for HIGH priority handling (1) Average event latency for EQUAL priority handling (2)

2e+06

1.5e+06

1e+06 Average CPU cycles per 100 events 500000

0 0 1 2 3 4 5 Number of parallel data event handling threads

Figure 5.5: Latency of event delivery while running an increasing number of data event handling threads in parallel.

5.7 Data Throughput

To measure the amount of data that can be delivered using data events per time, the following experiment was conducted. In the client, an event handling thread as well as a control thread were created. The control thread saves the current time and requests the delivery of data provided by an unlimited data source. Then the event handling thread receives the events, checks if they contain correct values and increments a data counter. After a fixed amount of time, the control thread reads the number of events delivered and shuts the client down. The data events are of size 100KB. A similar experiment was used to measure native throughput. In this case, one process sends data over a pipe to a second process. The second process again validates the data while the execution time is measured. The results are presented in Table 5.6.

Table 5.6: Data throughput speed relation vCPU 136.1 MB/sec 100% native 159.6 MB/sec 117%

62 5.8 Video Playback Using Plugin

900000 Average event latency with an increasing buffer size

800000

700000

600000

500000

400000 CPU clock cycles 300000

200000

100000

0 0 20 40 60 80 100 Buffer size in elements

Figure 5.6: Average latency of events when using event buffers of different size (smaller is better).

The data throughput of native processes is about 17% higher than that of the vCPU implementation. As both experiments require process switches (native: sending process - receiving process, vCPU: host process - client process), higher performance of pipes can be explained by requiring less copy operations or applying more suitable buffering. In Figure 5.8, I present the amount of data being delivered per time depending on the size of an event. To illustrate the influence of event sizes, the size of events is increased in every step.

5.8 Video Playback Using Plugin

The experiments presented so far concentrate on specific details of the implementation. To show the usability of the architecture for more complete use cases, I implemented a video decoding plugin as described in Section 4.4. The rudimentary video plugin displays the frames as soon as they are decoded. To play a video in its original speed, a time out was introduced: After calling syscall_display the decoding thread sleeps for a specific amount of time, resulting in a smoothly played video. The performance of video decoding in the plugin was assessed after removing the time out. Then the time to decode and display a video at maximum speed was

63 5 Evaluation

2400 Time expenditure to deliver a fixed number of events

2200

2000

1800

1600

1400 Time in ms 1200

1000

800

600

400 0 20 40 60 80 100 Buffer size in elements

Figure 5.7: Time expenditure to deliver a fixed number of events when using event buffers of different size (smaller is better). measured for two different videos. The resulting numbers in Table 5.7 were averaged over multiple runs.

Table 5.7: Video decoding performance frames frame size (pixel) bitrate total decoding time time/frame 175 1280 x 720 1409 kb/s 1432 ms 8.1 ms/frame 92 352 x 240 1240 kb/s 420 ms 4.6 ms/frame

The experiment shows that the plugin offers an average frame rate that is almost 10 times higher than the frame rate of current videos. This shows that the architecture easily supports plugins handling multimedia files.

64 5.8 Video Playback Using Plugin

200 MB/s with an increasing event size

180

160

140

120

100 MB/s

80

60

40

20

0 0 10000 20000 30000 40000 50000 60000 70000 Event size in bytes

Figure 5.8: Data throughput when delivering events of different size (higher is better).

65

6 Related Work

This section reviews research that is related to this thesis. First of all, I outline proposals that work on improving the security of browsers and plugins from inside the browser. Then I present projects that analyze current implementations of web browsers on an operating system level as well as other technologies that are related to plugin security in browsers.

6.1 Browsers

With the growing importance of web browsers and their plugins from a security point of view, different approaches have been studied to design and validate plugin interfaces in browsers.

6.1.1 Transparently Securing Plugins in Internet Explorer One approach to restrict plugins was proposed in [LZYC10], where an in-process sand- box monitors communication between a plugin on the one side and browser as well as operating system on the other. By design, their implementation is only aware of calls using defined APIs like COM interfaces or the Windows API and does not consider code that directly calls kernel routines or system calls. The threat model is justified on the grounds that code not using predefined APIs is uncommon and easily crashes. The results presented in the paper show that their implementation on Windows XP in Internet Explorer 8 successfully makes sure that the plugin complies with the specified security policy. The induced overhead on execution is around 4.5%.

6.1.2 Chromium Sandbox Described by [BJRT08], the Chromium sandbox on Linux tries to solve a two-fold prob- lem. On the one hand, all interactions of a process running in the sandbox with outside processes must be restricted to a defined set of operations and arguments which do not allow it to elevate its rights. On the other hand, the sandbox needs to be compatible with current third-party code as included in websites or plugins and therefore provide the functionality that is generally available if the code is executed outside the sandbox. To solve the first part, Chromium uses Linux’s seccomp feature, which allows the system calls available to a process to be limited to read, write, sigreturn and exit. Calling any other system call will result in the termination of the process by the kernel. This behavior ultimately creates a sandbox for the process. The second part requires a more sophisticated implementation. To provide the expected functionality in a safe and checkable way, the loaded code is analyzed before its first execution. A simple

67 6 Related Work disassembler tries to find all references to blocked system calls and rewrites them so that they get directed to a trusted helper thread inside the same process. The helper thread itself is carefully written in assembler and does not rely on any memory (heap, stack) but uses CPU registers only. The memory region containing the trusted helper thread’s code is assured to be immutable as long as the untrusted thread cannot deliberately use memory mapping system calls (mmap, munmap, mprotect). As those system calls are validated by the helper thread, the helper thread’s behavior cannot be influenced by malicious or defective third-party code. The helper thread then tries to verify the system call or delegates the task to a trusted helper process, which also is responsible for executing the approved calls. Results are returned through a read-only shared memory page. The steps taken by Chromium are promising, but cannot be applied to current plu- gins like the Adobe Flash Player, as they are not designed to accommodate restricted protection domains. To that end, Adobe and Google are currently working together to facilitate execution of the Flash Player in Chromium’s sandbox [SPG]. Additionally, integrated a PDF reader into Chromium [PG]. This internal PDF reader executes inside the sandbox.

6.1.3 Google Native Client Native Client (NaCl) is another concept to provide a portable sandboxing mechanism for native code that was introduced in [YSD+09]. The goal of NaCl is to give web applications the possibility to directly run native x86 code for best possible performance while at the same time maintaining a security level comparable to the one of current web applications implemented in JavaScript. Therefore, the paper argues that an application that uses only a restricted set of machine instructions (e.g. excluding system calls and instructions that modify segment state) and x86 segmented memory, may safely be executed. To ensure that the untrusted code only includes allowed instructions, a disassembler first analyzes the code. If the disassembler cannot guarantee to provide all execution paths, for example in the case of self-modifying code, the execution is denied. Otherwise, a static analysis verifies that the code only includes allowed machine instructions. As the code effectively runs in a sandbox, additional mechanisms to provide features like memory management and threading are added to the implementation. Furthermore, communication between instances and the browser is facilitated by including a simple remote procedure call (SRPC) mechanism and an interface to NPAPI. Consequently, loading of additional resources and interactions with the website or parts of it essentially relies on JavaScript. As the browser already applies a security model on JavaScript, Native Client does not need to further control those calls. Benchmarks included in the paper show little overhead for computation intensive code running under the control of Native Client compared to executing it natively.

68 6.1 Browsers

6.1.4 OP Web Browser In an effort to overcome deficiencies in the way web browsers were implemented, the authors of [GTK08] proposed and implemented a browser called OP Web Browser. Instead of having one level of access for the browser as a whole, it was designed to have different, isolated components that can be executed in separate, restricted domains. All communication between the components uses means provided by a small core compo- nent. The core, coined “browser kernel”, verifies that the messages adhere to the access control policy. To prevent additional interactions between the browser components or other, unrelated code, OP Web Browser relies on the operating system. In the paper, the authors explain how to leverage Security-Enhanced Linux (SELinux) to provide this kind of functionality. The distinct browser components are “the web page subsystem, a network component, a storage component, a user-interface (UI) component, and a browser kernel”. This separation allows the kernel to assure security, for example the same origin policy, by blocking all communication that violates a policy. Furthermore, at this point the browser kernel can selectively save messages which may give informa- tion about attempted or accomplished security breaches. To handle plugins, the OP Web Browser introduces a “provider domain policy” stating that the plugin is executed in the context of the domain that provided the content instead of the domain that included it. For example, as provided by the paper, if a website uiuc.edu includes an object from .com and the object is handled by a plugin, the plugin instance is executed in the context of youtube.com. Consequently, the instance can access infor- mation like cookies of youtube.com but is denied access to this kind of information of uiuc.edu. Furthermore, a “plugin freedom policy” allows plugins to open unlimited outgoing network connections, access a local plugin storage but disallows any access to the website and user information. As a consequence, the authors mention that this mechanism “prevents some plugin content from functioning properly”, i.e. breaking websites that rely on JavaScript interaction with a plugin.

6.1.5 Microsoft Gazelle The Gazelle Web Browser [WGM+09] is a prototype implementation of a new web browser architecture. It is built on the research of [GTK08] and has a Browser Kernel at its basis, that similarly to an operating system exclusively handles the protection of resources for its clients. With a strong focus on the same origin policy (SOP), web site principals are sandboxed in separate protection domains as operating system processes. As required by SOP, resources loaded using the same protocol, domain-name and port are considered to have the same origin and can access each others properties and methods. In contrast, the permissions of resources from other origins to access these methods and properties is severely restricted. The resources from one origin then form a web site principal. All communication of web site principals with each other and the operating system use means provided by Browser Kernel. Other interactions are blocked by the operating system. This results in Browser Kernel being able to impose access control rules on the website principals. For example, network traffic, cross origin interactions or drawing to

69 6 Related Work the screen can be limited by Browser Kernel. All rendering is based on bitmap objects which the Browser Kernel manages and assembles into the user interface. In contrast to OP Web Browser [GTK08], Gazelle does not put different browser com- ponents like the rendering engine or JavaScript engine into separate protection domains but rather executes those components in a single process per website principal. The authors argue that if one component fails, this will prevent the website principal to be usable in any case. Consequently, according to the paper, the need for IPC between the components of a website principal can be omitted without compromising security by having only one protection domain per principal. The only element of a website that is not part of the single principal process are plugins included in the website. As the number of vulnerabilities found in plugins “is about one magnitude higher than that of browser software”, plugins are considered to be less trustworthy. Additionally, plugin processes cannot request cross-origin resources (like scripts or style sheets), but need to rely on the corresponding website principal process for that purpose. Due to the fact that interactions with the other components require the use of system calls, current plugins are not compatible to Gazelle. The prototype implemented by the authors and used for assessing the architecture runs on Windows Vista and uses the .NET framework. The implementation of Browser Kernel required about 5, 000 lines of C# code, the components forming a browser instance mostly rely on Internet Explorer 7.

6.2 Operating Systems

Further research dealt with securing browsers and their plugin architecture on an oper- ating system level. By introducing new concepts to operating systems, potential sepa- rations between browser and plugins are described.

6.2.1 Chromium OS Chromium OS is an operating system design with a focus on web-centered usage. As out- lined in preliminary design documents [Goob], Chromium OS applies a three-level archi- tecture: firmware, system kernel with drivers and services, browser based on Chromium (see Section 2.5.2 for details) and window manager. At the lowest level, a customized firmware is responsible for verifying the integrity of the kernel and booting it. The kernel, which is based on the Linux kernel, verifies the root file system, initializes the system and starts user-land services. On top of this, on the one hand, the window manager handles multiple client windows and user interactions. By applying offscreen rendering of client windows, the window manager assembles the final image displayed to the user. On the other hand, the Chromium browser provides a runtime environment for websites and web applications. Other types of applications are not meant to be installed on the system. Additionally to the modularization and sandboxing designed into the Chromium browser, the operating system itself provides means to minimize privileges of compo- nents [Gooc]. First of all, a namespacing concept is used to separate processes from each other. Second, using coarse-grained capabilities, privileges are managed based on

70 6.2 Operating Systems extended runtime and file system attributes. Third, taking advantage of Linux Secu- rity Modules (LSM), the kernel can dictate mandatory access controls to guarantee restrictions on processes.

6.2.2 Illinios Browser Operating System In the paper “Trust and Protection in the Illinois Browser Operating System” [TMK10] (IBOS), the authors propose an operating system to facilitate secure web browser imple- mentations. The proposal is based on a microkernel called IBOS kernel which is manag- ing the hardware and enables IPC using messages. On top of the microkernel, on the one hand user-mode drivers are used to access hardware devices like network interface cards. On the other hand, an implementation of a browser API manager abstraction provides functionality to access the drivers and the browser abstraction. Provided components are responsible for HTTP requests, handling of cookies, persistent local storage and displaying browser tabs. To support traditional applications and plugins, an additional layer provides a UNIX-like abstraction. As the IBOS kernel facilitates messaging, it is used to implement security policies. Therefore, it inspects messages as well as system calls and infers labels for processes. Based on these labels, it decides if a resource may be accessed or a message may be sent by a process. IBOS basically supports two types of processes which can communicate using local storage. Web page instances and traditional processes. Web page instances are the abstraction used for each website the user loads. In contrast, traditional processes can run instructions of their choice. On request of a browser instance, plugins are created and executed as a traditional process while being able to interact with the browser instance using the NPAPI. Consequently, vulnerable browser plugins can compromise the browser instance (including local storage, tab, cookies, related network resources), but cannot interfere with other websites.

6.2.3 Capsicum The authors of [WALK10] point out that current applications have to deal with different kinds of independent data. For example a web browser has to decode videos loaded from an untrusted origin while at the same time having full access to user data to perform file uploads. As these two tasks can be implemented and executed separately, e.g. as in Chromium web browser 2.5.2, security can be improved by applying the principle of least privileges. Therefore, means to achieve sandboxing of different parts of an application are required. According to the authors, current techniques using discretionary access control (DAC) and mandatory access control (MAC) are not designed to restrict parts of an application and therefore are unsuitable. Furthermore, mechanisms for sandboxing show great differences on current platforms and require much effort to implement or use. This often leads to vulnerabilities due to bugs or wrong usage. Regarding those deficiencies, the authors propose Capsicum, an extension to the POSIX API on UNIX platforms for capabilities and sandboxing. If a process is run in capability mode, it cannot access global namespaces like the file system or system

71 6 Related Work management interfaces as well as the number of allowed system calls is limited. Modi- fications to the kernel assure compliance with the capabilities given to the process. If an applications needs to take advantage of Capsicum, the application first of all has to be separated into logical independent processes. Communication between those components has to be implemented using message passing. Then, the application can use the API proposed in the paper to apply capabilities to and enforce capabilities on its components. As an example, the authors adapted Chromium web browser with its multi-process architecture to use Capsicum on a FreeBSD implementation of their API. As Chromium was already prepared for sandboxing, only code that used shared memory had to be modified and an additional 100 lines of code had to be added to introduce Capsicum support. Compared to Chromium’s userspace sandboxing using seccomp on Linux, this approach is considerable simpler from an application developer’s point of view.

72 7 Conclusion

The goal of this thesis was the design of an architecture for untrusted web browser plugins. In Section 3, I described an architecture that could be implemented in user- space. It is designed to have a defined interface where all interactions of the plugin with the system can be validated. Other interactions are prevented by a sandboxing mechanism. Inside the sandbox, plugins may take advantage of threading to implement complex multimedia handlers. Using priorities for threads, specific tasks and events can be dealt with in a timely manner. In Section 4 I described details of my implementation. I explained how to allow com- munications between plugins and the browser using system calls and events. Outlines of potential pitfalls are included and implemented solutions are described. Furthermore, I detail how the ptrace system call can be leveraged to build a sandbox for processes on Linux. Using assembler instructions and system calls to the browser, I show how to implement a threading library in a sandboxed process that provides means for synchro- nization. The experiments in Section 5 show the feasibility of the implementation and its use- fulness for multimedia applications.

7.1 Outlook

Responding to the growing needs of web applications, the successor to the current standard for websites, HTML5, will include new elements and APIs. HTML5 introduces and defines functionality which is currently only provided by third-party plugins or browser-dependent implementations. For example, HTML5 standardizes video playback and technologies like rendering of 2D shapes. On the one hand, implementations of HTML5 help to reduce the browser’s need for external programs to handle commonly used technologies, but on the other hand result in a bigger code-base for the browser. As a standard will hardly ever capture all possible use-cases, the need for a plugin interface remains.

7.2 Future Work

The results presented in this thesis open up numerous possibilities to continue upon this work.

Extend API Functionality The API could be extended with further functionality. As my implementation is meant to show the feasibility of the architecture, the included API still misses features required by real plugins. For example, system calls to

73 7 Conclusion

request specific network resources or user files could be added. Additionally, many event types, e.g. GUI-related mouse and keyboard events or browser and system events like changes in network availability could be implemented.

Improve Client Scheduling The scheduling algorithm implemented in the threading library is rather coarse-grained. It does not take into account the extent to which a thread was able to use its time slice before getting preempted. For example, if a thread of higher priority gets scheduled before the current thread completely used up its time, the current thread is added to the end of the queue for threads of the same priority. When it gets scheduled next time, it will receive an equally sized time slice. Furthermore, when a thread is resumed after a call to the vCPU, it may exceed its time slice up to the next timer event. By applying a finer- grained model that takes into account previous usage of time slices, the fairness of scheduling could be improved.

Optimize Implementation for Speed The implementation derived during this work pro- vides a suitable platform for measurements as shown by the experiments. Never- theless, further analysis for performance bottlenecks and suboptimal implementa- tions could lead to higher performance. One starting point for the analysis could be the usage of the ptrace system call to read and write the client’s CPU register content. Instead of reading all values (including ones that are not modified) then modifying a small subset before writing all of them back, a more efficient mech- anism may be suitable. As the ptrace system call is frequently used this way in the implementation, e.g. to resume a client after it caused a segmentation fault for a system call, latency of API calls may decrease while the overall performance of the system may increase.

Support External Code Currently, the compiled code of a plugin is linked with the architecture into an executable. By introducing support for plugins that come in the form of precompiled code, e.g. using the Executable and Linkable Format (ELF), the plugin could be loaded dynamically at runtime.

Integrate Architecture into Firefox To make the architecture available for plugin devel- opment in browsers, an integration into Firefox independent of MozPlugger could be devised. Therefore, the following approaches deem suitable. The architecture could be integrate into the Firefox source code. This would offer high performance for interaction between the browser and the plugin, but requires an in-depth analysis on how to use Firefox’s functions to implement the API and tightly couples browser and architecture. Another approach would be to create a NPAPI plugin that encapsulates the imple- mentation of the new plugin interface. Taking advantage of the current plugin mechanism, this approach introduces another layer of indirection but may reduce the amount of work necessary for the integration.

Analyze Alternatives to ptrace on Other Platforms In this work is focused on x86 platforms running a Linux-based operating system. The analysis of other plat-

74 7.2 Future Work forms and operating systems could provide results on how to create more efficient implementations on those systems. For example, systems featuring fine-grained means for sandboxing could provide performance improvements.

75

Appendix A Summary (German): Eine sichere Architektur für nicht vertrauenswürdige Webbrowser-Plugins

Während der letzten 20 Jahre durchlief das Internet einige tiefgreifende Veränderungen. Wurde es anfangs hauptsächlich benutzt um entfernte Zentralrechner zu verbinden, ent- stand mit der Einführung des World-Wide-Web (WWW) zu Beginn der 1990er Jahre die Möglichkeit, Informationen als miteinander verknüpfte Seiten über das Internet zu veröffentlichen. Hierzu setzte sich das Dateiformat HTML durch, mit dessen Hilfe Texte und Bilder über eine Markup-Sprache strukturiert und mit verschiedenen Formatierun- gen versehen dargestellt werden konnten. Zusätzlich erlaubten sogenannte Hyperlinks die Verknüpfung von ansonsten eigenständigen Webseiten. Zum Laden und Darstellen von Webseiten wurden spezielle Programme - genannt Webbrowser - entwickelt. Webseiten und -browser zu dieser Zeit erlaubten lediglich eingeschränkte Interaktio- nen. Jede Veränderung an der Darstellung einer Webseite musste durch die Naviga- tion zu einer neuen Seite verwirklicht werden. Weitergehende Interaktionen nachdem eine Seite geladen wurde, beispielsweise die dynamische Anpassung von Seiteninhalten als Reaktion auf einen Mausklick, wurden durch die Unterstützung von Skriptsprachen möglich. Hierfür wurden in Browsern Laufzeitumgebungen für JavaScript und VBScript integriert. Da JavaScript durch eine große Anzahl von Browsern unterstützt wurde, entwickelte es sich zur vorherrschende Skriptsprache für Internetseiten. Trotz der neuen Möglichkeiten war es jedoch nicht möglich, Medien in komplexen Formaten einzubinden. Beispielsweise Videos oder Vektorgrafiken konnte nicht inner- halb von Webseiten dargestellt werden. Um diesem Umstand Rechnung zu tragen, wurden Browser um Schnittstellen zum Laden externer Komponenten erweitert. Solch eine Komponente, genannt Plugin, ist für den Umgang mit bestimmten Medientypen zuständig und erlaubt somit Entwicklern weitgehend unabhängig vom Hersteller des Browsers Funktionen zum Abspielen von Medien nachzurüsten. Plugins werden als kom- pilierter Binärcode zur Verfügung gestellt und von Browser und Betriebssystem meist im Kontext des Benutzers ausgeführt. Eine weit-verbreitete Pluginschnittstelle für Browser ist das Netscape Plugin Application Programming Interface (NPAPI). NPAPI definiert einige Funktionen für Browser und Plugin, geht jedoch nicht auf weitere Funktionalität wie Datei- und Hardwarezugriffe oder Nebenläufigkeit nicht ein, weshalb Plugins hier direkt auf das Betriebssystem zugreifen müssen.

77 Appendix A Summary (German)

Webseiten und Webbrowser verarbeiten heutzutage mehr und mehr vertraulich Infor- mationen. Beispielsweise werden E-Mails aber auch Bankgeschäfte vermehrt über die online Portale entsprechender Dienstleister abgewickelt. Da Webseiten oft Inhalte ver- schiedensten Ursprungs enthalten, beispielsweise werden neben Zeitungsartikeln oft Wer- bung aus einem fremden Werbenetzwerk angezeigt oder Nutzer auch mehrere Seiten par- allel ansteuern, steigt die Bedeutung des Browsers als sichere Umgebung zur Darstellung von fremden Inhalten. Deshalb unterliegen Webseiteninhalte verschiedenen Sicherheit- srichtlinien, deren Einhaltung vom Browser überwacht wird. Eine ist die sogenannte Same Origin Policy (SOP), die Zugriffe von JavaScript auf Objekte und Methoden anhand des Ursprungs der Daten regelt. Obwohl Plugins von Dritten in einer kaum überprüfbaren Form zur Verfügung gestellt werden, werden sie doch zur Laufzeit Teil des Browsers. Mit der Installation eines Plugins wird es somit als vertrauenswürdig eingestuft und mit den Berechtigungen des Browsers ausgeführt, obgleich zur Darstel- lung von Mediendateien oft nur ein geringer Teil der Berechtigungen benötigt wird und Plugins folglich nicht den Sicherheitsrichtlinien des Browsers für Webseiten unterstehen. Um die Ausführung von Plugins in einen kontrollierbaren Rahmen zu bringen, stelle ich in dieser Diplomarbeit Konzepte zur Erstellung einer neuen Pluginschnittstelle vor. Hiermit werden alle Interaktionen, die ein Plugin involvieren, über den Browser abgewick- elt. Die Definition der Schnittstelle erfolgt in einer Art, die dem Browser erlaubt Sicher- heitsrichtlinien zu überprüfen und durchzusetzen. Als Basis hierfür entwerfe ich eine Architektur, die die Ausführung von Plugins in einer eingeschränkten Laufzeitumge- bung ermöglicht und somit sicherstellt, das Plugins lediglich die definierte Schnittstelle benutzen können. In dieser Diplomarbeit stelle ich die Implementierung der Architektur für 64-bit Linux-Betriebssysteme vor. Dabei werden Plugins in einem eigenen Prozess mithilfe des Systemaufrufs ptrace ausgeführt und unter die Kontrolle eines übergeord- neten (Browser-)Prozesses gestellt. In verschiedenen Experimente zeige ich die Eignung der Architektur für die Ausführung von Multimedia-Plugins, die geringe Latenzen sowie hohen Datendurchsatz und eine schnelle Ausführung von Berechnungen benötigen.

78 Glossary

API Application Programming Interface

CISC Complex Instruction Set Computer

COM Microsoft’s Component Object Model

DOM Document Object Model

FTP File Transfer Protocol

GUI Graphical User Interface

HTML HyperText Markup Language

HTTP Hypertext Transfer Protocol

IA-32 Intel Architecture 32-bit

JVM Java Virtual Machine

NaCl Native Client

NPAPI Netscape Plugin Application Programming Interface

OOPP Out-Of-Process-Plugins

OS Operating System

PDF Portable Document Format

PPAPI Pepper Plugin Application Programming Interface

RISC Reduced Instruction Set Computing

SLOC Source Lines Of Code

SOP Same Origin Policy

SVG Scalable Vector Graphics

SSL Secure Sockets Layer

TLB Translation Lookaside Buffer: a CPU cache to speed up handling of virtual addresses

TSC Time Stamp Counter

79 Glossary

URL Uniform Resource Locator vCPU virtual

VM Virtual Machine

WWW World Wide Web

80 Bibliography

[Ado] Adobe Systems Inc. Flash content reaches 99 Retrieved Decem- ber 8, 2010, from http://www.adobe.com/products/player_census/ flashplayer/. 12, 13

[Ado08] Adobe Systems Inc. Adobe flash player 10.0.12 security. White paper, Adobe, November 2008. 12

[App] Apple Inc. Apple - safari. Retrieved March 1, 2011, http://www.apple. com/safari/. 9

[BJRT08] Adam Barth, Collin Jackson, Charles Reis, and The Google Chrome Team. The security architecture of the chromium browser. Technical report, 2008. 16, 67

[BL] Louis Bavoil and Peter Leese. Mozplugger. Retrieved March 3, 2011, http: //mozplugger.mozdev.org/. 11

[Cru] Arx Cruz. Zenity. Retrieved March 3, 2011, http://freshmeat.net/ projects/zenity. 53

[FFm] FFmpeg team. Ffmpeg. Retrieved March 1, 2011, http://www.ffmpeg. org/. 41, 53

[FTJ+10] Marc Fossi, Dean Turner, Eric Johnson, Trevor Mack, Téo Adams, Joseph Blackbird, Stephen Entwisle, Brent Graveland, David McKinney, Joanne Mulcahy, and Candid Wueest. Symantec global internet security threat report trends for 2009, April 2010. 12, 21

[GG06] Alan Grosskurth and Michael W. Godfrey. Architecture and evolution of the modern web browser. Retrieved January 12, 2011, http://grosskurth.ca/ papers/browser-archevol-20060619., June 2006. 14, 15

[Gooa] Google Inc. Chromium. Retrieved March 7, 2011, http://www.chromium. org/Home. 16

[Goob] Google Inc. Chromium OS - design documents - software architecture. Retrieved December 7, 2010, http://www.chromium.org/chromium-os/ chromiumos-design-docs/software-architecture. 70

[Gooc] Google Inc. Chromium OS - design documents - system harden- ing. Retrieved December 7, 2010, from http://www.chromium.org/ chromium-os/chromiumos-design-docs/system-hardening. 70

81 Bibliography

[Good] Google Inc. Google chrome. Retrieved March 1, 2011, http://www.google. com/chrome. 9

[Gooe] Google Inc. Pepper plugin (ppapi). Retrieved March 8, 2011, http://www.chromium.org/nativeclient/ getting-started/getting-started-background-and-basics# TOC-Pepper-Plugin-API-PPAPI-. 11

[GTK08] Chris Grier, Shuo Tang, and Samuel T. King. Secure web browsing with the op web browser. In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 402–416, May 2008. 16, 69, 70

[HB98] Galen Hunt and Doug Brubacher. Detours: Binary interception of win32 functions. In In Proceedings of the 3rd USENIX Windows NT Symposium, pages 135–143, 1998. 20

[KM10] Nick Kramer and Microsoft Corporation. Silverlight security overview. White paper, Microsoft Corporation, April 2010. 13

[Lan] Sam Lantinga. Simple directmedia layer. Retrieved March 1, 2011, http: //www.libsdl.org/. 41

[LWP10] Adam Lackorzynski, Alexander Warg, and Michael Peter. Generic virtu- alization with virtual processors. In Twelfth Real-Time Linux Workshop 2010, October 2010. 27

[LZYC10] Lei Liu, Xinwen Zhang, Guanhua Yan, and Songqing Chen. sePlugin: Towards transparently secure plugins in your internet explorers. In Pro- ceedings of the 8th International Conference on Applied Cryptography and Network Security, 2010. 67

[Met] Cade Metz. Mozilla mimics google’s native code demo in . Retrieved March 7, 2011, http://www.theregister.co.uk/2010/06/25/ mozilla_on_jaegermonkey_javascript_engine_extension/. 7

[Mic] Microsoft. Internet explorer 8 home. Retrieved March 5, 2011, http: //www.microsoft.com/windows/internet-explorer/default.aspx. 9

[MM06] Stephen McCamant and Greg Morrisett. Evaluating sfi for a cisc architec- ture. In Proceedings of the 15th conference on USENIX Security Symposium - Volume 15, volume 15, Berkeley, CA, USA, 2006. USENIX Association. 6

[Moza] Mozilla. Jetpack design guidelines. Retrieved February 15, 2011, https: //wiki.mozilla.org/Labs/Jetpack/Design_Guidelines. 15

[Mozb] Mozilla. Npapi:pepper. Retrieved December 9, 2010, from https://wiki. mozilla.org/NPAPI:Pepper. 11

[Mozc] . Mozilla firefox. Retrieved March 1, 2011, http:// www.mozilla.org/. 9

82 Bibliography

[Mozd] . Electrolysis. Retrieved March 7, 2011, https://wiki. mozilla.org/Content_Processes. 17

[NM] Johnathan Nightingale and Mozilla. Helping users keep plugins updated. Retrieved December 20, 2010, http://blog.mozilla.com/security/ 2009/09/04/helping-users-keep-plugins-updated/. 17

[Ope] Opera Software ASA. Opera browser. Retrieved March 1, 2011, http: //www.opera.com/. 9

[PG] Marc Pawliger and Google Inc. Bringing improved pdf support to google chrome. Retrieved March 8, 2011, http://blog.chromium.org/2010/06/ bringing-improved-pdf-support-to-google.html. 68

[RBP09] Charles Reis, Adam Barth, and Carlos Pizano. Browser security: lessons from google chrome. Communications of the ACM, 52(8):45–49, 2009. 16

[RG09] Charles Reis and Steven D. Gribble. Isolating web programs in modern browser architectures. In EuroSys ’09: Proceedings of the 4th ACM Euro- pean conference on Computer systems, pages 219–232, New York, NY, USA, 2009. ACM. 16

[Sch] Tobias Schneider. Gordon: An open source flash™ runtime written in pure javascript. Retrieved March 7, 2011, https://github.com/tobeytailor/ gordon. 7

[SMWL10] Kapil Singh, Alexander Moshchuk, Helen J. Wang, and Wenke Lee. On the incoherencies in web browser access control policies. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, SP ’10, pages 463–478, Washington, DC, USA, 2010. IEEE Computer Society. 8

[SPG] Justin Schuh, Carlos Pizano, and Google Inc. Rolling out a sandbox for adobe flash player. Retrieved March 8, 2011, http://blog.chromium.org/ 2010/12/rolling-out-sandbox-for-adobe-flash.html. 68

[TMK10] Shuo Tang, Haohui Mai, and Samuel T. King. Trust and protection in the illinois browser operating system. In OSDI’10: Proceedings of the 9th USENIX Symposium on Opearting Systems Design & Implementation, Berkeley, CA, USA, 2010. USENIX Association. 71

[WALK10] Robert N. M. Watson, Jonathan Anderson, Ben Laurie, and Kris Kenn- away. Capsicum: practical capabilities for UNIX. In Proceedings of the 19th USENIX Security Symposium, 2010. 71

[WGM+09] Helen J. Wang, Chris Grier, Alexander Moshchuk, Samuel T. King, Piali Choudhury, and Herman Venter. The multi-principal os construction of the gazelle web browser. In Proceedings of the 18th conference on USENIX security symposium, SSYM’09, pages 417–432, Berkeley, CA, USA, 2009. USENIX Association. 69

83 Bibliography

[WJNB95] Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. Dynamic storage allocation: A survey and critical review. In Proceedings of the International Workshop on Memory Management, IWMM ’95, pages 1–116, London, UK, 1995. Springer-Verlag. 50

[WLAG93] Robert Wahbe, Steven Lucco, Thomas E. Anderson, and Susan L. Graham. Efficient software-based fault isolation. In Proceedings of the fourteenth ACM symposium on Operating systems principles, SOSP ’93, pages 203– 216, New York, NY, USA, 1993. ACM. 5

[YSD+09] Bennet Yee, David Sehr, Gregory Dardyk, J. Bradley Chen, Robert Muth, Tavis Orm, Shiki Okasaka, Neha Narula, Nicholas Fullagar, and Google Inc. Native client: A sandbox for portable, untrusted x86 native code. In In Proceedings of the 2007 IEEE Symposium on Security and Privacy, 2009. 68

84