<<

DISS. ETH NO. 24063

Database/Operating Co-Design

A thesis submitted to attain the degree of

DOCTOR OF SCIENCES of ETH ZURICH

(Dr. sc. ETH Z¨urich)

presented by

JANA GICEVAˇ

MSc in Science, ETH Z¨urich

born on 14.12.1987

citizen of Macedonia

accepted on the recommendation of

Prof. Dr. Gustavo Alonso (ETH Z¨urich), examiner Prof. Dr. Timothy Roscoe (ETH Z¨urich), co-examiner Dr. Timothy L. Harris (Oracle Labs, Cambridge, UK), co-examiner Dr. Kimberly ="">K. Keeton (Hewlett Packard Laboratories), co-examiner

2016

Abstract

For decades, engines have found the generic interfaces offered by conventional operating odds with the need for efficient utilization of hardware resources. This is partly due to the big semantic gap between the two layers. The rigid DB/OS interface does not allow for to flow between them, and as a result: (1) the is unaware of the database requirements, and provides a set of general purpose policies and mechanisms for all applications running on , (2) the database can, at best, duplicate a lot of the OS functionality internally at the cost of absorbing significant portion of additional complexity in order to efficiently use the underlying hardware – an approach that does not scale with the current pace of hardware developments. In this dissertation, I approach the problem from two perspectives. First, I reduce the knowledge gap between the database and the operating system by introducing an OS policy engine and a declarative interface between the two layers. I show how such extensions allow easier deployment on different , robust in noisy environments, and close to optimal resource allocation without sacrificing performance or latencies. Second, I propose using an OS architecture allows dynamic splitting of the resources into a control and compute plane. I show how a compute plane kernel can be tailored to the needs of processing applications by integrating a kernel-based runtime for efficient execution of concurrent parallel analytical jobs. I also address a modern challenge of database optimizers regarding the balance of concurrency and parallelism, and the influence that modern multicore machines have on the problem. I conclude by discussing future directions which arise from the work presented in this dissertation, and highlighting the potential of cross-layer optimizations on the system stack in the of increasingly heterogeneous hardware platforms and modern workload requirements.

i

Zusammenfassung

F¨urDatenbanksysteme stehen seit Jahrzehnten die generischen Schnittstellen, die von konventionellen Betriebssystemen bereitgestellt werden, mit dem Bed¨urfnisin Konflikt, Hardware-Ressourcen effizient zu nutzen. Ein Teil des Problems ¨uhrtvon der großen semantischen L¨ucke zwischen den beiden Schichten her. Die starre DB/OS-Schnittstelle erlaubt keinen Informationsfluss zwischen diesen Schichten. Als Folge davon hat (1) das Betriebssystem keine Kenntnis von den Anforderungen der Datenbank und implementiert darum nur generische Strategien und Mechanismen f¨uralle auf ihm laufenden Anwen- dungen, und (2) dupliziert die Datenbank im besten Fall intern viel OS-Funktionalit¨at, auf Kosten einer signifikanten Menge an zus¨atzlicher Komplexit¨atnur um die verf¨ugbare Hardware effizient nutzen zu k¨onnen– ein Vorgehen, das mit der rasanten Geschwindigkeit, mit der Hardware sich heutzutage ¨andert,nicht skaliert. In dieser Dissertation gehe ich das Problem von zwei Seiten an. Zuerst reduziere ich die Wissensl¨ucke zwischen der Datenbank und dem Betriebssystem mithilfe einer OS-Policy- Engine und einer deklarativen Schnittstelle zwischen den beiden Schichten. Ich zeige, wie diese Erweiterung einen einfacheren Einsatz auf unterschiedlichen Maschinen, robustere Ausf¨uhrungin unruhigen Systemen und nahezu optimale Ressourcenallokation erm¨oglicht, ohne Ausf¨uhrungsgeschwindigkeit oder Tail-Latenzen zu opfern. Des Weiteren schlage ich eine OS-Architektur vor, welche es erlaubt, eine Maschine dy- namisch in eine Control- und eine Compute-Plane aufzuteilen. Ich zeige, wie ein Kernel in der Compute-Plane auf die Anforderungen einer datenverarbeitenden Anwendung spezial- isiert werden kann, indem ich eine Kernel-basierte Laufzeitumgebung f¨urdie effiziente Ausf¨uhrungvon nebenl¨aufigenDatenanalyse-Aufgaben integriere. Außerdem befasse ich mich mit der gegenw¨artigenHerausfordung von Datenbank-Optimierern bez¨uglich der Balance von Nebenl¨aufigkeit und Parallelit¨atund dem Einfluss, den moderne Mehrkern- Maschinen auf das Problem haben.

iii Ich schließe mit der Diskussion von zuk¨unftigenForschungsrichtungen, welche sich aus der vorgestellten Arbeit ergeben, und der Herausstellung des Potentials von schicht¨ubergreifenden Optimierungen des System-Stacks im Lichte von immer heterogenerer Hardware-Plattformen und Anforderungen moderner Workloads.

iv Acknowledgments

This has been an amazing journey. From the early days I have been captivated by the thrill to approach and explore even the challenging of problems. And it is thanks to my advisers, collaborators, family, and my friends that I have loved every bit of it. First, I would like to express my gratitude to my adviser Gustavo Alonso for all his support, advice, guidance, patience; for helping me grow as a scholar and for teaching me to love what it takes to do great research. I want to thank my co-adviser Timothy Roscoe for being supportive and always ready to give insightful feedback for my work, and for helping me improve myself as a researcher. I also extend my gratitude to my mentor from Oracle Labs, Tim Harris, for many great discussions, his feedback and guidance. I have greatly enjoyed our collaboration, which I hope we continue in the future. I would like to thank Kim Keeton for agreeing to be part of my PhD committee and her feedback that significantly improved the quality of my dissertation; John Wilkes for being supportive mentor for my PhD fellowship and always challenging me to clearly define what I do; Sedlar and Nipun Agrawal for giving me the opportunity to work on Project RAPID whose experience in the early days of my PhD has been very rewarding; Donald Kossmann, Frank McSherry, Derek Murray, Michael Isard, Onur Mutlu, and many others from ETH, Oracle Labs and Research SVC for your guidance, all of our discussions, and for allowing me to learn so much from all of it. I had the pleasure and luck to work with many great students and would like to thank all my collaborators: Tudor, Adrian, Ionut, Kaan, Claude, Darko, Gerd, Pratanu, Zaheer, and Simon P. It was such a rewarding experience working with all of you. Throughout the years, the friendship in the Systems Group has been one of the greatest highlights. Therefore, a big thank you goes to Anja, Besa, Pravin, Stefan, Gerd, Lukas, Zsolt, Pratanu, Akhi, Georgios, Tudor and Desi. I would also like to use this opportunity

to thank my closest allies and friends for many years: Tijana, Sanja, Kiki, Alen, Gogi, Ozan, Kaveh, Sukriti, Kaan, Josip, Irena, . For almost everything I have achieved so far, I have to thank my parents Gjorgji and Ljubica. They have been supportive like no other. Both have been my role models for many years and have strongly encouraged me to follow my dreams. I would also like to thank my sister Mila for always cheering me up and supporting me wherever I go. And finally, to my best friend, biggest supporter and critic – Darko. Thank you for putting up with all different versions of me and for always being there. Thank you for teaching me the value of a balanced life, and for showing me the positive aspects in all cases. Much of this success is thanks to you.

Contents

1 Introduction1 1.1 Background...... 2 1.2 Motivation and challenges...... 3 1.2.1 Hardware trends...... 3 1.2.2 Deployment trends...... 4 1.3 Problem statement...... 5 1.4 Contributions...... 6 1.4.1 Policies and flow...... 6 1.4.2 Customized OS support for ...... 7 1.5 Thesis outline...... 9 1.6 Related publications...... 10

2 OS policy engine and adaptive DB storage engine 11 2.1 System Overview...... 12 2.2 DB Storage engine...... 14 2.2.1 The architecture of the storage engine...... 14 2.2.2 Working unit and its properties...... 17 2.2.3 Properties of the CSCS storage engine...... 17 2.2.4 Embedding into COD...... 18 2.3 The OS Policy engine...... 20

vii Contents

2.3.1 Architecture...... 21 2.3.2 Implementation...... 23 2.3.3 Discussion...... 24 2.4 Interface...... 24 2.4.1 Scope...... 24 2.4.2 Semantics...... 25 2.4.3 Syntax and implementation...... 27 2.4.4 Evaluation...... 27 2.5 Experiments...... 27 2.5.1 Experimental Setup...... 28 2.5.2 Deployment on different machines...... 28 2.5.3 Deployment in a noisy system...... 35 2.5.4 Adaptability to changes...... 40 2.6 Related work...... 43 2.6.1 Interacting with operating systems...... 44 2.6.2 Means of obtaining application’s requirements...... 46 2.6.3 based on applications’ requirements...... 47 2.7 Summary...... 47

3 Execution engine: Efficient deployment of query plans 49 3.1 Motivation...... 50 3.2 Background...... 51 3.2.1 Complex Query Plans...... 51 3.2.2 Scheduling of Shared Systems...... 53 3.2.3 Problem Statement...... 54 3.2.4 Sketch of the Solution...... 55 3.2.5 In the context of COD...... 55 3.3 Resource Activity Vectors...... 57

viii Contents

3.3.1 RAV Definition...... 58 3.3.2 RAV Implementation...... 58 3.3.3 Capturing CPU Utilization...... 60 3.3.4 Capturing Memory Utilization...... 60 3.3.5 Parallel Operators...... 61 3.4 Deployment ...... 62 3.4.1 Operator Collapsing...... 63 3.4.2 Minimizing Computational Requirements...... 64 3.4.3 Minimizing Bandwidth Requirements...... 66 3.4.4 Deployment Mapping...... 67 3.4.5 Discussion and Possible Extensions...... 69 3.5 Evaluation...... 69 3.5.1 Experiment Setup...... 70 3.5.2 Resource Activity Vectors (RAVs)...... 72 3.5.3 Performance Comparison: Baseline vs Compressed deployment...... 76 3.5.4 Analysis of the Deployment Algorithm...... 77 3.5.5 Discussion...... 79 3.6 Generalizing the approach...... 79 3.6.1 Parallel Operators...... 79 3.6.2 Dynamic Workload...... 80 3.6.3 Non-shared (Traditional) Database Systems...... 81 3.7 Related work...... 82 3.7.1 General scheduling for multicore systems...... 82 3.7.2 Scheduling for ...... 83 3.7.3 Resource allocation for data-oriented systems...... 84 3.7.4 Deriving application’s requirements...... 85 3.8 Summary...... 86

ix Contents

4 Optimizer: Concurrency vs. parallelism in NUMA systems 89 4.1 Background and Motivation...... 90 4.2 Problem statement...... 92 4.2.1 Factors influencing concurrent execution...... 93 4.2.2 Scheduling approaches...... 94 4.2.3 Evaluation Metrics...... 98 4.3 Methodology...... 99 4.3.1 Parallel data-processing ...... 100 4.3.2 Hardware architectures...... 102 4.4 Algorithms in isolation...... 104 4.4.1 Relational operators...... 104 4.4.2 Graph processing algorithms...... 106 4.5 Concurrent WL execution...... 109 4.5.1 Interference in concurrent workloads...... 109 4.5.2 Scheduling approaches – experimental setup...... 112 4.5.3 Scheduling concurrent DB operators...... 114 4.5.4 Scheduling concurrent graph algorithms...... 117 4.5.5 Effect of underlying architecture...... 120 4.5.6 Heterogeneous workload...... 121 4.6 Discussion...... 122 4.7 Related Work...... 125 4.7.1 and data placement...... 125 4.7.2 Scheduling concurrent parallel workloads...... 125 4.7.3 Contention-aware scheduling...... 126 4.7.4 Constructive resource sharing...... 127 4.7.5 Impact of resource sharing and performance isolation...... 127 4.8 Summary...... 128

x Contents

5 Scheduler: Kernel-integrated runtime for parallel data analytics 131 5.1 Motivating use-case...... 132 5.2 Foundations...... 134 5.2.1 Expanding the Application/OS model...... 134 5.2.2 The OS architecture...... 135 5.3 Architecture overview...... 136 5.3.1 Control plane...... 137 5.3.2 Compute plane...... 137 5.3.3 Customized compute-node kernels...... 138 5.4 Customizing a compute-plane kernel...... 139 5.4.1 The need for better OS interface...... 139 5.4.2 The need for run-to-completion execution...... 140 5.4.3 The need for co-scheduling...... 142 5.4.4 The need for spatial isolation...... 142 5.4.5 The need for data aware placement...... 144 5.5 Implementation...... 145 5.5.1 Control and Compute plane...... 145 5.5.2 Compute plane kernel – Basslet...... 146 5.5.3 Basslet runtime ...... 148 5.5.4 Code size...... 151 5.6 Evaluation...... 152 5.6.1 Interference between a pair of parallel jobs...... 153 5.6.2 System throughput scale-out...... 154 5.6.3 Comparing standalone runtime: vs. Basslet...... 157 5.6.4 Overhead of Badis enqueuing...... 157 5.6.5 Evaluating the adaptive feature of Badis...... 159 5.7 Related work...... 160

xi Contents

5.7.1 Scheduling parallel workloads...... 160 5.7.2 Scheduling of and within Runtime systems...... 160 5.7.3 Specialized kernels...... 161 5.7.4 OS mechanisms for scheduling and performance isolation...... 162 5.7.5 Linux containers and ...... 163 5.8 Integration with COD’s policy engine...... 163 5.9 Summary...... 164

6 Future work 165 6.1 Follow up work...... 166 6.1.1 Managing different types of resources...... 166 6.1.2 Supporting workloads beyond traditional analytics...... 168 6.2 Beyond DB/OS co-design...... 169

7 Conclusion 171

xii 1 Introduction

The design and implementation of today’s (.g., databases, operating systems) has been influenced by the increasing complexity of modern machines, the neces- sity of efficient resource utilization, and meeting the growing demands for complex data processing workloads. Unfortunately, most of the research focuses primarily on individual layers of the system stack. This is often based on making specific assumptions about the rest of the system stack and discarding the information and knowledge available in the other layers. This, however, often results in redundancy in the services and mechanisms implemented and mismatching policies for managing the hardware resources as required by the modern application workloads. My claim is that in order to efficiently address such challenges, system developers need to revisit the problem of cross-layer optimizations and re-evaluate the benefits of system co-design. In this dissertation I explored the interaction between database engines and operating systems in the light of recent hardware advancements and current trends in workload requirements. The focus of this work is on data processing systems executing modern analytical workloads on multisocket multicore machines.

1 Chapter 1. Introduction

1.1 Background

Database management systems and operating systems have a decades-long conflict when it comes to resource management and scheduling [Gra77,Sto81]. Even though, they originally started with the same philosophy – providing access to manage data in files – they took a different approach when implementing it. For many years the two systems targeted different types of applications and machines (e.g., DBMS systems were running on dedicated, complex, and expensive database ma- chines [Hsi79, BD83]), which shaped the role of conventional monolithic databases and operating systems as we know them today. However, the efficiency and economic advan- tage of off-the-shelf hardware and the need to reduce the costs of application development has lead to today’s situation, where a typically runs on top of a conven- tional operating system. The operating system is responsible primarily to schedule resources across multiple appli- cations and to provide isolation protection. In many cases the OS also provides hardware to enable applications easier portability across different hardware platforms. As its primary purpose is to serve multiple applications, the OS often sees the DB engine as program and offers generic mechanisms and policies to all of its applications. Generally, today’s operating systems multiplex applications with little to no knowledge about their requirements. They migrate, preempt, and threads on various cores, trying to optimize certain system-wide global objectives (e.g., load-balancing the work queues on individual cores and across the NUMA nodes [LLF+16]). Furthermore, the op- erating system has no notion of how its decisions affect the performance of its applications, primarily due to the limited between the two layers. As a result, performance sensitive applications like DBMSs often find the generic mecha- nisms and policies provided by the operating system ill-suited for their requirements [Sto81]. Many data processing engines running on conventional operating systems suffer from a number of performance related problems and inefficient resource usage. Therefore, they try to take control over the management of machine’s resources. In order to do that, databases have two alternatives. First, they can try to understand how the underlying operating system operates, what its policies are, and how the available mechanisms work and adapt their implementation accordingly; or second, they can completely ignore the policies of the operating system and find means to override them.

2 1.2. Motivation and challenges

The problem with the first alternative is that it is tied to the current understanding of the OS policies and mechanisms (and without the OS being aware of it). As soon as the default kernel policies change, all the effort in optimizing the resource allocation is lost and problems in performance become more difficult to and understand. Therefore, applications use the second approach and leverage libraries to override the default policies of the operating system. One example is the libnuma [Kle05] , which allows applications to have more control over thread and data placement. Similar techniques are also done for buffer management and pinning [Gra77], page coloring for cache partitioning [CJ06,LDC +09], setting the granularity of memory alloca- tion (e.g., using ), etc. The benefits of using this alternative come primarily from the internal knowledge that applications like database systems have about their workload properties: algorithmic complexity of their operations, patterns, resource re- quirements, and data distributions. The idea of giving the application control over the resource management policies has been also explored by research operating systems (e.g., [EKO95], fos [WA09], Tesselation [CEH+13], Barrelfish [BBD+09], etc.). In this dissertation we argue that these interfaces, which open up the OS functionality, can improve performance but require prohibitive effort by the , due to the increasing complexity of the hardware platforms and diversity of machines on the . Moreover, even if a single application can benefit from such interfaces, the situation for several concurrently executing applications remains: improving overall resource efficiency requires a global of the system state, resource utilization, and knowing the set of running applications. We discuss these factors into more detail in the section.

1.2 Motivation and challenges

Modern data processing systems face challenges from several recent trends, which are briefly described in this section.

1.2.1 Hardware trends

The last years have seen profound changes in the hardware available for running database systems. As the hardware vendors reached the power , there was no longer speed-up by simply increasing the CPU frequency. Computer architects have tried to

3 Chapter 1. Introduction use the available transistors by introducing multiple cores, heterogeneous computational resources (some of which are tailored to a particular application), accelerators, etc. With the rise of the memory wall and the gap between DRAM and CPU frequencies, hardware vendors introduced more complex cache hierarchies, non-uniform cache coherent memory, etc. Consequently, the system software today has to adapt and embrace the hardware changes as an opportunity to rethink the architecture model and design principles, and find means to address the underlying hardware complexity. Database engines today perform their own memory allocation and thread scheduling, store data on raw disk partitions, and implement complex strategies to optimize and avoid I/O synchronization problems. Extensive research has been invested over more than three decades to optimize such engine decisions, but some of these optimizations may no longer apply in the face of the changing interfaces and properties of new storage (e.g., SSDs or NVRAM instead of HDD). Unfortunately, and unless something changes, a great deal of additional complexity will need to be added to database systems to cope with the increasing heterogeneity within and across hardware architectures. For example, in order to improve performance, the implementation of certain relational algorithms has shifted towards hardware awareness and fine-tuning to the new features and topologies modern machines have [Bal14, Mue16, WS11]. Optimal use of resources now requires detailed knowledge of the underlying hardware (memory affinities, cache hierarchies, interconnect distances, CPU/core/die layouts, etc.). Absorbing such hardware complexity has now become a burden of the programmer. And the problem gets further aggravated by the increasing diversity of microarchitectures. While this is a viable approach in the short run, it does not scale in the long-term with the pace at which hardware and workload complexity are evolving.

1.2.2 Deployment trends

On the deployment side, in an age of virtualization and consolidation, databases can no longer assume to have a complete physical machine to themselves. Databases are increasingly deployed on hardware alongside other applications: in virtual machines, multi-tenant hosting scenarios, platforms, etc. As a result, the carefully constructed internal model of machine resources a DBMS uses to plan execution has become highly dependent on the runtime state of the whole machine. This state, however, is unknown to the database and is currently only available to the operating system.

4 1.3. Problem statement

As we will show with our experiments, even a task can impact the performance of an otherwise scalable database engine. This is mainly because the DBMS is unaware of it. Hence, good performance in the presence of other applications requires the database to have an accurate picture of the runtime state of the whole machine.

1.3 Problem statement

This dissertation presents a fresh look on the interface and co-design of data-processing engines and operating systems in the light of new hardware architectures, and the design of novel database and operating system architectures. Even though the main focus of this thesis is on scheduling and management of CPU resources for analytical workloads, we believe that it establishes the basis for more general cross-layer optimization that could be applied to a wide range of resources and application properties. Some of the research questions we address are:

1. Given a data processing system and OS, both designed to fully exploit multicore hardware, what is the best interface and knowledge distribution between them? knows what, and where should knowledge reside?

2. How can the operating system database engines with machine diversity and internal resource heterogeneity?

3. How can the OS policies be improved using knowledge from the data processing layer to deliver efficient resource utilization?

4. What mechanisms should an operating system provide in order to meet the require- ments of modern workloads?

5. What are the recommended changes to data-processing systems so that they benefit from the richer interface with the OS? Which components can immediately leverage the newly introduced mechanisms and policies in the operating system?

The rest of the dissertation is structured such that the first part targets the knowledge distribution, its impact on the OS policies, and the information exchange over the DB/OS interface. The second part focuses on the need for efficient scheduling of modern analytical workloads, and revisits the OS architecture and process model to find a better match for dynamic workload requirements.

5 Chapter 1. Introduction

1.4 Contributions

In order to address the research questions and list of challenges given above, we have made the following contributions. Our changes to the operating system layer on a conceptual level can be into: (i) contributions to the internal reasoning in the OS and its com- putation of the resource allocation policies, as well as enabling a richer information flow with the database layer; and (ii) revisiting the offered services and mechanisms by the OS and adapting them to better suit the requirements of modern data processing workloads.

1.4.1 Policies and information flow

The first part of the thesis addresses the big semantic gap between the knowledge available to the operating system and the DBMS.

Policy engine

The OS policy engine is designed such that the system itself, and the applications running on top of it can better the complexity, address the challenges, and reason about the properties of the internal hardware resources and diversity of machines on the market. In particular, it consists of a that contains information about (i) machine specific facts (e.g., topology of the machine, number of cores and memory per NUMA

Operating system Policy Engine Resource Profiler System-wide Resource Knowledge Manager base App.-specifc

Figure 1.1: Overview of affected or new OS components

6 1.4. Contributions node, details about the cache hierarchy), (ii) application-related facts (e.g., information about whether an application is compute- or memory-bound, sensitive to sharing caches), and (iii) information about the current system state (e.g., number of active applications, their resource usage). This knowledge is used both by the knowledge base to build a detailed model of the multicore machine and its resources, and by a set of algorithms and solvers that reason about it to compute resource allocation schedules. The resource manager is the active component of the policy engine, responsible for commu- nicating with the applications, triggering resource allocation computations in the knowl- edge base, and implementing the output policies by invoking various OS mechanisms. Finally, the policy engine relies on a resource profiler to capture the resource capacities (e.g., by measuring the maximum attainable bandwidth on the interconnect , or estimating the local DRAM bandwidth per NUMA node), monitor the current utilization of resources, and enable applications to learn their resource requirements.

Declarative DB/OS interface

The proposed interface is declarative and allows for richer two way information exchange between the DBMS and the OS policy engine. For example, it allows applications to (i) push part of their logic (expressed as cost models, or stored procedures) down to the OS, together with (ii) information about their prop- erties. That way both layers can reason, given access to all information present in the knowledge base. For example, the OS can do a better job when deploying the application’s threads onto a range of different machines (as we show in Section 2.5.2), and provide effi- cient resource allocation without affecting the application’s performance or predictability (discussed in Section 3.5); and the database can adapt itself based on the current system state. The interface also allows applications to (iii) query the OS policy engine about hardware properties, the machine model, or current system state, and (iv) subscribe for notifications in case of changes in the global resources allocation.

1.4.2 Customized OS support for data processing

In the second part I show the benefits of customizing the operating system for the needs of data processing workloads.

7 Chapter 1. Introduction

Revisiting the process model

Recent developments in operating systems [BBD+09, WA09] enable us to configure and specialize the operating system stack (i.e., apply changes in both kernel- and - space) for particular application classes. Such customizations across the system’s stack have been shown to bring significant performance and security benefits (although at the moment limited to a single address space [MMR+13]). In this part of the dissertation, I explore how to customize the underlying OS process model to better fit the requirements of concurrent parallel analytical workloads. More specifically, I propose extending the -based process model consisting of OS processes, threads and user-level threads to also support OS task and ptask (parallel task) as program execution units. This way the application can explicitly specify that a certain job needs to be execution without – OS task, and that a certain pool of user-level threads execute a common parallel job and should be co-scheduled until completion – OS ptask. The proposed changes are implemented as a kernel-based runtime (Basslet), which can execute parallel analytical jobs on behalf of existing applications. I also show how using Basslet can resolve many resource interference problems across different workloads and improve the system throughput.

Adaptive OS architecture

The novel Badis OS architecture splits the machine’s resources into a control and a compute plane. The control plane runs the full-weight OS stack (FWK), while the compute plane consists of customized light-weight kernels (LWKs). The compute plane kernels provide selected OS services tailored to a particular workload and a noise-free environment for executing parallel jobs on behalf of the applications running on the control plane’s FWK. For the purpose of this dissertation, I used a customized OS stack for executing analytical workloads – Basslet. It uses the newly introduced OS program execution units (task and ptask) for task-based co-scheduling of parallel jobs. The Basslet kernels run on the compute plane, alongside the FWK on the control plane, which offers a traditional thread-based scheduling of application’s threads. The boundary between the two planes, as well as the of compute plane kernels can be changed at runtime depending on the workload mix requirements. Such a dynamic architecture makes the system’s stack suitable for scheduling hybrid workloads (e.g., operational analytics), where different kernels can co- at the same , each one customized for a particular workload.

8 1.5. Thesis outline

1.5 Thesis outline

The rest of the thesis describes the benefits of the OS changes in the context of, and as used by several components of a parallel data processing system.

Chapter2: Storage Engine In this chapter we present COD, a system that combines a database storage engine, which is highly customized for executing analytical queries and updates with predictable run- time guarantees, with the OS policy engine, which computes and suggests deployment configurations for the database storage engine in different execution environments. COD’s enhanced DB/OS interface enables the storage engine to offload to the OS policy engine the challenge of dealing with diverse hardware architectures, and suitable resource allocation in noisy and dynamic environments.

Chapter3: Execution Engine This chapter of the dissertation describes how the execution engine of a modern database system can benefit from information exchange with the OS policy engine for efficient deployment of complex query plans on multicore machines. By combining knowledge from both systems, the policy engine proposes a thread-to-core placement that delivers maximum performance and predictability, minimizes the resource utilization, and is robust across different server architectures.

Chapter4: Optimizer The material presented in this chapter focuses on the important problem today’s database optimizers face when executing modern analytical workloads: i.e., determining a suit- able degree of parallelism in concurrent scenarios for data-intensive jobs when executed on modern multi-socket multi-core machines. In particular, with an empirical study we explore the complex interplay between the chosen (i) degree of multiprogramming (concur- rency), (ii) degree of parallelism of the individual jobs, as well as the (iii) corresponding placement of threads onto cores, so that we maximize the system wide throughput and minimize the per-job variance when executed in noisy environments.

9 Chapter 1. Introduction

Chapter5: Scheduler In Chapter5 we show the benefits of using a customized OS stack as part of the Badis OS architecture, when executing multiple parallel jobs belonging to the same or different analytical applications. We discuss the limitations of existing OS interfaces and process model for executing parallel data intensive jobs, and propose introducing support for task and ptask (parallel task) program execution units in the new compute plane kernel. Additionally, we discuss the design principles of the kernel-integrated runtime scheduler, and demonstrate their effects when running graph kernels on top of a popular OpenMP- based graph analytics . Finally, in Chapter6 we outline a few opportunities for future work before concluding the dissertation with a short summary in Chapter7 .

1.6 Related publications

Part of the work in the thesis has already been covered in the following publications:

[GSS+13] Jana Giceva, Tudor-Ioan Salomie, Adrian Schuepbach, Gustavo Alonso, and Timothy Roscoe. “COD: Database/Operating System Co-Design”. In 6th Biennial Conference on Innovative Data Systems Research (CIDR ’13). Asilomar, CA, USA, January, 2013 (Online Proceedings).

[GARH14] Jana Giceva, Gustavo Alonso, Timothy Roscoe, and Tim Harris. “Deploy- ment of Query Plans on Multicores”. In Proceedings of VLDB Endowment (PVLDB). November, 2014, . 8, no. 3, pp. 233–244.

[GZAR16] Jana Giceva, Gerd Zellweger, Gustavo Alonso, and Timothy Roscoe. “Cus- tomized OS support for data-processing”. In: Proceedings of the 12th International Workshop on on New Hardware (DaMoN ’16). San Francisco, CA, USA, July 2013, 2:1–2:6.

10 2 OS policy engine and adaptive DB storage engine

The work in this chapter explores the following research questions:

• How can the operating system help data-processing applications deal with machine diversity and internalare resource heterogeneity?

• How can we extend the database/operating system interface so that they exchange more meaningful information?

• How can the OS understand database-specific properties and requirements?

In particular, we show how the interaction between data processing applications and op- erating systems can be improved by allowing a richer (declarative) interface for mutual information exchange. The goal is to integrate the extensive internal knowledge of the database on its resource requirements (including cost models), into the operating system. That way the OS can reason about the database in addition to its system-wide and run- time view of the available hardware configuration and application mix, to provide better management of the shared resources, and act as a homogeneous hardware driver across architectural differences.

11 Chapter 2. OS policy engine and adaptive DB storage engine

Our research prototype, COD (Co-design of an Operating system and a Database engine), combines a database storage engine designed to operate well on multicore machines with an OS policy engine in charge of making suggestions and decisions regarding the deployment aspects of the database. The content of this chapter has been published at the 6th Biennial Conference for Innova- tive Data Systems Research (CIDR) in 2013 [GSS+13]. The work was done in collaboration with Tudor Salomie and Adrian Schuepbach. Salomie developed the database storage en- gine that was used in our prototype (CSCS [AKSS12]), while Schuepbach’s work on the system knowledge base of Barrelfish (SKB) is the basis for the policy engine [SPB+08].

2.1 System Overview

This work makes several contributions vertically crossing the system stack, in order to address the research questions listed above. The key contribution is the interface between the database and the operating system. Using it for knowledge exchange with the OS policy engine, the storage engine of the database can efficient use of the available system resources even in a dynamic, noisy environment where it shares the machine with other applications. More concretely, the architecture of system is shown in Figure 2.1. In order to leverage the benefits of the interface, both sides of the system stack need to be modified. Our prototype system, COD, is built using two experimental engines for which we had access to the source code. However, the design ideas and discussed in the rest of the chapter can be easily generalized and applied to other type of systems. The storage engine we modified is a main memory, oriented, shared scan engine (marked (1) in Figure 2.1). Its main design goal is achieving robust performance and in particular predictable latencies even for unpredictable workloads [UGA+09]. As a result, its performance can be precisely controlled with a few parameters. Section 2.2 presents its design and main characteristics, before discussing how we modify it for richer communica- tion with the operating system, and better adaptivity based on the received notifications from the OS policy engine. The second building block is the OS policy engine, a provided by the operating system, and marked (2) in Figure 2.1. Its main purpose is to unify the knowledge of avail- able hardware resources e.g., cores and NUMA-domains) known to the operating system,

12 2.1. System Overview

DBMS

1 DB storage engine other apps

3 Interface

2 OS Policy engine

OS

Figure 2.1: COD’s architecture. Shaded blocks denote the main components. with information provided by applications running on top. As a result, the operating - tem can better orchestrate the resource allocation among all running tasks/applications, by being aware of how its decisions affect their objectives. Additionally, it also serves as an OS service that applications can:

• Query for details on hardware specifications or current system-state; • Rely on absorbing the hardware complexity and diversity, and providing suitable deployment suggestions; and • Use to translate the application’s internal characteristics to resource requirements, i.e., terms that the OS can reason about.

We discuss its internal building blocks in Section 2.3. Finally, between the two systems is a rich query-based interface. Even though its primary purpose is to allow the exchange of system-level information, the interface also enables the creation of application-specific stored procedures. By pushing such functions to the OS policy engine an application such as our database storage engine would be able to exploit both the Boolean Satisfiability Problem (SAT) solver and optimizer as well as

13 Chapter 2. OS policy engine and adaptive DB storage engine the overall knowledge available on the OS side. Section 2.4 describes the properties of the interface in more detail.

2.2 DB Storage engine

The most important properties that a database component needs to have in order to be able to fully benefit from the interaction with the operating system are the following:

1. A working unit (e.g., a thread) which is typically mapped to a hardware context, especially for multithreaded operations, enables flexible and elastic use of resources. This is important as it allows the application to adjust itself based on recommen- dations from the operating system. It could be extended to other types of resources (e.g., network channels, memory, etc.), but such an analysis is out of the scope of this work.

2. Information about the application’s properties and resource requirements. For exam- ple, a cost model that predicts the impact of resource allocation on the performance of each application’s task. This is important as it informs the operating system about the needs and sensitivity of the database jobs. Consequently, the OS policy engine can use this information when computing the global allocation of resources among different application mixes.

This section presents the DB storage engine that was used in our prototype. It was originally designed and implemented by Tudor Salomie, and the details are provided in the technical report for Clock-Scan Column Store (CSCS) [AKSS12]. After covering its basic properties and why it satisfies the required features, we describe how we adjusted it for better interaction with the operating system policy engine.

2.2.1 The architecture of the storage engine

Traditionally, database storage managers use and memory blocks as the exchange unit with the rest of the system stack. Their primary function is to manage the buffer pool, which is used by all transaction threads when accessing data from memory. It enforces the corresponding mechanisms to ensure transactional durability (e.g., flushing dirty pages to disk when needed) and isolation (e.g., enforcing ), in addition to

14 2.2. DB Storage engine prefetching and replacing pages to minimize the number and cost of page misses. Very often the performance of the DBMS heavily depends on the performance of its storage engine, as it controls the policies for internal memory allocation and page replacement. Storage managers are an important component of DBMSs and they need to arbitrate among many concurrent transactions who compete for various resources. In NUMA mul- ticore architectures with large main memories, the dominant aspects of execution time are no longer the size of memory chunks and management of pages, but also the thread- to-core placement and the memory affinities used when allocating data on the available NUMA nodes. As a result, these challenges are currently a hot topic in both the research and industrial communities. In particular, the CSCS was built as a storage engine with a SQL interface instead of pure memory blocks and pages. Its design was motivated by the Crescando storage engine [UGA+09] which was tailored for the airline industry. Like many state-of-the-art systems (e.g.MonetDB [BKM08], -Store [SAB+05], ), CSCS is a main memory engine that leverages columnar storage for more efficient process- ing of analytical workloads due to improved data locality. For example, with a columnstore, if the query involves only a few columns of a , only a fraction of the data has to be brought into the processor’s cache. The CSCS engine achieves good throughput and predictable response times as a result of having the following characteristics:

1. Batching incoming requests. Instead of executing each transaction alone (by its own thread), the CSCS engine batches the incoming requests and processes them as a group – an idea increasingly applied in operations like scans (Crescando [UGA+09], IBM Blink [RSQ+08,HKL+08]) that have been extensively studied [ZHNB07,SBZ12], joins (C- [CPV09, CPV11], MQ-Join [MGAK16]) or complete query processing engines (SharedDB [GAK12], DataPath [ADJ+10]).

2. Shared execution. It is an example of multi- technique [Sel88]. The idea is to simultaneously process a group of queries on the same table. Similar to IBM Blink and Crescando, CSCS avoids static indexes on the data. As a result, these systems do not pay the performance penalty of updating many index data structures during insert/update requests. Instead, by only scanning the data for answering the queries and applying the updates in-place, these systems can offer an upper bound on the of each transaction – equivalent to the cost of a full scan over the data. With modern hardware a single scan thread can answer thousands of

15 Chapter 2. OS policy engine and adaptive DB storage engine

Input Queue: Queries & Updates

Batching & Routing Core CPU

Scan Thread Scan Thread Scan Thread

...

Main Memory Main Memory Main Memory Data Data Data Core Core Core

Merging & Result Aggregation Core

Output Queue: Results

Figure 2.2: CSCS architecture

requests at a time. The shared scan is CPU bound as its performance is limited by the time to perform predicate evaluation (typically very CPU intensive due to many string compare operations) on the subset of records already in the processor’s cache, rather than the DRAM latency to bring the tuples in and out of the cache.

Similar to Crescando, SharedDB, and DataPath, the CSCS storage engine maps individual operators (e.g., the scan threads) to cores and well-defined memory regions. Therefore, by design there is no interaction between the operator threads beyond the necessary data flow. This makes the system more flexible in terms of resource allocation. Figure 2.2 shows the CSCS architecture. Incoming requests are enqueued in the input

16 2.2. DB Storage engine queue, where they are batched and indexed based on their predicates. The scan threads perform full data scans (similar to the ClockScan of Crescando), each one responsible for its own (horizontal) of the dataset. After completing a scan and merging the results, the scan threads coordinate before they reading a new batch of requests and begin a new data scan phase. As soon as the resulting tuples are processed by the and aggregation thread, they are pushed into the output queue. More details on the system can be found in the corresponding tech report by Salomie et al. [AKSS12].

2.2.2 Working unit and its properties

As we mentioned in Section 2.2, it is important to identify the working unit of a database component in order to allow for elastic and adaptable deployment based on recommenda- tions from the operating system. In the case of the CSCS storage engine, such a working unit is the scan thread. To ensure good data locality the scan thread is pinned to run on a core on the corresponding NUMA region. The columnar storage of the CSCS engine ensures good data locality during the scan by minimizing L1 data cache misses. Therefore, the CSCS scan threads are CPU bound, but also NUMA-sensitive. Furthermore, as all requests within a batch are processed in the same scan, the response time for all is bound by the time it takes to perform a scan over all data partitions and match each tuple against the indexed request predicates before merging the results.

2.2.3 Properties of the CSCS storage engine

The number of scan threads assigned to the CSCS storage engine is determined by the amount of data in the system, expected peak throughput of incoming client requests, and the bounds for response time for processing a request (i.e., the service-level objectives (SLOs) for response time). Given these parameters, the number of scan threads can be derived, making the performance of the CSCS engine both scalable and predictable.

1. The property of the CSCS engine is a consequence of its ability to achieve almost linear decrease in the response time when increasing the number of scan threads given a constant data size. Since the scan threads are synchronization free, except when coordinating before fetching a new batch of requests, the number of scan threads in the system has negligible effect on the communication costs.

17 Chapter 2. OS policy engine and adaptive DB storage engine

2. The predictability property is due to the CSCS ability to guarantee an upper bound on the response time of any request that it receives. There are two major reasons for this. First, for each request (query or update), it does a full table scan, and does not optimize the latency for particular requests by building indexes on data. As a result, it becomes a lot cheaper when having workloads with a large number of update requests, as we no longer need to maintain up-to-date indexes. This makes the latency for particular requests more expensive, but predictable: we always know how long it takes to scan the whole dataset. Second, by operating on a whole batch of requests at a time, all of them can be handled with a single full scan over the data. Therefore, almost with the same cost we can significantly increase the throughput and compensate for the increased latency of each individual query. More details on this trade-off are discussed in Crescando [UGA+09].

2.2.4 Embedding into COD

So far I have discussed the properties of the CSCS engine, and that it satisfies the first requirement of having a working unit that is easily deployable, which also makes the storage engine scalable. This section describes the second requirement: providing information about the application’s properties to the operating system. In the case of the CSCS engine and its scan threads, we provide a cost model that can be used to determine the type and amount of resources needed such that the storage engine can meet its response time SLOs. From now on we will assume, without loss of generality, that the system’s SLO is an upper bound on the response time of all requests. While determining the appropriate cost model for a particular is an orthogonal problem to the primary focus of discussion in this section, we would like to note that it is a fairly common in the design and deployment of any database (especially for the optimizer). Therefore, I do not consider as unreasonable the expectation to have such a cost model as part of a system co-design with the operating system. For the purpose of our prototype (COD), we focus on a workload comprising of both update and -only queries. It is inspired by the operational (BI) workload from the travel industry, which contains a high number of both queries and updates [UGA+09]. In particular, for our experiments we use the same workload from Amadeus as in Crescando. It consists of highly selective requests (i.e., point queries and

18 2.2. DB Storage engine updates matching a few records). The dataset is stored in a single table, where each record is approximately 315 bytes in size (fixed) and contains 47 attributes.

Deriving the cost model for CSCS

The cost model for the CSCS scan can be derived as follows: For each scanned tuple (record), the scan thread probes the predicate index of the batched requests. If successful, it checks whether the other predicates of the query/update are also satisfied. Therefore the cost for each tuple can be computed as y · #requests + z, where y and z denote coefficients which can be derived empirically and depend on the workload and machine characteristics. Depending on the input dataset size and the number of scan threads allocated to the engine, the number of tuples processed by each scan thread depends on the size of each #tuples horizontal partition. It can be computed using , where #cores denotes the number #cores of scan threads, and #tuples represents the total number of tuples in the dataset. Therefore, the cost model which describes the response time (RT ) of the CSCS engine as a function of the number of scan threads (#cores), dataset size (#tuples), and batch size per thread (#requests) can be computed as:

#tuples RT = x · · (y · #requests + z) (2.1) #cores

For the AMD MagnyCours machine (introduced in Section 2.5.1) and this particular work- load (3.75 million records, and 4:1 ratio of queries vs. updates), the coefficients are:

x = 1/3750000, y = 0.85, and z = 601

Constants y and z are machine and workload dependent, while x is used to normalize the total number of tuples as used in the calibration experiment. Using this model, the CSCS engine can delegate the decision for determining the minimum number of cores needed for its scan threads’ deployment to the operating system. Given the cost model, the OS has a better idea of what the storage engine needs and can find the best possible match to meet those requirements given the characteristics of the underlying hardware and the current system state.

19 Chapter 2. OS policy engine and adaptive DB storage engine

For more concrete examples on how to derive cost models for complex database oper- ators for hierarchical memory systems, I refer the reader to the work by Manegold et al. [MBK02].

Making the storage engine elastic

Finally, the CSCS storage engine was extended to be both proactive and reactive, i.e., in addition to stating its requirements, it can also receive notifications from the OS about any changes in the resource allocation. Therefore, before starting a new scanning phase, the parent scan thread checks for updates from the operating system on whether it needs to re-configure its deployment. If , it re-organizes the data partitions, spawns or kills scan threads as required, or migrates scan threads and their associated data partitions across the allocated cores. In order to make a good decision, it can query the operating system policy engine for further instructions or information. We discuss this in more detail, i.e., with a concrete use case in Section 2.5.4. After the re-organization is done, the main thread resumes the execution of all other scan threads which now operate on the new batch of requests.

2.3 The OS Policy engine

The main idea of our work is extending the support of traditional operating systems with new functionality that allows for a richer interface between it and the applications running on top. One important challenge that we address is the distribution of knowledge in the system. For instance, the operating system has information about the state of system resources, and the applications such as the CSCS storage engine have detailed knowledge about their workloads and algorithms. An important observation is that in order for the operating system to be able to make optimal resource allocation decisions, and for the CSCS engine to be able to make optimal use of the available resources, both need access to all of this knowledge. The challenge is where to place it. On the one hand, centralizing it in the operating system requires that the OS takes “on trust” information from the application, and operates certain optimizations on behalf of the application itself. On the other hand, placing it in the application (in our case the CSCS engine), prevents the operating system from making global resource decisions. We address this problem as follows:

20 2.3. The OS Policy engine

1. The operating system maintains detailed information about the underlying hardware (cores, , NUMA interconnect, performance trade-offs, etc.) and its own state (resource allocations, load, applications running, etc.) in a rich frame- work that enables reasoning about this information to make medium-term (i.e., in the order of a few seconds) policy decisions, such as spatial placement of OS and application’s tasks onto cores.

2. The OS also offers this functionality to the applications via the the new declarative interface (Section 2.4). The operating system can then perform complex application- specific calculations on that state on behalf of the CSCS engine or other applications, allowing them to optimize the usage of their own resources. It also enables appli- cations like the CSCS engine to reason about the complete system state, without the cost of and maintaining the same state in the application itself. This becomes especially useful with the facility for explicit notifications to the application when the allocation has changed.

3. Finally, the operating system also allows the applications to submit hints about their properties and requirements, which are stored along with the system state.

In our prototype COD, we refer to this part of the system as the Policy Engine and view it as part of the operating system.

2.3.1 Architecture

The new OS module is structured as shown in Figure 2.3, and consists of two components: (1) the Resource Manager () and (2) System Knowledge Base (SKB). We borrow the of the SKB from the Barrelfish OS, and more details about it and its widespread use can be found in the PhD thesis by Adrian Schuepbach [Sch12]. In short, the SKB stores information in the of free- predicates in a Constraint (CLP) engine. This allows reasoning over this information by issuing logical queries, extended with facilities for constraint solving and optimization. The knowledge stored in the SKB can be put into two categories:

1. System-level facts. The SKB is populated with information from the operating system about (i) the underlying hardware platform, and (ii) the system state. It

21 Chapter 2. OS policy engine and adaptive DB storage engine

CSCS engine

OS Policy engine Constraint solver and optimizer 2 SKB

1 Resource Manager System-level properties CSCS-specific (RM) Hardware architecture properties properties System-level properties of the CSCSengine

Hardware

Figure 2.3: Functionality that was added to the OS policy engine. The Resource Manager (RM) is the active component that triggers computation in the SKB and communicates with applications. The System Knowledge Base (SKB) stores information about the sys- tem and now also understands application specific properties.

obtains the former information at startup by resource discovery and online micro- benchmarks. Example data includes the hardware topology, memory hierarchy, core to NUMA node affinities, etc. The system state information is kept by book keeping the set of running tasks, and their spatial assignments to cores. The SKB can also store other named constraints and inference rules. We refer to these collectively as system-level facts. The applications like the CSCS engine can then issue queries to the SKB to information about the current resource allocation. More importantly, they can also submit more complex queries that allow both the SKB and the application itself to optimize the resource allocation and execution based on this information.

2. Application-specific facts. The SKB can be also populated with information that is specific to its applications. In particular, programs like the CSCS engine can submit system-level properties (as “hints”) for resource allocation to the SKB in the form of additional constraints, which the OS can take into account when allocating resources (e.g., CPU-bound, cores). Note that this is not intended to override

22 2.3. The OS Policy engine

OS policies, even though it may influence them, as it provides the operating system with additional application-specific information. As a result, the OS policy engine can the best option from a set of several policy-compliant alternatives. The application-level facts, however, can be part of an application’s domain knowledge, which may or often may not be understood by the operating system (e.g., num- ber of tuples). In such cases, these facts can be used by application-specific stored procedures (explained later) to compute the desired system-level properties. We discuss concrete use-cases in Section 2.5.

The SKB is the reactive component of the policy engine. It can be seen as a repository and calculation engine for the system knowledge. It is complemented by the resource manager (RM), which is the active component. It implements the allocations policies as computed by the SKB. On each registered change in the system environment (e.g., new application registers, a task terminates, an application’s properties change) the resource manager triggers a re-computation of the global allocation within the SKB. As soon as the re-computation is done, the resource manager invokes the required OS mechanisms and implements the desired OS policy. It is also responsible for notifying the registered applications about the new resource allocation decisions. As the reader has probably already realized, the OS policy engine and its computations are not intended to be on the critical , neither in the operating system nor in the application, i.e., they should not delay regular operations in either system. The main purpose is to provide the means to calculate and reason about medium-term policies, such as thread placement or data partitioning based on global system knowledge.

2.3.2 Implementation

The OS policy engine is implemented on version 2.6.32 for 64-bit machines. Note that there is almost nothing that prevents us from this to an operating system with more explicit allocation of resources, such as the Barrelfish operating system. The resource manager populates the SKB with the required information about the hard- ware architecture, and triggers periodic re-computations of the system state and resource allocation based on the current environment and the applications running on top. The resource manager primarily implements the spatial scheduling policies by means of thread pinning. In our prototype, the RM mediates between the SKB and the CSCS engine, and informs the CSCS of changes in the resource allocation by means of upcalls.

23 Chapter 2. OS policy engine and adaptive DB storage engine

The SKB side of the Policy engine is implemented using the CLP engine [AW07], and runs as a system . Even though ECLiPSe is expressive, convenient and easy to use, executing complex queries can sometimes be (prohibitively) slow.

2.3.3 Discussion

Even though the computation of a global allocation plan is periodically done and is off the critical path, it is still important to do the computation in a reasonable time. Unfortu- nately, the current implementation has not been optimized to restrict the ECLiPSe solver in any way when searching for valid resource allocations. By allowing it such freedom, the solver considers all possible solutions, which comes at a risk of high and often unpre- dictable times. There are several options. First, we can restrict the solver’s search space in a way that it still finds suitable allocations. For instance, if an application needs four cores, the current implementation will choose them from a full permutation of all avail- able cores, which is excessive. Alternatively, we could shift to using a modern Satisfiability Modulo Theories (SMT) solver like Z3 [DMB08]. In Chapter3 we discuss how we extend it with an approximate capacity constrained bin-packing algorithm, which trades off the optimality of the deployment solution with a significantly lower computation time.

2.4 Interface

The most important component of our design is the interface that joins the OS policy engine with the CSCS storage engine. Before discussing concrete examples on how it can be used for a variety of use-cases in Section 2.5, we provide an overview of the current scope covered by the interface. We also briefly describe the different classes of supported functions and their intended use.

2.4.1 Scope

The interface between the operating system and an application (e.g., a DBMS) can cover a range of aspects but in this section we primarily focus on the interaction and information exchange between the CSCS storage engine and the OS policy engine.

24 2.4. Interface

Table 2.1: Message types supported by COD’s interface

Message Type Example Core function rsmgr_client_connect() Add facts add_fact(var_name(value)) Add stored procedures add_query("f_name(f_vars):-f_content") (De-)Register for notification rsmgr_register_fn(, , handler) Query system-specific execute_query("get_nr_cores()") Query stored procedures execute_query("f_name(f_param,f_result)")

The interface provides support for actions such as:

1. Retrieving information about the underlying architecture and the available resources;

2. Pushing down application-specific properties/facts and cost models, so that the OS policy engine can reason about them;

3. Adding stored procedures which compute application-specific logic using system state information; and

4. Allowing for continuous information flow between the two layers during runtime in the form of notifications and updates.

The list of features and supported message types is neither exhaustive nor exclusive. Its primary purpose is to show the type of interactions that can be implemented in a co-design architecture. We discuss further possibilities and extensions in Chapter3.

2.4.2 Semantics

For better overview of the current possibilities, we have grouped the supported into several categories and summarized them in Table 2.1. COD’s interface currently supports the following types of messages:

1. The core functions provide support for: (1) Initializing the communication between the CSCS and the resource manager of the OS policy engine; (2) Registering the

25 Chapter 2. OS policy engine and adaptive DB storage engine

CSCS as a running task, which informs the RM of the policy engine that a new task has arrived, and sets up state so the RM can forward notifications to it; and (3) Requesting resource allocation suggestions from the RM, which will invoke the execution of the global allocation code in the SKB, and eventually forward the decision to the affected applications/tasks.

2. Add facts messages enable the CSCS engine to load information, as facts, for its own properties into the SKB. As described earlier, the policy engine distinguishes between system-level properties and application-specific facts. The interface allows for both of these to be modified and/or removed at any time during the execution of the application.

3. Add messages enable the CSCS engine to add application- specific functions to the SKB that are tailored to its own needs. An example is the deployment cost function passed on as a stored procedure (code sample is pro- vided in Listing 2.3). These procedures can use all application related facts as well as all the system-level properties belonging to the application.

4. (De-)Register for notification messages allow the CSCS engine to be informed about changes occurring in the system, and to filter unrelated events. The resource manager by default will notify all affected applications via upcalls when the global system optimizer of the SKB changes the resource allocation. This enables the CSCS and other affected applications to adapt their execution plans and internal resource management accordingly, and be able to react to the changes in the system state that affect the resources they operate with.

5. Query messages enable the CSCS engine to issue queries that retrieve system-specific information from the SKB.

6. Query stored procedures messages allow the CSCS engine to also query its own application specific functions, previously added to the SKB and to receive concrete information based on the current system state.

As we mentioned before, this list of message types is neither exhaustive nor exclusive. It is rather intended to illustrate the possibilities offered by the currently supported interface and is subject to alterations and changes in the future.

26 2.5. Experiments

2.4.3 Syntax and implementation

Currently, the implementation of the interface is tightly coupled with the semantics of the OS policy engine. In particular, it depends on the support provided by the OS policy engine, and the syntax that the SKB module understands. Since the OS policy engine is implemented as a user-space library, implemented in and uses the ECLiPSe CLP language, most of the message types supported by the interface resemble the Prolog format. As such they contain a string of Prolog command as an argument, followed by the input parameters for the function called as well as pointers to the output variables. Parsing the response obtained from the policy engine on the client side (i.e., the application side) is implemented in a similar fashion. An exception is the core function, which enables the CSCS engine to to the resource manager of the OS policy engine, and to register itself as an application entering the system. For both message types, we provide more detailed explanation and examples in Section 2.5.

2.4.4 Evaluation

The message types and the interface as a whole are evaluated from two different aspects:

1. Applicability or how the interface and its messages can be used by the applications. We discuss concrete scenarios and use-cases in Section 2.5.

2. Overhead or measuring the cost for making each call. Such an analysis is presented after each use-case, as part of the discussion for the overall results obtained from that experiment.

2.5 Experiments

This section presents in more detail the interaction between the CSCS engine and the OS policy engine through use-cases, presenting both the advantages of this approach sup- ported by experiments and the overhead imposed by the communication. Furthermore, we conclude each scenario with ideas for extensions and possible future work. Finally, we discuss the limitations of the current approach.

27 Chapter 2. OS policy engine and adaptive DB storage engine

More concretely, in this section we show that COD can deploy efficiently on a variety of different machine configurations without prior knowledge of hardware, and react to dynamic workloads to preserve performance.

2.5.1 Experimental Setup

The dataset and workload used in these experiments is generated from the traces of the Amadeus on- flight booking system. It is characterized by a large amount of concurrent point-queries, frequent peak loads, many updates and strong latency requirements. The main dataset represents a records of flight bookings: one record for every person on a plane. A record contains 47 attributes and is approximately 315 bytes in size (fixed). Many of its attributes are highly selective (e.g., seat class, vegetarian). A travel booking can contain millions of such records (in our experiments we use between 3.75 million and 180 million tuples, corresponding to 8 and 53 GB of ). The same workload was also used by Unterbrunner et al. [UGA+09]. We used Linux kernel version 2.6.32 on four different hardware platforms:

1. AMD Shanghai: Four quad-core 2.5 GHz AMD Opteron 8380 processors, and a total of 16 GB RAM, arranged as 4 GB per NUMA node.

2. AMD Barcelona: Eight quad-core AMD Opteron 8350 processors running at 2 GHz, a total of 16 GB RAM spread across eight NUMA nodes.

3. Intel Nehalem-: Four 8-core 1.87 GHz Intel Xeon L7555 processors and a total of 128 GB RAM with a NUMA node size of 32 GB. Hyperthreads are disabled.

4. AMD MagnyCours: Four 2.2 GHz AMD Opteron 6174 processors. Each one with two 6-core dies and a NUMA node size of 16 GB.

2.5.2 Deployment on different machines

Our first scenario shows how COD can adapt the deployment of the CSCS engine to different hardware platforms using the OS policy engine.

28 2.5. Experiments

Use-case description

The goal of the first use case is to show how COD assists the CSCS engine to determine the most suitable deployment strategy on a given machine, so that it meets its response time SLO. The output that the CSCS engine needs is: (i) the number of scan threads to be spawned, and (ii) the correct size and placement of its horizontal data partitions. This task is trivial when one knows both the properties of the underlying architecture and the cost model that characterizes the storage engine scan operation (recall equation 2.1). Taking this information into account, with this scenario we confirm the importance of deriving such cost models in the application’s side (e.g., the cost model for the CSCS’ scan) and matching them against the available hardware resources.

Listing 2.1: Using the interface

1 rsmgr_client_connect(use_skb);

2 rsmgr_register_function();

3

4 skb_client_connect();

5

6 skb_system_fact(maxCores, MAX_CORES);

7 skb_system_fact(bound, CPU);

8 skb_system_fact(sensitive, cache);

9 skb_system_fact(sensitive, NUMA);

10

11 skb_add_fact("db_ntuple(3750000).");

12 skb_add_fact("db_tsize(315).");

13 skb_add_fact("db_nquery(2048).");

14 skb_add_fact("db_nupdate(256).");

15 skb_add_fact("db_rtime(3000).");

16

17 skb_add_fn("db_cost_fn(X,Y,Z,NrCores):-

18 NrCores is (X*((0.85*Y)+601))/(3750000*Z)");

19

20 skb_add_fn(...query: see Listing 2.3...);

21 skb_execute_query(...query...);

29 Chapter 2. OS policy engine and adaptive DB storage engine

Listing 2.2: Example for retrieving system-level facts

1 get_list_free_cpus(avail_cores):-

2 findall(_,cpu_affinity(_,_,_),avail_cores).

3 get_list_numa_sizes(numa_sizes):-

4 findall(N, memory_affinity(_, N, _), numa_sizes).

Implementation details

Here we provide concrete details about the interaction between the CSCS and the OS policy engines, and how the information exchange comes into place. We refer the reader to Listing 2.1 for code samples which we use in the description of this use-case. When spawned, the CSCS engine connects to the OS policy engine by registering with both the RM and the SKB (lines 1-4). It then submits its system-level properties to the SKB (lines 6-9). In this example it states to the OS that it could use all available cores on the machine, and that its scan threads are CPU-bound tasks, which are both highly cache- and NUMA-sensitive. It then populates the SKB with its application-specific facts such as: the size of its dataset (in #tuples), size of a tuple (in bytes), the batch size of requests it needs to , and the response time SLO requirement (lines 11-15). Once the SKB is introduced to the “domain specific terms” of the database, the CSCS engine adds the scan’s cost model as an application specific function (line 17-18). Finally, it registers a stored procedure (line 20), which it uses to derive the results needed (line 21): number of scan threads to use, and the corresponding size of each partition. More details about the stored procedure are provided in Listing 2.3. Listing 2.2 contains Prolog code that retrieves system-level facts: list of all available cores (lines 1-2), and list of sizes of all NUMA nodes (lines 3-4). Both of these functions are used in the stored procedure in Listing 2.3. The CSCS engine bases its initial deployment of data partitions and the corresponding scan threads on the outcome of invoking a stored procedure registered with the OS policy engine. The implementation details of the stored procedure are given in Listing 2.3. The function operates by first retrieving the CSCS-specific properties (lines 4-6), and then the necessary system-level facts using the example functions given in Listing 2.2 (lines 8-10). It then continues by calculating the total size of the dataset (line 12), before computing the minimum number of cores as required by the cost model to meet the SLO (line 14).

30 2.5. Experiments

Listing 2.3: CSCS’ Initial deployment stored procedure

1 % status, nr cores, part size are output values.

2 dbos_cost_function(status, nr_cores, part_size):-

3

4 db_tsize(tsize), db_ntuple(ntuple).

5 db_nquery(nquery), db_nupdate(nupdate),

6 db_rtime(rtime),

7

8 get_free_memory(avail_memory),

9 get_list_free_cpus(avail_cores),

10 get_list_numa_sizes(numa_sizes),

11

12 memory is (ntuple*tsize),

13

14 db_cost_fn(ntuple,(nquery+nupdate),rtime,sla_nr_cores),

15

16 min(numa_sizes, min_numa_size),

17 numa_nr_cores is (memory/min_numa_size),

18 max([numa_nr_cores, sla_nr_cores], nr_cores),

19

20 part_size is (memory/nr_cores),

21

22 ( nr_cores > length(avail_cores) -> status = 1;

23 memory > avail_memory -> status = 2;

24 status = 0;

25 ).

Since the scan threads of the CSCS engine are cache- and NUMA-sensitive, the dataset needs to be partitioned and distributed across the available NUMA nodes. Furthermore, at least one core per NUMA node should be used in order to guarantee data access locality. As a result, no partition size should exceed the size of a NUMA node. Therefore, the stored procedure computes the minimum number of cores (i.e., partitions) so that each partition fits in the smallest of all NUMA nodes (lines 16-17). The final number of cores/partitions needed is the maximum of both requirements (line 18). Once determined, the final number of cores is used to calculate the size of the partitions to be used (line 20). Finally,

31 Chapter 2. OS policy engine and adaptive DB storage engine the stored procedure checks whether the total size of the dataset fits into main memory, and if the total number of cores required is in fact available in the machine. If either of these is false, the query fails notifying the CSCS that this machine cannot meet the desired constraints (lines 22-24), otherwise the CSCS is notified on the obtained results and can partitions its data accordingly. During runtime, some of the application-specific properties may change: for example, the size of the batch of requests that needs to be handled. In that case CSCS simply modifies these values in the OS policy engine, and triggers a re-computation of the stored procedure. If the outcome of an application-specific stored procedure results in new system-level properties for the application, then they are added in the SKB. By design, whenever some system-specific properties change or are added/removed, the resource manager invokes the global allocation function in the SKB to re-compute resource allocation among all registered applications. In this particular use case, the stored procedure of the CSCS engine computes the minimum requirement for core count and the size of a horizontal data partition, both of which are added as system-level properties in the SKB. After the global allocation plan completes its computation, the resource manager sends the CSCS engine an upcall containing the concrete core IDs to use. Based on that, the CSCS pins its scan threads to the given cores and allocates memory from the corresponding NUMA nodes.

Experiment evaluation

We deployed COD on all machines, described in 2.5.1, and now present the resulting allocation of scan threads to cores, as suggested by the OS policy engine. The CSCS engine pushed to the SKB the following application-specific facts: 30 · 106 tuples, each of size 315 (total dataset size of 8 GB) and a batch size of requests containing 2048 queries and 512 updates. We varied the SLA response time constraint between 2 and 8 seconds. Table 2.2 shows the results of the calculation performed at the SKB and illustrates how the suggested configuration varies considerably for different SLA requests and hardware platforms. The final column of the table shows the results of the actual runs with the proposed configuration. The experiment values confirm that in every case, but one, COD does meet the SLA as predicted1.

1The case where the SLA is not met is due to the use of the same cost model for all machines rather than parameterizing it to each one of them.

32 2.5. Experiments

Table 2.2: Derived deployments for different SLAs and hardware platforms

Input Output from stored procedure Measured Hardware RT [s] # cores # cores Total Size of RT [s] platform SLA cost fn. NUMA size # cores shard measured Intel Nehalem EX 2 8 1 8 1 GB 1.66 4 4 1 4 2 GB 3.27 8 2 1 2 4 GB 6.54 AMD Barcelona 2 8 5 8 1 GB 2.18 4 4 5 5 1.6 GB 3.55 8 2 5 5 1.6 GB 3.55 AMD Shanghai 2 8 3 8 1 GB 1.68 4 4 3 4 2 GB 3.25 8 2 3 3 2.7 GB 4.33 AMD MagnyCours 2 8 1 8 1 GB 1.87 4 4 1 4 2 GB 3.71 8 2 1 2 4 GB 7.37

One important observation from the experiment results is that the deployment of scan threads on different machines can be a non-trivial task. For example, when the SLA for response time is set to 8 seconds, the total number of cores on four different machines has three different values: 2 on Intel Nehalem EX and AMD MagnyCours, 3 on AMD Shanghai and 5 on AMD Barcelona. Even though the primary that influences this discrepancy (at least in this example) is the size of the NUMA nodes, the example confirms that an application (like the CSCS engine) needs to account for many properties both for the underlying hardware platform and the workload in question, in order to make a good deployment decision. In theory, this calculation could be performed entirely inside the database based on infor- mation requested by the CSCS engine from the operating system, regarding details about the underlying architecture and available resources. However, submitting a query to the OS policy engine means that the CSCS storage manager does not need to understand each machine’s hardware configuration. More importantly, since the CSCS engine has now delegated useful knowledge to the operating system about its own resource require-

33 Chapter 2. OS policy engine and adaptive DB storage engine ments (including how it can trade-off cores for memory), the OS is in a position to do more intelligent resource reallocation, and automatically compute the CSCS’ deployment in response. Furthermore, in the next scenarios we will see cases where this deployment decision also depends on system runtime state and the resource utilization of other applications present in the system. In this case, the OS is the only place where all this information is available and it makes little sense to it on to the database.

Communication and computation overhead

We discuss the communication and computation overhead that the co-design and tighter integration of the CSCS storage manager and the OS policy engine bring to the execution time of the system as a whole. As presented in the implementation details of this scenario, the communication overhead can be calculated as the number of messages that are exchanged between the resource manager and the CSCS engine. In this particular scenario (see Listing 2.1), the initial- ization phase requires three function calls: one for connecting to the SKB, and two for connecting and registering with the resource manager. It then needs four calls to set up the system-level properties, and seven calls to place the application-specific facts, includ- ing the cost model and the stored procedure (note that we calculate one function call per fact). Lastly, we need one call that triggers the computation of the stored procedure, which upon the callback containing the results of the calculation. The final call belongs to the upcall notification from the global allocation plan. In total, that means that we have introduced sixteen function calls for the initialization phase at deployment time, out of which four are fixed and the rest depend on our applica- tion properties. One immediate and obvious optimization is to decrease the communication overhead by batching the redundant calls (e.g., adding application properties to the SKB) into a single function call. The computation overhead for this phase solely depends on the time it takes for the SKB to calculate the output of the CSCS’ stored procedure. We measured this overhead during our last experiment to be 0.18 ms (see also Table 2.4). Since it is invoked quite infrequently, i.e.only at deployment time and when one of the application-specific facts have altered, we can conclude that this overhead does not affect the overall performance of the system.

34 2.5. Experiments

Discussion

The results confirmed that even having a rather-simplistic cost model for the scan - tion can result in a good deployment when knowing the underlying architecture. Ideally, the cost model should also take into consideration other processor properties (like CPU fre- quency) and cache layout so that we get more accurate results when deploying on different machines. This scenario can easily be extended to other DBMS operations, apart from the full table scan, as long as we develop the corresponding cost model that best describes the operation’s dependencies on the available system resources.

2.5.3 Deployment in a noisy system

In the second scenario we show the benefits of information exchange between the CSCS and the OS policy engine when deploying the scan threads in a noisy system.

Use-case description

In this use-case we show that just knowing the architecture is not enough to do smart deployment on a machine that is being shared with other tasks. In such circumstances one also has to take into account the current state of the system and act accordingly, otherwise the deployment decision can result in a significant drop in performance. More concretely, we evaluate the performance of the system when the assignment of tasks (such as the CSCS’ scan threads) to cores is done in the presence of other tasks sharing the resources of the machine with the storage engine.

Implementation details

As before, the allocation of cores and NUMA nodes to the registered tasks in the system, is performed by the constraint satisfaction solver in the SKB. This part of the OS policy engine was developed by Adrian Schuepbach and is described in more detail in his the- sis [Sch12]. For completeness, in this subsection we provide a overview on how the SKB does the global resource allocation. The basis for allocation is a matrix of free variables annotated with constraints derived from the system-level and application-specific facts. The structure of the matrix is itself based

35 Chapter 2. OS policy engine and adaptive DB storage engine

Core 0 1 2 3 4 5 6 7

Task 1 X=1 X=0 X=0 X=0 X=0 X=0 X=0 X=0 Task 2 X=0 X=1 X=0 X=0 X=0 X=0 X=0 X=0 Task 3 X=0 X=0 X=1 X=1 X=0 X=0 X=0 X=0 Task 4 X=0 X=0 X=0 X=0 X=1 X=1 X=1 X=1

Shared L3 Cache 0 Cache 1Cache 2 Cache 3 NUMA Node 0 Node 1

Figure 2.4: Matrix of core to task allocation, including NUMA, cache and core affinity. on the particular hardware configuration at hand. An example is shown in Figure 2.4, where the number of tasks assigned to each core is constrained to be zero or one2. The chosen policy constraints are essentially equivalent to the space-time partitioning scheme proposed for the Tessellation OS [LKB+09], though COD’s technique is rather more general: it subsumes shared caches and NUMA nodes, as well as sharing cores between appropriate tasks. Given that the matrix contains initially unconstrained free variables, we are not restricted to spatial placement of tasks. With time the solver obtains concrete values for these variables to indicate which core on which NUMA-node is allocated to which task. To derive a concrete core allocation, addi- tional requirements on the number of required cores and amount of memory consumption may be registered by the CSCS or other applications. The most common constraints used in the SKB for an application’s assignment are the following:

• MaxCores defines how many threads (cores) an application can leverage at most. Not all applications require a high number of cores in every phase of their execution. Some phases may be single threaded, and some algorithms may have scaling limi- tations. Such information is of valuable to the operating system, especially when it indicates the maximum number of cores an application needs. This constraint is implemented by making sure that the of the task’s is smaller than or equal to MaxCores. In Figure 2.4, tasks 1 and 2 have set their MaxCores to one, while task 3 to two. Task 4 did not provide have any restrictions.

2Please note that Figure 2.4 is a sample illustration of a machine with eight cores, used for simplicity, and does not match the actual machine used in the experiment.

36 2.5. Experiments

• MinCores defines the minimum number of cores that need to be allocated to an application. It is also implemented as a row sum constraint, but additionally includes an admission control check. To avoid infeasible allocations, the policy code checks in advance whether the sum of all requested MinCore values is than the number of available cores.

• WorkingSetSize defines the required memory for each task. This property is in particular important when allocating the cores in a NUMA-aware fashion. If two cores a 4 GB NUMA node, and the application needs to process a working set size of 4 GB per thread, then the cores must be allocated on different NUMA nodes to accommodate the requested memory capacity.

In our example use case, the CSCS engine declares MinCores to be the value computed by the experiment with the application-specific stored procedure, and sets the WorkingSetSize to be equal to the corresponding partition’s size. Since the scan can be parallelized easily, the MaxCores is set to be the maximum number of cores available on the machine. With this information, the SKB component of the OS policy engine can compute task to core ID deployment configuration for the application, keeping in mind the NUMA-awareness requirement and meeting the SLA requirements for response time.

Experimental evaluation

For this experiment we used the AMD MagnyCours machine. We evaluated the perfor- mance when deploying the CSCS engine with a dataset of size 53 GB which consists of 180·106 tuples of size 315 B each. The number of requests processed in a batch was varied between 512 queries and 128 updates, to 4096 queries and 1024 updates. We differentiate between three deployment alternatives: (1) the CSCS engine is the only application running in the system, and can therefore use all 48 cores, (2) the CSCS engine runs alongside another CPU intensive task, which has pinned its thread on core 0, but is unaware of it and still uses all 48 cores, and (3) same set-up as (2) but this time the CSCS engine relies on the policy engine for deployment configuration. Figure 2.5 shows the measured system throughput for the three different configurations of the experiment. The results indicate that when the CSCS engine has collocated one of its scan threads with another compute intensive task in the system, its throughput performance is significantly affected – by almost fifty percent. The outcome is logical,

37 Chapter 2. OS policy engine and adaptive DB storage engine

1400 isolated - 48 cores noisy - 48 cores cod - 47 cores 1200

1000

800

600

400 Throughput (queries/second) 200

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Number of queries in a batch

Figure 2.5: CSCS performance when deployed in a noisy system since the scan thread that shares a core becomes the straggler for each scan processing a batch of requests and slows everyone down. The problem can be avoided only if the CSCS is aware of the presence of other tasks (applications) in the system and their utilization of the shared resources. The CSCS engine as part of COD, delivers a superior performance than the na¨ıve-CSCS because it relies on the OS policy engine to compute a global allocation of the resources resulting in 47 cores being assigned to the CSCS avoiding the already occupied core. The measured throughput of the CSCS engine is then almost as good as the maximum throughput which can be achieved in an , i.e., when the CSCS engine uses all 48 cores.

Communication and computation overhead

The communication cost when deploying an application, after it has expressed its proper- ties and requirements, can be compressed to one function call (which triggers the global allocation function in the SKB), and a notification response given by the OS policy engine to the application itself.

38 2.5. Experiments

Table 2.3: Computation overhead of the OS policy engine

Cost of executing global allocation plan for time [ms] only CSCS in the system 5.64 CSCS + 1 application in the system 13.28

On the computation side, the overhead is significantly more expensive. In particular, when the SKB computes the global resource allocation for all registered applications. Table 2.3 summarizes the computation time of the global allocation plan for this experiment: (1) when having only CSCS, and (2) when executing the CSCS engine in addition to one more application in the system. As we can see from the results, the computation cost is quite small for this setup, i.e., is in the range of milliseconds. It does, however, grow higher as more applications enter the system. In particular, in cases where every new application registers itself with a specific set of system-level and application-specific facts. Adding additional constraints only increases the complexity of the problem that the optimizer needs to solve, and the current solution does not easily scale in such cases3.

Conclusion and possibilities for extensions

Naturally, the effects shown in this use case highlight the sensitivity of the storage manager to sharing CPU cores. This sensitivity is in part due to the CSCS design of synchronization among the scan threads, and a relatively poor skew- and load balancing. We discuss briefly how these limitations can be addressed in the next experiment. Nevertheless, similar problems can occur for applications that are sensitive to sharing other types of resources: e.g., bandwidth in the network, or local DRAM and interconnect, caches, etc. Therefore, it is important to capture not only the capacity of these resources, but also their current utilization as well as the sensitivity of the applications using it. We discuss these in more details in Chapter3. As described earlier, the decision for proper deployment of an application in a noisy system requires detailed information about the underlying hardware, the the system state, and

3Note that we cannot simply compute the computation overhead as a function of the number of applications in the system, as the cost primarily depends on the number of constraints.

39 Chapter 2. OS policy engine and adaptive DB storage engine the internal properties of the CSCS engine, which our prototype system COD collocates in the SKB. We would like to emphasize that in a traditional database plus OS deployment, there is no part of the system which has access to all this information.

2.5.4 Adaptability to changes

It is not only necessary that the CSCS engine is aware of the other running applications and tasks at deployment time, but also during runtime. It is essential that the data processing system can adapt to the dynamic system state, and is somewhat robust to the constantly changing noise in the machine.

Use-case description

The following experiments shows how COD can guarantee good performance and maintain predictable runtime even in a dynamic environment where other applications join the system and begin sharing resources. We compare it to an execution of a na¨ıve CSCS storage manager executing independently (i.e., without the assistance of the OS policy engine) and hence, unaware of the changes undergoing in the system state.

Implementation details

As per design, whenever a new task registers with the resource manager of the OS policy engine, the RM triggers a re-computation of the global resource allocation plan. This can often result in a decision to remove one of the cores (or other resources) previously allocated to the CSCS engine or other applications, as long as they can still meet their SLA constraints. Consequently, after receiving the notification of the new resource assignment the CSCS engine has to adapt its execution. In the case of losing a core the CSCS engine needs to decide how to distribute the affected portion of the tuples, i.e., which scan threads should take over their processing. In order to achieve good load balance, the current implementation of the CSCS engine invokes a second application-specific stored procedure, which was also registered with the SKB. At a high-level the stored procedure first checks for memory availability on the corresponding NUMA nodes, and tries to maximize the number of sibling threads that will share the new load. Eventually, it responds with a list of core IDs to which CSCS will

40 2.5. Experiments

Adaptability to changes 10 COD 9 CSCS engine naive SLA agreement 8 7 6 5 4 3 Response Time [sec] 2 1 0 0 2 4 6 8 10 12 14 16 18 Time [min]

Figure 2.6: COD’s adaptability to changes in the system have to the tuples to, as well as the corresponding number of tuples to be moved to each of the cores. As soon as the data is redistributed, the CSCS engine kills the scan thread on the core it just lost, and resumes the scan operation on the next batch of requests.

Experiment evaluation

This experiment was also conducted on the AMD MagnyCours machine, using a dataset of size 53 GB (180 · 106 tuples of size 315 B each), and a batch size consisting of 2048 queries and 512 updates. Initially, the CSCS engine is the only application running in the system and uses all forty-eight cores. The SLA response time is set to 3 seconds. The overall duration of the experiment is eighteen minutes, and at about every four-five minutes we spawn another compute-intensive task. Similar to the previous experiment, we compare the performance of the CSCS resource manager (this time its response time in seconds) when it runs standalone (i.e., without the assistance of the OS policy engine – na¨ıve) versus when executed as part of COD. In the na¨ıve run, the external CPU intensive tasks were always scheduled at core #0.

41 Chapter 2. OS policy engine and adaptive DB storage engine

Table 2.4: Policy engine computation cost for the stored procedures

CSCS-specific function time [ms] Initial deployment 0.18 Tuple re-distribution 0.27

At the same time the CSCS is unaware of their entrance and thus does not react. Con- sequently, its runtime gets significantly affected and unfortunately no longer meets the required latency SLAs. In COD, the new incoming tasks register to the resource manager of the policy engine and are placed on separate cores, as suggested by the SKB’s global allocation plan. Since the CSCS engine is notified every time it needs to release a core, it can act accordingly and redistribute the tuples among the remaining scan threads as suggested by the second stored procedure. The results of this experiment are presented in Figure 2.6. It shows how the response time of the CSCS storage manager (measured in seconds) changes over the course of the experi- ment. On the one side, we can see the CSCS’ na¨ıve-run performance and how its response time increases with each new task entering the system. The results are unsurprising, as we have already seen the effects of sharing a single physical context (core) with another CPU intensive task. The sharing scan thread becomes a straggler and slows down the performance of the whole CSCS engine. On the other side, we observe the performance of the CSCS engine when integrated in COD. Its response time remains relatively steady even in the presence of other applications with spikes observed at the time when a new task enters the system. We explain the sudden spikes in response time as a result of the CSCS engine redistributing the tuples to the other cores, as suggested by the stored procedure. Nevertheless, even when losing a scan thread (core), the CSCS engine can easily resume executing the scans with a latency well within the required SLA requirements.

Communication and computation overhead

The communication overhead specific to this scenario is (1) the upcall from the resource manager that there was a change in the global allocation of resources, and (2) the corre- sponding reaction from the CSCS engine invoking the tuple redistribution stored proce- dure. Overall that amounts to a total of two function calls.

42 2.6. Related work

The computation overhead on the SKB-side is the re-computation of the global allocation plan and consequently the cost of calculating the redistribution of tuples to a specific subset of CSCS-owned cores (i.e., the second stored procedure of CSCS). Table 2.4 displays the measured values for the stored procedures. It shows that the computation overhead for the stored procedure is almost negligible especially when compared to the time needed to do the actual re-distribution of tuples, which takes around 1.2 seconds 4.

Conclusion and possibilities for extensions

Enabling the DB storage engine to react and adapt its execution as a result of receiving an upcall from the OS is especially important when sharing the machine with other tasks executed in parallel. This makes the CSCS storage manager adaptable to dynamic system state and, as a result, can provide stronger guarantees about predictable and stable response time within the agreed SLA. One could extend this feature by enabling other types of notifications from the OS and the policy engine, e.g., as a result of monitoring certain events and changes in the utilization of sensitive resources. More importantly, we would like to note that by cleanly separating the policy from the mechanisms and the application logic, it is now easy to the policies depending on the underlying platform or based on recent proposals in the state of the art. For instance, we can replace the logic of the second stored procedure with state of the art NUMA-aware load balancing among scan threads as proposed by Psaurdakis et al. [PSM+15].

2.6 Related work

The design of COD is motivated by how current trends in both hardware and workloads affect the design and implementations of both databases and operating systems. As a result, its building components – the CSCS storage manager and the OS policy engine – combine several ideas already proposed in the OS and DBMS communities. A major influence is attributed to a recent school of thought dedicated to re-designing system software for multicore architectures [BBD+09,WGB+10,LKB+09,Kim15,JPH +09]. Many of these efforts at reducing inter-core memory sharing, and synchronization by careful structuring of the whole system. For example, there is considerable interest

4Note that the time it takes to re-distribute the tuples depends on the amount of data to be moved.

43 Chapter 2. OS policy engine and adaptive DB storage engine in removing scalability bottlenecks in OS designs (e.g. [BWCM+10, BBD+09, WGB+10, LKB+09,GKAS99]), much of which is relevant to database/OS co-design. However, more importantly is to position our work, in terms of the posed research ques- tions, with respect to existing systems and related work.

2.6.1 Interacting with operating systems

First aspect we cover is how existing operating systems interact with the applications. More concretely, we discuss how much applications are able/encouraged to give hints to the OS about their requirements, or provide feedback on the current scheduling policy. Similarly, we briefly cover systems which allow applications to query for information about the hardware and the system state, etc.

Commodity operating systems

Abstraction of resources and the associated encapsulation of system state has long been regarded as a core operating system function, and this has tended to go hand-in-hand with the OS determining resource allocation policy. As a result, the (virtual processors, uniform , etc.) provided by conventional OSes have often turned out to be a poor match for the requirements of relational databases. This is in particular the case for commodity operating systems based on UNIX, which use the POSIX interface [Gro13], which abstracts away the underlying hardware differences (e.g.the memory hierarchy) as well as how the operating system internally manages the resources (e.g.the policies). Currently, if an application wants to get information about the hardware on Linux it uses the and proc file systems. Windows exports the hardware knowledge in its registry, which is a key value store that can be queried (and updated) by other OS services and the applications running atop [RS09]. Windows also provides a richer API [Mic10], which allows applications to provide hints to the OS about some resource allocations like memory or scheduling (e.g., user-mode scheduling [Mic16]), or to retrieve information about the system state. Solaris 2.6 introduced the preemption control mechanism, which allows an application thread to the kernel’s scheduler that a preemption at that stage is undesirable (e.g., in cases where the application has recently acquired a and makes progress in a critical

44 2.6. Related work section) [Sun02]. In the same release, Solaris also added support for scheduler activa- tions [ABLL91] as means to improve user-level scheduling. The virtual memory system in Apple’s iOS relies on the cooperation of applications to remove strong references to objects placed in pages which are about to be swapped out. Therefore, the OS issues memory warning messages to the applications to ask them to remove strong references (e.g., data caches) [App13].

Research operating systems

The importance of separation of mechanism and policy in an OS has a long going back at least to the system [WDA+08]. One thread of OS research has always sought to get better performance by exposing more information to applications in a controlled way. For example, Appel and Li [AL91] proposed a better interface to virtual memory that still located much of the paging policy in the kernel, but nevertheless allowed applications (specifically, garbage-collected runtimes) to do a better job of managing their own memory. Architecturally, one way to allow applications gain more control is to remove abstractions from the kernel, and instead implement as much OS functionality (and consequently, policy) as possible in user-space libraries linked into the application. Such an approach was used in Exokernel [EKO95] and Nemesis [Han99]. While such an approach enables the creation of application-specific policies, by itself it does not solve the problem of how each application can map its requirements onto the available resources. Additionally, it pushes the complexity of dealing with the hardware up to the application, which in turn limits portability. An alternative approach was investigated by SPIN [BSP+95] and VINO [SESS96], which are extensible OS kernels that allow applications to inject policy code in the operating system, where both the mechanism and the system state required to make decisions were located. However, even on uniprocessor systems, the benefits of such approaches are debatable [DPZ97]. InfoKernel [ADADB+03] adopted a different approach to overcoming abstraction barriers and the semantic gap between OS state and application-level research management, by exposing considerable information about the state of a conventional OS to applications. More concretely, it exported abstractions that describe the internal state of the kernel to the user-level applications which can now reason about it and base their application specific policies on that additional knowledge.

45 Chapter 2. OS policy engine and adaptive DB storage engine

The Barrelfish OS extends ideas from Exokernel and InfoKernel by combining a small and relatively policy-neutral kernel per core with a novel system facility, the “System Knowledge Base” (SKB), which we used in our implementation of the OS policy engine. In the Barrelfish OS, its knowledge is used to optimal communication patterns over the hardware [BBD+09] and to configure hardware devices [SPB+08]. In addition, as we have already discussed in this chapter, the applications can use a rich and declarative interface to add, and query knowledge present in the SKB. The Barrelfish scheduler also receives hints from the applications in the form of scheduler manifests, which contain a specification of long term requirements over all the cores that are expressed as real-time scheduling parameters (e.g., worst-case execution time, deadline, scheduling period). A similar approach was also leveraged by the Helios [NHM+09] oper- ating system, which used them for applications’ preferences to execute on heterogeneous kernels on heterogeneous cores. Unfortunately, we are aware of surprisingly little such work in the operating systems community that targets databases. This is despite the fact that the detailed resource calculations that databases perform would seem to make them an ideal case for better OS designs and abstractions.

2.6.2 Means of obtaining application’s requirements

The second aspect we discuss is how the operating system obtains information about the applications’ properties and requirements. Specifically, in COD we leverage the fact that the DBMS is already in possession of a lot of knowledge regarding its algorithms, data layout and access patterns, properties of the workloads, and SLO goals it needs to meet. Therefore, the interface allows the CSCS storage engine push information such as cost functions down to the OS policy engine in a declarative manner. Other systems have also adopted the same method. For instance, in the networks domain, Rhizoma reasons about deployment of applications on virtual nodes in the in a declarative way. Both the underlying network (e.g.network links, and node resource capacities) and the application requirements are expressed declaratively [YSC+09]. Existing systems have proposed and used other means of gathering the required informa- tion.

46 2.7. Summary

One alternative is to get it via explicit request from the programmer. For instance, Poli- C[And12] extends the C and runtime to expose novel OS resource management features to application . The goal is to enable the developers to tailor the resource management to their own needs, by expressing the requirements for a set of resources using a new C statement. Another option is to obtain information using program analysis. Systems like Shoal [KARH15] use the knowledge available to domain specific language (DSL) to extract rele- vant information about data access patterns. The runtime then uses this information to do better on modern manycore machines. Finally, some information can be also obtained by treating the application as a black box and leverage online monitoring facilities by the hardware performance counters. We discuss more about it in Chapter3.

2.6.3 Scheduling based on applications’ requirements

This is a more extensive research area, which we are going to discuss in more detail in Chapter3. Nevertheless, for completeness here we mention the immediate related work to the spatio-temporal scheduling which was used by the SKB global allocation solver. Example of spatio-temporal scheduling on individual cores for longer term assignment of applications’ tasks to processors has also been explored by the Tessellation operating system [LKB+09], and Barrelfish [PSB+10]. Both provide means for efficient user-level scheduling. In particular, the Lithe scheduler [PHA09] shows how complex applications comprising of multiple parallel libraries can cooperatively use the allocated resources.

2.7 Summary

The interaction between operating systems and database engines has been a difficult sys- tem problem for decades. Both try to control and manage the same resources but have very different goals. The ignoring each other followed in the last decades has worked because the homogeneity of the hardware has allowed databases to optimize against a re- duced set of architectural specifications, and over-provisioning of resources (i.e., running a database on a single server) was not seen as a problem.

47 Chapter 2. OS policy engine and adaptive DB storage engine

With the advent of multicore and virtualization, these premises have changed. Databases will often no longer run alone in a server and the underlying hardware is becoming sig- nificantly more complex and heterogeneous. In fact, and because of these changes, both databases and operating systems are revisiting their internal architectures to accommo- date large scale parallelism. Using COD as an example, we argue that the redesign effort on both sides must include the interface between the database and the OS. COD is a proof of concept built out of several prototypes and experimental systems. Yet, it illustrates very well what needs to be changed in both databases and operating systems, as well as the interface between them, to achieve a better integration.

48 3 Execution engine: Efficient deployment of query plans

In addition to assisting data processing application with their threads’ deployment to diverse hardware platforms without having them absorb the hardware complexity, it is also important to do this in an efficient manner, i.e., without overprovisioning the available resources and without incurring undesired performance interaction as a result of resource interference among the application and workload mix. Even though COD lays out a promising architecture design, it has a few limitations which need to be addressed before the system can efficiently handle the problem outlined above. For instance, there is no immediate support for resource monitoring, or the algorithms in use are not tailored for efficient consolidation of application’s threads. Therefore, the goal of the work presented in this chapter is to augment the available support by addressing the following research questions:

• What information is needed for efficient deployment of complex dataflows on mul- ticore machines? From the data processing side? From the system’s side?

• How can a deployment algorithm exploit that information to compute suitable spatial placement of tasks to resources without hurting application’s performance? Can it be done in reasonable time?

49 Chapter 3. Execution engine: Efficient deployment of query plans

As the chapter suggests, the context of our work is to the execution engine in scheduling complex query plans on multicore machines. In particular, we explore how to deliver maximum performance and predictability, while minimizing resource utilization when deploying query plans on modern hardware. To provide a concrete context and a platform for experimentation, we work on global query plans such as those used in shared work systems (explained in Section 3.2.1) and evaluate the ideas proposed on top of SharedDB [GAK12]. Shared work systems are good examples of increasing complexity in query plans [ADJ+10,HA05,GAK12]. Even though most of these systems are motivated by multicores, to our knowledge no work has been done in mapping their complex plans onto modern architectures. We propose (1) the use of resource activity vectors to characterize the behavior of individ- ual relational operators, and (2) a deployment algorithm that optimally assigns relational operators to physical cores. Experiments show that such an approach can significantly reduce the resource requirements while preserving performance across different server - chitectures. The materials used in this chapter have been published in the proceedings of the VLDB Endowment Volume 8, Issue 3 in 2014 [GARH14]. The paper was done in collaboration with Tim Harris and my advisors Gustavo Alonso and Timothy Roscoe.

3.1 Motivation

Increasing data volumes, complex analytical workloads, and advances in multicores pose various challenges to database systems: On the one hand, most database systems experience performance degradation with increas- ing data volume and workload concurrency. The loss in performance and stability is due to contention and load interaction among concurrently executing queries [JPH+09,SSGA11]. These effects will become worse with more complex workloads: for instance, with the increasing demand for operational analytics [Pla09,PWM +15]. On the other hand, a generous allocation of resources that can guarantee performance and stability leads to overprovisioning. Overprovisioning results in lower efficiency and prevents the system from leveraging the full potential of the underlying architecture. This problem is being addressed in the context of virtualization and multi-tenancy but it also exists on multicore machines [MCM13,DNLS13].

50 3.2. Background

In this chapter we address the question of efficient deployment of query plans on multicore machines. The focus is, in particular, on the following two problems:

1. Consolidation of relational operators according to temporal (when operators are active) and spatial requirements (memory bandwidth, CPU demand), and

2. Specific resource allocation to operators of a given , i.e., decide which threads should be placed on which cores, rather than treating the cores as a homo- geneous array of processors.

The goal is finding a deployment that minimizes the resources used, while the system maintains its performance and predictability guarantees (i.e., does not increase tail la- tency).

3.2 Background

The following section briefly covers the necessary background on operator-centric systems and the state-of-the-art when scheduling shared systems, before outlining a more concrete problem statement. I then sketch the outline of the solution and provide and insight on how it all fits in the system design of COD and hence the thesis itself.

3.2.1 Complex Query Plans

Several recent systems suggest sharing of computation and data as a means to overcome re- source contention and provide good and predictable performance. Such shared-work (SW) systems were first introduced at the storage engine level in the form of shared (cooperative) scans implemented in systems such as IBM Blink [RSQ+08], Crescando [UGA+09] and in MonetDB/X100 [ZHNB07]. Psaroudakis et al. [PAA13] describe the main approaches to work and data sharing:

• Simultaneous pipelining (SP) – originally introduced in QPipe [HSA05], and

• Global query plans (GQP) – introduced for joins in CJOIN [CPV09, CPV11] and then extended to support more complex relational operators in DataPath [ADJ+10] and SharedDB [GAK12].

51 Chapter 3. Execution engine: Efficient deployment of query plans

Query-centric Operator-centric Q1 Q1 Q3 Q2 Q3 흉 흉 Q2 횪 횪 횪 횪

A B A B C A B C A B C

Figure 3.1: Examples for query-centric and operator-centric execution models. Three different query types (Q1, Q2, and Q3) need to be executed. The query-centric model generates three different query plans, while the operator-centric uses data and work sharing among operators and handles the three queries in a single plan.

All these systems abandon the traditional query-at-a-time execution model (query-centric) and implement an operator-centric query execution model. Figure 3.1 provides an example illustration of both when handling three queries (Q1, Q2, and Q3). It has been motivated by the classification presented in the work by Psaroudakis et al. [PAA13]. Operator-centric systems try to maximize the sharing of both computation and data among concurrent queries using shared operators. By executing more queries in one go they achieve higher throughput, and in some cases, also more stable performance. We distin- guish three common properties of these systems:

1. Operators are deployed as a (QPipe) or dataflow graph (DataPath, SharedDB). This is particularly important because the information about dataflow relationships can help in achieving better data-locality when deploying an operator.

2. Plans are composed of shared and always-on relational operators (CJOIN, Blink, SharedDB, QPipe): the operators are active throughout the whole execution of the workload and are shared among the concurrently executing queries. Different systems leverage different techniques in order to maximize sharing of computation and data (e.g., batching, detecting common sub-plans, or sub-expressions, etc).

3. Operators can be characterized as either or non-blocking, depending on

52 3.2. Background

whether they need the full input before starting to work or can start processing as data arrives.

We have implemented and evaluated our ideas on SharedDB. It compiles the entire work- load into a single global query plan that serves hundreds of concurrent queries and updates, and can be reused for a long period of time. Sharing data and work can be easily exploited in a scalable and generic way as a result of SharedDB’s batching of incoming queries and updates. SharedDB’s query optimizer can automatically generate the global query plan, and currently assigns each operator thread to a different core [GMAK14].

3.2.2 Scheduling of Shared Systems

Scheduling parallel query plans on multisocket multicore machines is a challenging, mul- tifaceted problem. Not only do modern machines add many more factors to consider (shared caches, shared processing units, NUMA regions, processor interconnects, etc.), they are also far more heterogeneous even among processors of the same vendor. Al- though motivated by and designed for multicores, operator-centric database engines have not yet addressed the problem of efficient resource utilization and deployment on multicore machines. In this subsection we provide a short overview of the current approaches. Existing shared work systems have different approaches for assigning processing threads to their relational operators/operators’ work units1. For instance, the QPipe system uses a micro-engine’s thread pool [HSA05], SharedDB leverages per-operator threads [GAK12], while DataPath uses fixed worker threads executing specific work-units [ADJ+10]. All of these approaches, however, fix the number of threads assigned to the operators. Each thread is pinned to a particular core and there is only one thread assigned to a core. Such an implementation provides (i) predictable performance because there is no thread migration [GAK12], and (ii) progress is always guaranteed [HSA05]. The cost for this performant deployment is system-wide resource overprovisioning. This problem is further aggravated by the rigidity of assigning the same amount of resources for all operators, which does not account for the individual resource footprints and require- ments. Moreover, none of the approaches currently employed in operator-centric systems

1The term work unit is used in the DataPath system to denote a data chunk together with relevant state information, code and meta data needed to process it.

53 Chapter 3. Execution engine: Efficient deployment of query plans

0 6

Core Core Core Core 0 4 8 12 4 2 R L3 cache A 3 5 M Core Core Core Core 16 20 24 28

1 7

Figure 3.2: Layout of a typical four-socket AMD Bulldozer system. On the right is the interconnect topology of the eight NUMA nodes. On the left is the layout of cores on NUMA node 0. Note that the topology and core IDs are taken from the output of the numactl --hardware command. takes into consideration the architecture and properties of the multicore machine or the data-dependency of the shared query plan.

3.2.3 Problem Statement

Our goal is to determine how to deploy a query plan minimizing the amount of resources used while maintaining required performance and predictability guarantees. There are two different aspects of the problem: (i) temporal and (ii) spatial scheduling. Temporal scheduling aims at deciding which operators are suitable candidates to time- share a CPU, i.e., can be deployed to run on the same core. The challenging part is to avoid co-locating operators that will interfere with each other, so that we maintain the required system stability and performance. An example of a suitable pair of operators that could time-share a core are pipelined blocking operators where it is certain that the downstream operator is only active after its predecessor has finished processing the result-set. Additionally, such a deployment may also benefit from data locality. Spatial scheduling, on the other hand, aims at determining which cores should be used for the deployment of the operators. This has two versions: collocating operators to run concurrently on the same core if one is CPU and the other is memory bound; and placing

54 3.2. Background operators that communicate with each other in ways that optimize the data traffic on the processor interconnect network. This is a difficult problem that is architecture-dependent and must be addressed accordingly. In order to illustrate the complexity of the problem and the need for appropriate solutions, we present a sample multicore architecture in Figure 3.2. By looking at the inter-NUMA node links one can easily grasp the difference in communication cost and redundant data traffic on the interconnect network when one places two communicating operators on cores 0 and 4, which share a last level cache (LLC) and a local NUMA node, as opposed to using core 0 on socket 0 and core 1 which is on a remote socket (one or two NUMA-hops away).

3.2.4 Sketch of the Solution

Optimized resource management and scheduling must take into consideration (Figure 3.3):

1. The data-flow dependency graph in the query plan, which is used to determine temporal dependencies;

2. The resource footprint and characteristics of the individual operators, which are used to determine spatial placement; and

3. The properties of the multicore platform, which determine the available resources and constraints for operator placement and inter-operator communication.

The next two sections describe in detail the two main contributions, the resource activ- ity vectors (RAVs) as a means to characterize the resource requirements of the database operators (Section 3.3) (addressing points 2 and 3), and a novel algorithm (Section 3.4) mapping operators to cores (under the constraints of point 1) while maintaining the sys- tem’s performance and stability. But prior to that, we briefly discuss how all this fits in the context of COD and its components.

3.2.5 In the context of COD

Figure 3.4 shows how the contributions presented in this chapter fit in the overall archi- tecture of the co-design system COD. In addition to the OS policy engine, the system can rely on the Resource Profiler unit (RP), which uses hardware performance counters to monitor specific events which are of

55 Chapter 3. Execution engine: Efficient deployment of query plans

(2) (1) (3) Resource requirements Query plan of relational operators Multicore machine

Data dependency Resource activity Model of multicore graph vectors machine

Deployment algorithm

Deployment layout of relational operators to specific CPU cores

Figure 3.3: Sketch of the solution – overview of the information flow in the deployment algorithm. interest to the system. Using these measurements, the system can derive system-wide properties of the underlying resources (e.g., the DRAM bandwidth capacities of the local memory controllers, or the interconnect bandwidth between NUMA nodes), but also the monitor the current utilization of certain resources (e.g., how efficiently a certain resource is being used by the scheduled jobs). Furthermore, the services of the resource profiler could also be made available to the applications running atop. Therefore, the database engine can use it to derive the application specific resource activity vectors for its threads (e.g., the RAVs for the operator threads of the DB execution engine). Moreover, the OS policy engine can be extended to understand additional information both from the resource profiler and from the DB execution engine, such as:

1. A more complex machine model covering details about the interconnect topology and cache hierarchy, resource capacities based on the data gathered by the RP, etc.;

2. Data dependency graph (dataflow DAG) based on the query plan passed by the application (e.g., the DB execution engine); and

3. The resource activity vectors for the application threads as derived by the RP (more about it in Section 3.3).

56 3.3. Resource Activity Vectors

DB execution engine

OS Policy Engine Resource Profiler (RP) System-wide SKB RM + estimate resource capacities Facts Solver/Optimizer + check utilization + Model of machine + Deployment + Data-dependency graph algorithm Application-specific + Application’s RAVs + derive RAVs b

Hardware

Figure 3.4: Extensions to the COD system architecture

Finally, in addition to the CLP solver and the optimizer already present in the SKB, the system could now use the proposed approximation algorithms to compute efficient application deployment, which trade off optimality of the resource allocation solution to faster computation times. We discuss the deployment algorithm in Section 3.4.

3.3 Resource Activity Vectors

Good resource management and relational operator deployment requires awareness of the thread’s resource requirements. In the last decade, we have witnessed significant progress in tuning DBMS performance to the underlying hardware [ADHW99, MBK00, LPM+13, AKN12, ZHNB07, SKC+10, BTAO13, KKL+09, ZR04, BZN05]. As a result databases and their performance have become more sensitive to the resources they have at hand, and poor scheduling can lead to performance degradation [LDC+09,GSS +13]. In order to capture the relevant characteristics for application threads, in this chapter we introduce the concept of resource activity vectors (RAVs). At the moment, these vectors concentrate on the most important resources, CPU and memory bandwidth utilization,

57 Chapter 3. Execution engine: Efficient deployment of query plans but could be extend to other resources if needed (e.g., network I/O utilization, cache sensitivity, etc.). This approach of characterization is inspired by the notion of activity vectors, initially introduced for energy-efficient scheduling on multicores [MSB10].

3.3.1 RAV Definition

We use RAVs to characterize the resource footprints of relational operators. They summa- rize the amount of resources needed such that each thread delivers its best performance. In the current implementation we consider two dimensions:

1. CPU utilization [%] represents how efficiently an operator thread uses the CPU throughout the workload execution. This is highly relevant for the deployment algo- rithm as it identifies the threads that are either rarely active, or when active make poor use of the CPU time. These types of threads are usually good candidates for sharing the core with another task. At the other extreme, operator threads with high CPU utilization are both active for a very long time and efficiently use the CPU and should be thus left to run in isolation. This way their performance will not be hurt by time-sharing the CPU. We elaborate more in Section 3.3.3.

2. Memory bandwidth utilization [%] identifies both interconnect and DRAM bandwidth consumption of the operator threads throughout the workload execution. This information is relevant for deployment because, for instance, if several band- width thirsty operators are placed on the same NUMA node, one can easily hit the bandwidth limit and affect performance. Similar to the arguments used for the CPU utilization, it is not enough to look only at the memory bandwidth utilization as an average over the active time but over the total duration of the workload execution. Only then will we be able to avoid over-provisioning of the machine resources by reducing the significance of short running tasks with heavy memory access require- ments. We elaborate more in Section 3.3.4.

3.3.2 RAV Implementation

One approach to obtain the RAV values is to model the relational operators and use the model to deduce their behavior on a set of resources. We used one instance of this approach in Chapter2, with the use of cost functions that model the behavior of the

58 3.3. Resource Activity Vectors

Table 3.1: Terminology for the performance counter events used to derive the RAVs

Event name Description cycles CPU clocks unhalted, i.e., number of clocks, when the CPU was not retired_instructions number of useful instructions the CPU executes on behalf of the program DRAM_accesses number of DRAM accesses by the program as measured by the memory controller(s) system_read number of system reads by coherency state system_write number of octwords written to system LLC_misses number of last level cache misses

CSCS’ scan threads. An alternative is to treat the operator threads as ‘black boxes’ and use available hardware instrumentation tools (e.g., the system’s Resource Profiler) to derive their resource requirements. We chose the second option because we believe it offers better scalability and could also handle future hardware extensions. The operator threads were instrumented on two different architectures (introduced in Section 3.5.1). The Performance Measuring Units (PMUs) on both architectures differ in their structure, organization and supported events [Int13, Int08, DC08, Adv13]. For simplicity of argumentation in this chapter we are only going to use generic names, as summarized in Table 3.1. The instrumentation is performed while running a representative sample of the workload in the following way: in the initial system deployment all operator threads are pinned to different cores with nothing else scheduled at the time. Every PMU has a number of registers (typically four per hardware thread) that can be used for gathering performance event counts. Because there is a limited number of such registers, we execute a number of runs to gather all data required for the . Upon completion of data gathering, we postprocess the measurements to derive the final values of the two RAV components. When the number of cores is smaller than the total number of operators, one can use the ‘separate-thread’ option in the profiling tool. This way the post-processing can distinguish between events that occurred due to another thread’s activity2.

2Note that for certain events related to any of the shared resources, the accuracy of the measured counters will be affected by the noise created by other tasks.

59 Chapter 3. Execution engine: Efficient deployment of query plans

3.3.3 Capturing CPU Utilization

The CPU utilization needs to capture both the efficient utilization of the CPU cycles when in possession of the core, as well as the active time of the operator thread as a fraction of the total duration of the workload-sample execution. In order to derive the efficient utilization of the CPU, we calculate the IPC () value for each operator. The IPC can be calculated using the retired_instrictions and cycles events. Although we work with ‘always-on’ operators, they are not active at the same time, for in- stance, because of a data-dependency. Furthermore, we to see discrepancies among operators when comparing their active runtime. As mentioned earlier, the deployment algorithm should consider the substantial difference in long- and short-running threads and handle each class accordingly. The active runtime of a thread is derived as the ra- tio of total number of unhalted CPU cycles (cycles) measured on that core to the total duration of the sample workload execution. The formula used to calculate CPU utilization for the RAVs normalizes the thread’s IPC value with its active runtime as presented in Equation (3.1). Please note that after sim- plifying the formula, the CPU utilization is only dependent on the measured number of retired_instructions events and the total_cycles (duration of the workload sample). Also note that the latter one is constant for all operators in the same query plan.

CP Uutil = IPC × active time retired_instructions cycles = × cycles total_cycles (3.1) retired_instructions = total_cycles

3.3.4 Capturing Memory Utilization

The values for memory bandwidth utilization can be calculated in a similar fashion. The first factor is the measured data bandwidth transfer (in bytes) for each operator threads. We gather information for both the interconnect and the DRAM bandwidth utilization. The events used for data gathering and exact formulas used for deriving these values are highly architecture-dependent. While on AMD the core PMUs can gather specific events

60 3.3. Resource Activity Vectors that capture both DRAM and system accesses [DC08], on Intel one is limited to using approximation formulas based on the last level cache (LLC) [Int08]. The second factor, just as in the CPU utilization case, is the active runtime of the operator threads. The total duration of the workload sample is used to normalize the memory bandwidth requirements of the threads of interest. Equation (3.2) shows the derivation for memory bandwidth utilization:

MEMutil = bandwidth × active time bytes cycles = × cycles total_cycles (3.2) bytes = total_cycles

The bandwidth utilization of an operator is only dependent on the amount of data trans- fered over the duration of the workload sample (expressed as total_cycles). For sim- plicity the total bytes transferred is the sum of interconnect and DRAM bandwidth consumption (events DRAM_accesses, system_read, system_write). Note that with this particular profiling setup (‘operator-per-core’ deployment, and dis- tributing all operators across the cores), the memory and in particular the interconnect bandwidth utilization for the operators is overestimated. If the operators are placed on the same NUMA region, it is likely that most of the data-transfer would occur via the shared last-level cache.

3.3.5 Parallel Operators

Until now, we have discussed how to capture the resource footprints and requirements of single-threaded operators. We assume that the degree of parallelism assigned to an operator is decided by the optimizer. Multithreaded operators are then supported in a similar manner – we just consider the individual operator threads as separate entities to be scheduled and hence RAV-annotated. In fact, in the context of SharedDB, we already do this for the scans. The performance of a scan operator, like Crescando (used in SharedDB), can be improved by increasing the number of scan threads working on horizontal partitions. In our experiments we RAV-annotate and schedule each scan thread separately.

61 Chapter 3. Execution engine: Efficient deployment of query plans

Deployment algorithm

1 Operator data- Operator graph collapsing dependency graph

2 Minimizing Resource activity computational requirements vectors (RAVs)

3 Minimizing NUMA-node bandwidth requirements internal properties

4 Interconnect model Deployment mapping topology Multicore machine Multicore

Deployment decision

Figure 3.5: Overview of the deployment algorithm

3.4 Deployment algorithm

The deployment algorithm we propose aims at delivering a deployment plan of the oper- ators that:

• Minimizes the computational and bandwidth requirements of the query plan;

• Provides NUMA-aware deployment of the relational operators; and

• Enhances data-locality.

As presented in Figure 3.5, the algorithm consists of four phases: (1) operator graph collapsing, (2) bin-packing of relational operators to clusters based on the CPU utilization dimension of the RAVs, (3) bin-packing of operator-clusters (output of (2)) to number of NUMA nodes based on the RAV’s memory bandwidth utilization dimension as well as the capacity of the NUMA nodes and (4) deployment mapping of the computed number of NUMA nodes onto a given multicore machine.

62 3.4. Deployment algorithm

The first two phases compute the required number of cores, which corresponds to the temporal scheduling subproblem, the third phase approximates the minimum number of required NUMA nodes, and the fourth computes the final placement of the cores on the machine such that it minimizes bandwidth usage – the spatial scheduling subproblem.

3.4.1 Operator Graph Collapsing

The first phase of the algorithm takes as input an abstract representation of the complex query plan – dataflow graph of database operators. It iterates over the operators in the dataflow graph and compacts each operator-pipeline into one so-called compound operator. An operator-pipeline is characterized by non-branching dataflow between the operators belonging to the pipeline. An example of such pipeline is presented in Figure 3.6. This phase of the algorithm targets operator-pipelines with blocking operators. Here, by design, there is a guarantee that the involved operators will never overlap each other’s

Q1 Q3 Q2 Q1 Q3 Q2

<15,05> 흉 횪 흉 횪 <45,15> <30,10> 횪 횪

A B C A B C

Blocking operator­pipeline Compound operator

Figure 3.6: The highlighted rectangle illustrates: (left) an example of an operator pipeline that can be collapsed into one compound operator, (right) the resulting compound opera- . For readability, only the operator-pipeline and the corresponding compound operator are RAV-annotated.

63 Chapter 3. Execution engine: Efficient deployment of query plans execution and as a result can be temporally ordered. Since there is a temporal ordering between the blocking operators belonging to the same operator-pipeline, one can easily think of them as an atomic scheduling unit (i.e., they can be grouped into one compound operator). This way the scheduler can safely place all such operators to run on the same set of resources, one after another. Additionally, the new compound operator is expected to have better data locality. Scheduling the component operators sequentially on the same core will leverage the warm data caches and reduce unnecessary memory movement. The new composed compound operator inherits the RAV characteristics of its components as presented in equations (3.3) and (3.4).

C.cpu util = X i.cpu util, ∀i ∈ P (3.3)

C.mem util = X i.mem util, ∀i ∈ P (3.4) where C denotes the compound operator, and P denotes the set of all operators belonging to the operator-pipeline. Both dimensions of the compound operator’s RAV are computed as the sum of the values of the corresponding dimensions of its components. The formulas are a direct consequence from the definitions of CPU and memory bandwidth utilization in equations (3.1) and (3.2). Intuitively, the compound operator now has to execute the cumulative number of instructions of its components in the same amount of time (total_cycles). The same reasoning applies for the total number of bytes transferred.

3.4.2 Minimizing Computational Requirements

Once the original set of operators has been compressed by collapsing the operator-pipelines into compound operators, the subsequent phases of the deployment algorithm operate on a smaller set of operators. Please note that due to the ‘always-on’ nature of the operators of the shared-work systems, the operators of the new set can no longer be temporally ordered, i.e., we have to assume that they could run concurrently. The second phase iterates over this new set of operators in search of suitable clusters of operators that can be safely placed on the same CPU core. This clustering is based on the values of the CPU utilization component of the operators’ RAVs. The goal is to determine the minimum number of CPU cores needed to accommodate all relational operators. Note

64 3.4. Deployment algorithm that the second dimension of the RAVs does not play a role. This is because the operator clustering solely determines which operators can be placed safely on the same CPU core. The operators belonging to the same cluster technically only share CPU time. Hence, they will never be executed concurrently and memory bandwidth is not going to be shared. This phase of the algorithm is an instance of the bin packing problem, defined as follows:

Definition. Items of different sizes must be packed into a finite number of bins, each of capacity X, in a way that minimizes the number of bins used

Even though it is an NP-hard problem, there are many approximation algorithms proposed that give nearly optimal solution in polynomial time [CGJ97]. Since we know the items (operators) and their properties in advance, we can use offline bin packing solutions. Therefore, in our algorithm, this phase is implemented using an adapted version of one of the simplest heuristics – the first-fit decreasing (FFD) algorithm.

Data: List of items Result: Number of bins, contents of bins

1 SortDecreasing (items);

2 BinPacking (items)

3 for items i=1,2,...n do

4 for bins =1,... do

5 if item i fits in bin j and i’s sibling is not in bin j then

6 pack item i in bin j

7 break the loop, and pack the next item

8 end

9 end

10 if item i was not placed in any bin then

11 create new bin k

12 pack item i in bin k

13 add bin k to bins

14 end

15 end

16 return number of bins, and bins’ contents Algorithm 1: FFD bin-packing algorithm

65 Chapter 3. Execution engine: Efficient deployment of query plans

Algorithm 1 provides a pseudo-code implementation of the bin-packing algorithm. As a first step, the algorithm sorts the input items in a decreasing order based on their CPU utilization RAV dimension (line 1). It then invokes the BinPacking procedure, which iterates over all the items and checks whether an item ‘fits’ in one of the existing bins. The original algorithm only checks if item fits, but for our use-case we added additional check which is related to our support for parallel operators, i.e., the algorithm now also checks whether another thread of the same operator (aka sibling item) was previously placed in the same bin and avoids the bin if so (lines 5-7). Finally, if an item was not placed in any of the existing bins, the algorithm creates a new bin and packs the item in that bin (lines 10-14). The FFD algorithm has been proven to have a tight approximation bound of 11/9 · OPT + 6/9, where OPT is the optimal number of bins [D´os07]. Eventually, the algorithm returns an approximate of the minimum number of cores that can fit all operator threads such that the computational requirements are met.

3.4.3 Minimizing Bandwidth Requirements

The third phase of the algorithm operates on the following input:

• Model of the internal NUMA-node properties: (1) number of cores and (2) local bandwidth capacity.

• Memory bandwidth requirements of the (bins) operator clusters, which were com- puted during the previous phase of the algorithm. One can compute the resource requirements of these operator clusters in a similar manner as with the compound operators (equations (3) and (4)).

The goal is to compute the minimum number of NUMA nodes required to accommodate all operator-clusters and their bandwidth requirements given the node’s capacity constraints. This can be formalized as another instantiation of the bin packing problem: the items to be packed are the operator-clusters and the bins are the NUMA nodes. The capacity of the bins is determined by the maximum attainable local DRAM bandwidth (determined by microbenchmarks). Furthermore, the bins are also constrained on the cardinality of items they can accommodate, i.e., the number of CPU cores on the corresponding node.

66 3.4. Deployment algorithm

As such, it is an instantiation of cardinality-constrained offline two-dimensional bin packing problem. We can, thus, use the same FFD algorithm to compute an approximate solution. The only modifications needed are when evaluating whether an item can fit in a certain bin, and when updating the corresponding data structures (lines 5-6 of Algorithm 1).

3.4.4 Deployment Mapping

The last phase of the deployment algorithm computes the final placement of the operator- clusters onto actual CPU cores of the multicore machine. It uses the output of the previous phase, which computed the minimum number of NUMA nodes required to accommodate the operator-clusters. If one NUMA node is sufficient, the output is trivial – any subset of the cores belonging to the same NUMA node will do, provided it is of cardinality k, where k denotes the required number of CPU cores (output of phase (2)). Here we explain the steps needed should the number of NUMA nodes be larger than one. In order to determine the optimal mapping, this phase first models the multicore’s NUMA interconnect topology as a graph.

Definition. Let G(V,E) represent an undirected graph, where the set of vertices V cor- responds to the set of all NUMA nodes and the set of edges E is composed of all direct links between the NUMA nodes. Furthermore, G allows maximum one (edge) between a pair of NUMA nodes.

As initially presented in Section 3.2.2 (Figure 3.2) two communicating operators should ideally be placed close to each other so that we reduce data-access latency and interconnect bandwidth usage. Thus, if it is not possible to accommodate them on a single NUMA node, then a priority should be given to the neighboring nodes. As an example, we present the following problem: it has been determined that the deploy- ment of a certain query plan on a given machine needs four NUMA nodes. To demonstrate the generality of the approach, let us assume that the interconnect topology of the machine is not symmetric and looks like shown in Figure 3.7. We have denoted with D1 and D2 two of the many possible deployments (subgraphs) encapsulating four NUMA nodes. Ideally, the deployment algorithm should return de- ployment D1 as a preference because that way all operators will be at the shortest possi- ble distance of each other (1-hop) as opposed to the other alternative where the average distance between the NUMA nodes is 1.33 hops.

67 Chapter 3. Execution engine: Efficient deployment of query plans

A B D1 0 6

4 2

D2 3 5

1 7

Figure 3.7: Example of two possible deployments (D1,D2) of 4 NUMA nodes within an eight-node AMD Bulldozer machine. The asymmetric topology means deployment D1 is preferable to D2.

Therefore, the algorithm needs to also quantify how close the nodes of a certain subgraph are. In order to do that it leverages the concept of graph density.

Definition. Graph density (dG) of a graph G(VG,EG) is defined as the number of edges divided by the number of vertices, or more formally:

|EG| dG = (3.5) |VG|

Using this metric we can formalize the problem that needs to be solved in this phase as an instantiation of finding the densest k-item subgraph problem. Khuller and Shaba [KS09] proved that the solution is NP-hard, but given the small size of our graph this is still within acceptable boundaries. Given a multicore machine a prephase of our na¨ıve implementation iterates over all subgraphs of size k, and computes the density of each. The desired output is a query for the subgraph with highest density. The final operator-to-core deployment mapping is chosen accordingly.

68 3.5. Evaluation

3.4.5 Discussion and Possible Extensions

With the deployment algorithm presented in this section we leverage the dataflow informa- tion from the database query plan, the RAV properties of the operators, and the NUMA model of the multicore machine. Using this input the algorithm is able to significantly reduce the total number of resources required by a query plan and at the same time avoid contention for the most critical resources: CPU cycles and memory bandwidth. Note that ideally the RAV vectors should be extended to also capture the actual memory requirements of the relational operators. Then we can extend (1) the bin packing algorithm in the third stage to also account for the DRAM capacity on the NUMA nodes, and (2) the deployment mapping to also account for available memory on the chosen NUMA nodes. For other workloads and/or hardware architectures, it may also be required to extend the RAVs to capture other resource requirements (e.g., when deploying threads on machines with multithreaded cores where even the private caches are shared). Another immediately applicable enhancement is the following: The chosen implementation of the bin-packing algorithm (sorted first fit) can provide a good approximation on the minimum number of cores/NUMA nodes required but does not guarantee that all the bins will be equally balanced. Consequently, one can extend it with an additional epilogue phase that will perform load-balancing of the content in the bins. As we said, the proposed ideas, concepts and algorithms can easily be integrated as part of COD. Using the OS system-wide knowledge of the current resource utilization among all running tasks, the deployment algorithm can suggest efficient resource allocation even in noisy system environments.

3.5 Evaluation

In this section we show that the performance, stability and predictability of the query plan remains unaffected despite the heavy reduction in the allocated resources by the deployment algorithm. We evaluate the performance of the deployment of a TPC- query plan on different dataset sizes on two different multicore architectures. We demonstrate the accuracy of the RAV characterization, and conclude the section with an analysis on the different phases of the deployment algorithm.

69 Chapter 3. Execution engine: Efficient deployment of query plans

3.5.1 Experiment Setup

Infrastructure: For our experiments we used two multicore machines from different vendors in order to compare the influence of their architectures both on the operators’ RAVs and the outcome of the deployment algorithm. The machines used are:

1. AMD MagnyCours: Dell 06JC9T board with four 2.2 GHz AMD Opteron 6174 processors. Each processor has two 6-core dies. Each die has its own 5 MB LLC (L3 cache) and a NUMA node of size 16 GB. The operating system used was 12.04 amd64 with kernel 3.2.0-23-generic.

2. Intel Nehalem-EX: Supermicro X8Q86 board with four 8-core 1.87 GHz Intel Xeon L7555 processors with hyperthreads. The hyperthreads were not used for the experiments. Each processor has its own 20 MB LLC (L3 cache) and a NUMA node of size 32 GB. The operating system used was with Linux kernel-3.9.4 64.

The clients were running on four machines with two 2.26GHz Intel Xeon L5520 quadcore processors, and a total of 24GB RAM. Workload: In order to evaluate the deployment algorithm for a query plan we used SharedDB’s global query plan for the TPC-W [GG03]. The TPC-W workload consists of 11 web-interactions, each consisting of several prepared statements, which are issued based on the frequencies defined by the TPC-W browsing mix. The query param- eters were also generated as defined in the TPC-W specification. SharedDB’s TPC-W query plan has 44 operators, as presented in Figure 3.8. In the figure we marked all 44 operators with an ID which corresponds to the core-ID assigned in the initial deployment. Please note that the range of the IDs is between 0- 47 but some core IDs (4 to be exact) are skipped 3. The storage engine operators work with data local to their threads, and the internal logic operators get the input from their predecessors and generate data on their local NUMA node. We used two dataset sizes: 5 GB (1.2k emulated browsers(EBs), 100k items), and 20 GB (5k EBs, 100k items). Setup: Every experiment was run for 10 minutes, plus two minutes dedicated for warm- up and cool-down phases. The record logs obtained in the warm-up and cool-down phases were not taken into account in the final results presented here. The profiling on both machines was done with Oprofile (operf).

3Due to difference in the core-ID mapping between the application and the OS

70 3.5. Evaluation S C A X C C 4 0 2

G N T I P R P A O C H 6 S t k r r a o C

w t w e e 7 i 5 3 N V

⋈ E G

N h N I s I a L P 1 H T P 3 R k O r A H o S C w t 5 e 2 N t r 4 o 2 S 5 9 4 2

S y k d r B ⋈ R s t

e ⋈ o p t

r E h m u a o w s L l e D t o a t S e I e r N R R H N G O 4 3 0 4 4 3 ) t 2 r 3 o 2 S 4 ( k t 7

k i r m 2 r n i o r o i m i 6 f E w m w 2 t n L t d N e o e I A C N ⋈ N L

_ L R N E 8 D 4 2 1 R 1 0 9 8 O ⋈ 4 4 3 3 2

y M k h n r B ⋈ s E t

s a

o a p r t r c T h s u e s I o w s H l t e l o a S e 8 6 r e B H 1 3 S N ) G e l

t t r M M i y r o E b E 1 T

o (

t M

T S T y c I n ) I E r k m e a b j r ( o e c T b t s I o h I

u t w S h u 3 t 3

c , ⋈ e 1 A 2

r e l a N h t 5 s i e 1 a S T H 7 4 7 6

s 1 1 k 2 ) t l r i e c 1 o k t R a u t t r a R w d r O e o t D o o O

e D r H w n S y t P H N a T 1 e b c ( T 1 U s N U A A s k t 2

r c 2 o w u e w 7 d t 9 N o e 1 r k P N R r 1 o E 2 w M t e O N T S U C 1 S S 2

E n R 0 a D 1 c s D S S A E R S D S 1 D

E A n R 3 a 2 D c t k s s ⋈

r D

e y o h A u u s w 0 B a e e H R N 9 Y Y R R T n T a N N c U s U O O C C 8

Figure 3.8: TPC-W shared query plan – as generated for SharedDB

71 Chapter 3. Execution engine: Efficient deployment of query plans

Metrics: Throughput is reported in Web Interactions Per Second (WIPS), and all the re- ported on latency are in seconds (s). Maximum attainable bandwidth is not taken from the machine specifications but measured with the STREAM benchmark [McC95]. In the figures it is expressed in per second (GB/s).

3.5.2 Resource Activity Vectors (RAVs)

The RAVs are derived from statistics gathered by profiling the operator threads. Here we present a breakdown of the factors that constitute the values for CPU and memory bandwidth utilization dimensions. We measured cycles and retired_instructions to derive the IPC values. Moreover, in order to calculate the memory bandwidth consumed we used the formulas as pre- sented in [DC08] and collected the following events: system_read, system_write and DRAM_accesses. The last one included measurements of two different DRAM channels (DCT0 and DCT1). The per-core PMUs on AMD MagnyCours can collect up to four events per run, so in total we needed two runs to derive the RAV properties for the op- erators. The results presented in Figure 3.9 illustrate the measured events for operator threads (with IDs 0-47 on the x-axis) from an experiment run on the AMD MagnyCours on the 20 GB dataset. We show the values for IPC (3.9a) and memory bandwidth (3.9b) (consisting of DRAM, and system- read and bandwidth). From these two graphs, one can notice variety in the distribution of resource consumption among the different operators. As discussed on Section 3.3, just looking at the raw performance metrics (presented in Figure 3.9a and Figure 3.9b) is not enough to make an effective decision on the threads’ resource requirements. One must also consider the total active time for each operator thread, and normalize the derived values for both RAV dimensions accordingly. Figure 3.9c presents the active runtime of the operators with respect to the total duration of the experiment. It shows that only a few operator-threads are actively using the CPU time. This is an important observation, as it emphasizes the large number of idle threads and the opportunity for resource consolidation. The final values for CPU and memory bandwidth are presented in Figure 3.9d. Please note that the memory bandwidth utilization no longer contains the bandwidth breakdown of the individual measurements, but rather considers their sum.

72 3.5. Evaluation

2 IPC

1.5

IPC 1

0.5

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 36 37 38 39 40 41 42 43 44 45 47 Operator ID

(a) Instructions per Cycle (IPC)

18 16 DRAM bandwidth 14 System read bandwidth 12 System write bandwidth 10 8 6 4 2 0 Memory bandwidth [GB/s] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 36 37 38 39 40 41 42 43 44 45 47 (b) MemoryOperator Bandwidth ID 100 CPU Activity 80 60 40 20 CPU Activity [%] 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 36 37 38 39 40 41 42 43 44 45 47

(c) Active RuntimeOperator Distribution ID 100 CPU utilization 80 Mem utilization 60 40 RAV [%] 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 36 37 38 39 40 41 42 43 44 45 47

(d) Resource ActivityOperator Vectors ID (RAVs)

Figure 3.9: Understanding the derivation of RAVs, AMD MagnyCours, 20 GB dataset73 Chapter 3. Execution engine: Efficient deployment of query plans

100 20GB dataset 80 5GB dataset 60 40 20 CPU utilization [%] 0 0 1 13 14 16 12 5 18 17 23 32 38 22 44 47 9 30 24 25 7 0 1 14 13 16 12 7 5 23 18 32 38 44 17 22 24 30 36 9 11 Operator ID (decreasing CPU utilization)

Figure 3.10: Analyzing the impact of varying dataset size on the CPU utilization dimension of the RAVs for different operators (marked by their Operator IDs) shown on the x-axis.

Figure 3.9d shows that both the CPU and memory bandwidth utilization values look significantly different than on the raw performance/resource metrics in Figures 3.9a and 3.9b. This confirms that the resources of the machine are overprovisioned and that there is room for improvement. In the rest of the evaluation section we focus on the CPU utilization dimension of the RAVs. The same observations also hold for memory bandwidth utilization.

Impact of Dataset Size on RAVs

This subsection analyzes the effect that dataset size has on operators’ RAVs. The experi- ments were done on the AMD MagnyCours using datasets with size 5 GB and 20 GB. A summary of the output of the experiment is presented in Figure 3.10. It displays the derived CPU utilization values for the two experimental configurations. For readability, in the figure we only present the values (in a decreasing order) for the operators with CPU utilization higher than 5 percent (in this case the top 19). Each row in the x-axis denotes the operator-IDs for the corresponding experiment run. As shown in the figure, the distribution of the CPU utilization varies with the changes in the dataset size. Intuitively, the larger dataset puts more strain on some of the scan operators (operator IDs 0 and 1), while the CPU-heaviest join (ID 16) is busier in the smaller dataset. The difference in CPU-utilization for the other operators in both datasets is almost negligible.

74 3.5. Evaluation

100 AMD MagnyCours 80 Intel Nehalem-EX 60 40 20 CPU utilization [%] 0 0 1 13 14 16 12 5 18 17 23 32 38 22 44 47 9 30 24 25 7 1 0 14 13 16 12 17 5 18 6 23 3 7 11 31 24 9 22 4 8 Operator ID (decreasing CPU utilization)

Figure 3.11: Analyzing the impact a workload execution on a different HW architecture may have on the operator’s CPU utilization. Experiment done on the 20 GB dataset.

The difference in both distribution and absolute values of the CPU utilization influence the output of the deployment algorithm. In this case their effects canceled each other and consequently the second phase of our algorithm derived the same number of bins (cores) for both configurations – six.

Influence of Architecture on RAVs: Intel vs AMD

Another important factor that influences the values of the RAVs is the underlying multicore architecture. In order to show its impact on the RAVs we executed the same workload with dataset size of 20 GB on the two different machines introduced in Section 3.5.1 (AMD MagnyCours and Intel Nehalem-EX). In Figure 3.11 we present the results of the CPU utilization of the operators in decreasing order. We focus on operators with CPU utilization higher than 5 percent (i.e., the top 19). Once again, there are two x-axes denoting the operator-IDs on the corresponding architectures. On Intel we observe that the heavy scan operators (IDs 0,1,13,14) have higher utilization of the CPU, but also that fewer operators have CPU utilization higher than 5 percent (only the first 8). The observed difference in both distribution and absolute values confirms the benefits of using the RAVs as a means to detect architecture-specific sensitivity. For this setup the deployment algorithm allocated five cores for the run on Intel Nehalem-EX and six for AMD MagnyCours.

75 Chapter 3. Execution engine: Efficient deployment of query plans

Table 3.2: Performance of default vs. compressed deployment

# Architecture #Cores Throughput [WIPS] Response Time[s] (exp. config) Average (stdev) Average (stdev) 50th 90th 99th 1 AMD (20 GB) 6 428.07 (+/- 32.80) 14.62 (+/- 0.76) 15.36 23.73 36.13 2 44 425.86 (+/- 54.34) 14.69 (+/- 0.85) 14.59 22.93 36.08 3 OS baseline 48 317.30 (+/- 31.11) 20.81 (+/- 2.55) 8.22 72.43 82.03 4 AMD (5 GB) 6 645.71 (+/- 38.24) 8.41 (+/- 0.46) 7.00 16.44 19.69 5 44 703.51 (+/- 55.66) 7.38 (+/- 0.55) 5.65 14.81 17.87 6 Intel (20 GB) 5 362.62 (+/- 62.16) 18.05 (+/- 2.77) 18.35 31.73 43.94 7 32 386.97 (+/- 59.34) 16.70 (+/- 2.47) 16.95 28.03 41.93

3.5.3 Performance Comparison: Baseline vs Compressed deployment

The following experiment compares the performance of the query plan when deployed on a compressed set of resources (based on the output of our algorithm) versus the approach using operator-per-core deployment. As a baseline we use the performance when the deployment of operators is handled by the scheduler of the operating system. The results show that both the performance and latency of the query plan are unaf- fected by the heavy reduction of resources allocated by the deployment algorithm. This can be observed for a range of workload configurations and architectures (see Table 3.2). The table summarizes the performance expressed both in throughput (WIPS), and latency (s). In order to capture the stability of the system, alongside with the aggregated aver- age, we also present the standard deviation in parenthesis. Additionally, for the latency measurements we present the 50th, 90th and 99th percentile of the requests’ response time. The presented values for throughput and latency confirm that the resulting system per- formance was not compromised by the significant reduction in allocated resources, which is important for databases and their SLOs. The performance of the query plan when the OS scheduler was in charge of deployment (displayed in row 3 in Table 3.2) is poorer both in terms of absolute values and stability than the other two approaches, including the operator-per-core deployment (row 2). The latter suggests that performance can be affected as a result of thread migration, as this is the only configuration which does not threads to cores.

76 3.5. Evaluation

Table 3.3: Performance/Resources efficiency savings

Machine Dataset size savings factor AMD MagnyCours 5 GB x6.73 AMD MagnyCours 20 GB x7.37 Intel Nehalem-EX 20 GB x5.99

Performance/Resource Savings Ratio

In order to better highlight the significance of the gain in deployment efficiency we intro- duce a new metric: the performance/resource efficiency savings factor. It is calculated by normalizing the measured throughput () based on the amount of allocated resources (Res). More specifically, we use the following formula:

Tput Res savings factor = c × b (3.6) Resc Tputb where the subscript c denotes the compressed, and b the operator-per-core deployment. Table 3.3 illustrates the savings factor obtained by the compressed deployment for the dif- ferent workload setups. The gain in performance/resource (calculated from the throughput values in Table 3.2) is usually in the range of 6-7x compared with the operator-per-core deployment. In other words, with our deployment algorithm the query plan can achieve the same performance by using only 14% of the resources, or even less when compared to the baseline OS operator scheduling.

3.5.4 Analysis of the Deployment Algorithm

The deployment algorithm computes an approximation to the minimum number of cores needed by the query plan, and a mapping on how to optimally choose cores on the given multicore architecture, in order to reduce bandwidth consumption. As presented in Fig- ure 3.5, it consists of four phases and each phase of the algorithm contributes to the final result from different aspects: Phase (1) reduces the total number of operators by collapsing the operator-pipelines to compound operators. In the case of the TPC-W global query plan, this phase decreases the number of operators from the original 44 down to 32. A second contribution is the

77 Chapter 3. Execution engine: Efficient deployment of query plans

Table 3.4: Evaluating the design choices of algorithm phases

Phase deployment layout #Cores Throughput [WIPS] same NUMA 6 428.07 (+/- 32.80) (2) same NUMA 5 366.08 (+/- 51.01) (3,4) different NUMA 6 401.84 (+/- 33.21) constructive cache sharing between the operators in the original operator-pipelines. These are now scheduled on the same core and benefit from data-locality. Phase (2) is responsible for a more aggressive reduction in the allocation of computational resources. It further decreases the total number of required cores down from 32 to 6 or 5 cores, depending on the architecture. The optimality of the algorithm was briefly addressed in Section 3.4. In order to evaluate the accuracy of the output of this phase, we did another experiment on the AMD MagnyCours (20 GB) where we took the content of the sixth bin and evenly distributed it across all other bins (i.e, we tested a deployment on a smaller number of cores). The results are presented in a row dedicated to Phase (2) in Table 3.4. The first row shows the system’s performance based on the output of the algorithm, and we use it as a baseline for comparison. Comparing the first two rows confirms that both the absolute performance and the stability of the system are decreased when reducing the number of cores allocated from six down to five. Phases (3) and (4) compute the final placement of operators to cores. In order to evaluate the heuristic and importance of the actual core placement on the machine, we compare the output of our algorithm (the heuristic being, if no contention for memory bandwidth, place the bins as close as possible) to the other extreme where every bin is placed on a different NUMA node. The results are presented in row dedicated to Phase (3,4) in Table 3.4. The difference in performance favors the heuristic approach, although the results are still within the error bars. We expect that it will have higher impact when dealing with more memory bandwidth intensive workloads.

78 3.6. Generalizing the approach

3.5.5 Discussion

The results confirmed that RAVs successfully characterize the properties of relational oper- ators, and can accurately capture the changes in resource requirements for different dataset sizes. Furthermore, the evaluation on the two machines (Intel Nehalem-EX and AMD MagnyCours) points out that both the operator’s resource requirements and consequently the optimal resource allocation and query plan deployment are architecture-dependent. Most importantly, the performance of SharedDB on the compressed set of resources (compressed-deployment) is as good as the performance on the operator-per-core deploy- ment both in terms of throughput and latency, and in their stability, and better when compared to the OS operator scheduling baseline. Eventually, we emphasized the significance in efficient resource utilization when delivering a performance/resource savings factor of 6-7x when compared to the baseline.

3.6 Generalizing the approach

3.6.1 Parallel Operators

In SharedDB and its original operator-per-core deployment policy, because of the ‘always- on’ nature of the operators, even though it was conceptually possible, in practice one was not able to parallelize all operators at the same time, simply because there were not enough contexts available on current multicores. Therefore, one had to choose to parallelize only a few heavy operators (in our case the scans) that could improve the performance. One of the benefits of the work presented in this chapter is that it allows the system to improve its performance by adding more resources to the existing operators, as a result of the reduction of the amount of resources allocated to the original query plan. We have already shown how multithreaded operators are supported with our deployment algorithm and RAV-annotations. Here, in order to demonstrate the immediate benefits of the smart deployment, without having to parallelize the internal ‘logic’ operators, we increased the number of threads of all storage (scan) operators and replicated the internal ‘logic’ op- erators. Note that the KV store operators were neither parallelized nor replicated. As the original plan deployment fit on one NUMA, we deployed every new replica on a new NUMA node.

79 Chapter 3. Execution engine: Efficient deployment of query plans

3000 5 GB dataset 20 GB dataset

2500

2000

1500

Throughput [WIPS] 1000

500

0 0 1 2 3 4 5 6 7 8 9 Number of replicas

Figure 3.12: Evaluating how the system throughput scales when replicating the inner operators of SharedDB’s TPC-W global query plan. The graph shows values from two experiment runs with 5 GB and 20 GB dataset sizes.

Figure 3.12 summarizes the performance of the system in terms of throughput as we increase the number of replicas in the system. The experiments were executed on the AMD MagnyCours machine, on two datasets of 5 GB and 20 GB. The results indicate that with the simple technique of plan-replication and parallelization of the scan operators, the system scales almost linearly up until four replicas, and achieves a scale-up of almost six when using eight replicas. This observation is inline with the performance/efficiency factors we measured in the previous section (also in the range of six/seven depending on the underlying hardware platform and working set size). We would also like to point out that this performance improvement was achieved without any system fine-tuning.

3.6.2 Dynamic Workload

Regarding a discussion on how our approach can address dynamic workloads, it is impor- tant to note that we distinguish three different types of dynamism in a workload:

80 3.6. Generalizing the approach

1. Known-in-advance queries vs. query-types, where in the latter case the known part is implemented in the form of JDBC-like prepared statements and the dynamic part are the changing parameters used in the statement;

2. Changes in the workload distribution, which refers to the percentage of query types being present in the workload mix; and

3. Ad-hoc queries and their arrival rates and distribution.

In this chapter we presented a solution handling the first type of workload dynamism – which follows by design and implementation of the SharedDB system. Nevertheless, given the reduction of the amount of resources allocated to the global query plan, it is immediately possible to handle incoming ad-hoc queries by running them on the side, using some of the extra, now available, resources. The only dynamism that is not supported by the current implementation of the global query plan and its static deployment of operators is the second one – with changes in the distribution of query types in the workload mix which will affect the hot-paths/spots in the query plan. We believe, however, that this type of dynamism cannot be solely addressed from the resource allocation perspective, but also from the optimizer side. Therefore, we think that adding support for continuous monitoring of the operators’ activity and computing the RAVs at regular intervals can assist the DBMS and its optimizer by pro- viding additional information about potential bottleneck operators and heated paths in the system. This way the optimizer can adapt the global query plan and use the same deployment algorithm to re-deploy its operators.

3.6.3 Non-shared (Traditional) Database Systems

Finally, we address how some of these ideas can be used in more traditional (non-shared) data processing systems. First, to our knowledge, there is very little work focusing on deployment and scheduling of query plans on multicore machines. While in this chapter we focused on shared work systems, the approach of using RAVs and basing the deployment on temporal and spatial considerations can also be used in other conventional systems. RAVs can be obtained by instrumenting and observing ongoing execution of queries (simi- lar to how, e.g., selectivity estimates, result caching, hints for indexing, and data statistics are collected today). When knowing the RAVs the query optimizer can be extended to

81 Chapter 3. Execution engine: Efficient deployment of query plans consider the CPU and memory requirements of the operators and use the algorithm pro- posed in this work to identify how many cores should be allocated to a plan and how the deployment of operators of the plan should look like among the cores which have been allocated. Obviously, the more complex the plan in terms of overall costs, data movement, and number of operators, the more gains are to be expected from using an approach such as the one outlined in this chapter. Although our prototype has been evaluated using SharedDB, we believe that the presented techniques generalize beyond this kind of shared work systems. In fact most of the place- ment decisions apply to conventional query plans. For instance, blocking operators within the same query plan can be placed on the same bin (core). Operators streaming to each other can be placed on adjacent cores. Operators that will not be active at the same time can be placed on the same core. Operators active at the same time but complementary in terms of resource usage (CPU vs. memory bound) can be placed on the same core, etc. The spatial scheduling in these systems would additionally have to take into account the physical data placement and data access patterns of the operators (similar to the one proposed in ATraPos by Porobic et al [PLTA14]), in order to decide the most suitable cores/NUMA regions (spatial scheduling subproblem) but, overall, the same concepts as those used here would apply.

3.7 Related work

Scheduling and resource management has inspired a lot of research targeting different systems, which we cover here as part of related work. The section is split into several categories. We first discuss general scheduling on multicore systems, with a focus on con- tention aware thread deployment, before outlining recent proposals for hardware-consious scheduling for database systems. Finally, we position our work in the context of data- center level scheduling and discuss possible ways to generalize our approach using existing techniques used in different contexts.

3.7.1 General scheduling for multicore systems

There is a considerable amount of research aimed at contention-aware scheduling on mul- ticore machines (Zhuravlev et al. [ZSB+12] provide a comprehensive survey). Some of

82 3.7. Related work these efforts, for example, achieve higher resource efficiency by classifying application’s behaviour: cache light/heavy threads by Knauerhase et al. [KBH+08], turtles/devils by Zhuravlev et al. [ZBF10], or high- and low-pressure threads by McGregor et al. [MAN05] to represent light and heavy users of the memory hierarchy. Using that classification they schedule threads to cores so that contention, typically in the memory subsystem, is re- duced. Similarly, the memory bandwidth dimension of the RAVs is used in our deployment algorithm for making sure conflicting threads are not going to be placed in each other’s vicinity and overcommit the memory subsystem. Additionally, there are several methods for characterizing and modeling thread perfor- mance on multicores [WWP09, MSB10], and the possible interference when they share resources [CGKS05, BPA08]. Some of these models are used to aid comprehension and optimization of code performance, while others are used to minimize overall resource con- tention and performance degradation. Although these methods provide valuable insight in the performance bottlenecks of multicore systems as well as techniques to identify and alleviate contention, they are application-agnostic and consequently unable to provide any performance guarantees to the executing application. Furthermore, none of these approaches optimizes the overall resource allocation so as to maximize their utilization efficiency.

3.7.2 Scheduling for databases

Databases, traditionally, have dealt pessimistically with the challenges imposed by modern hardware by exclusively allocating (or assuming exclusive access to) hardware resources such as cores and memory. This practice leads to hardware resources being overprovisioned and underutilized. However, scheduling becomes an increasing topic of interest in databases. For example, cache-aware scheduling (e.g.,the MCC-DB work by Lee et al. [LDC+09]) concentrates on minimizing the cache-conflicts and benefiting of constructive cache sharing via cache-aware scheduling on multicore machines. The work by Chen et al. [CGK+07] demonstrates the advantage of parallel depth first (PDF) greedy scheduling over work-stealing to enhance performance by constructive cache sharing. This is similar to the first step of our deployment algorithm where we collapsed operator pipelines to improve data locality and benefit from constructive cache sharing.

83 Chapter 3. Execution engine: Efficient deployment of query plans

Existing work by Porobic et al. [PPB+12] also argues that the topology of modern hardware favors coupling communicating-threads and deploys them onto cores on the same ‘hardware island’ (NUMA node, or CPU-socket) in order to minimize cross-node communication. Follow up work enhances system’s data-locality and reduces redundant bandwidth traffic by providing means for suitable data placement and adaptive re-partitioning techniques as the workload changes [PLTA14]. Leis et al. [LBKN14] take it a step further and propose a novel morsel-driven query ex- ecution model which integrates both NUMA-awareness and fine grained task-based par- allelism. This allows for maximizing the utilization of the CPU resources and provides elasticity with respect to load balancing the resource allocation to dynamic query work- loads. Furthermore, they also advocate the deployment of operator pipelines and assign hard CPU affinities to threads in order to maintain locality and stability. These examples provide highly valuable techniques, mechanisms and execution models but none uses the knowledge at hand to solve the problem we address, which is how to use operator characteristics and inter-thread relationships to minimize the total number of resources allocated to a query plan without affecting performance and predictability. Nevertheless, we corroborate previous works’ observations that leveraging the knowledge of the underlying hardware for operator deployment is essential for minimizing bandwidth traffic and maximizing data locality [PPB+12,LBKN14].

3.7.3 Resource allocation for data-oriented systems

This brings us to the discussion on how our work compares to existing systems and ap- proaches on a more general scale (i.e.., resource consolidation for data processing systems in a cloud environment). The closely related approaches also rely on:

1. A model that captures the properties of the available resources, and

2. Resource requests from the applications/jobs to be scheduled.

The approaches differ by how they obtain and represent the values for (1) and (2), and the algorithm they use to compute the resource allocation. The resource allocation algorithm can be mapped to several more traditional problems:

84 3.7. Related work

• Multi-dimensional (cardinality constrained) bin packing algorithm [KM77, MCT77] which was used in our deployment algorithm, but also by others e.g.,[ABK+14, BCF+13,GZH +11];

• Balanced k-way min- graph partitioning problem which was originally to mapped by Effe et al. [Efe82], and then used in the allocation approaches by Kiefer [Kie16] and Curino et al. [CJZM10]);

• Non-linear optimization program aimed at minimizing the amount of resources and balance the load while achieving near-zero performance degradation (e.g., used by Curino et al. [CJMB11]).

In this work, we have used the FFD approximation of the first one [CGJ97] to show the immediate benefits of the approach. I believe that the other two approaches can augment the current possibilities, especially in terms of composition of application threads using non-linear resources [Kie16,CJMB11], or when deploying on heterogeneous hardware platforms where the resource capacity constraints can differ and the simple machine model used in our approach cannot capture the diversity.

3.7.4 Deriving application’s requirements

The final part of the discussion is related to the alternative approaches in determining the specific resource requirements of applications’ threads. One approach is to rely on models to derive the resource demand for each operation (e.g., Garofalakis and Ioannidis [GI96] and Kiefer [Kie16]). Some researchers additionally differ- entiate between bounded and unbounded resources [Kie16], or time-shared (preemptable) and space-shared [GI97] resources. This helps them model the underlying hardware plat- form more accurately, so that the resource allocation algorithm can reason better about the trade-offs among the alternative solutions. While using models to capture resource requirements is a viable approach, and one could use generic cost models for database operations [MBK02], getting a good estimate on the exact resource requirements for vari- ous resource types can be non-trival to derive especially given the complexity of modern microarchitectures. Alternatively, Curino et al. [CJMB11] and Merkel et al. [MSB10] derive the resource re- quirements of the workload tasks by using a system profiler and the resource monitoring

85 Chapter 3. Execution engine: Efficient deployment of query plans tools offered by operating systems. As we motivated this approach before (when intro- ducing the RAVs), such a technique allows us to treat the jobs as black boxes and easily scale across different hardware platforms. Finally, one can also rely on the application itself or the developer to issue a demand vector of resources (e.g., as we did in the previous chapter, but also has been used by many level schedulers [ABK+14, HKZ+11, SKAEMW13]). The main problem with this approach is that the scheduling algorithm needs to be robust to overprovisioning demands from the applications, and to be strategy proof (against some applications exploiting its allocation policies [GZH+11]). Some recent work suggests augmenting the knowledge obtained from these interfaces with online resource monitoring to be able to provide better and more efficient consolidation (e.g., by Quasar [DK14]).

3.8 Summary

This focus of this chapter was to demonstrate the benefits of the COD architecture in the context of resource scheduling for the database execution engine. More concretely, we ad- dressed the problem of minimizing resource utilization for complex query plans on modern multicore architectures without affecting performance or sacrificing desired performance properties such as stability and predictability for both throughput and latency. The solution and our proof-of-concept system build upon on two main contributions:

• Resource activity vectors (RAVs), an abstraction used to characterize the perfor- mance profile for each relational operator that can be derived from hardware perfor- mance measurements.

• A deployment algorithm that computes the minimum amount of resources needed based on the dataflow DAG from the query plan, a model of the machine’s memory hierarchy (e.g., size of NUMA nodes (memory and cores), interconnect topology, etc.) and the RAVs to compute an optimal assignment of the operator threads to specific processor cores.

In the evaluation we confirmed that RAVs can be used to accurately characterize the re- source requirements for database operators across different hardware platforms, without having to build detailed cost models. Additionally, we showed that our deployment al- gorithm can significantly reduce the computational and memory resource requirements,

86 3.8. Summary while leaving performance unaffected. To demonstrate the benefits of this approach, we also proposed several ways on how the newly freed resources can be used to improve system throughput performance, but also to support more dynamic workloads. As part of future work, we consider online monitoring of the resource utilization of the operator threads which could be fed to the database query optimizer to derive algorithm but also operator to core mapping, which exploits runtime information about the machine topology and overall system state.

87

4 Optimizer: Concurrency vs. parallelism in NUMA systems

Before addressing the of modern data processing engines for executing parallel jobs, we try to understand the main challenges when scheduling multiple such jobs and characterize their needs. Until now we have either worked with the storage manager, where reasoning about the operator parallelism was easier (Chapter2), or assumed that the degree of parallelism for the operators in the query plan was pre-determined by the optimizer (Chapter3). In this chapter we extend the analysis beyond traditional database workloads, and also investigate the behaviour and requirements of parallel graph processing algorithms. More concretely, we focus on a problem that modern optimizers face – determining the trade off between concurrency and parallelism (i.e., inter-operator and intra-operator parallelism) when executing on modern NUMA systems. If there are enough parallel jobs in the system:

• How many should be executed at the same time (concurrency)?

• What degree of parallelism should be assigned to each job?

• How should specific resources be distributed?

89 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

In our study, we experimentally explore how to schedule a workload of modern parallel operators/algorithms on a multicore machine. Our analysis focuses on two evaluation metrics:

1. The per-job runtime slowdown in a concurrent workload, i.e., what the typically cares about: How much slower will my job run because others are also running their jobs?

2. The system throughput, or the number of jobs executed over a given time frame, i.e., what the data processing engine optimizes for: How many jobs can be completed per unit of time?

More formally, the problem statement is to determine scheduling strategies that maximize the system throughput while guaranteeing that, given the same amount of resources, a parallel job does not suffer any significant slow down when running alone on the machine and when running alongside concurrent workloads. The work in this chapter was done in collaboration with Gustavo Alonso and Tim Harris.

4.1 Background and Motivation

Data processing has become not only a goal in itself but also a first step in more complex analysis and computational tasks. Given the evolution of modern hardware, balancing the degree of parallelism (DOP) necessary for operators to perform well on increasing amounts of data, and the degree of concurrency1 in a workload to support as high a throughput as possible is an important research problem. In fact, it is an old database topic that looks into effective scheduling of systems resources among concurrent relational operators, which was addressed decades ago in the context of parallel databases [GI97,GHK92]. This pioneering work, however, considered neither the intricacies of modern multicore machines, nor the need to support operators beyond those in the (e.g., for graph processing). The reasons for that are two-. First, the degree of parallelism available to an operator was then quite limited compared to the possibility of having, e.g., 40 or 80 parallel threads today. Second, the amount of data and the nature of the operators was

1Also known in literature as a level of multiprogramming.

90 4.1. Background and Motivation also different. (e.g., conventional relational operators on row store vs. column-store and graph processing algorithms). Several modern DBMSs today provide support for executing parallel queries (e.g., Oracle1, IBM DB 22, PostgreSQL3). In many cases it is encouraged to use the adaptive multi-user scheduler that assigns the DOP based on the current load in the system [CI12]4. For instance, as the number of queries in the system increases, the DOP for each query is decreased proportionally until either the min_dop (minimum degree of parallelism) or the maximum concurrency has been reached. The latter one typically depends on the availability of resources (e.g., memory for in-memory processing jobs). As a result of the high computation cost for optimizers to determine the optimal degrees of concurrency and parallelism, many DBMS vendors expose various knobs to database administrators (e.g., min/max degree of parallelism for an index or table)5. This transfers the complexity from the database optimizer to the user, and with it also the responsibility – many guidelines now emphasize both the benefits of enabling parallel query execution and the dangers of reduced performance when mis-configuring it in multi-user environments. And while this has been already analyzed and addressed in commercial database sys- tems, in the graph processing space most research so far has been focused on single-job runtime [GSC+15], and has not looked into concurrent workloads and throughput perfor- mance. In reality, and like for parallel relational operators and queries, the constraints are likely to come from response time SLOs [UGA+09,RSQ +08,ZHNB07]. Therefore, our experimental analysis attempts to optimize the whole workload execution in terms of throughput, while maintaining predictable latencies for parallel algorithms in comparison with their runtime when executed in isolation.

1https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm 2http://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com..db2.luw.admin. .doc/doc/c0005287.html 3https://www.postgresql.org/docs/9.6/static/release-9-6.html 4https://docs.oracle.com/cd/B28359_01/server.111/b28320/initparams168.htm 5http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc00743. 1502/html/queryprocessing/queryprocessing108.htm

91 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

4.2 Problem statement

Before formally defining the problem we address, we make the following assumptions:

1. In-memory execution of the algorithms and programs (no disk or network I/O);

2. No synchronization across different jobs for data accesses – analytical processing is done on different input data;

3. Input data is partitioned across the machine in order to leverage higher aggregate memory throughput;

4. Parallel operators are independent, i.e., we do not consider pipelined parallelism as part of this analysis and leave it for future work.

SP Let Ti (m) denote the execution time of job Ji(m) with degree of parallelism m in single MP program execution mode (SP )(i.e., in isolation). And let Ti (m) be the execution time of the same job in multi program execution (MP )(i.e., noisy) environment.

Problem statement. Given a workload WLp, which consists of p parallel jobs WLp =

{J1,J2, ..., Jp}, determine how to schedule the jobs on a machine M(c, n) with c cores, organized in n NUMA nodes. The goal is to find resource allocation so that we maximize the overall system throughput, while guaranteeing that for a parallel job Ji(m) the ratio in MP SP execution time between MP and SP execution mode (Ti (m)/Ti (m)) is smaller than . In other words, given the same resources the execution time of a parallel job will be predictable regardless of the noise in the system.

To illustrate it better, let us take the following as a concrete example of the problem.

Example. Assume the underlying hardware (a large NUMA many-core machine) has 64 hardware contexts (cores) and 8 NUMA nodes. The concurrent data-processing workload generates 100 parallel operators to be executed, i.e., all are in the run queue at the same time. How should the optimizer and query execution engine allocate the machine’s re- sources to execute the parallel jobs, such that it achieves the best throughput and minimizes the slowdown perceived by each job in the workload?

92 4.2. Problem statement

Our approach is to understand the properties of parallel operators in isolation and when co-scheduled with other multi-threaded data processing tasks on modern machines. As such, this is primarily an empirical study of several scheduling strategies for concurrent workloads, which we use to devise performance models that capture the observed behavior.

4.2.1 Factors influencing concurrent execution

There are several factors that can influence the execution time of a concurrent workload of parallel jobs on modern hardware:

1. The scalability of each of the parallel jobs as we increase the amount of resources. We evaluate each of the parallel jobs when running in isolation (Section 4.4);

2. The slowdown of each job when executed in a concurrent environment as a result of resource interference (Section 4.5);

One can derive Ji’s runtime for various degrees of parallelism when executing in isolation SP (Ti (m)) using the following equation based on the power-regression model:

SP SP αi(1 − δi) Ti (m) = Ti (1) · ( + δi) (4.1) mβi

SP Where Ti (1) denotes the job’s runtime when executing on a single core, and δi denotes the serial fraction of a job, which is negligible for most operations we consider. The formula states, for instance, that the parallel job Ji has linear scalability when δi is zero and both αi and βi equal to one. As the reader has probably noticed, this is just an approximation, which does not take into account how the scalability of the algorithm implementation depends on other resources in the system (e.g., if bound to memory, DRAM or I/O bandwidth). We discuss this further when explaining the properties of the concrete algorithms/operators we use in our experiments.

Similarly, we derive the execution time of job Ji with degree of parallelism m in multi- MP program execution mode (Ti (m)) using equation 4.2. The γi denotes the observed slowdown as a result of resource interference.

MP SP Ti (m) = Ti (m) · γi (4.2)

93 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

SP MP If there is no interference, then Ti (m) = Ti (m) for all values of m. As we show in the experiments the amount of interference depends on (i) the type of resource the jobs are sharing, (ii) the characteristics of the jobs that constitute the noise in the system and (iii) the sensitivity and intensity with which Ji uses the resources when executed in isolation. As such, it is an important factor that often causes the undesired unpredictability in response times, when scheduling complex interacting workloads.

4.2.2 Scheduling approaches

This work evaluates three different scheduling approaches for concurrent workloads, which primarily differ in the granularity of their unit of scheduling. With that we explore the trade-off between the chosen degrees of parallelism and concurrency. In particular we evaluate: (i) core-based, (ii) NUMA-based, and (iii) serial workload execution. We illustrate each of the scheduling approaches in Figure 4.1. The example is based on a homogeneous workload consisting of eight identical parallel programs on a machine with four NUMA nodes and eight cores on each NUMA node. The horizontal dimension represents the spatial deployment of program’s threads while the vertical dimension shows the elapsed time and how the temporal placement alters (if at all). Each parallel job (denoted as op in the figure legend) is depicted as a set of boxes with the same texture pattern. The number of boxes denotes the degree of parallelism. Finally, the red dashed lines represent the total time for executing the whole workload mix.

Core-based scheduling

In the core-based scheduling approach (Figure 4.1a), the unit of resource allocation is a physical hardware context, i.e., a core. For simplicity, here we assume that a core is not multi-threaded. First, we need to determine the maximum concurrency that the workload can sustain on the given resources of the machine (k). This is in particular important for a bounded re- source, which cannot be over-committed without incurring significant performance penal- ties. In our case, the available DRAM memory (or the size of the buffer cache in a database) is such a resource – if the concurrent jobs have large working set sizes which cannot fit in the allocated memory, then they will need to swap to disk and this can result in very poor performance. As a result, for the core-based scheduling the maximum degree

94 4.2. Problem statement

op 1 op 2 op 3 op 4 op 5 op 6 op 7 op 8

NUMA 0 NUMA 1 NUMA 2 NUMA 3 0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31

t

(a) Core-based scheduling

NUMA 0 NUMA 1 NUMA 2 NUMA 3 0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31

t

(b) NUMA node-based scheduling

NUMA 0 NUMA 1 NUMA 2 NUMA 3 0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31

t

(c) Serial scheduling

Figure 4.1: Three scheduling approaches for homogeneous workload that are evaluated in the chapter. The figure depicts scheduling a workload consisting of eight identical operators on a four NUMA node machine, with eight cores per NUMA.

95 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems of concurrency is determined by the of each operator/algorithm and the available memory on all NUMA nodes. A second step is determining the maximum degree of parallelism for each job. It depends on the total number of available cores (c), and the number of parallel jobs in the concurrent workload (k), which we determined previously. If k > c then c jobs will get one core each and the others are going to be put in the scheduling queue. If, however, the number of c jobs in the concurrent workload is smaller than c then each job receives b k c cores. The final part is determining the thread-to-core mapping, which can use one of the various heuristics explored by prior work. Some are noise independent which optimize for (1) data-access locality and place threads on the cores belonging to the corresponding NUMA node, or (2) improving performance by maximizing the utilization of the available band- width. Such policies have been adopted by several recent papers proposing NUMA-aware algorithms for relational operators [AKN12, LPM+13, DMR+10]. Alternative approaches, however, take into account the noise in the system by monitoring the LLC miss rate for all the jobs in the system, and propose appropriate thread and data placement and migration to mitigate the congestion on the DRAM and interconnect bandwidth [ZBF10].

NUMA-based scheduling

In contrast to the previous scheduling approach, the NUMA-based one uses a coarser gran- ularity to allocate resources to parallel jobs. As the name implies it uses the whole NUMA node (with all of its cores) as a unit of scheduling. Consequently, the number of operators concurrently executing (i.e., the degree of multiprogramming or concurrency) is limited to the number of NUMA nodes (n) provided that it is smaller than the maximum number of jobs that can run, given the constraint of the unbounded resources (k). Similarly, the granularity of parallelism for each job is now fixed to the number of cores in a NUMA region (c/n). Using this scheduling approach, as soon as a job finishes with its execution, the NUMA node is assigned to the next parallel job in the scheduling queue. An example for this scheduling strategy is presented in Figure 4.1b. The advantages of this approach are that it provides opportunity for constructive resource sharing among the threads belonging to the same parallel job because they can now benefit from data locality in the last level cache (LLC) and do most of the coordination and communication via the LLC rather than over the interconnect. Furthermore, by assigning all the physical contexts of the NUMA node to threads of the same job, we reduce the

96 4.2. Problem statement danger of destructive resource sharing. In particular, it helps to avoid cache pollution, as well as contention on the local DRAM bandwidth among threads from different jobs. One immediate disadvantage is that it further limits the concurrency in the system, as some parallel jobs may need to wait in a queue for execution. Another limitation is that, using such a coarser granularity of assigning resources does not allow a resource-hungry job to its resource consumption (e.g., of the last level caches) unless explicitly scheduled on more than one NUMA node. Similarly, if the input data is spread out across the machine, scheduling all the threads of a parallel job to a single NUMA node will disrupt data locality, introduce additional interconnect traffic, and potentially also fail to use more aggregate DRAM bandwidth.

Serial scheduling

The serial scheduling strategy does not do spatial-sharing of the machine’s resources among any of the runnable jobs. Instead, it allocates all resources to a single job and schedules the rest to run serially one after another. It is illustrated in Figure 4.1c. This is the scheduling approach most often used when evaluating algorithms for existing graph processing systems (e.g.,[GSC+15,GXD+14,HLP+13], etc.), as well as for hardware-tuned relational operators (e.g.,[BTAO13, PR14, BLP+14], etc.). It is also the scheduling approach used in many high performance computing systems, where the researchers want to execute their jobs in a noise-free environment [HSL10]. One can argue favorably for this policy based on several observations. First, there is increasing effort in implementing parallel algorithms that scale well with the number of cores. And second, the execution time is predictable since there is no noise in the system – only one parallel job which runs on all the cores and hence, there is no load interaction with threads from other jobs. The disadvantages are that the sequential execution has queuing effects, which can poten- tially be devastating for workloads with a mix of short and long running jobs, or batch and latency-critical jobs. The restrictive nature of the scheduling approach will not allow flexibility unless time-sharing among jobs is enforced. In such a set-up this scheduling approach resembles the traditional gang-scheduling [PBAR11]. Furthermore, we should also point out that this form of serial scheduling is ill-suited for workloads with jobs that have poor scalability or different resource requirements, which can significantly reduce the resource usage efficiency.

97 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

4.2.3 Evaluation Metrics

Throughout the analysis we use two commonly used metrics to evaluate the performance of the scheduling approaches [EE08]:

1. A system-wide metric that measures the overall performance when executing the whole workload (i.e., the system throughput (STP )) using the Weighted Speedup as proposed by Snavely and Tullsen [ST00]; and

2. A user-perceived metric which measures the performance slowdown as observed by the user that issued the job in a multiprogramming execution mode as opposed to its run in isolation. More concretely, we use the Hmean Speedup which balances throughput and fairness, as proposed by Luo et al. [LGJ01].

The rest of the section covers the derivation of the metrics. The total turnaround time (TTT ) for a workload execution is defined as the total time required to execute all jobs in the workload. It quantifies the time between the start of the first job of the workload until their completion.

For example, using the same notation defined earlier, for workload WLp executed on machine M(c, n) using the core-based scheduling approach, the total turnaround time is derived from the runtime of the longest executing job, i.e.,

$ % p c TTT c = max T MP ( ) (4.3) WL i=1 i p

Similarly, for the same workload executing its operators serially, the total turnaround time, is derived as the sum of the execution time of its jobs (eq. 4.4).

p s X MP TTTWL = Ti (c) (4.4) i=1 Finally, for a scheduling approach that uses a deployment granularity of a NUMA node, one should use one of the approximation algorithms to derive the close-to-optimal parallel schedule that minimizes the makespan (e.g., the one proposed by Hochbaum et al. [HS87]). Here we compute the upper-bound approximation of the TTT based on the number of p stages (s) that are needed to execute p jobs in parallel (n at a time) – s equals to . n

98 4.3. Methodology

s c TTT n = X maxn T MP ( ) (4.5) WL j=1 j i=1 n The formulas we use to determine the system throughput (STP ) can be derived as the total number of jobs executed within the total turnaround time for the whole workload

(TTTWL).

p STP = (4.6) TTTWL For the user-perceived metric (HmeanSpeedup), i.e., the observed slowdown with respect to the job execution time when run in isolation, using the same set of resources, we compute the normalized turnaround time (NTTi) for a job Ji (eq. 4.7).

MP Ti (m) NTTi = SP = γi (4.7) Ti (m) In order to summarize the overall slowdown experienced by all the jobs within the workload

(WLp), we also define the average normalized turnaround time (ANT TWL) as follows:

p p MP 1 X 1 X Ti (m) ANT TWL = NTTi = SP (4.8) p i=1 p i=1 Ti (m)

The reader may already notice that by computing the NTT and ANT TWL, we also obtain an empirical value for the slowdown (γ) in MP execution mode. Also note that if we would have linear scalability (α and β equal to 1, and δ equals to 0 in eq. 4.1), and no interference c s n (γ equal to 1 in eq. 4.2) then TTTWL = TTTWL = TTTWL, i.e., the total turnaround time for the whole workload would be the same regardless of the scheduling approach. The rest of the chapter explores in a series of experiments the values for α, β, and γ for a set of DB and graph analytics algorithms.

4.3 Methodology

This section explains the experimental setup for our analysis. We cover the necessary background for the parallel data-processing jobs used in the concurrent workload, as well as the properties of the underlying hardware architectures.

99 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

4.3.1 Parallel data-processing algorithms

We focus our analysis on parallel data-processing algorithms such as the ones used in parallel databases and graph processing. The parallel relational operators that we use are optimized for multicore machines and are the basic building blocks for more complex analytical queries: radix join, -merge join and aggregation. There are many hardware-conscious implementations for these operators (e.g., for hash and sort-merge join [KKL+09, AKN12, BATO13, BLP11, BLP+14], sort [SKC+10, WS11, PR14], or aggre- gation [YRV11, CR07]). We use the open-source implementation of some of the most performant algorithms reported to date [BTAO13,YRV11,BATO13]. Similarly, there are many optimized systems for graph processing, most of which are fo- cused on [CSCC15, GSC+15, MMI+13, GXD+14] or on disk-based single node systems [HLP+13,KBG12]. However, Green-Marl [HCSO12] and Ringo [PSB+15] successfully demonstrate that a single multicore machine is a suitable platform for interac- tive graph analytics. We use an open-source implementation 2 of parallel graph algorithms for multicore systems as generated by the Green-Marl suite [HCSO12].

Relational operators

Hash Join: The algorithm we use is by Balkesen et al. [BTAO13] and optimizes its implementation for the cache hierarchy by introducing a partitioning phase (which in itself can have several passes so that it reduces the TLB cache misses, as proposed by Manegold et al. [MBK00]). This way it creates partitions that fit in the caches. The source code for the hash join is publicly available 3. The experimental setup used throughout the evaluation is as follows: one level of par- titioning with fanout set to 214, the input data is uniformly sharded across the NUMA nodes, and the two relations are equi-sized with 32-bit keys and 32-bit values.

Group-by Aggregation: The algorithm we use is by Ye et al. [YRV11] and has an aggregation phase whose performance has also been optimized for better cache behavior by introducing a partitioning phase. The difference to hash join is that during the partitioning phase the algorithm performs some partial local aggregation of the input data.

2https://github.com/stanford-ppl/Green-Marl 3http://www.systems.ethz.ch/sites/default/files/multicore-hashjoins-0 1 .gz

100 4.3. Methodology

For our experiments, the optimal fanout for partitioning was determined to be 28. The input data is synthetic with uniform distribution and a random alphabet type. For large cardinality (LC) the group cardinality is set 220, and for small cardinality (SC) it is set to 210. As in the original paper the input relations consist of a 64-bit group-by key, and 64-bit values.

Sort-merge Join: The algorithm we use is by Balkesen et al. [BATO13] and uses the AVX registers to speed up the building blocks of a traditional merge-sort: the run generation and the merging of pre-sorted runs. Furthermore, the memory bandwidth usage is optimized by introducing the multi-way merge phase that reuses hot data in the last level cache (LLC). The authors also optimize the data movement by using a NUMA-aware data- transfer [LPM+13]. The source code for this algorithm is also publicly available 4. The experimental setup we used is as follows. As recommended by the authors we set the partitioning fanout to 27, and the multi-way buffer size to the size of the LLC. Furthermore, the NUMA shuffling strategy is set to “ring-based”, as defined by Li et al. [LPM+13], and the input data generation is synthetic with uniform distribution. The two relations are equi-sized and have 32-bit keys and 32-bit values. In our experiments, for all relational operators we used input relations with 128 M, 1024 M and 2048 M tuples.

Graph processing algorithms

As part of the Green-Marl graph algorithm suite we use the following kernels:

1. PageRank () iteratively calculates the pagerank of each node in the graph;

2. Single-Source Shortest Path (SSSP) computes the shortest distance to every node in the graph from a single node,

3. Hop-dist (HD) computes the distance of every node from a given root node using the Bellman-Ford algorithm;

4. Triangle Counting (TC) computes the number of closed triangles; and

5. Strongly Connected Components (SCC) finds the strongly connected compo- nents of a given graph using Kosaraju’s algorithm.

4http://www.systems.ethz.ch/sites/default/files/file/sort-merge-joins-1 4 tar.gz

101

Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

DRAM 128GB ­ 1 NUMA NUMA NUMA core core core core 0 1 HT HT 1 5 9 13 13 45

shared L3 cache L1 cache

NUMA NUMA core core core core L2 cache

PCIe 2 3 QPI 17 21 25 29

Interconnect topology NUMA layout Physical core layout

Figure 4.2: Architecture for Intel SandyBridge: The left side depicts the interconnect QPI topology between the four NUMA nodes. The middle illustrates the internal layout of a NUMA node, and the right side shows the layout of one of the multi-threaded core. It consists of two hardware threads that share some resources such as the L1 and L2 caches.

We evaluate their performance on the Twitter graph [KLPM10] with 41 M nodes, 1.5 B edges (5.6 GB of binary data), the LiveJournal [BHKL06] graph from the datasets 5 with 5 M nodes and 68 M edges (300 MB binary data), and a random graph generated with the GreenMarl uniform synthetic graph generator with 128 M nodes and 2 B edges (8 GB binary data). The data-structure size we used for the graphs has 32-bit node- and edge-IDs. The initial graph data is uniformly sharded across all NUMA nodes on the machine. That way we get the maximum aggregate memory bandwidth of the machine. As done in prior work, in our measurements we do not include the time it took to load the graph in memory or prepare any other relevant input data.

4.3.2 Hardware architectures

To illustrate the complexity of modern hardware and the many options to consider when scheduling, we use as an example one of the machines used in our experiments – the Intel SandyBridge, whose architecture is shown in Figure 4.2. It has four Intel Xeon E5-4640 processors, each containing one NUMA node. The left side of Figure 4.2 illustrates the topology of the 8 GT/s QPI interconnect between the four NUMA regions. Zooming in to one of the NUMA nodes, its internal layout is shown in the middle section of Figure 4.2. There are eight multi-threaded cores connected via a bi-directional internal bus to the shared 20 MiB L3 cache, the memory controller with four memory channels (constituting

5http://snap.stanford.edu/data

102 4.3. Methodology

NUMA NUMA 0 6 cores cores 16, 17 18, 19 core core NUMA NUMA cores cores 18 19 4 2 20, 21 22, 23 L1 L1 NUMA NUMA L3 cache 3 5 L2 cache NUMA NUMA HT NUMA 6 1 7 Interconnect topology NUMA layout Module block layout

Figure 4.3: Architecture for the AMD Bulldozer: The left side depicts the HyperTrans- port interconnect topology among eight NUMA nodes. The middle illustrates the internal layer of each die (corresponding to a NUMA node), and the right side shows the layout of one of the module blocks. It consists of two cores that share resources like the L2 cache). the access to one NUMA node) running at 1600 MHz, as well as the QPI interconnect for interfacing with the other NUMA nodes. The opportunities for resource sharing increase further as we into one of the multi-threaded cores (e.g., core 13). It has two hardware- threads running at 2.4 GHz, which share both the 32 KiB L1 cache and the 256 KiB L2 cache (as depicted in the right most section of Figure 4.2). To emphasize the difference across different hardware platforms, we also present the inter- nal architecture of another machine that was used in our experiments – the AMD Bulldozer (Figure 4.3). It has four AMD Opteron 6378 processors, each with two dies, resulting in a total of eight NUMA nodes (64 GiB each). The complexity of the 6.4 GT/s HyperTrans- port (HT 3.0) interconnect is displayed on the left side of Figure 4.3. We would like to point out the asymmetry in the topology, which was not present on the Intel platform. Each NUMA node has eight cores each with single hardware thread. Each core runs at 2.4 GHz and has a private 64 KiB L1 cache. Two sibling cores (belonging to the same module block) share a 128 KiB L2 cache (right side of Figure 4.3). Within a NUMA node, all eight cores share a common 6 MiB L3 cache and access to the local DRAM controller, which is shown in the middle section of Figure 4.3.

103 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

4.4 Algorithms in isolation

We begin the analysis by characterizing each of the workload’s algorithms and their prop- erties, when running in isolation on the Intel SandyBridge machine. The characterization consists of two steps:

1. Scalability. We evaluate the performance of each algorithm as we increase the number of cores. The measured scalability is also compared against a projected linear scalability (from the algorithms’ performance on a single core). We use this to determine the α and β coefficients from eq. 4.1, and how their values are affected by the type of algorithm and machine’s properties.

2. Instrumentation. We try to understand the algorithm’s behavior using instrumen- tation data, i.e., by gathering performance counter events and deriving properties that describe the resource usage: executed cycles per instruction (CPI), L2 and L3 cache hit ratios, the total amount of data read from and written to the mem- ory controllers, and the data transfer rate (bytes/cycle). The tool used is the Intel Peformance Counter Monitor (PCM) 6. For these experiments each algorithm was scheduled on all 32 cores (64 threads). Prior work in the systems community by Zhuravlev et al. [ZSB+12] has suggested that this data can be used to understand how a thread will behave in concurrent workload setup. For instance, often it can provide hints about the sensitivity of the algorithm to sharing resources.

4.4.1 Relational operators

We first analyze the parallel relational operators. Figure 4.4 depicts in a log-log scale the scalability of the algorithms as we increase the number of assigned cores from 1 to 64, where from 32 to 64 we move from assigning one hardware thread per physical core to using hyperthreads. The results show that both the hash-join (HJ) and aggregation (AGG LC) scale linearly with the number of cores within a NUMA node (eight). There is a slight slowdown in the scalability when assigning more than eight cores, and perfor- mance flattens when moving to hyper threads. This observation suggests that the operator implementation is primarily CPU bound. If it were memory-bound the use of HT would

6http://www.intel.com/software/pcm

104 4.4. Algorithms in isolation

100 1 NUMA node more NUMA nodes HT

10

latency [s] 1

HJ SMJ AGG linear linear linear 0.1 1 2 4 8 16 32 64 number of cores Figure 4.4: Scalability of DB operators on tables with 1024 M tuples have further reduced the runtime. The limited scalability when moving to hyperthreads implies either that the algorithm has reached another bottleneck (within a NUMA node) or that the sharing between the two sibling threads is destructive and does not bring any improvements in performance. In comparison with them, the sort-merge join (SMJ) scales sub-linearly as we add more cores, even within one NUMA node. This can be also implied by the nature of the algorithm, which is not as much optimized for the memory subsystem, especially when it comes to data movement across the DRAM controllers and the cores. We get back to this analysis when discussing the results from the instrumentation. Going back to equation 4.1, these experiments indicate that the scalability of a parallel relational operator not only depends on the degree of parallelism (i.e., the number of cores), but also on the NUMA- and HyperThread properties of the underlying hardware. Let’s take HJ as an example. Based on its implementation of HJ, we take δ to be 0, and hence the rest of the equation looks as shown in eq. 4.9. Note how the values of αHJ and

βHJ decrease with every step up the resource hierarchy.

  SP 1  Ti (1) · , if 1 ≤ m ≤ 8  m0.93  0.65 T SP (m) = SP (4.9) i Ti (1) · 0.65 , if 8 ≤ m ≤ 32  m  0.14  T SP (1) · , if 32 ≤ m ≤ 64  i m0.3

105 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

Table 4.1: Relational workload characterized by instrumentation

Workload (Dataset) Database Operators (1024 M tuple relations) Metric Hash Join Aggregation (LC) Sort-Merge Join CPI 4.9 1.9 1.0 L2 hit ratio (%) 43 48 61 L3 hit ratio (%) 79 76 22 Read from MC (GB) 73 47 152 Written to MC (GB) 43 27 110 Data rate (byte/cycle) 0.4 0.6 0.8

Even though the instrumentation results (Table 4.1) hide the details of the separate phases in the DB operators implementations, the high L2 and L3 cache hit ratios confirm the importance of the partitioning phase for the good cache behaviour of the subsequent phases (e.g., for HJ the L3 hit ratio is almost 95% for the build and probe phases). Furthermore, all operators exhibit a high data-transfer rate between the memory con- trollers and the CPUs. In particular, the SMJ with 0.79 bytes/cycle and a total of 262 GB transfered for relations of 1024 M tuples. For comparison, the HJ transfers only half the amount for the same dataset size. These results confirm that HJ and AGG may be more sensitive to sharing the caches, as they have high cache locality. One example of this was clearly seen when we also scheduled work on the hyperthreads, which share even the private caches. But, the results also explain that the SMJ uses the DRAM bandwidth more intensively, which not only limits its scalability with the number of cores, but also makes it a parallel job that can easily contend with other tasks for the shared resources like the LLC (via cache pollution) and memory bandwidth (both for the local DRAM and for the interconnect links).

4.4.2 Graph processing algorithms

We continue the analysis with the algorithms in the graph workload. Figure 4.5 shows in a log-log scale the scalability of the Green-Marl graph kernels as we increase the number of assigned cores. The results indicate that PR, HD, SSSP and TC7 have linear scalability up to eight cores (i.e., within a single NUMA node), but also that there is a significant

7Note that the TC kernel could not finish its execution on the Twitter graph. For completeness we include results for its run on the LiveJournal (LJ) graph.

106 4.4. Algorithms in isolation

1000 1 NUMA node more NUMA nodes HT

100

latency [s] 10

PR HD TC (LJ) linear SSSP SCC 1 1 2 4 8 16 32 64 number of cores

Figure 4.5: Scalability of Graph algorithms on Twitter data performance slowdown when moving to larger core counts (i.e., when the computation spans multiple NUMA nodes). The performance eventually flattens when moving to - perthreads. PR has a slightly better scalability than the other algorithms up to 16 cores. The sublinear scalability can be attributed to the higher number of random data accesses (both to local and remote NUMA nodes) as a result of increasing the number of worker threads, but also to the need for synchronization after each iteration. The synchronization cost is especially sensitive to the increasing costs for communication among the worker threads when done across the interconnect. Revisiting equation 4.1 for the PR kernel gives us eq. 4.10. Please note that again the serial fraction δ is set to 0, and that there is even more noticeable decrease in the values for αPR and βPR as we increase the core count than it was the case of the HJ algorithm.

  SP 0.95  Ti (1) · , if 1 ≤ m ≤ 8  m0.91  0.43 T SP (m) = SP (4.10) i Ti (1) · 0.51 , if 8 ≤ m ≤ 32  m  0.12  T SP (1) · , if 32 ≤ m ≤ 64  i m0.13 As we can see in Figure 4.5, in contrast to the other graph kernels, the SCC using the Kosaraju algorithm is inherently non-parallelizable. The main reason is that it relies

107 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

Table 4.2: Graph workload characterized by instrumentation

Workload (Dataset) Graph Analytics (Twitter) Metric PR HD SSSP TC (LJ) SCC CPI 36.9 10.7 15.0 1.2 1.7 L2 hit ratio (%) 10 0 0 60 40 L3 hit ratio (%) 30 30 20 100 70 Read from MC (GB) 1519 74 225 2 59 Written to MC (GB) 444 30 100 1 19 Data rate (byte/cycle) 0.2 0.3 0.3 0.1 0.3 on the depth-first ; hence δ is almost 1. Additionally, the algorithm runtime increases both in absolute numbers and in variability as we increase the number of worker threads. This is especially visible with the measured standard deviation among five different repetitions of the experiment, which grows from less than 1% of the runtime for the smaller core counts, up to 22% and 33% of the runtime for 32 and 64 threads respectively. Later, we will also see how this impacts the parallelism vs. concurrency argument when scheduling concurrent workloads. Continuing with the instrumentation results, Table 4.2 confirms that the implementation of the graph kernels is not optimized for the cache hierarchy as well as the DB operators. This can be immediately observed by the poor cache hit ratios for almost all graph kernels, unless the set of vertices fits in the L3 cache. This is clear when comparing the behaviour of the TC algorithm which is executed on the smaller LiveJournal (LJ) social-network graph (100% L3 hit ratio) while the other algorithms process the larger Twitter dataset. This results in having CPIs as high as 36 for PR, and in poor L2 (0% for HD and SSSP), and L3 cache hit ratios (30%). The latter results in large data transfers between the DRAM memory controllers and the CPUs, the most significant being the 2 TB of data moved during the PR execution. These results imply that the graph kernel’s behaviour depends on the size and properties of the input graph. For smaller graph sizes, the impact of sharing the caches (e.g., the LLC) will be higher, while for larger input graphs the penalty would come from sharing the DRAM bandwidth. Hence, determining the γi coefficient is context-dependent as we will see in the experiments in Section 4.5.

108 4.5. Concurrent WL execution

Discussion

The results indicate that for many parallel data-processing algorithms the optimal degree of parallelism in concurrent workloads (i.e., when there are enough jobs in the system) is the number of cores up until they had almost linear speed-up in performance. Simply allocating a degree of parallelism higher than that can potentially waste resources. For example, these cores can be assigned to another parallel job, which can use them more efficiently and hence achieve a better overall system throughput (STP). Based on the result presented in this section, most algorithms only scale well (i.e., have the βi coefficient close to 1) when executing with a degree of parallelism smaller than or equal to the number of cores within a NUMA node.

4.5 Concurrent WL execution

Until now, we have focused on characterizing the behaviour of the parallel analytical algorithms when running in isolation. This section evaluates the same set of algorithms when executed in concurrent workloads in various setups with the goal of understanding how the resource interaction among the parallel jobs changes the performance of the whole workload mix when using the three proposed scheduling approaches. In all experiments the input data for both the relational operators and the graph kernels is uniformly distributed across the NUMA nodes.

4.5.1 Interference in concurrent workloads

Recent work in both the database [LDC+09, SSGA11] and systems [ZBF10, TMV+11, CHCF15] communities identified that resource interference in the memory system can hurt performance. Our experiments confirm that there can be significant performance degradation as a result of sharing of the caches or memory bandwidth (Figure 4.6). The experiment set-up is the following. In each experiment we concurrently run two different algorithms. The chosen pairs are composed of either DB operators (Figure 4.6a) or Green-Marl graph algorithms (Figure 4.6b). Each job is started with four threads and the thread placement is set so that we evaluate the effects of sharing specific resources: a hardware thread, a multi-threaded core (on Intel SandyBridge), an L2 cache shared

109 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems with a sibling core on AMD Bulldozer, or a LLC/NUMA bandwidth. The results are expressed as the observed performance slowdown with respect to the job’s runtime in isolation, executed using the same DOP. The histogram clusters represent the co-located algorithm pairs. For each algorithm-type (HJ, SMJ, AGG, PR, HD, and SSSP), we show the normalized runtime when executed together with a given partner algorithm. The DB operators process 1024 M tuple relations, and the graph kernels work on the Twitter graph. Intel SandyBridge machine 4 hardware thread physical core 3 LLC/NUMA bandwidth

2

1 Normalized runtime

0 HJ SMJ AGG SMJ HJ AGG Co-located DB operators (a) Co-locatedAMD DB operators Bulldozer on Intelmachine SandyBridge 4 core L2 cache 3 LLC/NUMA bandwidth

2

1 Normalized runtime

0 PR HD PR SSSP HD SSSP

(b) Co-locatedCo-located GreenMarl GreenMarl algorithms onalgorithms AMD Bulldozer

Figure 4.6: Effect of resource sharing on a co-scheduled pair of jobs

110 4.5. Concurrent WL execution

It is not surprising that all parallel jobs experience significant slowdown when sharing a hardware thread or a core with another operator, regardless of its type. However, it is important to observe how much the performance of both HJ and AGG is affected when sharing the LLC/NUMA bandwidth with the SMJ operator (1.60 for HJ, and 1.26 for AGG). Similarly, the performance slowdown of the graph algorithms when sharing the LLC/NUMA with each other can be quite substantial (for HD it is in the range from 1.12 when co-located with the SSSP up to 1.53 when paired with PR). This is, however, not the case for HJ/AGG pair, where we see no slowdown when sharing the LLC and local DRAM bandwidth for either operator. The explanation can be found in the workload characterization Tables 4.1 and 4.2. For example, as the SMJ operator uses significantly more bandwidth than any other algorithm it interferes more intensively with its partners. Going back to equation 4.2, these results highlight the difficulty to determine the γ coef- ficient, and show that it depends on several factors:

• The sensitivity to resource sharing of the parallel job;

• The characteristics of the partner job and in particular the intensity with which it uses the shared resources; as well as

• The hardware properties of the multicore machine (e.g., type of shared resources), and the HW/OS mechanisms for QoS guarantees for performance isolation.

Note that these effects get more convoluted when executing multiple different parallel jobs.

Discussion

The main take away from this experiment is the importance of spatial isolation. If the par- allel jobs were scheduled on isolated HW islands (i.e., almost none of the shared resources are shared with the exception of the interconnect for accessing the input dataset (relations or graph)), none of them would have experienced slowdown in its runtime despite the noise in the system. Therefore, the system’s optimizer and scheduler would either have to work with a per- algorithm cost model that captures the above characteristics (sensitivity and intensity for each ) on different hardware architectures, or make sure the jobs are scheduled on spatially isolated HW islands to ensure the runtime of each job is stable.

111 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

4.5.2 Scheduling approaches – experimental setup

For evaluating the three scheduling approaches we use the two performance metrics: STP or W eighted Speedup, and ANTT or Hmean Speedup. We use two types of concurrent workloads for our experimental analysis: homogeneous and heterogeneous workload mixes. The homogeneous workload consists of eight identical jobs which are issued to the system at the same time. An illustration of the resource allocation used for the three scheduling approaches was previously shown in Figure 4.1. For the core-based scheduling approach we used a thread-to-core mapping, which is based on a data-locality heuristic. It knows the location of the input data partition to be accessed by each thread, and places the thread on the core belonging to the corresponding NUMA region. Such thread and data placement (and migration) mechanisms have been discussed by Li et al. [LPM+13] and Diener et al. [DMR+10]. We found that for most algorithms, such a policy gives better results than pinning all threads onto cores on the same NUMA node (see Table 4.3). With the exception of HD and SSSP8, all other algorithms exhibited better performance when

8The SSSP kernel is based on the Bellman-Ford algorithm, which often uses synchronization. One explanation for its poor performance is the expensive synchronization across the interconnect.

Table 4.3: Effect of thread to core placement on the execution time of various parallel algorithms on the Intel SandyBridge machine. All algorithms are executed with a degree of parallelism set to eight (i.e., equal to the number of cores per NUMA node). We compare the performance when the eight threads are placed on the same NUMA node or spread across the machine on different NUMA regions.

Execution time (sec) Algorithm same NUMA across NUMA Hash Join (HJ) 7.49 6.46 Sort Merge Join (SMJ) 19.04 15.20 Aggregation (AGG) 6.72 6.77 PageRank (PR) 128.26 113.20 Hop Distance (HD) 5.47 8.36 Single Source Shortest Path (SSSP) 15.47 26.28 Triangle Counting (TC) 13.81 12.13 Strong Connected Components (SCC) 116.17 116.62

112 4.5. Concurrent WL execution

HJ1 HJ2 AGG1 AGG2 SMJ1 SMJ2

NUMA 0 NUMA 1 NUMA 2 NUMA 3 0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31

t

(a) Core-based scheduling

NUMA 0 NUMA 1 NUMA 2 NUMA 3 0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31

t

(b) NUMA node-based scheduling

NUMA 0 NUMA 1 NUMA 2 NUMA 3 0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31

t

(c) Serial scheduling

Figure 4.7: Scheduling approaches for heterogeneous workload mix on a four NUMA node machine, with eight cores per NUMA.

113 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems their threads were placed on cores across the NUMA nodes of the machine. There are two possible explanations for this behaviour. First, the algorithms can leverage more aggregate bandwidth when their threads are spread across multiple DRAM controllers. This is in particular important for well optimized implementations, where the intermediate data structures are allocated local to each thread. Second, the algorithms can benefit from having better data locality with respect to the input dataset (e.g., the input graph). The heterogeneous workload mix of relational operators consists of six parallel jobs: two hash joins, two sort-merge joins, and two aggregations with large cardinality. The jobs from the same operator type have the same setup (input data size, and amount of resources allocated). The assigned degree of parallelism within the operators is a power of two, which is not uncommon for parallel operators. Figure 4.7 depicts the resource allocation to each of the parallel DB operators for the three scheduling approaches. In the core-based scheduling, all jobs are executed concurrently. The HJ and AGG operators are assigned a degree of parallelism equal to four, and the SMJ operators have eight threads. For the NUMA-based scheduling policy, the degree of parallelism for the HJ and AGG is set to eight (equal to the number of cores in a NUMA node), and for the SMJ it is set to sixteen. The parallel operators are then executed in two stages, as depicted in Figure 4.7b. Finally, for the serial execution approach, as before, all parallel operators are assigned a degree of parallelism equal to the total number of cores, and are executed sequentially one after another. Similarly, the heterogeneous workload mix from the set of graph algorithms consists of eight concurrently runnable jobs, i.e., two identical instances of each algorithm (PR, SSSP, HD, and TC). The scheduling setup for the different scheduling strategies is identical to the one depicted for the homogeneous workload with eight parallel jobs in Figure 4.1.

4.5.3 Scheduling concurrent DB operators

We first evaluate the interplay of both algorithm scalability (i.e., the values of α and β) and interference (γ) for the different data-processing algorithms in concurrent workloads. As presented earlier, the homogeneous workload mix we use to study the three scheduling policies consists of eight identical algorithms. A summary of the results for the relational operators on a variety of input data sizes is provided in Figure 4.8.

114 4.5. Concurrent WL execution

3 HJ 500 Serial NUMA SMJ 400 Core 2 AGG (LC) AGG (SC) 300

200 ANTT 1 100 STP (operators/min) 0 0 Serial NUMA Core HJ SMJ AGG

(a) 128M relations - ANTT (b) 128M relations - STP

3 50

40 2 30

ANTT 20 1 10 STP (operators/min) 0 0 Serial NUMA Core HJ SMJ AGG

(c) 1024M relations - ANTT (d) 1024M relations - STP

3 25

20 2 15

ANTT 10 1

STP (operators/min) 5 Did not finish 0 Did not finish 0 Serial NUMA Core HJ SMJ AGG

(e) 2048M relations - ANTT (f) 2048M relations - STP

Figure 4.8: Comparing scheduling approaches for relational DB operators.

Per job performance slowdown (Hmean speedup)

The graphs on the left side of the page (Fig. 4.8a, Fig. 4.8c, and Fig. 4.8e) show the average normalized turnaround time (ANTT) for each workload mix. The histogram clusters present the values for a particular scheduling policy (core, NUMA, and serial) and

115 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems the bars represent the different operators. For a per-job runtime predictability (i.e., best balance between throughput and fairness), the spatial isolation policies have the best performance for all operators. For example, when using the NUMA-node as a deployment unit for parallel jobs, the system can achieve an ANTT of up to 1.07. Similarly, the serial scheduling policy also delivers good results, especially when the algorithm processes larger input relations. In contrast, the core-based allocation policy often slows down the parallel algorithms who interfere with each other for the shared resources. This policy results in higher penalties especially for smaller input relations for both the HJ and Aggregation (SC) operators (e.g., the ANTT for AGG SC equals 1.83 for the 1024 M relations). We would also like to note how the input data size changes the behaviour of the AGG LC operator. Its ANTT values grow from 1.21 for the 128 M and 1024 M relations up to 1.96 for the 2048 M relations.

System throughput (W eighted speedup)

The plots displayed on the right side (Fig. 4.8b, Fig. 4.8d, and Fig. 4.8f) show the achieved throughput (STP) expressed in number of parallel DB operators executed per minute (ops/min). The histogram clusters are grouped per DB operator type (AGG shows the LC case9), and the bars present the scheduling policies. The results for all input relations show that the NUMA-based policy delivers the best throughput for all operator types. Consistently, the serial scheduling policy achieves the lowest throughput. As we discussed before, this is due to the sublinear scalability of the operators with higher core counts. In general, for the SMJ there is no big difference among the throughput achieved with the three scheduling policies. One reason is that the SMJ experienced a sublinear scalability also for smaller core-counts (i.e., within a NUMA node) – and as a result the NUMA- based scheduling is only a bit better than the serial execution policy. Furthermore, the poor scalability also explains why the performance isolation benefits of the spatial iso- lation scheduling strategies cannot win over the experienced slowdown in the core-based scheduling strategy. In contrast, for the HJ case for 1024 M relations the NUMA-policy achieves throughput of 36 ops/min compared to 26 ops/min in the serial execution mode and 31 ops/min for core-based scheduling. In the case of both HJ and AGG, the core-based scheduling suffers

9AGG SC values for STP are about an order of magnitude higher than for the other algorithms and are, thus, omitted for readability

116 4.5. Concurrent WL execution from destructive cache sharing among the different operator’s threads, which consequently results in more data traffic between DRAM and the CPUs. We would also like to point out that the SMJ run with eight parallel jobs working on the 2048 M tuple relations (˜15 GB per table) ran even though the machine has 512 GB main memory, which confirms the importance of checking the working set (memory) requirements of the jobs against the capacity of the bounded resource when determining the maximum concurrency which can be supported for the given workload mix. The observation also highlights the importance for designing algorithm implementations that are not only highly performant but also more modest in their resource usage especially for throughput-oriented systems (e.g., memory optimized algorithms like McJoin [BHC12]) and when executing in multi- user (i.e., noisy) environments (e.g., memory bandwidth optimized implementations).

4.5.4 Scheduling concurrent graph algorithms

The results of the experiments comparing the three scheduling strategies for the GreenMarl graph kernels are presented in Figure 4.9. Once again, the homogeneous workload consists of eight identical parallel jobs executed concurrently.

Per job performance slowdown (Hmean speedup)

The plots on the left side of the figure (Fig. 4.9a, Fig. 4.9c, and Fig. 4.9e) show the ANTT measured for the individual jobs, and the histogram clusters represent the values for each of the scheduling policies (core, NUMA, and serial). Fig. 4.9a shows the numbers for experiments executed on the LiveJournal graph, Fig. 4.9c displays the performance for the Twitter graph, and Fig.4.9e shows the numbers for the Random graph. The TC algorithm did not finish executing on the Twitter dataset, and its results are omitted in the result graph. At a high-level, the scheduling approaches behave similarly to the relational operators. Both the serial and NUMA scheduling approaches deliver predictable runtimes for all par- allel jobs. The smaller social-network LiveJournal graph has bigger fluctuations in the ANTT for the serial-scheduling than the other graphs. The NUMA-policy, however, re- mains stable. In contrast to them, the core-based scheduling approach results in significant slowdown for all algorithms on all input graphs. The SCC is in general the least affected,

117 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

6 PR 800 HD Serial 5 SSSP 700 NUMA TC 600 Core 4 SCC 500 3 400

ANTT 2 300 200

1 STP (jobs/min) 100 0 0 Serial NUMA Core PR HD SSSP TC SCC

(a) LiveJournal - ANTT (b) LiveJournal - STP

4 50

3 40

30 2

ANTT 20

1 STP (jobs/min) 10 Did not finish 0 Did not finish Did not finish Did not finish 0 Serial NUMA Core PR HD SSSP TC SCC

(c) Twitter - ANTT (d) Twitter - STP

9 25 8 7 20 6 5 15 4 ANTT 10 3 STP (jobs/min) 2 5 1 0 0 Serial NUMA Core PR HD SSSP TC SCC

(e) Random - ANTT (f) Random - STP

Figure 4.9: Comparing scheduling approaches for GreenMarl algorithms. but still has an ANTT of 1.56 on the Twitter dataset. All other algorithms have ANTT above 3 for the same input graph. The PR algorithm is more sensitive to interference and experiences a big slowdown per job for the core-based scheduling policy for all graphs. Its ANTT values for Twitter is 3, for LiveJournal more than 5 and for the Random graph is almost 8. This is primarily

118 4.5. Concurrent WL execution attributed to the nature of the algorithm, which does several iterations over both the input data structures of the graph (following the edge-list for each vertex) when updating the page rank for the vertices. This results in many random accesses over the data, and in par- ticular when accessing the current page rank values of the neighboring vertices. Therefore, the algorithm is very sensitive to pollution in the LLC – more random accesses will miss the LLC and result in an expensive memory access (often also across the interconnect). Naturally, in many hardware architectures the bandwidth capacity for random accesses is smaller than for sequential data accesses, and with a higher number of jobs in the system the bottleneck on the bandwidth as a resource significantly reduces the performance of the whole system. In contrast, for the NUMA-scheduling policy the number of communicating threads com- puting the pagerank is smaller (because the DOP is limited to the total number of cores in a NUMA node), and all the intermediate working set data structures (for instance, the current and future pagerank values for the vertices) are allocated on the same NUMA node. Additionally, all the communication between the worker threads happens via the LLC, and does not cross the interconnect.

System throughput (W eighted speedup)

The plots on the right side of the figure (Fig. 4.9b, Fig. 4.9d, and Fig. 4.9f) show the achieved system-wide throughput (STP), measured in number of parallel jobs executed per minute. The histogram clusters denote the measured values for each graph algorithm, and the bars represent the different scheduling approaches. The results indicate that with the exception of the SCC algorithm (which did not scale with the number of cores), for all other graph kernels the NUMA-based scheduling policy achieves the highest throughput. It is most pronounced for the HD and SSSP kernels where it outperforms the other approaches by more than a factor of two. The reason for this behaviour is two-fold. First, the graph processing algorithms did not have as good scalability as the DB operators (recall Fig.4.5 and eq. 4.10), so the poor performance of the serial execution policy is not surprising. Second, the core-based policy results in heavy performance interaction among the parallel jobs (see the plots on the left side), which negatively affects the per-job runtime and hence, the overall system throughput. Finally, since SCC is inherently non-parallelizable, it does not benefit from being assigned addi- tional cores. Therefore, the NUMA and serial scheduling approaches are simply wasting

119 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

3 PR HD SSSP TC 2 SCC ANTT 1

0 Did not finish Did not finish Did not finish Serial NUMA Core (a) ANTT on Twitter dataset

30 Serial NUMA 25 Core 20

15

10 STP (jobs/min) 5

0 Did not finish PR HD SSSP TC SCC (b) STP on Twitter dataset

Figure 4.10: AMD Bulldozer results resources. So, even though the SCC algorithms are slower in the core-based policy, the overall system-throughput is slightly higher.

4.5.5 Effect of underlying architecture

Although we base the main evaluation of the different scheduling approaches on the Intel SandyBridge machine, we wanted to check whether the same observations are also valid

120 4.5. Concurrent WL execution

Table 4.4: Scheduling policies on heterogeneous WL

Total Turnaround Time (sec) Workload mix core NUMA sequential 6 x DB operators 24 18 20 8 x GM algorithms 253 197 238 on a different architecture. For this purpose we use the AMD Bulldozer machine. The setup is similar to the previous experiments: eight identical algorithms are executed, and we evaluate the hmean speedup (ANTT) and the weighted speedup (STP) for the three scheduling policies. For these experiments we use the Green-Marl graph process- ing algorithms on the Twitter dataset as they are more easily portable across different microarchitectures. The evaluation results are presented in Figure 4.10. The spatial isolation policy using the NUMA-node as a unit of allocation once again delivers the lowest ANTT for all the jobs. The serial execution policy also has a low ANTT with the exception of the large variation in the SCC algorithm. This behaviour is, however, a result of the SCC high variance response time when spawned with a high degree of parallelism (recall the discussion for Figure 4.5). The per-job slowdown of the core-based policy is higher than the ones from the other scheduling approaches (˜1.5, with exception of PR). In general, the ANTT values are lower than the ones we observed on the Intel SandyBridge. When comparing the STP numbers, the NUMA node scheduling strategy once again delivers superior throughput to the other approaches.

4.5.6 Heterogeneous workload

The next experiment uses the two heterogeneous workload mixes presented in Section 4.5.2. For both, the evaluation metric is total turnaround time (TTT) measured in seconds. The results are presented in Table 4.4. Similar to the observations made earlier, the performance of the NUMA-based policy delivers the best overall turnaround time. As before, even though the serial scheduling policy removes the danger from performance interaction, it suffers from the sublinear scalability of the operators with high core counts.

121 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

Table 4.5: Time breakdown for heterogeneous workload mix

TTT – Total Turnaround Time (sec) Scheduling policy WL PR1 PR2 SSSP1 SSSP2 HD1 HD2 TC1 TC2 Core 253 253 253 74 72 29 30 70 70 NUMA 197 197 197 35 34 11 11 36 37 Sequential 238 82 82 13 13 5 4 20 20

When using the core as a unit of allocation, the TTT is the highest as a result of resource interference among the different operators. In order to quantify the slowdown, we show the breakdown of the execution time per-job for the graph workload in Table 4.5. The fine-grained scheduling policy that uses the cores as a unit of allocation gives the scheduler flexibility to choose the thread-to-core mapping based on different heuristics. The biggest problem, however, is when the data processing engine needs to provide pre- dictability guarantees for the runtime of each individual parallel job, regardless of the other jobs in the system. For the same core-based allocation strategy, one could end up with thread-to-core placements similar to (1) the NUMA-based scheduling approach, (2) the data-locality based policy which we used throughout the chapter, or (3) something in between. Unlike the NUMA-based scheduling approach where the slow down due to resource interference (γ) is almost always equal to 1.00, the core-based scheduling policy could result in high variance and overall less predictable behaviour. As a reference, the ex- ecution of the same graph algorithm (working on the same dataset, with the same amount of resources, etc.) could be up to several times slower depending on the thread-to-core placement. One example is the HD algorithm, which can have either 11 seconds execution time with the same noise in the system if we use a thread-to-core placement similar to the NUMA-based scheduling strategy, or almost 30 seconds when placing the threads local to the input data.

4.6 Discussion

The outcome of our analysis indicates that a NUMA node should be used as the unit of scheduling for parallel relational and graph processing algorithms on multi-socket multicore machines. Here we discuss its applicability to existing systems and the trade-offs it entails for different types of workloads.

122 4.6. Discussion

What does it mean for other forms of parallelism?

Initially, in our assumptions we simplified the discussion only to partitioned (DOP) and independent parallelism (concurrency). However, following up on the conclusions from the previous chapter regarding operator pipelines, which was also leveraged in the morsel- driven parallelism [LBKN14] and Oracle’s producer-consumer scheduling [Pap14], we be- lieve that allocating NUMA nodes to operator pipelines can positively affect the perfor- mance of the system. This way both non-blocking and blocking operator pipelines can benefit from constructive resource sharing, improved data locality and efficient synchro- nization. We leave the verification of this hypothesis for future work.

What does this mean for the optimizer?

We now return to the original research question: How can a database optimizer leverage our findings? First, optimizers can use their detailed knowledge of resource requirements for the parallel operators to request and allocate resources, rather than cores and working memory. Given the current lack of hardware mechanisms for isolating DRAM bandwidth, we argue that optimizers should allocate resources at a NUMA node granularity when optimizing for both Hmean and Weighted speedup for bandwidth intensive jobs. Second, using NUMA nodes as a unit of allocation also simplifies the computation cost of an optimizer, as for each parallel operator they can set the degree of parallelism equal to the number of cores in a NUMA node, and the degree of concurrency equal to the number of NUMA nodes in the system.

How can it be applied to existing systems?

As an example let us take Oracle’s support for parallel queries [Pap14]. Currently, Oracle assigns so-called parallel execution servers (i.e., a dedicated thread pool) for each parallel job based on the pre-computed DOP for the query. In such a system design, applying the suggested scheduling approach is fairly straightforward: assign the threads for each parallel execution server to the cores belonging to the same NUMA node. In the cases when pipeline parallelism is needed (e.g., producer-consumer pipeline pair), the system can either choose to split the NUMA node resources between the producer/consumer operator pair, or assign them to neighboring NUMA nodes.

123 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

What is the trade-off?

For the particular workload type, and the evaluation metrics we considered, scheduling parallel analytical jobs on a NUMA node granularity results in best performance. Natu- rally, there is a trade-off. By using a coarser unit of resource allocation, we lose flexibility for job deployment. In fact, system wide throughput can be improved if the most expen- sive job is bandwidth bound by spawning it to multiple NUMA nodes. That will improve its performance and shorten the makespan of the whole workload, even though it will likely cause unfair advantage towards the other jobs. Furthermore, using the NUMA-based scheduling alternative is more conservative in terms of resource allocation and may not result in the most resource efficient deployment. We believe that it is a trade-off between providing good balance of fairness and throughput and maximizing resource utilization. In cases where the workload consists of short running latency sensitive jobs and long running best effort jobs, or when the jobs have different priorities it is highly probable that scheduling approaches that optimize for different ob- jectives will propose more suitable deployment.

How can it impact future multicore system design?

By identifying the NUMA region as the most suitable unit of scheduling, we confirmed the two major factors contributing to the performance of concurrent operators: (i) constructive and destructive sharing in the cache hierarchy; and (ii) sharing and contention for the local DRAM bandwidth. One possible direction for future hardware platforms could be to introduce systems with more rather than larger NUMA regions. Another direction may be creating an additional cache hierarchy within a NUMA region, such as the one implemented in the new SPARC M7 [Phi14]. There, each chip has eight sets of core clusters (containing four cores that share and L2 cache) and a 64 MB shared on-chip L3 cache. The L3 cache is, however, partitioned into eight 8 MB chunks, one for each core cluster. Finally if the hardware designers enable QoS control over some of the shared resources (e.g., Intel’s for cache allocation or partitioning10), and in particular for the shared DRAM bandwidth, then the challenge will be shifted to identifying and enforcing a suitable and fair resource allocation. 10http://tinyurl.com/zej88ke

124 4.7. Related Work

4.7 Related Work

4.7.1 Thread and data placement

Most recent work on parallel data processing in databases and graph processing systems on multisocket multicore machines has focused on careful data placement and partition- ing across the machine, and data-local thread deployment. The classical approach by many data-processing engines is to partition the data across the machine’s NUMA nodes to take advantage of higher aggregate memory bandwidth. The thread (worker) place- ment is then based on local data chunks. One example is the work done as part of SAP HANA [FML+12], which compares different strategies for data- and task-placement for main-memory store scans [PSM+15]. The morsel-driven parallelism in Hyper [LBKN14] argues for NUMA-awareness, and additionally makes sure that the data morsels that are used for fine-grained task-based parallelism are kept warm in the caches when processed by highly tuned operator pipelines. Careful data and memory placement has also received attention from the systems commu- nity. Dashti et al. [DFF+13] consider the placement of memory within NUMA systems, and the sometimes-conflicting goal of providing low latency (local data access) versus spreading memory across the machine to achieve high aggregate bandwidth by optimizing the interconnect traffic congestion. Similarly, in Shoal [KARH15] the memory allocation for the chosen array implementations is based on the extracted data access patterns for graph analytics programs and the hardware specifications, such that the load on all mem- ory controllers is balanced and the threads access local memory to reduce the pressure on the interconnects. Shoal, however, does not consider concurrent workloads. As we mentioned earlier, hardware awareness was an important aspect for many hardware- tuned implementations of relational operators [AKN12, LPM+13, BATO13], but none of these efforts (like most of the work on graph analytics engines for multicore systems) looked into how they perform, or should be scheduled in concurrent workloads.

4.7.2 Scheduling concurrent parallel workloads

In the 90s there was a lot of work done in the context of scheduling and resource manage- ment for parallel databases, and determining suitable parallel schedules for mixed work- loads (e.g. [GI97, GHK92]). Authors argued for the need to characterize the resource

125 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems requirements of the operators in terms of time-shared and space-shared resources, and developed algorithms that generate close-to-optimal parallel schedules that minimize the makespan of the workload. However, these approaches make certain assumptions that are not applicable in today’s systems. First, determining the granularity of parallel execution used to be a matter of finding the right balance between computation and communication overhead, and hence modeling the scalability of a parallel operator does not consider the effects of the microarchitectural features of modern hardware, as we have shown earlier. Second, the parallel schedule does not take into account the non-uniformity of today’s cores (computational sites), and their shared resources in the memory sub-system (caches, DRAM bandwidth, etc.). Nevertheless, most of these ideas can be applied to a scheduler that uses, for instance, the NUMA node as a unit of scheduling. It can model its resource capacities, and leverage the valuable information about the resource requirements of the parallel operators to compute the optimal parallel schedule.

4.7.3 Contention-aware scheduling

On the other end of the spectrum are the black-box scheduling approaches from the systems community that aim for contention-aware scheduling. They use the hardware support for instrumentation to capture the characteristics of the application threads and adaptively do thread- and data-migration to reduce the contention on the shared resources. Zhuravlev et al. provides an extensive survey [ZSB+12] of work on different schedul- ing techniques for addressing the contention on shared resources in multicore systems. Some of the techniques base their scheduling decision on profiling the given workloads and choosing an appropriate subset of tasks that could be co-scheduled. The profiling can be done prior to scheduling the applications [BZFK10,ZBF10,LLD +08,KBH +08,MAN05, DWDS13,BM10,GARH14], or be based on detecting bad scheduling decisions as a result of high LLC miss rate, low IPC, or application-level performance indications and re-schedule the tasks accordingly [NKG10,MVHS10,BPA08,CHCF15]. While these approaches do not help with determining the degrees of concurrency and par- allelism, they still propose valuable mechanisms for characterizing the properties of the operator threads, and strategies for online balancing the load imposed on the memory

126 4.7. Related Work subsystem. Unfortunately, some of the simple heuristics that initially provided successful characterization of the demand for the DRAM bandwidth (e.g., only considering the LLC miss rate) fail to capture the properties of the fine-tuned implementations of modern ana- lytics algorithms. Examples include the extensive use of the (shared) software pre-fetchers, and non-temporal read/write intrinsics (as suggested by Wassenberg et al. [WS11]).

4.7.4 Constructive resource sharing

The benefits of constructive resource sharing within a NUMA node have already been investigated in the context of transactional workloads by Porobic et al. [PPB+12,PLTA14]. Their focus is on optimizing the synchronization cost among communicating threads by collocating them on HW-islands to reduce interconnect bandwidth. David et al. [DGT13] provide an extensive analysis on synchronization for modern hard- ware, and conclude that crossing sockets is harmful for performance and should thus be avoided. Their observations confirm our claim that the allocation of whole NUMA regions has advantages also from the point of view of synchronization. The benefits of NUMA-based allocation were also identified by Zhang and Re [ZR14] in the context of statistical analytics. They conclude that the NUMA-based model replication, which enables communication among parallel computations over the LLC, is significantly faster than any other model that requires communication over the interconnect and can performance by up to an order of magnitude. However, as in the graph processing space [GSC+15], they also do not consider concurrent workloads and how their implemen- tation and observations would be affected by noisy execution environments.

4.7.5 Impact of resource sharing and performance isolation

MCC-DB [LDC+09] addresses the problem of destructive cache sharing in databases on multicores. They classify queries as cache-sensitive and insensitive and propose co- scheduling them to minimize LLC conflicts. They suggest software-based partitioning of the shared caches in order to limit the performance degradation as a result of cache pollu- tion. Our results lead to a stronger statement regarding sharing of resources: rather than working on minimizing cache pollution, operators must be scheduled at the NUMA node level to also eliminate the danger of sharing of the local NUMA bandwidth.

127 Chapter 4. Optimizer: Concurrency vs. parallelism in NUMA systems

Tang et al. [TMV+11] analyze the impact of sharing the memory-subsystem on Google’s datacenter applications. Their observations are in line with ours: the effects on perfor- mance are much larger between co-scheduling decisions in such data-intensive workloads than in the benchmark [Bie11]. Their scheduling decision is based on heuristics that predict good thread-to-core deployments based on statistics gathered when running alone in the system. Priority is given to latency-sensitive applications that are later co- scheduled with batch programs. Lo et al. [LCG+15] also address the problem of collocating multiple parallel applications in a shared environment so that they improve resource utilization without violating the SLOs of latency critical jobs. Similar to our observations they identified that sharing of the LLC and the DRAM bandwidth can result in significant performance penalties. Instead of isolating the jobs on separate NUMA nodes, they use the recently introduced Cache Allocation (CAT) by Intel for way-partitioning of the shared LLC. As there are no commercially available hardware mechanisms for DRAM bandwidth isolation, they monitor the resource utilization of the batch jobs and scale down the allocated cores in order to limit the bandwidth usage.

4.8 Summary

In this chapter we revisited the problem of finding the right balance between degree of parallelism and multi-programming in the context of modern analytical algorithms running on multicore machines. The goal was finding a thread to core deployment that maximizes throughput while minimizing the per-job slow down in runtime in concurrent workloads. We show that the problem is not trivial, as the performance of individual parallel jobs, and the workload mix, is highly sensitive to load interactions, the chosen degree of parallelism, and the spatio-temporal resource allocation on different hardware architectures. With a series of experiments, we have shown that consistently the spatial isolation schedul- ing strategies (e.g., NUMA-based or serial scheduling) help to achieve the best balance between throughput and fairness (Hmean speedup) when executing in noisy environments. The results corroborate observations from prior work, that contention on the memory sub-system (e.g., pollution of shared caches, or contention on the DRAM and intercon- nect bandwidth) can result in substantial slowdown in the runtime of parallel operators. Consequently, if a system adopts a scheduling strategy that uses finer units of resource

128 4.8. Summary allocation (e.g., physical cores), it will struggle to provide guarantees for predictable run- time of individual jobs. As we have shown, a poor thread-to-core mapping can result in response time penalties of up to several factors (i.e., even with the same DOP a job’s runtime can be a subject of large variations due to resource interference). However, despite providing complete performance isolation, the serial scheduling approach delivers lower throughput in almost all experiments. We attribute this to the sublinear scalability of parallel jobs. Therefore, when optimizing for higher system-wide throughput we recommend scheduling jobs with finer granularity of resources. This is an important insight for the development of parallel algorithms (both in the database and graph pro- cessing communities), which so far has primarily focused on a single job’s performance. Instead, more attention should be devoted to optimizing the resource utilization footprint i.e., the intensity with which one uses the shared resources. Finally, we argue that for concurrent workloads with parallel operators, using the NUMA node as a unit of allocation and isolation island often achieves the desired optimization goal of maximizing the throughput while at the same time guaranteeing the performance runtime of a job given a set of resources.

129

5 Scheduler: Kernel-integrated runtime for parallel data analytics

The research questions we investigate in this chapter are:

• What are the suitable OS mechanisms for scheduling parallel workload mixes on multicore machines?

• How can we improve the interface between the OS and the applications, so that more information is passed regarding parallel operations and their requirements.

• How good is the currently offered OS process model for modern data processing workloads?

Our work primarily targets modern data appliances which address the challenges for exe- cuting hybrid (i.e., HTAP, operational analytics) workloads, like SAP Hana [FML+12] or federated data processing engines like BigDAWG [EDS+15] or Oracle’s GoldenGate [Ora15]. Our main observation is that one size does not fit all also when it comes to OS mechanisms and policies for scheduling such diverse workloads. Therefore, we propose an OS kernel which is customized for the needs of data processing systems. The solution we advocate relies on recent advancements in operating systems that enable systems to run multiple different kernels concurrently on a subset of the resources of a given machine [BBD+09],

131 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics and adapt the kernels and their resource allocation dynamically [ZGKR14]. We also ad- dress some of the limitations that current OS interfaces impose on the information flow regarding the scheduling requirements of different workloads. On a conceptual level, after analyzing the needs of modern analytical engines we revisit the traditional process model, which DBMS systems typically rely on – the UNIX process model, and propose different OS abstractions for using and sharing hardware contexts among jobs from different processes/applications. We implement them in our prototype Basslet, which is a kernel-integrated runtime offered as a service for predictable execution of parallel jobs on behalf of analytical client applications. This work was done in collaboration with Gerd Zellweger, and our advisors Gustavo Alonso and Timothy Roscoe. Part of the material presented in this chapter, which discusses the capabilities enabled by the proposed system design, was published at the 12th Internal- tional Workshop on Data Management on New Hardware (DaMoN ’16) as “Customized OS support for data-processing”.

5.1 Motivating use-case

To make things more concrete, in the rest of the chapter, we show how by using an adaptive OS architecture and specialized kernels we can solve an immediate problem of scheduling concurrent parallel analytical workloads. More specifically, for this illustrative use case we let the default system stack (i.e., Linux + OpenMP) schedule some of the most common graph processing algorithms (e.g., PageRank) on a multicore machine. Figure 5.1 illustrates how the default and na¨ıve execution of such concurrent workloads can result in poor scaling, both for the client (slower runtime) and for the data processing engine (lower throughput). The experiment measures the overall throughput obtained when concurrent clients each submit a sequence of pagerank (PR) OpenMP jobs over the LiveJournal dataset to a Linux server. The machine (AMD MagnyCours) has 8 NUMA nodes with 6 cores each. We add one socket (6 cores) with every additional client, and as default practice, let OpenMP choose the level of parallelism for each job. Ideally, the throughput should increase linearly with the number of clients (the “Ideal” line). Instead, the throughput per client rapidly decreases because the response time for each job increases, even though in principle each client has the same number of resources as a single client (1 NUMA node, 6 cores).

132 5.1. Motivating use-case

80 Linux+OpenMP Ideal 70 Basslet+Badis 60 50 40 30 20

Throughput [PR/min] 10 0 1 2 3 4 5 6 7 8 (b) Number of clients

Figure 5.1: System throughput for executing concurrent pagerank jobs

There are many factors contributing to this bad scaling:

• Inappropriate selection of the degree of parallelism by the OpenMP runtime;

• Poor co-location of threads within a single parallel task (ptask);

• Migration of threads by the operating system;

• Cache pollution due to context switching between threads of different clients; and

• Memory contention due to poor NUMA placement of data relative to threads.

In Section 5.4, we discuss more about these factors and how we address them with a novel kernel runtime scheduler (Basslet) to reach system scalability as shown in Figure 5.1. Before that we describe the Badis OS architecture (Section 5.3), which enables running multiple different kernels in a single system, and using a customized scheduler to solve the mentioned issues.

133 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

5.2 Foundations

We first discuss the limitations of the current process model and how we propose extending it for better execution of parallel analytical workloads. We also discuss the multikernel OS architecture model, which we used as our basis for integrating the required support for task-based program execution units.

5.2.1 Expanding the Application/OS Process model

One of the most important design decisions for every multi-user facing application is its process model, or how it handles the multiple requests of concurrent clients and how it maps its worker threads to OS execution units [HSH07]. As such, the process model of each DBMS depends on the offered functionality and mech- anisms from the underlying operating system. Typically the assumption is that of the UNIX-based OS process model. For the purpose of this discussion we take the definitions provided in the book for “Architecture of a Database System” by Hellerstein, Stonebraker and Hamilton [HSH07]:

1. An OS process combines an OS program execution unit with a unique and private address space. The single unit of program execution is scheduled by the OS kernel.

2. An OS thread is also an OS program execution unit, but without additional private address space. Instead the program address space is shared among all OS threads within the same OS process. Thread execution is also scheduled by the OS kernel.

3. User level thread as an application construct that allows multiple threads to be multiplexed within a single OS thread/process without the involvement or knowledge of the OS kernel.

The user-level threads have been deployed in many systems, as they provide inexpensive task-switching, relatively easy portability, and more application control over the thread scheduling policies. However, as we discussed earlier in the thesis, their usage means replicating a good deal of the OS logic in the application’s space (e.g., task switching, thread state management, scheduling, etc.) [Sto81], which not only makes the program- ming model significantly more difficult, but also ineffective in consolidation scenarios (i.e.,

134 5.2. Foundations when sharing the machine with other programs, and hence being unaware of the current system state and resource usage). The problem with using the alternative of mapping DBMS workers to OS threads is that the operating system cannot differentiate between long running services in an application (e.g., a transaction manager) from short running jobs, often linked to the execution of a query. Furthermore, the operating system cannot distinguish threads working on one par- allel job from threads working on another. This is because traditional OS interfaces offer limited opportunity for parallel applications to express the intricacies of their algorithms. In this chapter, we propose extending the traditional OS process model to also support OS task and ptask (parallel task) as OS program execution units. Similar to the OS threads, they do not have their own private address space. They are implemented as part of a kernel integrated task-based runtime, which can execute parallel jobs on behalf of different programs (applications) by switching to the corresponding process address space.

5.2.2 The multikernel OS architecture

Having support for the extended OS process model requires an OS architecture design that can implement the required functionality. In this case it means that the hardware contexts that are used to execute the OS tasks and ptasks should not be allowed to kernel threads simultaneously. The multikernel model [BBD+09] proposes to design the OS as a distributed system to better match modern multi-socket multicore hardware, which already resembles networked systems. It has been already implemented by several research operating systems (e.g., fos [WA09], Hive [CRD+95], Barrelfish [bar16]). For example, in the Barrelfish OS the shared OS state (e.g., the Linux scheduler queues) is replaced with a globally partitioned and replicated OS state. More concretely, Barrelfish runs a kernel on each core in the system, and the operating system is built as a set of cooperating processes running those kernels, which communicate via message passing and share no memory. Note that as the state on each core is (relatively) decoupled from the rest of the system, a multikernel can run multiple different kernels (or versions of kernels) on different cores [ZGKR14]. In the next section, we describe how we leverage this feature of the multikernels to spe- cialize kernels for different types of workloads, and in particular to design a customized light-weight kernel responsible for running the task-based OS execution units.

135 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

5.3 Architecture overview

Badis is an OS architecture that dynamically partitions the machine’s resources (e.g., CPUs, memory controllers, etc.) into a control plane, running a full-weight operating system stack, and a compute plane, consisting of our specialized lightweight OS stack, which is customized for running parallel data analytics using the newly introduced program execution unit (tasks and ptasks). The lightweight OS stack consists of a kernel, runtime and selected library OS services, which we describe in more detail in the rest of the chapter. Figure 5.2 shows the system architecture of Badis, which is based on the multikernel model. The key underlying idea is that at any point in time, the resources of a given multicore machine (e.g., multi-threaded cores, DRAM controllers, etc.) are partitioned into a control and compute plane, but the allocation of resources between the two planes is dynamic and can be adapted at runtime based on the workload requirements. Furthermore, the figure also illustrates that the Badis architecture is also suitable for addressing hardware heterogeneity and customizing the OS support for different computational resources. The rest of this section describes the main system components and their properties.

FWK FWK LWK A LWK B OS

core core core core QPI PCIe

Hardware Hardware DRAM DRAM GDDRAM NUMA 1 NUMA 2 Figure 5.2: Illustrating Badis – an adaptive OS architecture that is based on the multi- kernel model. The cores on NUMA node 1 each execute a separate kernel instance of the full-weight kernel (FWK). The cores on NUMA node 2 execute a common instance of the specialized light weight kernel (LWK A), while the computational units on the HW accelerator (e.g., a XeonPhi) execute another version of the light weight kernel that is optimized for the particular hardware platform (LWK B).

136 5.3. Architecture overview

5.3.1 Control plane

The control plane cores run the full-weight kernel (FWK) and various services and drivers in , which form the core of the existing operating system. As the full-weight OS retains the standard functionality and services, it can continue hosting the main threads of all applications. The individual applications can then decide if they want to execute all of their code on the control plane (e.g., using the standard thread based scheduler) or offload some of their tasks (e.g., parallel analytical jobs) to be executed on the customized kernels on the compute plane cores. Furthermore, the control plane is now in full control of managing the compute plane and the kernel instances it hosts. That entails spawning of light weight kernels, managing the dynamic resource allocation both between the two planes and among the compute plane kernels, and providing the mechanisms that allow applications to submit jobs to be executed on the compute plane kernels. Finally, it also provides an interface, which enables the compute plane kernels to ask for additional operating system services, which were not integrated in the compute plane kernels to avoid unnecessary overhead and undesired interference.

5.3.2 Compute plane

The compute’s plane primary purpose is to provide a clean, noise-free (i.e., isolated) and customized environment solely used for executing data computations on behalf of the applications. It can consist of several (potentially different) customized OS stack(s): specialized light-weight kernels, and a selection of library services. The goal is to enable the system to specialize various aspects of the LWKs either to meet specific requirements of the application workloads, or to customize it for better resource management and task-execution on heterogeneous hardware platforms. Another important property of the compute plane is that it is dynamic. Each core that is allocated to the compute plane can be rebooted at runtime with a customized light-weight kernel (LWK). Even though it was not evaluated in this chapter, this feature enables Badis to swiftly adapt to workload requirements. From an application’s side interacting with the Badis compute plane resembles the use of interfaces offered by conventional task-parallel runtimes or batch systems. Figure 5.3

137 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

Application Threads Tasks Application int main() { t(arg) { void t2(arg) { badis_ptask_ f(arg); enqueue(ptask, …) … } } }

1 Enqueue Parallel Task ControlC0 Plane Compute Plane task 4 Dispatch Tasks

task Badis

ptask

ptask ptask Memory mgmt. ptask Queue … task IRQ Handling Badis 3 Distribute Tasks Thread scheduler Distribute to Badis 2 compute-plane

Figure 5.3: Interacting with the compute plane: Applications submit parallel tasks to the ptask queue(s). The Badis compute plane dequeues ptasks and dispatches the individual tasks to the hardware contexts owned by the compute plane instance. provides an illustration of the interaction, and the role of the different system components. For instance, using the interface offered by the control plane, applications can simply en- queue (parallel) jobs to a given set of queues (labeled as (1)). Each queue has a designated kernel type and only active instances of the corresponding kernel type in the compute plane can dequeue jobs (ptasks) from that queue (marked as (2)). As soon as one of the kernel instances dequeues a ptask, it extracts the tasks and distributes them to computational units (e.g., CPUs) belonging to the same instance (labeled as (3)). Then the task-based execution units (recall the extensions we introduced to the OS prices model) running on each of the cores switch to the address space of the corresponding application and execute the task on behalf of the application itself (marked as (4)).

5.3.3 Customized compute-node kernels

As mentioned before, the flexible design of the compute plane allows kernel customiza- tion. One can use it to specialize the kernel scheduling policy or the particular mecha- nisms used for specific workload requirements. For instance, transactional queries that are typically short-running can be executed by a pool of threads scheduled at very fine time- intervals [TTH16] and spatially placed to benefit from constructive LLC sharing [PPB+12]; HPC-like workloads can be scheduled using gang-scheduling [Ous82]; a mix of synchronization-

138 5.4. Customizing a compute-plane kernel heavy algorithms can use Callisto’s scheduling mechanism [HMM14]. There are many other customizations that can be done to the offered OS mechanisms, policies, and interfaces for various type of resources not only CPUs. In the future work section of this chapter we briefly discuss some of our ideas regarding memory management, the interaction with I/O devices, etc. A light-weight OS kernel can also be specialized for the heterogeneous computational re- sources present in modern and future machines. For example, the kernel customization can target power-efficient and specialized hardware architectures as proposed by the research community [AFK+09,WLP +14,LFC +12] as well as co-processors such as Intel’s XeonPhi or GPGPUs, which are readily available today. Given their properties, the computational resources on such hardware platforms should not be engaged in system-wide decision mak- ing, maintaining the system-state and executing heavy OS services. Instead, they should only run a thin layer of the operating system, optimized and used solely for job execution.

5.4 Customizing a compute-plane kernel

Going back to the scheduling problem presented in Figure 5.1, the following section analy- ses the main factors leading to performance degradation in such workloads, and discusses the requirements for the runtime scheduler and the need for the task-based execution unit.

5.4.1 The need for better OS interface

A key problem is that traditional OS interfaces offer limited to no opportunity for parallel applications to express the properties of their algorithms. One example is that the OS does not distinguish between threads working on one concur- rent task over another. In Linux, for many years system developers, in the absence of more expressive OS interfaces, have abused the Linux’s mechanisms to group threads as opposed to processes, in order to let the OS know that they ought to be scheduled differ- ently [Jon15]. However, cgroups v2.0 disallows this [Ram16], and the Linux community is still negotiating what the right mechanisms and interfaces should look like [Jon15]. Therefore, we argue that the operating system should provide parallel applications with an API that allows such information (e.g., a group of threads working on the same parallel job) to be passed down to the OS.

139 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

The above-mentioned extension to the OS interface is complementary to COD’s interface that enables knowledge exchange to help the OS policy engine compute better policies when managing the available resources. Allowing the DB optimizer push more information about the workload properties in the form of cost models (Chapter ??) that would suggest scalability limitations of a particular algorithm will enable the system to compute a more suitable degree of parallelism for concurrent workloads in noisy environment sand avoid the problem of overprovisioning. We discuss the integration of Badis and COD in more detail in Section 5.8.

5.4.2 The need for run-to-completion execution

Runtime libraries (e.g., OpenMP, Cilk, JVM, . CLR) are widely used, as they cap- ture more information about the program’s parallelism and allow easier maintenance and portability across different platforms at the cost of performance. However, they lack the necessary global view and system state runtime information. Particular to the example shown in Figure 5.1, in the default mode (i.e., without relying on user’s input) every OpenMP runtime assumes it has all the machine cores to itself. As a result, every additional PageRank client oversubscribes the CPU resources. The Linux kernel must then preempt and temporally schedule the clients, i.e., multiplex multiple runtime threads over the hardware contexts.

High cost of a

There are two scenarios where a context switch can significantly impact the execution time of the preempted programs:

1. In synchronization heavy applications, preemption can lead to the well-known convoy effect [Bla79], especially when a thread is context switched while holding a lock, or just before reaching a critical barrier [HMM14,OWZS13].

2. When data applications are cache-sensitive, preemption can result in cache pollution and consequently to expensive DRAM accesses.

To evaluate the impact of context switching on a data-sensitive application, we use a micro-benchmark by Li et al. [LDS07]. The benchmark uses a synthetic program that

140 5.4. Customizing a compute-plane kernel

7 MagnyCours (5MB LLC) 6 Haswell (8MB LLC) IvyBridge (25MB LLC) 5

4

3

2

1 Cost of context switch [ms] 0 0 5 10 15 20 25 30 35 Working set size (MB)

Figure 5.4: Indirect cost of context switch vs. working set size performs accesses on an array of floating point numbers to mimic random accesses to a hash-table or a set of graph vertices, both of which are common operations in data processing workloads. Two identical instances of the program are executed repeatedly and a context switch is made from one to the other instance after the array has been fully read. The time for the context switch is then calculated by subtracting the time it took for one of the program instances to read the array, without being preempted. Figure 5.4 shows the results from running this benchmark on three machines (AMD MagnyCours, Intel IvyBridge, and Intel Haswell) with different last-level cache (LLC) sizes (between 5 and 25 MiB). The experiment was run on different array (i.e., the working set) sizes varied between 1 and 32 MB. The results show that the cost of a context switch can increase from 0.7 microseconds (for smaller working set arrays on all machines) to more than 6 milliseconds (for arrays of size 25 MB on the Intel IvyBridge machine), or by almost four orders of magnitude. In fact, depending on the machine type and the working set size of the application, a context switch may be longer than the standard preemption latency on modern Linux schedulers (about 6 milliseconds [Tor16]). We further note that the cost of a context switch is highest when

141 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics the working set is roughly equal to the size of the LLC. However, this is a common case. In order to optimize the execution on multicore architectures, modern data processing operators are highly tuned to use a working set of exactly the cache size [BATO13,CR07]. Thus, each preemption has a potential to destroy the carefully arranged data locality and hurt performance. Therefore, we argue that short-running latency-critical jobs, such as queries and data processing algorithms, should be executed without preemption, i.e., as run-to-completion tasks. Note, that this can be best achieved within a specialized kernel, which will ensure that no other job is scheduled on the same core.

5.4.3 The need for co-scheduling

Unfortunately, running a single thread to completion is not enough for parallel programs. Hence, a group of threads working on the same operation should ideally run simultaneously. Otherwise, an out-of-sync worker will become a straggler and slow the whole parallel job. Even though one prominent use-case for co-scheduling are the workloads from the HPC community [HSL10], data-processing workloads also have synchronization steps where a straggler can impact the performance [HMM14,OWZS13]. For example, in Chapter2 we showed the effects of slowing down one of the scan threads on the performance of the whole CSCS storage engine, because they have a synchronization step after finishing a full table scan for a given batch of requests before starting the next one. However, unlike gang-scheduling where preemption is allowed, we propose doing co-scheduling where all tasks of a parallel task start at the same time and execute until completion. We believe that this more favorable for two reasons: first, the jobs are not long running and hence, the execution to completion is not going to take a long time; second, parallel an- alytic jobs do not have many synchronization steps, or other blocking operations (e.g., waiting for disk or network I/O).

5.4.4 The need for spatial isolation

The internal architectural details of modern multicore machines offer many opportunities for resource sharing and, hence, for load interference. For example, Figure 5.5 shows all the resources where hardware interference may occur for the Intel Sandy Bridge machine, which was used for some of the experiments presented in this thesis.

142 5.4. Customizing a compute-plane kernel

core core core core Thr. Thr. 1 5 9 13 NUMA node node NUMA 5 13 45 1 L3 cache 2 L1 cache 4 core core core core L2 cache QPI PCIe 3 17 21 25 29 Processor layout Multi-threaded core Figure 5.5: The architecture of an Intel Sandy Bridge processor with labeled opportunities for resource sharing: (1) the shared last level cache (LLC), (2) the local DRAM controller and channels – hence the DRAM bandwidth, (3) the QPI interconnect, (4) the private L1 and L2 caches among sibling HyperThreads, and (5) the HyperThread itself.

In fact, one of the principal factors in the observed interference in Figure 5.1 is contention on the memory subsystem. As we have shown in Chapter4, in particular the sharing of the LLC and the DRAM bandwidth can result in significant drops in performance. Instrumentation shows that a single PageRank client alone on one NUMA node uses about 18 GB/s DRAM bandwidth. When executing it alone on two NUMA nodes it uses 25 GB/s (this contributes to its runtime decreasing from 6.9 seconds to 4.9 seconds). However, when sharing the two NUMA nodes with another PageRank client, their individual DRAM bandwidth share drops to about 12 GB/s, which is less than what each one gets if running alone on one NUMA node. This is an important observation, especially for the objectives that modern schedulers have, which is achieving a good load balance across the cores of the multicore machine [LLF+16]. In this scenario, even though the load across the cores is well balanced, nobody got what they need instead they get what they wanted. More precisely, neither the overall sys- tem achieves the optimal throughput, nor the individual PageRank jobs get the lowest runtime. However, the Linux scheduler achieved its objective of well balanced work distri- bution among the cores of the system, and both PageRank jobs received a thread-to-core deployment which had they been running in isolation, would have resulted in having a better aggregate DRAM bandwidth and hence a lower runtime.

143 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

Therefore, we conclude that there is more to allocation than just cores, and one should also account for other resources such as shared caches, DRAM bandwidth, etc. In some cases, like in the example above and our evaluation in Chapter4, sharing these resources can lead to performance interference. In other cases, threads can benefit from constructive resource sharing to lower the cost for synchronization and communication. Given the properties of modern multicore machines, one such hardware island [PPB+12] is a NUMA node. Having its own cache hierarchy, several DRAM channels and a mem- ory bus, it is suitable for all communicating [DGT13] and data-sharing tasks [PSM+15, LBKN14]. Moreover, the same properties make it ideal for achieving performance iso- lation (e.g., our discussion in Chapter4, the static allocation in Callisto [HMM14]), as it can restrict destructive resource sharing as a result of cache pollution [LDC+09], and bandwidth contention [TMV+11]. Arguably in contexts like data processing, for efficient performance isolation, the unit of allocation should be entire NUMA regions rather than cores.

5.4.5 The need for data aware task placement

Finally, data processing applications are sensitive to the location of the data they access. In particular when executing on large NUMA systems. A common algorithmic pattern in data processing is to access the data in multiple stages, thereby creating a data-flow path between parallel sub-jobs [RTG14, LBKN14]. In such cases, it is important to preserve data locality and reduce the traffic across the machine’s interconnect between the com- municating jobs. Prior studies have shown the effect that data and thread placement can have on performance [PSM+15]. Therefore, there is a need to enable such applications to:

• Specify their preference on which NUMA node they want to execute their job on, and hence moving the computation to the data; or

• Define that a sequence of parallel operations are to be executed in a pipeline and should thus execute on the same set of resources. That way the system can avoid any unwanted migration or inappropriate placement decisions by the OS (e.g., the operator-pipelines in Chapter3, or morsel-driven parallelism by Leis et al. [LBKN14]).

144 5.5. Implementation

The design objectives of the task-based optimized compute kernel of Badis (which we refer to as Basslet) and its runtime scheduler, explained in the next section (§ 5.5.2), are based on the needs and requirements that were discussed here.

5.5 Implementation

We chose the Barrelfish OS [bar16] ( revision 1500a71) as a basis for implementing Badis and Basslet primarily because it allows running a separate and different kernel on each core. Barrelfish is a freely available OS based on the multi-kernel [BBD+09] design, and provides an infrastructure for dynamic core [ZGKR14]. It enables us to exchange or update kernels on a set of cores. In Badis, we make use of this functionality to realize the adaptive control/compute plane separation and the dynamic nature of the compute plane. Our current implementation targets the x86-64 architecture.

5.5.1 Control and Compute plane

Control plane The control plane cores run the full-weight OS, which in our case is based on Barrelfish. Apart from executing application’s threads, the control plane is also responsible for setting up the compute plane. The initial partitioning between the resources allocated to the two planes happens at boot time, when the system boots a separate kernel on all available CPUs. The control plane kernel runs a modified version of the boot manager from Bar- relfish, which first spawns the Barrelfish OS only on the CPUs allocated to the control plane, and then interacts with the CPU boot manager to spawn the customized kernels on the cores dedicated to the compute plane. For the threads running on the control plane the compute plane cores are invisible, i.e., they can only run on the control plane cores.

Compute plane The customized kernels, which run on the compute plane, are deliberately kept simple to avoid taking part in any globally coordinated OS routine, and to remove any interference coming from the OS services. Such simplicity enables the kernels to be easily scalable, and specialized for both the underlying hardware platform and the workload requirements. But, it also comes at a price, because certain operations are no longer possible. For

145 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics instance, a job is submitted for execution on the compute plane can access the address space of the corresponding process (which runs on the control plane) from the compute plane, but it cannot modify it. In Barrelfish, the kernels on the compute plane run on a per-core basis and that can be spawned almost at the cost of a context switch [ZGKR14].

Executing a job on Badis’ compute plane In user-space, the applications using the control/compute plane architecture link to a library that provides a set of functions that the applications can use to submit jobs to be executed on a (set of) dedicated kernel(s) on the compute plane (see Figure 5.6). Badis provides a set of queues that are data-structures. When an application submits a job for the compute plane, the Barrelfish kernel will the job (provided as a task) to the corresponding queue, and signal the related compute plane kernel (using inter-processor- (IPIs)) that work is available. In addition, applications can use a messaging API for communication between threads executed on the control plane and tasks executed on the compute plane. On the compute plane side, dequeuing a job from the queue essentially means executing it as the new task/ptask program execution unit. In some sense it is similar to starting a thread: if the task is dispatched for the first time, the kernel switches to the address space of the owning process, otherwise it initializes the register set with the correct instruction and stack pointers and switches back to user-mode. The rest of the chapter explains in more detail the implementation of our concrete compute plane kernel (Basslet) which was customized for executing parallel analytical jobs based on our analysis in Section 5.4.

5.5.2 Compute plane kernel – Basslet

We implemented the Basslet compute plane kernel by forking the existing Barrelfish kernel. A Barrelfish kernel in its basic form has support for multi-tasking, implements a capability system, and has a set of drivers for core local devices such as interrupt controllers, timers or MMUs. For Basslet, we kept the initialization routines of the existing Barrelfish kernel, but adapted the remaining system as follows:

• First, we replaced the thread-based scheduling mechanism of Barrelfish with a task- based scheduler that executes tasks to completion;

146 5.5. Implementation

Badis Data Structures struct task { task fn fun, void* arg, ... } struct ptask { struct task* tasks, size t count, ... }

Application API badis ptask create(tasks) → struct ptask* badis ptask enqueue(ptask, locality, ...) badis ptask free(ptask) badis ptask wait(ptask) badis ptask abort(ptask)

Message API badis send(dest, msg, arg1, ...) badis wait for(msg) → [data]

Figure 5.6: The API exposed by the control plane of Badis, which can be used by applica- tions to enqueue and communicate with jobs executed on the compute plane. Each job is constructed using the task and ptask structures offered as new program execution units by Badis’ process model.

• Second, we replaced the interface with one used by Basslet tasks; and

• Finally, we set the interrupt controller to only accept interrupts specific to Basslet.

In Badis, the compute plane cores can either run a separate, different kernel, or several cores can be grouped into so-called instances of the same kernel. The instances are typically used for executing parallel jobs (ptasks). In our current prototype, the Basslet kernels are grouped into Basslet instances. These instances are currently sized to spawn an entire NUMA node, in order to provide support for spatial isolation, and minimize resource interference. Note, that this is just a decision based on the properties of current multicore systems, and if desired, a Basslet instance can also be spawned on a smaller or larger scale, e.g., on core-groups in SPARC M7 [Phi14]. Within an instance, typically one of the kernels acts as a master and is responsible for coordinating the ptask execution over the entire instance (i.e., manage the jobs executed on the other worker kernels belonging to the same instance). More concretely. the master is notified by the control plane once a new ptask is available. It then tries to acquire a lock

147 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics on its queue to fetch the ptask. If successful, it notifies the workers in the instance (i.e., the other cores) that a new ptask is available. It then runs the task scheduler, which starts dispatching tasks until all the tasks of a ptask are executed. The workers immediately start executing tasks after they are woken up by the master and go back to once the ptask is fully executed. The scheduler guarantees the run-to-completion of tasks, while the grouping of master and worker kernels in Basslet instances running on a dedicated NUMA node ensures co-scheduling and spatial isolation. We would like to note, however, that with co-scheduling there is always a danger of one of the tasks misbehaving or blocking indefinitely due to program bugs. Such a scenario could then essentially stall an entire Basslet instance. As a solution, we use watchdogs by programming the local APIC timer to interrupt the task in case the execution took too long. If the is reached, the entire ptask is aborted and a failure is reported back to the application, which can then react accordingly (for example by restarting the ptask).

5.5.3 Basslet runtime libraries

Tasks running on a Basslet compute plan kernel can use the API provided by the Basslet library as shown in Figure 5.7. The API currently supports a set of synchronization functions (similar to the ones exposed by pthreads) and a dynamic memory allocation

Basslet Application API (synchronization) bas mutex lock(bas mutex) bas mutex unlock(bas mutex) bas cond release(bas cond) bas cond wait(bas cond, bas mutex) bas cond signal(bas cond) bas cond broadcast(bas cond) bas barrier wait(bas barrier)

Basslet Application API (memory) malloc(size) free(ptr)

Figure 5.7: Basslet runtime API

148 5.5. Implementation using malloc and free. The synchronization are implemented using x86 atomic instructions with exponential back-off. In practice, and as others have reported [DGT13], we found this works well for our set-up as tasks are never preempted, and communicate with the other tasks via the last-level cache. A Basslet kernel can not allocate or map new memory into an address space. Therefore, if malloc ever runs out of the internal buffer space while called from within a task, it will forward the allocation request to the control plane using message passing. Currently, in Basslet we provide support for algorithms parallelized either using the OpenMP runtime, or the standard POSIX threads. In the rest of the section, we first sketch how a program originally parallelized using pthreads can be adapted to Badis tasks, and then discuss how we ported the OpenMP runtime on top of Basslet.

POSIX threads

The conversion from pthreads to Badis tasks is fairly straight-forward. We show an example of how to translate spawning and joining threads to Badis ptasks in Listing 5.1. With the Basslet (Figure 5.7), the tasks are able to allocate memory and use various synchronization primitives. While such a restricted set of functions is relatively minimal, we found it sufficient for porting a set of widely used relational operators running inside a database (e.g., a radix-based hash join, or aggregation). In addition, the programmer can always use the messaging API to requests back to the control plane. However, this will incur additional latency and may stall tasks unnecessarily. Therefore, it should only be used in exceptional circumstances. However, such expensive calls (e.g., memory allocation) are typically done outside of the critical path when executing the latency sensitive operations so we do not expect increasing their cost to be a problem.

149 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

Listing 5.1: Example of porting the POSIX thread creation using Badis’ API

1

2 /* Createa ptask withn tasks.*/

3 struct ptask *pt = rt_ptask_create(n);

4

5 /* Instead of spawning threads, set up the individual tasks.*/

6 for(int i = 0; i < n; i++) {

7 set_thread_args(args, ...);

8 pt->tasks[i].task = fn;

9 pt->tasks[i].data = &args[i];

10 //pthread_create(&tid[i],&attr, fn,(void*)&args[i]);

11 }

12

13 /* Once all tasks are set, enqueue the ptask to Basslet.*/

14 rt_ptask_enqueue(pt, &ptid);

15

16 /* Instead of joining the threads, wait for the ptask*/

17 // for(inti= 0;i

18 // pthread_join(tid[i],NULL)

19 //}

20 rt_wait(ptid);

The OpenMP runtime

In our prototype we also ported an OpenMP [Ope15] runtime on top of the Basslet API. OpenMP programs consist of C/C++ code with annotations added by the programmer or a DSL to automatically parallelize loops or other constructs. The annotations are read by the OpenMP compiler, which generates intermediate functions to compartmen- talize loops and transforms the annotations into a series of calls to an OpenMP runtime library for execution on different cores. The runtime library is then in charge of distribut- ing the work across different cores. As an example, a simple implementation in Badis for the #pragma omp parallel annotation is given in Listing 5.2. In our implementation, we also added a few optimizations. First, OpenMP typically assumes that thread 0, which initially invokes the GOMP_parallel_start function on the control plane, also takes part in the execution of parallel section. Hence the nthreads-1

150 5.5. Implementation in the Listing 5.2. In our implementation, however, we deliberately avoid that and only allow tasks (on the compute plane) execute the parallel parts.

Listing 5.2: Example of OMP runtime using Badis’ API for #pragma omp parallel

1 GOMP_parallel_start(void* fn, void* data, int nthreads) {

2 struct ptask *pt = badis_ptask_create(nthreads-1);

3 for(int i=0; i

4 pt->tasks[i].task = fn;

5 pt->tasks[i].data = data;

6 }

7 badis_ptask_enqueue(pt, &ptid);

8 }

9

10 GOMP_parallel_end(void){

11 badis_ptask_wait(ptid);

12 badis_ptask_free(pt);

13 }

Second, very often the #pragma omp parallel statements are enclosed by a loop that iterates over some data until reaching convergence, i.e., for each iteration we enqueue a separate ptask. In that case, it is desirable to run the consecutive #pragma omp parallel constructs (ptasks) on the same Basslet instance and in sequence in order to benefit from data locality and cache reuse. There are several ways to implement the second optimization in Badis. The simplest is to use a queue dedicated for a particular Basslet instance and enqueue all the ptasks of such a loop or data-flow pipeline to the same instance queue. An alternative is to spawn a single ptask, and then let the Basslet runtime use the message passing interface to hand off work units to the different Basslet workers executing the ptask. We currently only support a subset of OpenMP, namely the parallel pragma as well as dynamic and static for loops, which is sufficient for running many of the Green Marl graph kernels used in our evaluation.

5.5.4 Code size

The Badis user-space library is currently about 1.5k lines of code. However, it does rely on several libraries from the existing Barrelfish source (e.g., for memory allocation,

151 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics existing data-structures, and messaging). The Basslet user-space library adds another 1.5k lines, where the OpenMP runtime implementation attributes to about 1k lines of code. The changes to the existing Barrelfish kernel to adapt it to a Basslet kernel were relatively small. Overall, we changed 20 files in the tree, added 844 lines of code, and removed 122 lines. This includes some changes to the build scripts as well, which excludes some files entirely (such as the code to manipulate the capability data-structure which is not required on the data-plane). The Basslet kernel itself currently supports the x86-64 architecture. One important hardware requirement is a fast notification mechanism between the com- pute and control plane which can be either interrupt- or memory-based (for example by using something similar to the monitor and mwait instructions on Intel).

5.6 Evaluation

Our experiments are run on the AMD MagnyCours machine with the following properties: Dell 06JC9T board with four 2.2 GHz AMD Opteron 6174 processors and 128 GB RAM. Each processor has two 6-core dies, each with a NUMA node size of 16 GB and a 5 MiB LLC cache. The main advantage is that the machine has more NUMA nodes (eight in total), and allows us to scale the compute plane with up to seven instances. Parallel data processing applications are the focus of our evaluation. As part of the GreenMarl [HCSO12] graph application suite (git revision 4c0d62e) we execute the follow- ing algorithms: (1) PageRank (PR), (2) the Single-Source Shortest Path (SSSP), and (3) HopDistance (HD). We evaluate the performance of the three algorithms on the LiveJour- nal graph, which is the largest available social graph from the SNAP dataset [LK14]. It has 4.8 M 32-bit nodes, and 68 M 32-bit edges (about 300 MB row binary data). In the presented measurements we do not include the time for loading the graph in memory. For a relational DB workload we use the hashjoin (HJ) [BTAO13] operator, as described in Section 4.3.1. The first set of experiments evaluates the efficiency of the Basslet compute plane kernel when scheduling concurrent parallel workloads. We compare its performance with the same workloads executed using default OpenMP and Linux schedulers.

152 5.6. Evaluation

2.6 2.6 SSSP 2.70 2.19 2.47 SSSP2.4 1.01 1.00 1.00 2.4 2.2 2.2 2.0 2.0 HD 1.07 1.13 1.04 HD 2.50 2.13 2.07 1.8 1.8 1.6 1.6 1.4 1.4 PR 2.14 2.15 1.88 PR 1.03 1.01 1.02 1.2 1.2 1.0 1.0 PR HD SSSP PR HD SSSP

(a) Linux+OpenMP 12/24 (b) Basslet + Badis 12/24

2.6 2.6 SSSP 1.72 1.60 1.77 SSSP2.4 1.07 1.00 1.01 2.4 2.2 2.2 2.0 2.0 HD 1.68 1.58 1.44 1.8HD 1.03 1.04 1.05 1.8 1.6 1.6 1.4 1.4 PR 1.23 1.25 1.17 PR 1.00 1.03 0.90 1.2 1.2 1.0 1.0 PR HD SSSP PR HD SSSP

(c) Linux+OpenMP 6/12 (d) Basslet + Badis 6/12

Figure 5.8: Slow-down of PR, HD and SSSP algorithms when co-executed with a partner algorithm. Comparing their normalized runtime when running on Linux + OpenMP to their execution on Basslet + Badis.

5.6.1 Interference between a pair of parallel jobs

The first experiment measures the effects of interference when different pairs of parallel jobs execute concurrently. The experimental setup is the following. We execute the three GreenMarl graph algorithms on separate OpenMP runtime instances. We run them concurrently either using (1) the default Linux scheduler, with OpenMP choosing the degree of parallelism or using (2) the Basslet compute kernel runtime. We tested with

153 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

Basslet instances spawned on a single or two NUMA nodes each. The reported numbers are the execution time normalized to a baseline experiment, which measured each algorithm’s runtime when run in isolation on 6 and 12 cores (the respective 1 and 2 NUMA nodes). When executing the pair of algorithms concurrently, for both setups we doubled the allocated resources to execute both jobs, i.e., we assigned 12 and 24 cores (on the respective 2 and 4 NUMA nodes)1. Ideally the normalized time should be 1, as for twice the load we allocated double the resources. The results from the run on Linux are presented in Figure 5.8a and Figure 5.8c. The heatmap shows the calculated slowdown of the noisy execution time relative to the runtime in isolation. The first observation is that there is significant performance degradation for all combinations of algorithm pairs. In some cases, it can reach up to 2.7x slowdown despite having enough resources for both to execute well. The second observation is that the degradation and interference gets worse with higher degree of parallelism, i.e., with the number of NUMA nodes used. This is an important insight, especially because NUMA nodes on more recent machines are becoming bigger (i.e., with more cores) and the effects of internal resource sharing are going to be exacerbated. In contrast, when the same combination of algorithm pairs are executed on Basslet + Badis, the normalized runtimes are as expected (Figure 5.8b,5.8d), i.e., the job’s runtime is almost unaffected compared to their run in isolation. These results confirm the benefits of Basslet’s runtime design decisions, and especially (1) the non-preemptive co-scheduling of the tasks belonging to the same ptask; (2) the spatial isolation of ptasks on complete NUMA nodes – also limiting the degree of parallelism for each job; and (3) the data-aware task placement, which was in particular important for the tight pipeline of ptasks within a loop, as generated by the GreenMarl compiler. As a result, Basslet’s scheduler delivers the desired performance isolation, even in noisy environments.

5.6.2 System throughput scale-out

We revisit the problem statement experiment presented in section 5.1. measuring how well different scheduling approaches do when increasing both the number of clients in the system and the allocated resources. 1We switched off the remaining cores and let the OpenMP runtime choose the degree of parallelism and schedule the threads on top of the default Linux scheduler.

154 5.6. Evaluation

Intel SandyBridge 20 Intel IvyBridge AMD MagnyCours AMD Bulldozer 15

10

5

Throughput [PR/min/client] 0 1 2 3 4 5 6 7 8 Number of clients

Figure 5.9: Expanding the problem statement experiment (recall Figure 5.1) on four different machines.

First, we repeated the same experiment on three additional machines, to verify that the effects we observed on the AMD MagnyCours machine are not a special case:

1. Intel Sandy Bridge with four Intel Xeon E5-4640 processors, each containing one NUMA node with a 20 MiB LLC and eight hyper-threaded CPUs running at 2.4 GHz;

2. Intel IvyBridge with four Intel Xeon E5-4650 v2 processors, each containing one NUMA node with 24 MiB LLC and ten hyper-threaded CPUs running at 2.4 GHz;

3. AMD Bulldozer with four 2.4GHz AMD Opteron 6378 processors with two dies (a total of 8 NUMA nodes with 64 GiB each). Each NUMA node has a total of eight cores with a single hardware thread, and a shared 6 MiB LLC.

Hyper-threading is disabled for the Intel machines. Also for this experiment we run the PageRank algorithm on the LiveJournal graph. The baseline run uses the cores on the first NUMA node2. As a result, please note that the

2We explicitly disabled the rest of the cores on the machine by switching them off.

155 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics degree of parallelism used is machine-dependent. The results are presented in Figure 5.9, and confirm that the problem with throughput scaling per client in concurrent workload is exhibited on all four multi-core machines. Second, we measure the performance of three different scheduling approaches on the AMD MagnyCours machine:

1. First, we enable nested parallelism and set the concurrency level internally within the OpenMP program. We execute it on the default Linux scheduler.

2. Second, we evaluated a setting when executing multiple OpenMP runtimes on top of the default Linux scheduler, one OpenMP program for each PageRank client3.

3. Finally, we evaluate the system behaviour when the same parallel workload is exe- cuted using the Basslet runtime.

As a client in the workload, we again use the PageRank algorithm running on the Live- Journal social graph. Once again the baseline data point is taken with a single client running a job on 6 cores. For every subsequent client (another instance of the PageRank algorithm) we allow the system to use additional 6 cores. The rest of the cores remain dis- abled. The reported system throughput is the inverse of the total time needed to execute all PageRanks, i.e., the throughput as perceived per client. The results are presented in Figure 5.10. They show that the performance interference among multiple clients when both OpenMP and OpenMP+Linux schedule the resources increases as we add more clients, despite having sufficient resources. The machine has forty-eight cores, so there are enough resources to execute eight concurrent PageRanks within an OpenMP runtime, or OpenMP runtimes on top of Linux. In contrast, Basslet achieves almost perfect per-client throughput scale-out until seven clients. The final six cores, belonging to the first NUMA node, are dedicated for the control plane, which limits the scalability to seven Basslet instances. We would like note, however, that if the multiple OpenMP runtimes were aware of each other and used a static resource partitioning in a noise-free environment, by limiting the degree of parallelism (e.g., by setting OMP NUM THREADS) and by thread pinning (e.g., by using libnuma or GOMP CPU AFFINITY flag) on NUMA hardware islands we were going to achieve almost the same scalability as with Badis+Basslet.

3This is the setup used in the previous experiment.

156 5.6. Evaluation

12 Basslet Linux+OpenMP Ideal OpenMP 10

8

6

4

2

Throughput [PR/min/client] 0 1 2 3 4 5 6 7 8 Number of clients

Figure 5.10: Throughput scale-out when executing multiple PRs using internal OpenMP parallelism (option 1) vs. Linux+OpenMP scheduler (option 2) vs. Basslet (option 3).

5.6.3 Comparing standalone runtime: Linux vs. Basslet

The goal of this experiment is to compare the absolute runtimes of the algorithms when executed on Basslet + Badis versus Linux. All algorithms are executed in isolation and with a degree of parallelism equal to the number of cores in a single NUMA node. For both systems, the algorithms were executed on cores belonging to the same NUMA node. The results, shown in Table 5.1, indicate that the algorithm’s runtime on Basslet is com- parable to the one measured on Linux. We would also like to point out that the compute plane could be additionally customized to improve the performance of such workloads.

5.6.4 Overhead of Badis enqueuing

This experiment measures the overhead (additional time/cost) of enqueuing the parallel jobs for execution on the compute plane. It measures the time of issuing the enqueue system call in two situations: (1) when the compute plane still has enough resources to execute the new job concurrently, (2) when it is saturated and the new job needs to queue.

157 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

Table 5.1: Runtime of parallel algorithms executing on Linux versus Basslet.

Execution time (ms) Algorithm (input data) Linux Basslet Hash join (128M x 128M) 4787 3316 PageRank (LiveJournal) 6712 6509 Hop-Dist (LiveJournal) 515 542 SSSP (LiveJournal) 3390 3491

For this experiment, as a parallel job we use the parallel hashjoin operator and the compute plane runs Basslet instances. We measure the execution time of the hashjoin (HJ) before invoking ptask_enqueue and after returning from ptask_wait, as well as the execution time of the algorithm within the Basslet instance. In order to generate enough load to saturate the compute plane (i.e., spawn more hashjoin clients), we dedicated the cores on two NUMA nodes for the control plane, and use the cores on the remaining six NUMA nodes for the compute plane with six Basslet instances. The performance of the hashjoin is typically evaluated in number of cycles per output

140 Total time (incl. queuing) Runtime within a Basslet instance 120 100 80 60 40 20

Latency [cycles/tuple/client] 0 0 2 4 6 8 10 12 Number of clients

Figure 5.11: Overhead of an enqueue syscall and the queuing effects in Badis using HJ.

158 5.6. Evaluation tuple [BTAO13]. The join in this experiment is executed on input relations with 32 million 64-bit tuples. The results are shown in Figure 5.11, and indicate that the cost of enqueuing the ptask is quite low. It also shows that as soon as there are not enough Basslet instances in the compute plane to take over the enqueued ptasks, the wait time increases. We also note that the runtime within a Basslet instance for all jobs remains stable, despite the noise in the system.

5.6.5 Evaluating the adaptive feature of Badis

Finally, we put into perspective the adaptive functionality of the architecture for swapping a kernel by comparing it to other common operations:

• Performing a context switch,

• Resizing a Linux container – used for example in [Mer14], and

• Resizing a (VM).

The numbers, taken from prior work, are presented in Table 5.2. It shows that kernel swapping in Badis can be on-par with a heavy context switch, i.e., it is in the range of a millisecond. That makes it several orders of magnitude faster than adjusting the resource allocation to an application running in a Docker container or on top of a Virtual Machine. It also only accounts for a fraction of the execution time of an analytical job, so any adjustments are unlikely to be noticed as delays.

Table 5.2: Time to adjust

Mechanism/System Range for adjustment time [us] context-switch [Sec. 5.4.2] 0.6 ×100 - 6×103 Badis (Barrelfish) [ZGKR14] (0.5 - 1.2) ×103 Linux containers [Pan15] (1.7 - 7.8) ×106 Virtual Machines [Pan15] 9 ×108

159 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics

5.7 Related work

5.7.1 Scheduling parallel workloads

Several systems have addressed the challenges of scheduling parallel applications on mod- ern multicore systems. Tessellation [LKB+09] used space-time partitioning to factor the machine’s resources into separate units and virtualize them for user-level runtimes. In contrast to Badis, Tessellation does not specialize the OS kernels for task- or thread-based workloads. Instead it exposes “harts” (1:1 abstraction of physical hardware threads) and relies on user-level schedulers [CEH+13] for the fine-grained thread and memory management. Such two-level scheduling has also been adopted in other operating sys- tems [MSLM91,ABLL91,WA09]. The high performance computing (HPC) community has also explored scheduling tech- niques such as gang-scheduling [Ous82] or flexible co-scheduling [FFPF05]. In Basslet, ptasks run coordinated with temporal locality. In contrast to gang-scheduling, our system does not require a complex scheduling logic across the whole machine [Pet12]. In Chapter4 we already outlined the challenge(s) for scheduling parallel analytical algo- rithms on modern multicore machine. However, as workloads become increasingly more complex and diverse (e.g., not only HTAP or operational analytics but also graph pro- cessing, R, , etc. on top of traditional query processing engines), these systems must efficiently schedule heterogeneous workloads with sometimes conflicting re- quirements. Cloud schedulers [OWZS13, SKAEMW13, DK16] face similar problems for a heterogeneous mix of interactive and batch workloads. Teabe et al. [TTH16] identified five main application types with different requirements for length of the scheduler quanta, and scheduled the vCPUs on pools of physical CPUs with the desired quantum length. However, this analysis still falls short of addressing the conflicting scheduling requirements of HTAP systems experiencing severe performance interactions [PWM+15].

5.7.2 Scheduling of and within Runtime systems

Programming models as provided by Cilk [FLR98], X10 [CGS+05], or OpenMP [Ope15] provide a convenient way to express parallel algorithms that is simple both for the pro- grammer to use and for the runtime to parallelize. However, they typically have little knowledge about the overall system state, or job’s requirements. For instance, libgomp

160 5.7. Related work

(from OpenMP) uses the average load over the last 15 minutes to estimate the suitable degree of parallelism in dynamic mode [Inc16]). And as we have seen in our experiments, such miscalculations of the degree of parallelism can easily result in in the presence of other applications, or multiple parallel runtimes. Harris et al. [HMM14] also identified the interference problems among co-existing paral- lel runtime systems. The authors propose Callisto, a shared library that helps parallel runtimes coordinate their execution within a single machine, to avoid over-subscription of CPU resources, but also enable reuse of allocated but unused cores. Lithe [PHA09] addresses the problem of composability of such parallel runtimes within a single application by building a hierarchical assignment of “harts” (1:1 abstraction of physical hardware threads) in order to avoid over-subscription. Both Callisto and Lithe have the shortcomings of solving the problem in user-space without having an overview of the system state, and the other services and applications currently executed in the system. Nevertheless, we believe that both approaches bring valuable ideas that can be explored also in the new dynamic and flexible OS architecture that allows kernel specialization and supports task-based co-scheduling of parallel jobs.

5.7.3 Specialized kernels

High performance computing (HPC) systems are very sensitive to so-called OS “noise”. Such noise, aggravated by the scale at which these systems run, results in severe perfor- mance problems [HSL10]. Thus, -computing systems initially proposed the idea of us- ing customized lightweight kernels [RMG+15](e.g., Catamount [KB05], CNK [GGIW10]), a technique we also use for the Badis compute plane. Our approach uses separate kernel programs to specialize the kernel on a compute plane instance, while other approaches rely on exchanging the kernel code on a per process basis [CCS+15]. HPC processes are synchronization and coordination heavy, and optimize for these effects. Additionally, they do not have the worry of high workload concurrency and time-sharing of the ma- chine’s resources. Basslet provides a task-based runtime execution for general parallel data-processing workloads, for performance isolation on commodity servers. However, in parallel the HPC community has been also exploring the design space of a multikernel by having the light-weight kernels running alongside full-weight kernels like Linux inside the same system. Examples systems include FusedOS [PVHH+12],

161 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics mOS [WIK+14], and [GTI+15]. Shimosawa et al. [SGT+14] provide a framework (IHK) which allows ease booting of the LWKs by providing the basic functionality for CPU initialization, inter-kernel communication and resource partitioning (e.g., CPU cores and physical memory). The ideas for offloading sys calls from the compute plane kernels to the control plane FWK is inspired by FlexSC [SS10].

5.7.4 OS mechanisms for scheduling and performance isolation

Decades of work in the systems research and software have contributed to many different mechanisms and scheduling policies. Here we only briefly discuss few of the ones who are closely related to our proposed solution. Solaris [Ora09] uses locality groups to describe NUMA machines and to set thread and memory affinities for to an entire group of threads. Linux, on the other hand, uses cgroups [Heo15] to divide tasks into hierarchical groups and perform fine grained schedul- ing for every group. As we discussed before, even though the mechanism was primarily designed for managing OS processes, practitioners were abusing it to group threads in order to notify the OS that they should be scheduled differently, which is no longer possible. Windows uses fibers as a light-weight alternative to threads for co-operative multitasking. This has been used by the Microsoft SQL server in some high-scale transactional processing benchmarks [HSH07]. However, in comparison to Badis, the fibers are multiplexed on a set of OS threads. Rossbach et al. [RCS+11] introduces a PTask abstraction for describing parallel work units on GPUs, which is very closely related to Basslet’s ptask execution unit. Other researchers from the database community have also explored the benefits of scheduling database operations in a task-based or morsel-driven parallelism [LBKN14,PSMA13]. Fos [WA09] runs OS services in user space similar to a micro-kernel. It does not preempt them during execution, but rather uses a cooperative model where a service yields at specific locations in the program. Badis also adopts a cooperative execution model for the Basslet instances. But in contrast to Fos, the Barrelfish user-space OS services remain scheduled as threads on the control plane.

162 5.8. Integration with COD’s policy engine

5.7.5 Linux containers and virtualization

Industry solutions like Docker containers [Mer14] rely on cgroups to solve dependency conflicts and provide independence. In contrast to Badis, the resource provisioning of a Docker container is done manually by a . The solution provided by Badis is orthogonal and relies on customized kernels for the compute plane. However, Docker could benefit from a system like Badis to compartmentalize further and run different Linux kernel versions on certain cores. like [BDF+03] can run multiple operating systems on the same machine by using virtualization techniques. The idea of specialized execution with multiple virtual machines has been proposed in the past [BDSK+08]. However, in our current work hy- pervisors are orthogonal to Basslet, which aims to provide better support for scheduling a mix of data-processing applications, working on the same data, within a single OS. Library operating systems follow the idea of reducing abstraction in the OS kernel, and instead only mediate safe access to the underlying hardware [EKO95]. Applications then use libraries to abstract the access to the low-level hardware. Such a design allows for applications to specialize for various workloads or sand-boxing and lightweight virtualiza- tion [PBWH+11]. Both Barrelfish and Badis are implemented as library OSes. The [MMR+13] pushes the library OS concept to the extreme by running only a single address space machine which a cloud-provider may run on a (to avoid overhead due to layering) or directly on bare- hardware. Their evaluation shows the performance improvements of customizing the system stack for specific applications, and using the Badis’ compute plane kernels we hope to be able to achieve similar benefits in a less constrained setup.

5.8 Integration with COD’s policy engine

In this section we will briefly discuss how the Badis OS architecture can be integrated with COD’s policy engine and its rich declarative DB/OS interface. The OS policy engine is an integral component of the control plane OS. If Badis could choose among several customized compute plane kernels available for different classes of workloads (e.g., with customized scheduler), then it the OS policy engine in the control

163 Chapter 5. Scheduler: Kernel-integrated runtime for parallel data analytics plane that decides how to allocate resources to the compute plane kernels and how to link the applications and their jobs to the corresponding compute plane kernels. COD’s declarative interface allows the database to pass information about the proper- ties and preferences of its components down to the OS policy engine. This information can then used to compute the assignment of application’s working units (e.g., user-level threads) onto hardware threads (and resources). Therefore, a database can push different preferences for its components depending on their properties. For example, a thread pool that serves short-running, latency-critical, and synchronization-heavy queries will have requirements for a thread-based scheduler with short quantas and optimized storage and I/O accesses. These jobs can then be dispatched for execution to compute plane, which is customized accordingly. Similarly, the database can use the new sys call interface to invoke ptask-based scheduling for its parallel analtyical jobs on NUMA hardware islands.

5.9 Summary

In this chapter we presented Badis, our OS architecture that allows dynamic adjustments of the OS kernels and services based on workload requirements. Badis’ control/compute plane separation enables complex applications to execute efficiently on modern machines, by providing adaptive and customizable OS stacks. Furthermore, we showed how we can specialize a compute kernel for efficient scheduling of parallel analytical algorithms by implementing the Basslet kernel-integrated runtime scheduler for ptask program execution units. In our experiments, we achieved almost linear throughput scale-out and predictable runtime for heavy analytical workload mixes. While our prototype is implemented over a multikernel, it is reasonable to ask whether similar benefits could be obtained by modifying a like Linux. A radically new scheduler (similar to Basslet), leveraging recent proposals for fast core reconfigura- bility in Linux [PSK15] might be able to achieve similar results for certain workloads but may introduce a penalty for others. Currently the Basslet kernel does not directly address I/O issues, but the Badis archi- tecture allows to easily integrate the control/data plane design proposed by systems like Arrakis [PLZ+14]. The use of different kernels in the two planes raises the issue of having stronger isolation for them. Using the techniques proposed by nested kernels [DKD+15] or Mondrix [WRA05] remain to be explored as part of future work.

164 6 Future work

In this thesis we explored two aspects of the co-design of databases and operating systems. Our focus was primarily on allocation of (homogeneous) CPU resources for analytical workloads on a single multicore machine. First, in Chapters 2 and 3, we revised how to improve the policy decision making process using application’s knowledge for resource management on modern multicore machines. Second, in Chapter 4 we revisited the concurrency vs. parallelism argument in the context of modern analytical workloads on NUMA systems, and then in Chapter 5 we proposed an extension to the existing OS process model with OS support for task-based co-scheduling of parallel jobs (ptasks). However, the interaction between a database and the operating system spans across many other verticals of the system’s stack and involves different types of resources and workload requirements, which were not covered in this thesis but can be addressed as part of future work. In the first half of this chapter we cover a few of these opportunities. We conclude with a discussion on how some of the concepts and techniques presented in this thesis can be applied beyond the context of database/operating system co-design.

165 Chapter 6. Future work

6.1 Follow up work

6.1.1 Managing different types of resources

Machines have a plethora of resources, and both operating systems and applications need to collaborate to make the most use out of them. As it was already motivated in the previous chapters of the thesis, we can no longer assume to be working with a homogeneous set of resources and that there is more to resource allocation than CPU cores.

Memory management The effects of non-uniform memory accesses (NUMA), local DRAM and interconnect bandwidth and their implication on synchronization and memory movement has been extensively studied both in the database (e.g.,[PPB+12,PSM+15,LPM+13]) and systems community (e.g.,[ZSB+12,DFF +13,DGT13]). Similar analysis and optimizations were also done for the role of the translation lookaside buffer (TLB) and reducing the costly overhead of virtual to physical memory transla- tion [KKL+09, YRV11]. Therefore, it does not come as a surprise that for certain classes of applications and workloads, the virtual memory system is simply becoming a bot- tleneck for main-memory data processing. This inspired recent proposals such as direct segments [BGC+13] which allow circumventing the costly physical-to-virtual address trans- lation for large regions of memory. Furthermore, the page swapping policies of conventional operating systems do not take into account application specific knowledge about the content of pages. It has been shown that it can have a negative impact performance of database systems [GVK+14], and such effects are likely to become more pronounced with the introduction of NVRAM [Kim15]. Unfortunately, the conventional operating system APIs for manipulating a process’ address space still try to shield the application from all this complexity. This particularly limits the opportunities for an application like a DBMS to directly manage memory and its own page-tables or use self-paging to swap their data out to disk or NVRAM. Recent proposals in OS memory systems provide greater flexibility to applications by exposing both physical and virtual memory directly to application and allowing them to safely construct their own page-tables [GZA+15]. Using such an approach database applications can take over the

166 6.1. Follow up work control over the memory management and thus, avoid the potentially sub-optimal global policies enforced by the operating system. Finally, some database engines have explored the idea of using the OS and hardware support for copy-on-write using sometimes heavy and slow system calls like fork. It would be interesting to explore how newer OS abstractions like SpaceJMP [EHMZ+16], which allow for fast address space switching, can be used in combination with copy-on-write to provide a more lightweight alternative to forking.

More efficient use and access to I/O devices In the thesis we did not consider workloads that spawn multiple machines or that work on data which does not resides in memory of the local machine. There are, however, many applications where this is not the case. For instance, transactional workloads or other latency-sensitive data processing applications which communicate with clients, rely on cross-machine replication (e.g., for reliability, performance isolation, or scalability), pro- cess data which is distributed across multiple machines (e.g., FaRM [DNCH14,DNN +15], HERD [KKA14] rack-scale join processing [BLAK15], or HyPer’s rack-scale query pro- cessing [RMKN15]), or build data processing system whose architecture is separated into layers [Loe15,LLMZ11,LLS +15]). In many of these systems, it is important that the system stack support allows efficient processing of data requests over the network. In particular, solutions which completely circumvent the OS from the data-path, but still provide means for controlling the poli- cies and the security aspects among several applications. Due to lack of support from conventional operating systems, most data processing engines today which optimize their implementations for network operations use Infiniband (i.e., remove the OS from the crit- path, and rely on the network card (NIC) scheduler to do the resource multiplexing following certain preferences, like priorities, given by the applications). However, most I/O hardware devices available today have support for virtualization, which allows virtual machines to have direct control over the entire device or parts of a device. Recently, re- search operating systems such as iX [BPK+14] and Arrakis [PLZ+14] use this functionality to give applications direct access the exposed hardware (disk or network). The reported performance for a client request on Arrakis to the persistent NoSQL store showed 2x better read latency, 5x better write latency, and 9x better write throughput compared to Linux. It would be interesting to explore how the policies devised by the control plane (i.e., the OS) can be improved given more knowledge of the application’s requirements.

167 Chapter 6. Future work

Heterogeneous hardware At the moment operating systems try to hide the underlying hardware complexity and diversity as much as possible from the applications running above. While this made sense in the past, such an approach today is very restrictive as many of the resources are left underutilized and applications cannot get the most out of the available hardware. From the perspective of the OS policy engine some of the interesting challenges are: how to capture and model the diversity in resources (e.g., their capabilities, capacities, and other properties), how to monitor their utilization (given the many performance counters), as well as whether and how to expose some of that information to the applications running on top. For example, it is increasingly common for machines to have heterogeneous computa- tional devices, e.g., powerful Intel Xeon processor alongside an FPGA (HARP machine), a highly parallelized Xeon Phi, or a small and energy efficient . An interesting question is how can the OS assist data processing applications map their work units onto such a diverse set of platforms – we discuss more about this later in the context of DB optimizer. Furthermore, current trends in hardware architectures suggest that we are going to have “active” everything (e.g., integrated data processing at line rate in the chip (DMX in SPARC M7 [A+15]), smart NICs, active near-memory processing [AHY+15, AYMC15, SMB+15], intelligent HDD/SSD storage [WIA14], etc.). As opposed to treating them as devices with external drivers (like we did with GPGPUs [RCS+11], etc.) we should have the operating system manage their computational capacities (i.e., the control plane) and export the device services directly to the applications (i.e., the data or compute plane). Finally, recent advances in operating systems also allow for fast core booting [ZGKR14], which can be used for more energy-efficient resource utilization, or even for handling the upcoming challenges of dark-silicon [EBSA+11]. This is another aspect that we would like to explore as part of future work.

6.1.2 Supporting workloads beyond traditional analytics

In the previous chapter we motivated the need for supporting operational analytics. The proposed OS architecture (Badis) in Chapter5 already provides the necessary foundation for such OS support – adaptive control and compute plane separation which can ad- just based on the workload requirements. Several research groups have recently been exploring the challenges for co-executing transactional and analytical workload mixes

168 6.2. Beyond DB/OS co-design

(e.g., HyPer [KN11], SAP HANA [Pla09], MemSQL [Mem16]). A recent survey study by Psaroudakis et al. [PWM+15] has shown that some of these systems (i.e., HyPer and SAP HANA) struggle to provide predictable performance, and one important factor for it is resource interference between the threads operating on the two different workload types. We believe that there are two problems which contribute to that behaviour. First is the process model employed by the DBMS engine, i.e., how they map the incoming queries to the internal worker threads, and then the way they are mapped onto the hardware contexts (and kernel threads). Second is the scheduling support and QoS guarantees they can received from the underlying system stack (including, the scheduler in the OS kernel, but also the schedulers of the various devices for disk and network I/O). Some of the techniques, mechanisms, and policies that we presented in this thesis can be applied within these systems and address a significant portion of the observed performance degradation due to poor resource management decisions. For example, the idea for spa- tial isolation when placing the DB worker threads onto cores can help the transactional workload not be directly affected by the bandwidth intensive queries of the analytical workload. Similarly, for predictable behaviour, the CPU resources can be allocated on a per execution engine basis rather than on a per query/connection basis. Furthermore, if the engines were to use the adaptive Badis control/compute plane sep- aration: the transactional component could benefit from thread-based scheduling with short quantum on the control plane, while the analytical queries could use the ptask based scheduling on the Basslet runtime. It would be interesting to explore policies which adjust the control/compute plane separation and evaluate how that can affect the performance of data processing engines serving such HTAP workloads.

6.2 Beyond DB/OS co-design

We continue the discussion on how the proposed ideas in the thesis can have a broader impact beyond the interaction of databases and OSes. First, it would be interesting to explore other applications that may benefit of closer inte- gration with the underlying operating system. Many modern and analytics workloads have similar properties as more traditional database systems. For instance, they often have detailed understanding of the internal algorithms and data structures, have specific objectives they need to meet, and can benefit from more efficient implemen-

169 Chapter 6. Future work tation on modern hardware platforms. As such, many machine learning, , or graph processing based applications can benefit from cross layer optimizations across the system stack, including the runtime and operating systems. It would be interesting to explore how they can benefit from exposing information about their cost models, SLOs, or dataflow graphs to the OS policy engine or from the Badis architecture when executing on heterogeneous computing platforms (e.g., co-processors like Xeon Phi or external accel- erators such as Google’s Tensorflow Processing Unit (TPU) [Goo16]). One limitation to the COD architecture is the assumption that the applications (in our case the database) are long running and internally can have long- or short-running jobs. In cases when the application consists of a few (or only one) short jobs, this causes a big overhead. It would be interesting to explore how to schedule such applications in a COD-based OS. Second, the gap between applications’ and system-state knowledge can also be beneficial for schedulers working on larger scale, such as a data-center or in cloud environ- ments. Even though we have primarily focused on scheduling and resource management in a single operating system for multicore systems, some of the proposed techniques can be applied in a distributed environment. For example, allowing applications to specify cost models for their jobs can assist in mapping resource requirements to their service level objectives. Similarly, having information about the data-dependency graph as a DAG and the individual job’s properties can already result in more efficient resource scheduling (e.g., in Graphine [KR16]). It would be interesting to explore what additional knowledge can be provided by applications that can be leveraged by system schedulers at such a scale. Finally, an interesting opportunity would be to leverage the Badis compute plane customiz- ability to extend current possibilities of widely used Linux containers (e.g., Docker [Mer14]). Currently Linux containers are implemented with a goal of bundling the applications de- pendencies and provide performance isolation in a more lightweight manner than running on a VM by using three mechanisms: cgroups for bundling the application with its depen- dencies, for software isolation, and Linux-based mechanisms that can be at- tached to cgroups (e.g., core-pinning, constraining memory allocation, and/or device I/O provisioning) for hardware resource isolation. Current containers, however, cannot solve performance isolation problems related to shared resources (e.g., DRAM bandwidth). Fur- ther, as the Docker manager executes on top of a single Linux image, it does not provide customized OS interfaces, mechanisms or policies for different application’s requirements. Therefore, it would be interesting to explore the benefits of using the Docker container as a packaging mechanism on top of a customized OS stack.

170 7 Conclusion

The interaction between operating systems and database engines has been a difficult sys- tem problem for decades. Both try to control and manage the same resources but have very different goals. The ignoring each other tactic followed in the last decades has worked because the homogeneity of the hardware has allowed databases to optimize against a re- duced set of architectural specifications, and over-provisioning of resources (i.e., running a database on a single server) was not seen as a problem. With the advent of multicore and virtualization, these premises have changed. Databases will often no longer run alone in a server and the underlying hardware is becoming sig- nificantly more complex and heterogeneous. In fact, and because of these changes, both databases and operating systems are revisiting their internal architectures to accommo- date large scale parallelism. Using COD as an example, we argue that the redesign effort on both sides must include the interface between the database and the OS. This dissertation first addresses the knowledge gap problem with the COD architecture. COD introduces (1) an OS policy engine, which the operating system can use to reason about the underlying machine properties, and the applications’ requirements, (2) a declar- ative interface between the database and the operating system that allows bi-directional knowledge flow, and (3) a resource profiler, which both the OS and the database can use to measure resource utilization. In Chapter 2 we have shown how the CSCS storage engine can benefit from a close inte-

171 Chapter 7. Conclusion gration with the OS policy engine. The storage engine pushes down information about its properties, including a cost model that estimates the scan runtime given computational resources (i.e., cores), latency SLOs, and application-specific stored procedures. Using that information, the OS policy engine can assist in deployment on various different server architectures, and in noisy environments. In Chapter 3 we have demonstrated the benefits of the COD architecture in the context of resource scheduling for the database execution engine. More concretely, we showed how to efficiently deploy complex query plans on modern multicore architectures without affecting performance and predictability for both throughput and latency. We achieved that by using the resource profiler to estimate the resource requirements of the relational operators, and by allowing the database execution engine to push down information about the data dependency graph of the complex query plan. In Chapter 4 we revisited the problem of finding the right balance between a degree of parallelism and multi-programming in the context of modern analytical algorithms running on multicore machines. We show that the problem is not trivial – as the performance of individual parallel jobs and the workload mix, are highly sensitive to load interactions, the chosen degree of parallelism, and the spatio-temporal resource allocation on different hardware architectures. With a series of experiments, we have shown that for concurrent workloads with parallel operators, using the NUMA node as a unit of allocation, often achieves the desired optimization goal of maximizing the throughput while at the same time guaranteeing the performance runtime of a job given a set of resources. In Chapter 5 we presented Badis, an OS architecture that allows dynamic adjustments of the OS kernels and services based on workload requirements. Badis’ control/compute plane separation enables complex applications to execute efficiently on modern machines, by providing adaptive and customizable OS stacks. We also showed how we can specialize a compute kernel for efficient scheduling of parallel analytical algorithms by implementing the Basslet kernel-integrated runtime scheduler for ptask program execution units. In our experiments, we achieved almost linear throughput scale-out and predictable runtime for heavy analytical workload mixes. As an overall system design, integrating the policy engine as part of the control plane is a promising way to address the upcoming challenges for managing large (rack-scale, heterogeneous) machines while serving modern data processing workloads. The policy engine can then choose the suitable set of compute kernels and adjust the policies based on the workload requirements.

172 List of Tables

2.1 Message types supported by COD’s interface...... 25 2.2 Derived deployments for different SLAs and hardware platforms...... 33 2.3 Computation overhead of the OS policy engine...... 39 2.4 Policy engine computation cost for the stored procedures...... 42

3.1 Perfomance counter events used for deriving the resource activity vectors. 59 3.2 Performance of default vs. compressed deployment...... 76 3.3 Performance/Resources efficiency savings...... 77 3.4 Evaluating the design choices of algorithm phases...... 78

4.1 Relational workload characterized by instrumentation...... 106 4.2 Graph workload characterized by instrumentation...... 108 4.3 Effect of thread deployment on the runtime of parallel algorithms..... 112 4.4 Scheduling policies on heterogeneous WL...... 121 4.5 Time breakdown for heterogeneous workload mix...... 122

5.1 Runtime of parallel algorithms executing on Linux versus Basslet...... 158 5.2 Time to adjust...... 159

173

List of Figures

1.1 Overview of affected or new OS components...... 6

2.1 COD’s architecture...... 13 2.2 CSCS architecture...... 16 2.3 Architecture of the OS policy engine...... 22 2.4 Matrix of core to task allocation, including NUMA, cache and core affinity. 36 2.5 CSCS performance when deployed in a noisy system...... 38 2.6 COD’s adaptability to changes in the system...... 41

3.1 Query-centric vs. operator-centric execution models...... 52 3.2 Layout of the four-socket AMD Bulldozer system...... 54 3.3 Sketch of the solution – information flow within the deployment algorithm 56 3.4 Extensions to the COD system architecture...... 57 3.5 Overview of the deployment algorithm...... 62 3.6 Collapsing an operator pipeline into a compound operator...... 63 3.7 Thread to core mapping: comparing two deployment alternatives..... 68 3.8 TPC-W shared query plan – as generated for SharedDB...... 71 3.9 Understanding the derivation of RAVs, AMD MagnyCours, 20 GB dataset 73 3.10 RAV – impact of dataset size on CPU utilization...... 74 3.11 RAV – impact of HW architecture on CPU utilization...... 75

175 List of Figures

3.12 SharedDB’s throughput scalability with plan replication...... 80

4.1 Illustrating the three scheduling approaches for homogeneous workloads.. 95 4.2 Architecture of the Intel SandyBridge machine...... 102 4.3 Architecture of the AMD Bulldozer machine...... 103 4.4 Scalability of DB operators on tables with 1024 M tuples...... 105 4.5 Scalability of Graph algorithms on Twitter data...... 107 4.6 Effect of resource sharing on a co-scheduled pair of jobs...... 110 4.7 Scheduling approaches for heterogeneous workload mix on a four NUMA node machine, with eight cores per NUMA...... 113 4.8 Comparing scheduling approaches for relational DB operators...... 115 4.9 Comparing scheduling approaches for GreenMarl algorithms...... 118 4.10 AMD Bulldozer results...... 120

5.1 System throughput for executing concurrent pagerank jobs...... 133 5.2 Illustrating the Badis OS Architecture...... 136 5.3 Same interaction with the compute plane of Badis...... 138 5.4 Indirect cost of context switch vs. working set size...... 141 5.5 The architecture of an Intel SandyBridge processor with marked opportu- nities for resource sharing...... 143 5.6 Badis control plane APIs...... 147 5.7 Basslet runtime API...... 148 5.8 Measuring the slow-down of graph kernels when co-scheduling with another parallel algorithm...... 153 5.9 Throughput of concurrent PageRank execution on four different machines. 155 5.10 Comparing throughput scale-out using OpenMP vs. Linux+OpenMP vs. Basslet...... 157 5.11 Measuring the overhead of enqueuing a ptask in Badis...... 158

176 Bibliography

[A+15] K. Aingaran et al. “M7: Oracle’s Next-Generation Sparc Processor.” IEEE Micro, vol. 35, no. 2, 36–45, 2015.

[ABK+14] S. Angel, H. Ballani, T. Karagiannis, G. O’Shea, and E. Thereska. “End- to-end Performance Isolation Through Virtual Datacenters.” In Proceed- ings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pp. 233–248. 2014.

[ABLL91] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. “Sched- uler Activations: Effective Kernel Support for the User-level Management of Parallelism.” In Proceedings of the Thirteenth ACM Symposium on Op- erating Systems Principles, SOSP ’91, pp. 95–109. 1991.

[ADADB+03] A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, N. C. Burnett, T. E. Denehy, T. J. Engle, H. S. Gunawi, J. A. Nugent, and F. I. Popovici. “Transform- ing Policies into Mechanisms with Infokernel.” In ACM Symposium on Operating System Principles, pp. 90–105. 2003.

[ADHW99] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. “DBMSs on a Modern Processor: Where Does Time Go?” In VLDB ’99, pp. 266–277. 1999.

[ADJ+10] S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. L. Perez. “The DataPath system: a data-centric analytic processing engine for large data .” In SIGMOD Conference, pp. 519–530. 2010.

[Adv13] Advanced Micro Devices, Inc. (AMD). BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors, 2013.

177 Bibliography

[AFK+09] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. “FAWN: A Fast Array of Wimpy Nodes.” In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pp. 1–14. 2009.

[AHY+15] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. “A Scalable Processing- in-memory Accelerator for Parallel Graph Processing.” In ISCA ’15, pp. 105–117. 2015.

[AKN12] M.-C. Albutiu, A. Kemper, and T. Neumann. “Massively parallel sort- merge joins in main memory multi-core database systems.” PVLDB ’12, vol. 5, no. 10, 1064–1075, 2012.

[AKSS12] G. Alonso, D. Kossmann, T. Salomie, and A. Schmidt. “Shared Scans on Main Memory Column Stores.” Tech. Rep. no. 769, Department of Computer Science, ETH Z¨urich, 2012.

[AL91] A. W. Appel and K. Li. “Virtual memory primitives for user programs.” In Proceedings of the fourth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-IV, pp. 96– 107. 1991.

[And12] Z. R. Anderson. “Efficiently combining parallel software using fine-grained, language-level, hierarchical resource management policies.” In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Pro- gramming, Systems, Languages, and Applications, OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, October 21-25, 2012, pp. 717–736. 2012.

[App13] Apple. “Memory Usage Performance Guidelines.” https:// developer.apple.com/library/prerelease/content/documentation/ Performance/Conceptual/ManagingMemory/Articles/MemoryAlloc. html#//apple_ref/doc/uid/20001881-SW1, 2013.

[AW07] K. R. Apt and M. G. Wallace. Constraint Logic Programming using ECLiPSe. Cambridge University Press, 2007.

[AYMC15] J. Ahn, S. Yoo, O. Mutlu, and K. Choi. “PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture.” In ISCA ’15, pp. 336–348. 2015.

178 Bibliography

[Bal14] C. Balkesen. “In-memory parallel join processing on multi-core processors.” Ph.D. thesis, ETH Zurich, 2014.

[bar16] “Barrelfish Operating System.”, 2016. www..org, accessed 2016-08-12.

[BATO13] C. Balkesen, G. Alonso, J. Teubner, and M. T. Ozsu.¨ “Multi-core, Main- memory Joins: Sort vs. Hash Revisited.” Proc. VLDB Endow., vol. 7, no. 1, 85–96, 2013.

[BBD+09] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Sch¨upbach, and A. Singhania. “The multikernel: a new OS architecture for scalable multicore systems.” In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP ’09, pp. 29–44. 2009.

[BCF+13] A. A. Bhattacharya, D. Culler, E. Friedman, A. Ghodsi, S. Shenker, and I. Stoica. “Hierarchical Scheduling for Diverse Datacenter Workloads.” In Proceedings of the 4th Annual Symposium on , SOCC ’13, pp. 4:1–4:15. 2013.

[BD83] H. Boral and D. J. DeWitt. Database Machines: An Idea Whose Time has Passed? A Critique of the Future of Database Machines, pp. 166–187. Springer Berlin Heidelberg, Berlin, Heidelberg, 1983.

[BDF+03] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neuge- bauer, I. Pratt, and A. Warfield. “Xen and the Art of Virtualization.” In Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP ’03, pp. 164–177. 2003.

[BDSK+08] M. Butrico, D. Da Silva, O. Krieger, M. Ostrowski, B. Rosenburg, D. Tsafrir, E. Van Hensbergen, R. W. Wisniewski, and J. Xenidis. “Spe- cialized Execution Environments.” SIGOPS Oper. Syst. Rev., vol. 42, no. 1, 106–107, 2008.

[BGC+13] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift. “Efficient Virtual Memory for Big Memory Servers.” ISCA ’13, pp. 237–248. 2013.

179 Bibliography

[BHC12] S. K. Begley, Z. He, and Y.-P. P. Chen. “MCJoin: A Memory-constrained Join for Column-store Main-memory Databases.” In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pp. 121–132. 2012.

[BHKL06] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. “Group Forma- tion in Large Social Networks: Membership, Growth, and Evolution.” In KDD, pp. 44–54. 2006.

[Bie11] C. Bienia. “Benchmarking Modern Multiprocessors.” Ph.D. thesis, - ton University, 2011.

[BKM08] P. A. Boncz, M. L. Kersten, and S. Manegold. “Breaking the memory wall in MonetDB.” Commun. ACM, vol. 51, 77–85, 2008.

[Bla79] Blasgen, Mike and Gray, Jim and Mitoma, Mike and Price, Tom. “The Convoy Phenomenon.” SIGOPS Oper. Syst. Rev., vol. 13, no. 2, 20–25, 1979.

[BLAK15] C. Barthels, S. Loesing, G. Alonso, and D. Kossmann. “Rack-Scale In- Memory Join Processing Using RDMA.” In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pp. 1463–1475. 2015.

[BLP11] S. Blanas, Y. Li, and J. M. Patel. “Design and Evaluation of Main Mem- ory Hash Join Algorithms for Multi-core CPUs.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pp. 37–48. 2011.

[BLP+14] R. Barber, G. Lohman, I. Pandis, V. Raman, R. Sidle, G. Attaluri, N. Chainani, S. Lightstone, and D. Sharpe. “Memory-efficient Hash Joins.” Proc. VLDB Endow., vol. 8, no. 4, 353–364, 2014.

[BM10] M. Bhadauria and S. A. McKee. “An approach to resource-aware co- scheduling for CMPs.” In ICS ’10, pp. 189–199. 2010.

[BPA08] M. Banikazemi, D. Poff, and B. Abali. “PAM: a novel performance/power aware meta-scheduler for multi-core systems.” In SC ’08, pp. 39:1–39:12. 2008.

180 Bibliography

[BPK+14] A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion. “IX: A Protected Dataplane Operating System for High Throughput and Low Latency.” OSDI’14, pp. 49–65. 2014.

[BSP+95] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Chambers, and S. Eggers. “Extensibility safety and perfor- mance in the SPIN operating system.” In Proceedings of the fifteenth ACM Symposium on Operating Systems Principles, SOSP ’95, pp. 267–283. 1995.

[BTAO13] C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu. “Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware.” ICDE ’13, vol. 0, 362–373, 2013.

[BWCM+10] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. “An analysis of Linux scalability to many cores.” In Proceedings of the 9th USENIX conference on Operating Systems Design and Implementation, OSDI’10, pp. 1–8. 2010.

[BZFK10] S. Blagodurov, S. Zhuravlev, A. Fedorova, and A. Kamali. “A Case for NUMA-aware Contention Management on Multicore Systems.” In PACT ’10, pp. 557–558. 2010.

[BZN05] P. A. Boncz, M. Zukowski, and N. Nes. “MonetDB/X100: Hyper- Pipelining Query Execution.” In CIDR ’05, vol. 5, pp. 225–237. 2005.

[CCS+15] O. R. A. Chick, L. Carata, J. Snee, N. Balakrishnan, and R. Sohan. “Shadow Kernels: A General Mechanism For Kernel Specialization in Ex- isting Operating Systems.” In Proceedings of the 6th Asia-Pacific Workshop on Systems, APSys ’15, pp. 1:1–1:7. 2015.

[CEH+13] J. A. Colmenares, G. Eads, S. Hofmeyr, S. Bird, M. Moret´o,D. Chou, B. Gluzman, E. Roman, D. B. Bartolini, N. Mor, K. Asanovi´c,and J. D. Kubiatowicz. “Tessellation: Refactoring the OS Around Explicit Resource Containers with Continuous Adaptation.” In Proceedings of the 50th An- nual Design Automation Conference, DAC ’13, pp. 76:1–76:10. 2013.

[CGJ97] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson. “Approximation algorithms for NP-hard problems.” In D. S. Hochbaum, ., Approximation algorithms for bin packing: a survey, pp. 46–93. 1997.

181 Bibliography

[CGK+07] S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilk- erson. “Scheduling threads for constructive cache sharing on CMPs.” In SPAA ’07, pp. 105–115. 2007.

[CGKS05] D. Chandra, F. Guo, S. Kim, and Y. Solihin. “Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture.” In HPCA ’05, pp. 340–351. 2005.

[CGS+05] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. “X10: An Object-oriented Approach to Non- uniform Cluster Computing.” In Proceedings of the 20th Annual ACM SIG- PLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pp. 519–538. 2005.

[CHCF15] A. Collins, T. Harris, M. Cole, and C. Fensch. “LIRA: Adaptive Contention-Aware Thread Placement for Parallel Runtime Systems.” In ROSS, pp. 2:1–2:8. 2015.

[CI12] A. Costea and A. Ionescu. “Query Optimization and Execution in Vector- wise MPP.” Master’s thesis, Vrije Universiteit, Amsterdam, 2012.

[CJ06] S. Cho and L. Jin. “Managing Distributed, Shared L2 Caches Through OS-Level Page Allocation.” In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pp. 455–468. 2006.

[CJMB11] C. Curino, E. P. Jones, S. Madden, and H. Balakrishnan. “Workload-aware Database Monitoring and Consolidation.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pp. 313–324. 2011.

[CJZM10] C. Curino, E. Jones, Y. Zhang, and S. Madden. “Schism: A Workload- driven Approach to Database Replication and Partitioning.” Proc. VLDB Endow., vol. 3, no. 1-2, 48–57, 2010.

[CPV09] G. Candea, N. Polyzotis, and R. Vingralek. “A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses.” Proceedings VLDB Endowment, vol. 2, no. 1, 277–288, 2009.

182 Bibliography

[CPV11] G. Candea, N. Polyzotis, and R. Vingralek. “Predictable performance and high query concurrency for data analytics.” VLDB Journal, vol. 20, no. 2, 227–248, 2011.

[CR07] J. Cieslewicz and K. A. Ross. “Adaptive Aggregation on Chip Multiproces- sors.” In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pp. 339–350. VLDB Endowment, 2007.

[CRD+95] J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. “Hive: Fault Containment for Shared-memory Multiproces- sors.” In Proceedings of the Fifteenth ACM Symposium on Operating Sys- tems Principles, SOSP ’95, pp. 12–25. 1995.

[CSCC15] R. Chen, J. Shi, Y. Chen, and H. Chen. “PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs.” In EuroSys, pp. 1:1– 1:15. 2015.

[DC08] P. J. Drongowski and B. D. Center. “Basic Performance Measurements for AMD , AMD Opteron and AMD Phenon Processors.” AMD whitepaper, vol. 25, 2008.

[DFF+13] M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. “Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems.” In ASPLOS, pp. 381–394. 2013.

[DGT13] T. David, R. Guerraoui, and V. Trigonakis. “Everything you always wanted to know about synchronization but were afraid to ask.” In SOSP, pp. 33–48. 2013.

[DK14] C. Delimitrou and C. Kozyrakis. “Quasar: Resource-efficient and QoS- aware Cluster Management.” In Proceedings of the 19th International Con- ference on Architectural Support for Programming Languages and Operat- ing Systems, ASPLOS ’14, pp. 127–144. 2014.

[DK16] C. Delimitrou and C. Kozyrakis. “HCloud: Resource-Efficient Provision- ing in Shared Cloud Systems.” In Proceedings of the Twenty-First Inter- national Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’16, pp. 473–488. 2016.

183 Bibliography

[DKD+15] N. Dautenhahn, T. Kasampalis, W. Dietz, J. Criswell, and V. Adve. “Nested Kernel: An Operating System Architecture for Intra-Kernel Privi- lege Separation.” In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Sys- tems, ASPLOS ’15, pp. 191–206. 2015.

[DMB08] L. De Moura and N. Bjørner. “Z3: an efficient SMT solver.” In Proceed- ings of the Theory and Practice of Software, 14th international conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’08/ETAPS’08, pp. 337–340. 2008.

[DMR+10] M. Diener, F. L. Madruga, E. R. Rodrigues, M. A. Z. Alves, J. Schneider, P. O. A. Navaux, and H.-U. Heiss. “Evaluating thread placement based on memory access patterns for multi-core processors.” In HPCC, pp. 491–496. 2010.

[DNCH14] A. Dragojevic, D. Narayanan, M. Castro, and O. Hodson. “FaRM: Fast Remote Memory.” In NSDI, pp. 401–414. 2014.

[DNLS13] S. Das, V. R. Narasayya, F. Li, and M. Syamala. “CPU Sharing Tech- niques for Performance Isolation in Multitenant -as-a- Service.” PVLDB ’13, vol. 7, no. 1, 37–48, 2013.

[DNN+15] A. Dragojevi´c, D. Narayanan, E. B. , M. Renzelmann, A. Shamis, A. Badam, and M. Castro. “No Compromises: Distributed Transactions with Consistency, Availability, and Performance.” In Pro- ceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, pp. 54–70. 2015.

[D´os07] G. D´osa.“The tight bound of first fit decreasing bin-packing algorithm is FFD(I) ≤ 11/9OPT(I) + 6/9.” In , Algorithms, Probabilis- tic and Experimental Methodologies, pp. 1–11. Springer, 2007.

[DPZ97] P. Druschel, V. Pai, and W. Zwaenepoel. “Extensible Kernels are Leading OS Research Astray.” In Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI), HOTOS ’97, pp. 38–. 1997.

[DWDS13] T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa. “ReSense: Map- ping Dynamic Workloads of Colocated Multithreaded Applications Using

184 Bibliography

Resource Sensitivity.” ACM Trans. Archit. Code Optim., vol. 10, no. 4, 41:1–41:25, 2013.

[EBSA+11] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. “Dark Silicon and the End of Multicore Scaling.” ISCA ’11, pp. 365–376. 2011.

[EDS+15] A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, et al. “A Demon- stration of the BigDAWG Polystore System.” VLDB, vol. 8, no. 12, 2015.

[EE08] S. Eyerman and L. Eeckhout. “System-Level Performance Metrics for Mul- tiprogram Workloads.” IEEE Micro, vol. 28, no. 3, 42–53, 2008.

[Efe82] K. Efe. “Heuristic Models of Task Assignment Scheduling in Distributed Systems.” Computer, vol. 15, no. 6, 50–56, 1982.

[EHMZ+16] I. El Hajj, A. Merritt, G. Zellweger, D. Milojicic, R. Achermann, P. Fara- boschi, W.-m. Hwu, T. Roscoe, and K. Schwan. “SpaceJMP: Programming with Multiple Virtual Address .” ASPLOS’16. 2016.

[EKO95] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. “Exokernel: an oper- ating system architecture for application-level resource management.” In Proceedings of the fifteenth ACM Symposium on Operating Systems Prin- ciples, SOSP ’95, pp. 251–266. 1995.

[FFPF05] E. Frachtenberg, D. G. Feitelson, F. Petrini, and J. Fernandez. “Adaptive Parallel Job Scheduling with Flexible Coscheduling.” IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 11, 1066–1077, 2005.

[FLR98] M. Frigo, C. E. Leiserson, and K. H. Randall. “The Implementation of the Cilk-5 multithreaded language.” In In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation. 1998.

[FML+12] F. F¨arber, N. May, W. Lehner, P. Große, I. M¨uller,H. Rauhe, and J. Dees. “The SAP HANA Database – An Architecture Overview.” IEEE Data Eng. Bull., vol. 35, no. 1, 28–33, 2012.

185 Bibliography

[GAK12] G. Giannikis, G. Alonso, and D. Kossmann. “SharedDB: killing one thou- sand queries with one stone.” Proceedings VLDB Endowment, vol. 5, no. 6, 526–537, 2012.

[GARH14] J. Giceva, G. Alonso, T. Roscoe, and T. Harris. “Deployment of Query Plans on Multicores.” PVLDB, vol. 8, no. 3, 233–244, 2014.

[GG03] D. F. Garc´ıaand J. Garc´ıa. “TPC-W E-Commerce Benchmark Evalua- tion.” Computer, pp. 42–48, 2003.

[GGIW10] M. Giampapa, T. Gooding, T. Inglett, and R. W. Wisniewski. “Experi- ences with a Lightweight Kernel: Lessons Learned from Blue Gene’s CNK.” SC ’10, pp. 1–10. 2010.

[GHK92] S. Ganguly, W. Hasan, and R. Krishnamurthy. “Query Optimization for Parallel Execution.” In Proceedings of the 1992 ACM SIGMOD Interna- tional Conference on Management of Data, SIGMOD ’92, pp. 9–18. 1992.

[GI96] M. N. Garofalakis and Y. E. Ioannidis. “Multi-dimensional Resource Scheduling for Parallel Queries.” In Proceedings of the 1996 ACM SIG- MOD International Conference on Management of Data, SIGMOD ’96, pp. 365–376. 1996.

[GI97] M. N. Garofalakis and Y. E. Ioannidis. “Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources.” In Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB ’97, pp. 296–305. 1997.

[GKAS99] B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. “Tornado: Maximising Locality and Concurrency in a Shared Memory Multiprocessor Operating System.” In USENIX Symposium on Operating Systems Design and Im- plementation, pp. 87–100. 1999.

[GMAK14] G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. “Shared Workload Optimization.” PVLDB ’14, vol. 7, no. 6, 2014.

[Goo16] Google. “Google supercharges machine learning tasks with TPU custom chip.”, 2016.

186 Bibliography

[Gra77] J. Gray. “ on Data Base Operating Systems.” In R. Bayer, R. M. Gra- ham, and G. Seegm¨uller,eds., Operating Systems: An Advanced Course, pp. 393–481. Springer-Verlag, 1977.

[Gro13] T. O. Group. “POSIX.1-2008 Specification 2013 edition.” http://pubs. opengroup.org/onlinepubs/9699919799/, 2013.

[GSC+15] I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand. “Musketeer: All for One, One for All in Data Processing Sys- tems.” In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pp. 2:1–2:16. 2015.

[GSS+13] J. Giceva, T.-I. Salomie, A. Sch¨upbach, G. Alonso, and T. Roscoe. “COD: Database/Operating System Co-Design.” In CIDR ’13. 2013.

[GTI+15] B. Gerofi, M. Takagi, Y. Ishikawa, R. Riesen, E. Powers, and R. W. Wisniewski. “Exploring the Design Space of Combining Linux with Lightweight Kernels for Extreme Scale Computing.” ROSS ’15, pp. 5:1–5:8. 2015.

[GVK+14] G. Graefe, H. Volos, H. Kimura, H. Kuno, J. Tucek, M. Lillibridge, and A. Veitch. “In-memory Performance for .” PVLDB’14, pp. 37–48, 2014.

[GXD+14] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. “GraphX: Graph Processing in a Distributed Dataflow Frame- work.” In OSDI, pp. 599–613. 2014.

[GZA+15] S. Gerber, G. Zellweger, R. Achermann, K. Kourtis, T. Roscoe, and D. Milojicic. “Not Your Parents’ Physical Address Space.” In HotOS’15. 2015.

[GZAR16] J. Giceva, G. Zellweger, G. Alonso, and T. Rosco. “Customized OS Sup- port for Data-processing.” In Proceedings of the 12th International Work- shop on Data Management on New Hardware, DaMoN ’16, pp. 2:1–2:6. 2016.

[GZH+11] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Sto- ica. “Dominant Resource Fairness: Fair Allocation of Multiple Resource

187 Bibliography

Types.” In Proceedings of the 8th USENIX Conference on Networked Sys- tems Design and Implementation, NSDI’11, pp. 323–336. 2011.

[HA05] S. Harizopoulos and A. Ailamaki. “StagedDB: Designing Database Servers for Modern Hardware.” IEEE Data Eng. Bull., vol. 28, no. 2, 11–16, 2005.

[Han99] S. M. Hand. “Self-paging in the Nemesis operating system.” In Proceedings of the third Symposium on Operating Systems Design and Implementation, OSDI ’99, pp. 73–86. 1999.

[HCSO12] S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. “Green-Marl: A DSL for Easy and Efficient Graph Analysis.” In ASPLOS, pp. 349–362. 2012.

[Heo15] T. Heo. “Control Group v2.” https://www.kernel.org/doc/ Documentation/cgroup-v2.txt, 2015.

[HKL+08] W.-S. Han, W. Kwak, J. Lee, G. M. Lohman, and V. Markl. “Parallelizing query optimization.” Proceedings VLDB Endowment, vol. 1, 188–200, 2008.

[HKZ+11] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. “Mesos: A Platform for Fine-grained Resource Sharing in the Data Center.” In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pp. 295–308. 2011.

[HLP+13] W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, and H. Yu. “Tur- boGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC.” In KDD, pp. 77–85. 2013.

[HMM14] T. Harris, M. Maas, and V. J. Marathe. “Callisto: Co-scheduling Parallel Runtime Systems.” In Proceedings of the Ninth European Conference on Computer Systems, EuroSys ’14, pp. 24:1–24:14. 2014.

[HS87] D. S. Hochbaum and D. B. Shmoys. “Using Dual Approximation Algo- rithms for Scheduling Problems Theoretical and Practical Results.” J. ACM, vol. 34, no. 1, 144–162, 1987.

[HSA05] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. “QPipe: a simulta- neously pipelined relational query engine.” In SIGMOD ’05, pp. 383–394. 2005.

188 Bibliography

[HSH07] J. M. Hellerstein, M. Stonebraker, and J. Hamilton. “Architecture of a Database System.” Found. Trends databases, vol. 1, no. 2, 141–259, 2007.

[Hsi79] D. K. Hsiao. “Data Base Machines Are Coming, Data Base Machines Are Coming!” Computer, vol. 12, no. 3, 7–9, 1979.

[HSL10] T. Hoefler, T. Schneider, and A. Lumsdaine. “Characterizing the Influence of System Noise on Large-Scale Applications by Simulation.” SC ’10, pp. 1–11. 2010.

[Inc16] F. S. F. Inc. “libgomp: proc.c gomp dynamic max threads().” https://github.com/gcc-mirror/gcc/blob/ edd716b6b1caa1a5cb320a8cd7f626f30198e098/libgomp/config/ /proc.c#L55, 2016.

[Int08] Intel Corporation. Intel 64 and IA-32 Architectures Optimization Refer- ence Manual, 2008.

[Int13] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes 3A, 3B, and 3C: System Programming Guide, 2013.

[Jon15] Jonathan Corbet. “Thread-level management in control groups.”, 2015. https://lwn.net/Articles/656115/, accessed 2016-08-12.

[JPH+09] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and B. Falsafi. “Shore- MT: a scalable storage manager for the multicore era.” In EDBT, pp. 24–35. 2009.

[KARH15] S. Kaestle, R. Achermann, T. Roscoe, and T. Harris. “Shoal: Smart Allo- cation and Replication of Memory for Parallel Programs.” In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’15, pp. 263–276. 2015.

[KB05] S. M. Kelly and R. Brightwell. “ of the light weight kernel, Catamount.” In In Cray User Group, pp. 16–19. 2005.

[KBG12] A. Kyrola, G. Blelloch, and C. Guestrin. “GraphChi: Large-scale Graph Computation on Just a PC.” In OSDI, pp. 31–46. 2012.

189 Bibliography

[KBH+08] R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. “Using OS Obser- vations to Improve Performance in Multicore Systems.” Micro ’08, vol. 28, no. 3, 54–66, 2008.

[Kie16] T. Kiefer. “Allocation Strategies for Data-Oriented Architectures.” Ph.D. thesis, Dresden University of Technology, 2016.

[Kim15] H. Kimura. “FOEDUS: OLTP Engine for a Thousand Cores and NVRAM.” SIGMOD ’15, pp. 691–706. 2015.

[KKA14] A. Kalia, M. Kaminsky, and D. G. Andersen. “Using RDMA efficiently for key-value services.” In SIGCOMM, pp. 295–306. 2014.

[KKL+09] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. “Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs.” PVLDB ’09, vol. 2, no. 2, 1378–1389, 2009.

[Kle05] A. Kleen. “A NUMA API for LINUX.” Tech. rep., 2005.

[KLPM10] H. Kwak, C. Lee, H. Park, and S. Moon. “What is Twitter, a social network or a media?” In WWW, pp. 591–600. 2010.

[KM77] L. T. Kou and G. Markowsky. “Multidimensional Bin Packing Algorithms.” IBM Journal of Research and Development, vol. 21, no. 5, 443–448, 1977.

[KN11] A. Kemper and T. Neumann. “HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots.” In ICDE, pp. 195–206. 2011.

[KR16] K. Keeton and T. Roscoe, eds. 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016. USENIX Association, 2016.

[KS09] S. Khuller and B. Saha. “On Finding Dense Subgraphs.” In Automata, Languages and Programming, vol. 5555 of Lecture Notes in Computer Sci- ence, pp. 597–608. Springer Berlin Heidelberg, 2009.

190 Bibliography

[LBKN14] V. Leis, P. Boncz, A. Kemper, and T. Neumann. “Morsel-driven Paral- lelism: A NUMA-aware Query Evaluation Framework for the Many-core Age.” In SIGMOD ’14, pp. 743–754. 2014.

[LCG+15] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. “Heracles: Improving Resource Efficiency at Scale.” In Proceedings of the 42Nd Annual International Symposium on , ISCA ’15, pp. 450–462. 2015.

[LDC+09] R. Lee, X. Ding, F. Chen, Q. Lu, and X. Zhang. “MCC-DB: Minimizing Cache Conflicts in Multi-core Processors for Databases.” Proceedings of VLDB Endowment, vol. 2, no. 1, 373–384, 2009.

[LDS07] C. Li, C. Ding, and K. Shen. “Quantifying the Cost of Context Switch.” In Proceedings of the 2007 Workshop on Experimental Computer Science, ExpCS ’07. 2007.

[LFC+12] I. Lebedev, C. Fletcher, S. Cheng, J. Martin, A. Doupnik, D. Burke, M. Lin, and J. Wawrzynek. “Exploring Many-core Design Templates for FPGAs and ASICs.” Int. J. Reconfig. Comput., vol. 2012, 8:8–8:8, 2012.

[LGJ01] K. Luo, J. Gummaraju, and F. J. “Balancing throughput and fairness in SMT processors.” In Proceedings of the International Symposium on Performance Analysis of Systems and Software, ISPASS ’01, pp. 164–171. 2001.

[LK14] J. Leskovec and A. Krevl. “SNAP Datasets: Stanford Large Network Dataset Collection.” http://snap.stanford.edu/data, 2014.

[LKB+09] R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanovi´c,and J. Kubiatow- icz. “Tessellation: Space-Time Partitioning in a Manycore Client OS.” In USENIX Workshop on Hot Topics in Parallelism. 2009.

[LLD+08] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. “Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems.” In HPCA, pp. 367–378. 2008.

191 Bibliography

[LLF+16] J. Lozi, B. Lepers, J. R. Funston, F. Gaud, V. Qu´ema,and A. Fedorova. “The Linux scheduler: a decade of wasted cores.” In EuroSys’16, p. 1. 2016.

[LLMZ11] J. J. Levandoski, D. Lomet, M. F. Mokbel, and K. K. Zhao. “Deuteronomy: Transaction Support for Cloud Data.” In CIDR. 2011.

[LLS+15] J. Levandoski, D. Lomet, S. Sengupta, R. Stutsman, and R. Wang. “High Performance Transactions in Deuteronomy.” CIDR 2015, 2015.

[Loe15] S. Loesing. “Architectures for elastic database services.” Ph.D. thesis, ETH Zurich, 2015.

[LPM+13] Y. Li, I. Pandis, R. M¨uller,V. Raman, and G. M. Lohman. “NUMA-aware algorithms: the case of data shuffling.” In CIDR ’13. 2013.

[MAN05] R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. “Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors.” In Proceedings of the 19th IEEE International Parallel and Distributed Pro- cessing Symposium (IPDPS’05) - Papers - Volume 01, IPDPS ’05, pp. 28.1–. 2005.

[MBK00] S. Manegold, P. A. Boncz, and M. L. Kersten. “Optimizing database architecture for the new bottleneck: memory access.” PVLDB ’00, vol. 9, no. 3, 231–246, 2000.

[MBK02] S. Manegold, P. Boncz, and M. L. Kersten. “Generic Database Cost Models for Hierarchical Memory Systems.” In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02, pp. 191–202. 2002.

[McC95] J. D. McCalpin. “Memory Bandwidth and Machine Balance in Current High Performance .” TCCA ’95, pp. 19–25, 1995.

[MCM13] B. Mozafari, C. Curino, and S. Madden. “DBSeer: Resource and Perfor- mance Prediction for Building a Next Generation Database Cloud.” In CIDR ’13. 2013.

[MCT77] K. Maruyama, S. K. Chang, and D. T. Tang. “A general packing algo- rithm for multidimensional resource requirements.” International Journal of Computer & Information Sciences, vol. 6, no. 2, 131–149, 1977.

192 Bibliography

[Mem16] MemSQL. “MemSQL – distributed In-memory Database.” www.memsql. com, 2016.

[Mer14] D. Merkel. “Docker: Lightweight Linux Containers for Consistent Devel- opment and Deployment.” Linux J., vol. 2014, no. 239, 2014.

[MGAK16] D. Makreshanski, G. Giannikis, G. Alonso, and D. Kossmann. “MQJoin: Efficient Shared Execution of Main-memory Joins.” Proc. VLDB Endow., vol. 9, no. 6, 480–491, 2016.

[Mic10] Microsoft. “Windows API Reference.” https://msdn.microsoft.com/ en-us/library/aa383749(v=vs.85).aspx, 2010.

[Mic16] Microsoft. “User-Mode Scheduling in Windows.” https://msdn. microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85) .aspx, 2016.

[MMI+13] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. “Naiad: A Timely Dataflow System.” In SOSP, pp. 439–455. 2013.

[MMR+13] A. Madhavapeddy, R. Mortier, C. Rotsos, D. Scott, B. Singh, T. Gazag- naire, S. Smith, S. Hand, and J. Crowcroft. “Unikernels: Library Operat- ing Systems for the Cloud.” In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Op- erating Systems, ASPLOS ’13, pp. 461–472. 2013.

[MSB10] A. Merkel, J. Stoess, and F. Bellosa. “Resource-conscious scheduling for energy efficiency on multicore processors.” In EuroSys ’10, pp. 153–166. 2010.

[MSLM91] B. D. Marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos. “First-class User-level Threads.” In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, SOSP ’91, pp. 110–121. 1991.

[Mue16] I. Mueller. “Engineering Aggregation Operators for Relational In-memory Database Systems.” Ph.D. thesis, Karlsruhe Institute of Technology (KIT), 2016.

193 Bibliography

[MVHS10] J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. “Contention Aware Execution: Online Contention Detection and Response.” In CGO, pp. 257–265. 2010.

[NHM+09] E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. “Helios: heterogeneous with satellite kernels.” In Proceed- ings of the ACM SIGOPS 22nd symposium on Operating systems princi- ples, SOSP ’09, pp. 221–234. 2009.

[NKG10] R. Nathuji, A. Kansal, and A. Ghaffarkhah. “Q-clouds: Managing Perfor- mance Interference Effects for QoS-aware Clouds.” In EuroSys, pp. 237– 250. 2010.

[Ope15] OpenMP Architecture Review Board. “OpenMP Application Program Interface Version 4.5.”, 2015.

[Ora09] Oracle Inc. Programming Interfaces Guide. 2009.

[Ora15] Oracle White Paper. “Oracle GoldenGate 12c: Real-Time Access to Real- Time Information.” Tech. rep., Oracle, 2015.

[Ous82] J. Ousterhout. “Scheduling Techniques for Concurrent Systems.” IEEE Distributed Computer Systems, 1982.

[OWZS13] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. “Sparrow: Dis- tributed, Low Latency Scheduling.” In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pp. 69–84. 2013.

[PAA13] I. Psaroudakis, M. Athanassoulis, and A. Ailamaki. “Sharing Data and Work Across Concurrent Analytical Queries.” PVLDB ’13, vol. 6, no. 9, 637–648, 2013.

[Pan15] Panagiotis Garefalakis. “Bridging the Gap between Serving and Analytics in Scalable Web Applications.” Master’s thesis, Imperial College London, Department of Computing, London, UK, 2015.

[Pap14] O. W. Paper. “Parallel Execution with 12c Fundamen- tals.” Tech. rep., 2014.

194 Bibliography

[PBAR11] S. Peter, A. Baumann, Z. Anderson, and T. Roscoe. “Gang scheduling isnâĂŹt worth it ... yet.” Tech. Rep. 745, Department of Computer Science, ETH Zurich, 2011.

[PBWH+11] D. E. Porter, S. Boyd-Wickizer, J. Howell, R. Olinsky, and G. C. Hunt. “Rethinking the Library OS from the Top Down.” In Proceedings of the Sixteenth International Conference on Architectural Support for Program- ming Languages and Operating Systems, ASPLOS XVI, pp. 291–304. 2011.

[Pet12] S. Peter. “Resource Management in a Multicore Operating System.” Ph.D. thesis, ETH Zurich, 2012.

[PHA09] H. , B. Hindman, and K. Asanovi´c. “Lithe: Enabling Efficient Com- position of Parallel Libraries.” In Proceedings of the First USENIX Con- ference on Hot Topics in Parallelism, HotPar’09, pp. 11–11. 2009.

[Phi14] S. Phillips. “M7: Next Generation SPARC.” Presented at Hot Chips (HC 26): A symposium on High Performance Chips, August, 2014.

[Pla09] H. Plattner. “A Common Database Approach for OLTP and OLAP Using an In-memory Column Database.” In SIGMOD, pp. 1–2. 2009.

[PLTA14] D. Porobic, E. Liarou, P. Tozun, and A. Ailamaki. “ATraPos: Adaptive on hardware Islands.” In ICDE ’14, pp. 688–699. 2014.

[PLZ+14] S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe. “Arrakis: The Operating System is the Con- trol Plane.” In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pp. 1–16. 2014.

[PPB+12] D. Porobic, I. Pandis, M. Branco, P. Tozun, and A. Ailamaki. “OLTP on Hardware Islands.” PVLDB ’12, vol. 5, no. 11, 1447–1458, 2012.

[PR14] O. Polychroniou and K. A. Ross. “A Comprehensive Study of Main- memory Partitioning and Its Application to Large-scale Comparison- and Radix-sort.” In Proceedings of the 2014 ACM SIGMOD International Con- ference on Management of Data, SIGMOD ’14, pp. 755–766. 2014.

195 Bibliography

[PSB+10] S. Peter, A. Schuepbach, P. Barham, A. Baumann, R. Isaacs, T. Harris, and T. Roscoe. “Design Principles for End-to-End Multicore Schedulers.” In Proceedings of the 2nd Usenix Workshop on Hot Topics in Parallelism (HotPar-10). Berkeley, CA, USA, 2010.

[PSB+15] Y. Perez, R. Sosiˇc,A. Banerjee, R. Puttagunta, M. Raison, P. Shah, and J. Leskovec. “Ringo: Interactive Graph Analytics on Big-Memory Ma- chines.” In SIGMOD, pp. 1105–1110. 2015.

[PSK15] S. Panneerselvam, M. Swift, and N. S. Kim. “Bolt: Faster Reconfiguration in Operating Systems.” In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pp. 511–516. USENIX Association, Santa Clara, CA, 2015.

[PSM+15] I. Psaroudakis, T. Scheuer, N. May, A. Sellami, and A. Ailamaki. “Scal- ing Up Concurrent Main-Memory Column-Store Scans: Towards Adaptive NUMA-aware Data and Task Placement.” PVLDB, vol. 8, no. 12, 1442– 1453, 2015.

[PSMA13] I. Psaroudakis, T. Scheuer, N. May, and A. Ailamaki. “Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Work- loads.” In ADMS, pp. 36–45. 2013.

[PVHH+12] Y. Park, E. Van Hensbergen, M. Hillenbrand, T. Inglett, B. Rosenburg, K. D. Ryu, and R. W. Wisniewski. “FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment.” In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD ’12, pp. 211–218. IEEE Computer Society, Washington, , USA, 2012.

[PWM+15] I. Psaroudakis, F. Wolf, N. May, T. Neumann, A. BÃűhm, A. Ailamaki, and K.-U. Sattler. “Scaling Up Mixed Workloads: A Battle of Data Fresh- ness, Flexibility, and Scheduling.” In R. Nambiar and M. Poess, eds., Per- formance Characterization and Benchmarking. Traditional to Big Data, vol. 8904 of Lecture Notes in Computer Science, pp. 97–112. Springer In- ternational Publishing, 2015.

196 Bibliography

[Ram16] Rami Rosen. “Understanding the new control groups API.”, 2016. https: //lwn.net/Articles/679786/, accessed 2016-08-12.

[RCS+11] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. “PTask: Operating System Abstractions to Manage GPUs As Compute Devices.” In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11, pp. 233–248. 2011.

[RMG+15] R. Riesen, A. B. Maccabe, B. Gerofi, D. N. Lombard, J. J. Lange, K. - dretti, K. Ferreira, M. Lang, P. Keppel, R. W. Wisniewski, R. Brightwell, T. Inglett, Y. Park, and Y. Ishikawa. “What is a Lightweight Kernel?” ROSS ’15, pp. 9:1–9:8. 2015.

[RMKN15] W. R¨odiger,T. M¨uhlbauer,A. Kemper, and T. Neumann. “High-speed Query Processing over High-speed Networks.” Proc. VLDB Endow., vol. 9, no. 4, 228–239, 2015.

[RS09] M. Russinovich and D. A. Solomon. Windows Internals: Including Win- dows Server 2008 and , Fifth Edition. Microsoft Press, 5th ed., 2009.

[RSQ+08] V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, I. Narang, and R. Sidle. “Constant-Time Query Processing.” In ICDE, pp. 60–69. 2008.

[RTG14] P. Roy, J. Teubner, and R. Gemulla. “Low-latency Handshake Join.” Proc. VLDB Endow., vol. 7, no. 9, 709–720, 2014.

[SAB+05] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Fer- reira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik. “C-store: a column-oriented DBMS.” In Proceedings of the 31st international conference on Very large data bases, VLDB ’05, pp. 553–564. 2005.

[SBZ12] M. Switakowski, P. A. Boncz, and M. Zukowski. “From Cooperative Scans to Predictive Buffer Management.” CoRR, vol. abs/1208.4170, 2012.

[Sch12] A. Schuepbach. “Tackling OS Complexity with Declarative Techniques.” Ph.D. thesis, ETH Zurich, 2012.

197 Bibliography

[Sel88] T. K. Sellis. “Multiple-query optimization.” ACM Trans. Database Syst., vol. 13, 23–52, 1988.

[SESS96] M. I. Seltzer, Y. Endo, C. Small, and K. A. Smith. “Dealing with disas- ter: surviving misbehaved kernel extensions.” SIGOPS Operating Systems Review, vol. 30, no. SI, 213–227, 1996.

[SGT+14] T. Shimosawa, B. Gerofi, M. Takagi, G. Nakamura, T. Shirasawa, Y. Saeki, M. Shimizu, A. Hori, and Y. Ishikawa. “Interface for heterogeneous ker- nels: A framework to enable hybrid OS designs targeting high performance computing on manycore architectures.” In HiPC’14, pp. 1–10. 2014.

[SKAEMW13] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. “Omega: Flexible, Scalable Schedulers for Large Compute Clusters.” In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pp. 351–364. 2013.

[SKC+10] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. “Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort.” In SIGMOD ’10, pp. 351–362. 2010.

[SMB+15] V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. “Gather-scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses.” In MICRO’15, pp. 267–280. 2015.

[SPB+08] A. Sch¨upbach, S. Peter, A. Baumann, T. Roscoe, P. Barham, T. Harris, and R. Isaacs. “Embracing diversity in the Barrelfish manycore operating system.” In Proceedings of the Workshop on Managed Many-Core Systems, MMCS’08. 2008.

[SS10] L. Soares and M. Stumm. “FlexSC: Flexible System Call Scheduling with Exception-less System Calls.” In Proceedings of the 9th USENIX Con- ference on Operating Systems Design and Implementation, OSDI’10, pp. 33–46. 2010.

[SSGA11] T.-I. Salomie, I. E. Subasu, J. Giceva, and G. Alonso. “Database Engines on Multicores, Why Parallelize when You Can Distribute?” In Proceedings

198 Bibliography

of the Sixth Conference on Computer Systems, EuroSys ’11, pp. 17–30. 2011.

[ST00] A. Snavely and D. M. Tullsen. “Symbiotic Jobscheduling for a Simultane- ous Multithreaded Processor.” In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Op- erating Systems, ASPLOS IX, pp. 234–244. 2000.

[Sto81] M. Stonebraker. “Operating System Support for Database Management.” Commun. ACM, vol. 24, no. 7, 412–418, 1981.

[Sun02] . “Multithreading in the Solaris Operating Envi- ronment.” http://home.mit.bme.hu/˜meszaros/edu/oprendszerek/ segedlet/unix/2_folyamatok_es_utemezes/solaris_multithread. , 2002.

[TMV+11] L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. “The Im- pact of Memory Subsystem Resource Sharing on Datacenter Applications.” In ISCA, pp. 283–294. 2011.

[Tor16] L. Torvalds. “Linux 4.6-rc7 scheduler.” https://git.kernel.org/cgit/ linux/kernel/git/torvalds/linux.git/tree/kernel/sched/fair.c? id=/tags/v4.6-rc7#n50, 2016.

[TTH16] B. Teabe, A. Tchana, and D. Hagimont. “Application-specific Quantum for Multi-core Platform Scheduler.” In EuroSys, pp. 3:1–3:14. 2016.

[UGA+09] P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Koss- mann. “Predictable performance for unpredictable workloads.” Proceed- ings VLDB Endowment, vol. 2, 706–717, 2009.

[WA09] D. Wentzlaff and A. Agarwal. “Factored operating systems (fos): the case for a scalable operating system for multicores.” SIGOPS Operating Systems Review, vol. 43, no. 2, 76–85, 2009.

[WDA+08] Y. Weinsberg, D. Dolev, T. Anker, M. Ben-Yehuda, and P. Wyckoff. “Tap- ping into the fountain of CPUs: on operating system support for pro- grammable devices.” In International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 179–188. 2008.

199 Bibliography

[WGB+10] D. Wentzlaff, C. Gruenwald III, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff, J. Miller, and A. Agarwal. “An Operating System for Multicore and Clouds: Mechanisms and Implementation.” In ACM Symposium on Cloud Computing (SOCC). 2010.

[WIA14] L. Woods, Z. Istv´an,and G. Alonso. “Ibex: An Intelligent Storage Engine with Support for Advanced SQL Offloading.” PVLDB’14, vol. 7, no. 11, 963–974, 2014.

[WIK+14] R. W. Wisniewski, T. Inglett, P. Keppel, R. Murty, and R. Riesen. “mOS: An Architecture for Extreme-scale Operating Systems.” ROSS ’14, pp. 2:1–2:8. 2014.

[WLP+14] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross. “Q100: The Architecture and Design of a Database Processing Unit.” In Proceedings of the 19th International Conference on Architectural Support for Program- ming Languages and Operating Systems, ASPLOS ’14, pp. 255–268. 2014.

[WRA05] E. Witchel, J. Rhee, and K. Asanovi´c. “Mondrix: Memory Isolation for Linux Using Mondriaan .” In Proceedings of the Twen- tieth ACM Symposium on Operating Systems Principles, SOSP ’05, pp. 31–44. 2005.

[WS11] J. Wassenberg and P. Sanders. “Engineering a multi-core radix sort.” In Euro-Par 2011 Parallel Processing, pp. 160–169. Springer, 2011.

[WWP09] S. Williams, A. Waterman, and D. Patterson. “Roofline: an insightful vi- sual performance model for multicore architectures.” Commun. ACM ’09, vol. 52, no. 4, 65–76, 2009.

[YRV11] Y. Ye, K. A. Ross, and N. Vesdapunt. “Scalable Aggregation on Multicore Processors.” In Proceedings of the Seventh International Workshop on Data Management on New Hardware, DaMoN ’11, pp. 1–9. 2011.

[YSC+09] Q. Yin, A. Sch¨upbach, J. Cappos, A. Baumann, and T. Roscoe. “Rhizoma: A Runtime for Self-deploying, Self-managing Overlays.” In Proceedings of the ACM/IFIP/USENIX 10th International Conference on , Middleware’09, pp. 184–204. 2009.

200 Bibliography

[ZBF10] S. Zhuravlev, S. Blagodurov, and A. Fedorova. “Addressing shared re- source contention in multicore processors via scheduling.” In ASPLOS XV ’10, pp. 129–142. 2010.

[ZGKR14] G. Zellweger, S. Gerber, K. Kourtis, and T. Roscoe. “Decoupling Cores, Kernels, and Operating Systems.” In 11th USENIX Symposium on Operat- ing Systems Design and Implementation (OSDI 14), pp. 17–31. Broomfield, CO, 2014.

[ZHNB07] M. Zukowski, S. H´eman,N. Nes, and P. Boncz. “Cooperative scans: dy- namic bandwidth sharing in a DBMS.” In Proceedings of the 33rd interna- tional conference on Very large data bases, VLDB ’07, pp. 723–734. 2007.

[ZR04] J. Zhou and K. A. Ross. “Buffering database operations for enhanced instruction cache performance.” In SIGMOD ’04, pp. 191–202. 2004.

[ZR14] C. Zhang and C. Re. “DimmWitted: A Study of Main-Memory Statistical Analytics.” PVLDB, vol. 7, no. 12, 1283–1294, 2014.

[ZSB+12] S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto. “Sur- vey of scheduling techniques for addressing shared resources in multicore processors.” ACM Comput. Surv. ’12, vol. 45, no. 1, 4:1–4:28, 2012.

201