Μtune: Auto-Tuned Threading for OLDI Microservices Akshitha Sriraman and Thomas F
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  Accordion: Better Memory Organization for LSM Key-Value StoresAccordion: Better Memory Organization for LSM Key-Value Stores Edward Bortnikov Anastasia Braginsky Eshcar Hillel Yahoo Research Yahoo Research Yahoo Research [email protected] [email protected] [email protected] Idit Keidar Gali Sheffi Technion and Yahoo Research Yahoo Research [email protected] gsheffi@oath.com ABSTRACT of applications for which they are used continuously in- Log-structured merge (LSM) stores have emerged as the tech- creases. A small sample of recently published use cases in- nology of choice for building scalable write-intensive key- cludes massive-scale online analytics (Airbnb/ Airstream [2], value storage systems. An LSM store replaces random I/O Yahoo/Flurry [7]), product search and recommendation (Al- with sequential I/O by accumulating large batches of writes ibaba [13]), graph storage (Facebook/Dragon [5], Pinter- in a memory store prior to flushing them to log-structured est/Zen [19]), and many more. disk storage; the latter is continuously re-organized in the The leading approach for implementing write-intensive background through a compaction process for efficiency of key-value storage is log-structured merge (LSM) stores [31]. reads. Though inherent to the LSM design, frequent com- This technology is ubiquitously used by popular key-value pactions are a major pain point because they slow down storage platforms [9, 14, 16, 22,4,1, 10, 11]. The premise data store operations, primarily writes, and also increase for using LSM stores is the major disk access bottleneck, disk wear. Another performance bottleneck in today's state- exhibited even with today's SSD hardware [14, 33, 34].
- 
												  Thread Management for High Performance Database Systems - Design and ImplementationNr.: FIN-003-2018 Thread Management for High Performance Database Systems - Design and Implementation Robert Jendersie, Johannes Wuensche, Johann Wagner, Marten Wallewein-Eising, Marcus Pinnecke, Gunter Saake Arbeitsgruppe Database and Software Engineering Fakultät für Informatik Otto-von-Guericke-Universität Magdeburg Nr.: FIN-003-2018 Thread Management for High Performance Database Systems - Design and Implementation Robert Jendersie, Johannes Wuensche, Johann Wagner, Marten Wallewein-Eising, Marcus Pinnecke, Gunter Saake Arbeitsgruppe Database and Software Engineering Technical report (Internet) Elektronische Zeitschriftenreihe der Fakultät für Informatik der Otto-von-Guericke-Universität Magdeburg ISSN 1869-5078 Fakultät für Informatik Otto-von-Guericke-Universität Magdeburg Impressum (§ 5 TMG) Herausgeber: Otto-von-Guericke-Universität Magdeburg Fakultät für Informatik Der Dekan Verantwortlich für diese Ausgabe: Otto-von-Guericke-Universität Magdeburg Fakultät für Informatik Marcus Pinnecke Postfach 4120 39016 Magdeburg E-Mail: [email protected] http://www.cs.uni-magdeburg.de/Technical_reports.html Technical report (Internet) ISSN 1869-5078 Redaktionsschluss: 21.08.2018 Bezug: Otto-von-Guericke-Universität Magdeburg Fakultät für Informatik Dekanat Thread Management for High Performance Database Systems - Design and Implementation Technical Report Robert Jendersie1, Johannes Wuensche2, Johann Wagner1, Marten Wallewein-Eising2, Marcus Pinnecke1, and Gunter Saake1 Database and Software Engineering Group, Otto-von-Guericke University
- 
												  Concurrency & Parallel Programming PatternsConcurrency & Parallel programming patterns Evgeny Gavrin Outline 1. Concurrency vs Parallelism 2. Patterns by groups 3. Detailed overview of parallel patterns 4. Summary 5. Proposal for language Concurrency vs Parallelism ● Parallelism is the simultaneous execution of computations “doing lots of things at once” ● Concurrency is the composition of independently execution processes “dealing with lots of thing at once” Patterns by groups Architectural Patterns These patterns define the overall architecture for a program: ● Pipe-and-filter: view the program as filters (pipeline stages) connected by pipes (channels). Data flows through the filters to take input and transform into output. ● Agent and Repository: a collection of autonomous agents update state managed on their behalf in a central repository. ● Process control: the program is structured analogously to a process control pipeline with monitors and actuators moderating feedback loops and a pipeline of processing stages. ● Event based implicit invocation: The program is a collection of agents that post events they watch for and issue events for other agents. The architecture enforces a high level abstraction so invocation of an agent is implicit; i.e. not hardwired to a specific controlling agent. ● Model-view-controller: An architecture with a central model for the state of the program, a controller that manages the state and one or more agents that export views of the model appropriate to different uses of the model. ● Bulk Iterative (AKA bulk synchronous): A program that proceeds iteratively … update state, check against a termination condition, complete coordination, and proceed to the next iteration. ● Map reduce: the program is represented in terms of two classes of functions.
- 
												  Artificial Intelligence for Understanding Large and ComplexArtificial Intelligence for Understanding Large and Complex Datacenters by Pengfei Zheng Department of Computer Science Duke University Date: Approved: Benjamin C. Lee, Advisor Bruce M. Maggs Jeffrey S. Chase Jun Yang Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Abstract Artificial Intelligence for Understanding Large and Complex Datacenters by Pengfei Zheng Department of Computer Science Duke University Date: Approved: Benjamin C. Lee, Advisor Bruce M. Maggs Jeffrey S. Chase Jun Yang An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Copyright © 2020 by Pengfei Zheng All rights reserved except the rights granted by the Creative Commons Attribution-Noncommercial Licence Abstract As the democratization of global-scale web applications and cloud computing, under- standing the performance of a live production datacenter becomes a prerequisite for making strategic decisions related to datacenter design and optimization. Advances in monitoring, tracing, and profiling large, complex systems provide rich datasets and establish a rigorous foundation for performance understanding and reasoning. But the sheer volume and complexity of collected data challenges existing techniques, which rely heavily on human intervention, expert knowledge, and simple statistics. In this dissertation, we address this challenge using artificial intelligence and make the case for two important problems, datacenter performance diagnosis and datacenter workload characterization. The first thrust of this dissertation is the use of statistical causal inference and Bayesian probabilistic model for datacenter straggler diagnosis.
- 
												  Learning Key-Value Store DesignLearning Key-Value Store Design Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, Zichen Zhu Harvard University ABSTRACT We introduce the concept of design continuums for the data Key-Value Stores layout of key-value stores. A design continuum unifies major Machine Databases K V K V … K V distinct data structure designs under the same model. The Table critical insight and potential long-term impact is that such unifying models 1) render what we consider up to now as Learning Data Structures fundamentally different data structures to be seen as \views" B-Tree Table of the very same overall design space, and 2) allow \seeing" Graph LSM new data structure designs with performance properties that Store Hash are not feasible by existing designs. The core intuition be- hind the construction of design continuums is that all data Performance structures arise from the very same set of fundamental de- Update sign principles, i.e., a small set of data layout design con- Data Trade-offs cepts out of which we can synthesize any design that exists Access Patterns in the literature as well as new ones. We show how to con- Hardware struct, evaluate, and expand, design continuums and we also Cloud costs present the first continuum that unifies major data structure Read Memory designs, i.e., B+tree, Btree, LSM-tree, and LSH-table. Figure 1: From performance trade-offs to data structures, The practical benefit of a design continuum is that it cre- key-value stores and rich applications.
- 
												  Model Checking Multithreaded Programs WithView metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository Model Checking Multithreaded Programs with Asynchronous Atomic Methods Koushik Sen and Mahesh Viswanathan Department of Computer Science, University of Illinois at Urbana-Champaign. {ksen,vmahesh}@uiuc.edu Abstract. In order to make multithreaded programming manageable, program- mers often follow a design principle where they break the problem into tasks which are then solved asynchronously and concurrently on different threads. This paper investigates the problem of model checking programs that follow this id- iom. We present a programming language SPL that encapsulates this design pat- tern. SPL extends simplified form of sequential Java to which we add the ca- pability of making asynchronous method invocations in addition to the standard synchronous method calls and the ability to execute asynchronous methods in threads atomically and concurrently. Our main result shows that the control state reachability problem for finite SPL programs is decidable. Therefore, such mul- tithreaded programs can be model checked using the counter-example guided abstraction-refinement framework. 1 Introduction Multithreaded programming is often used in software as it leads to reduced latency, improved response times of interactive applications, and more optimal use of process- ing power. Multithreaded programming also allows an application to progress even if one thread is blocked for an I/O operation. However, writing correct programs that use multiple threads is notoriously difficult, especially in the presence of a shared muta- ble memory. Since threads can interleave, there can be unintended interference through concurrent access of shared data and result in software errors due to data race and atom- icity violations.
- 
												  Myrocks in MariadbMyRocks in MariaDB Sergei Petrunia <[email protected]> MariaDB Shenzhen Meetup November 2017 2 What is MyRocks ● #include <Yoshinori’s talk> ● This talk is about MyRocks in MariaDB 3 MyRocks lives in Facebook’s MySQL branch ● github.com/facebook/mysql-5.6 – Will call this “FB/MySQL” ● MyRocks lives there in storage/rocksdb ● FB/MySQL is easy to use if you are Facebook ● Not so easy if you are not :-) 4 FB/mysql-5.6 – user perspective ● No binaries, no packages – Compile yourself from source ● Dependencies, etc. ● No releases – (Is the latest git revision ok?) ● Has extra features – e.g. extra counters “confuse” monitoring tools. 5 FB/mysql-5.6 – dev perspective ● Targets a CentOS-type OS – Compiler, cmake version, etc. – Others may or may not [periodically] work ● MariaDB/Percona file pull requests to fix ● Special command to compile – https://github.com/facebook/mysql-5.6/wiki/Build-Steps ● Special command to run tests – Test suite assumes a big machine ● Some tests even a release build 6 Putting MyRocks in MariaDB ● Goals – Wider adoption – Ease of use – Ease of development – Have MyRocks in MariaDB ● Use it with MariaDB features ● Means – Port MyRocks into MariaDB – Provide binaries and packages 7 Status of MyRocks in MariaDB 8 Status of MyRocks in MariaDB ● MariaDB 10.2 is GA (as of May, 2017) ● It includes an ALPHA version of MyRocks plugin – Working to improve maturity ● It’s a loadable plugin (ha_rocksdb.so) ● Packages – Bintar, deb, rpm, win64 zip + MSI – deb/rpm have MyRocks .so and tools in a separate package. 9 Packaging for MyRocks in MariaDB 10 MyRocks and RocksDB library ● MyRocks is tied RocksDB@revno MariaDB – RocksDB is a github submodule – No compatibility with other versions MyRocks ● RocksDB is always compiled with RocksDB MyRocks S Z n ● l i And linked-in statically a b p ● p Distros have a RocksDB package y – Not using it.
- 
												  Lecture 26: Creational PatternsCreational Patterns CSCI 4448/5448: Object-Oriented Analysis & Design Lecture 26 — 11/29/2012 © Kenneth M. Anderson, 2012 1 Goals of the Lecture • Cover material from Chapters 20-22 of the Textbook • Lessons from Design Patterns: Factories • Singleton Pattern • Object Pool Pattern • Also discuss • Builder Pattern • Lazy Instantiation © Kenneth M. Anderson, 2012 2 Pattern Classification • The Gang of Four classified patterns in three ways • The behavioral patterns are used to manage variation in behaviors (think Strategy pattern) • The structural patterns are useful to integrate existing code into new object-oriented designs (think Bridge) • The creational patterns are used to create objects • Abstract Factory, Builder, Factory Method, Prototype & Singleton © Kenneth M. Anderson, 2012 3 Factories & Their Role in OO Design • It is important to manage the creation of objects • Code that mixes object creation with the use of objects can become quickly non-cohesive • A system may have to deal with a variety of different contexts • with each context requiring a different set of objects • In design patterns, the context determines which concrete implementations need to be present © Kenneth M. Anderson, 2012 4 Factories & Their Role in OO Design • The code to determine the current context, and thus which objects to instantiate, can become complex • with many different conditional statements • If you mix this type of code with the use of the instantiated objects, your code becomes cluttered • often the use scenarios can happen in a few lines of code • if combined with creational code, the operational code gets buried behind the creational code © Kenneth M. Anderson, 2012 5 Factories provide Cohesion • The use of factories can address these issues • The conditional code can be hidden within them • pass in the parameters associated with the current context • and get back the objects you need for the situation • Then use those objects to get your work done • Factories concern themselves just with creation, letting your code focus on other things © Kenneth M.
- 
												  An Enhanced Thread Synchronization Mechanism for JavaSOFTWARE—PRACTICE AND EXPERIENCE Softw. Pract. Exper. 2001; 31:667–695 (DOI: 10.1002/spe.383) An enhanced thread synchronization mechanism for Java Hsin-Ta Chiao and Shyan-Ming Yuan∗,† Department of Computer and Information Science, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan SUMMARY The thread synchronization mechanism of Java is derived from Hoare’s monitor concept. In the authors’ view, however, it is over simplified and suffers the following four drawbacks. First, it belongs to a category of no-priority monitor, the design of which, as reported in the literature on concurrent programming, is not well rated. Second, it offers only one condition queue. Where more than one long-term synchronization event is required, this restriction both degrades performance and further complicates the ordering problems that a no-priority monitor presents. Third, it lacks the support for building more elaborate scheduling programs. Fourth, during nested monitor invocations, deadlock may occur. In this paper, we first analyze these drawbacks in depth before proceeding to present our own proposal, which is a new monitor-based thread synchronization mechanism that we term EMonitor. This mechanism is implemented solely by Java, thus avoiding the need for any modification to the underlying Java Virtual Machine. A preprocessor is employed to translate the EMonitor syntax into the pure Java codes that invoke the EMonitor class libraries. We conclude with a comparison of the performance of the two monitors and allow the experimental results to demonstrate that, in most cases, replacing the Java version with the EMonitor version for developing concurrent Java objects is perfectly feasible.
- 
												  Designing for Performance: Concurrency and Parallelism COS 518: Computer Systems Fall 2015Designing for Performance: Concurrency and Parallelism COS 518: Computer Systems Fall 2015 Logan Stafman Adapted from slides by Mike Freedman 2 Definitions • Concurrency: – Execution of two or more tasks overlap in time. • Parallelism: – Execution of two or more tasks occurs simultaneous. Concurrency without 3 parallelism? • Parts of tasks interact with other subsystem – Network I/O, Disk I/O, GPU, ... • Other task can be scheduled while first waits on subsystem’s response Concurrency without parrallelism? Source: bjoor.me 5 Scheduling for fairness • On time-sharing system also want to schedule between tasks, even if one not blocking – Otherwise, certain tasks can keep processing – Leads to starvation of other tasks • Preemptive scheduling – Interrupt processing of tasks to process another task (why with tasks and not network packets?) • Many scheduling disciplines – FIFO, Shortest Remaining Time, Strict Priority, Round-Robin Preemptive Scheduling Source: embeddedlinux.org.cn Concurrency with 7 parallelism • Execute code concurrently across CPUs – Clusters – Cores • CPU parallelism different from distributed systems as ready availability to shared memory – Yet to avoid difference between parallelism b/w local and remote cores, many apps just use message passing between both (like HPC’s use of MPI) Symmetric Multiprocessors 8 (SMPs) Non-Uniform Memory Architectures 9 (NUMA) 10 Pros/Cons of NUMA • Pros Applications split between different processors can share memory close to hardware Reduced bus bandwidth usage • Cons Must ensure applications sharing memory are run on processors sharing memory 11 Forms of task parallelism • Processes – Isolated process address space – Higher overhead between switching processes • Threads – Concurrency within process – Shared address space – Three forms • Kernel threads (1:1) : Kernel support, can leverage hardware parallelism • User threads (N:1): Thread library in system runtime, fastest context switching, but cannot benefit from multi- threaded/proc hardware • Hybrid (M:N): Schedule M user threads on N kernel threads.
- 
												  Assessment of Barrier Implementations for Fine-Grain Parallel Regions on Current Multi-Core ArchitecturesAssessment of Barrier Implementations for Fine-Grain Parallel Regions on Current Multi-core Architectures Simon A. Berger and Alexandros Stamatakis The Exelixis Lab Department of Computer Science Technische Universitat¨ Munchen¨ Boltzmannstr. 3, D-85748 Garching b. Munchen,¨ Germany Email: [email protected], [email protected] WWW: http://wwwkramer.in.tum.de/exelixis/ time Abstract—Barrier performance for synchronizing threads master thread on current multi-core systems can be critical for scientific applications that traverse a large number of relatively small parallel regions, that is, that exhibit an unfavorable com- fork putation to synchronization ratio. By means of a synthetic and a real-world benchmark we assess 4 alternative barrier worker threads parallel region implementations on 7 current multi-core systems with 2 up to join 32 cores. We find that, barrier performance is application- and data-specific with respect to cache utilization, but that a rather fork na¨ıve lock-free barrier implementation yields good results across all applications and multi-core systems tested. We also worker threads parallel region assess distinct implementations of reduction operations that join are computed in conjunction with the barriers. The synthetic and real-world benchmarks are made available as open-source code for further testing. Keywords-barriers; multi-cores; threads; RAxML Figure 1. Classic fork-join paradigm. I. INTRODUCTION The performance of barriers for synchronizing threads on modern general-purpose multi-core systems is of vital In addition, we analyze the efficient implementation of importance for the efficiency of scientific codes. Barrier reduction operations (sums over the double values produced performance can become critical, if a scientific code exhibits by each for-loop iteration), that are frequently required a high number of relatively small (with respect to the in conjunction with barriers.
- 
												  Myrocks Deployment at Facebook and RoadmapsMyRocks deployment at Facebook and Roadmaps Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom Agenda ▪ MySQL at Facebook ▪ MyRocks overview ▪ Production Deployment ▪ Future Plans MySQL “User Database (UDB)” at Facebook▪ Storing Social Graph ▪ Massively Sharded ▪ Low latency ▪ Automated Operations ▪ Pure Flash Storage (Constrained by space, not by CPU/IOPS) What is MyRocks ▪ MySQL on top of RocksDB (RocksDB storage engine) ▪ Open Source, distributed from MariaDB and Percona as well MySQL Clients SQL/Connector Parser Optimizer Replication etc InnoDB RocksDB MySQL http://myrocks.io/ MyRocks Initial Goal at Facebook InnoDB in main database MyRocks in main database CPU IO Space CPU IO Space Machine limit Machine limit 45% 90% 21% 15% 45% 20% 15% 21% 15% MyRocks features ▪ Clustered Index (same as InnoDB) ▪ Bloom Filter and Column Family ▪ Transactions, including consistency between binlog and RocksDB ▪ Faster data loading, deletes and replication ▪ Dynamic Options ▪ TTL ▪ Online logical and binary backup MyRocks vs InnoDB ▪ MyRocks pros ▪ Much smaller space (half compared to compressed InnoDB) ▪ Gives better cache hit rate ▪ Writes are faster = Faster Replication ▪ Much smaller bytes written (can use more afordable fash storage) ▪ MyRocks cons (improvements in progress) ▪ Lack of several features ▪ No SBR, Gap Lock, Foreign Key, Fulltext Index, Spatial Index support. Need to use case sensitive collation for perf ▪ Reads are slower, especially if your data fts in memory ▪ More dependent on flesystem