Coordination Avoidance in Distributed Databases

Total Page:16

File Type:pdf, Size:1020Kb

Coordination Avoidance in Distributed Databases UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Coordination Avoidance in Distributed Databases Permalink https://escholarship.org/uc/item/8k8359g2 Author Bailis, Peter David Publication Date 2015 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California Coordination Avoidance in Distributed Databases By Peter David Bailis A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair Professor Ali Ghodsi Professor Tapan Parikh Fall 2015 Coordination Avoidance in Distributed Databases Copyright 2015 by Peter David Bailis 1 Abstract Coordination Avoidance in Distributed Databases by Peter David Bailis Doctor of Philosophy in Computer Science University of California, Berkeley Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair The rise of Internet-scale geo-replicated services has led to upheaval in the design of modern data management systems. Given the availability, latency, and throughput penalties asso- ciated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., “NoSQL”) has sought weaker alternatives that reduce the use of expensive coordina- tion during system operation, often at the cost of application integrity. When can we safely forego the cost of this expensive coordination, and when must we pay the price? In this thesis, we investigate the potential for coordination avoidance—the use of as little coordination as possible while ensuring application integrity—in several modern data- intensive domains. We demonstrate how to leverage the semantic requirements of appli- cations in data serving, transaction processing, and web services to enable more efficient distributed algorithms and system designs. The resulting prototype systems demonstrate regular order-of-magnitude speedups compared to their traditional, coordinated counter- parts on a variety of tasks, including referential integrity and index maintenance, transac- tion execution under common isolation models, and database constraint enforcement. A range of open source applications and systems exhibit similar results. i To my family ii Contents List of Figuresv List of Tables viii Acknowledgments ix 1 Introduction1 1.1 Coordination Avoidance.............................3 1.2 Primary Contributions..............................6 1.3 Outline and Previously Published Work.....................9 2 Coordination: Concepts and Costs 10 2.1 Coordination and Correctness in Database Systems.............. 10 2.2 Understanding the Costs of Coordination.................... 12 2.2.1 Latency.................................. 12 2.2.2 Throughput and Scalability....................... 14 2.2.3 Availability and Failures......................... 17 2.2.4 Summary: Costs............................. 19 2.2.5 Outcome: NoSQL, Historical Context, Safety and Liveness...... 19 2.3 System Model................................... 21 3 Invariant Confluence and Coordination 27 3.1 Invariant Confluence: Criteria Defined..................... 27 3.2 Invariant Confluence and Coordination-Free Execution............ 28 3.3 Discussion and Limitations............................ 33 3.4 Summary..................................... 34 4 Coordination Avoidance and Weak Isolation 36 4.1 ACID in the Wild................................. 36 4.2 Invariant Confluence Analysis: Isolation Levels................. 37 4.2.1 Invariant Confluent Isolation Guarantees................ 39 4.2.2 Sticky Availability............................ 44 4.2.3 Non-Invariant Confluent Semantics................... 45 CONTENTS iii 4.2.4 Summary................................. 48 4.3 Implications: Existing Algorithms and Empirical Impact............ 49 4.3.1 Existing Algorithms........................... 50 4.3.2 Empirical Impact: Isolation Guarantees................. 51 4.4 Isolation Models................................. 56 4.5 Summary..................................... 64 5 Coordination Avoidance and RAMP Transactions 65 5.1 Overview..................................... 67 5.2 Read Atomic Isolation in the Wild........................ 68 5.3 Semantics and System Model.......................... 71 5.3.1 RA Isolation: Formal Specification................... 71 5.3.2 RA Implications and Limitations.................... 72 5.3.3 RA Compared to Other Isolation Models................ 73 5.3.4 RA and Serializability.......................... 76 5.3.5 System Model and Scalability...................... 80 5.4 RAMP Transaction Algorithms......................... 81 5.4.1 RAMP-Fast................................ 82 5.4.2 RAMP-Small: Trading Metadata for RTTs............... 84 5.4.3 RAMP-Hybrid: An Intermediate Solution............... 87 5.4.4 Summary and Additional Details.................... 88 5.4.5 Distribution and Fault Tolerance.................... 91 5.4.6 Additional Semantics........................... 92 5.4.7 Further Optimizations.......................... 93 5.5 Experimental Evaluation............................. 93 5.5.1 Experimental Setup............................ 94 5.5.2 Experimental Results: Comparison................... 95 5.5.3 Experimental Results: CTP Overhead.................. 100 5.5.4 Experimental Results: Scalability.................... 100 5.6 Applying and Modifying the RAMP Protocols................. 101 5.6.1 Multi-Datacenter RAMP......................... 102 5.6.2 Quorum-Replicated RAMP Operation................. 104 5.6.3 RAMP, Transitive Dependencies, and Causal Consistency....... 105 5.7 RSIW Proof.................................... 108 5.8 RAMP Correctness and Independence...................... 111 5.9 Discussion..................................... 114 5.10 Summary..................................... 115 6 Coordination Avoidance for Database Constraints 117 6.1 Invariant Confluence of SQL Constraints.................... 117 6.1.1 Invariant Confluence for SQL Relations................ 118 6.1.2 Invariant Confluence for SQL Data Types............... 120 CONTENTS iv 6.1.3 SQL Discussion and Limitations..................... 121 6.2 More Formal Invariant Confluence Analysis of SQL Constraints....... 122 6.3 Empirical Impact: SQL-Based Constraints................... 130 6.3.1 TPC-C Invariants and Execution.................... 130 6.3.2 Evaluating TPC-C New-Order...................... 132 6.3.3 Analyzing Additional Applications................... 136 6.4 Constraints from Open Source Applications.................. 137 6.4.1 Background and Context......................... 139 6.4.2 Feral Mechanisms in Rails........................ 142 6.4.3 Rails Invariant Confluence Analysis................... 153 6.5 Quantifying Integrity Violations in Rails.................... 156 6.6 Other Frameworks................................ 165 6.7 Implications for Databases............................ 167 6.7.1 Summary: Database Shortcomings Today................ 167 6.7.2 Domesticating Feral Mechanisms.................... 168 6.8 Detailed Validation Behavior, Experimental Workload............. 170 6.8.1 Uniqueness Validation Behavior..................... 170 6.8.2 Association Validation Behavior..................... 171 6.8.3 Uniqueness Validation Schema...................... 171 6.8.4 Uniqueness Stress Test.......................... 172 6.8.5 Uniqueness Workload Test........................ 172 6.8.6 Association Validation Schema..................... 172 6.8.7 Association Stress Test.......................... 173 6.8.8 Association Workload Test........................ 174 6.9 Summary..................................... 175 7 Related Work 176 8 Conclusions 184 8.1 Design Patterns for Coordination Avoidance.................. 184 8.2 Limitations.................................... 185 8.3 Future Work.................................... 186 8.3.1 Automating Coordination Avoidance.................. 187 8.3.2 Comprehending Weak Isolation..................... 188 8.3.3 Emerging Application Patterns...................... 189 8.3.4 Statistical Coordination Avoidance................... 190 8.4 Closing Thoughts................................. 191 Bibliography 193 v List of Figures 1.1 An illustration of a distributed, replicated database and its relation to appli- cation servers and end users. In modern distributed databases, data is stored on several servers that may be located in geographically distant regions (e.g., Virginia and Oregon, or even different continents) and may be accessed by mul- tiple database clients (e.g., application servers, analytics frameworks, database administrators) simultaneously. The key challenge that we investigate in this thesis is how to minimize the amount of synchronous communication across databases while providing “always on,” scalable, and high performance access to each replica.....................................2 1.2 In this thesis, we develop the principle of Invariant Confluence, a necessary and sufficient condition for safe, convergent, coordination-free execution, and apply it to a range of application domains at increasing levels of abstraction: database isolation, database constraints, and safety properties from modern database- backed applications. Each guarantee that is invariant confluent is guaranteed to have at least one coordination-free implementation; we investigate the design of several implementations in this work, which operate at the
Recommended publications
  • Coordination Avoidance in Database Systems
    COORDINATION Peter Bailis Stanford/MIT/Berkeley AVOIDANCE Alan Fekete University of Sydney IN Mike Franklin Ali Ghodsi DATABASE Joe Hellerstein Ion Stoica SYSTEMS UC Berkeley Slides of Peter Bailis' VLDB’15 talk VLDB 2014 Default Supported? Serializable Actian Ingres NO YES Aerospike NO NO transactions are Persistit NO NON Clustrix NO NON not as widely Greenplum NO YESN IBM DB2 NO YES deployed as you IBM Informix NO YES MySQL NO YES might think… MemSQL NO NO MS SQL Server NO YESN NuoDB NO NO Oracle 11G NO NON Oracle BDB YES YESN Oracle BDB JE YES YES PostgreSQL NO YES SAP Hana NO NO ScaleDB NO NON VoltDB YES YESN VLDB 2014 Default Supported? Serializable Actian Ingres NO YES Aerospike NO NO transactions are Persistit NO NON Clustrix NO NON not as widely Greenplum NO YESN IBM DB2 NO YES deployed as you IBM Informix NO YES MySQL NO YES might think… MemSQL NO NO MS SQL Server NO YESN NuoDB NO NO Oracle 11G NO NON Oracle BDB YES YESN WHY?Oracle BDB JE YES YES PostgreSQL NO YES SAP Hana NO NO ScaleDB NO NON VoltDB YES YESN serializability: equivalence to some serial execution “post on timeline” “accept friend request” very general! serializability: equivalence to some serial execution r(x) w(y←1) r(y) w(x←1) very general! …but restricts concurrency serializability: equivalence to some serial execution CONCURRENT EXECUTION r(x)=0 r(y)=0 w(y←1) IS NOT w(x←1) SERIALIZABLE! verySerializability general! requires Coordination …buttransactions restricts cannot concurrency make progress independently Serializability requires Coordination transactions cannot make progress independently Two-Phase Locking Multi-Version Concurrency Control Blocking Waiting Optimistic Concurrency Control Pre-Scheduling Aborts Costs of Coordination Between Concurrent Transactions 1.
    [Show full text]
  • La Metodología TRIZ E Integración De Software De Licencia Libre Con Módulos Multifuncionales Como Estrategia De
    10º Congreso de Innovación y Desarrollo de Productos Monterrey, N.L., del 18 al 22 de Noviembre del 2015 Instituto Tecnológico de Estudios Superiores de Monterrey, Campus Monterrey. 1 10º Congreso de Innovación y Desarrollo de Productos Monterrey, N.L., del 18 al 22 de Noviembre del 2015 Instituto Tecnológico de Estudios Superiores de Monterrey, Campus Monterrey. La metodología TRIZ e Integración de software de licencia libre con módulos multifuncionales: como estrategia de fortalecimiento y competitividad en empresas emergentes de México. Guillermo Flores Téllez Jaime Garnica González Joselito Medina Marín Elisa Arisbé Millán Rivera José Carlos Cortés Garzón Resumen El sistema productivo de México se constituye en gran proporción, por empresas emergentes que funcionan como negocios familiares, surgidos a través de una idea creativa, para emprender una actividad económica específica, ya sea de la producción de un bien o prestación de servicio. El funcionamiento de este tipo de empresas es limitado, son compañías surgidas del emprendimiento con contribuciones positivas hacia la práctica de la innovación, desarrollo de procesos y generación de empleos principalmente. Sin embargo, estas empresas se encuentran en desventaja para afrontar el entorno tecnológico global, porque no disponen de los recursos económicos, infraestructura, maquinaria o equipo que les brinde la posibilidad de operación estable y competencia internacional. El presente artículo exhibe alternativas viables y sus medios de implementación; como lo son la sinergia de operación de la metodología TRIZ y los módulos de software de gestión de negocios, para apoyar el proceso de innovación y el monitoreo de nuevas ideas llevadas al mercado, por parte de una empresa emergente.
    [Show full text]
  • Khodayari and Giancarlo Pellegrino, CISPA Helmholtz Center for Information Security
    JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals Soheil Khodayari and Giancarlo Pellegrino, CISPA Helmholtz Center for Information Security https://www.usenix.org/conference/usenixsecurity21/presentation/khodayari This paper is included in the Proceedings of the 30th USENIX Security Symposium. August 11–13, 2021 978-1-939133-24-3 Open access to the Proceedings of the 30th USENIX Security Symposium is sponsored by USENIX. JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals Soheil Khodayari Giancarlo Pellegrino CISPA Helmholtz Center CISPA Helmholtz Center for Information Security for Information Security Abstract ior and avoiding the inclusion of HTTP cookies in cross-site Client-side CSRF is a new type of CSRF vulnerability requests (see, e.g., [28, 29]). In the client-side CSRF, the vul- where the adversary can trick the client-side JavaScript pro- nerable component is the JavaScript program instead, which gram to send a forged HTTP request to a vulnerable target site allows an attacker to generate arbitrary requests by modifying by modifying the program’s input parameters. We have little- the input parameters of the JavaScript program. As opposed to-no knowledge of this new vulnerability, and exploratory to the traditional CSRF, existing anti-CSRF countermeasures security evaluations of JavaScript-based web applications are (see, e.g., [28, 29, 34]) are not sufficient to protect web appli- impeded by the scarcity of reliable and scalable testing tech- cations from client-side CSRF attacks. niques. This paper presents JAW, a framework that enables the Client-side CSRF is very new—with the first instance af- analysis of modern web applications against client-side CSRF fecting Facebook in 2018 [24]—and we have little-to-no leveraging declarative traversals on hybrid property graphs, a knowledge of the vulnerable behaviors, the severity of this canonical, hybrid model for JavaScript programs.
    [Show full text]
  • Osvのご紹介 in Iijlab セミナー
    OSvのご紹介 in iijlab セミナー Takuya ASADA <syuu@cloudius-systems> Cloudius Systems 自己紹介 • Software Engineer at Cloudius Systems • FreeBSD developer (bhyve, network stack..) Cloudius Systemsについて • OSvの開発母体(フルタイムデベロッパで開発) • Office:Herzliya, Israel • CTO : Avi Kivity → Linux KVMのパパ • 他の開発者:元RedHat(KVM), Parallels(Virtuozzo, OpenVZ) etc.. • イスラエルの主な人物は元Qumranet(RedHatに買収) • 半数の開発者がイスラエル以外の国からリモート開発で参加 • 18名・9ヶ国(イスラエル在住は9名) OSvの概要 OSvとは? • OSvは単一のアプリケーションをハイパーバイザ・IaaSでLinuxOSな しに実行するための新しい仕組み • より効率よく高い性能で実行 • よりシンプルに管理しやすく • オープンソース(BSDライセンス)、コミュニティでの開発 • http://osv.io/ • https://github.com/cloudius-systems/osv 標準的なIaaSスタック • 単一のアプリケーションを実行するワークロードでは フルサイズのゲストOS+フル仮想化はオーバヘッド コンテナ技術 • 実行環境をシンプルにする事が可能 • パフォーマンスも高い ライブラリOS=OSv • コンテナと比較してisolationが高い 標準的なIaaSスタック: 機能の重複、オーバヘッド • ハイパーバイザ・OSの双方が ハードウェア抽象化、メモリ Your App 保護、リソース管理、セキュ リティ、Isolationなどの機能 Application Server を提供 JVM • OS・ランタイムの双方がメモ Operating System リ保護、セキュリティなどの 機能を提供 Hypervisor • 機能が重複しており無駄が多 Hardware い OSv: 重複部分の削除 • 重複していた多くの機能を排 Your App 除 Application Server • ハイパーバイザに従来のOSの JVM 役割を負ってもらい、OSvは Core その上でプロセスのように単 Hypervisor 一のアプリケーションを実行 Hardware OSvのコンセプト • 1アプリ=1インスタンス→シングルプロセス • メモリ保護や特権モードによるプロテクションは 行わない • 単一メモリ空間、単一モード(カーネルモード • Linux互換 libc APIを提供、バイナリ互換(一部) • REST APIによりネットワーク経由で制御 動作環境 • ハイパーバイザ • IaaS • KVM • Amazon EC2 • Xen • Google Compute • VMware Engine • VirtualBox 対応アーキテクチャ • x86_64(32bit非サポート) • aarch64 対応アプリ (Java) • JRuby(Ruby on Railsなど) • OpenJDK7,8 • Ringo.JS • Tomcat • Jython • Cassandra • Erjang • Jetty • Scala • Solr • Quercus(PHPエンジン、
    [Show full text]
  • Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure
    Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure by Jack Jamieson A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Information University of Toronto © Copyright 2021 by Jack Jamieson Abstract Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure Jack Jamieson Doctor of Philosophy Graduate Department of Information University of Toronto 2021 This dissertation studies a community of web developers building the IndieWeb, a modular and decen- tralized social web infrastructure through which people can produce and share content and participate in online communities without being dependent on corporate platforms. The purpose of this disser- tation is to investigate how developers' values shape and are shaped by this infrastructure, including how concentrations of power and influence affect individuals' capacity to participate in design-decisions related to values. Individuals' design activities are situated in a sociotechnical system to address influ- ence among individual software artifacts, peers in the community, mechanisms for interoperability, and broader internet infrastructures. Multiple methods are combined to address design activities across individual, community, and in- frastructural scales. I observed discussions and development activities in IndieWeb's online chat and at in-person events, studied source-code and developer decision-making on GitHub, and conducted 15 in-depth interviews with IndieWeb contributors between April 2018 and June 2019. I engaged in crit- ical making to reflect on and document the process of building software for this infrastructure. And I employed computational analyses including social network analysis and topic modelling to study the structure of developers' online activities.
    [Show full text]
  • Database-Backed Web Applications in the Wild: How Well Do They Work?
    Database-Backed Web Applications in the Wild: How Well Do They Work? Cong Yan Shan Lu Alvin Cheung University of Washington University of Chicago ABSTRACT Response time of homepage 1000 Most modern database-backed web applications are built upon Ob- 100 ject Relational Mapping (ORM) frameworks. While ORM frame- sec works ease application development by abstracting persistent data 10 as objects, such convenience often comes with a performance cost. In this paper, we present CADO, a tool that analyzes the applica- 1 tion logic and its interaction with databases using the Ruby on Rails 0.1 ORM framework. CADO includes a static program analyzer, a Response time in profiler and a synthetic data generator to extract and understand ap- 0.01 plication’s performance characteristics. We used CADO to analyze gitlab publify kandan lobsters redmine sugar tracks Range of #tuples for 200-2K 2K-20K 20K-200K 200K-2M the performance problems of 27 real-world open-source Rails ap- each major table: plications, covering domains such as online forums, e-commerce, project management, blogs, etc. Based on the results, we uncov- Figure 1: How response time of homepage (log-scale in Y-axis) ered a number of issues that lead to sub-optimal application perfor- scales with database size (ranging from 1M to 25G). mance, ranging from issuing queries, how result sets are used, and physical design. We suggest possible remedies for each issue, and to handle nowadays big-data challenges. highlight new research opportunities that arise from them. Some problems are well-known in the data management [23], software engineering research communities [32, 33], and devel- 1.
    [Show full text]
  • High Performance Multi-Core Transaction Processing Via Deterministic Execution
    Abstract High Performance Multi-core Transaction Processing via Deterministic Execution Jose Manuel Faleiro 2018 The increasing democratization of server hardware with multi-core CPUs and large main memories has been one of the dominant hardware trends of the last decade. “Bare metal” servers with tens of CPU cores and over 100 gigabytes of main mem- ory have been available for several years now. Recently, this large scale hardware has also been available via the cloud; for instance, Amazon EC2 now provides in- stances with 64 physical CPU cores. Database systems, with their roots in unipro- cessors and paucity of main memory, have unsurprisingly been found wanting on modern hardware. In addition to changes in hardware, database systems have had to contend with changing application requirements and deployment environments. Database sys- tems have long provided applications with an interactive interface, in which an application can communicate with the database over several round-trips in the course of a single request. A large class of applications, however, does not require interactive interfaces, and is unwilling to pay the performance cost associated with overly flexible interfaces. Some of these applications have eschewed database sys- tems altogether in favor of high-performance key-value stores. Finally, modern applications are increasingly deployed at ever increasing scales, often serving hundreds of thousands to millions of simultaneous clients. These large scale deployments are more prone to errors due to consistency issues in their underlying database systems. Ever since their inception, database systems have provided applications to tradeoff consistency for performance, and often nudge applications towards weak consistency.
    [Show full text]
  • RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation
    University of Calgary PRISM: University of Calgary's Digital Repository Graduate Studies The Vault: Electronic Theses and Dissertations 2014-09-30 RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation Shahnewaz, S. M. Shahnewaz, S. M. (2014). RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/27619 http://hdl.handle.net/11023/1867 master thesis University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation by S. M. Shahnewaz A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE GRADUATE PROGRAM IN ELECTRICAL AND COMPUTER ENGINEERING CALGARY, ALBERTA SEPTEMBER, 2014 © S. M. Shahnewaz 2014 Abstract As part of iterative development, decision about “Is the software product ready to be released at some given release date?” have to be made at the end of each release, sprint or iteration. While this decision is critically important, so far it is largely done either informally or in a simplistic manner relying on a small set of isolated metrics. In addition, continuity in release readiness evaluation is not achieved and any problems related to release issues cannot be addressed proactively.
    [Show full text]
  • Understanding and Improving Database-Backed Applications
    ©Copyright 2020 Cong Yan Understanding and Improving Database-Backed Applications Cong Yan A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2020 Reading Committee: Alvin Cheung, Chair Magdalena Balazinska Dan Suciu Program Authorized to Offer Degree: Paul G. Allen School of Computer Science & Engineering University of Washington Abstract Understanding and Improving Database-Backed Applications Cong Yan Chair of the Supervisory Committee: Alvin Cheung Paul G. Allen School of Computer Science & Engineering From online shopping to social media network, modern web applications are used everywhere in our daily life. These applications are often structured with three tiers: the front-end developed with HTML or JavaScript, the application server developed with object-oriented programming language like Java, Python or Ruby, and the back-end database that accepts SQL queries. Latency is critical for these applications. However, our study shows that many open-source web applications suffer from serious performance issues, with many slow pages as well as pages whose generation time scales superlinearly with the data size. Prior work has been focusing on improving the performance of each tier individually, but is often not enough to reduce the latency to meet the developer’s expectation. In this thesis, I present a different optimization strategy that enables much further improvement of database-backed applications. Rather than looking into each tier separately, I focus on optimizing one tier based on how other tiers process and consume the data. In particular, I describe five projects that implement five examples of such optimization. On the database tier, I present 1) CHESTNUT, a data layout designer that generates a customized data layout based on how the application tier consumes the query result; and 2) CONSTROPT, a query rewrite optimizer that performs query optimization by leveraging the data constraint inferred from the application code.
    [Show full text]
  • Understanding Database Performance Inefficiencies in Real-World Web Applications
    Understanding Database Performance Inefficiencies in Real-world Web Applications Cong Yan Junwen Yang Alvin Cheung Shan Lu University of Washington University of Chicago {congy,akcheung}@cs.washington.edu {junwen,shanlu}@uchicago.edu ABSTRACT of application semantics makes it difficult for the ORM and DBMS Many modern database-backed web applications are built upon to optimize how to manipulate persistent application data. Both Object Relational Mapping (ORM) frameworks. While such frame- aspects make applications built atop ORMs vulnerable to perfor- works ease application development by abstracting persistent data mance problems that impact overall user experience, as we have as objects, such convenience comes with a performance cost. In observed from the issue reports for such ORM applications [9]. this paper, we studied 27 real-world open-source applications built To understand the causes of performance issues in ORM ap- on top of the popular Ruby on Rails ORM framework, with the goal plications, we studied 27 real-world applications built using the to understand the database-related performance inefficiencies in popular Ruby on Rails framework. The applications are all under these applications. We discovered a number of inefficiencies rang- active development and are chosen to cover a wide variety of do- ing from physical design issues to how queries are expressed in the mains: online forums, e-commerce, collaboration platforms, etc. application code. We applied static program analysis to identify and Our goal is to understand database-related inefficiencies in these measure how prevalent these issues are, then suggested techniques applications and their causes. To the best of our knowledge, this is to alleviate these issues and measured the potential performance the first comprehensive study of database-related inefficiencies in gain as a result.
    [Show full text]
  • Coordination Avoidance in Distributed Databases
    Coordination Avoidance in Distributed Databases Peter Bailis Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2015-206 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-206.html October 30, 2015 Copyright © 2015, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Coordination Avoidance in Distributed Databases By Peter David Bailis A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair Professor Ali Ghodsi Professor Tapan Parikh Fall 2015 Coordination Avoidance in Distributed Databases Copyright 2015 by Peter David Bailis 1 Abstract Coordination Avoidance in Distributed Databases by Peter David Bailis Doctor of Philosophy in Computer Science University of California, Berkeley Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair The rise of Internet-scale geo-replicated services has led to upheaval in the design of modern data management systems. Given the availability, latency, and throughput penalties asso- ciated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., “NoSQL”) has sought weaker alternatives that reduce the use of expensive coordina- tion during system operation, often at the cost of application integrity.
    [Show full text]
  • Feral Concurrency Control: an Empirical Investigation of Modern Application Integrity
    Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity Peter Bailis, Alan Fekete†, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica UC Berkeley and †University of Sydney ABSTRACT Rails is interesting for at least two reasons. First, it continues to be a The rise of data-intensive “Web 2.0” Internet services has led to a popular means of developing responsive web application front-end range of popular new programming frameworks that collectively and business logic, with an active open source community and user embody the latest incarnation of the vision of Object-Relational base. Rails recently celebrated its tenth anniversary and enjoys Mapping (ORM) systems, albeit at unprecedented scale. In this considerable commercial interest, both in terms of deployment and work, we empirically investigate modern ORM-backed applica- the availability of hosted “cloud” environments such as Heroku. tions’ use and disuse of database concurrency control mechanisms. Thus, Rails programmers represent a large class of consumers of Specifically, we focus our study on the common use of feral, or database technology. Second, and perhaps more importantly, Rails application-level, mechanisms for maintaining database integrity, is “opinionated software” [41]. That is, Rails embodies the strong which, across a range of ORM systems, often take the form of declar- personal convictions of its developer community, and, in particular, ative correctness criteria, or invariants. We quantitatively analyze David Heinemeier Hansson (known as DHH), its creator. Rails is the use of these mechanisms in a range of open source applications particularly opinionated towards the database systems that it tasks written using the Ruby on Rails ORM and find that feral invariants with data storage.
    [Show full text]