Understanding and Improving Database-Backed Applications

Total Page:16

File Type:pdf, Size:1020Kb

Understanding and Improving Database-Backed Applications ©Copyright 2020 Cong Yan Understanding and Improving Database-Backed Applications Cong Yan A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2020 Reading Committee: Alvin Cheung, Chair Magdalena Balazinska Dan Suciu Program Authorized to Offer Degree: Paul G. Allen School of Computer Science & Engineering University of Washington Abstract Understanding and Improving Database-Backed Applications Cong Yan Chair of the Supervisory Committee: Alvin Cheung Paul G. Allen School of Computer Science & Engineering From online shopping to social media network, modern web applications are used everywhere in our daily life. These applications are often structured with three tiers: the front-end developed with HTML or JavaScript, the application server developed with object-oriented programming language like Java, Python or Ruby, and the back-end database that accepts SQL queries. Latency is critical for these applications. However, our study shows that many open-source web applications suffer from serious performance issues, with many slow pages as well as pages whose generation time scales superlinearly with the data size. Prior work has been focusing on improving the performance of each tier individually, but is often not enough to reduce the latency to meet the developer’s expectation. In this thesis, I present a different optimization strategy that enables much further improvement of database-backed applications. Rather than looking into each tier separately, I focus on optimizing one tier based on how other tiers process and consume the data. In particular, I describe five projects that implement five examples of such optimization. On the database tier, I present 1) CHESTNUT, a data layout designer that generates a customized data layout based on how the application tier consumes the query result; and 2) CONSTROPT, a query rewrite optimizer that performs query optimization by leveraging the data constraint inferred from the application code. On the application tier, I present 3) QURO, a query reorder compiler that changes the query order to reduce the database locking time when processing transactions; and 4) POWERSTATION, an IDE plugin to fix performance anti-patterns in the application code which results in slow or redundant database queries. Besides cross-tier information, there is more we can use to improve the application. Then I propose 5) HYPERLOOP, a framework that leverages not only application knowledge but also the developer’s insights to make performance tradeoffs. In each project, by looking into the interaction with other tiers and even the developers, I show that many optimizations that may not be viable or effective when optimizing each one tier alone can improve the overall application performance significantly. I show the performance gain brought by these optimizations using real-world applications, then discuss future work in this direction and conclude. TABLE OF CONTENTS Page List of Figures . iv Chapter 1: Introduction . 1 1.1 Architecture of Database-Backed Applications . 3 1.2 Real-world Application Performance . 5 1.3 Thesis Contributions . 7 Chapter 2: CHESTNUT: Generating Application-Specific Data Layouts . 11 2.1 Data Representation Mismatch . 11 2.2 CHESTNUT Overview . 13 2.3 Data Layout and Query Plan . 17 2.4 Query Plan Generation . 21 2.5 Handling Write Queries . 28 2.6 Finding Shared Data Structures . 30 2.7 Evaluation . 32 2.8 Summary and Future Work . 44 Chapter 3: CONSTROPT: Rewriting Queries by Leveraging Application Constraints . 45 3.1 CONSTROPT Overview . 46 3.2 Data Constraint Detection . 48 3.3 Leveraging Data Constraints to Optimize Queries . 58 3.4 Automatic Query Rewrite and Verification . 64 3.5 Evaluation . 69 3.6 Summary and Future Work . 78 Chapter 4: QURO: Improving Transaction Performance by Query Reordering . 79 4.1 Query Order and Transaction Performance . 79 i 4.2 QURO Overview . 82 4.3 Preprocessing . 87 4.4 Reordering Statements . 89 4.5 Profiling . 99 4.6 Evaluation . 99 4.7 Summary and Future Work . 115 Chapter 5: POWERSTATION: Understanding and Fixing Performance Anti-patterns . 116 5.1 Understanding Performance Anti-Patterns . 116 5.2 POWERSTATION: IDE Plugin for Fixing Anti-Patterns . 129 5.3 Summary and Future Work . 133 Chapter 6: HYPERLOOP: View-driven Optimizations with Developer’s Insights . 134 6.1 View-driven Optimization . 134 6.2 HyperLoop Overview . 137 6.3 Priority-Driven Optimization . 139 6.4 HyperLoop’s User Interface . 140 6.5 View Designer . 144 6.6 Application-Tier Optimizer . 145 6.7 Layout Generator . 148 6.8 Summary . 150 Chapter 7: Future Work . 151 7.1 More Opportunities of Cross-Tier Optimization . 151 7.2 Improving Application Robustness . 152 7.3 Building an Intelligent Middleware . 153 7.4 Extending Techniques to Other Systems . 154 Chapter 8: Related Work . 155 8.1 Related Database Optimizations . 155 8.2 Performance Studies in ORM Applications . 159 8.3 Leveraging Application Semantics to Place Computation . 159 Chapter 9: Conclusion . 161 ii Bibliography . 164 iii LIST OF FIGURES Figure Number Page 1.1 Architecture of database-backed applications. 2 1.2 A typical architecture of a Rails application. 4 1.3 An example of blogging application built with Rails. 5 2.1 Using CHESTNUT’s data layout for OODAs. 14 2.2 Example of data layouts for Q1. A green box shows a Project object and a blue shaded box shows an Issue object. A triangle shows an index created on the array of objects. 16 2.3 Workflow of CHESTNUT ............................... 17 2.4 Example of how CHESTNUT enumerates data structures and plans for Q1. A green box shows a Project and a blue box an Issue.................... 27 2.5 Query performance comparison. Figures on the left shows the relative time of slow queries compared to the original. The original query time is shown as text on the top. Each bar is divided into the top part showing the data-retrieving time and the shaded bottom part showing the deserialization time. Figures on the right shows the summary of other short read queries and write queries. 37 2.6 Case study of Kandan-Q1. (a) shows the original query. (b) shows the corresponding SQL queries. (c) shows the data layout (left) and the query plan (right) generated by CHESTNUT. .................................... 38 2.7 Case study of Redmine-Q3. 39 2.8 Case study of Redmine-Q8. 39 2.9 Comparison with hand-tuned materialized views on individual slow queries. 40 2.10 Scaling Kandan’s data to 50G. 41 2.11 Evaluation on TPC-H. 42 3.1 CONSTROPT architecture . 48 3.2 Example program flow analysis for different types of target node. 55 3.3 Before and after query plans of replacing disjunctions with unions. The parameter :c (shown in Listing 3.11 is given a concrete value 1 in the query plan. 62 iv 3.4 Illustration of query rewrite verification . 67 3.5 Query time evaluation with PosgreSQL. Each figure shows one type of optimization. Each bar shows the speedup of one query (left y axis) and the dot on line above the bar shows the actual time of the original query in millisecond (right y axis). X-axis shows the number of query. 74 3.6 Query time evaluation with MySQL. 76 3.7 Speed up under different percentage of invalid input . 77 3.8 Evaluation on webpage performance. It shows the speedup of end-to-end webpage load time, the speedup of server time (the time spent on application server and the database), and the absolute time of the end-to-end page loading time. 77 4.1 Comparison of the execution of the original and reordered implementation of the payment transaction. The darker the color, more likely the query is going to access contentious data. 84 4.2 Architecture and workflow of QURO ......................... 85 4.3 Performance comparison: varying data contention . 100 4.4 Performance comparison: varying number of threads for TPC-C benchmark . 101 4.5 Performance comparison: varying number of threads . 104 4.6 Self-relative speedup comparison . 106 4.7 Execution of trade update transactions T1 and T2 before and after reordering. Deadlock window is the time range where if T2 starts within the range, it is likely to deadlock with T1. 109 4.8 TPC-C worst-case . 110 4.9 Comparison with stored procedure. 110 4.10 Throughput comparison by varying buffer pool size . 111 4.11 Performance comparison of TPCC transactions among original implementation under OCC, MVCC, and variants of 2PL. 112 4.12 Execution of transaction T1 and T2 under different concurrency control schemes. 113 5.1 Action Flow Graph (AFG) example . 117 5.2 Performance gain by caching of query results. 119 5.3 Performance gain after applying all optimizations . 124 5.4 Transfer size reduction . 125 5.5 Pagination evaluation. The original page renders 1K to 5K records, as opposed to 40 records per page after pagination. 126 v 5.6 Caching evaluation on paginated webpages. p1 and p2 refer to the current the next pages, respectively. (we observed no significant latency difference among the pages that are reachable from p1). The x-axis shows.
Recommended publications
  • Coordination Avoidance in Database Systems
    COORDINATION Peter Bailis Stanford/MIT/Berkeley AVOIDANCE Alan Fekete University of Sydney IN Mike Franklin Ali Ghodsi DATABASE Joe Hellerstein Ion Stoica SYSTEMS UC Berkeley Slides of Peter Bailis' VLDB’15 talk VLDB 2014 Default Supported? Serializable Actian Ingres NO YES Aerospike NO NO transactions are Persistit NO NON Clustrix NO NON not as widely Greenplum NO YESN IBM DB2 NO YES deployed as you IBM Informix NO YES MySQL NO YES might think… MemSQL NO NO MS SQL Server NO YESN NuoDB NO NO Oracle 11G NO NON Oracle BDB YES YESN Oracle BDB JE YES YES PostgreSQL NO YES SAP Hana NO NO ScaleDB NO NON VoltDB YES YESN VLDB 2014 Default Supported? Serializable Actian Ingres NO YES Aerospike NO NO transactions are Persistit NO NON Clustrix NO NON not as widely Greenplum NO YESN IBM DB2 NO YES deployed as you IBM Informix NO YES MySQL NO YES might think… MemSQL NO NO MS SQL Server NO YESN NuoDB NO NO Oracle 11G NO NON Oracle BDB YES YESN WHY?Oracle BDB JE YES YES PostgreSQL NO YES SAP Hana NO NO ScaleDB NO NON VoltDB YES YESN serializability: equivalence to some serial execution “post on timeline” “accept friend request” very general! serializability: equivalence to some serial execution r(x) w(y←1) r(y) w(x←1) very general! …but restricts concurrency serializability: equivalence to some serial execution CONCURRENT EXECUTION r(x)=0 r(y)=0 w(y←1) IS NOT w(x←1) SERIALIZABLE! verySerializability general! requires Coordination …buttransactions restricts cannot concurrency make progress independently Serializability requires Coordination transactions cannot make progress independently Two-Phase Locking Multi-Version Concurrency Control Blocking Waiting Optimistic Concurrency Control Pre-Scheduling Aborts Costs of Coordination Between Concurrent Transactions 1.
    [Show full text]
  • La Metodología TRIZ E Integración De Software De Licencia Libre Con Módulos Multifuncionales Como Estrategia De
    10º Congreso de Innovación y Desarrollo de Productos Monterrey, N.L., del 18 al 22 de Noviembre del 2015 Instituto Tecnológico de Estudios Superiores de Monterrey, Campus Monterrey. 1 10º Congreso de Innovación y Desarrollo de Productos Monterrey, N.L., del 18 al 22 de Noviembre del 2015 Instituto Tecnológico de Estudios Superiores de Monterrey, Campus Monterrey. La metodología TRIZ e Integración de software de licencia libre con módulos multifuncionales: como estrategia de fortalecimiento y competitividad en empresas emergentes de México. Guillermo Flores Téllez Jaime Garnica González Joselito Medina Marín Elisa Arisbé Millán Rivera José Carlos Cortés Garzón Resumen El sistema productivo de México se constituye en gran proporción, por empresas emergentes que funcionan como negocios familiares, surgidos a través de una idea creativa, para emprender una actividad económica específica, ya sea de la producción de un bien o prestación de servicio. El funcionamiento de este tipo de empresas es limitado, son compañías surgidas del emprendimiento con contribuciones positivas hacia la práctica de la innovación, desarrollo de procesos y generación de empleos principalmente. Sin embargo, estas empresas se encuentran en desventaja para afrontar el entorno tecnológico global, porque no disponen de los recursos económicos, infraestructura, maquinaria o equipo que les brinde la posibilidad de operación estable y competencia internacional. El presente artículo exhibe alternativas viables y sus medios de implementación; como lo son la sinergia de operación de la metodología TRIZ y los módulos de software de gestión de negocios, para apoyar el proceso de innovación y el monitoreo de nuevas ideas llevadas al mercado, por parte de una empresa emergente.
    [Show full text]
  • Khodayari and Giancarlo Pellegrino, CISPA Helmholtz Center for Information Security
    JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals Soheil Khodayari and Giancarlo Pellegrino, CISPA Helmholtz Center for Information Security https://www.usenix.org/conference/usenixsecurity21/presentation/khodayari This paper is included in the Proceedings of the 30th USENIX Security Symposium. August 11–13, 2021 978-1-939133-24-3 Open access to the Proceedings of the 30th USENIX Security Symposium is sponsored by USENIX. JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals Soheil Khodayari Giancarlo Pellegrino CISPA Helmholtz Center CISPA Helmholtz Center for Information Security for Information Security Abstract ior and avoiding the inclusion of HTTP cookies in cross-site Client-side CSRF is a new type of CSRF vulnerability requests (see, e.g., [28, 29]). In the client-side CSRF, the vul- where the adversary can trick the client-side JavaScript pro- nerable component is the JavaScript program instead, which gram to send a forged HTTP request to a vulnerable target site allows an attacker to generate arbitrary requests by modifying by modifying the program’s input parameters. We have little- the input parameters of the JavaScript program. As opposed to-no knowledge of this new vulnerability, and exploratory to the traditional CSRF, existing anti-CSRF countermeasures security evaluations of JavaScript-based web applications are (see, e.g., [28, 29, 34]) are not sufficient to protect web appli- impeded by the scarcity of reliable and scalable testing tech- cations from client-side CSRF attacks. niques. This paper presents JAW, a framework that enables the Client-side CSRF is very new—with the first instance af- analysis of modern web applications against client-side CSRF fecting Facebook in 2018 [24]—and we have little-to-no leveraging declarative traversals on hybrid property graphs, a knowledge of the vulnerable behaviors, the severity of this canonical, hybrid model for JavaScript programs.
    [Show full text]
  • Osvのご紹介 in Iijlab セミナー
    OSvのご紹介 in iijlab セミナー Takuya ASADA <syuu@cloudius-systems> Cloudius Systems 自己紹介 • Software Engineer at Cloudius Systems • FreeBSD developer (bhyve, network stack..) Cloudius Systemsについて • OSvの開発母体(フルタイムデベロッパで開発) • Office:Herzliya, Israel • CTO : Avi Kivity → Linux KVMのパパ • 他の開発者:元RedHat(KVM), Parallels(Virtuozzo, OpenVZ) etc.. • イスラエルの主な人物は元Qumranet(RedHatに買収) • 半数の開発者がイスラエル以外の国からリモート開発で参加 • 18名・9ヶ国(イスラエル在住は9名) OSvの概要 OSvとは? • OSvは単一のアプリケーションをハイパーバイザ・IaaSでLinuxOSな しに実行するための新しい仕組み • より効率よく高い性能で実行 • よりシンプルに管理しやすく • オープンソース(BSDライセンス)、コミュニティでの開発 • http://osv.io/ • https://github.com/cloudius-systems/osv 標準的なIaaSスタック • 単一のアプリケーションを実行するワークロードでは フルサイズのゲストOS+フル仮想化はオーバヘッド コンテナ技術 • 実行環境をシンプルにする事が可能 • パフォーマンスも高い ライブラリOS=OSv • コンテナと比較してisolationが高い 標準的なIaaSスタック: 機能の重複、オーバヘッド • ハイパーバイザ・OSの双方が ハードウェア抽象化、メモリ Your App 保護、リソース管理、セキュ リティ、Isolationなどの機能 Application Server を提供 JVM • OS・ランタイムの双方がメモ Operating System リ保護、セキュリティなどの 機能を提供 Hypervisor • 機能が重複しており無駄が多 Hardware い OSv: 重複部分の削除 • 重複していた多くの機能を排 Your App 除 Application Server • ハイパーバイザに従来のOSの JVM 役割を負ってもらい、OSvは Core その上でプロセスのように単 Hypervisor 一のアプリケーションを実行 Hardware OSvのコンセプト • 1アプリ=1インスタンス→シングルプロセス • メモリ保護や特権モードによるプロテクションは 行わない • 単一メモリ空間、単一モード(カーネルモード • Linux互換 libc APIを提供、バイナリ互換(一部) • REST APIによりネットワーク経由で制御 動作環境 • ハイパーバイザ • IaaS • KVM • Amazon EC2 • Xen • Google Compute • VMware Engine • VirtualBox 対応アーキテクチャ • x86_64(32bit非サポート) • aarch64 対応アプリ (Java) • JRuby(Ruby on Railsなど) • OpenJDK7,8 • Ringo.JS • Tomcat • Jython • Cassandra • Erjang • Jetty • Scala • Solr • Quercus(PHPエンジン、
    [Show full text]
  • Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure
    Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure by Jack Jamieson A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Information University of Toronto © Copyright 2021 by Jack Jamieson Abstract Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure Jack Jamieson Doctor of Philosophy Graduate Department of Information University of Toronto 2021 This dissertation studies a community of web developers building the IndieWeb, a modular and decen- tralized social web infrastructure through which people can produce and share content and participate in online communities without being dependent on corporate platforms. The purpose of this disser- tation is to investigate how developers' values shape and are shaped by this infrastructure, including how concentrations of power and influence affect individuals' capacity to participate in design-decisions related to values. Individuals' design activities are situated in a sociotechnical system to address influ- ence among individual software artifacts, peers in the community, mechanisms for interoperability, and broader internet infrastructures. Multiple methods are combined to address design activities across individual, community, and in- frastructural scales. I observed discussions and development activities in IndieWeb's online chat and at in-person events, studied source-code and developer decision-making on GitHub, and conducted 15 in-depth interviews with IndieWeb contributors between April 2018 and June 2019. I engaged in crit- ical making to reflect on and document the process of building software for this infrastructure. And I employed computational analyses including social network analysis and topic modelling to study the structure of developers' online activities.
    [Show full text]
  • Database-Backed Web Applications in the Wild: How Well Do They Work?
    Database-Backed Web Applications in the Wild: How Well Do They Work? Cong Yan Shan Lu Alvin Cheung University of Washington University of Chicago ABSTRACT Response time of homepage 1000 Most modern database-backed web applications are built upon Ob- 100 ject Relational Mapping (ORM) frameworks. While ORM frame- sec works ease application development by abstracting persistent data 10 as objects, such convenience often comes with a performance cost. In this paper, we present CADO, a tool that analyzes the applica- 1 tion logic and its interaction with databases using the Ruby on Rails 0.1 ORM framework. CADO includes a static program analyzer, a Response time in profiler and a synthetic data generator to extract and understand ap- 0.01 plication’s performance characteristics. We used CADO to analyze gitlab publify kandan lobsters redmine sugar tracks Range of #tuples for 200-2K 2K-20K 20K-200K 200K-2M the performance problems of 27 real-world open-source Rails ap- each major table: plications, covering domains such as online forums, e-commerce, project management, blogs, etc. Based on the results, we uncov- Figure 1: How response time of homepage (log-scale in Y-axis) ered a number of issues that lead to sub-optimal application perfor- scales with database size (ranging from 1M to 25G). mance, ranging from issuing queries, how result sets are used, and physical design. We suggest possible remedies for each issue, and to handle nowadays big-data challenges. highlight new research opportunities that arise from them. Some problems are well-known in the data management [23], software engineering research communities [32, 33], and devel- 1.
    [Show full text]
  • High Performance Multi-Core Transaction Processing Via Deterministic Execution
    Abstract High Performance Multi-core Transaction Processing via Deterministic Execution Jose Manuel Faleiro 2018 The increasing democratization of server hardware with multi-core CPUs and large main memories has been one of the dominant hardware trends of the last decade. “Bare metal” servers with tens of CPU cores and over 100 gigabytes of main mem- ory have been available for several years now. Recently, this large scale hardware has also been available via the cloud; for instance, Amazon EC2 now provides in- stances with 64 physical CPU cores. Database systems, with their roots in unipro- cessors and paucity of main memory, have unsurprisingly been found wanting on modern hardware. In addition to changes in hardware, database systems have had to contend with changing application requirements and deployment environments. Database sys- tems have long provided applications with an interactive interface, in which an application can communicate with the database over several round-trips in the course of a single request. A large class of applications, however, does not require interactive interfaces, and is unwilling to pay the performance cost associated with overly flexible interfaces. Some of these applications have eschewed database sys- tems altogether in favor of high-performance key-value stores. Finally, modern applications are increasingly deployed at ever increasing scales, often serving hundreds of thousands to millions of simultaneous clients. These large scale deployments are more prone to errors due to consistency issues in their underlying database systems. Ever since their inception, database systems have provided applications to tradeoff consistency for performance, and often nudge applications towards weak consistency.
    [Show full text]
  • Coordination Avoidance in Distributed Databases
    UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Coordination Avoidance in Distributed Databases Permalink https://escholarship.org/uc/item/8k8359g2 Author Bailis, Peter David Publication Date 2015 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California Coordination Avoidance in Distributed Databases By Peter David Bailis A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair Professor Ali Ghodsi Professor Tapan Parikh Fall 2015 Coordination Avoidance in Distributed Databases Copyright 2015 by Peter David Bailis 1 Abstract Coordination Avoidance in Distributed Databases by Peter David Bailis Doctor of Philosophy in Computer Science University of California, Berkeley Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair The rise of Internet-scale geo-replicated services has led to upheaval in the design of modern data management systems. Given the availability, latency, and throughput penalties asso- ciated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., “NoSQL”) has sought weaker alternatives that reduce the use of expensive coordina- tion during system operation, often at the cost of application integrity. When can we safely forego the cost of this expensive coordination, and when must we pay the price? In this thesis, we investigate the potential for coordination avoidance—the use of as little coordination as possible while ensuring application integrity—in several modern data- intensive domains. We demonstrate how to leverage the semantic requirements of appli- cations in data serving, transaction processing, and web services to enable more efficient distributed algorithms and system designs.
    [Show full text]
  • RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation
    University of Calgary PRISM: University of Calgary's Digital Repository Graduate Studies The Vault: Electronic Theses and Dissertations 2014-09-30 RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation Shahnewaz, S. M. Shahnewaz, S. M. (2014). RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/27619 http://hdl.handle.net/11023/1867 master thesis University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation by S. M. Shahnewaz A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE GRADUATE PROGRAM IN ELECTRICAL AND COMPUTER ENGINEERING CALGARY, ALBERTA SEPTEMBER, 2014 © S. M. Shahnewaz 2014 Abstract As part of iterative development, decision about “Is the software product ready to be released at some given release date?” have to be made at the end of each release, sprint or iteration. While this decision is critically important, so far it is largely done either informally or in a simplistic manner relying on a small set of isolated metrics. In addition, continuity in release readiness evaluation is not achieved and any problems related to release issues cannot be addressed proactively.
    [Show full text]
  • Understanding Database Performance Inefficiencies in Real-World Web Applications
    Understanding Database Performance Inefficiencies in Real-world Web Applications Cong Yan Junwen Yang Alvin Cheung Shan Lu University of Washington University of Chicago {congy,akcheung}@cs.washington.edu {junwen,shanlu}@uchicago.edu ABSTRACT of application semantics makes it difficult for the ORM and DBMS Many modern database-backed web applications are built upon to optimize how to manipulate persistent application data. Both Object Relational Mapping (ORM) frameworks. While such frame- aspects make applications built atop ORMs vulnerable to perfor- works ease application development by abstracting persistent data mance problems that impact overall user experience, as we have as objects, such convenience comes with a performance cost. In observed from the issue reports for such ORM applications [9]. this paper, we studied 27 real-world open-source applications built To understand the causes of performance issues in ORM ap- on top of the popular Ruby on Rails ORM framework, with the goal plications, we studied 27 real-world applications built using the to understand the database-related performance inefficiencies in popular Ruby on Rails framework. The applications are all under these applications. We discovered a number of inefficiencies rang- active development and are chosen to cover a wide variety of do- ing from physical design issues to how queries are expressed in the mains: online forums, e-commerce, collaboration platforms, etc. application code. We applied static program analysis to identify and Our goal is to understand database-related inefficiencies in these measure how prevalent these issues are, then suggested techniques applications and their causes. To the best of our knowledge, this is to alleviate these issues and measured the potential performance the first comprehensive study of database-related inefficiencies in gain as a result.
    [Show full text]
  • Coordination Avoidance in Distributed Databases
    Coordination Avoidance in Distributed Databases Peter Bailis Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2015-206 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-206.html October 30, 2015 Copyright © 2015, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Coordination Avoidance in Distributed Databases By Peter David Bailis A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair Professor Ali Ghodsi Professor Tapan Parikh Fall 2015 Coordination Avoidance in Distributed Databases Copyright 2015 by Peter David Bailis 1 Abstract Coordination Avoidance in Distributed Databases by Peter David Bailis Doctor of Philosophy in Computer Science University of California, Berkeley Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair The rise of Internet-scale geo-replicated services has led to upheaval in the design of modern data management systems. Given the availability, latency, and throughput penalties asso- ciated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., “NoSQL”) has sought weaker alternatives that reduce the use of expensive coordina- tion during system operation, often at the cost of application integrity.
    [Show full text]
  • Feral Concurrency Control: an Empirical Investigation of Modern Application Integrity
    Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity Peter Bailis, Alan Fekete†, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica UC Berkeley and †University of Sydney ABSTRACT Rails is interesting for at least two reasons. First, it continues to be a The rise of data-intensive “Web 2.0” Internet services has led to a popular means of developing responsive web application front-end range of popular new programming frameworks that collectively and business logic, with an active open source community and user embody the latest incarnation of the vision of Object-Relational base. Rails recently celebrated its tenth anniversary and enjoys Mapping (ORM) systems, albeit at unprecedented scale. In this considerable commercial interest, both in terms of deployment and work, we empirically investigate modern ORM-backed applica- the availability of hosted “cloud” environments such as Heroku. tions’ use and disuse of database concurrency control mechanisms. Thus, Rails programmers represent a large class of consumers of Specifically, we focus our study on the common use of feral, or database technology. Second, and perhaps more importantly, Rails application-level, mechanisms for maintaining database integrity, is “opinionated software” [41]. That is, Rails embodies the strong which, across a range of ORM systems, often take the form of declar- personal convictions of its developer community, and, in particular, ative correctness criteria, or invariants. We quantitatively analyze David Heinemeier Hansson (known as DHH), its creator. Rails is the use of these mechanisms in a range of open source applications particularly opinionated towards the database systems that it tasks written using the Ruby on Rails ORM and find that feral invariants with data storage.
    [Show full text]