Understanding Database Performance Inefficiencies in Real-World Web Applications

Total Page:16

File Type:pdf, Size:1020Kb

Understanding Database Performance Inefficiencies in Real-World Web Applications Understanding Database Performance Inefficiencies in Real-world Web Applications Cong Yan Junwen Yang Alvin Cheung Shan Lu University of Washington University of Chicago {congy,akcheung}@cs.washington.edu {junwen,shanlu}@uchicago.edu ABSTRACT of application semantics makes it difficult for the ORM and DBMS Many modern database-backed web applications are built upon to optimize how to manipulate persistent application data. Both Object Relational Mapping (ORM) frameworks. While such frame- aspects make applications built atop ORMs vulnerable to perfor- works ease application development by abstracting persistent data mance problems that impact overall user experience, as we have as objects, such convenience comes with a performance cost. In observed from the issue reports for such ORM applications [9]. this paper, we studied 27 real-world open-source applications built To understand the causes of performance issues in ORM ap- on top of the popular Ruby on Rails ORM framework, with the goal plications, we studied 27 real-world applications built using the to understand the database-related performance inefficiencies in popular Ruby on Rails framework. The applications are all under these applications. We discovered a number of inefficiencies rang- active development and are chosen to cover a wide variety of do- ing from physical design issues to how queries are expressed in the mains: online forums, e-commerce, collaboration platforms, etc. application code. We applied static program analysis to identify and Our goal is to understand database-related inefficiencies in these measure how prevalent these issues are, then suggested techniques applications and their causes. To the best of our knowledge, this is to alleviate these issues and measured the potential performance the first comprehensive study of database-related inefficiencies in gain as a result. These techniques significantly reduce database database-backed applications built using ORM frameworks. query time (up to 91%) and the webpage response time (up to 98%). In our study, we found a number of issues across these applica- Our study provides guidance to the design of future database en- tions that cause performance problems, ranging from how data is gines and ORM frameworks to support database application that stored in the DBMS to how queries are expressed in the application are performant yet without sacrificing programmability. code. Most of these issues have not been reported in prior work. Fur- thermore, we implemented a number of static program analysis to systematically quantify how common the issues that we identified 1 INTRODUCTION are across different applications. We proposed a number of possible Object-relational mapping (ORM) frameworks are widely used to optimizations to resolve these issues that require no or little effort construct applications that interact with database management sys- from developers. As case studies, we applied these optimizations tems (DBMSs). While the implementations of such frameworks vary manually to some of these applications and our evaluation showed (e.g., Ruby on Rails [15], Django [5], and Hibernate [10]), the design that these optimizations improved performance by up to 41×. principles and goals remain the same: rather than embedding SQL In summary, this paper makes the following contributions: queries into the application code, ORMs let developers manipulate • We performed the first comprehensive database-performance persistent data as if it is in-memory objects via APIs exposed by the study of real-world applications built upon ORM frameworks. We ORMs [1, 6]. When executed, such API calls are translated by the chose 27 applications, covering a wide variety of domains. Our ORM into queries executed by the DBMS and the query results are results show that many applications share similar performance serialized into objects returned to the application. By raising the inefficiencies, including poor physical database design, coding pat- level of abstraction, this approach allows developers to implement terns that lead to inefficient queries being generated by the ORMs, their entire application in a single programming language, thereby and the lack of caching that results in redundant computation. enhancing code readability and maintainability. • For each performance inefficiency, we implemented an auto- However, the increase in programming productivity comes at mated static analysis to systematically measure how prevalent it is a cost. With the details of query processing hidden, programmers across different applications. often do not understand how their code is translated to queries and • For each inefficiency, we proposed solutions that open up how queries are executed. Furthermore, lacking an understanding new research opportunities. We manually implemented a number of these optimizations and showed that they provided up to 41× Permission to make digital or hard copies of all or part of this work for personal or performance improvement in our cases studies. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation In the following, we review the design of ORMs and web applica- on the first page. Copyrights for components of this work owned by others than ACM tions built with ORMs in Section 2. Then in Section 3, we describe must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, our study methodology. We discuss our findings and optimizations to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. in Section 4 and Section 5, followed by related work in Section 6. CIKM’17, , November 6–10, 2017, Singapore. © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-4918-5/17/11...$15.00 https://doi.org/10.1145/3132847.3132954 View Controller Model (backed by) blog_index.erb controller.rb user.rb User <div style="width: 100%;"> class BlogController class User < ActiveRecord <%= blogs.each do |b| %> def index has_many: blogs, class => Blog id name <p> <%= b.excerpt %> </p> u = User.where(name=params[name]) def some_blogs <%= link_to '/blogs/#{b.id}'> blogs = u.some_blogs return self.blogs <%= end %> render ‘blog_index', blogs blog.rb </div> def show Blog b = Blog.where(id=arg[id]) class Blog < ActiveRecord blog_show.erb render: ‘blog_show', b belongs_to: user, class => User id user_id excerpt excerpt <p> <%= b.content %> </p> Routing rules: / —> BlogController.index /blogs/:id —> BlogController.show Application Server DBMS Figure 1: Rails application example abridged from publify [13] user then passed to render to construct a webpage defined by the view ɠ ɧ file blog_index.erb. This webpage displays the excerpt and the URL HTTP request HTTP response link (link_to) of every blog. Clicking the link brings the user to One Other application render Controller applications a separate page, with another action (show) called with the corre- ɥ … objects webpage ɡ ɤ sponding blog identifier to retrieve the corresponding blog details ɦ function objects View call via another query issued by Blog.where. Model Application Server ORM function call objects 2.2 Object relational mapping in Rails query translator Like other ORMs, Rails by default maps each model class hierarchy ɢ ɣ SQL queries query results to a table. Each field in the model class maps to a column in the table. : Data flow : Control flow Hence, for the code shown in Figure 1, User objects are stored in the DBMS User table in the DBMS, and likewise for Blog objects. Developers Figure 2: Architecture of a Rails application define relationships among the model classes, such as belongs_to and has_many [7]. Such relationships are implemented using foreign keys or by building a separate table storing the relationship. For 2 BACKGROUND example, in Figure 1, each Blog object belongs_to one User, hence In this section we provide an overview of applications constructed Rails stores the unique user_id associated with each Blog. As we using ORM frameworks, using Ruby on Rails (“Rails”) as an example. will see, how classes and their relationships are modeled can affect Rails is an ORM framework that is widely used in building web application performance significantly. applications, including many well-known websites (e.g., Airbnb, Hulu, etc.). Applications constructed on top of Rails share similar 3 ANALYSIS METHODOLOGY architecture as those built using other ORMs such as Hibernate [10] We now describe how we analyze applications, our profiling method- and Django [5]. ology, and the application corpus we choose for the study. 2.1 Design of Rails applications 3.1 Static program analysis Figure 2 shows the architecture of a typical Rails application. The application is hosted on a Rails application server that commu- BlogController.index nicates with a DBMS to retrieve and manipulate persistent data. BlogController.show Internally, a Rails application consists of three components: model, u=User.where(name=params[name]) view, and controller [30]. Upon receiving a user’s request, say to b=Blog.where(id=arg[id]) render a web page ( 1 in Figure 2), the Rails server invokes the cor- u.some_blogs b.content responding action that resides in the appropriate controller based on v1=self.blogs the routing rules provided by the application, as shown in Figure 1. blogs=v1 During its execution, the controller interacts with the DBMS by in- c.a : Controller action : Next action edge voking ORM functions provided by Rails 2 , which Rails translates blogs.each do |b| Ins : statement node into SQL queries 3 . The query results 4 are serialized into model b.excerpt link_to “blogs/#{b.id}” QF : query function : Data flow edge objects returned to the controller 5 , and subsequently passed to : Control flow edge the view 6 to construct a webpage 7 to be returned 8 . Figure 3: Action Flow Graph (AFG) example As an illustration, Figure 1 shows the code of a blogging applica- tion. While the controller and the model are written in Ruby, the We build an ORM-aware static analyzer for Rails applications.
Recommended publications
  • Coordination Avoidance in Database Systems
    COORDINATION Peter Bailis Stanford/MIT/Berkeley AVOIDANCE Alan Fekete University of Sydney IN Mike Franklin Ali Ghodsi DATABASE Joe Hellerstein Ion Stoica SYSTEMS UC Berkeley Slides of Peter Bailis' VLDB’15 talk VLDB 2014 Default Supported? Serializable Actian Ingres NO YES Aerospike NO NO transactions are Persistit NO NON Clustrix NO NON not as widely Greenplum NO YESN IBM DB2 NO YES deployed as you IBM Informix NO YES MySQL NO YES might think… MemSQL NO NO MS SQL Server NO YESN NuoDB NO NO Oracle 11G NO NON Oracle BDB YES YESN Oracle BDB JE YES YES PostgreSQL NO YES SAP Hana NO NO ScaleDB NO NON VoltDB YES YESN VLDB 2014 Default Supported? Serializable Actian Ingres NO YES Aerospike NO NO transactions are Persistit NO NON Clustrix NO NON not as widely Greenplum NO YESN IBM DB2 NO YES deployed as you IBM Informix NO YES MySQL NO YES might think… MemSQL NO NO MS SQL Server NO YESN NuoDB NO NO Oracle 11G NO NON Oracle BDB YES YESN WHY?Oracle BDB JE YES YES PostgreSQL NO YES SAP Hana NO NO ScaleDB NO NON VoltDB YES YESN serializability: equivalence to some serial execution “post on timeline” “accept friend request” very general! serializability: equivalence to some serial execution r(x) w(y←1) r(y) w(x←1) very general! …but restricts concurrency serializability: equivalence to some serial execution CONCURRENT EXECUTION r(x)=0 r(y)=0 w(y←1) IS NOT w(x←1) SERIALIZABLE! verySerializability general! requires Coordination …buttransactions restricts cannot concurrency make progress independently Serializability requires Coordination transactions cannot make progress independently Two-Phase Locking Multi-Version Concurrency Control Blocking Waiting Optimistic Concurrency Control Pre-Scheduling Aborts Costs of Coordination Between Concurrent Transactions 1.
    [Show full text]
  • La Metodología TRIZ E Integración De Software De Licencia Libre Con Módulos Multifuncionales Como Estrategia De
    10º Congreso de Innovación y Desarrollo de Productos Monterrey, N.L., del 18 al 22 de Noviembre del 2015 Instituto Tecnológico de Estudios Superiores de Monterrey, Campus Monterrey. 1 10º Congreso de Innovación y Desarrollo de Productos Monterrey, N.L., del 18 al 22 de Noviembre del 2015 Instituto Tecnológico de Estudios Superiores de Monterrey, Campus Monterrey. La metodología TRIZ e Integración de software de licencia libre con módulos multifuncionales: como estrategia de fortalecimiento y competitividad en empresas emergentes de México. Guillermo Flores Téllez Jaime Garnica González Joselito Medina Marín Elisa Arisbé Millán Rivera José Carlos Cortés Garzón Resumen El sistema productivo de México se constituye en gran proporción, por empresas emergentes que funcionan como negocios familiares, surgidos a través de una idea creativa, para emprender una actividad económica específica, ya sea de la producción de un bien o prestación de servicio. El funcionamiento de este tipo de empresas es limitado, son compañías surgidas del emprendimiento con contribuciones positivas hacia la práctica de la innovación, desarrollo de procesos y generación de empleos principalmente. Sin embargo, estas empresas se encuentran en desventaja para afrontar el entorno tecnológico global, porque no disponen de los recursos económicos, infraestructura, maquinaria o equipo que les brinde la posibilidad de operación estable y competencia internacional. El presente artículo exhibe alternativas viables y sus medios de implementación; como lo son la sinergia de operación de la metodología TRIZ y los módulos de software de gestión de negocios, para apoyar el proceso de innovación y el monitoreo de nuevas ideas llevadas al mercado, por parte de una empresa emergente.
    [Show full text]
  • Khodayari and Giancarlo Pellegrino, CISPA Helmholtz Center for Information Security
    JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals Soheil Khodayari and Giancarlo Pellegrino, CISPA Helmholtz Center for Information Security https://www.usenix.org/conference/usenixsecurity21/presentation/khodayari This paper is included in the Proceedings of the 30th USENIX Security Symposium. August 11–13, 2021 978-1-939133-24-3 Open access to the Proceedings of the 30th USENIX Security Symposium is sponsored by USENIX. JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals Soheil Khodayari Giancarlo Pellegrino CISPA Helmholtz Center CISPA Helmholtz Center for Information Security for Information Security Abstract ior and avoiding the inclusion of HTTP cookies in cross-site Client-side CSRF is a new type of CSRF vulnerability requests (see, e.g., [28, 29]). In the client-side CSRF, the vul- where the adversary can trick the client-side JavaScript pro- nerable component is the JavaScript program instead, which gram to send a forged HTTP request to a vulnerable target site allows an attacker to generate arbitrary requests by modifying by modifying the program’s input parameters. We have little- the input parameters of the JavaScript program. As opposed to-no knowledge of this new vulnerability, and exploratory to the traditional CSRF, existing anti-CSRF countermeasures security evaluations of JavaScript-based web applications are (see, e.g., [28, 29, 34]) are not sufficient to protect web appli- impeded by the scarcity of reliable and scalable testing tech- cations from client-side CSRF attacks. niques. This paper presents JAW, a framework that enables the Client-side CSRF is very new—with the first instance af- analysis of modern web applications against client-side CSRF fecting Facebook in 2018 [24]—and we have little-to-no leveraging declarative traversals on hybrid property graphs, a knowledge of the vulnerable behaviors, the severity of this canonical, hybrid model for JavaScript programs.
    [Show full text]
  • Osvのご紹介 in Iijlab セミナー
    OSvのご紹介 in iijlab セミナー Takuya ASADA <syuu@cloudius-systems> Cloudius Systems 自己紹介 • Software Engineer at Cloudius Systems • FreeBSD developer (bhyve, network stack..) Cloudius Systemsについて • OSvの開発母体(フルタイムデベロッパで開発) • Office:Herzliya, Israel • CTO : Avi Kivity → Linux KVMのパパ • 他の開発者:元RedHat(KVM), Parallels(Virtuozzo, OpenVZ) etc.. • イスラエルの主な人物は元Qumranet(RedHatに買収) • 半数の開発者がイスラエル以外の国からリモート開発で参加 • 18名・9ヶ国(イスラエル在住は9名) OSvの概要 OSvとは? • OSvは単一のアプリケーションをハイパーバイザ・IaaSでLinuxOSな しに実行するための新しい仕組み • より効率よく高い性能で実行 • よりシンプルに管理しやすく • オープンソース(BSDライセンス)、コミュニティでの開発 • http://osv.io/ • https://github.com/cloudius-systems/osv 標準的なIaaSスタック • 単一のアプリケーションを実行するワークロードでは フルサイズのゲストOS+フル仮想化はオーバヘッド コンテナ技術 • 実行環境をシンプルにする事が可能 • パフォーマンスも高い ライブラリOS=OSv • コンテナと比較してisolationが高い 標準的なIaaSスタック: 機能の重複、オーバヘッド • ハイパーバイザ・OSの双方が ハードウェア抽象化、メモリ Your App 保護、リソース管理、セキュ リティ、Isolationなどの機能 Application Server を提供 JVM • OS・ランタイムの双方がメモ Operating System リ保護、セキュリティなどの 機能を提供 Hypervisor • 機能が重複しており無駄が多 Hardware い OSv: 重複部分の削除 • 重複していた多くの機能を排 Your App 除 Application Server • ハイパーバイザに従来のOSの JVM 役割を負ってもらい、OSvは Core その上でプロセスのように単 Hypervisor 一のアプリケーションを実行 Hardware OSvのコンセプト • 1アプリ=1インスタンス→シングルプロセス • メモリ保護や特権モードによるプロテクションは 行わない • 単一メモリ空間、単一モード(カーネルモード • Linux互換 libc APIを提供、バイナリ互換(一部) • REST APIによりネットワーク経由で制御 動作環境 • ハイパーバイザ • IaaS • KVM • Amazon EC2 • Xen • Google Compute • VMware Engine • VirtualBox 対応アーキテクチャ • x86_64(32bit非サポート) • aarch64 対応アプリ (Java) • JRuby(Ruby on Railsなど) • OpenJDK7,8 • Ringo.JS • Tomcat • Jython • Cassandra • Erjang • Jetty • Scala • Solr • Quercus(PHPエンジン、
    [Show full text]
  • Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure
    Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure by Jack Jamieson A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Information University of Toronto © Copyright 2021 by Jack Jamieson Abstract Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure Jack Jamieson Doctor of Philosophy Graduate Department of Information University of Toronto 2021 This dissertation studies a community of web developers building the IndieWeb, a modular and decen- tralized social web infrastructure through which people can produce and share content and participate in online communities without being dependent on corporate platforms. The purpose of this disser- tation is to investigate how developers' values shape and are shaped by this infrastructure, including how concentrations of power and influence affect individuals' capacity to participate in design-decisions related to values. Individuals' design activities are situated in a sociotechnical system to address influ- ence among individual software artifacts, peers in the community, mechanisms for interoperability, and broader internet infrastructures. Multiple methods are combined to address design activities across individual, community, and in- frastructural scales. I observed discussions and development activities in IndieWeb's online chat and at in-person events, studied source-code and developer decision-making on GitHub, and conducted 15 in-depth interviews with IndieWeb contributors between April 2018 and June 2019. I engaged in crit- ical making to reflect on and document the process of building software for this infrastructure. And I employed computational analyses including social network analysis and topic modelling to study the structure of developers' online activities.
    [Show full text]
  • Database-Backed Web Applications in the Wild: How Well Do They Work?
    Database-Backed Web Applications in the Wild: How Well Do They Work? Cong Yan Shan Lu Alvin Cheung University of Washington University of Chicago ABSTRACT Response time of homepage 1000 Most modern database-backed web applications are built upon Ob- 100 ject Relational Mapping (ORM) frameworks. While ORM frame- sec works ease application development by abstracting persistent data 10 as objects, such convenience often comes with a performance cost. In this paper, we present CADO, a tool that analyzes the applica- 1 tion logic and its interaction with databases using the Ruby on Rails 0.1 ORM framework. CADO includes a static program analyzer, a Response time in profiler and a synthetic data generator to extract and understand ap- 0.01 plication’s performance characteristics. We used CADO to analyze gitlab publify kandan lobsters redmine sugar tracks Range of #tuples for 200-2K 2K-20K 20K-200K 200K-2M the performance problems of 27 real-world open-source Rails ap- each major table: plications, covering domains such as online forums, e-commerce, project management, blogs, etc. Based on the results, we uncov- Figure 1: How response time of homepage (log-scale in Y-axis) ered a number of issues that lead to sub-optimal application perfor- scales with database size (ranging from 1M to 25G). mance, ranging from issuing queries, how result sets are used, and physical design. We suggest possible remedies for each issue, and to handle nowadays big-data challenges. highlight new research opportunities that arise from them. Some problems are well-known in the data management [23], software engineering research communities [32, 33], and devel- 1.
    [Show full text]
  • High Performance Multi-Core Transaction Processing Via Deterministic Execution
    Abstract High Performance Multi-core Transaction Processing via Deterministic Execution Jose Manuel Faleiro 2018 The increasing democratization of server hardware with multi-core CPUs and large main memories has been one of the dominant hardware trends of the last decade. “Bare metal” servers with tens of CPU cores and over 100 gigabytes of main mem- ory have been available for several years now. Recently, this large scale hardware has also been available via the cloud; for instance, Amazon EC2 now provides in- stances with 64 physical CPU cores. Database systems, with their roots in unipro- cessors and paucity of main memory, have unsurprisingly been found wanting on modern hardware. In addition to changes in hardware, database systems have had to contend with changing application requirements and deployment environments. Database sys- tems have long provided applications with an interactive interface, in which an application can communicate with the database over several round-trips in the course of a single request. A large class of applications, however, does not require interactive interfaces, and is unwilling to pay the performance cost associated with overly flexible interfaces. Some of these applications have eschewed database sys- tems altogether in favor of high-performance key-value stores. Finally, modern applications are increasingly deployed at ever increasing scales, often serving hundreds of thousands to millions of simultaneous clients. These large scale deployments are more prone to errors due to consistency issues in their underlying database systems. Ever since their inception, database systems have provided applications to tradeoff consistency for performance, and often nudge applications towards weak consistency.
    [Show full text]
  • Coordination Avoidance in Distributed Databases
    UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Coordination Avoidance in Distributed Databases Permalink https://escholarship.org/uc/item/8k8359g2 Author Bailis, Peter David Publication Date 2015 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California Coordination Avoidance in Distributed Databases By Peter David Bailis A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair Professor Ali Ghodsi Professor Tapan Parikh Fall 2015 Coordination Avoidance in Distributed Databases Copyright 2015 by Peter David Bailis 1 Abstract Coordination Avoidance in Distributed Databases by Peter David Bailis Doctor of Philosophy in Computer Science University of California, Berkeley Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair The rise of Internet-scale geo-replicated services has led to upheaval in the design of modern data management systems. Given the availability, latency, and throughput penalties asso- ciated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., “NoSQL”) has sought weaker alternatives that reduce the use of expensive coordina- tion during system operation, often at the cost of application integrity. When can we safely forego the cost of this expensive coordination, and when must we pay the price? In this thesis, we investigate the potential for coordination avoidance—the use of as little coordination as possible while ensuring application integrity—in several modern data- intensive domains. We demonstrate how to leverage the semantic requirements of appli- cations in data serving, transaction processing, and web services to enable more efficient distributed algorithms and system designs.
    [Show full text]
  • RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation
    University of Calgary PRISM: University of Calgary's Digital Repository Graduate Studies The Vault: Electronic Theses and Dissertations 2014-09-30 RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation Shahnewaz, S. M. Shahnewaz, S. M. (2014). RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/27619 http://hdl.handle.net/11023/1867 master thesis University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY RELREA-An Analytical Approach Supporting Continuous Release Readiness Evaluation by S. M. Shahnewaz A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE GRADUATE PROGRAM IN ELECTRICAL AND COMPUTER ENGINEERING CALGARY, ALBERTA SEPTEMBER, 2014 © S. M. Shahnewaz 2014 Abstract As part of iterative development, decision about “Is the software product ready to be released at some given release date?” have to be made at the end of each release, sprint or iteration. While this decision is critically important, so far it is largely done either informally or in a simplistic manner relying on a small set of isolated metrics. In addition, continuity in release readiness evaluation is not achieved and any problems related to release issues cannot be addressed proactively.
    [Show full text]
  • Understanding and Improving Database-Backed Applications
    ©Copyright 2020 Cong Yan Understanding and Improving Database-Backed Applications Cong Yan A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2020 Reading Committee: Alvin Cheung, Chair Magdalena Balazinska Dan Suciu Program Authorized to Offer Degree: Paul G. Allen School of Computer Science & Engineering University of Washington Abstract Understanding and Improving Database-Backed Applications Cong Yan Chair of the Supervisory Committee: Alvin Cheung Paul G. Allen School of Computer Science & Engineering From online shopping to social media network, modern web applications are used everywhere in our daily life. These applications are often structured with three tiers: the front-end developed with HTML or JavaScript, the application server developed with object-oriented programming language like Java, Python or Ruby, and the back-end database that accepts SQL queries. Latency is critical for these applications. However, our study shows that many open-source web applications suffer from serious performance issues, with many slow pages as well as pages whose generation time scales superlinearly with the data size. Prior work has been focusing on improving the performance of each tier individually, but is often not enough to reduce the latency to meet the developer’s expectation. In this thesis, I present a different optimization strategy that enables much further improvement of database-backed applications. Rather than looking into each tier separately, I focus on optimizing one tier based on how other tiers process and consume the data. In particular, I describe five projects that implement five examples of such optimization. On the database tier, I present 1) CHESTNUT, a data layout designer that generates a customized data layout based on how the application tier consumes the query result; and 2) CONSTROPT, a query rewrite optimizer that performs query optimization by leveraging the data constraint inferred from the application code.
    [Show full text]
  • Coordination Avoidance in Distributed Databases
    Coordination Avoidance in Distributed Databases Peter Bailis Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2015-206 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-206.html October 30, 2015 Copyright © 2015, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Coordination Avoidance in Distributed Databases By Peter David Bailis A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair Professor Ali Ghodsi Professor Tapan Parikh Fall 2015 Coordination Avoidance in Distributed Databases Copyright 2015 by Peter David Bailis 1 Abstract Coordination Avoidance in Distributed Databases by Peter David Bailis Doctor of Philosophy in Computer Science University of California, Berkeley Professor Joseph M. Hellerstein, Co-Chair Professor Ion Stoica, Co-Chair The rise of Internet-scale geo-replicated services has led to upheaval in the design of modern data management systems. Given the availability, latency, and throughput penalties asso- ciated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., “NoSQL”) has sought weaker alternatives that reduce the use of expensive coordina- tion during system operation, often at the cost of application integrity.
    [Show full text]
  • Feral Concurrency Control: an Empirical Investigation of Modern Application Integrity
    Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity Peter Bailis, Alan Fekete†, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica UC Berkeley and †University of Sydney ABSTRACT Rails is interesting for at least two reasons. First, it continues to be a The rise of data-intensive “Web 2.0” Internet services has led to a popular means of developing responsive web application front-end range of popular new programming frameworks that collectively and business logic, with an active open source community and user embody the latest incarnation of the vision of Object-Relational base. Rails recently celebrated its tenth anniversary and enjoys Mapping (ORM) systems, albeit at unprecedented scale. In this considerable commercial interest, both in terms of deployment and work, we empirically investigate modern ORM-backed applica- the availability of hosted “cloud” environments such as Heroku. tions’ use and disuse of database concurrency control mechanisms. Thus, Rails programmers represent a large class of consumers of Specifically, we focus our study on the common use of feral, or database technology. Second, and perhaps more importantly, Rails application-level, mechanisms for maintaining database integrity, is “opinionated software” [41]. That is, Rails embodies the strong which, across a range of ORM systems, often take the form of declar- personal convictions of its developer community, and, in particular, ative correctness criteria, or invariants. We quantitatively analyze David Heinemeier Hansson (known as DHH), its creator. Rails is the use of these mechanisms in a range of open source applications particularly opinionated towards the database systems that it tasks written using the Ruby on Rails ORM and find that feral invariants with data storage.
    [Show full text]