A Comparative Analysis of Materialized Views Selection and Concurrency Control Mechanisms in Nosql Databases

Total Page:16

File Type:pdf, Size:1020Kb

A Comparative Analysis of Materialized Views Selection and Concurrency Control Mechanisms in Nosql Databases A Comparative Analysis of Materialized Views Selection and Concurrency Control Mechanisms in NoSQL Databases Ashish Tapdiya Yuan Xue Daniel Fabbri Vanderbilt University Vanderbilt University Vanderbilt University Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Increasing resource demands require relational NewSQL architectures enable a database to scale out lin- databases to scale. While relational databases are well suited early while providing ACID transaction guarantees. How- for vertical scaling, specialized hardware can be expensive. ever, their schema design requires careful consideration when Conversely, emerging NewSQL and NoSQL data stores are designed to scale horizontally. NewSQL databases provide ACID choosing partition keys, since joins are restricted to parti- transaction support; however, joins are limited to the partition tion keys only [7], resulting in limited query expressiveness. keys, resulting in restricted query expressiveness. On the other Similarly, NoSQL databases can also scale out linearly, but hand, NoSQL databases are designed to scale out linearly on are limited by slow join performance due to the distribution commodity hardware; however, they are limited by slow join of data across the cluster and data transfer latency, which performance. Hence, we consider if the NoSQL join performance can be improved while ensuring ACID semantics and without has also been identified in previous work [8]. Thus, while drastically sacrificing write performance, disk utilization and NewSQL and NoSQL systems allow data stores to scale, their query expressiveness. designs sacrifice query expressiveness and join performance, This paper presents the Synergy system that leverages schema respectively. More generally, there exists a design space that and workload driven mechanism to identify materialized views makes trade-offs between performance, ACID guarantees, and a specialized concurrency control system on top of a NoSQL database to enable scalable data management with familiar query expressiveness and disk utilization. relational conventions. Synergy trades slight write performance This paper considers if NoSQL join performance can be degradation and increased disk utilization for faster join perfor- improved while ensuring ACID semantics and without drasti- mance (compared to standard NoSQL databases) and improved cally sacrificing write performance, disk utilization and query query expressiveness (compared to NewSQL databases). Exper- expressiveness. One option for improving the performance of imental results using the TPC-W benchmark show that, for a database populated with 1M customers, the Synergy system NoSQL workloads is materialized views (MVs), which pre- exhibits a maximum performance improvement of 80.5% as compute expensive joins [9], [10]. However, deploying MVs compared to other evaluated systems. on top of a NoSQL store does not guarantee consistency as Index Terms—Transaction processing, materialized views, atomic key based operations allow for the MV’s data to be NoSQL databases, performance evaluation stale relative to the base table [6], [11], [12]. Hence, additional concurrency controls such as locking or multi-versioning are I. INTRODUCTION required to ensure data consistency. Application development with relational databases as the Standard concurrency control methods, such as locking or storage backend is prevalent, because relational databases multi-versioning, can provide ACID semantics for NoSQL provide: a formal framework for schema design, a structured stores with materialized views, but induce performance degra- arXiv:1710.01792v1 [cs.DB] 4 Oct 2017 query language (i.e., SQL) and ACID transactions. As appli- dation (i.e., by grabbing many locks or checking multiple cations gain popularity and the database system reaches its versions, respectively) because the concurrency control mech- resource limits, the architecture must scale up to ensure end- anism and MVs selection mechanism are not designed in to-end response time (RT). Relational databases are well suited tandem. Instead, this paper considers a synergistic design for vertical scaling; however, vertical scaling has known limi- space in which the concurrency control mechanism and MV’s tations and requires expensive hardware. On the contrary, the selection mechanism operate together such that only a single new brand of NewSQL and NoSQL databases are known for lock is grabbed per transaction. The proposed system relies on their ability to scale out linearly [1], [2], [3]. Hence, as the data the hierarchical structure of relational data and the workload size and the resource demands increase, application designers to inform the views selection mechanism, which can then be can consider transitioning from their relational database to a leveraged to grab a single lock for MVs and base tables. NewSQL/NoSQL database. Recently Facebook [4] and Netflix In this work, we present the Synergy system that leverages [5] transformed part of their relational databases to HBase [6]. MVs and a light-weight concurrency control on top of a A short version of this paper appeared in IEEE Cluster Conference 2017. NoSQL database to provide for scalable data management Available Online: http://ieeexplore.ieee.org/document/8048951/ with familiar relational conventions and more robust query 1 Slow Join Ensuring RDBMS NoSQL Performance Materialized Consistency Requires Concurrency Databases Views Control Horizontal Selection Mechanisms Scaling Mechanisms Challenges Lock Based MVCC NewSQL Schema Based – Workload Granularity Databases Workload Driven Based Single Lock Many Locks Limited Query Expressiveness Synergy VoltDB System Fig. 1. Design choices and decisions in the Synergy System. ADDRESS A. Relation, Index and Schema Models AID Street City Zip DEPARTMENT DNo DName Relation– A relation R is modeled as a set of attributes. The EMPLOYEE EID EName EHome_AID EOffice_AID E_DNo DEPARTMENT LOCATION primary key of R denoted as PK(R), is a tuple of attributes DL_DNo DLocation that uniquely identify each record in R. The foreign key of R WORKS ON WO_EID WO_PNo Hours PROJECT denoted as FK(R), is a set of attributes that reference another PNo PName P_DNo DEPENDENT relation T. A relation can have multiple foreign keys, hence DP_EID DPName DPHome_AID let F(R) denotes the set of foreign keys of R. Underlined Attributes Represent Primary Key Foreign Key Reference Index– In this work we utilize covered indexes that store the required data in the index itself. An index X on a relation R Fig. 2. Relations in the Company schema. denoted as X(R), is modeled as a set of attributes (s.t. X(R) ⊂ R). Let Xtuple(R) denotes a tuple of attributes that the index is indexed upon (s.t. Xtuple(R) ⊂ X(R)). The key of an index is a expressiveness. Figure 1 presents the design decisions for union of attributes in tuples Xtuple(R) and PK(R), in that order. MVs selection and concurrency control mechanisms in the Since a relation can have multiple indexes, let I(R) denotes the Synergy system. Synergy harnesses databases’ hierarchical set of indexes on R. schemas to generate candidate MVs, and then uses a workload Schema– Using the previous definitions of a relation and driven selection mechanism to select views for materialization. an index, a database schema S is modeled as a set of relations To provide ACID semantics in the presence of views, the and the corresponding index sets, S = fR , I(R ), R , I(R ),..., system implements concurrency controls on top of the NoSQL 1 1 2 2 R , I(R )g, where n represents the number of relations in the database using a hierarchical locking mechanism that only n n schema. We use an example Company database for the purpose requires a single lock to be held per transaction. The Synergy of exposition. Figure 2 depicts the relations in the Company system provides ACID semantics with the read-committed database schema. transaction isolation level. Our contributions in this work are as follows: B. Modeling Workload • We present the design of Synergy system that trades slight write performance degradation and increased disk utiliza- A database workload W = fw1,..., wmg is modeled as a set tion for faster join performance (compared to standard of SQL statements, where m is the number of statements. NoSQL databases) and improved query expressiveness (compared to NewSQL databases). C. HBase Overview • We propose a novel schema based–workload driven ma- We use HBase [6] for the purpose of exposition and terialized views selection mechanism. experimentation in this work. It is a column family-oriented • We implement and evaluate the proposed system on an distributed database modeled after Google’s Bigtable [11]. Amazon EC2 cluster using the TPC-W benchmark. HBase organizes data into tables. A table consists of rows • We compare and contrast the performance of Synergy that are sorted alphabetically by the row key. HBase groups system with four complementary systems. columns in a table into column families such that each column family data are stored in its own file. A column is identified by a column qualifier. Also, a column can have multiple versions II. BACKGROUND of a data sorted by the timestamp. We first review the concepts of a relation, index and The HBase data manipulation API comprises of five primi- schema which are common to both SQL and NoSQL data tive operations: Get, Put, Scan, Delete and Increment. The Get, models. Then, we present a model for the database workload. Put, Delete and Increment
Recommended publications
  • Failures in DBMS
    Chapter 11 Database Recovery 1 Failures in DBMS Two common kinds of failures StSystem filfailure (t)(e.g. power outage) ‒ affects all transactions currently in progress but does not physically damage the data (soft crash) Media failures (e.g. Head crash on the disk) ‒ damagg()e to the database (hard crash) ‒ need backup data Recoveryyp scheme responsible for handling failures and restoring database to consistent state 2 Recovery Recovering the database itself Recovery algorithm has two parts ‒ Actions taken during normal operation to ensure system can recover from failure (e.g., backup, log file) ‒ Actions taken after a failure to restore database to consistent state We will discuss (briefly) ‒ Transactions/Transaction recovery ‒ System Recovery 3 Transactions A database is updated by processing transactions that result in changes to one or more records. A user’s program may carry out many operations on the data retrieved from the database, but the DBMS is only concerned with data read/written from/to the database. The DBMS’s abstract view of a user program is a sequence of transactions (reads and writes). To understand database recovery, we must first understand the concept of transaction integrity. 4 Transactions A transaction is considered a logical unit of work ‒ START Statement: BEGIN TRANSACTION ‒ END Statement: COMMIT ‒ Execution errors: ROLLBACK Assume we want to transfer $100 from one bank (A) account to another (B): UPDATE Account_A SET Balance= Balance -100; UPDATE Account_B SET Balance= Balance +100; We want these two operations to appear as a single atomic action 5 Transactions We want these two operations to appear as a single atomic action ‒ To avoid inconsistent states of the database in-between the two updates ‒ And obviously we cannot allow the first UPDATE to be executed and the second not or vice versa.
    [Show full text]
  • Relational Database Design Chapter 7
    Chapter 7: Relational Database Design Chapter 7: Relational Database Design First Normal Form Pitfalls in Relational Database Design Functional Dependencies Decomposition Boyce-Codd Normal Form Third Normal Form Multivalued Dependencies and Fourth Normal Form Overall Database Design Process Database System Concepts 7.2 ©Silberschatz, Korth and Sudarshan 1 First Normal Form Domain is atomic if its elements are considered to be indivisible units + Examples of non-atomic domains: Set of names, composite attributes Identification numbers like CS101 that can be broken up into parts A relational schema R is in first normal form if the domains of all attributes of R are atomic Non-atomic values complicate storage and encourage redundant (repeated) storage of data + E.g. Set of accounts stored with each customer, and set of owners stored with each account + We assume all relations are in first normal form (revisit this in Chapter 9 on Object Relational Databases) Database System Concepts 7.3 ©Silberschatz, Korth and Sudarshan First Normal Form (Contd.) Atomicity is actually a property of how the elements of the domain are used. + E.g. Strings would normally be considered indivisible + Suppose that students are given roll numbers which are strings of the form CS0012 or EE1127 + If the first two characters are extracted to find the department, the domain of roll numbers is not atomic. + Doing so is a bad idea: leads to encoding of information in application program rather than in the database. Database System Concepts 7.4 ©Silberschatz, Korth and Sudarshan 2 Pitfalls in Relational Database Design Relational database design requires that we find a “good” collection of relation schemas.
    [Show full text]
  • Concurrency Control
    Concurrency Control Instructor: Matei Zaharia cs245.stanford.edu Outline What makes a schedule serializable? Conflict serializability Precedence graphs Enforcing serializability via 2-phase locking » Shared and exclusive locks » Lock tables and multi-level locking Optimistic concurrency with validation Concurrency control + recovery CS 245 2 Lock Modes Beyond S/X Examples: (1) increment lock (2) update lock CS 245 3 Example 1: Increment Lock Atomic addition action: INi(A) {Read(A); A ¬ A+k; Write(A)} INi(A), INj(A) do not conflict, because addition is commutative! CS 245 4 Compatibility Matrix compat S X I S T F F X F F F I F F T CS 245 5 Update Locks A common deadlock problem with upgrades: T1 T2 l-S1(A) l-S2(A) l-X1(A) l-X2(A) --- Deadlock --- CS 245 6 Solution If Ti wants to read A and knows it may later want to write A, it requests an update lock (not shared lock) CS 245 7 Compatibility Matrix New request compat S X U S T F Lock already X F F held in U CS 245 8 Compatibility Matrix New request compat S X U S T F T Lock already X F F F held in U F F F Note: asymmetric table! CS 245 9 How Is Locking Implemented In Practice? Every system is different (e.g., may not even provide conflict serializable schedules) But here is one (simplified) way ... CS 245 10 Sample Locking System 1. Don’t ask transactions to request/release locks: just get the weakest lock for each action they perform 2.
    [Show full text]
  • Sundial: Harmonizing Concurrency Control and Caching in a Distributed OLTP Database Management System
    Sundial: Harmonizing Concurrency Control and Caching in a Distributed OLTP Database Management System Xiangyao Yu, Yu Xia, Andrew Pavlo♠ Daniel Sanchez, Larry Rudolph|, Srinivas Devadas Massachusetts Institute of Technology, ♠Carnegie Mellon University, |Two Sigma Investments, LP [email protected], [email protected], [email protected] [email protected], [email protected], [email protected] ABSTRACT and foremost, long network delays lead to long execution time of Distributed transactions suffer from poor performance due to two distributed transactions, making them slower than single-partition major limiting factors. First, distributed transactions suffer from transactions. Second, long execution time increases the likelihood high latency because each of their accesses to remote data incurs a of contention among transactions, leading to more aborts and per- long network delay. Second, this high latency increases the likeli- formance degradation. hood of contention among distributed transactions, leading to high Recent work on improving distributed concurrency control has abort rates and low performance. focused on protocol-level and hardware-level improvements. Im- We present Sundial, an in-memory distributed optimistic concur- proved protocols can reduce synchronization overhead among trans- rency control protocol that addresses these two limitations. First, to actions [22, 35, 36, 46], but can still limit performance due to high reduce the transaction abort rate, Sundial dynamically determines coordination overhead (e.g., lock blocking). Hardware-level im- the logical order among transactions at runtime, based on their data provements include using special hardware like optimized networks access patterns. Sundial achieves this by applying logical leases to that enable low-latency remote memory accesses [20, 48, 55], which each data element, which allows the database to dynamically calcu- can increase the overall cost of the system.
    [Show full text]
  • Comparing Approaches: Analytic Workspaces and Materialized Views
    Comparing Materialized Views and Analytic Workspaces in Oracle Database 11g An Oracle White Paper March 2008 Comparing Materialized Views and Analytic Workspaces in Oracle Database 11g Introduction ....................................................................................................... 3 Fast Reporting – Comparing The Approaches............................................. 3 Overview of Materialized Views................................................................. 3 Overview of Analytic Workspaces............................................................. 4 Working Together in the Oracle Database ............................................... 6 Explanation of Data Used for this White Paper........................................... 6 Methodology for Designing and Building the Materialized Views........ 7 Manually Creating the Materialized View ............................................. 7 Generating Recommendations from the SQL Access Advisor ........ 9 Methodology for Defining the Analytic Workspace ............................. 10 Tools For Querying ........................................................................................ 15 Refreshing the Data ........................................................................................ 16 Refreshing Materialized Views.................................................................. 16 Refreshing Analytic Workspace Cubes.................................................... 17 Other Issues To Consider.............................................................................
    [Show full text]
  • Concurrency Control and Recovery ACID • Transactions • Recovery Transaction Model Concurency Control Recovery
    Data Management Systems • Transaction Processing • Concurrency control and recovery ACID • Transactions • Recovery Transaction model Concurency Control Recovery Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Transactions-CC&R 1 A bit of theory • Before discussing implementations, we will cover the theoretical underpinning behind concurrency control and recovery • Discussion at an abstract level, without relation to implementations • No consideration of how the concepts map to real elements (tuples, pages, blocks, buffers, etc.) • Theoretical background important to understand variations in implementations and what is considered to be correct • Theoretical background also key to understand how system have evolved over the years Transactions-CC&R 2 Reference Concurrency Control and Recovery in Database Systems Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman • https://www.microsoft.com/en- us/research/people/philbe/book/ Transactions-CC&R 3 ACID Transactions-CC&R 4 Conventional notion of database correctness • ACID: • Atomicity: the notion that an operation or a group of operations must take place in their entirety or not at all • Consistency: operations should take the database from a correct state to another correct state • Isolation: concurrent execution of operations should yield results that are predictable and correct • Durability: the database needs to remember the state it is in at all moments, even when failures occur • Like all acronyms, more effort in making it sound cute than in
    [Show full text]
  • Oracle Database 10G Materialized Views & Query Rewrite
    Oracle Materialized Views & Query Rewrite An Oracle White Paper May 2005 Oracle Materialized Views & Query Rewrite Executive Overview.......................................................................................... 3 Introduction ....................................................................................................... 3 Why use Summary Management..................................................................... 4 Components of Summary Management ........................................................ 5 Schema Requirements.................................................................................. 5 Dimensions ........................................................................................................ 5 Hints on Defining Dimensions .................................................................. 7 Materialized Views ............................................................................................ 8 Creating a Materialized View....................................................................... 8 Using your own pre-built materialized views............................................ 9 Index selection for Materialized Views...................................................... 9 What can this Materialized View Do?...................................................... 10 Tuning a Materialized View....................................................................... 11 Materialized View Invalidation ................................................................. 12 Security Implications
    [Show full text]
  • CAS CS 460/660 Introduction to Database Systems Transactions
    CAS CS 460/660 Introduction to Database Systems Transactions and Concurrency Control 1.1 Recall: Structure of a DBMS Query in: e.g. “Select min(account balance)” Data out: Database app e.g. 2000 Query Optimization and Execution Relational Operators These layers Access Methods must consider concurrency Buffer Management control and recovery Disk Space Management Customer accounts stored on disk1.2 = File System vs. DBMS? ■ Thought Experiment 1: ➹ You and your project partner are editing the same file. ➹ You both save it at the same time. ➹ Whose changes survive? A) Yours B) Partner’s C) Both D) Neither E) ??? • Thought Experiment 2: Q: How do you write programs over a – You’re updating a file. subsystem when it – The power goes out. promises you only “???” ? – Which of your changes survive? A: Very, very carefully!! A) All B) None C) All Since last save D ) ??? 1.3 Concurrent Execution ■ Concurrent execution essential for good performance. ➹ Because disk accesses are frequent, and relatively slow, it is important to keep the CPU humming by working on several user programs concurrently. ➹ Trends are towards lots of cores and lots of disks. § e.g., IBM Watson has 2880 processing cores ■ A program may carry out many operations, but the DBMS is only concerned about what data is read/written from/to the database. 1.4 Key concept: Transaction ■ an atomic sequence of database actions (reads/writes) ■ takes DB from one consistent state to another ■ transaction - DBMS’s abstract view of a user program: ➹ a sequence of reads and writes. transaction
    [Show full text]
  • Apache Hive Overview Date of Publish: 2018-07-12
    Data Access 3 Apache Hive overview Date of Publish: 2018-07-12 http://docs.hortonworks.com Contents What's new in this release: Apache Hive...............................................................3 Apache Hive 3 architectural overview................................................................... 4 Apache Hive 3 upgrade process..............................................................................6 Changes after upgrading to Apache Hive 3.........................................................................................................7 Convert Hive CLI scripts to Beeline................................................................................................................... 9 Hive Semantic and Syntax Changes.................................................................................................................. 10 Creating a table.......................................................................................................................................10 Escaping db.table references.................................................................................................................. 11 Casting timestamps................................................................................................................................. 11 Renaming tables......................................................................................................................................12 Checking compatibility of column changes...........................................................................................12
    [Show full text]
  • Short Type Questions and Answers on DBMS
    BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ODISHA Short Type Questions and Answers on DBMS Prepared by, Dr. Subhendu Kumar Rath, BPUT, Odisha. DABASE MANAGEMENT SYSTEM SHORT QUESTIONS AND ANSWERS Prepared by Dr.Subhendu Kumar Rath, Dy. Registrar, BPUT. 1. What is database? A database is a logically coherent collection of data with some inherent meaning, representing some aspect of real world and which is designed, built and populated with data for a specific purpose. 2. What is DBMS? It is a collection of programs that enables user to create and maintain a database. In other words it is general-purpose software that provides the users with the processes of defining, constructing and manipulating the database for various applications. 3. What is a Database system? The database and DBMS software together is called as Database system. 4. What are the advantages of DBMS? 1. Redundancy is controlled. 2. Unauthorised access is restricted. 3. Providing multiple user interfaces. 4. Enforcing integrity constraints. 5. Providing backup and recovery. 5. What are the disadvantage in File Processing System? 1. Data redundancy and inconsistency. 2. Difficult in accessing data. 3. Data isolation. 4. Data integrity. 5. Concurrent access is not possible. 6. Security Problems. 6. Describe the three levels of data abstraction? The are three levels of abstraction: 1. Physical level: The lowest level of abstraction describes how data are stored. 2. Logical level: The next higher level of abstraction, describes what data are stored in database and what relationship among those data. 3. View level: The highest level of abstraction describes only part of entire database.
    [Show full text]
  • Automated Selection of Materialized Views and Indexes for SQL Databases
    Automated Selection of Materialized Views and Indexes for SQL Databases Sanjay Agrawal Surajit Chaudhuri Vivek Narasayya Microsoft Research Microsoft Research Microsoft Research [email protected] [email protected] [email protected] Abstract large number of recent papers in this area, most of the Automatically selecting an appropriate set of prior work considers the problems of index selection and materialized views and indexes for SQL materialized view selection in isolation. databases is a non-trivial task. A judicious choice Although indexes and materialized views are similar, must be cost-driven and influenced by the a materialized view is much richer in structure than an workload experienced by the system. Although index since a materialized view may be defined over there has been work in materialized view multiple tables, and can have selections and GROUP BY selection in the context of multidimensional over multiple columns. In fact, an index can logically be (OLAP) databases, no past work has looked at considered as a special case of a single-table, projection the problem of building an industry-strength tool only materialized view. This richness of structure of for automated selection of materialized views materialized views makes the problem of selecting and indexes for SQL workloads. In this paper, materialized views significantly more complex than that we present an end-to-end solution to the problem of index selection. We therefore need innovative of selecting materialized views and indexes. We techniques for dealing with the large space of potentially describe results of extensive experimental interesting materialized views that are possible for a given evaluation that demonstrate the effectiveness of set of SQL queries and updates over a large schema.
    [Show full text]
  • Materialized Views
    MATERIALIZED VIEWS - WISSENSWERTES UND NEUERUNGEN IN 12C AUTHOR: JONAS GASSENMEYER www.syntegris.de © SYNTEGRIS INFORMATION SOLUTIONS GMBH Mein Haus, mein Auto,… [email protected] @gassenmj Stand 42 © SYNTEGRIS INFORMATION SOLUTIONS GMBH AGENDA . Vorstellung (2 Minuten) . Use Case Pharma Konzern (3 Minuten) •Wissenswertes (vor 12c) .Was ist eine MV, (SQL Hinterlegen und Index drauf) (2 Minuten) .Query Rewrite (1 Minuten) .Atomic_refresh (5 Minuten) .Tablespace (1 Minute) .Fast refresh (incrementell) (5 Minuten) .„es kommt drauf an“ (1 Minuten) • Out of place refresh . Prebuilt table (3 Minuten) . Demo (3 Minuten) • On Statement MVs . Erklärung (3 Minuten) . Demo (3 Minuten) • Real time MVs . Erklärung (3 Minuten) . Demo (3 Minuten) Refresh Statistiken .Erklärung (3 Minuten) .Demo (3 Minuten) .Bug? Fazit (3 Minuten) © SYNTEGRIS INFORMATION SOLUTIONS GMBH Use Case - Aggregation Table Table Table complex Tables MView fast select Report aggregation runs when user runs at night requests the report Materialized View pre-computes result sets © SYNTEGRIS INFORMATION SOLUTIONS GMBH Use Case - Replication › Use Case - Replication Database Local DB MView Link Table Remote DB Data Materialized View copies remote data to local DB © SYNTEGRIS INFORMATION SOLUTIONS GMBH Real World Example? © SYNTEGRIS INFORMATION SOLUTIONS GMBH Chemical Substances › Ordering Vials/Tubes/Plates Differ on each Site → Aggregation From ca. 50 Mio aggregated to ca. 6.5 Mio › Metadata about Substances Same on each Site → Replication Ca. 18.5 Mio © SYNTEGRIS INFORMATION
    [Show full text]