A Comparative Analysis of Materialized Views Selection and Concurrency Control Mechanisms in Nosql Databases
Total Page:16
File Type:pdf, Size:1020Kb
A Comparative Analysis of Materialized Views Selection and Concurrency Control Mechanisms in NoSQL Databases Ashish Tapdiya Yuan Xue Daniel Fabbri Vanderbilt University Vanderbilt University Vanderbilt University Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Increasing resource demands require relational NewSQL architectures enable a database to scale out lin- databases to scale. While relational databases are well suited early while providing ACID transaction guarantees. How- for vertical scaling, specialized hardware can be expensive. ever, their schema design requires careful consideration when Conversely, emerging NewSQL and NoSQL data stores are designed to scale horizontally. NewSQL databases provide ACID choosing partition keys, since joins are restricted to parti- transaction support; however, joins are limited to the partition tion keys only [7], resulting in limited query expressiveness. keys, resulting in restricted query expressiveness. On the other Similarly, NoSQL databases can also scale out linearly, but hand, NoSQL databases are designed to scale out linearly on are limited by slow join performance due to the distribution commodity hardware; however, they are limited by slow join of data across the cluster and data transfer latency, which performance. Hence, we consider if the NoSQL join performance can be improved while ensuring ACID semantics and without has also been identified in previous work [8]. Thus, while drastically sacrificing write performance, disk utilization and NewSQL and NoSQL systems allow data stores to scale, their query expressiveness. designs sacrifice query expressiveness and join performance, This paper presents the Synergy system that leverages schema respectively. More generally, there exists a design space that and workload driven mechanism to identify materialized views makes trade-offs between performance, ACID guarantees, and a specialized concurrency control system on top of a NoSQL database to enable scalable data management with familiar query expressiveness and disk utilization. relational conventions. Synergy trades slight write performance This paper considers if NoSQL join performance can be degradation and increased disk utilization for faster join perfor- improved while ensuring ACID semantics and without drasti- mance (compared to standard NoSQL databases) and improved cally sacrificing write performance, disk utilization and query query expressiveness (compared to NewSQL databases). Exper- expressiveness. One option for improving the performance of imental results using the TPC-W benchmark show that, for a database populated with 1M customers, the Synergy system NoSQL workloads is materialized views (MVs), which pre- exhibits a maximum performance improvement of 80.5% as compute expensive joins [9], [10]. However, deploying MVs compared to other evaluated systems. on top of a NoSQL store does not guarantee consistency as Index Terms—Transaction processing, materialized views, atomic key based operations allow for the MV’s data to be NoSQL databases, performance evaluation stale relative to the base table [6], [11], [12]. Hence, additional concurrency controls such as locking or multi-versioning are I. INTRODUCTION required to ensure data consistency. Application development with relational databases as the Standard concurrency control methods, such as locking or storage backend is prevalent, because relational databases multi-versioning, can provide ACID semantics for NoSQL provide: a formal framework for schema design, a structured stores with materialized views, but induce performance degra- arXiv:1710.01792v1 [cs.DB] 4 Oct 2017 query language (i.e., SQL) and ACID transactions. As appli- dation (i.e., by grabbing many locks or checking multiple cations gain popularity and the database system reaches its versions, respectively) because the concurrency control mech- resource limits, the architecture must scale up to ensure end- anism and MVs selection mechanism are not designed in to-end response time (RT). Relational databases are well suited tandem. Instead, this paper considers a synergistic design for vertical scaling; however, vertical scaling has known limi- space in which the concurrency control mechanism and MV’s tations and requires expensive hardware. On the contrary, the selection mechanism operate together such that only a single new brand of NewSQL and NoSQL databases are known for lock is grabbed per transaction. The proposed system relies on their ability to scale out linearly [1], [2], [3]. Hence, as the data the hierarchical structure of relational data and the workload size and the resource demands increase, application designers to inform the views selection mechanism, which can then be can consider transitioning from their relational database to a leveraged to grab a single lock for MVs and base tables. NewSQL/NoSQL database. Recently Facebook [4] and Netflix In this work, we present the Synergy system that leverages [5] transformed part of their relational databases to HBase [6]. MVs and a light-weight concurrency control on top of a A short version of this paper appeared in IEEE Cluster Conference 2017. NoSQL database to provide for scalable data management Available Online: http://ieeexplore.ieee.org/document/8048951/ with familiar relational conventions and more robust query 1 Slow Join Ensuring RDBMS NoSQL Performance Materialized Consistency Requires Concurrency Databases Views Control Horizontal Selection Mechanisms Scaling Mechanisms Challenges Lock Based MVCC NewSQL Schema Based – Workload Granularity Databases Workload Driven Based Single Lock Many Locks Limited Query Expressiveness Synergy VoltDB System Fig. 1. Design choices and decisions in the Synergy System. ADDRESS A. Relation, Index and Schema Models AID Street City Zip DEPARTMENT DNo DName Relation– A relation R is modeled as a set of attributes. The EMPLOYEE EID EName EHome_AID EOffice_AID E_DNo DEPARTMENT LOCATION primary key of R denoted as PK(R), is a tuple of attributes DL_DNo DLocation that uniquely identify each record in R. The foreign key of R WORKS ON WO_EID WO_PNo Hours PROJECT denoted as FK(R), is a set of attributes that reference another PNo PName P_DNo DEPENDENT relation T. A relation can have multiple foreign keys, hence DP_EID DPName DPHome_AID let F(R) denotes the set of foreign keys of R. Underlined Attributes Represent Primary Key Foreign Key Reference Index– In this work we utilize covered indexes that store the required data in the index itself. An index X on a relation R Fig. 2. Relations in the Company schema. denoted as X(R), is modeled as a set of attributes (s.t. X(R) ⊂ R). Let Xtuple(R) denotes a tuple of attributes that the index is indexed upon (s.t. Xtuple(R) ⊂ X(R)). The key of an index is a expressiveness. Figure 1 presents the design decisions for union of attributes in tuples Xtuple(R) and PK(R), in that order. MVs selection and concurrency control mechanisms in the Since a relation can have multiple indexes, let I(R) denotes the Synergy system. Synergy harnesses databases’ hierarchical set of indexes on R. schemas to generate candidate MVs, and then uses a workload Schema– Using the previous definitions of a relation and driven selection mechanism to select views for materialization. an index, a database schema S is modeled as a set of relations To provide ACID semantics in the presence of views, the and the corresponding index sets, S = fR , I(R ), R , I(R ),..., system implements concurrency controls on top of the NoSQL 1 1 2 2 R , I(R )g, where n represents the number of relations in the database using a hierarchical locking mechanism that only n n schema. We use an example Company database for the purpose requires a single lock to be held per transaction. The Synergy of exposition. Figure 2 depicts the relations in the Company system provides ACID semantics with the read-committed database schema. transaction isolation level. Our contributions in this work are as follows: B. Modeling Workload • We present the design of Synergy system that trades slight write performance degradation and increased disk utiliza- A database workload W = fw1,..., wmg is modeled as a set tion for faster join performance (compared to standard of SQL statements, where m is the number of statements. NoSQL databases) and improved query expressiveness (compared to NewSQL databases). C. HBase Overview • We propose a novel schema based–workload driven ma- We use HBase [6] for the purpose of exposition and terialized views selection mechanism. experimentation in this work. It is a column family-oriented • We implement and evaluate the proposed system on an distributed database modeled after Google’s Bigtable [11]. Amazon EC2 cluster using the TPC-W benchmark. HBase organizes data into tables. A table consists of rows • We compare and contrast the performance of Synergy that are sorted alphabetically by the row key. HBase groups system with four complementary systems. columns in a table into column families such that each column family data are stored in its own file. A column is identified by a column qualifier. Also, a column can have multiple versions II. BACKGROUND of a data sorted by the timestamp. We first review the concepts of a relation, index and The HBase data manipulation API comprises of five primi- schema which are common to both SQL and NoSQL data tive operations: Get, Put, Scan, Delete and Increment. The Get, models. Then, we present a model for the database workload. Put, Delete and Increment