Nosql Data Stores the Reference Big Data Stack
Total Page:16
File Type:pdf, Size:1020Kb
Macroarea di Ingegneria Dipar/mento di Ingegneria Civile e Ingegneria Informa/ca NoSQL Data Stores Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica The reference Big Data stack High-level Interfaces / Integration Support Data Processing Data Storage Resource Management Valeria Cardellini - SABD 2018/19 1 Traditional RDBMSs • Relational DBMSs (RDBMSs) – Traditional technology for storing structured data in web and business applications • SQL is good – Rich language and toolset – Easy to use and integrate – Many vendors • RDBMSs promise ACID guarantees Valeria Cardellini - SABD 2018/19 2 ACID properties • Atomicity – All included statements in a transaction are either executed or the whole transaction is aborted without affecting the database (“all or nothing” principle) • Consistency – A database is in a consistent state before and after a transaction • Isolation – Transactions cannot see uncommitted changes in the database (i.e., the results of incomplete transactions are not visible to other transactions) • Durability – Changes are written to a disk before a database commits a transaction so that committed data cannot be lost through a power failure Valeria Cardellini - SABD 2018/19 3 RDBMS constraints • Domain constraints – Restricts the domain of each attribute or the set of possible values for the attribute • Entity integrity constraint – No primary key value can be null • Referential integrity constraint – To maintain consistency among the tuples in two relations: every value of one attribute of a relation should exist as a value of another attribute in another relation • Foreign key – To cross-reference between multiple relations: it is a key in a relation that matches the primary key of another relation Valeria Cardellini - SABD 2018/19 4 Pros and cons of RDBMS Pros Cons • Well-defined consistency • Performance as major model constraint, scaling is difficult • ACID guarantees • Limited support for complex data structures • Relational integrity maintained through entity • Complete knowledge of DB structure required to create and referential integrity ad hoc queries constraints • Commercial DBMSs are • Well suited for OLTP apps expensive OLTP: OnLine Transaction • Some DBMSs have limits on Processing fields size • Sound theoretical foundation • Data integration from • Stable and standardized multiple RDBMSs can be DBMSs available cumbersome • Well understood Valeria Cardellini - SABD 2018/19 5 RDBMS challenges • Web-based applications caused spikes – Internet-scale data size – High read-write rates – Frequent schema changes • Let’s scale RDBMSs – RDBMS were not designed to be distributed • How to scale RDBMSs? – Replication – Sharding Valeria Cardellini - SABD 2018/19 6 Replication • Primary backup with master/worker architecture • Replication improves read scalability • Write operations? Valeria Cardellini - SABD 2018/19 7 Sharding • Horizontal partitioning of data across many separate servers • Scales read and write operations • Cannot execute transactions across shards (partitions) • Consistent hashing is one form of sharding - Hash both data and nodes using the same hash function in the same ID space Valeria Cardellini - SABD 2018/19 8 Scaling RDBMSs is expensive and inefficient Source: Couchbase technical report Valeria Cardellini - SABD 2018/19 9 NoSQL data stores • NoSQL = Not Only SQL – SQL-style querying is not the crucial objective Valeria Cardellini - SABD 2018/19 10 NoSQL data stores: main features • Support flexible schema – No requirement for fixed rows in a table schema – Well suited for Agile development process • Scale horizontally – Partitioning of data and processing over multiple nodes • Provide high availability – By replicating data in multiple nodes, often geo-distributed • Mainly utilize shared-nothing architecture – With exception of graph-based databases • Avoid unneeded complexity – E.g., elimination of join operations • Multiprocessor support • Support weaker consistency models – BASE rather than ACID: compromising reliability for better performance Valeria Cardellini - SABD 2018/19 11 ACID vs BASE • Two design philosophies at opposite ends of the consistency-availability spectrum - Keep in mind CAP theorem Pick two of Consistency, Availability and Partition tolerance! • ACID: traditional approach for RDBMSs – Pessimistic approach: prevent conflicts from occurring • Usually implemented with write locks managed by the system • Leads to performance degradation and deadlocks (hard to prevent and debug) – Does not scale well when handling petabytes of data (remember of latency!) Valeria Cardellini - SABD 2018/19 12 ACID vs BASE (2) • BASE: Basically Available, Soft state, Eventual consistency – Basically Available: the system is available most of the time and there could exist a subsystem temporarily unavailable – Soft state: data is not durable in the sense that its persistence is in the hand of the user that must take care of refresh it – Eventually consistent: the system eventually converge to a consistent state • Optimistic approach - Lets conflicts occur, but detects them and takes action to sort them out - How? • Conditional updates: test the value just before updating • Save both updates: record that they are in conflict and then merge them Valeria Cardellini - SABD 2018/19 13 NoSQL and consistency • Biggest change from RDBMS – RDBMS: strong consistency – Traditional RDBMS are CA systems (or CP systems, depending on the configuration) • Most NoSQL systems are eventual consistent – i.e., AP systems • But some NoSQL systems provide strong consistency or tunable consistency Valeria Cardellini - SABD 2018/19 14 NoSQL cost and performance Source: Couchbase technical report Valeria Cardellini - SABD 2018/19 15 Pros and cons of NoSQL Pros Cons • Easy to scale-out • No ACID guarantees, less • Higher performance for suitable for OLTP apps massive data scale • No fixed schema, no common data storage model • Allows sharing of data across multiple servers • Lack of standardization (e.g., querying is unique for • Most solutions are either each NoSQL data store) open-source or cheaper • Limited support for • HA and fault tolerance aggregation (sum, avg, provided by data replication count, group by) • Supports complex data • Poor performance for structures and objetcs complex joins • No fixed schema, supportrs • No well defined approach for DB design (different data unstructured data models) • Very fast retrieval of data, • Lack of model can lead to Valeria Cardellini- SABD2018/19 Valeria suitable for real-time apps solution lock-in 16 NoSQL data models • A number of largely diverse data stores not based on the relational data model Valeria Cardellini - SABD 2018/19 17 NoSQL data models • A data model is a set of constructs for representing the information – Relational model: tables, columns and rows • Storage model: how the DBMS stores and manipulates the data internally • A data model is usually independent of the storage model • Data models for NoSQL systems: – Aggregate-oriented models: key-value, document, and column-family – Graph-based models Valeria Cardellini - SABD 2018/19 18 Aggregates • Data as units having a complex structure – More structure than just a set of tuples – E.g., complex record with simple fields, arrays, records nested inside • Aggregate pattern in Domain-Driven Design – A collection of related objects that we treat as a unit – A unit for data manipulation and management of consistency • Advantages of aggregates – Easier for application programmers to work with – Easier for database systems to handle operating on a cluster Valeria Cardellini - SABD 2018/19 See http://thght.works/1XqYKB0 19 Transactions? • Relational databases do have ACID transactions! • Aggregate-oriented databases: – Support atomic transactions, but only within a single aggregate – Don’t have ACID transactions that span multiple aggregates • Update over multiple aggregates: possible inconsistent reads – Part of the consideration for deciding how to aggregate data • Graph databases tend to support ACID transactions Valeria Cardellini - SABD 2018/19 20 Key-value data model • Simple data model: data is represented as a schemaless collection of key-value pairs – Associative array (map or dictionary) as fundamental data model • Strongly aggregate-oriented – Lots of aggregates – Each aggregate has a key • Data model: – Set of <key, value> pairs – Value: aggregate instance • The aggregate is opaque to the data store – Just a big blob of mostly meaningless bit • Access to aggregate: – Lookup based on its key • Richer data models can be implemented on top Valeria Cardellini - SABD 2018/19 21 Key-value data model: example Valeria Cardellini - SABD 2018/19 22 Types of key-value stores • Some data stores support ordering of keys • Some maintain data in memory (RAM), while others employ HDDs or SSDs • Wide range of consistency models Valeria Cardellini - SABD 2018/19 23 Consistency of key-value stores • Consistency models range from eventual consistency to serializability – Serializability: guarantee about transactions, or groups of one or more operations over one or more objects • It guarantees that the execution of a set of transactions (with read and write operations) over multiple items is equivalent to some serial execution (total ordering) of the transactions • Gold standard in database community • Serializability is the traditional I (isolation) in ACID – Examples: • AP: Dynamo, Riak KV, Voldemort • CP: Redis, Berkeley DB, Scalaris Valeria