Nosql Data Stores the Reference Big Data Stack

Macroarea di Ingegneria Dipar/mento di Ingegneria Civile e Ingegneria Informa/ca NoSQL Data Stores Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica The reference Big Data stack High-level Interfaces / Integration Support Data Processing Data Storage Resource Management Valeria Cardellini - SABD 2018/19 1 Traditional RDBMSs • Relational DBMSs (RDBMSs) – Traditional technology for storing structured data in web and business applications • SQL is good – Rich language and toolset – Easy to use and integrate – Many vendors • RDBMSs promise ACID guarantees Valeria Cardellini - SABD 2018/19 2 ACID properties • Atomicity – All included statements in a transaction are either executed or the whole transaction is aborted without affecting the database (“all or nothing” principle) • Consistency – A database is in a consistent state before and after a transaction • Isolation – Transactions cannot see uncommitted changes in the database (i.e., the results of incomplete transactions are not visible to other transactions) • Durability – Changes are written to a disk before a database commits a transaction so that committed data cannot be lost through a power failure Valeria Cardellini - SABD 2018/19 3 RDBMS constraints • Domain constraints – Restricts the domain of each attribute or the set of possible values for the attribute • Entity integrity constraint – No primary key value can be null • Referential integrity constraint – To maintain consistency among the tuples in two relations: every value of one attribute of a relation should exist as a value of another attribute in another relation • Foreign key – To cross-reference between multiple relations: it is a key in a relation that matches the primary key of another relation Valeria Cardellini - SABD 2018/19 4 Pros and cons of RDBMS Pros Cons • Well-defined consistency • Performance as major model constraint, scaling is difficult • ACID guarantees • Limited support for complex data structures • Relational integrity maintained through entity • Complete knowledge of DB structure required to create and referential integrity ad hoc queries constraints • Commercial DBMSs are • Well suited for OLTP apps expensive OLTP: OnLine Transaction • Some DBMSs have limits on Processing fields size • Sound theoretical foundation • Data integration from • Stable and standardized multiple RDBMSs can be DBMSs available cumbersome • Well understood Valeria Cardellini - SABD 2018/19 5 RDBMS challenges • Web-based applications caused spikes – Internet-scale data size – High read-write rates – Frequent schema changes • Let’s scale RDBMSs – RDBMS were not designed to be distributed • How to scale RDBMSs? – Replication – Sharding Valeria Cardellini - SABD 2018/19 6 Replication • Primary backup with master/worker architecture • Replication improves read scalability • Write operations? Valeria Cardellini - SABD 2018/19 7 Sharding • Horizontal partitioning of data across many separate servers • Scales read and write operations • Cannot execute transactions across shards (partitions) • Consistent hashing is one form of sharding - Hash both data and nodes using the same hash function in the same ID space Valeria Cardellini - SABD 2018/19 8 Scaling RDBMSs is expensive and inefficient Source: Couchbase technical report Valeria Cardellini - SABD 2018/19 9 NoSQL data stores • NoSQL = Not Only SQL – SQL-style querying is not the crucial objective Valeria Cardellini - SABD 2018/19 10 NoSQL data stores: main features • Support flexible schema – No requirement for fixed rows in a table schema – Well suited for Agile development process • Scale horizontally – Partitioning of data and processing over multiple nodes • Provide high availability – By replicating data in multiple nodes, often geo-distributed • Mainly utilize shared-nothing architecture – With exception of graph-based databases • Avoid unneeded complexity – E.g., elimination of join operations • Multiprocessor support • Support weaker consistency models – BASE rather than ACID: compromising reliability for better performance Valeria Cardellini - SABD 2018/19 11 ACID vs BASE • Two design philosophies at opposite ends of the consistency-availability spectrum - Keep in mind CAP theorem Pick two of Consistency, Availability and Partition tolerance! • ACID: traditional approach for RDBMSs – Pessimistic approach: prevent conflicts from occurring • Usually implemented with write locks managed by the system • Leads to performance degradation and deadlocks (hard to prevent and debug) – Does not scale well when handling petabytes of data (remember of latency!) Valeria Cardellini - SABD 2018/19 12 ACID vs BASE (2) • BASE: Basically Available, Soft state, Eventual consistency – Basically Available: the system is available most of the time and there could exist a subsystem temporarily unavailable – Soft state: data is not durable in the sense that its persistence is in the hand of the user that must take care of refresh it – Eventually consistent: the system eventually converge to a consistent state • Optimistic approach - Lets conflicts occur, but detects them and takes action to sort them out - How? • Conditional updates: test the value just before updating • Save both updates: record that they are in conflict and then merge them Valeria Cardellini - SABD 2018/19 13 NoSQL and consistency • Biggest change from RDBMS – RDBMS: strong consistency – Traditional RDBMS are CA systems (or CP systems, depending on the configuration) • Most NoSQL systems are eventual consistent – i.e., AP systems • But some NoSQL systems provide strong consistency or tunable consistency Valeria Cardellini - SABD 2018/19 14 NoSQL cost and performance Source: Couchbase technical report Valeria Cardellini - SABD 2018/19 15 Pros and cons of NoSQL Pros Cons • Easy to scale-out • No ACID guarantees, less • Higher performance for suitable for OLTP apps massive data scale • No fixed schema, no common data storage model • Allows sharing of data across multiple servers • Lack of standardization (e.g., querying is unique for • Most solutions are either each NoSQL data store) open-source or cheaper • Limited support for • HA and fault tolerance aggregation (sum, avg, provided by data replication count, group by) • Supports complex data • Poor performance for structures and objetcs complex joins • No fixed schema, supportrs • No well defined approach for DB design (different data unstructured data models) • Very fast retrieval of data, • Lack of model can lead to Valeria Cardellini- SABD2018/19 Valeria suitable for real-time apps solution lock-in 16 NoSQL data models • A number of largely diverse data stores not based on the relational data model Valeria Cardellini - SABD 2018/19 17 NoSQL data models • A data model is a set of constructs for representing the information – Relational model: tables, columns and rows • Storage model: how the DBMS stores and manipulates the data internally • A data model is usually independent of the storage model • Data models for NoSQL systems: – Aggregate-oriented models: key-value, document, and column-family – Graph-based models Valeria Cardellini - SABD 2018/19 18 Aggregates • Data as units having a complex structure – More structure than just a set of tuples – E.g., complex record with simple fields, arrays, records nested inside • Aggregate pattern in Domain-Driven Design – A collection of related objects that we treat as a unit – A unit for data manipulation and management of consistency • Advantages of aggregates – Easier for application programmers to work with – Easier for database systems to handle operating on a cluster Valeria Cardellini - SABD 2018/19 See http://thght.works/1XqYKB0 19 Transactions? • Relational databases do have ACID transactions! • Aggregate-oriented databases: – Support atomic transactions, but only within a single aggregate – Don’t have ACID transactions that span multiple aggregates • Update over multiple aggregates: possible inconsistent reads – Part of the consideration for deciding how to aggregate data • Graph databases tend to support ACID transactions Valeria Cardellini - SABD 2018/19 20 Key-value data model • Simple data model: data is represented as a schemaless collection of key-value pairs – Associative array (map or dictionary) as fundamental data model • Strongly aggregate-oriented – Lots of aggregates – Each aggregate has a key • Data model: – Set of <key, value> pairs – Value: aggregate instance • The aggregate is opaque to the data store – Just a big blob of mostly meaningless bit • Access to aggregate: – Lookup based on its key • Richer data models can be implemented on top Valeria Cardellini - SABD 2018/19 21 Key-value data model: example Valeria Cardellini - SABD 2018/19 22 Types of key-value stores • Some data stores support ordering of keys • Some maintain data in memory (RAM), while others employ HDDs or SSDs • Wide range of consistency models Valeria Cardellini - SABD 2018/19 23 Consistency of key-value stores • Consistency models range from eventual consistency to serializability – Serializability: guarantee about transactions, or groups of one or more operations over one or more objects • It guarantees that the execution of a set of transactions (with read and write operations) over multiple items is equivalent to some serial execution (total ordering) of the transactions • Gold standard in database community • Serializability is the traditional I (isolation) in ACID – Examples: • AP: Dynamo, Riak KV, Voldemort • CP: Redis, Berkeley DB, Scalaris Valeria

Nosql Data Stores the Reference Big Data Stack

Google-Cloud Documentation Release 0.20.0

Google Certified Professional - Cloud Architect.Exam.57Q

Studi Perbandingan Layanan Cloud Computing

Every App in the Universe

Choosing a Database- As-A-Service an Overview of Offerings by Major Public Cloud Service Providers

Announcement

GAE — Google App Engine

An Analytic Real Life Case Study of Google Cloud Computing Architecture Platform’S for Big Data Analytics Products from AWS, Azure and GCP

Approved by Gsa 29 May 19

Breathe London Technical Report Pilot Phase (2018 – 2020)

Check and Evernote Named to PC Magazine's Top 100 Android Apps

Getting Started with Google Cloud