Macroarea di Ingegneria Dipar/mento di Ingegneria Civile e Ingegneria Informa/ca

NoSQL Data Stores Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini

Laurea Magistrale in Ingegneria Informatica

The reference Big Data stack

High-level Interfaces Support / Integration

Data Processing

Data Storage

Resource Management

Valeria Cardellini - SABD 2018/19 1 Traditional RDBMSs

• Relational DBMSs (RDBMSs) – Traditional technology for storing structured data in web and business applications • SQL is good – Rich language and toolset – Easy to use and integrate – Many vendors • RDBMSs promise ACID guarantees

Valeria Cardellini - SABD 2018/19 2

ACID properties

• Atomicity – All included statements in a transaction are either executed or the whole transaction is aborted without affecting the database (“all or nothing” principle) • Consistency – A database is in a consistent state before and after a transaction • Isolation – Transactions cannot see uncommitted changes in the database (i.e., the results of incomplete transactions are not visible to other transactions) • Durability – Changes are written to a disk before a database commits a transaction so that committed data cannot be lost through a power failure

Valeria Cardellini - SABD 2018/19 3 RDBMS constraints

• Domain constraints – Restricts the domain of each attribute or the set of possible values for the attribute • Entity integrity constraint – No primary key value can be null • Referential integrity constraint – To maintain consistency among the tuples in two relations: every value of one attribute of a relation should exist as a value of another attribute in another relation • Foreign key – To cross-reference between multiple relations: it is a key in a relation that matches the primary key of another relation

Valeria Cardellini - SABD 2018/19 4

Pros and cons of RDBMS Pros Cons • Well-defined consistency • Performance as major model constraint, scaling is difficult • ACID guarantees • Limited support for complex data structures • Relational integrity maintained through entity • Complete knowledge of DB structure required to create and referential integrity ad hoc queries constraints • Commercial DBMSs are • Well suited for OLTP apps expensive OLTP: OnLine Transaction • Some DBMSs have limits on Processing fields size • Sound theoretical foundation • Data integration from • Stable and standardized multiple RDBMSs can be DBMSs available cumbersome • Well understood

Valeria Cardellini - SABD 2018/19 5 RDBMS challenges

• Web-based applications caused spikes – Internet-scale data size – High read-write rates – Frequent schema changes

• Let’s scale RDBMSs – RDBMS were not designed to be distributed

• How to scale RDBMSs? – Replication – Sharding

Valeria Cardellini - SABD 2018/19 6

Replication

• Primary backup with master/worker architecture • Replication improves read scalability • Write operations?

Valeria Cardellini - SABD 2018/19 7 Sharding

• Horizontal partitioning of data across many separate servers • Scales read and write operations • Cannot execute transactions across shards (partitions) • Consistent hashing is one form of sharding - Hash both data and nodes using the same hash function in the same ID space

Valeria Cardellini - SABD 2018/19 8

Scaling RDBMSs is expensive and inefficient

Source: Couchbase technical report

Valeria Cardellini - SABD 2018/19 9 NoSQL data stores

• NoSQL = Not Only SQL – SQL-style querying is not the crucial objective

Valeria Cardellini - SABD 2018/19 10

NoSQL data stores: main features • Support flexible schema – No requirement for fixed rows in a table schema – Well suited for Agile development process • Scale horizontally – Partitioning of data and processing over multiple nodes • Provide high availability – By replicating data in multiple nodes, often geo-distributed • Mainly utilize shared-nothing architecture – With exception of graph-based databases • Avoid unneeded complexity – E.g., elimination of join operations • Multiprocessor support • Support weaker consistency models – BASE rather than ACID: compromising reliability for better

performance Valeria Cardellini - SABD 2018/19 11 ACID vs BASE

• Two design philosophies at opposite ends of the consistency-availability spectrum - Keep in mind CAP theorem Pick two of Consistency, Availability and Partition tolerance!

• ACID: traditional approach for RDBMSs – Pessimistic approach: prevent conflicts from occurring • Usually implemented with write locks managed by the system • Leads to performance degradation and deadlocks (hard to prevent and debug) – Does not scale well when handling petabytes of data (remember of latency!)

Valeria Cardellini - SABD 2018/19 12

ACID vs BASE (2) • BASE: Basically Available, Soft state, Eventual consistency – Basically Available: the system is available most of the time and there could exist a subsystem temporarily unavailable – Soft state: data is not durable in the sense that its persistence is in the hand of the user that must take care of refresh it – Eventually consistent: the system eventually converge to a consistent state • Optimistic approach - Lets conflicts occur, but detects them and takes action to sort them out - How? • Conditional updates: test the value just before updating • Save both updates: record that they are in conflict and then merge them

Valeria Cardellini - SABD 2018/19 13 NoSQL and consistency

• Biggest change from RDBMS – RDBMS: strong consistency – Traditional RDBMS are CA systems (or CP systems, depending on the configuration) • Most NoSQL systems are eventual consistent – i.e., AP systems • But some NoSQL systems provide strong consistency or tunable consistency

Valeria Cardellini - SABD 2018/19 14

NoSQL cost and performance

Source: Couchbase technical report

Valeria Cardellini - SABD 2018/19 15 Pros and cons of NoSQL Pros Cons • Easy to scale-out • No ACID guarantees, less • Higher performance for suitable for OLTP apps massive data scale • No fixed schema, no common data storage model • Allows sharing of data across multiple servers • Lack of standardization (e.g., querying is unique for • Most solutions are either each NoSQL data store) open-source or cheaper • Limited support for • HA and fault tolerance aggregation (sum, avg, provided by data replication count, group by) • Supports complex data • Poor performance for structures and objetcs complex joins • No fixed schema, supportrs • No well defined approach for DB design (different data unstructured data models) • Very fast retrieval of data, • Lack of model can lead to Valeria Cardellini SABD2018/19- Valeria suitable for real-time apps solution lock-in 16

NoSQL data models

• A number of largely diverse data stores not based on the relational data model

Valeria Cardellini - SABD 2018/19 17 NoSQL data models

• A data model is a set of constructs for representing the information – Relational model: , columns and rows • Storage model: how the DBMS stores and manipulates the data internally • A data model is usually independent of the storage model • Data models for NoSQL systems: – Aggregate-oriented models: key-value, document, and column-family – Graph-based models

Valeria Cardellini - SABD 2018/19 18

Aggregates • Data as units having a complex structure – More structure than just a set of tuples – E.g., complex record with simple fields, arrays, records nested inside • Aggregate pattern in Domain-Driven Design – A collection of related objects that we treat as a unit – A unit for data manipulation and management of consistency • Advantages of aggregates – Easier for application programmers to work with – Easier for database systems to handle operating on a cluster

Valeria Cardellini - SABD 2018/19 See http://thght.works/1XqYKB0 19 Transactions?

• Relational databases do have ACID transactions! • Aggregate-oriented databases: – Support atomic transactions, but only within a single aggregate – Don’t have ACID transactions that span multiple aggregates • Update over multiple aggregates: possible inconsistent reads – Part of the consideration for deciding how to aggregate data • Graph databases tend to support ACID transactions

Valeria Cardellini - SABD 2018/19 20

Key-value data model • Simple data model: data is represented as a schemaless collection of key-value pairs – Associative array (map or dictionary) as fundamental data model • Strongly aggregate-oriented – Lots of aggregates – Each aggregate has a key • Data model: – Set of pairs – Value: aggregate instance • The aggregate is opaque to the data store – Just a big blob of mostly meaningless bit • Access to aggregate: – Lookup based on its key

• Richer data models can be implemented on top Valeria Cardellini - SABD 2018/19 21 Key-value data model: example

Valeria Cardellini - SABD 2018/19 22

Types of key-value stores

• Some data stores support ordering of keys

• Some maintain data in memory (RAM), while others employ HDDs or SSDs

• Wide range of consistency models

Valeria Cardellini - SABD 2018/19 23 Consistency of key-value stores • Consistency models range from eventual consistency to serializability – Serializability: guarantee about transactions, or groups of one or more operations over one or more objects • It guarantees that the execution of a set of transactions (with read and write operations) over multiple items is equivalent to some serial execution (total ordering) of the transactions • Gold standard in database community • Serializability is the traditional I (isolation) in ACID – Examples: • AP: Dynamo, Riak KV, Voldemort • CP: Redis, Berkeley DB, Scalaris

Valeria Cardellini - SABD 2018/19 24

Query features in key-value data stores

• Only query by the key! – There is a key and there is the rest of the data (the value) • Most data store provide access operations on groups of related key-value pairs • It is not possible to use some attribute of the value • The key needs to be suitably chosen – E.g., session ID for storing session data • What if we don’t know the key? – Some system allows the search inside the value using a full- text search (e.g., using Apache Solr)

Valeria Cardellini - SABD 2018/19 25 Suitable use cases for key-value data stores • Storing session information in web apps – Every session is unique and is assigned a unique session id value – Store everything about the session using a single put, request or retrieved using get • User profiles and preferences – Almost every user has a unique user id, username, …, as well as preferences such as language, which products the user has access to, … – Put all into an object, so getting preferences of a user takes a single get operation • Shopping cart data – All the shopping information can be put into the value where the key is the user id

Valeria Cardellini - SABD 2018/19 26

Examples of key-value stores

• Amazon’s Dynamo is the most notable example – Riak: open-source implementation • Other key-value stores include: – Amazon DynamoDB • Data model and name from Dynamo, but different implementation – Berkeley DB (ordered keys) – Memcached, Redis, Hazelcast (in-memory data stores) – Oracle NoSQL Database – Encache (Java-based cache) – Scalaris – Voldemort (used by LinkedIn) – upscaledb – LevelDB (written by ’s fellows and open source)

Valeria Cardellini - SABD 2018/19 27 Document data model

• Strongly aggregate-oriented – Lots of aggregates – Each aggregate has a key • Similar to a key-value store (unique key), but API or query/update language to query or update based on the internal structure in the document – The document content is no longer opaque • Similar to a column-family store, but values can have complex documents, instead of fixed format • Document: encapsulates and encodes data in some standard formats or encodings – XML, YAML, JSON, BSON, …

Valeria Cardellini - SABD 2018/19 28

Document data model • Data model: – A set of pairs – Document: an aggregate instance

• Structure of the aggregate is visible – Limits on what we can place in it

• Access to an aggregate – Queries based on the fields in the aggregate

• Flexible schema – No strict schema to which documents must conform, which eliminates the need of schema migration efforts

Valeria Cardellini - SABD 2018/19 29 Document data model

• The data model resembles the JSON object

Valeria Cardellini - SABD 2018/19 30

Document vs. SQL table

• Document: record that describes the data in the document, as well as the actual data • Can be as complex as you choose

Valeria Cardellini - SABD 2018/19 31 Document data store API

• Usual CRUD operations (although not standardized) – Creation (or insertion) – Retrieval (or query, search, finds) • Not only simple key-to-document lookup • API or query language that allows the user to retrieve documents based on content (or metadata) – Update (or edit) • Replacement of the entire document or individual structural pieces of the document – Deletion (or removal)

Valeria Cardellini - SABD 2018/19 32

Examples of document stores

• MongoDB is the most representative – Documents grouped together to form collections – See hands-on lesson • Other popular document stores include: – CouchDB – Couchbase – RethinkDB – RavenDB (for .NET) • Document stores as PaaS cloud services – Amazon DocumentDB – Azure Cosmos DB (multi-model) – Google Cloud Datastore – Google Cloud Firestore: serverless

Valeria Cardellini - SABD 2018/19 33 Key-value vs. document stores • Key-value store – A key plus a big blob of mostly meaningless bits – We can store whatever we like in the aggregate – We can only access an aggregate by lookup based on its key

• Document store – A key plus a structured aggregate – More flexibility in access – We can submit queries based on the fields in the aggregate – We can retrieve part of the aggregate rather than the whole aggregate – Indexes based on the contents of the aggregate

Valeria Cardellini - SABD 2018/19 34

Suitable use cases for document stores

• Good for storing and managing big data-size collections of semi-structured data with a varying number of fields – Text and XML documents, email – Conceptual documents like de-normalized (aggregate) representations of DB entities such as product or customer – Sparse data in general, i.e., irregular (semi-structured) data that would require an extensive use of nulls in RDBMS • Nulls being placeholders for missing or nonexistent values • Examples of use cases - Log data - Information for product data management (e.g., product catalogue) - User comments, like blog posts

Valeria Cardellini - SABD 2018/19 35 When not to use document stores

• Complex transactions spanning multiple documents – From MongoDB 4.x multi-document transaction supported but greater performance cost over single document writes

• Queries against varying aggregate structure – Since the data is saved as an aggregate, if the structure of the aggregate constantly changes, the aggregate is saved at the lowest level of granularity. In this scenario, document stores may not work

Valeria Cardellini - SABD 2018/19 36

Column-family data model

• Strongly aggregate-oriented – Lots of aggregates – Each aggregate has a key • Similar to a key-value store, but the value can have multiple attributes (columns) • Data model: a two-level map structure – A set of pairs – Each aggregate is a group of pairs – Column: a set of data values of a particular type • Aggregate structure is visible • Columns can be organized in families – Data usually accessed together

Valeria Cardellini - SABD 2018/19 37 Column-family data model: example • Representing customer information in a column-family structure

Valeria Cardellini - SABD 2018/19 38

Column-store vs. row-store • Row-store systems: store and process data by row – However, DBMSs support indexes to improve the performance of set-wide operations on the whole tables • Column-store systems: store and process data by column – Can access faster the needed data rather than scanning and discarding unwanted data in row – Examples: C-Store (pre-NoSQL) and its commercial fork Vertica https://bit.ly/2pwrBMN – and Cassandra: no column stores in the original sense of the term, but column families are stored separately

Valeria Cardellini - SABD 2018/19 39 Properties of column-family stores

• In many analytical databases queries, few attributes are needed – Example: aggregate queries (avg, max, …) – Column-store is preferable for performance

• Moreover, both rows and columns are split over multiple nodes using sharding to achieve scalability

• So column-family stores are suitable for read- mostly, read-intensive, large data repositories

Valeria Cardellini - SABD 2018/19 40

Properties of column-family stores • Operations also allow picking out a particular column See slide 38: get('1234', 'name')! • Each column: – Has to be part of a single column family – Acts as unit for access • You can add any column to any row, and rows can have very different columns • You can model a list of items by making each item a separate column • Two ways to look at data: – Row-oriented • Each row is an aggregate • Column families represent useful chunks of data within that aggregate – Column-oriented • Each column family defines a record type

• Row as the join of records in all column families Valeria Cardellini - SABD 2018/19 41 Suitable use cases for column-family stores

• Queries that involve only a few columns • Aggregation queries against vast amounts of data - E.g., average age of the customers • Column-wise compression • Well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve highly complex queries over all data (possibly petabytes)

Valeria Cardellini - SABD 2018/19 42

Examples of column-family stores • Google’s Bigtable is the most notable example • Other popular column-family stores: – Apache HBase: open-source implementation of Bigtable on top of HDFS – Apache Accumulo: based on BigTable design, on top of HDFS and powered by Hadoop, Zookeeper, and Thrift • Different APIs and different nomenclature from HBase, but same in operational and architectural standpoint • Better security – Cassandra

Valeria Cardellini - SABD 2018/19 43 Examples of column-family stores

• Column-family stores as PaaS cloud services – Google Cloud Bigtable – Amazon Redshift – Cassandra and Hbase through Amazon EMR – HBase through Azure HDInsight

Valeria Cardellini - SABD 2018/19 44

Graph data model • Uses graph structure with nodes, edges, and properties to represent stored data – Nodes are the entities and have a set of attributes – Edges are the relationships between the entities • E.g.: a user posts a comment – Edges can be directed or undirected – Nodes and edges also have individual properties consisting of key-value pairs • Replaces relational tables with structured relational graphs of interconnected key-value pairs • Powerful data model – Differently from other types of NoSQL stores, it concerns itself with relationships – Focus on visual representation of information (more human- friendly than other NoSQL stores) – Other types of NoSQL stores are poor for interconnected data Valeria Cardellini - SABD 2018/19 45 Graph data model: example

https://neo4j.com/developer/movie-database/

Valeria Cardellini - SABD 2018/19 46

Graph database

• Explicit graph structure • Each node knows its adjacent nodes – As the number of nodes increases, the cost of local step (or hop) remains the same • Plus an index for lookups

• Cons: – Sharding: data partitioning is difficult – Horizontal scalability • When related nodes are stored on different servers, traversing multiple servers is not performance-efficient – Require rewiring your brain

Valeria Cardellini - SABD 2018/19 47 Examples of graph databases

• Popular graph databases: – Neo4j – OrientDB – Blazegraph – AllegroGraph – InfiniteGraph • Graph databases as PaaS cloud services – Amazon Neptune – Azure Cosmos DB (multi-model)

Valeria Cardellini - SABD 2018/19 48

Suitable use cases for graph databases

• Good for applications where you need to model entities and relationships between them – Social networking applications – Pattern recognition – Dependency analysis – Recommendation systems – Solving path finding problems raised in navigation systems – … • Good for applications in which the focus is on querying for relationships between entities and analyzing relationships – Computing relationships and querying related entities is simpler and faster than in RDBMS

Valeria Cardellini - SABD 2018/19 49 Case studies

• Key-value data stores: Amazon’s Dynamo (and Riak KV), Redis • Document-oriented data stores: MongoDB • Column-family data stores: Google’s Bigtable (and HBase), Cassandra • Graph databases: Neo4j

In blue: Lab

Valeria Cardellini - SABD 2018/19 50

Case study: Amazon’s Dynamo • Highly available and scalable distributed key- value data store built for Amazon’s platform – A very diverse set of Amazon applications with different storage requirements – Need for storage technologies that are always available on a commodity hardware infrastructure • E.g., shopping cart service: “Customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados” – Meet stringent Service Level Agreements (SLAs) • E.g., “service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.” G. DeCandia et al., "Dynamo: Amazon's highly available key-value store", Proc. of ACM SOSP 2007.

Valeria Cardellini - SABD 2018/19 51 Dynamo: Features

• Simple key-value API – Simple operations to read (get) and write (put) objects uniquely identified by a key – Each operation involves only one object at time • Focus on eventually consistent store – Sacrifices consistency for availability – BASE rather than ACID • Efficient usage of resources • Simple scale-out schema to manage increasing data set or request rates • Internal use of Dynamo – Security is not an issue since operation environment is assumed to be non-hostile

Valeria Cardellini - SABD 2018/19 52

Dynamo: Design principles • Sacrifice consistency for availability (CAP theorem) • Use optimistic replication techniques • Possible conflicting changes which must be detected and resolved: when to resolve them and who resolves them? – When: execute conflict resolution during reads rather than writes, i.e. “always writeable” data store – Who: data store or application; if data store, use simple policy (e.g., “last write wins”) • Other key principles: – Incremental scalability • Scale-out with minimal impact on the system – Simmetry and decentralization • P2P techniques – Heterogeneity

Valeria Cardellini - SABD 2018/19 53 Dynamo: API

• Each stored object has an associated key • Simple API including get() and put() operations to read and write objects get(key)! • Returns single object or list of objects with conflicting versions and context • Conflicts are handled on reads, never reject a write put(key, context, object)! • Determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk • Context encodes system metadata, e.g., version number – Both key and object treated as opaque array of bytes – Key: 128-bit MD5 hash applied to client supplied key

Valeria Cardellini - SABD 2018/19 54

Dynamo: Used techniques

Problem Technique Advantage

Partitioning Consistent hashing Incremental scalability Vector clocks with Version size is decoupled High Availability for writes reconciliation during reads from update rates Handling temporary Sloppy Quorum and Provides high availability failures hinted handoff and durability guarantee when some of the replicas are not available Recovering from Anti-entropy using Merkle Synchronizes divergent permanent failures trees replicas in the background Preserves symmetry and Gossip-based avoids having a Membership and failure membership protocol and centralized registry for detection failure detection storing membership and node liveness information

Valeria Cardellini - SABD 2018/19 55 Dynamo: Data partitioning

• Consistent hashing: output range of a hash is treated as a ring (similar to Chord) – MD5(key) -> node (position on the ring) – Differently from Chord: zero-hop DHT • “Virtual nodes” – Each node can be responsible for more than one virtual node – Work distribution proportional to the capabilities of the individual node

Valeria Cardellini - SABD 2018/19 56

Dynamo: Replication

• Each object is replicated on N nodes – N is a parameter configured per-instance by the application • Preference list: list of nodes that is responsible for storing a particular key – More than N nodes to account for node failures – See figure: object identified by key K is replicated on nodes B, C and D • Node D will store the keys in the ranges (A, B], (B, C], and (C, D]

Valeria Cardellini - SABD 2018/19 57 Dynamo: Used techniques

Problem Technique Advantage

Partitioning Consistent hashing Incremental scalability Vector clocks with Version size is decoupled High availability for writes reconciliation during reads from update rates Handling temporary Sloppy Quorum and Provides high availability failures hinted handoff and durability guarantee when some of the replicas are not available Recovering from Anti-entropy using Merkle Synchronizes divergent permanent failures trees replicas in the background Preserves symmetry and Gossip-based avoids having a Membership and failure membership protocol and centralized registry for detection failure detection storing membership and node liveness information

Valeria Cardellini - SABD 2018/19 58

Dynamo: Data versioning

• A put() call may return to its caller before the update has been applied at all the replicas • A get() call operation may return an object that does not have the latest updates • Version branching can also happen due to node/ network failures • Problem: multiple versions of an object, that the system needs to reconcile • Solution: use vectorial clocks to capture the casuality among different versions of the same object – If causal: older version can be forgotten – If concurrent: conflict exists, requiring reconciliation

Valeria Cardellini - SABD 2018/19 59 Dynamo: Used techniques

Problem Technique Advantage

Partitioning Consistent hashing Incremental scalability Vector clocks with Version size is decoupled High Availability for writes reconciliation during reads from update rates Handling temporary Sloppy Quorum and Provides high availability failures hinted handoff and durability guarantee when some of the replicas are not available Recovering from Anti-entropy using Merkle Synchronizes divergent permanent failures trees replicas in the background Preserves symmetry and Gossip-based avoids having a Membership and failure membership protocol and centralized registry for detection failure detection storing membership and node liveness information

Valeria Cardellini - SABD 2018/19 60

Dynamo: Sloppy quorum

• R/W: minimum numebr of nodes at must participate in a successful read/write operation • Setting R + W > N yields a quorum-like system – The latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas – R and W are usually configured to be less than N, to provide better latency – Typical configuration in Dynamo: (N, R, W) = (3, 2, 2) • Balances performance, durability, and availability • Sloppy quorum – Due to partitions, quorums might not exist – Sloppy quorum: create transient replicas (called hinted replicas) • N healthy nodes from the preference list (may not always be the first N nodes encountered while walking the consistent hashing ring)

Valeria Cardellini - SABD 2018/19 61 Dynamo: Put and get operations • put operation – Coordinator generates new vector clock and writes the new version locally – Send to N nodes – Wait for response from W nodes

• get operation – Coordinator requests existing versions from N • Wait for response from R nodes – If multiple versions, return all versions that are causally unrelated – Divergent versions are then reconciled – Reconciled version written back

Valeria Cardellini - SABD 2018/19 62

Dynamo: Hinted handoff

• Hinted handoff for transient failures • Consider N = 3; if A is temporarily down or unreachable, put will use D • D knows that the replica belongs to A • Later, D detects A is alive – Sends the replica to A – Removes the replica • Again, “always writeable” principle

Valeria Cardellini - SABD 2018/19 63 Dynamo: Used techniques

Problem Technique Advantage

Partitioning Consistent hashing Incremental scalability Vector clocks with Version size is decoupled High Availability for writes reconciliation during reads from update rates Handling temporary Sloppy Quorum and Provides high availability failures hinted handoff and durability guarantee when some of the replicas are not available Recovering from Anti-entropy using Merkle Synchronizes divergent permanent failures trees replicas in the background Preserves symmetry and Gossip-based avoids having a Membership and failure membership protocol and centralized registry for detection failure detection storing membership and node liveness information

Valeria Cardellini - SABD 2018/19 64

Dynamo: Membership management

• Administrator explicitly adds and removes nodes • Gossiping to propagate membership changes – Eventually consistent view – O(1) hop overlay

Valeria Cardellini - SABD 2018/19 65 Dynamo: Failure detection and management • Passive failure detection – Use pings only for detection from failed to alive – In the absence of client requests, node A doesn’t need to know if node B is alive • Anti-entropy mechanism to keep replica synchronized • Use Merkle trees to rapidly detect inconsistency and limit the amount of data transferred – Merkle tree: every leaf node is labeled with the hash of a data block and every non-leaf node is labeled with the cryptographic hash of the labels of its child nodes - Nodes maintain Merkle tree of each key range - Exchange root of Merkle tree to check if the key ranges are updated

Valeria Cardellini - SABD 2018/19 66

Riak KV

• Distributed NoSQL key-value data store inspired by Dynamo – Open-source version, https://docs.riak.com/riak/kv/latest/ • Like Dynamo – Employs consistent hashing to partition and replicate data around the ring – Makes use of gossiping to propagate membership changes – Uses vector clocks to resolve conflicts – Nodes can be added and removed from the Riak cluster as needed – Two ways of resolving update conflicts: • Last write wins • Both values are returned allowing the client to resolve the conflict • Can run on Mesos

Valeria Cardellini - SABD 2018/19 67 Case study: Google’s Bigtable

• Built on GFS, Chubby, SSTable – Data storage organized in tables, whose rows are distributed over GFS • Available as Cloud service: Google Cloud Bigtable • Underlies Google Cloud Datastore (document data store) • Used by a number of Google applications, including: – Web indexing, MapReduce, , , YouTube and • LevelDB is based on concepts from Bigtable – Stores entries lexicographically sorted by keys, but widely noted for being unreliable

Chang et al., "Bigtable: A Distributed Storage System for Structured Data", ACM Trans. Comput. Syst., 2008.

Valeria Cardellini - SABD 2018/19 68

Bigtable: Motivation

• Lots of semi-structured data at Google – URLs, geographical locations, ...

• Big data – Billions of URLs, hundreds of millions of users, 100+TB of satellite image data, …

Valeria Cardellini - SABD 2018/19 69 Bigtable: Main features

• Distributed storage structured as a large table – Distributed multi-dimensional sorted map

• Fault-tolerant

• Scalable and self-managing

• CP system: strong consistency and network partition tolerance

Valeria Cardellini - SABD 2018/19 70

Bigtable: Data model • Table – Distributed, multi-dimensional, sparse and sorted map – Indexed by rows • Rows – Sorted in lexicographical order by row key – Every read or write in a row is atomic: no concurrent ops on same row • Columns – The basic unit of data access – Sparse table: different rows may use different columns – Column family: group of columns • Data within a column family usually of the same type – Column family allows for specific optimization for better access control, storage and data indexing – Column naming: column-family:column

Valeria Cardellini - SABD 2018/19 71 Bigtable: Data model

• Multi-dimensional: rows, column families and columns provide a three-level naming hierarchy in identifying data

Valeria Cardellini - SABD 2018/19 72

Bigtable: Data model • Time-based – Multiple versions in each cell, each one having a timestamp

• Bigtable data model vs. relational data model

Valeria Cardellini - SABD 2018/19 73 Bigtable: Tablet • Tablet: group of consecutive rows of a table stored together – The basic unit for data storing and distribution – Table sorted by row keys: select row keys properly to improve data locality • Auto-sharding: tablets are split by the system when they become too large • Each tablet is served by exactly one tablet server

Valeria Cardellini - SABD 2018/19 74

Bigtable: API

• Metadata operations – Create/delete tables, column families, change metadata • Write operations: single-row, atomic – Write/delete cells in a row, delete all cells in a row • Read operations: read arbitrary cells in a Bigtable table – Each row read is atomic – One row, all or specific columns, certain timestamps, ...

Valeria Cardellini - SABD 2018/19 75 Bigtable: Writing and reading examples

Valeria Cardellini - SABD 2018/19 76

Bigtable: Architecture • Main components: – Master server – Tablet servers – Client library

Valeria Cardellini - SABD 2018/19 77 Bigtable: Master server

• A single master server • Detects addition/deletion of tablet servers • Assigns tablets to tablet servers • Balances tablet server load • Garbage collection of unneeded in GFS • Handles schema changes – e.g., table and column family creations and additions

Valeria Cardellini - SABD 2018/19 78

Bigtable: Tablet server

• Many tablet servers • Can be added or removed dynamically • Each tablet server: – Manages a set of tablets (typically 10-1000 tablets/server) – Handles read/write requests to tablets – Splits tablets when too large

Valeria Cardellini - SABD 2018/19 79 Bigtable: Client library

• Library that is linked into every client • Client data do not move through the master • Clients communicate directly with tablet servers for reads/writes

Valeria Cardellini - SABD 2018/19 80

Bigtable: Building blocks

• External building blocks of Bigtable: – (GFS): raw storage – Chubby: distributed lock service – Cluster scheduler: schedules jobs onto machines

Valeria Cardellini - SABD 2018/19 81 Bigtable: Chubby lock service • Chubby: distributed and highly available lock service used in many Google’s products – File system {directory/file} for locking – Uses Paxos for consensus to keep replicas consistent – A client leases a session with the service • In Bigtable Chubby is used to: – Ensure there is only one active master – Store bootstrap location of Bigtable data – Discover tablet servers – Store Bigtable schema information – Store access control lists (ACL)

Valeria Cardellini - SABD 2018/19 82

Bigtable: Locating rows

• Three-level indexing hierarchy • Root tablet stores location of all METADATA tablets in a special METADATA tablet • Each METADATA tablet stores location of a set of user data tablets • Client-side caching of caches tablet locations (efficiency!)

Valeria Cardellini - SABD 2018/19 83 Bigtable: Master startup

• The master executes the following steps at startup – Grabs a unique master lock in Chubby (to prevent concurrent master instantiations) – Scans the tablet servers directory in Chubby to find the live servers – Communicates with each live tablet server to find what tablets are assigned to each server – Scans the METADATA table to learn the full set of tablets and builds a set of unassigned tablet servers, which are eligible for tablet assignment

Valeria Cardellini - SABD 2018/19 84

Bigtable: Tablet assignment

• Each tablet assigned to one tablet server at a time • Master uses Chubby to keep tracks of live tablet serves and unassigned tablets • When a tablet server starts, it creates and acquires an exclusive lock in Chubby • Master detects the lock status of each tablet server by checking Chubby periodically • Master is responsible for finding when tablet server is no longer serving its tablets and reassigning those tablets as soon as possible

Valeria Cardellini - SABD 2018/19 85 Bigtable: SSTable • Sorted Strings Table (SSTable) file format used to store Bigtable data durably • SSTable: persistent and immutable key-value map, sorted by keys – Stored as a series of 64KB blocks plus a block index – Block index is used to locate blocks – Index is loaded into memory when SSTable is opened – Each SSTable is stored in a GFS file

Valeria Cardellini - SABD 2018/19 86

Bigtable: SSTable • To speed-up reads, in-memory Bloom filter per SSTable to test if row data exists before accessing SSTables on disk

• Bloom filter: space and time-efficient probabilistic data structure used to know whether an element is present in a set – Probabilistic: the element either definitely is not in the set or may be in the set (i.e., false positives are possible but false negatives are not)

Valeria Cardellini - SABD 2018/19 87 Bigtable: Serving tablets • How to support fast writes with SSTables? – Write in memory and then flush to disk! • Updates committed to a commit log • Recently committed writes are cached in memory in a memtable • Older write are stored in a series of SSTables memtable and SSTables are Recent updates kept merged to serve a read request sorted in memory

Write operations are logged

Valeria Cardellini - SABD 2018/19 88

Bigtable: Loading tablets

• To load a tablet, a tablet server: – Finds location of tablet through its METADATA • Metadata for a tablet includes list of SSTables and set of redo points – Read SSTables index blocks into memory – Read the commit log since the redo point and reconstructs the memtable

Valeria Cardellini - SABD 2018/19 89 Bigtable: Consistency and availability

• Strong consistency – CP system – Only one tablet server is responsible for a given piece of data – Replication is handled on the GFS layer

• Tradeoff with availability – If a tablet server fails, its portion of data is temporarily unavailable until a new server is assigned

Valeria Cardellini - SABD 2018/19 90

Comparing Dynamo and Bigtable

Dynamo Bigtable Data model Key-value, row store Column store API Single value Single value and range Data partition Random Ordered Optimized for Writes Reads Consistency Eventual Atomic Multiple version Version Timestamp Replication Quorum GFS Data center aware Yes Yes Persistency Local and pluggable Replicated and distributed file system Architecture Decentralized Hierarchical (master/ worker) Client library Yes Yes

Valeria Cardellini - SABD 2018/19 91 Cloud Bigtable • Bigtable as Cloud service: same concepts! • Sparsely populated table that can scale to billions of rows and thousands of columns – Table: sorted key-value map – Table composed of rows and columns • Each row describes a single entity • Each column contains individual values for each row • Each row/column intersection can contain multiple cells at different timestamps • Row indexed by a single row key • Columns that are related to one another are grouped together into a column family – Column identified by column family and column qualifier – Still beta: multi-region replication

Valeria Cardellini - SABD 2018/19 92

Cloud Bigtable: table example

• Application: social network for United States presidents • Table: tracks who each president is following

Valeria Cardellini - SABD 2018/19 93 Cloud Bigtable: use cases • When to use – To store large amounts of single-keyed data with low latency for apps that need high throughput and scalability for non-structured key-value data • Single value no larger than 10 MB • At least 1 TB of data – Examples • Marketing data (e.g., purchase histories, customer preferences) • Financial data (e.g., transaction histories, stock prices) • IoT data (e.g., usage reports from energy meters and home appliances) • Time-series data (e.g., CPU and memory usage over time for multiple servers)

Valeria Cardellini - SABD 2018/19 94

Cloud Bigtable: architecture

Valeria Cardellini - SABD 2018/19 95 Cloud Bigtable: usage

• Through command-line tools – Using cbt (native CLI in Go) https://cloud.google.com/bigtable/docs/cbt-overview – Using HBase shell

• Through Cloud Client Libraries for the Cloud Bigtable API

Valeria Cardellini - SABD 2018/19 96

Cloud Bigtable: schema design example

• Dataset on about New York City buses https://bit.ly/2JDHspc – More than 300 bus routes and 5,800 vehicles following those routes – Timestamp, origin, destination, vehicle id, vehicle latitude and longitude, expected and scheduled arrival times • Keep in mind – Each table has only one index (row key), no secondary indexes – Rows are automatically sorted lexicographically by row key – Cloud Bigtable allows for queries using point lookups by row key or row-range scans • Try to avoid slow operations, i.e., multiple row lookups or full table scans) – Keep all information for an entity in a single row

Valeria Cardellini - SABD 2018/19 97 Cloud Bigtable: schema design example

• Queries about NYC buses – Get the locations of a specific vehicle over an hour – Get the locations of an entire bus line over an hour – Get the locations of all buses in Manhattan in an hour – Get the most recent locations of all buses in Manhattan in an hour – Get the locations of an entire bus line over the month – Get the locations of an entire bus line with a certain destination over an hour • Focus on location, many different location grains, two time grains

For the code in Java see https://codelabs.developers.google.com/codelabs/cloud-bigtable-intro-java/

Valeria Cardellini - SABD 2018/19 98

Cloud Bigtable: schema design example

• Row key design central for Bigtable performance • How to design the row key? – Consider how you will use the stored data – Keep row key reasonably short – Use human-readable values instead of hashing – Include multiple identifiers in the row key

• Common mistake: make time the first value in the row key – Can cause hot spots and result in poor performance: most writes would be pushed onto a single tablet server

Valeria Cardellini - SABD 2018/19 99 Cloud Bigtable: schema design example

• How to design the row key for the NYC buses? [Bus company/Bus line/Timestamp rounded down to the hour/Vehicle ID]

Valeria Cardellini - SABD 2018/19 100

Case study: Cassandra • Initially developed at Facebook • A mixture of Amazon’s Dynamo and Google’s BigTable

Dynamo BigTable P2P architecture Sparse column-oriented data (replication & partitioning), model, storage architecture gossip-based discovery (SSTables, memtable, …) and error detection

Cassandra • Some large production deployments: - Apple: over 75,000, over 10 PB of data - Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per day Valeria Cardellini - SABD 2018/19 101 Cassandra: Features • High availability and incremental scalability • Robust support for systems spanning multiple data centers – Asynchronous master-less replication allowing low latency operations • Data model: structured key-value store where columns are added only to specified rows – Distributed multi-dimensional map indexed by a row key – As in Bigtable: different rows can have different number of columns and columns are grouped into column families – Emphasizes denormalization instead of normalization and joins – See example at http://bit.ly/2n1Vfua • Write-oriented system – On the contrary, Bigtable designed for intensive read workloads

Valeria Cardellini - SABD 2018/19 102

Cassandra: Consistency

• Managed in a decentralized fashion – No master node to coordinate reads and writes • Through quorum-based protocol – If N+W > R and W >= N/2 +1 you have strong consistency – Some available consistency levels (see http://bit.ly/2n26EdE) • ONE: only a single replica must respond • QUORUM: a majority of the replicas must respond • ALL: all of the replicas must respond • LOCAL_QUORUM: a majority of the replicas in the local datacenter must respond – Tunable tradeoffs between consistency and latency • Per-query tunable consistency SELECT points! FROM fantasyfootball.playerpoints USING CONSISTENCY QUORUM! WHERE playername = ‘Tom Brady’;!

Valeria Cardellini - SABD 2018/19 103 Cassandra Query Language (CQL)

• An SQL-like language CREATE COLUMNFAMILY Customer (! KEY varchar PRIMARY KEY,! name varchar,! city varchar,! web varchar);! ! INSERT INTO Customer (KEY,name,city,web)! VALUES ('mfowler',! 'Martin Fowler',! 'Boston',! 'www.martinfowler.com');! ! SELECT * FROM Customer! SELECT name,web FROM Customer! SELECT name,web FROM Customer WHERE city='Boston'!

Valeria Cardellini - SABD 2018/19 104

Case study: Neo4j

• Fully ACID compatible and schema-free • Properties of nodes and relationships captured in the form of multiple attributes (key-value pairs) – Relationships are unidirectional • Nodes tagged with labels – Labels used to represent different roles in the modeled domain

• Neo4j architecture – Data replication via standard master-worker architecture • Single read/write master and multiple read-only workers • No data partitioning on multiple servers (constraint on data size) – Improve data throughput via a multi-level caching scheme

Valeria Cardellini - SABD 2018/19 105 Neo4j: Twitter graph example

Valeria Cardellini - SABD 2018/19 106

Neo4j: Cypher • Cypher: declarative SQL-like query language – See http://bit.ly/2o0kytc – Designed to be human-readable – Nodes are between ( ) – Edges are an arrow -> between two nodes

• Some examples (from movie graph, see slide 46) • Create nodes with labels and properties CREATE (TheMatrixReloaded:Movie {title:'The Matrix Reloaded', released:2003, tagline:'Free your mind'}) CREATE (Keanu:Person {name:'Keanu Reeves', born:1964}) • Create edges with properties CREATE (Keanu)-[:ACTED_IN {roles:['Neo']}]-> (TheMatrixReloaded)

Valeria Cardellini! - SABD 2018/19 107 Neo4j: Cypher

• Use MATCH to search for a specified pattern (corresponds to SELECT in SQL) Search for the movies where Tom Hanks acted and the directors of those movies, limiting the resulting rows to 10 WITH TomH as a MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d) RETURN a,m,d LIMIT 10 ; – Query result in graphical view on next slide

!

Valeria Cardellini - SABD 2018/19 108

Neo4j: Cypher

Valeria Cardellini - SABD 2018/19 109 Neo4j: Cypher

• Use MATCH to search for variable length relationships Search all movies related to Helen Hunt by 1 to 3 hops! MATCH (HelenH{ name: 'Helen Hunt' })-[:ACTED_IN*1..3]- (movie:Movie) RETURN movie.title – Nodes that are a variable number of relationship->node hops away can be found using the syntax: ! !-[:TYPE*minHops...maxHops]->!

• See https://neo4j.com/sandbox for full case study on movie graph !

Valeria Cardellini - SABD 2018/19 110

Neo4j: Cypher

• Native support for shortest path queries • Example: find the shortest path between two airports (SFO and MSO)

– shortestPath returns a single shortest path between two nodes • In the example, the path length is between 0 and 2 – allShortestPath returns all the shortest paths between

two nodes Valeria Cardellini - SABD 2018/19 111 Neo4j: Cypher

• Shortest path queries (from Twitter graph)

Find all the shortest paths between two Twitter users linked by Follows as long as the path is min 0 and max 3 long

Find all the shortest paths between two Twitter users as long as the path is min 0 and max 5 long • More queries on Twitter graph at https://neo4j.com/sandbox – You can map your Twitter network and analyze it with Neo4j – Plus other use cases

Valeria Cardellini - SABD 2018/19 112

How to select the right data store

• Bad news: no one type that fits all • Main factors to consider: – Data model and access pattern • Access pattern: distribution of read/write and random vs sequential access needs • Take into account also data-related features: volume, complexity, schema flexibility, durability, and data access pattern – Query requirements – Non-functional properties • Performance, (auto-)scalability, consistency, partitioning, replication, load balancing, concurrency mechanisms, CAP tradeoffs, security, license model, supprot, … • We now focus on performance

Valeria Cardellini - SABD 2018/19 113 Performance comparison of NoSQL data stores

• Bad news: many studies, no clear winner • Unsolved issue: no standard benchmark – Some studies use YCSB – Yahoo Cloud Serving Benchmark (YCSB): open-source workload generation tool https://github.com/brianfrankcooper/YCSB • Focus on performance evaluation – Throughput and latency as metrics • Be careful – Consider the workload scenario: NoSQL data stores mostly divided into read and write optimized – Parameters tuning for performance optimization – In some studies bias in data store setting (e.g., custom hardware and software settings) • See https://bit.ly/2HGuuVV

Valeria Cardellini - SABD 2018/19 114

Performance comparison @VLDB’12 “Solving Big Data Challenges for Enterprise Application Performance Management”, VLDB 2012 http://bit.ly/1vYYihO • Workload R: 95% reads and only 5% writes

Throughput Read latency

Write latency • Cassandra: linear scalability in terms of throughput but slow at reading and writing • HBase: suffers in scalability, slow at reading but fast at writing • Redis: being in-memory, the

Valeria Cardellini - SABD 2018/19 fastest at reading 115 Performance comparison @VLDB’12

• Workload WR: 50% reads and 50% writes Throughput Read latency

Study conclusions (caution, old versions) - Linear scalability for Cassandra and HBase - Cassandra dominated for throughput, however high latency - HBase achieved least throughput, but low write latency at the cost of a high read latency

Valeria Cardellini - SABD 2018/19 116

Performance comparison of NoSQL data stores

• Performance is an important factor in choosing the right NoSQL data store solution

• Many performance studies, both academic and industrial – No single “winner takes all” among NoSQL data stores – Results depend on use case, workload, and deployment conditions

Valeria Cardellini - SABD 2018/19 117 Which data model/store to use?

• Different data models and data stores designed to solve different problems • Using a single data store engine for all of the requirements… – storing transactional data – caching session information – traversing graph of customers – performing OLAP operations • … usually leads to non-performing solutions • Also different needs for availability, consistency, backup requirements, …

Valeria Cardellini - SABD 2018/19 118

Multi-model data management

• How to manage Variety of Big Data 3V model? 1. Polyglot persistence 2. Multi-model databases

Valeria Cardellini - SABD 2018/19 119 Polyglot persistence • Rather than a single data store use multiple data storage technologies: polyglot persistence – Choose them based upon the way data are used by applications or components of a single application – Different kinds of data are best dealt with different data stores: pick the right tool for the right use case

• See http://bit.ly/2plRJYq

Valeria Cardellini - SABD 2018/19 120

Multi-model databases • Alternative to polyglot persistence: multi-model databases – Second-generation of NoSQL products that support multiple data models • How to realize? – Tightly-integrated polystores – Single-store multi-model database

Valeria Cardellini - SABD 2018/19 121 Multi-model databases

• Pros with respect to polyglot persistence – Decrease in operational complexity and cost – No need to maintain data consistency across separate data stores

• Cons with respect to polyglot persistence – Performance not optimized for specific data model – Increased risk of vendor lock-in

Valeria Cardellini - SABD 2018/19 122

References

• Sadalage and Fowler, “NoSQL distilled”, Addison-Wesley, 2009. • Golinger at al., “Data management in cloud environments: NoSQL and NewSQL data stores”, J. Cloud Comp., 2013. http://bit.ly/2oRKA5R • DeCandia et al., "Dynamo: Amazon's highly available key-value store", ACM SOSP 2007. http://bit.ly/2pmXzsr • Chang et al., “Bigtable: a distributed storage system for structured data”, OSDI 2006. http://bit.ly/2nywNRg • Lakshman and Malik, “Cassandra - a decentralized structured storage system”, LADIS 2009. http://bit.ly/2nyGSxE

Valeria Cardellini - SABD 2018/19 123