Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Unit 3: NoSQL, Json, Semistructured Data (3 lectures*) *Slides may change: refresh each lecture Introduction to Data Management CSE 344 Lecture 11: NoSQL CSE 344 - 2017au 2 Announcmenets • HW3 (Azure) due tonight (Friday) • WQ4 (Relational algebra) due Tuesday • WQ5 (Datalog) due next Friday • HW4 (Datalog/Logicblox/Cloud9) is posted CSE 344 - 2017au 3 Class Overview • Unit 1: Intro • Unit 2: Relational Data Models and Query Languages • Unit 3: Non-relational data – NoSQL – Json – SQL++ • Unit 4: RDMBS internals and query optimization • Unit 5: Parallel query processing • Unit 6: DBMS usability, conceptual design • Unit 7: Transactions • Unit 8: Advanced topics (time permitting) 4 Two Classes of Database Applications • OLTP (Online Transaction Processing) – Queries are simple lookups: 0 or 1 join E.g., find customer by ID and their orders – Many updates. E.g., insert order, update payment – Consistency is critical: transactions (more later) • OLAP (Online Analytical Processing) – aka “Decision Support” – Queries have many joins, and group-by’s E.g., sum revenues by store, product, clerk, date – No updates CSE 344 - 2017au 5 NoSQL Motivation • Originally motivated by Web 2.0 applications – E.g. Facebook, Amazon, Instagram, etc – Web startups need to scaleup from 10 to 100000 users very quickly • Needed: very large scale OLTP workloads • Give up on consistency • Give up OLAP 6 What is the Problem? • Single server DBMS are too small for Web data • Solution: scale out to multiple servers • This is hard for the entire functionality of DMBS • NoSQL: reduce functionality for easier scale up – Simpler data model – Very restricted updates DesktopRDBMS Review: Serverless User SQLite: • One data file • One user DBMS • One DBMS application Application • Consistency is easy (SQLite) • But only a limited number of scenarios work with such model File Data file Disk CSE 344 - 2017au 8 RDBMS Review: Client-Server Client Server Machine Applications File 1 Connection (JDBC, ODBC) File 2 File 3 DB Server • One server running the database • Many clients, connecting via the ODBC or JDBC (Java Database Connectivity) protocol 9 RDBMS Review: Client-Server Many users and apps Consistency is harder à Client Server Machine transactions Applications File 1 Connection (JDBC, ODBC) File 2 File 3 DB Server • One server running the database • Many clients, connecting via the ODBC or JDBC (Java Database Connectivity) protocol 10 Client-Server • One server that runs the DBMS (or RDBMS): – Your own desktop, or – Some beefy system, or – A cloud service (SQL Azure) CSE 344 - 2017au 11 Client-Server • One server that runs the DBMS (or RDBMS): – Your own desktop, or – Some beefy system, or – A cloud service (SQL Azure) • Many clients run apps and connect to DBMS – Microsoft’s Management Studio (for SQL Server), or – psql (for postgres) – Some Java program (HW8) or some C++ program CSE 344 - 2017au 12 Client-Server • One server that runs the DBMS (or RDBMS): – Your own desktop, or – Some beefy system, or – A cloud service (SQL Azure) • Many clients run apps and connect to DBMS – Microsoft’s Management Studio (for SQL Server), or – psql (for postgres) – Some Java program (HW8) or some C++ program • Clients “talk” to server using JDBC/ODBC protocol CSE 344 - 2017au 13 Web Apps: 3 Tier Browser File 1 File 2 File 3 DB Server CSE 344 - 2017au 14 Web Apps: 3 Tier Browser File 1 Connection File 2 (e.g., JDBC) HTTP/SSL File 3 DB Server App+Web Server CSE 344 - 2017au 15 Web Apps: 3 Tier Web-based applications Browser File 1 Connection File 2 (e.g., JDBC) HTTP/SSL File 3 DB Server App+Web Server CSE 344 - 2017au 16 Web Apps: 3 Tier Web-based applications File 1 App+Web Server Connection File 2 (e.g., JDBC) HTTP/SSL File 3 App+Web Server DB Server CSE 344 - 2017au App+Web Server 17 Replicate WebApp server Apps: 3 Tier for scaleup Web-based applications File 1 App+Web Server Connection File 2 (e.g., JDBC) HTTP/SSL File 3 App+Web Server DB Server Why not replicate DB server? App+Web Server 18 Replicate WebApp server Apps: 3 Tier for scaleup Web-based applications File 1 App+Web Server Connection File 2 (e.g., JDBC) HTTP/SSL File 3 App+Web Server DB Server Why not replicate DB server? Consistency! App+Web Server 19 Replicating the Database • Two basic approaches: – Scale up through partitioning – Scale up through replication • Consistency is much harder to enforce CSE 344 - 2017au 20 Scale Through Partitioning • Partition the database across many machines in a cluster – Database now fits in main memory – Queries spread across these machines • Can increase throughput • Easy for writes but reads become expensive! Application updates here May also update here Three partitions CSE 344 - 2017au 21 Scale Through Replication • Create multiple copies of each database partition • Spread queries across these replicas • Can increase throughput and lower latency • Can also improve fault-tolerance • Easy for reads but writes become expensive! App 1 App 2 updates updates here only Three replicas here only CSE 344 - 2017au 22 Relational Model à NoSQL • Relational DB: difficult to replicate/partition • Given Supplier(sno,…),Part(pno, …),Supply(sno,pno) – Partition: we may be forced to join across servers – Replication: local copy has inconsistent versions – Consistency is hard in both cases (why?) • NoSQL: simplified data model – Given up on functionality – Application must now handle joins and consistency 23 Data Models Taxonomy based on data models: ☞ • Key-value stores – e.g., Project Voldemort, Memcached • Document stores – e.g., SimpleDB, CouchDB, MongoDB • Extensible Record Stores – e.g., HBase, Cassandra, PNUTS CSE 344 - 2017au 24 Key-Value Stores Features • Data model: (key,value) pairs – Key = string/integer, unique for the entire data – Value = can be anything (very complex object) Key-Value Stores Features • Data model: (key,value) pairs – Key = string/integer, unique for the entire data – Value = can be anything (very complex object) • Operations – get(key), put(key,value) – Operations on value not supported Key-Value Stores Features • Data model: (key,value) pairs – Key = string/integer, unique for the entire data – Value = can be anything (very complex object) • Operations – get(key), put(key,value) – Operations on value not supported • Distribution / Partitioning – w/ hash function – No replication: key k is stored at server h(k) – 3-way replication: key k stored at h1(k),h2(k),h3(k) Key-Value Stores Features • Data model: (key,value) pairs – Key = string/integer, unique for the entire data – Value = can be anything (very complex object) • Operations – get(key), put(key,value) – Operations on value not supported • Distribution / Partitioning – w/ hash function – No replication: key k is stored at server h(k) – 3-way replication: key k stored at h1(k),h2(k),h3(k) How does get(k) work? How does put(k,v) work? Flights(fid, date, carrier, flight_num, origin, dest, ...) Carriers(cid, name) Example • How would you represent the Flights data as key, value pairs? How does query processing work? Flights(fid, date, carrier, flight_num, origin, dest, ...) Carriers(cid, name) Example • How would you represent the Flights data as key, value pairs? • Option 1: key=fid, value=entire flight record How does query processing work? Flights(fid, date, carrier, flight_num, origin, dest, ...) Carriers(cid, name) Example • How would you represent the Flights data as key, value pairs? • Option 1: key=fid, value=entire flight record • Option 2: key=date, value=all flights that day How does query processing work? Flights(fid, date, carrier, flight_num, origin, dest, ...) Carriers(cid, name) Example • How would you represent the Flights data as key, value pairs? • Option 1: key=fid, value=entire flight record • Option 2: key=date, value=all flights that day • Option 3: key=(origin,dest), value=all flights between How does query processing work? Key-Value Stores Internals • Partitioning: – Use a hash function h, and store every (key,value) pair on server h(key) – In class: discuss get(key), and put(key,value) • Replication: – Store each key on (say) three servers – On update, propagate change to the other servers; eventual consistency – Issue: when an app reads one replica, it may be stale • Usually: combine partitioning+replication Data Models Taxonomy based on data models: • Key-value stores – e.g., Project Voldemort, Memcached • Document stores ☞ – e.g., SimpleDB, CouchDB, MongoDB • Extensible Record Stores – e.g., HBase, Cassandra, PNUTS CSE 344 - 2017au 34 Motivation • In Key, Value stores, the Value is often a very complex object – Key = ‘2010/7/1’, Value = [all flights that date] • Better: allow DBMS to understand the value – Represent value as a JSON (or XML...) document – [all flights on that date] = a JSON file – May search for all flights on a given date 35 Document Stores Features • Data model: (key,document) pairs – Key = string/integer, unique for the entire data – Document = JSon, or XML • Operations – Get/put document by key – Query language over JSon • Distribution / Partitioning – Entire documents, as for key/value pairs We will discuss JSon Data Models Taxonomy based on data models: • Key-value stores – e.g., Project Voldemort, Memcached • Document stores – e.g., SimpleDB, CouchDB, MongoDB ☞ • Extensible Record Stores – e.g., HBase, Cassandra, PNUTS CSE 344 - 2017au 37 Extensible Record Stores • Based on Google’s BigTable • Data model is rows and columns • Scalability by splitting rows and columns over nodes – Rows partitioned through sharding

Introduction to Data Management CSE 344

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support