1

LECTURE #3: , DATABASES, DATABASES

E4520: Data Science for Mechanical Systems

Instructor: Josh Browne, PhD Guest Lecturer: Gilman Callsen Feb 5, 2020 2

Gilman Callsen [email protected]

• “Entrepreneur with a penchant for technology companies.”

• Yale, BA Psychology • Started as a physics major, though!

• Databases have been a big part of my entire career. • Not a administrator. 3

Basic Timeline

Late 90s & Early 00s Websites

Chromic Décor (2006)

MC10(2008)

Pit Rho (2012)

Rho AI (2016) 4

I learned the value of databases pretty quickly at this point... 5

PREP 6

Pair Up

We will do some thought experiments and actual coding throughout. Make sure:

• You’re next to at least one person you can talk to/brainstorm with • You have a computer or are near a computer that got through pre-class prep 7

Let’s Get Those Laptops Out cd ~/Documents git clone https://github.com/gcallsen/database-class-examples.git git clone https://github.com/rhoai/python-dev.git cd ~/Documents/python-dev docker-compose up -d docker run -it --rm --net python-dev_default -v ~/Documents/database-class-examples:/code rhoai/python-dev:v0.1.0 8

OVERVIEW 9

Data, Data, Data

Who remembers what Erik Allen (Lecture #1) said about “Data Collection and Preparation”? 10

Data Collection and Preparation

• Be prepared to spend a lot of time (~80%) on data collection and cleaning • If you’ve got a data set, be very very grateful! • Expect to be a partner in the data generation process • Have a full tool belt of models, to prepare for a paucity of data early in the process 11

Where Do the Data Live?

With that in mind...where do you think all those collected and prepared data live? 12

You’re Right! DATABASES! 13

Cooking Analogy

As we go through this lecture keep cooking in mind

Grocery Store Blue Apron Family Cook 14

Question

What is a Database? 15

What is a Database?

• An organized collection of data

?

Data Database Do Stuff 16

What is a Database?

• An organized collection of data

? Grocery Store Blue Apron Collection of Chef The Food Ingredients 17

Where are Databases Used?

1. Name an industry/profession/etc. 2. Brainstorm some “data” they might have 18

Where are Databases Used?

• Everywhere! • Databases are the backbone of nearly every digital ‘thing’ we interact with today. • Examples • Software engineer • Mechanical engineer • Business person • Marketing • Generic consumer • Even Schools... 19

https://xkcd.com/327/ 20

Data from Mechanical Systems?

1. What sorts of data might mechanical systems produce? 21

DBs for DS in Mechanical Systems?

• Types of data and application needs will vary wildly • Super fast (real-time mechanical systems) • Highly accurate • Complex connections • Research • Development • Unlikely anyone here will be DBA but an understanding of how to think of databases goes a long way • Rubicon Global (waste management) • Automated pickup detection - on board vs cloud • Video processing • Accelerometer data • GPS location + time 22

Takeaways

1. Nearly every industry in the world produces data 2. Those data need to be stored somewhere to be useful 23

Types of Databases

• No silver bullet • An incredibly wide array of databases exist; all with strengths and weaknesses • All situations require considering what mix of DBs are used. • Polyglot persistence • Grocery Store vs Blue Apron vs Cook

Non- Relational Relational

Why are there only two ‘types’ here when our analogy has 3 ‘types’? 24

Relational Overview

• A relational database is a digital database based on the of data, as proposed by E. F. Codd in 1970. • Virtually all relational database systems use SQL (Structured Query Language) for querying and maintaining the database.

Straight from wikipedia https://en.wikipedia.org/wiki/Relational_database 25

Non-Relational Overview

• A NoSQL (originally referring to "non SQL" or "non relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. • NoSQL databases are increasingly used in big data and real-time web applications. • NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages, or sit alongside SQL database in a polyglot persistence architecture.

Straight from wikipedia https://en.wikipedia.org/wiki/NoSQL 26

https://www.alooma.com/blog/types-of-modern-databases 27

In Practice

Amazon.com How do you think they store your data? (What data do they store, what type of database(s))?

https://neo4j.com/blog/neo4j-doc-manager-polyglot-persistence-mongodb/ 28

In Practice

• In most cases, more than one database is used!

Polyglot Persistence

https://neo4j.com/blog/neo4j-doc-manager-polyglot-persistence-mongodb/ 29

RELATIONAL A man walks into a bar and sees two tables.

He says, “May I join you?” 30

https://www.alooma.com/blog/types-of-modern-databases 31

Summary Version

• A database is a means of storing information in such a way that information can be retrieved from it. • A relational database is one that presents information in tables with rows and columns. • A table is a collection of objects of the same type (rows). • Data in a table can be related to other tables (typically using ‘keys’) • The ability to retrieve these related data provides us the term relational database. 32

Relational Databases

• Core concepts for today • Tables • Normalization • SQL query language 33

Question

How would you describe the contents of a grocery store in an excel spreadsheet? What “columns” would it have? 34

Tables

• Grocery store

Products Products id int id name price description department name string 1 Apple 1.99 Delicious Produce Apple price float 2 Banana 3.49 Bunch of Produce Bananas description string 3 Bread 3.99 Loaf of Bakery whole wheat department string 4 Cheddar 2.79 Sliced Deli cheddar 35

Normalization

is a technique of organizing the data in the database. • Basically, you break apart tables to: • eliminate data redundancy and • reduce malformed data when performing CRUD* operations • Importantly, this allows the database itself to enforce data integrity. • Once you’ve done this, you now need the concept of of ‘joins’. • To perform a join you need two items: • two tables and a join condition • the tables contain the rows to be combined, and the join condition the instructions to match rows together

*CRUD - Create Read Update Delete 36

Things can get crazy... 37

Question

What column(s) from our previous table are good candidate(s) for normalization?

Products

id name price description department 38

Basic Normalization

• Grocery store

Departments Products id int id int name string department_id int description string name string

price float

description string 39

Basic Normalization

• Grocery store

Departments Products id name description id name price descriptio departme n nt_id

1 Produce Healthy 1 Apple 1.99 Delicious 1 stuff! Apple

2 Bakery Breads and 2 Banana 3.49 Bunch of 1 goodies Bananas

3 Deli Meats, 3 Bread 3.99 Loaf of 2 cheeses, etc whole wheat

4 Cheddar 2.79 Sliced 3 cheddar 40

SQL

• The fundamentals of most SQL languages are the same • Variations exist based on the database’s functionality • https://www.w3schools.com/sql/sql_intro.asp • Worth going through that as a primer 41

SQL Common Commands

• SELECT - extracts data from a database • UPDATE - updates data in a database • DELETE - deletes data from a database • INSERT INTO - inserts new data into a database • CREATE DATABASE - creates a new database • ALTER DATABASE - modifies a database • CREATE TABLE - creates a new table • ALTER TABLE - modifies a table • DROP TABLE - deletes a table • CREATE INDEX - creates an index (search key) • DROP INDEX - deletes an index 42

Let’s Go to the Grocery Store • Start databases. • Go to to `python-dev` folder • docker-compose up -d • docker ps -a • Inside your python-dev docker container • docker run -it --rm --net python-dev_default -v ./database-class-examples:/code rhoai/python-dev:v0.1.0 • Connect to postgres (pwd may “postgres”) • psql -h postgres -U postgres -d postgres_db • \d → nothing → \q • Seed tables • python ./code/ex_postgres.py 43

SQL On Our Tables

Command Outcome

\d+ Describe the tables/relations

SELECT * FROM products; Get all the Products

SELECT * FROM departments; Get all the Departments

SELECT p.name, d.name, price Get all of the products, display their name, price, FROM products p and name of the department they are in. FULL OUTER JOIN departments d ON d.id = p.department_id; 44 45

BREAK A programmer’s wife sends him to the grocery store with the instructions, “get a loaf of bread and, if they have eggs, get a dozen.”

He comes home with a dozen loaves of bread and tells her, “they had eggs.” 46

NON-RELATIONAL Database Admins walked into a NoSQL bar.

…a little while later they walked out because they couldn’t find a table. 47

https://www.alooma.com/blog/types-of-modern-databases 48

Summary Version

• A database is a means of storing information in such a way that information can be retrieved from it. • A non-relational database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. • There are a lot of options. Why? • Data do not always fit nicely into columns and rows • Wide array of use cases that can have highly optimized solutions (e.g. time-series data) • Scalability (horizontal scalability) • Often, CAP Theorem is at play 49

What Doesn’t Fit in a Table?

Think of some examples of data that would be difficult to represent in an relational database (excel spreadsheet). 50

Non-Relational Databases

• Core concepts for today • CAP Theorem • Unnormalized Form • “Query languages” 51

CAP Theorem

• High Level: • Consistency • Every read receives the most recent write or an error • Availability • Every request receives a (non-error) response – without the guarantee that it contains the most recent write • Partition tolerance • The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes • No distributed system is safe from network failures • The CAP theorem says that when network failures happen, you must choose between consistency and availability • Every use case optimizes for this differently 52

Unnormalized Form

• Group all those tables back together! • Unnormalized form is frequently used in non-relational databases (certainly not always) • An unnormalized data model has redundancy • Fields are grouped together • Multiple values and complex structures inside single fields • etc. • Why? • Frequently to improve the read performance of a database • at the expense write performance and storage • Not all data fit nicely into tables • NOTE: this does not mean there is ‘no data model’ 53

Query Languages

• All over the map • Majority of them are specific to the database • Many attempt to have “SQL-Like” query languages, you’ll often see “mydbQL” • To give a sense of variety, some examples of “Non-Relational” database types: • Document Store (MongoDB, Riak, Couchbase, Rethinkdb) • Key/Value (Redis, Aerospike, Leveldb) • Wide-Column (more like nested-k/v: hbase, Cassandra) • Graph (Arangodb, Neo4j, Titan) • Search (storing and search text: Elasticsearch, Solr, Lucene) • Time-Series (InfluxDB, TimescaleDB) • Vector Similarity Search (e.g. Faiss, PySparNN) 54

Let’s Pick Up Our Blue Apron

• Back in your terminal • Exit postgres if you haven’t already (\q) • Connect to Redis • redis-cli -h redis • Any keys? • keys * • Seed values • python ./code/ex_redis.py • Connect back into Redis 55

Redis

• What’s my ham sandwich recipe?

Keys key value sandwich_ham ingredients: [‘wheat bread’, ‘cheddar, ‘ham’] sandwich_turkey ingredients: [‘wheat bread’, ‘cheddar, ‘turkey’] cheese_platter_1 ingredients: [‘cheddar’, ‘mozzarella’, ‘olives’] cheese_platter_2 ingredients: [‘pepper jack’, ‘mozzarella’, ‘olives’] 56

Redis Search?

• Against the keys, yes! • Let’s try to find all sandwich recipes

• Against the values...have fun. • Try to find all recipes that use cheddar cheese 57

Let’s Ask our Cook!

• Back in your terminal • Exit redis if you haven’t already (exit) • Let’s check out elasticsearch using Kibana • localhost:5601 • Dev Tools • Match all • Any Documents? • Seed values • python ./code/ex_elasticsearch.py • Go back to Kibana 58

Elasticsearch

• What can I make if I have extra cheddar cheese?

Documents

_id value sandwich_ham ingredients: [‘wheat bread’, ‘cheddar, ‘ham’] sandwich_turkey ingredients: [‘wheat bread’, ‘cheddar, ‘turkey’] cheese_platter_1 ingredients: [‘cheddar’, ‘mozzarella’, ‘olives’] cheese_platter_2 ingredients: [‘pepper jack’, ‘mozzarella’, ‘olives’]

Note: specifying the _id is completely unnecessary, adding here for consistency. 59

Elasticsearch...search?

• Let’s hope the name isn’t lying to us… • Against the keys, yes! • Let’s try to find all sandwich recipes • Against the values...actually have fun! • Try to find all recipes that use cheddar cheese 60

Group activity

• Get a head start on your homework! • Won’t be the same actual use case but same thought process • Design your polyglot architecture for the “Amazon” scenario: • Define and create at least two tables in Postgres • Define and create at least one use for Redis • Define and create at least one mapping for Elasticsearch 61

Homework • “Mechanical systems” database design • Choose any problem you have worked on • Describe the problem • Describe the data available • Describe how you need to use data • Who’s accessing it? • Is this a big team, small team, etc? • Are there applications accessing it programmatically? • Are you exploring the data to find new features? • Define your database design • What database(s) will you use? • Why? Justify each choice based on core concepts/takeaways. • Define how the data fits in to each database you’ve chosen • Mock out actual tables, database structure, etc. • Submit HW#2 to courseworks (posted) • Format options: • 1) presentation-style (< 8 slides), or report-style (< 750 words). • All submissions must be in PDF form 62

Real World Example

• Many speakers in this lecture series will use Pit Rho as an example…and so will I! • Databases - how many do you think we’d need? • Take away: production systems require thoughtfully using tech components. No silver bullet. 63

Another Real World Example

• Sermos • If there’s time... 64

Reading & Reference

• https://www.red-gate.com/simple-talk/sql/database-a dministration/five-simple-database-design-errors-you- should-avoid • https://www.learndatasci.com/tutorials/using-databas es-python-postgres-sqlalchemy-and-alembic/ • https://www.w3schools.com/sql/default.asp • https://dzone.com/articles/23-useful-elasticsearch-exa mple-queries • 65 66