Relational Database Is a Digital Database Based on the Relational Model of Data, As Proposed by E
Total Page:16
File Type:pdf, Size:1020Kb
1 LECTURE #3: DATABASES, DATABASES, DATABASES E4520: Data Science for Mechanical Systems Instructor: Josh Browne, PhD Guest Lecturer: Gilman Callsen Feb 5, 2020 2 Gilman Callsen [email protected] • “Entrepreneur with a penchant for technology companies.” • Yale, BA Psychology • Started as a physics major, though! • Databases have been a big part of my entire career. • Not a database administrator. 3 Basic Timeline Late 90s & Early 00s Websites Chromic Décor (2006) MC10(2008) Pit Rho (2012) Rho AI (2016) 4 I learned the value of databases pretty quickly at this point... 5 PREP 6 Pair Up We will do some thought experiments and actual coding throughout. Make sure: • You’re next to at least one person you can talk to/brainstorm with • You have a computer or are near a computer that got through pre-class prep 7 Let’s Get Those Laptops Out cd ~/Documents git clone https://github.com/gcallsen/database-class-examples.git git clone https://github.com/rhoai/python-dev.git cd ~/Documents/python-dev docker-compose up -d docker run -it --rm --net python-dev_default -v ~/Documents/database-class-examples:/code rhoai/python-dev:v0.1.0 8 OVERVIEW 9 Data, Data, Data Who remembers what Erik Allen (Lecture #1) said about “Data Collection and Preparation”? 10 Data Collection and Preparation • Be prepared to spend a lot of time (~80%) on data collection and cleaning • If you’ve got a data set, be very very grateful! • Expect to be a partner in the data generation process • Have a full tool belt of models, to prepare for a paucity of data early in the process 11 Where Do the Data Live? With that in mind...where do you think all those collected and prepared data live? 12 You’re Right! DATABASES! 13 Cooking Analogy As we go through this lecture keep cooking in mind Grocery Store Blue Apron Family Cook 14 Question What is a Database? 15 What is a Database? • An organized collection of data ? Data Database Do Stuff 16 What is a Database? • An organized collection of data ? Grocery Store Blue Apron Collection of Chef The Food Ingredients 17 Where are Databases Used? 1. Name an industry/profession/etc. 2. Brainstorm some “data” they might have 18 Where are Databases Used? • Everywhere! • Databases are the backbone of nearly every digital ‘thing’ we interact with today. • Examples • Software engineer • Mechanical engineer • Business person • Marketing • Generic consumer • Even Schools... 19 https://xkcd.com/327/ 20 Data from Mechanical Systems? 1. What sorts of data might mechanical systems produce? 21 DBs for DS in Mechanical Systems? • Types of data and application needs will vary wildly • Super fast (real-time mechanical systems) • Highly accurate • Complex connections • Research • Development • Unlikely anyone here will be DBA but an understanding of how to think of databases goes a long way • Rubicon Global (waste management) • Automated pickup detection - on board vs cloud • Video processing • Accelerometer data • GPS location + time 22 Takeaways 1. Nearly every industry in the world produces data 2. Those data need to be stored somewhere to be useful 23 Types of Databases • No silver bullet • An incredibly wide array of databases exist; all with strengths and weaknesses • All situations require considering what mix of DBs are used. • Polyglot persistence • Grocery Store vs Blue Apron vs Cook Non- Relational Relational Why are there only two ‘types’ here when our analogy has 3 ‘types’? 24 Relational Overview • A relational database is a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. • Virtually all relational database systems use SQL (Structured Query Language) for querying and maintaining the database. Straight from wikipedia https://en.wikipedia.org/wiki/Relational_database 25 Non-Relational Overview • A NoSQL (originally referring to "non SQL" or "non relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. • NoSQL databases are increasingly used in big data and real-time web applications. • NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages, or sit alongside SQL database in a polyglot persistence architecture. Straight from wikipedia https://en.wikipedia.org/wiki/NoSQL 26 https://www.alooma.com/blog/types-of-modern-databases 27 In Practice Amazon.com How do you think they store your data? (What data do they store, what type of database(s))? https://neo4j.com/blog/neo4j-doc-manager-polyglot-persistence-mongodb/ 28 In Practice • In most cases, more than one database is used! Polyglot Persistence https://neo4j.com/blog/neo4j-doc-manager-polyglot-persistence-mongodb/ 29 RELATIONAL A man walks into a bar and sees two tables. He says, “May I join you?” 30 https://www.alooma.com/blog/types-of-modern-databases 31 Summary Version • A database is a means of storing information in such a way that information can be retrieved from it. • A relational database is one that presents information in tables with rows and columns. • A table is a collection of objects of the same type (rows). • Data in a table can be related to other tables (typically using ‘keys’) • The ability to retrieve these related data provides us the term relational database. 32 Relational Databases • Core concepts for today • Tables • Normalization • SQL query language 33 Question How would you describe the contents of a grocery store in an excel spreadsheet? What “columns” would it have? 34 Tables • Grocery store Products Products id int id name price description department name string 1 Apple 1.99 Delicious Produce Apple price float 2 Banana 3.49 Bunch of Produce Bananas description string 3 Bread 3.99 Loaf of Bakery whole wheat department string 4 Cheddar 2.79 Sliced Deli cheddar 35 Normalization • Database Normalization is a technique of organizing the data in the database. • Basically, you break apart tables to: • eliminate data redundancy and • reduce malformed data when performing CRUD* operations • Importantly, this allows the database itself to enforce data integrity. • Once you’ve done this, you now need the concept of of ‘joins’. • To perform a join you need two items: • two tables and a join condition • the tables contain the rows to be combined, and the join condition the instructions to match rows together *CRUD - Create Read Update Delete 36 Things can get crazy... 37 Question What column(s) from our previous table are good candidate(s) for normalization? Products id name price description department 38 Basic Normalization • Grocery store Departments Products id int id int name string department_id int description string name string price float description string 39 Basic Normalization • Grocery store Departments Products id name description id name price descriptio departme n nt_id 1 Produce Healthy 1 Apple 1.99 Delicious 1 stuff! Apple 2 Bakery Breads and 2 Banana 3.49 Bunch of 1 goodies Bananas 3 Deli Meats, 3 Bread 3.99 Loaf of 2 cheeses, etc whole wheat 4 Cheddar 2.79 Sliced 3 cheddar 40 SQL • The fundamentals of most SQL languages are the same • Variations exist based on the database’s functionality • https://www.w3schools.com/sql/sql_intro.asp • Worth going through that as a primer 41 SQL Common Commands • SELECT - extracts data from a database • UPDATE - updates data in a database • DELETE - deletes data from a database • INSERT INTO - inserts new data into a database • CREATE DATABASE - creates a new database • ALTER DATABASE - modifies a database • CREATE TABLE - creates a new table • ALTER TABLE - modifies a table • DROP TABLE - deletes a table • CREATE INDEX - creates an index (search key) • DROP INDEX - deletes an index 42 Let’s Go to the Grocery Store • Start databases. • Go to to `python-dev` folder • docker-compose up -d • docker ps -a • Inside your python-dev docker container • docker run -it --rm --net python-dev_default -v ./database-class-examples:/code rhoai/python-dev:v0.1.0 • Connect to postgres (pwd may “postgres”) • psql -h postgres -U postgres -d postgres_db • \d → nothing → \q • Seed tables • python ./code/ex_postgres.py 43 SQL On Our Tables Command Outcome \d+ Describe the tables/relations SELECT * FROM products; Get all the Products SELECT * FROM departments; Get all the Departments SELECT p.name, d.name, price Get all of the products, display their name, price, FROM products p and name of the department they are in. FULL OUTER JOIN departments d ON d.id = p.department_id; 44 45 BREAK A programmer’s wife sends him to the grocery store with the instructions, “get a loaf of bread and, if they have eggs, get a dozen.” He comes home with a dozen loaves of bread and tells her, “they had eggs.” 46 NON-RELATIONAL Database Admins walked into a NoSQL bar. …a little while later they walked out because they couldn’t find a table. 47 https://www.alooma.com/blog/types-of-modern-databases 48 Summary Version • A database is a means of storing information in such a way that information can be retrieved from it. • A non-relational database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. • There are a lot of options. Why? • Data do not always fit nicely into columns and rows • Wide array of use cases that can have highly optimized solutions (e.g. time-series data) • Scalability (horizontal scalability) • Often, CAP Theorem is at play 49 What Doesn’t Fit in a Table? Think of some examples of data that would be difficult to represent in an relational database