2/20/18

Lecture 11: Last Two Lectures Finish Web Technologies & Ø Last Thursday: String manipulation & Regular Expressions Ø guest lecture the amazing Sam Lau Begin SQL Ø reviewed in section and in future labs & HWs Ø Last Tuesday: HTTP, XML, and JSON Ø Pandas web tables support Slides by: Ø Using the browser developer mode Joseph E. Gonzalez ? Ø JSON and basics of XML [email protected] Ø Started HTTP request/response protocol and GET vs POST Ø Didn’t finish REST and web-services …

REST – Representational State Transfer

REST APIs Ø A way of architecting widely accessible, efficient, and extensible web services (typically using HTTP) Example: GET /website/images Get all images Ø Client-Server: client and server are able to evolve independently POST /website/images Add an image GET /website/images/{id} Get a an image Ø Stateless: The server does not store any of the clients session state PUT /website/images/{id} Update an image DELETE /website/images/{id} Delete an image Ø Cacheable: system should clearly define what functionality can be cached (e.g., GET vs POST requests) Ø Uniform Interface: provide a consistent interface for getting and Client Server updating data in a system

Scraping Ethics Ø Don’t violate terms of use for the service or data Ø Scraping can cause result in degraded services for others Ø Many services are optimized for human user access patterns Ø Requests can be parallelized/distributed to saturate server Ø Each query may result in many requests Ø How to scrape ethically: Demo Ø Used documented REST APIs – read terms of service TwitterAPI_REST_Example.ipynb Ø Examine at robots.txt (e.g., https://en.wikipedia.org/robots.txt) Ø Throttle request rates (sleep) Ø Avoid getting Berkeley (or your organization) blocked from websites & services

1 2/20/18

Databases and SQL Part 1

Slides by: What is a database? Joseph E. Gonzalez & Joseph Hellerstein, ? [email protected] [email protected]

Defining Databases Database Management Systems

Ø A database is an organized collection of data. Ø Data storage Ø Provide reliable storage to survive system crashes and disk failures Ø Special data-structures to improve performance Ø Data management Ø Configure how data is logically organized and who has access Ø Ensure data consistency properties (e.g., positive bank account values)

Ø A database management systems (DBMS) is a software Ø Facilitate access system that stores, manages, and facilitates access to Ø Enable efficient access to the data one or more databases. Ø Supports user defined computation (queries) over data

Is Pandas a Database Management System? Why should I use a DBMS? Why can’t I just have my CSV files? Ø Data Storage? Ø DBMSs organize many related sources of information Ø Pandas doesn’t store data, this is managed by the filesystem Ø Data Management? Ø DBMSs enforce guarantees on the data Ø Pandas does support changing the organization of data but doesn’t Ø Can be used to prevent data anomalies manage who can access the data Ø Ensure safe concurrent operations on data Ø Facilitate Access? Ø DBMSs can be scalable Ø Pandas does support rich tools for computation over data Ø Optimized to compute on data that does not fit in memory Ø Parallel computation and optimized data structures Ø Pandas is not generally considered a database management system but it often interacts with DBMSs Ø DBMSs prevent data loss from software/hardware failures

2 2/20/18

Common DBMS Systems https://db-engines.com/en/ranking

Widely Used DBMS Technologies

Relational database management systems are widely used!

Relational Database Management Systems Relational Data Abstraction Ø Relational databases are the traditional DBMS technology Database Management System Relations (Tables) Optimized Data Structures Name Prod Price

Sue iPod $200.00 B+Trees Joey sid Bikesname $333.99rating age Ø Logically organize data in relations (tables) Alice 28 Caryuppy $999.009 35.0 31 lubber 8 55.5 Describes relationship: 44 guppy 5 35.0 Page 1 Page 2 Name Prod Price bid bname color Optimized Sales relation: 58 rusty 10 35.0 Name purchased Storage Sue iPod $200.00 101 Interlake blue Page 3 Page 4 Abstraction Page Prod at Price. 102 Interlake red Header Joey Bike $333.99 104 Marine red Page 5 Page 6 Tuple (row) Alice Car $999.00 How is data 103 Clipper green physically Attribute (column) stored?

RelationalPhysical Data Data Independence: Abstraction Database management systemsDatabase hide Management how data Systemis Relationsstored from (Tables) end user applicationsOptimized Data Structures Name Prod Price

Sueà SystemiPod can$200.00 optimize storage and computationB+Trees Joey sid Bikesname $333.99rating age Alicewithout28 Caryuppy changing$999.009 35.0 applications. 31 lubber 8 55.5 44 guppy 5 35.0 Page 1 Page 2 bid bname Bigcolor Idea in Data StructuresOptimized 58 rusty 10 35.0 In a time long ago … Storage 101 Interlake blue Page 3 Page 4

DataAbstraction Systems & Page 102 Interlake red Header

104 Marine red ComputerPage 5 Page 6 Science 103 Clipper green It wasn’t always like this …

3 2/20/18

Ted Codd and the

Ø [1969] Relational model: a mathematical Before 1970’s databases abstraction of a database as sets Ø Independence of data from the physical properties of were not routinely organized as tables. stage storage and representation

Instead they exposed specialized Ø [1972] Relational Algebra & Calculus: a collection data structures designed for of operations and a way defining logical specific applications. outcomes for data transformations Ø Algebra: beginning of technologies like Pandas Ø Calculus: the foundation of modern SQL

Edgar F. “Ted” Codd (1923 - 2003) Turing Award 1981

Relational Database Management Systems

Ø Traditionally DBMS referred to relational databases What Ø Logically organize data in relations (tables) Ø Structured Query Language (SQL) to define, manipulate not and compute on data. Ø A common language spoken by many data systems How Ø Some variations and deviations from the standard … Ø Describes logical organization of data as well as computation on data. SQL

SQL is a Declarative Language Review of Relational Terminology Ø Declarative: “Say what you want, not how to get it.” Ø Database: Set of Relations (i.e., one or more tables) Ø Declarative Example: I want a table with columns “x” and “y” constructed from tables “A” and ”B” the values in “y” are greater than 100.00. Ø Relation (Table): Ø Imperative Example: For each record in table “A” find the corresponding Ø Schema: description of columns, their types, and constraints record in table “B” then drop the records where “y” is less than or equal to Ø Instance: data satisfying the schema 100 then return the ”x” and “y” values. Ø Advantages of declarative programming Ø Attribute (Column) Ø Enable the system to find the best way to achieve the result. Ø Often more compact and easier to learn for non-programmers Ø Tuple (Record, Row) Ø Challenges of declarative programming Ø Schema of database is set of schemas of its relations Ø System performance depends heavily on automatic optimization Ø Limited language (not Turing complete)

4 2/20/18

Two sublanguages of SQL

Ø DDL – Ø Define and modify schema Ø DML – Data Manipulation Language Creating Tables & Ø Queries can be written intuitively.

SELET * FROM Populating Tables THINGS; CAPITALIZATION IS optional BUT … DATABASE PEOPLE PREFER TO YELL

CREATE TABLE … CREATE TABLE Sailors ( sid INTEGER, sname CHAR(20), Columns have Common SQL Types (there are others...) sid sname rating age rating INTEGER, names and types 1 Fred 7 22 age REAL, PRIMARY KEY (sid)); 2 Jim 2 39 Specify Ø CHAR(size): Fixed number of characters 3 Nancy 8 27 Primary Key CREATE TABLE Boats ( column(s) bid INTEGER, Ø TEXT: Arbitrary number of character strings bname CHAR (20), bid bname color color CHAR(10), PRIMARY KEY (bid)); Specify Ø INTEGER & BIGINT: Integers of various sizes 101 Nina red Foreign Key 102 Pinta blue relationships Ø REAL & DOUBLE PRECISION: Floating point numbers 103 Santa Maria red CREATE TABLE Reserves ( sid INTEGER, bid INTEGER, Ø DATE & DATETIME: Date and Date+Time formats day DATE, PRIMARY KEY (sid, bid, day), sid bid day FOREIGN KEY (sid) REFERENCES Sailors, FOREIGN KEY (bid) REFERENCES Boats); 1 102 9/12 See documentation for database system (e.g., Postgres) 2 102 9/13 Semicolon at end of command

More Creating Tables Inserting Records into a Table

INSERT INTO students (name, gpa, age, dept, gender) ß Optional CREATE TABLE students( VALUES name TEXT PRIMARY KEY, ('Sergey Brin', 2.8, 40, 'CS', 'M'), gpa REAL CHECK (gpa >= 0.0 and gpa <= 4.0), ('Danah Boyd', 3.9, 35, 'CS', 'F'), age INTEGER, Ø ('Bill Gates', 1.0, 60, 'CS', 'M'), Fields must be entered in dept TEXT, ('Hillary Mason', 4.0, 35, 'DATASCI', 'F'), order (record) gender CHAR); Ø ('Mike Olson', 3.7, 50, 'CS', 'M'), Comma between records Ø ('Mark Zuckerberg', 4.0, 30, 'CS', 'M'), Must use the single quote Imposing Integrity (‘Sheryl Sandberg', 4.0, 47, 'BUSINESS', 'F'), (’) for strings. Constraints ('Susan Wojcicki', 4.0, 46, 'BUSINESS', 'F'), ('Marissa Meyer', 4.0, 45, 'BUSINESS', 'F'); Useful to ensure data quality… -- This is a comment. -- Does the order matter? No

5 2/20/18

Deleting and Modifying Records Deleting and Modifying Records

Ø Records are deleted by specifying a : Ø What is wrong with the following

DELETE FROM students UPDATE students WHERE LOWER(name) = 'sergey brin' SET gpa = 1.0 + gpa String Function WHERE dept = ‘CS’; Ø Modifying records

UPDATE students CREATE TABLE students( name TEXT PRIMARY KEY, SET gpa = 1.0 + gpa gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0), WHERE dept = ‘CS’; age INTEGER, dept TEXT, Ø Notice that there is no way to modify records by location gender CHAR); Update would violate Integrity Constraints

SQL DML: Basic Single-Table Queries

SELECT [DISTINCT] FROM [WHERE ] [GROUP BY [HAVING ] ] Querying Tables [ ]; Ø Elements of the basic statement Ø [Square brackets] are optional expressions.

Find the name and GPA for all CS Basic Single-Table Queries Students

SELECT [DISTINCT] FROM SELECT name, gpa [WHERE ] FROM students [GROUP BY WHERE dept = 'CS' [HAVING ] ] [ORDER BY ] ;

Ø Simplest version is straightforward Ø Produce all tuples in the table that satisfy the predicate Ø Output the expressions in the SELECT list Ø Expression can be a column reference, or an arithmetic expression over column refs

6 2/20/18

SELECT DISTINCT ORDER BY

SELECT DISTINCT dept SELECT name, gpa, age FROM students FROM students [WHERE ] WHERE dept = 'CS' [GROUP BY [GROUP BY [HAVING ] ] [HAVING ] ] [ORDER BY ] ; ORDER BY gpa, name;

Ø DISTINCT flag specifies removal of duplicates before output Ø ORDER BY clause specifies output to be sorted Ø Lexicographic ordering

ORDER BY Aggregates

SELECT name, gpa, age SELECT AVG(gpa) FROM students FROM students WHERE dept = 'CS' WHERE dept = 'CS' [GROUP BY [GROUP BY [HAVING ] ] [HAVING ] ] [ORDER BY ] ; ORDER BY gpa DESC, name ASC; Ø Before producing output, compute a summary statistic Ø Ascending order by default Ø Aggregates include: SUM, COUNT, MAX, MIN, … Ø DESC flag for descending, ASC for ascending Ø Can mix and match, lexicographically Ø Produces 1 row of output à Still a table Ø Note: can use DISTINCT inside the agg function Ø SELECT COUNT(DISTINCT name) …

GROUP BY What does the following Produce?

SELECT dept, AVG(gpa) SELECT name, AVG(gpa) FROM students FROM students [WHERE ] [WHERE ] GROUP BY dept GROUP BY dept [HAVING ] [HAVING ] [ORDER BY ] ; [ORDER BY ] ; Ø Partition table into groups with same GROUP BY column values Ø Group By takes a list of columns

Ø Produce an aggregate result per group Ø An error! (why?) Ø What name should be used for each group?

7 2/20/18

What if we wanted to only consider What if we wanted to only consider departments that have greater than two departments that have greater than two students? students?

SELECT dept, AVG(gpa) SELECT dept, AVG(gpa) FROM students FROM students [WHERE ] ? WHERE COUNT(*) > 2 GROUP BY dept GROUP BY dept [HAVING ] [HAVING ] [ORDER BY ] ; [ORDER BY ] ;

Ø Doesn’t work … Ø WHERE clause is applied before GROUP BY Ø You cannot have aggregation functions in the where clause

SELECT [DISTINCT] target-list HAVING Conceptual SQL FROM relation-list WHERE qualification Evaluation GROUP BY grouping-list SELECT dept, AVG(gpa) HAVING group-qualification FROM students Try Queries Here [WHERE ] http://sqlfiddle.com/#!17/67109/12 GROUP BY dept Form groups GROUP BY [DISTINCT] Eliminate HAVING COUNT(*) > 2 & aggregate duplicates [ORDER BY ] ;

Ø The HAVING predicate is applied after grouping and aggregation Project away columns Apply selections (just keep those used in Ø Hence can contain anything that could go in the SELECT list WHERE SELECT (eliminate rows) SELECT, GBY, HAVING) Ø HAVING can only be used in aggregate queries One or more tables Eliminate to use FROM HAVING groups (cross product …)

Putting it all together Putting it all together

SELECT dept, AVG(gpa) AS avg_gpa, COUNT(*) AS size SELECT dept, AVG(gpa) AS avg_gpa, COUNT(*) AS size FROM students FROM students WHERE gender = 'F' WHERE gender = 'F' GROUP BY [DISTINCT] GROUP BY dept GROUP BY [DISTINCT] GROUP BY dept HAVING COUNT(*) > 2 HAVING COUNT(*) > 2 ORDER BY avg_gpa DESC ORDER BY avg_gpa DESC WHERE SELECT WHERE SELECT

FROM HAVING What does this compute? FROM HAVING What does this compute? Ø The average GPA of female students and number of female students in each department where there are at least 3 female students in that http://bit.ly/ds100-sp18-sql department. The results are ordered by the average GPA.

8 2/20/18

Cust. Prod. Interacting with a DBMS Sales How do you interact with a Query database? SELECT * FROM sales WHERE price > 100.0 DBMS Server What is the DBMS? Python Analysis Ø Server Ø Software Response Date Purchase ID Name Price Ø A library 9/20/2012 1234 Sue $200.00 8/21/2012 3453 Joe $333.99 Answer: It can be all of these.

Cust. Prod. Interacting with a DBMS Sales

Web Servers

DBMS Server HTTP

Web Applications Often many Break Python Analysis systems will Visualization connect to a DBMS concurrently.

Why are databases drawn as “cans” Platters on a Disk Drive

Looks Like?

9 2/20/18

Platters on a Disk Drive

Looks Like? Simple Single Table Query Demo

1956: IBM MODEL 350 RAMAC First Commercial Disk Drive 5MB @ 1 ton

10