Introduction to Data Science Lecture 3 Manipulating Tabular Data

Intro. to Data Science Fall 2015 John Canny including notes from Michael Franklin and others Outline for this Evening

• Two views of tables: – SQL/Pandas – OLAP/Numpy/Matlab • SQL, NoSQL – Non-Tabular Structures

Data Science – One Definition The Big Picture

Extract Transform Load

4 Two views of tables First view Key Concept: Structured Data A data model is a collection of concepts for describing data.

A schema is a description of a particular collection of data, using a given data model.

The * • The Relational Model is Ubiquitous: • MySQL, PostgreSQL, Oracle, DB2, SQLServer, … • Foundational work done at • IBM - System R • UC Berkeley - Ingres E. F., “Ted” Codd Turing Award 1981 • Object-oriented concepts have been merged in • Early work: POSTGRES research project at Berkeley • Informix, IBM DB2, Oracle 8i

• Also has support for XML (semi-structured data)

*Codd, E. F. (1970). "A relational model of data for large shared data banks".

Communications of the ACM 13 (6): 37 Relational : Definitions • : a set of relations • Relation: made up of 2 parts: Schema : specifies name of relation, plus name and type of each column Students(sid: string, name: string, login: string, age: integer, gpa: real) Instance : the actual data at a given time • #rows = cardinality • #fields = degree / arity • A relation is a mathematical object (from set theory) which is true for certain arguments. • An instance defines the set of arguments for which the relation is true (it’s a table not a row).

Ex: Instance of Students Relation

sid name login age gpa 5366 6 Jones jones @cs 18 3.4 5368 8 Smith smith@e ecs 18 3.2 5365 0 Smith smith @math 19 3.8

• Cardinality = 3, arity = 5 , all rows distinct

• The relation is true for these tuples and false for others

SQL - A language for Relational DBs*

• SQL = Structured • Data Definition Language (DDL) – create, modify, delete relations – specify constraints – administer users, security, etc. • Data Manipulation Language (DML) – Specify queries to find tuples that satisfy criteria – add, modify, remove tuples • The DBMS is responsible for efficient evaluation.

* Developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the 1970s. Used to be SEQUEL (Structured English QUEry Language) Creating Relations in SQL

• Create the Students relation. – Note: the type (domain) of each field is specified, and enforced by the DBMS whenever tuples are added or modified.

CREATE TABLE Students (sid CHAR(20), name CHAR(20), login CHAR(10), age INTEGER, gpa FLOAT) Table Creation (continued)

• Another example: the Enrolled table holds information about courses students take.

CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2)) Adding and Deleting Tuples

• Can insert a single tuple using:

INSERT INTO Students (sid, name, login, age, gpa) VALUES ('53688', 'Smith', 'smith@ee', 18, 3.2)

• Can delete all tuples satisfying some condition (e.g., name = Smith):

DELETE FROM Students S WHERE S.name = 'Smith' Queries in SQL

• Single-table queries are straightforward.

• To find all 18 year old students, we can write:

SELECT * FROM Students S WHERE S.age=18

• To find just names and logins, replace the first line: SELECT S.name, S.login

Joins and Inference • Chaining relations together is the basic inference method in relational DBs. It produces new relations (effectively new facts) from the data: SELECT S.name, M.mortality FROM Students S, Mortality M WHERE S.Race=M.Race S M Name Race Race Mortality Socrates Man Man Mortal Thor God God Immortal Barney Dinosaur Dinosaur Mortal Blarney stone Stone Stone Non-living

Joins and Inference • Chaining relations together is the basic inference method in relational DBs. It produces new relations (effectively new facts) from the data: SELECT S.name, M.mortality FROM Students S, Mortality M WHERE S.Race=M.Race

Name Mortality Socrates Mortal Thor Immortal Barney Mortal Blarney stone Non-living

SELECT [DISTINCT] target-list Basic SQL Query FROM relation-list WHERE qualification

• relation-list : A list of relation names • possibly with a range-variable after each name • target-list : A list of attributes of tables in relation-list • qualification : Comparisons combined using AND, OR and NOT. • Comparisons are Attr op const or Attr1 op Attr2, where op is one of =≠<>≤≥ • DISTINCT: optional keyword indicating that the answer should not contain duplicates. • In SQL SELECT, the default is that duplicates are not eliminated! (Result is called a “multiset”)

SQL Inner Joins SELECT S.name, E.classid FROM Students S (INNER) JOIN Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150

Note the previous version of this query (with no join keyword) is an “Implicit join” SQL Inner Joins SELECT S.name, E.classid FROM Students S (INNER) JOIN Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Unmatched keys Jones DataScience194 Smith French150 What kind of Join is this? SELECT S.name, E.classid FROM Students S ?? Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 Brown NULL SQL Joins SELECT S.name, E.classid FROM Students S LEFT OUTER JOIN Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 Brown NULL What kind of Join is this? SELECT S.name, E.classid FROM Students S ?? Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 NULL English10 SQL Joins SELECT S.name, E.classid FROM Students S RIGHT OUTER JOIN Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 NULL English10 SQL Joins SELECT S.name, E.classid FROM Students S ? JOIN Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 NULL English10 Brown NULL SQL Joins SELECT S.name, E.classid FROM Students S FULL OUTER JOIN Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Jones DataScience194 Smith French150 NULL English10 Brown NULL What kind of Join is this? SELECT S.name, E.classid FROM Students S ?? Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Smith French150 SQL Joins SELECT S.name, E.classid FROM Students S LEFT SEMI JOIN Enrolled E ON S.sid=E.sid S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194

Brown 33333 22222 French150 44444 English10 S.name E.classid Jones History105 Smith French150 What kind of Join is this? SELECT * FROM Students S ?? Enrolled E S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194 22222 French150

S.name S.sid E.sid E.classid Jones 11111 11111 History105 Jones 11111 11111 DataScience194 Jones 11111 22222 French150 Smith 22222 11111 History105 Smith 22222 11111 DataScience194 Smith 22222 22222 French150 SQL Joins SELECT * FROM Students S CROSS JOIN Enrolled E S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194 22222 French150

S.name S.sid E.sid E.classid Jones 11111 11111 History105 Jones 11111 11111 DataScience194 Jones 11111 22222 French150 Smith 22222 11111 History105 Smith 22222 11111 DataScience194 Smith 22222 22222 French150 What kind of Join is this? SELECT * FROM Students S, Enrolled E WHERE S.sid <= E.sid

S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194 22222 French150

S.name S.sid E.sid E.classid Jones 11111 11111 History105 Jones 11111 11111 DataScience194 Jones 11111 22222 French150 Smith 22222 22222 French150 Theta Joins SELECT * FROM Students S, Enrolled E WHERE S.sid <= E.sid

S S.name S.sid E E.sid E.classid Jones 11111 11111 History105 Smith 22222 11111 DataScience194 22222 French150

S.name S.sid E.sid E.classid Jones 11111 11111 History105 Jones 11111 11111 DataScience194 Jones 11111 22222 French150 Smith 22222 22222 French150 Recall: Tweet JSON Format

The tweet's unique ID. These Text of the tweet. IDs are roughly sorted & Consecutive duplicate tweets developers should treat them are rejected. 140 character as opaque (http://bit.ly/dCkppc). max (http://bit.ly/4ud3he). D {"id"=>12296272736,

E T "text"=>

A

C "An early look at Annotations:

E

R http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", Tweet's

P

E "created_at"=>"Fri Apr 16 17:55:46 +0000 2010", creation

D "in_reply_to_user_id"=>nil, The ID of an existing tweet that date. "in_reply_to_screen_name"=>nil, this tweet is in reply to. Won't

s "in_reply_to_status_id"=>nil ' be set unless the author of the r The screen name &

. o "favorited"=>false,

D

h referenced tweet is mentioned.

I

t user ID of replied to

Truncated to 140

r u "truncated"=>false,

e a tweet author. characters. Only

s

e "user"=>

u

h possible from SMS. The author's

T {"id"=>6253282, user name. The author's "screen_name"=>"twitterapi", biography. "name"=>"Twitter API", The author's

. c "description"=> screen name.

n

y

s s "The Real Twitter API. I tweet about API changes, service issues and

i

f

h

o

T happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",

t

.

u The author's t "url"=>"http://apiwiki.twitter.com",

o

e

t URL.

e

e "location"=>"San Francisco, CA",

w The author'sThe "location". tweet's This is a free-formunique text field, ID. and These Text of the tweet.

g

t

"profile_background_color"=>"c1dfee",

n e there are no guarantees on whether it can be geocoded.

a

h

t c "profile_background_image_url"=>

f t IDs are roughly sorted & Consecutive duplicate tweets

o

c

"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",

e

r

j Rendering information

o b "profile_background_tile"=>false,

h

o t for the author. Colors

u

d "profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",developers should treat them are rejected. 140 character

a

e

are encoded in hex

d e "profile_link_color"=>"0000ff",

d h values (RGB).

e T "profile_sidebar_border_color"=>"87bc44", The creation date b as opaque (http://bit.ly/dCkppc). max (http://bit.ly/4ud3he).

m "profile_sidebar_fill_color"=>"e0ff92", for this account.

e D {"id"=>12296272736, "profile_text_color"=>"000000", Whether this account has E "created_at"=>"Wed May 23 06:01:13 +0000 2007", contributors enabled (http://bit.ly/50npuu).

T "contributors_enabled"=>true, Number of

s

t

. "favourites_count"=>1, e "text"=> favorites this

A

s

e

a "statuses_count"=>1628, Number of

w user has.

h

t

C f r "friends_count"=>13, users this user

o e

"An early look at Annotations:

s

r

E is following.

u "time_zone"=>"Pacific Time (US & Canada)", The timezone and offset

e

b

s i "utc_offset"=>-28800, (in seconds) for this user.

m

R

h

t u http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453","lang"=>"en", The user's selected Tweet's

N P "protected"=>false, language. E "created_at"=>"Fri "followers_count"=>100581, Apr 16 17:55:46 +0000 2010", creation "geo_enabled"=>true,

.

D

)

Whether this user is protected

7 o "notifications"=>false, DEPRECATED

7 e "in_reply_to_user_id"=>nil, or not. If the user is protected, g Y date. "following"=>true, in this context Number of The ID of an existing tweet that

F

s then this tweet is not visible

p a "verified"=>true}, followers for

4

h Whether this user / except to "friends".

r y

l "in_reply_to_screen_name"=>nil, "contributors"=>[3191321], this user.

. e has a verified badge.

t

i this tweet is in reply to. Won't

s

b "geo"=>nil,

u

/

/

:

s i "coordinates"=>nil,

p s DEPRECATED

h t "in_reply_to_status_id"=>nil

'

t

t The contributors' (if any) user be set unless the author of the

r h "place"=>

r

( The screen name &

e The place ID

. IDs (http://bit.ly/50npuu).

h d {"id"=>"2b6ff8c22edd9576",

t

o

e e l "favorited"=>false,

D h b "url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",

h referenced tweet is mentioned.

a

I

t W user ID of replied to n "name"=>"SoMa", The URL to fetch a detailed Truncated to 140

e

r The printable names of this place u "truncated"=>false, "full_name"=>"SoMa, San Francisco", polygon for this place

e a "place_type"=>"neighborhood", tweet author.

. characters. Only

s ) "country_code"=>"US", The type of this

p

e "user"=>

n

i

u

C place - can be a "country"=>"The United States of America",

t The place associated with this

1

h

e The author's L "neighborhood"possible from SMS. e "bounding_box"=> Tweet (http://bit.ly/b8L1Cp).

8

w

b

T

t / {"id"=>6253282, {"coordinates"=> or "city"

y

l

s

i

.

t [[[-122.42284884, 37.76893497], user name.

i h The author's

t

b

/

/ n [-122.3964, 37.76893497],

: "screen_name"=>"twitterapi",

o The country this place is in

p

t [-122.3964, 37.78752897],

t

g

h biography.

a

t

( The author's

[-122.42284884, 37.78752897]]],

o

N "name"=>"Twitter API",

e "type"=>"Polygon"}},

.

O

g

S

e

c "source"=>"web"}

J screen name.

h o The application The bounding T "description"=>

e

n

G that sent this box for this

y Map of a Twitter Status Object tweet place

s s "The Real Twitter API. I tweet about API changes, service issues and

i

Raffi Krikorian

f h 18 April 2010

o

T happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",

t

.

u The author's t "url"=>"http://apiwiki.twitter.com",

o

e

t URL.

e

e "location"=>"San Francisco, CA",

w The author's "location". This is a free-form text field, and

g

t

"profile_background_color"=>"c1dfee",

n e there are no guarantees on whether it can be geocoded.

a

h

t c "profile_background_image_url"=>

f

t

o

c

"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",

e

r

j Rendering information

o b "profile_background_tile"=>false,

h

o t for the author. Colors

u

d "profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",

a

e

are encoded in hex

d e "profile_link_color"=>"0000ff",

d h values (RGB).

e T "profile_sidebar_border_color"=>"87bc44", The creation date

b

m "profile_sidebar_fill_color"=>"e0ff92", for this account. e "profile_text_color"=>"000000", Whether this account has "created_at"=>"Wed May 23 06:01:13 +0000 2007", contributors enabled (http://bit.ly/50npuu). "contributors_enabled"=>true, Number of

s

t

. "favourites_count"=>1, e favorites this

s

e

a "statuses_count"=>1628, Number of w user has.

h

t

f r "friends_count"=>13, users this user

o e

s r is following.

u "time_zone"=>"Pacific Time (US & Canada)", The timezone and offset

e

b

s i "utc_offset"=>-28800, (in seconds) for this user.

m

h

t u "lang"=>"en", The user's selected

N "protected"=>false, language. "followers_count"=>100581, "geo_enabled"=>true,

.

)

Whether this user is protected

7 o "notifications"=>false, DEPRECATED

7 e or not. If the user is protected,

g Y

"following"=>true, in this context Number of

F s then this tweet is not visible

p a "verified"=>true}, followers for

4

h Whether this user / except to "friends".

r y

l "contributors"=>[3191321], this user.

. e has a verified badge.

t

i

s

b "geo"=>nil,

u

/

/

:

s i "coordinates"=>nil,

p DEPRECATED

h

t

t

t

The contributors' (if any) user

r h "place"=>

(

e The place ID IDs (http://bit.ly/50npuu).

h d {"id"=>"2b6ff8c22edd9576",

t

e

e l

h b "url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",

a

W n "name"=>"SoMa", The URL to fetch a detailed e "full_name"=>"SoMa, San Francisco", The printable names of this place polygon for this place "place_type"=>"neighborhood",

.

) "country_code"=>"US", The type of this

p

n

i

C place - can be a "country"=>"The United States of America", t The place associated with this

1

e

L "neighborhood" e "bounding_box"=> Tweet (http://bit.ly/b8L1Cp).

8

w

b

t / {"coordinates"=> or "city"

y

l

s

i

.

t [[[-122.42284884, 37.76893497],

i

h

t

b

/

/ n [-122.3964, 37.76893497],

: o The country this place is in

p

t [-122.3964, 37.78752897],

t

g

h

a

t

(

[-122.42284884, 37.78752897]]],

o

N e "type"=>"Polygon"}},

O

g

S

e "source"=>"web"}

J

h o The application The bounding

T

e G that sent this box for this Map of a Twitter Status Object tweet place Raffi Krikorian 18 April 2010 Normalization Raw twitter data storage is very inefficient because, e.g. user records are repeated with every tweet by that user. Normalization is the process of minimizing data redundancy.

Tweet id User id Location id Body 11 111 1111 I need a Jamba juice 22 111 1111 Cal Soccer rules 33 111 2222 Why do we procrastinate? 44 222 3333 Close your eyes and push “go”

User.id Name Attribs… Loc.id Name Attribs… 111 Jones 1111 Berkeley 222 Smith 2222 Oakland 3333 Hayward Normalization Normalized tables include only a foreign key to the information in another table for repeated data. The original table is the result of inner joins between tables.

Tweet id User id Location id Body 11 111 1111 I need a Jamba juice 22 111 1111 Cal Soccer rules 33 111 2222 Why do we procrastinate? 44 222 3333 Close your eyes and push “go”

User.id Name Attribs… Loc.id Name Attribs… 111 Jones 1111 Berkeley 222 Smith 2222 Oakland 3333 Hayward Aggregate Queries Including reference counts in the lookup tables allows you to perform aggregate queries on those tables alone: Average age of users, most popular location,…

Tweet id User id Location id Body 11 111 1111 I need a Jamba Juice 22 111 1111 Cal Soccer rules 33 111 2222 Why do we procrastinate? 44 222 3333 Close your eyes and push “go”

U.id Name Count Attr.. L.id Name Count Attr… 111 Jones 3 1111 Berkeley 2 222 Smith 1 2222 Oakland 1 3333 Hayward 1 SQL Query Semantics Semantics of an SQL query are defined in terms of the following conceptual evaluation strategy: 1. do FROM clause: compute cross-product of tables (e.g., Students and Enrolled). 2. do WHERE clause: Check conditions, discard tuples that fail. (i.e., “selection”). 3. do SELECT clause: Delete unwanted fields. (i.e., “projection”). 4. If DISTINCT specified, eliminate duplicate rows. Probably the least efficient way to compute a query! – An optimizer will find more efficient strategies to get the same answer.

Data Model (Tabular)

• SQLite – Table: fixed number of named columns of specified type – 5 storage classes for columns • NULL • INTEGER • REAL • TEXT • BLOB – Data stored on disk in a single file in row-major order – Operations performed via sqlite3 shell

38 Reductions and GroupBy • One of the most common operations on Data Tables is aggregation or reduction (count, sum, average, min, max,…). • They provide a means to see high-level patterns in the data, to make summaries of it etc. • You need ways of specifying which columns are being aggregated over, which is the role of a GroupBy operator.

SID Name Course Semester Grade GPA 111 Jones Stat 134 F13 A 4.0 111 Jones CS 162 F13 B- 2.7 222 Smith EE 141 S14 B+ 3.3 222 Smith CS162 F14 C+ 2.3 222 Smith CS189 F14 A- 3.7

39 Reductions and GroupBy SID Name Course Semester Grade GPA 111 Jones Stat 134 F13 A 4.0 111 Jones CS 162 F13 B- 2.7 222 Smith EE 141 S14 B+ 3.3 222 Smith CS162 F14 C+ 2.3 222 Smith CS189 F14 A- 3.7

SELECT SID, Name, AVG(GPA) FROM Students GROUP BY SID

40 Reductions and GroupBy SID Name Course Semester Grade GPA 111 Jones Stat 134 F13 A 4.0 111 Jones CS 162 F13 B- 2.7 222 Smith EE 141 S14 B+ 3.3 222 Smith CS162 F14 C+ 2.3 222 Smith CS189 F14 A- 3.7

SELECT SID, Name, AVG(GPA) FROM Students GROUP BY SID SID Name GPA 111 Jones 3.35 222 Smith 3.1

41 Pandas/Python • Series: a named, ordered dictionary – The keys of the dictionary are the indexes – Built on NumPy’s ndarray – Values can be any Numpy data type object

• DataFrame: a table with named columns – Represented as a Dict (col_name -> series) – Each Series object represents a column

42 Operations • map() functions • filter (apply predicate to rows) • sort/group by • aggregate: sum, count, average, max, min • Pivot or reshape • Relational: – union, intersection, difference, cartesian product (CROSS JOIN) – select/filter, project – join: natural join (INNER JOIN), theta join, semi-join, etc. – rename

43 Pandas vs SQL + Pandas is lightweight and fast. + Full SQL expressiveness plus the expressiveness of Python, especially for function evaluation. + Integration with plotting functions like Matplotlib.

- Tables must fit into memory. - No post-load indexing functionality: indices are built when a table is created. - No transactions, journaling, etc. - Large, complex joins probably slower.

44 Jacobs Update

• Room 310 not ready, but other rooms are. For the next few (?) weeks, we will meet: • Mondays in 155 Donner • Wednesdays in 110/120 Jacobs Hall

Starting this Weds. 5 min break

The other view of tables: OLAP • OnLine Analytical Processing • Conceptually like an n-dimensional spreadsheet (Cube) • (Discrete) columns become dimensions • The goal is live interaction with numerical data for business intelligence

The other view of tables: OLAP From a table to a cube:

name classid Semester Grade Units Jones History105 F13 3.3 4.0 Jones DataScience194 S12 4.0 3.0 Jones French150 F14 3.7 4.0 Smith History105 S15 2.3 3.0 Smith DataScience194 F14 2.7 3.0 Smith French150 F13 3.0 4.0 From tables to OLAP cubes From a table to a cube:

name classid Semester Grade Units Jones History105 F13 3.3 4.0 Jones DataScience194 S12 4.0 3.0 Jones French150 F14 3.7 4.0 Smith History105 S15 2.3 3.0 Smith DataScience194 F14 2.7 3.0 Smith French150 F13 3.0 4.0

Variables used as qualifiers Variables we want to measure (In where, GroupBy clauses) Normally numeric Normally discrete Constructing OLAP cubes

name classid Semester Grade Units Jones History105 F13 3.3 4.0 Jones DataScience194 S12 4.0 3.0 … … … … … Cube Cube dimensions Semester values

Cell contents are Grade, Unit values Classid

Name Queries on OLAP cubes • Once the cube is defined, its easy to do aggregate queries by projecting along one or more axes. • E.g. to get student GPAs, we project the Grade field onto the student (Name) axis. • In fact, such aggregates are precomputed and maintained automatically in an OLAP cube, so queries are instantaneous.

Semester

Cell contents are Grade, Unit values Classid

Name OLAP • Slicing: fixing one or more variables

• Dicing: selecting a range of values for one or more variables

OLAP • Drilling Up/Down (change levels of a hierarchically-indexed variable)

• Pivoting: produce a two-axis view for viewing as a spreadsheet. Outline • To support real-time querying, OLAP DBs store aggregates of data values along many dimensions. • This works best if axes can be tree-structured. E.g time can be expressed as a hierarchy hour  day  week  month  year

OLAP tradeoffs • Aggregates increase space and the cost of updates. • On the other hand, since they are projections of data, or tree structures, the storage overhead can be small. • Aggregates are limited, but cover a lot of common cases: avg, stdev, min, max. • Operations (slice, dice, pivot, etc.) are conceptually simpler than SQL, but cover a lot of common cases. • Good integration with clients, e.g. spreadsheets, for visual interaction, although there is an underlying query language (MDX).

Numpy/Matlab and OLAP

• Numpy and Matlab have an efficient implementation of nd- arrays for dense data. • Indices must be integer, but you can implement general indices using dictionaries from indexval->int. • Slicing and dicing are available using index ranges: a[5,1:3,:] etc. • Roll-down/up involve aggregates along dimensions such as sum(a[3,4:6,:],2) • Pivoting involves index permutations (.transpose()) and aggregation over the other indices. • Limitation: MATLAB and Numpy currently only support dense nd-arrays (or sparse 2d arrays). What’s Wrong with Tables?

• Too limited in structure? • Too rigid? • Too old fashioned? What’s Wrong with (RDBMS) Tables?

• Indices: Typical RDBMS table storage is mostly indices – Cant afford this overhead for large datastores • Transactions: – Safe state changes require journals etc., and are slow • Relations: – Checking relations adds further overhead to updates • Sparse Data Support: – RDBMS Tables are very wasteful when data is very sparse – Very sparse data is common in modern data stores – RDBMS tables might have dozens of columns, modern data stores might have many thousands.

RDBMS tables – row based

Table:

sid name login age gpa

53831 Jones jones@cs 18 3.4 53831 Smith smith@ee 18 3.2

Represented as:

53831 Jones jones@cs 18 3.4

53831 Smith smith@ee 18 3.2 Tweet JSON Format

The tweet's unique ID. These Text of the tweet. IDs are roughly sorted & Consecutive duplicate tweets developers should treat them are rejected. 140 character as opaque (http://bit.ly/dCkppc). max (http://bit.ly/4ud3he). D {"id"=>12296272736,

E T "text"=>

A

C "An early look at Annotations:

E

R http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453", Tweet's

P

E "created_at"=>"Fri Apr 16 17:55:46 +0000 2010", creation

D "in_reply_to_user_id"=>nil, The ID of an existing tweet that date. "in_reply_to_screen_name"=>nil, this tweet is in reply to. Won't

s "in_reply_to_status_id"=>nil ' be set unless the author of the r The screen name &

. o "favorited"=>false,

D

h referenced tweet is mentioned.

I

t user ID of replied to

Truncated to 140

r u "truncated"=>false,

e a tweet author. characters. Only

s

e "user"=>

u

h possible from SMS. The author's

T {"id"=>6253282, user name. The author's "screen_name"=>"twitterapi", biography. "name"=>"Twitter API", The author's

. c "description"=> screen name.

n

y

s s "The Real Twitter API. I tweet about API changes, service issues and

i

f

h

o

T happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",

t

.

u The author's t "url"=>"http://apiwiki.twitter.com",

o

e

t URL.

e

e "location"=>"San Francisco, CA",

w The author'sThe "location". tweet's This is a free-formunique text field, ID. and These Text of the tweet.

g

t

"profile_background_color"=>"c1dfee",

n e there are no guarantees on whether it can be geocoded.

a

h

t c "profile_background_image_url"=>

f t IDs are roughly sorted & Consecutive duplicate tweets

o

c

"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",

e

r

j Rendering information

o b "profile_background_tile"=>false,

h

o t for the author. Colors

u

d "profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",developers should treat them are rejected. 140 character

a

e

are encoded in hex

d e "profile_link_color"=>"0000ff",

d h values (RGB).

e T "profile_sidebar_border_color"=>"87bc44", The creation date b as opaque (http://bit.ly/dCkppc). max (http://bit.ly/4ud3he).

m "profile_sidebar_fill_color"=>"e0ff92", for this account.

e D {"id"=>12296272736, "profile_text_color"=>"000000", Whether this account has E "created_at"=>"Wed May 23 06:01:13 +0000 2007", contributors enabled (http://bit.ly/50npuu).

T "contributors_enabled"=>true, Number of

s

t

. "favourites_count"=>1, e "text"=> favorites this

A

s

e

a "statuses_count"=>1628, Number of

w user has.

h

t

C f r "friends_count"=>13, users this user

o e

"An early look at Annotations:

s

r

E is following.

u "time_zone"=>"Pacific Time (US & Canada)", The timezone and offset

e

b

s i "utc_offset"=>-28800, (in seconds) for this user.

m

R

h

t u http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453","lang"=>"en", The user's selected Tweet's

N P "protected"=>false, language. E "created_at"=>"Fri "followers_count"=>100581, Apr 16 17:55:46 +0000 2010", creation "geo_enabled"=>true,

.

D

)

Whether this user is protected

7 o "notifications"=>false, DEPRECATED

7 e "in_reply_to_user_id"=>nil, or not. If the user is protected, g Y date. "following"=>true, in this context Number of The ID of an existing tweet that

F

s then this tweet is not visible

p a "verified"=>true}, followers for

4

h Whether this user / except to "friends".

r y

l "in_reply_to_screen_name"=>nil, "contributors"=>[3191321], this user.

. e has a verified badge.

t

i this tweet is in reply to. Won't

s

b "geo"=>nil,

u

/

/

:

s i "coordinates"=>nil,

p s DEPRECATED

h t "in_reply_to_status_id"=>nil

'

t

t The contributors' (if any) user be set unless the author of the

r h "place"=>

r

( The screen name &

e The place ID

. IDs (http://bit.ly/50npuu).

h d {"id"=>"2b6ff8c22edd9576",

t

o

e e l "favorited"=>false,

D h b "url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",

h referenced tweet is mentioned.

a

I

t W user ID of replied to n "name"=>"SoMa", The URL to fetch a detailed Truncated to 140

e

r The printable names of this place u "truncated"=>false, "full_name"=>"SoMa, San Francisco", polygon for this place

e a "place_type"=>"neighborhood", tweet author.

. characters. Only

s ) "country_code"=>"US", The type of this

p

e "user"=>

n

i

u

C place - can be a "country"=>"The United States of America",

t The place associated with this

1

h

e The author's L "neighborhood"possible from SMS. e "bounding_box"=> Tweet (http://bit.ly/b8L1Cp).

8

w

b

T

t / {"id"=>6253282, {"coordinates"=> or "city"

y

l

s

i

.

t [[[-122.42284884, 37.76893497], user name.

i h The author's

t

b

/

/ n [-122.3964, 37.76893497],

: "screen_name"=>"twitterapi",

o The country this place is in

p

t [-122.3964, 37.78752897],

t

g

h biography.

a

t

( The author's

[-122.42284884, 37.78752897]]],

o

N "name"=>"Twitter API",

e "type"=>"Polygon"}},

.

O

g

S

e

c "source"=>"web"}

J screen name.

h o The application The bounding T "description"=>

e

n

G that sent this box for this

y Map of a Twitter Status Object tweet place

s s "The Real Twitter API. I tweet about API changes, service issues and

i

Raffi Krikorian

f h 18 April 2010

o

T happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",

t

.

u The author's t "url"=>"http://apiwiki.twitter.com",

o

e

t URL.

e

e "location"=>"San Francisco, CA",

w The author's "location". This is a free-form text field, and

g

t

"profile_background_color"=>"c1dfee",

n e there are no guarantees on whether it can be geocoded.

a

h

t c "profile_background_image_url"=>

f

t

o

c

"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",

e

r

j Rendering information

o b "profile_background_tile"=>false,

h

o t for the author. Colors

u

d "profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",

a

e

are encoded in hex

d e "profile_link_color"=>"0000ff",

d h values (RGB).

e T "profile_sidebar_border_color"=>"87bc44", The creation date

b

m "profile_sidebar_fill_color"=>"e0ff92", for this account. e "profile_text_color"=>"000000", Whether this account has "created_at"=>"Wed May 23 06:01:13 +0000 2007", contributors enabled (http://bit.ly/50npuu). "contributors_enabled"=>true, Number of

s

t

. "favourites_count"=>1, e favorites this

s

e

a "statuses_count"=>1628, Number of w user has.

h

t

f r "friends_count"=>13, users this user

o e

s r is following.

u "time_zone"=>"Pacific Time (US & Canada)", The timezone and offset

e

b

s i "utc_offset"=>-28800, (in seconds) for this user.

m

h

t u "lang"=>"en", The user's selected

N "protected"=>false, language. "followers_count"=>100581, "geo_enabled"=>true,

.

)

Whether this user is protected

7 o "notifications"=>false, DEPRECATED

7 e or not. If the user is protected,

g Y

"following"=>true, in this context Number of

F s then this tweet is not visible

p a "verified"=>true}, followers for

4

h Whether this user / except to "friends".

r y

l "contributors"=>[3191321], this user.

. e has a verified badge.

t

i

s

b "geo"=>nil,

u

/

/

:

s i "coordinates"=>nil,

p DEPRECATED

h

t

t

t

The contributors' (if any) user

r h "place"=>

(

e The place ID IDs (http://bit.ly/50npuu).

h d {"id"=>"2b6ff8c22edd9576",

t

e

e l

h b "url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",

a

W n "name"=>"SoMa", The URL to fetch a detailed e "full_name"=>"SoMa, San Francisco", The printable names of this place polygon for this place "place_type"=>"neighborhood",

.

) "country_code"=>"US", The type of this

p

n

i

C place - can be a "country"=>"The United States of America", t The place associated with this

1

e

L "neighborhood" e "bounding_box"=> Tweet (http://bit.ly/b8L1Cp).

8

w

b

t / {"coordinates"=> or "city"

y

l

s

i

.

t [[[-122.42284884, 37.76893497],

i

h

t

b

/

/ n [-122.3964, 37.76893497],

: o The country this place is in

p

t [-122.3964, 37.78752897],

t

g

h

a

t

(

[-122.42284884, 37.78752897]]],

o

N e "type"=>"Polygon"}},

O

g

S

e "source"=>"web"}

J

h o The application The bounding

T

e G that sent this box for this Map of a Twitter Status Object tweet place Raffi Krikorian 18 April 2010 RDBMS tables – row based Table: ID name login loc locid LAT LONG ALT State 52841 Jones jones@cs NULL NULL NULL NULL NULL NULL 53831 Smith smith@ee NULL NULL NULL NULL NULL NULL 55541 Brown brown@ee NULL NULL NULL NULL NULL NULL

Represented as:

52841 Jones jones@cs NULL NULL NULL NULL NULL NULL

53831 Smith smith@ee NULL NULL NULL NULL NULL NULL

55541 Brown brown@ee NULL NULL NULL NULL NULL NULL Column-based store Table: ID name login loc locid LAT LONG ALT State 52841 Jones jones@cs Albany 2341 38.4 122.7 100 CA 53831 Smith smith@ee NULL NULL NULL NULL NULL NULL 55541 Brown brown@ee NULL NULL NULL NULL NULL NULL

Represented as column (key-value) stores: ID name ID login ID loc ID locid 52841 Jones 52841 jones@cs 52841 Albany 52841 2341 53831 Smith 53831 smith@ee 55541 Brown 55541 brown@ee ID LAT ID LONG 52841 38.4 52841 122.7 … NoSQL Storage Systems

64 Column-Family Stores (Cassandra) A column-family groups data columns together, and is analogous to a table (and similar to Pandas DataFrame) Static column family from Apache Cassandra: Columns fixed

Dynamic Column family (Cassandra): Can add or remove columns from a dynamic column family CouchDB Data Model (JSON) • “With CouchDB, no schema is enforced, so new document types with new meaning can be safely added alongside the old.” • A CouchDB document is an object that consists of named fields. Field values may be: – strings, numbers, dates, – ordered lists, associative maps "Subject": "I like Plankton" "Author": "Rusty" "PostedDate": "5/23/2006" "Tags": ["plankton", "baseball", "decisions"] "Body": "I decided today that I don't like baseball. I like plankton."

66 Key-value stores • A key-value store is an even simpler approach. • It implements storage and retrieval of (key,value) pairs. • i.e. Basic functionality is that of a dictionary age[“john”] = 25. • But some KV-stores also implement sorting and indexing with the keys (e.g. leveldb). • You can build either column-based or row-based DBs on top of such KV-stores to optimize performance (e.g. omitting indices or ACID qualities).

67 Pig

• Started at Yahoo! Research • Features: – Expresses sequences of MapReduce jobs – Data model: nested “bags” of items • Schema is optional – Provides relational (SQL) operators (JOIN, GROUP BY, etc) – Easy to plug in Java functions An Example Problem

Suppose you have user Load Users Load Pages data in one file, website data in another, and you Filter by age need to find the top 5 most visited pages by Join on name users aged 18-25. Group on url

Count clicks

Order by clicks

Take top 5

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt In MapReduce

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt In Pig Latin

Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’;

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt Hive • Developed at Facebook • Relational database built on Hadoop – Maintains table schemas – SQL-like query language (which can also call Hadoop Streaming scripts) – Supports table partitioning, complex data types, sampling, some • Used for most Facebook jobs – Less than 1% of daily jobs at Facebook use MapReduce directly!!! (SQL – or PIG – wins!) – Note: Google also has several SQL-like systems in use.

Summary

• Two views of tables: – SQL/Pandas – OLAP/Numpy/Matlab • SQL, NoSQL – Non-Tabular Structures

Wednesday come to 110/120 Jacobs Hall for Pandas Lab