New Technologies for Data Management

Chaitan Baru

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 2 2

Why new technologies? • Big Data Characteristics: Volume, Velocity, Variety • Began as a Volume problem ▫ E.g. Web crawls … ▫ 1’sPB-100’sPB in a single cluster • Velocity became an issue ▫ E.g. Clickstreams at Facebook ▫ 100,000 concurrent clients ▫ 6 billion messages/day • Variety is now important ▫ Integrate, fuse data from multiple sources ▫ Synoptic view; solve complex problems 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

3 3

Varying opinions on Big Data

• “It’s all about velocity. The others issues are old problems.”

• “Variety is the important characteristic.”

• “Big Data is nothing new.”

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 4 4

Expanding Enterprise Systems: From OLTP to OLAP to “Catchall” • OLTP: OnLine Transaction Processing ▫ E.g., point-of-sale terminals, e-commerce transactions, etc. ▫ High throughout; high rate of “single shot” read/write operations ▫ Large number of concurrent clients ▫ Robust, durable processing

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 5 5

Expanding Enterprise Systems: OLAP • OLAP: Decision Support Systems requiring complex query processing ▫ Fast response times for complex SQL queries ▫ Queries over historical data ▫ Fewer number of concurrent clients ▫ Data aggregated from multiple systems  Multiple business systems, e.g. Sales, Manufacturing, Financial  Multiple locations, e.g. branches across the country, world ▫ Create Data Warehouses   Leading to “data marts” 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 6 6

Expanding Enterprise Systems: “Catchment Area”

• Capture “all” data that an enterprise might care about ▫ E.g. information about customers from social networks, other contextual data • “Data catchment” area now an element of enterprise architectures • “Big data is about late binding”

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 7 7

Example from an industry standard • TPC: Transaction Processing Performance Council • TPC-Decision Support (TPC-DS)

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 8 8

Extending TPC-DS for Big Data

• Add semistructured and unstructured data

• Incorporate2013 Summer Institute: data Discover mining Big Data, August 5operations-9, San Diego, California in queries SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 9 9

Another Big Data Use Case: Deep Analytics Pipelines • Sequence of processing steps: ▫ From data ingestion to data cleaning and transformation (ELT, sorting, SQL queries) ▫ To Machine Learning and Predictive Analytics • Feed data from one step to the next

Extraction/ Integration/ Analysis/ Acquistion/ Cleaning/ Aggregation/ Modelin Interpretation Recording Annotation Representation g

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 10 10

Pipeline Example: User Modeling • Based on clickstream processing ▫ Data Acquisition  Collect data from Web logs across all servers ▫ Sessionization  Pull together all user data for a single session ▫ Feature and Target Generation  Targets are, say, clicks on ads of interest  Features are the “prefix” operations which lead to that click ▫ Model Training ▫ Offline Scoring & Evaluation ▫ Batch Scoring & Upload to serving 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 11 11

Two Approaches to Representing Big Data • Data Warehousing ▫ Structured data repository ▫ With extensions for semistructured and unstructured data • Pipeline / Catchment Extraction/ Integration/ Analysis Acquistion/ Cleaning/ Aggregation/ / Interpretation Recording Annotation Representation Modelin ▫ Unstructured data g repository ▫ Data structured according to needs of an application (“late binding”) 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 12 12

Transforming Data

• ETL vs ELT (vs NoETL) • ETL ▫ Extract data from sources ▫ Transform data to fit a schema ▫ Load data into data management system • ELT ▫ Extract data ▫ Load into data management system ▫ Transform data as needed by application(s) 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 13 13

ELT Example: SciSIP Project

• NSF Science of Science and Innovation Policy project • Compare trajectory of research productivity among San Diego, Philadelphia, and St. Louis regions over a given period of time ▫ Compute research spending ▫ patent production and other output quantities ▫ Perform comparative study • Data acquisition ▫ Download social science data from Federal Government sources, e.g. research spending data from USASpending.gov; patent data from USPTO; • Analysis ▫ Analysis to be performed according to MSA’s (Metropolitan Statistical Areas), but data provided according to state/county ▫ Question: What is San Diego? What is Philadelphia? What is St. Louis?

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 14 14 San Diego County is an MSA

Some counties in PA ELT Some counties in NJ • Map (State, Some counties in MD Some counties in DE County Name)  • MSA

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 15 15

SciSIP: Map institution names

Map names to Burnham Institute

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 16 16 Big Data: Agile Applications • Data structuring determined by needs of agile applications ▫  Need for loose (flexible) schemas, and “late binding” of schemas ▫  Extract-Load-Transform (ELT) rather than Extract- Load-Transform • Support for processing pipelines ▫ Data runs through multi-step processing pipelines ▫ Building and execution of machine learning models using data • Used for “event detection” ▫ User clicks ▫ Device failures ▫ Hospital re-admissions ▫ … 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

17 17

Hadoop Ecosystem

• Designed to deal with large amounts of semi/unstructured data • Potentially a step backwards ▫ Exposes many internals of the system • Can expect next generation of Hadoop technologies to bring back higher abstractions and performance optimizations

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 18 18

Hadoop Ecosystem Components • HDFS ▫ Distributed File System (across a cluster) • MapReduce ▫ Parallel execution environment, operating atop HDFS • Pig ▫ Workflow based system; specifies data processing workflows (assembler for data) • Hbase ▫ Column-based data management system • Hive ▫ SQL-like (lite) interface to Hbase. Not for OLTP. Uses MR and sequential scans with HDFS. Slow. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 19 19

Hadoop Ecosystem Components • Mahout ▫ Machine learning libraries on Hadoop ▫ Recommendation mining, clustering, classification, frequent itemset mining • Cassandra ▫ Distributed key-value store. With an implementation on Hadoop. • YARN: MR2 • Ambari: Hadoop cluster manager • Avro: Data serialization system • Chukwa: Monitoring system • Zookeeper: Config services 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 20 20

NoSQL • “Not Only SQL” ▫ Misnomer ▫ Should have been named according what they are, rather than what they are not • What are they? ▫ Data stores designed to operate at large scale ▫ By “relaxing” (not incorporating) a number of features of relational databases ▫ By incorporating extensible storage mechanisms • Storage abstractions ▫ Key-value stores ▫ Column-oriented stores ▫ Distributed storage • Simple SQL operators, e.g. Select, Project, Group By, simple Join

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 21 21

Column Stores

• Column store systems ▫ BigTable, Hbase, Vertica, MonetDB • Store data not by rows but by column groups/families ▫ Don’t store duplicate values ▫ Apply compression ▫ Utilize reduction in storage space for data replication

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 22 22

Column Store Example • BigTable ▫ http://www.slideshare.net/vanjakom/google-bigtable-paper- presentation-5078506, Komadinovic Vanja, Google

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 23 23 Hbase: Column-oriented store Storing data at scale

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 24 24

Vertica vs Postgres: • From the Tropical Ecology Monitoring and Assessment Project (TEAM) ▫ Conservation International, Smithsonian Institution, World Wildlife Fund

SELECT COUNT(*) FROM VIEW Vertica (sec) Postgres (sec) 1/ VIEW iucn_species_data 1 102 2/VIEW liana_info_2 7 1 3/VIEW netstats_climate_days_by_year 0.5 14 4/ VIEW netstats_climate_records_by_site 1 3 5/ VIEW vegbank_tree1ha_taxonomies 1 3 6/ VIEW sampling_unit_observed_time 1 5 7/ VIEW sampling_unit_observed_time_version 3 19 8/ VIEW sampling_unit_sampling_time 0.7 2 9/ VIEW site_protocol_block_event_record_number_v8 1 53 10/ VIEW site_protocol_block_event_record_number_v9 1 57 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 25 25 1/ VIEW iucn_species_data CREATE VIEW iucn_species_data ( AS ( SELECT DISTINCT ( x.red_list_status_id, ( n.class, ( n.order_team AS "order", ( n.family, (((n.genus)::text || ' '::text) || (n.species)::text) AS tv_photo_animal p species, JOIN tv_photo h ON ((p.photo_id = h.id)) a.year, ) a.unit_name AS "camera trap sampling unit id", JOIN tv_camera_trap_data ON ((c.id = a.latitude, h.camera_trap_data_id)) a.longitude, ) a.country_name AS "country name", JOIN sampling_units u ON ((u.id = c.camera_trap_id)) a.continent_name AS "continent name", ) a.short_name AS "TEAM site" JOIN sites_team s ON ((u.site_id = s.site_id)) FROM ) ( JOIN countries ON ((s.country_id = r.country_id)) ( ) ( JOIN continents t ON ((r.continent_id = t.continent_id)) taxonomy_scientific_name n JOIN ( ) SELECT DISTINCT ) p.genus, a ON ((((n.genus)::text = (a.genus)::text) AND ((n.species)::text = p.species, (a.species)::text))) s.short_name, ) r.country_name, JOIN taxonomy_scientific_name_with_association x ON t.continent_name, ((x.scientific_name_id = n.id)) u.unit_name, ) u.latitude, JOIN taxonomy_other_information o ON ((o.id = u.longitude, x.other_information_id)) date_part('year'::text, h.taken_time) AS YEAR ) FROM ORDER BY a.short_name, n.class, n.order_team, n.family, (((n.genus)::text || ' '::text) || (n.species)::text); 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 26 26

SQL on Hadoop

• HIVE ▫ A on top of Hadoop ▫ Supports basic DDL and SQL  Select, Project, Join, Group By • Commercial Offerings ▫ Cloudera: Supports Apache. Impala ▫ Hortonworks Data Platform: Apache. Tez ▫ Pivotal: HAWQ ▫ MapR: Modified storage layer. Drill. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 27 27

Apache Impala

• Almost same as DB2 Parallel ▫ From 1995!

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 28 28

Pivotal HAWQ

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 29 29

Hortonworks Tez: Under Apache incubation vote

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 30 30

MapR Drill

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 31 31

SciDB: Array-based DBMS • Unbounded non-uniform dimensions with holes • Arbitrary nesting • New operations such as “regrid” (mapping one array onto another) • User-defined functions (UDF) (first-class citizens in SciDB) • Sophisticated storage representations, including overlapping chunks • AQL, an array query language that is similar to SQL. • AFL, a functional language that provides the same capabilities as AQL but with a functional syntax

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 32 32

MongoDB

• Next talk

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO