Hmi + Clds = New Sdsc Center
Total Page:16
File Type:pdf, Size:1020Kb
New Technologies for Data Management Chaitan Baru 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 2 2 Why new technologies? • Big Data Characteristics: Volume, Velocity, Variety • Began as a Volume problem ▫ E.g. Web crawls … ▫ 1’sPB-100’sPB in a single cluster • Velocity became an issue ▫ E.g. Clickstreams at Facebook ▫ 100,000 concurrent clients ▫ 6 billion messages/day • Variety is now important ▫ Integrate, fuse data from multiple sources ▫ Synoptic view; solve complex problems 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 3 3 Varying opinions on Big Data • “It’s all about velocity. The others issues are old problems.” • “Variety is the important characteristic.” • “Big Data is nothing new.” 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 4 4 Expanding Enterprise Systems: From OLTP to OLAP to “Catchall” • OLTP: OnLine Transaction Processing ▫ E.g., point-of-sale terminals, e-commerce transactions, etc. ▫ High throughout; high rate of “single shot” read/write operations ▫ Large number of concurrent clients ▫ Robust, durable processing 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 5 5 Expanding Enterprise Systems: OLAP • OLAP: Decision Support Systems requiring complex query processing ▫ Fast response times for complex SQL queries ▫ Queries over historical data ▫ Fewer number of concurrent clients ▫ Data aggregated from multiple systems Multiple business systems, e.g. Sales, Manufacturing, Financial Multiple locations, e.g. branches across the country, world ▫ Create Data Warehouses Leading to “data marts” 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 6 6 Expanding Enterprise Systems: “Catchment Area” • Capture “all” data that an enterprise might care about ▫ E.g. information about customers from social networks, other contextual data • “Data catchment” area now an element of enterprise architectures • “Big data is about late binding” 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 7 7 Example from an industry standard • TPC: Transaction Processing Performance Council • TPC-Decision Support (TPC-DS) 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 8 8 Extending TPC-DS for Big Data • Add semistructured and unstructured data • Incorporate2013 Summer Institute: data Discover mining Big Data, August 5operations-9, San Diego, California in queries SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 9 9 Another Big Data Use Case: Deep Analytics Pipelines • Sequence of processing steps: ▫ From data ingestion to data cleaning and transformation (ELT, sorting, SQL queries) ▫ To Machine Learning and Predictive Analytics • Feed data from one step to the next Extraction/ Integration/ Analysis/ Acquistion/ Cleaning/ Aggregation/ Modelin Interpretation Recording Annotation Representation g 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 10 10 Pipeline Example: User Modeling • Based on clickstream processing ▫ Data Acquisition Collect data from Web logs across all servers ▫ Sessionization Pull together all user data for a single session ▫ Feature and Target Generation Targets are, say, clicks on ads of interest Features are the “prefix” operations which lead to that click ▫ Model Training ▫ Offline Scoring & Evaluation ▫ Batch Scoring & Upload to serving 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 11 11 Two Approaches to Representing Big Data • Data Warehousing ▫ Structured data repository ▫ With extensions for semistructured and unstructured data • Pipeline / Catchment Extraction/ Integration/ Analysis Acquistion/ Cleaning/ Aggregation/ / Interpretation Recording Annotation Representation Modelin ▫ Unstructured data g repository ▫ Data structured according to needs of an application (“late binding”) 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 12 12 Transforming Data • ETL vs ELT (vs NoETL) • ETL ▫ Extract data from sources ▫ Transform data to fit a schema ▫ Load data into data management system • ELT ▫ Extract data ▫ Load into data management system ▫ Transform data as needed by application(s) 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 13 13 ELT Example: SciSIP Project • NSF Science of Science and Innovation Policy project • Compare trajectory of research productivity among San Diego, Philadelphia, and St. Louis regions over a given period of time ▫ Compute research spending ▫ Measure patent production and other output quantities ▫ Perform comparative study • Data acquisition ▫ Download social science data from Federal Government sources, e.g. research spending data from USASpending.gov; patent data from USPTO; • Analysis ▫ Analysis to be performed according to MSA’s (Metropolitan Statistical Areas), but data provided according to state/county ▫ Question: What is San Diego? What is Philadelphia? What is St. Louis? 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 14 14 San Diego County is an MSA Some counties in PA ELT Some counties in NJ • Map (State, Some counties in MD Some counties in DE County Name) • MSA 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 15 15 SciSIP: Map institution names Map names to Burnham Institute 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 16 16 Big Data: Agile Applications • Data structuring determined by needs of agile applications ▫ Need for loose (flexible) schemas, and “late binding” of schemas ▫ Extract-Load-Transform (ELT) rather than Extract- Load-Transform • Support for processing pipelines ▫ Data runs through multi-step processing pipelines ▫ Building and execution of machine learning models using data • Used for “event detection” ▫ User clicks ▫ Device failures ▫ Hospital re-admissions ▫ … 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 17 17 Hadoop Ecosystem • Designed to deal with large amounts of semi/unstructured data • Potentially a step backwards ▫ Exposes many internals of the system • Can expect next generation of Hadoop technologies to bring back higher abstractions and performance optimizations 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 18 18 Hadoop Ecosystem Components • HDFS ▫ Distributed File System (across a cluster) • MapReduce ▫ Parallel execution environment, operating atop HDFS • Pig ▫ Workflow based system; specifies data processing workflows (assembler for data) • Hbase ▫ Column-based data management system • Hive ▫ SQL-like (lite) interface to Hbase. Not for OLTP. Uses MR and sequential scans with HDFS. Slow. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 19 19 Hadoop Ecosystem Components • Mahout ▫ Machine learning libraries on Hadoop ▫ Recommendation mining, clustering, classification, frequent itemset mining • Cassandra ▫ Distributed key-value store. With an implementation on Hadoop. • YARN: MR2 • Ambari: Hadoop cluster manager • Avro: Data serialization system • Chukwa: Monitoring system • Zookeeper: Config services 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 20 20 NoSQL Databases • “Not Only SQL” ▫ Misnomer ▫ Should have been named according what they are, rather than what they are not • What are they? ▫ Data stores designed to operate at large scale ▫ By “relaxing” (not incorporating) a number of features of relational databases ▫ By incorporating extensible storage mechanisms • Storage abstractions ▫ Key-value stores ▫ Column-oriented stores ▫ Distributed storage • Simple SQL operators, e.g. Select, Project, Group By, simple Join 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 21 21 Column Stores • Column store systems ▫ BigTable, Hbase, Vertica, MonetDB • Store data not by rows but by column groups/families ▫ Don’t store duplicate values ▫ Apply compression ▫ Utilize reduction in storage space for data replication 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER