E6895 Advanced Big Data Analytics Lecture 4: Data Store
Total Page:16
File Type:pdf, Size:1020Kb
E6895 Advanced Big Data Analytics Lecture 4: Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Chief Scientist, Graph Computing, IBM Watson Research Center E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, 2016 Columbia University Reference 2 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL 3 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL 4 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Apache Hive 5 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Hive to Create a Table 6 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Creating, Dropping, and Altering DBs in Apache Hive 7 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Another Hive Example 8 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive’s operation modes 9 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using HiveQL for Spark SQL 10 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive Language Manual 11 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Spark SQL — Steps and Example 12 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Query testtweet.json Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files 13 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University SchemaRDD 14 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Row Objects Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields. 15 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Types stored by Schema RDDs 16 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Look at the Schema (not a complete screen shot) 17 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Another way to create SchemaRDD 18 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University JDBC Server Spark SQL provides JDBC connectivity, which is useful for connecting business intelligence tools to a Spark cluster and for sharing a cluster across multiple users. 19 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University User-Defined Functions (UDF) UDFs allow you to register custom functions in Python, Java, and Scala to call within SQL. This is a very popular way to expose advanced functionality to SQL users in an organization, so that these users can call into it without writing code. 20 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Streaming 21 E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing © 2014 CY Lin, Columbia University Spark Streaming In Spark 1.1, Spark Streaming is available only in Java and Scala. Spark 1.2 has limited Python support. 22 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark Streaming architecture 23 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark Streaming with Spark’s components 24 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Try these examples 25 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Graph Database 26 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Big Data: “While enterprises struggle to consolidate systems and collapse redundant databases to enable greater operational, analytical, and collaborative consistencies, changing economic conditions have made this job more difficult. E-commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity and variety. In 2001/02, IT organizations much compile a variety of approaches to have at their disposal for dealing each.” – Doug Laney, Gartner, 2001 2 © 2015 IBM Corporation Graph is a missing pillar in the existing Big Data foundation UI / User App Builder Integration & Governance Graphs Streams Hadoop/Spark Data Explorer Warehouse Linked Big Data Connector Framework CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems Graph Computing is difficult because data cannot be easily partitioned 28 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Database Example 29 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Forrester: Over 25% of enterprise will use Graph DB by 2017 TechRadar: Enterprise DBMS, Q12014 Graph DB is in the significant success trajectory, and with the highest business value in the upcoming DBs. 30 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University GraphDB has the largest Popularity Change among DBMS lately 31 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Database key differentiator — native store In Relational DB, relationships are distributed and stored as tables Native Graph DB stores nodes and relationships directly, It makes retrieval efficient. Retrieving multi-step relationships is a 'graph traversal' problem Technology ==> Top Layer: Graph, Bottom Layer: Graph Cited “Graph Database” O’liey 2013 32 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University A usual example 33 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query Example – I 34 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query Examples – II & III Computational intensive 35 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Execution Time in the example of finding extended friends (by Neo4j) 36 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Modeling Order History as a Graph 37 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University A query language on Property Graph – Cypher 38 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Property Graph Example – Shakespeare 39 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Creating the Shakespeare Graph 40 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query on the Shakespeare Graph 41 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Building Application Example – Collaborative Filtering 42 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University What is IBM System G? http://systemG.research.ibm.com (Internet) or http://systemG.ibm.com (IBM internal site) A Complete Graph Computing Suite — Toolkits, Solutions, & Cloud 43 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Download IBM System G Standard Edition (on-premise) http://systemg.research.ibm.com/download.html or http://www.ibm.com/developerworks/labs/ 44 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University IBM System G Tools — 8 categories • Graph Database: • Reasoning Engine: Reasoning • Native Store • Markovian & Bayesian Networks • GBase • Anomaly Detection Tools Intrinsic senses Intrinsic • Brain Analysis Tool • Scalable Middleware: • Parallel Prog. Lib. • Cognitive Networks: • Power Optimization • Deep Learning • Software Defined Env. • Emotion Analysis • Contextual Analytics: • Spatiotemporal Analytics: • Topological Analysis • Road Network Algorithms • Matching and Search Observations • Spatiotemporal Data Mining • Path and Flow • Spatiotemporal Indexing • Visual Analytics: • Mobile & Sensor Analytics: • Multivariate Graph • Mobile Security Tools • Heterogeneous Graph • Sensor Analytics Tools • Dynamic Graph • Big Graph 1345 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Similarity to the Brain Functions and Evolution Perception Judgement Abstract comprehension Observation Memory 46 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University IBM System G Graph Computing Tools Visualization Huge Network Network Dynamic Network Geo Network Graphical Model Visualization Propagation Visualization Visualization Visualization Analytics Graph Computing Tools APIs and Query Langauge Support Communities Graph Search Bayesian Networks Emotion Analytics Centralities Graph Matching Deep Learning Visual Sentiments Ego Features Shortest Paths SpatioTemporal Anomaly Detection Middleware Multi-Core In-Memory Distributed Memory GPU Graph Multi-Thread Graph RT Library Graph RT Library Computing Driver Graph RT Library Database Spark Streams Enterprise Graph Graph Data Interface (TinkerPop) Interface Interface Database Enhancer System G Assets Other GBase Performance Driven Open Graph Store System G Native Graph Store Source HBase Hardware File System (Linux FS, Hadoop HDFS, etc.) Hardware Server Cluster Mobile Mainframe Super (Linux & OS X) (CPU, CPU+GPU) Cloud ( iOS) (System Z & Power) Computer 47 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Analytical Tools • Network topological analysis tools