E6895 Advanced Big Data Analytics Lecture 4:
Data Store
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Chief Scientist, Graph Computing, IBM Watson Research Center
E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, 2016 Columbia University Reference
2 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL
3 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL
4 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Apache Hive
5 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Hive to Create a Table
6 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Creating, Dropping, and Altering DBs in Apache Hive
7 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Another Hive Example
8 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive’s operation modes
9 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using HiveQL for Spark SQL
10 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive Language Manual
11 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Spark SQL — Steps and Example
12 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Query testtweet.json
Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files
13 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University SchemaRDD
14 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Row Objects
Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields.
15 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Types stored by Schema RDDs
16 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Look at the Schema
(not a complete screen shot)
17 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Another way to create SchemaRDD
18 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University JDBC Server
Spark SQL provides JDBC connectivity, which is useful for connecting business intelligence tools to a Spark cluster and for sharing a cluster across multiple users.
19 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University User-Defined Functions (UDF)
UDFs allow you to register custom functions in Python, Java, and Scala to call within SQL.
This is a very popular way to expose advanced functionality to SQL users in an organization, so that these users can call into it without writing code.
20 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Streaming
21 E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing © 2014 CY Lin, Columbia University Spark Streaming
In Spark 1.1, Spark Streaming is available only in Java and Scala. Spark 1.2 has limited Python support.
22 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark Streaming architecture
23 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark Streaming with Spark’s components
24 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Try these examples
25 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Graph Database
26 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Big Data: “While enterprises struggle to consolidate systems and collapse redundant databases to enable greater operational, analytical, and collaborative consistencies, changing economic conditions have made this job more difficult. E-commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity and variety. In 2001/02, IT organizations much compile a variety of approaches to have at their disposal for dealing each.” – Doug Laney, Gartner, 2001
2 © 2015 IBM Corporation Graph is a missing pillar in the existing Big Data foundation
UI / User
App Builder
Integration & Governance
Graphs Streams Hadoop/Spark Data Explorer Warehouse
Linked Big Data
Connector Framework
CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems
Graph Computing is difficult because data cannot be easily partitioned
28 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Database Example
29 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Forrester: Over 25% of enterprise will use Graph DB by 2017 TechRadar: Enterprise DBMS, Q12014
Graph DB is in the significant success trajectory, and with the highest business value in the upcoming DBs. 30 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University GraphDB has the largest Popularity Change among DBMS lately
31 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Database key differentiator — native store
In Relational DB, relationships are distributed and stored as tables
Native Graph DB stores nodes and relationships directly, It makes retrieval efficient.
Retrieving multi-step relationships is a 'graph traversal' problem
Technology ==> Top Layer: Graph, Bottom Layer: Graph Cited “Graph Database” O’liey 2013
32 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University A usual example
33 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query Example – I
34 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query Examples – II & III
Computational intensive 35 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Execution Time in the example of finding extended friends (by Neo4j)
36 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Modeling Order History as a Graph
37 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University A query language on Property Graph – Cypher
38 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Property Graph Example – Shakespeare
39 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Creating the Shakespeare Graph
40 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query on the Shakespeare Graph
41 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Building Application Example – Collaborative Filtering
42 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University What is IBM System G?
http://systemG.research.ibm.com (Internet) or http://systemG.ibm.com (IBM internal site)
A Complete Graph Computing Suite — Toolkits, Solutions, & Cloud
43 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Download IBM System G Standard Edition (on-premise)
http://systemg.research.ibm.com/download.html
or http://www.ibm.com/developerworks/labs/
44 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University IBM System G Tools — 8 categories
• Graph Database: • Reasoning Engine: Reasoning • Native Store • Markovian & Bayesian Networks • GBase • Anomaly Detection Tools Intrinsic senses • Brain Analysis Tool • Scalable Middleware: • Parallel Prog. Lib. • Cognitive Networks: • Power Optimization • Deep Learning • Software Defined Env. • Emotion Analysis
• Contextual Analytics: • Spatiotemporal Analytics: • Topological Analysis • Road Network Algorithms • Matching and Search Observations • Spatiotemporal Data Mining • Path and Flow • Spatiotemporal Indexing
• Visual Analytics: • Mobile & Sensor Analytics: • Multivariate Graph • Mobile Security Tools • Heterogeneous Graph • Sensor Analytics Tools • Dynamic Graph • Big Graph
1345 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Similarity to the Brain Functions and Evolution
Perception Judgement Abstract comprehension Observation Memory
46 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University IBM System G Graph Computing Tools
Visualization Huge Network Network Dynamic Network Geo Network Graphical Model Visualization Propagation Visualization Visualization Visualization
Analytics Graph Computing Tools APIs and Query Langauge Support
Communities Graph Search Bayesian Networks Emotion Analytics
Centralities Graph Matching Deep Learning Visual Sentiments
Ego Features Shortest Paths SpatioTemporal Anomaly Detection
Middleware Multi-Core In-Memory Distributed Memory GPU Graph Multi-Thread Graph RT Library Graph RT Library Computing Driver Graph RT Library
Database Spark Streams Enterprise Graph Graph Data Interface (TinkerPop) Interface Interface Database Enhancer System G Assets Other GBase Performance Driven Open Graph Store System G Native Graph Store Source HBase Hardware File System (Linux FS, Hadoop HDFS, etc.)
Hardware Server Cluster Mobile Mainframe Super (Linux & OS X) (CPU, CPU+GPU) Cloud ( iOS) (System Z & Power) Computer
47 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Analytical Tools
• Network topological analysis tools •Centralities (degree, closeness, betweenness) •PageRank •Communities (connected component, K-core, K-neighborhood triangle count, clustering coefficient) K-core •Neighborhood (egonet, K-neighborhood) query graph 1 1
• Graph matching and search tools 1 2 data graph •Graph search/filter by label, 10 vertex/edge properties Collaborative filtering 1 2 3 Bipartite weighted graph matching 5 (including geo locations) 3 5
•Graph matching 2 20 2 10 •Collaborative filtering 0 Graph matching • Graph path and flow tools •Shortest paths •Top K-shortest paths
• Probabilistic graphical model tools •Bayesian network inference Top k-shortest paths •Deep learning Bayesian network inference
48 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Performance comparisons item user
People who bought this also bought that..
Recommendation ==> 2-hop traversal & ranking For Visualization ==> 4-hop traversal & rankings
IBM KnowledgeView 1-year Access Log: 72.3K users, 82.1K docs, and 1.74 million downloads
TBD
TBD
Products Startup Open Sources System G *All performance numbers are preliminary 49 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University IBM System G Visualizer – Graph Data Explorer
Visual Query Panel Visualization Panel
Visual Mapping Panel Console Panel
50 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Panel Introduction
• Visual Query Panel • Providing users a friendly UI to create, delete, and query graphs from the System G native store.
• Console Panel • Display all the interaction information with System G native store. • Execute user defined query.
• Visualization Panel • Rendering graph structure on screen for users to visually explore graphs.
• Visual Mapping Panel • Customizing rendering effects to show desired graph information.
51 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visual Query Panel – Creating a graph
1
2
4 5
1: Click “Create Graph”; 2: Prepare the graph data 3: Set the graph name; 4: Upload node files; 3 5: Upload edge files and finalize creating the graph.
52 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visual Query Panel – Visual Query Builder
“analytics_degree <= 10 and (group == “center” or group == “guard”)
53 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Console Panel – User typed query
Query with no graph returned
Query with graph returned
54 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visual Mapping Panel
Name Functionality Background Color Change the background color of the canvas. Node Default Color Set a unified color for all nodes. Edge Default Color Set a unified color for all edges. Show Nodes Set the visibility of all nodes. Node Color Mapping Assign color to nodes according to selected property of nodes.
Node Size Mapping Assign the radius of nodes according to selected property of nodes. Selectively show the node label according to the threshold. Labels will be shown for the Filter Node Label by Node Size nodes of which the size is larger than the threshold. Node Label Mapping Set the label value according to selected property of nodes. Node Label Size Adjust the font size of node labels Show Edges Set the visibility of all edges
Edge Color Mapping Assign color to edges according to selected property of edges. Edge Label Mapping Set the label value according to selected property of edges. Edge Label Size Adjust the font size of edge labels
Edge Thickness Mapping Assign thickess to edges according to selected property of edges. Select the rendering style of edges. For directed Edge Style graphs, users also can choose if showing the arrows or not.
55 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visualization Panel – Before Customization
56 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visualization Panel – After Customization
57 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visualization Panel – Further Customization
Users can further specify colors by clicking the color blocks shown in the legend area
58 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University http://systemg.ibm.com/tool/visualizer/
59 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Characteristics of IBM System G Graph Analytics
• Cover a wide range of graph analytics to support many application use cases in different domains, e.g.: • Enterprise social network analysis, expertise search, knowledge recommendation • Financial/security anomaly/fraud detection • Social media monitoring and analysis • Cellular network analytics in Telco operation • Patient and disease analytics for healthcare • Live neural brain network analysis • Provide efficient in-memory computation as well as on-disk persistence • Optimal performance enabled by IBM System G graph database technologies that focus on efficient use of available computing resources with architecture-aware design to leverage system/architecture advantages • Single-threaded, concurrent (shared memory), and distributed versions • Multiple deployment options to suit different customer preferences and needs • C++ executables in Linux environments (Redhat CentOS, Ubuntu, Mac OS X, Power) • TinkerPop (Blueprints) API • gShell (a shell-like environment with interactive, batch, and server/client modes to operate multiple graph stores simultaneously) • Gremlin console • REST API Web service • Python wrapper
60 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Compatible with TinkerPop Interface (Apache Incubator)
http://sql2gremlin.com
http://tinkerpop.incubator.apache.org
61 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Write Python Code based on System G
g.analytic_find_path(src="1",sink="2") g.analytic_find_path(src="1",sink="2",label="b")
Output of the above Python script
62 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Highly Scalable Graph Database http://www.graph500.org Nov 2015: IBM Research’s Software powered all Top 3 winners of Graph 500 benchmark and 9 out of the Top 10 winners (supercomputers in US, Japan, France, UK, and Germany; except in China). Sequoia #1 ×3.25 #4 #4 #4
TSUBAME 2.5 K computer CPU only FX10 #3
#4 TSUBAME-KFC SGI UV2000 #3
GPU 4-way Xeon server CPU only
The Nov 2015 winner, K-computer supercomputer of 83K nodes and 663Kcores, achieved graph search of up to 38, 621,400,000 vertices per second.
63 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Comparison of graph size No. of edges
Human Brain Project Graph500 (Huge) 45 1 trillion Symbolic edges Network Graph500 (Large) 40 Graph500 (Medium) 1 billion Twitter (tweets/day) 35 edges Graph500 (Small) (m) 2
log Graph500 (Mini) 30 Graph500 (Toy) USA-road-d.USA.gr 25 USA Road Network USA-road-d.LKS.gr 1 billion 1 trillion 20 USA-road- nodes nodes d.NY.gr 15 20 25 30 35 40 45
log2(n) No. of nodes 64 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University