E6895 Advanced Big Data Analytics Lecture 4:

Data Store

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Chief Scientist, Graph Computing, IBM Watson Research Center

E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, 2016 Columbia University Reference

2 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL

3 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL

4 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Apache Hive

5 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Hive to Create a Table

6 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Creating, Dropping, and Altering DBs in Apache Hive

7 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Another Hive Example

8 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive’s operation modes

9 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using HiveQL for Spark SQL

10 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive Language Manual

11 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Spark SQL — Steps and Example

12 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Query testtweet.json

Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files

13 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University SchemaRDD

14 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Row Objects

Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields.

15 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Types stored by Schema RDDs

16 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Look at the Schema

(not a complete screen shot)

17 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Another way to create SchemaRDD

18 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University JDBC Server

Spark SQL provides JDBC connectivity, which is useful for connecting business intelligence tools to a Spark cluster and for sharing a cluster across multiple users.

19 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University User-Defined Functions (UDF)

UDFs allow you to register custom functions in Python, Java, and Scala to call within SQL.

This is a very popular way to expose advanced functionality to SQL users in an organization, so that these users can call into it without writing code.

20 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Streaming

21 E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing © 2014 CY Lin, Columbia University Spark Streaming

In Spark 1.1, Spark Streaming is available only in Java and Scala. Spark 1.2 has limited Python support.

22 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark Streaming architecture

23 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark Streaming with Spark’s components

24 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Try these examples

25 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Graph Database

26 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Big Data: “While enterprises struggle to consolidate systems and collapse redundant databases to enable greater operational, analytical, and collaborative consistencies, changing economic conditions have made this job more difficult. E-commerce, in particular, has exploded data management challenges along three dimensions: volumes, velocity and variety. In 2001/02, IT organizations much compile a variety of approaches to have at their disposal for dealing each.” – Doug Laney, Gartner, 2001

2 © 2015 IBM Corporation Graph is a missing pillar in the existing Big Data foundation

UI / User

App Builder

Integration & Governance

Graphs Streams Hadoop/Spark Data Explorer Warehouse

Linked Big Data

Connector Framework

CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems

Graph Computing is difficult because data cannot be easily partitioned

28 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Database Example

29 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Forrester: Over 25% of enterprise will use Graph DB by 2017 TechRadar: Enterprise DBMS, Q12014

Graph DB is in the significant success trajectory, and with the highest business value in the upcoming DBs. 30 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University GraphDB has the largest Popularity Change among DBMS lately

31 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Database key differentiator — native store

In Relational DB, relationships are distributed and stored as tables

Native Graph DB stores nodes and relationships directly, It makes retrieval efficient.

Retrieving multi-step relationships is a 'graph traversal' problem

Technology ==> Top Layer: Graph, Bottom Layer: Graph Cited “Graph Database” O’liey 2013

32 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University A usual example

33 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query Example – I

34 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query Examples – II & III

Computational intensive 35 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Execution Time in the example of finding extended friends (by Neo4j)

36 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Modeling Order History as a Graph

37 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University A query language on Property Graph – Cypher

38 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Property Graph Example – Shakespeare

39 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Creating the Shakespeare Graph

40 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Query on the Shakespeare Graph

41 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Building Application Example – Collaborative Filtering

42 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University What is IBM System G?

http://systemG.research.ibm.com (Internet) or http://systemG.ibm.com (IBM internal site)

A Complete Graph Computing Suite — Toolkits, Solutions, & Cloud

43 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Download IBM System G Standard Edition (on-premise)

http://systemg.research.ibm.com/download.html

or http://www.ibm.com/developerworks/labs/

44 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University IBM System G Tools — 8 categories

• Graph Database: • Reasoning Engine: Reasoning • Native Store • Markovian & Bayesian Networks • GBase • Anomaly Detection Tools Intrinsic senses • Brain Analysis Tool • Scalable Middleware: • Parallel Prog. Lib. • Cognitive Networks: • Power Optimization • Deep Learning • Software Defined Env. • Emotion Analysis

• Contextual Analytics: • Spatiotemporal Analytics: • Topological Analysis • Road Network Algorithms • Matching and Search Observations • Spatiotemporal Data Mining • Path and Flow • Spatiotemporal Indexing

• Visual Analytics: • Mobile & Sensor Analytics: • Multivariate Graph • Mobile Security Tools • Heterogeneous Graph • Sensor Analytics Tools • Dynamic Graph • Big Graph

1345 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Similarity to the Brain Functions and Evolution

Perception Judgement Abstract comprehension Observation Memory

46 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University IBM System G Graph Computing Tools

Visualization Huge Network Network Dynamic Network Geo Network Graphical Model Visualization Propagation Visualization Visualization Visualization

Analytics Graph Computing Tools APIs and Query Langauge Support

Communities Graph Search Bayesian Networks Emotion Analytics

Centralities Graph Matching Deep Learning Visual Sentiments

Ego Features Shortest Paths SpatioTemporal Anomaly Detection

Middleware Multi-Core In-Memory Distributed Memory GPU Graph Multi-Thread Graph RT Library Graph RT Library Computing Driver Graph RT Library

Database Spark Streams Enterprise Graph Graph Data Interface (TinkerPop) Interface Interface Database Enhancer System G Assets Other GBase Performance Driven Open Graph Store System G Native Graph Store Source HBase Hardware File System (Linux FS, Hadoop HDFS, etc.)

Hardware Server Cluster Mobile Mainframe Super (Linux & OS X) (CPU, CPU+GPU) Cloud ( iOS) (System Z & Power) Computer

47 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Graph Analytical Tools

• Network topological analysis tools •Centralities (degree, closeness, betweenness) •PageRank •Communities (connected component, K-core, K-neighborhood triangle count, clustering coefficient) K-core •Neighborhood (egonet, K-neighborhood) query graph 1 1

• Graph matching and search tools 1 2 data graph •Graph search/filter by label, 10 vertex/edge properties Collaborative filtering 1 2 3 Bipartite weighted graph matching 5 (including geo locations) 3 5

•Graph matching 2 20 2 10 •Collaborative filtering 0 Graph matching • Graph path and flow tools •Shortest paths •Top K-shortest paths

• Probabilistic graphical model tools •Bayesian network inference Top k-shortest paths •Deep learning Bayesian network inference

48 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Performance comparisons item user

People who bought this also bought that..

Recommendation ==> 2-hop traversal & ranking For Visualization ==> 4-hop traversal & rankings

IBM KnowledgeView 1-year Access Log: 72.3K users, 82.1K docs, and 1.74 million downloads

TBD

TBD

Products Startup Open Sources System G *All performance numbers are preliminary 49 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University IBM System G Visualizer – Graph Data Explorer

Visual Query Panel Visualization Panel

Visual Mapping Panel Console Panel

50 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Panel Introduction

• Visual Query Panel • Providing users a friendly UI to create, delete, and query graphs from the System G native store.

• Console Panel • Display all the interaction information with System G native store. • Execute user defined query.

• Visualization Panel • Rendering graph structure on screen for users to visually explore graphs.

• Visual Mapping Panel • Customizing rendering effects to show desired graph information.

51 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visual Query Panel – Creating a graph

1

2

4 5

1: Click “Create Graph”; 2: Prepare the graph data 3: Set the graph name; 4: Upload node files; 3 5: Upload edge files and finalize creating the graph.

52 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visual Query Panel – Visual Query Builder

“analytics_degree <= 10 and (group == “center” or group == “guard”)

53 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Console Panel – User typed query

Query with no graph returned

Query with graph returned

54 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visual Mapping Panel

Name Functionality Background Color Change the background color of the canvas. Node Default Color Set a unified color for all nodes. Edge Default Color Set a unified color for all edges. Show Nodes Set the visibility of all nodes. Node Color Mapping Assign color to nodes according to selected property of nodes.

Node Size Mapping Assign the radius of nodes according to selected property of nodes. Selectively show the node label according to the threshold. Labels will be shown for the Filter Node Label by Node Size nodes of which the size is larger than the threshold. Node Label Mapping Set the label value according to selected property of nodes. Node Label Size Adjust the font size of node labels Show Edges Set the visibility of all edges

Edge Color Mapping Assign color to edges according to selected property of edges. Edge Label Mapping Set the label value according to selected property of edges. Edge Label Size Adjust the font size of edge labels

Edge Thickness Mapping Assign thickess to edges according to selected property of edges. Select the rendering style of edges. For directed Edge Style graphs, users also can choose if showing the arrows or not.

55 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visualization Panel – Before Customization

56 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visualization Panel – After Customization

57 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Visualization Panel – Further Customization

Users can further specify colors by clicking the color blocks shown in the legend area

58 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University http://systemg.ibm.com/tool/visualizer/

59 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Characteristics of IBM System G Graph Analytics

• Cover a wide range of graph analytics to support many application use cases in different domains, e.g.: • Enterprise social network analysis, expertise search, knowledge recommendation • Financial/security anomaly/fraud detection • Social media monitoring and analysis • Cellular network analytics in Telco operation • Patient and disease analytics for healthcare • Live neural brain network analysis • Provide efficient in-memory computation as well as on-disk persistence • Optimal performance enabled by IBM System G graph database technologies that focus on efficient use of available computing resources with architecture-aware design to leverage system/architecture advantages • Single-threaded, concurrent (shared memory), and distributed versions • Multiple deployment options to suit different customer preferences and needs • C++ executables in Linux environments (Redhat CentOS, Ubuntu, Mac OS X, Power) • TinkerPop (Blueprints) API • gShell (a shell-like environment with interactive, batch, and server/client modes to operate multiple graph stores simultaneously) • console • REST API Web service • Python wrapper

60 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Compatible with TinkerPop Interface ()

http://sql2gremlin.com

http://tinkerpop.incubator.apache.org

61 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Write Python Code based on System G

g.analytic_find_path(src="1",sink="2") g.analytic_find_path(src="1",sink="2",label="b")

Output of the above Python script

62 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Highly Scalable Graph Database http://www.graph500.org Nov 2015: IBM Research’s Software powered all Top 3 winners of Graph 500 benchmark and 9 out of the Top 10 winners (supercomputers in US, Japan, France, UK, and Germany; except in China). Sequoia #1 ×3.25 #4 #4 #4

TSUBAME 2.5 K computer CPU only FX10 #3

#4 TSUBAME-KFC SGI UV2000 #3

GPU 4-way Xeon server CPU only

The Nov 2015 winner, K-computer supercomputer of 83K nodes and 663Kcores, achieved graph search of up to 38, 621,400,000 vertices per second.

63 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University Comparison of graph size No. of edges

Human Brain Project Graph500 (Huge) 45 1 trillion Symbolic edges Network Graph500 (Large) 40 Graph500 (Medium) 1 billion Twitter (tweets/day) 35 edges Graph500 (Small) () 2

log Graph500 (Mini) 30 Graph500 (Toy) USA-road-d.USA.gr 25 USA Road Network USA-road-d.LKS.gr 1 billion 1 trillion 20 USA-road- nodes nodes d.NY.gr 15 20 25 30 35 40 45

log2(n) No. of nodes 64 E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, Columbia University