Distributed Graph Storage

Veronika Molnár, UZH Overview

- Graphs and Social Networks - Criteria for Graph Processing Systems - Current Systems - Storage - Computation - Large scale systems - Comparison / Best systems - Questions

2 Graphs and Social Networks 1

Graph = collection of nodes + edges connecting nodes to each other

Social Network = collection of individuals and social relations

Social Network is also a Graph! (node = person, edge = relation) Social Network graph

(image source : thenextweb.com)

3 Graphs and Social Networks 2 - Social Network graph properties (SNA = Social Network Analysis)

- Limited number of connections at each node (person) e.g. : max 5000

- Distribution not uniform - Most people: an average number of connections - But: a few people have a lot of connections (Power law distribution)

- Small degree of separation = “Small World” (length of shortest paths)

- Centrality

- Constantly changing, but very large graph! (7 billion people = 7 billion nodes) 4 Graphs and Social Networks 3 Shortest Path Centrality VM

Betweenness Closeness

BP PageRank Degree 5 Graphs and Social Networks 4 - Social Network can be…

- Facebook

- Emails

- Mailing lists

- Academic networks

6 Criteria for Graph Processing Systems 1

- Modes:

- Distributed processing

- Research and industry use

- Interactive and noninteractive modes

- Storage of static and dynamic information

E-mail connectivity graph

7 (image source: research.microsoft.com) Criteria for Graph Processing Systems 2

- Properties:

- Scalability (social networks are large!) - Speed

- Features:

- SNA (Social Network Analysis) metrics: PageRank, Centrality, Shortest paths, ... - Extensibility

E-mail connectivity graph 8 (image source: research.microsoft.com) Current Systems 1

Storage:

- Apache Hive (and Hadoop) - Titan Graph Database - Neo4j

9 Current Systems - Storage 2

Apache Hive (and Hadoop)

Hadoop: Map/Reduce architecture

- Hive: High-level operations on large data sets - HiveQL (similar to SQL) - Converted to MapReduce jobs - Not graph-specific - Supports custom data formats - Can be used as a backend for other systems

10 Current Systems - Storage 3

Titan Graph Database

- Store and Query large graphs - Graph schemas - edge and vertex labels - query language - transactional query model - high level operations - Two backends: Cassandra and HBase

11 Current Systems - Storage 4

Neo4j

- Cost: €12K for startups (more for large companies), free for personal use - Graph Database Management - ACID compliant (Atomicity, Consistency, Isolation, Durability) - Graphs are stored as Edges, Nodes, Attributes - Focus on finding and querying data - Graph analytics with igraph or GraphX - Community!

12 Neo4j

13 Current Systems 5

Computation:

- igraph - Spark GraphX - GraphLab

14 Current Systems - Computation 6

igraph

- Network analysis / network research - Portable and efficient - Python, R, C, C++ - Built-in, optimized SNA metrics (centrality, diameter, connected components) - Stand-alone or Grid - Extensible, 3 layer API

15 Current Systems - Computation 7

Spark GraphX

- Graphs and parallel graph computations - User-defined parallel operations - stored in-memory for faster processing - very good end-to-end performance - graphs are immutable; all operations create a new graph - Prebuilt graph algorithms, e.g. PageRank

16 Current Systems - Computation 8

GraphLab

- Cost: $4,000/machine/year, or free 1 year student subscription - Graph computations: processing & analytics - Visualization (GraphLab Canvas) - Machine learning - Common graph algorithms + API

17 GraphLab

18 Current Systems 9

Used by Facebook/Google:

- Pregel/Pregelix -

19 Current Systems - Large Scale 10

Pregel/Pregelix

- Pregel: Google-only, Pregelix: open-source - BSP (bulk synchronous processing) model - User defined edge, vertex, message types - Supersteps - Extremely large graphs - in-memory/out-of-core operation models - Vertex-based API, libraries with graph algorithms

20 Current Systems - Large Scale 11

Apache Giraph

- BSP model - Graph-wide metrics via global operations - Built on Hadoop, 5-26 times faster than Hive - Highly parallel, keeps all data in memory - Scales linearly with number of edges, can make efficient use of large clusters - Used for PageRank, popularity rank, shortest paths - No built-in graph metrics

21 Comparison Focus Scalability SNA Extensibility Used for

Hive parallel computations any size no Java generic

Titan storage ~100 B no Python, Java graph queries

Neo4j transactional DB ~1 B yes Java, Python, R recommender systems

igraph efficiency, portability ~1 yes R, Python, C++ research

GraphX parallel computations ~1 B yes Java, Python, R graph processing

GraphLab processing, analytics ~1 B yes C++ recommender systems

Giraph large scale, BSP any size no Java, Python Facebook

Pregel(ix) large scale, BSP any size yes Java Google 22 Which is the best? Depends on the network and intended use..

- Very large Social Networks: - High-performance, customizable systems, such as Pregelix

- Research: - igraph and GraphX support R and Python integration

- Analysis and Visualisation of Social Networks - GraphLab with built-in interactive analysis and plotting features - Neo4j contains vast amounts of community resources for these tasks

- Custom use cases... - Existing systems might not support these - Instead: use Hadoop/Hive and write the rest yourself! 23 Thank You!

aaaaaand

Stay for some questions

24 Questions 1

Why do we analyse social data?

What are the possible uses of analysing social data?

25 Questions 2

Can visualisation help to understand graphs? (connections can be viewed, subset of graph can be analysed, …)

26 Questions 3

Have you ever used such a system? Which one?

27 Questions 4

What are the advantages and disadvantages of distributed graph processing?

What is the value of graph processing?

28 Questions 5

How can social metric calculations deal with fake accounts?

29 The End ...

30