Exploring the Hidden Dimension in Graph Processing

Mingxing Zhang, Yongwei Wu, Kang Chen, *Xuehai Qian, Xue Li, and Weimin Zheng

Tsinghua University *University of Shouthern California Graph is Ubiquitous

Google: > 1 trillion indexed pages Facebook: > 800 million active users

Web Graph Social Graph

De Bruijn: 4k nodes 31 billion RDF triples (k = 20, … , 40) in 2011 Information Network

Acknowledgement: Arijit Khan, Sameh Elnikety [VLDB '14] Graph Computing is Even More Ubiquitous

Traditional Graph Analysis Many graph applications aim at analyzing the property of graphs

Examples Shortest Path Triangle Counting PageRank Connecting Component Graph Computing is Even More Ubiquitous

Traditional Graph Analysis MLDM as Graph Computing Many graph applications Many MLDM applications aim at analyzing the can be modeled and computed property of graphs as graph problems

Examples Examples Shortest Path Collaborative Filtering Triangle Counting SpMM PageRank Neural Network Connecting Component Matrix Factorization 1D Partitioning

1D Partitioning: each vertex is assigned to one worker with all its incoming/outcoming edges attached

Original Graph Node 0 Node 1 Partition into 4 nodes V0 V3 with 1D partitioning V0 V3 V3

V1 V2 V2

Node 2 Node 3 V V V 4 Sub-graphs 0 3 3 & 6 Replicas (denoted by dashed circle) V1 V2 V2 1D → Each Vertex is a Task

Vertex Program Vertex Program Each vertex program can read and modify the neighborhood of a vertex

Communication by messages (e.g. Pregel) by shared state (e.g. GraphLab)

Parallelism by running multiple Granularity of task dispatching is vertex, vertex programs which is the SAME as simultaneously the granularity of data dispatching Problem SKEW!!! 2D Partitioning

2D Partitioning: each is assigned to one worker certain heuristics can be used for the assigning

Original Graph Node 0 Node 1 Partition into 4 nodes V0 V3 with 2D partitioning V0 V3 V3

V1 V2 V2 V2

Node 2 Node 3 V V 4 Sub-graphs 0 3 & 6 Replicas (denoted by dashed circle) V1 V1 V2 2D à Each Edge is a Task

PowerGraph

Gather Apply Scatter User Defined: User Defined: User Defined:

Gather( Y ) àΣ Apply( Y , Σ) à Y’ Scaber( Y’ ) à Σ1 + Σ2 à Σ3

Y’ Y Σ Y Y’

Granularity of task dispatching is edge, which is also

Y the SAME as the granularity of data dispatching Is this the END of task partitioning?

• YES • Vertex can be attached with multiple edges

• But an edge is connected to only two vertices • Indivisible! Is this the END of task partitioning?

• YES • Vertex can be attached with multiple edges

• But an edge is connected to only two vertices • Indivisible! ß NOT true for many problems

• NO • Each vertex/edge can be further partitioned! Example: Collaborative Filtering

Matrix View

M: # of items D Collaborative Filtering QT D estimating missing ratings Q[v] based on a given incomplete R ≈ P * set of (user, item) ratings R[u,v] P[u] N: # of users R[u,v] ≈ In graph view Each vertex is attached with a feature vector Graph View with size D. …… …… P0 P1 P2 Pu PN A typical pattern R[u,v] of modelling MLDM problem as graph problem.

Q0 Q1 Q2 …… Qv …… QM 3D Partitioning

3D Partitioning: each vertex is split into sub-vertices and a 2D partitioner is used in each layer

Original Graph Partition into 4 nodes Node 0,0 Node 0,1 1 1 1 1 1 1 2 V0 V3 2 with 3D partitioning 2 V0 V3 2 2 V0 V3 2

1 1 1 1 1 2 V1 V2 2 2 V1 2 V1 V2 2

Node 1,0 Node 1,1 1 1 1 V V 1 2 Sub-graphs 2 V0 V3 2 2 0 3 2 & 3 Replicas in each layer. 1 1 1 2 V1 2 V1 V2 2 Motivation

• We found a NEW dimension, shouldn’t we partition it?

• Observation Ø These vector properties are usually operated by element-wise operators Ø Easy to parallelize v With no communication! Two Kinds of Communications

Intra-layer Node 0,0 Node 0,1 Communication 1 1 1 1 Less sub-graphs 2 V0 V3 2 2 V0 V3 2 Less replicas 1 1 1 Reduced! 2 V1 2 V1 V2 2

Node 1,0 Node 1,1 1 1 1 1 2 V0 V3 2 2 V0 V3 2

1 1 1 2 V1 2 V1 V2 2 Two Kinds of Communications

Intra-layer Node 0,0 Node 0,1 Communication 1 1 1 1 Less sub-graphs 2 V0 V3 2 2 V0 V3 2 Less replicas 1 1 1 Reduced! 2 V1 2 V1 V2 2

Node 1,0 Node 1,1 Inter-layer 1 V V 1 1 V V 1 2 0 3 2 2 0 3 2 Communication Not exist before 1 1 1 2 V1 2 V1 V2 2 Increased! Two Kinds of Communications

Intra-layer Node 0,0 Node 0,1 Communication 1 1 1 1 Less sub-graphs 2 V0 V3 2 2 V0 V3 2 Less replicas 1 1 1 Reduced! 2 V1 2 V1 V2 2

Node 1,0 Node 1,1 Inter-layer 1 V V 1 1 V V 1 2 0 3 2 2 0 3 2 Communication Not exist before 1 1 1 2 V1 2 V1 V2 2 Increased! New System:

• Existing models does not formalize the inter-layer communication

• CUBE

• Adopts 3D partitioning • Implements a new programming model UPPS • Use a matrix backend Data Model of CUBE CUBE

• The data graph model is also used in CUBE, but each vertex/edge can be attached with two kinds of data.

• Two classes of data

Ø DShare :-- an indivisible property v represented by a single variable. Ø DColle :-- a divisible collection of property vector v stored as a vector of variables

v len(DColle) == “collection size” SC 3D Partitioner CUBE

3D Partitioner (L, P): where L is the number of layers, and P is a 2D partitioner. Node 0,0 Node 1,0 DShare is 1 V V 1 1 V V 1 S 2 0 3 2 S S 2 0 3 2 S replicated

1 1 1 S S S 2 V1 2 V1 V2 2 L = Node 0,1 Node 1,1 2 1 V V 1 1 V V 1 S 2 0 3 2 S S 2 0 3 2 S DColle is equally 1 1 1 partitioned S S S 2 V1 2 V1 V2 2

Partitioned with P Programming Model CUBE

• Update :-- inter-layer communication v UpdateVertex: read/write all the data of a vertex v UpdateEdge: read/write all the data of an edge

• Push, Pull, Sink :-- intra-layer communication

Ø Variations of GAS, for an edge edg that connects (src, dst) v Push: use src and edg to update dst. v Pull: use dst and edg to update src. v Sink: use src and dst to update edg. Matrix Backend CUBE

Vertex Frontend MAP Matrix Backend

ü Easy to program ✖ Hard to program ü Similar to existing ü High efficiency works v Simple indexing ✖ Less efficiency v Hilbert order

Mapping from vertex frontend to matrix backend can significantly speedup the system; a useful method pioneered by GraphMat. Evaluation CUBE

• Baseline: PowerGraph and PowerLyra Ø the default partitioner oblivious is used for PowerGraph Ø every possible partitioner is tested for PowerLyra and the best is used Ø P of CUBE is the same as the partitioner used by PowerLyra • Platform: 8-node system; connected with 1Gb Ethernet • Dataset

Name |U| |V| |E| Best 2D Partitioner Libimseti 135,359 168,791 17,359,346 Hybrid-cut Last.fm 359,349 211,067 17,559,530 Bi-cut Netflix 17,770 480,189 100,480,507 Bi-cut Micro Benchmarks: SpMM

SpMM M Execution Time of SpMM 50 Q D Libimseti, Sc = 256 = 40 Lastfm, Sc = 256 D Q[v] Libimseti, Sc = 1024 30 P[u] R[u,v] 20

P * R N 10 P[w] R[w,v] 0 1 9 17 25 33 41 49 57 Q[v] = R[u,v]*P[u] + R[w,v]*P[w] # of layers

proportional to Communication cost # of replicas negative relation # of layers # of subgraphs inversely proportional to Micro Benchmarks: SumV & SumE

Execution Time of SumV Execution Time of SumE 50 Libimseti, Sc = 256 50 Libimseti, Sc = 256 40 Lastfm, Sc = 256 40 Lastfm, Sc = 256 Libimseti, Sc = 1024 Libimseti, Sc = 1024 30 30 20 20 10 10 0 0 1 9 17 25 33 41 49 57 1 9 17 25 33 41 49 57 # of layers # of layers

SumV: summing up each SumE: summing up each vertex’s vector edge’s vector

proportional to !"# Communication cost , L is # of layers ! Compare to PowerLyra & PowerGraph CUBE

Execution Time Dataset D GD ALS PowerGraph PowerLyra CUBE PowerGraph PowerLyra CUBE

64 6.82 6.89 2.59 (4) 87.0 86.8 28 (64) Libimseti 128 11.64 11.62 3.33 (8) 331 331 109 (64)

64 10.4 9.86 2.48 (4) 158 111 57 (64) Lastfm 128 18.6 17.8 3.47 (8) Failed Failed 230 (64)

64 18.3 7.42 4.16 (1) 179 66.0 42.5 (8) Netflix 128 30.6 11.3 6.55 (2) Failed 239 118 (8)

ü Up to 4.7x speedup (V.S. PowerLyra)

ü Up to 7.3x speedup (V.S. PowerGraph) Piecewise Breakdown CUBE

Network Reduction on GD Network Reduction on ALS

ü About half of the speedup is ü Almost all the speedup is from 3D partitioning (up to 2x) from network reduction. for Lastfm and Libimseti.

✖ Not useful for Netflix. v Computation of ALS is dominated by DSYSV, an v Can achieve 2.5x speedup on O(N3) computation kernel. Netflix if D is set to 2048. Memory Consumption CUBE

Total Memory Consumption 30 Memory Consumption 25 20 better than L=1, 15 i.e., 2D only 10 5 best for time is not 0 1 11 21 31 41 51 61 always best for # of layers memory consumption

Total memory consumption for Partitioning Time running ALS on 64 workers with D=32, similar curve as memory which means that SC=D2+D=1056 consumption Conclusion

• We found that, for certain graph problems, vertices and edges are not indivisible.

• We propose CUBE, which

Ø adopts 3D partitioning; Ø uses matrix backend; Ø achieves up to 4.7x speedup. Thank You!!