Classifying Types of Network Communities Using Motifs
Total Page:16
File Type:pdf, Size:1020Kb
Classifying Types of Network Communities Using Motifs Kun Tu1, Jian Li1, Don Towsley1, Dave Braines2, Liam Turner3 1University of Massachusetts Amherst, 2IBM UK, 3Cardiff University Machine Learning, Data Analytics and Modeling September 26, 2018, Thessaloniki, Greece Network Community Classification • Detect type of relations within groups of individuals • Help to understand patterns of interaction between people • Application in identifying • Department ID in email networks • User ID from mobile app switching behavior represented as a network Contribution Propose a new network embedding (gl2vec) methodology for network classification in directed networks based on a network’s network motif counts Problem Formulation ) • !" #", %", &" "'(: set of * (sub)graphs/networks • #": set of nodes • %": set of timestamped edges • &" ∈ {1, … , /}: class label of !" Problem Formulation ) • !" #", %", &" "'(: set of * (sub)graphs/networks • #": set of nodes • %": set of timestamped edges • &" ∈ {1, … , /}: class label of !" 5 • 1: !" → 4 : graph embedding function from Gi to 1*m feature representation vector using subgraph ratio profile (SRP) of static motif Problem Formulation ) • !" #", %", &" "'(: set of * (sub)graphs/networks • #": set of nodes • %": set of timestamped edges • &" ∈ {1, … , /}: class label of !" 5 • 1: !" → 4 : graph embedding function from Gi to 1*m feature representation vector using subgraph ratio profile (SRP) of static motif • 6: 45 → 7: classifier mapping feature vector to a categorical distribution 7 = [:(, … , :;] for / labels Problem Formulation ) • !" #", %", &" "'(: set of * (sub)graphs/networks • #": set of nodes • %": set of timestamped edges • &" ∈ {1, … , /}: class label of !" 5 • 1: !" → 4 : graph embedding function from Gi to 1*m feature representation vector using subgraph ratio profile (SRP) of static motif • 6: 45 → 7: classifier mapping feature vector to a categorical distribution 7 = [:(, … , :;] for / labels • Goal: design an embedding function f and select a machine learning model g to minimize the sum of cross entropy for all graphs Network Motifs • Static motifs: triad - three node motifs Subgraph Ratio Profile (SRP) • SRP for motif ! • "#$%&: average counts of motif ! in random graphs • "(): observed counts of ! in empirical graph -./01-23456 • Δ+ = -23457 1-2345678 • 9: keeps Δ+ not too large when motif ! is rare Δ+ :;<+ = > ∑Δ+ Subgraph Ratio Profile (SRP) • SRP for motif ! • "#$%&: average counts of motif ! in random graphs • "(): observed counts of ! in empirical graph -./01-23456 • Δ+ = -23457 1-2345678 • 9: keeps Δ+ not too large when motif ! is rare Δ+ :;<+ = > ∑Δ+ • SRP can be used to compare networks of different sizes since an SRP for a motif is a normalized term Null Model for Networks • null model • a generative model that generates random graphs that matches a specific graph in some of its structural features Null Model for Networks • null model • generates random graphs with same structural features as a specific graph • Static network null model • NE: # nodes and # edges (fast) • MAN: # (M)utual, (A)symetric and (N)ull edges • BDS: In/out degree sequence (slow) Experiments • Classification problems: given network topological structure • Network domain classification • 15 network domains, including email network, QA network, p2p, … • Subgraph identification • Department ID in email network • User ID given mobile app switching behaviors as a network • Baseline embedding methods • Struc2vec • Graph Kernel with graphlet (GK graphlet) • Graph kernel with Weisfeiler-lehman test of isomorphism (GK WL) • Node2vec • Graph2vec • Sub2vec • Motif distribution as feature vector Real-world Data • 2,355 real-world networks in 15 network domains (from SNAP, Tymer project): • Social network: Google+, Twitter • Email network: EmailEU, BBNEmailTraffic, Enron • QA network: askubuntu, mathoverflow • P2p: Gnutella • Citation: HepHh, HepTh (physics paper) • Friendship: slashdot • Others: Bitcoin, co-sponsorship, SwitchApp42, Epinion, WikiVote, Advice, terrorist • More than 10,000 network communities obtained • From ground truth information • From community detection algorithm (Newman’s modularity) Experiment Setup • Machine learning classifier • XGBoost: better than most classifiers for mid-size datasets • SVM: good for small dataset • Random forest: feature selection to investigate our network embedding • Ada boost: baseline • Parameter setting: • Grid search for best performance • XGBoost: learning rate is 0.1, maximal tree depth is 8, minimal child weight is 1 and the subsample ratio of train instances is 0.8 • SVM: regularization weight in is 2 • Random forest: # tree is 400 and the minimal samples required to split a tree node is 2 Community Domain • gl2vec is competitive • gl2vec with NE is slightly better • Combine gl2vec embedding with other methods significantly improve accuracy • XGBoost performs best EmailEU Departments • gl2vec is competitive • gl2vec with NE is slightly better • Combine gl2vec embedding with other methods significantly improve accuracy • Random forest performs best BBN Department • gl2vec is competitive • gl2vec with NE is slightly better • Combine gl2vec embedding with other methods significantly improve accuracy Mobile Users • gl2vec is competitive • gl2vec with NE is slightly better • Combine gl2vec embedding with other methods significantly improve accuracy Conclusion • gl2vec is competitive with state-of-the-art methods • Combination of gl2vec and state-of-the-art methods yield significant improvement • Subgraph ratio profiles (SRP) contain important features for network classification • All three null model have similar prediction accuracy but NE is recommended because of simple computational complexity Thank you!.