Network Community Detection by SCORE
Total Page:16
File Type:pdf, Size:1020Kb
Community Detection by SCORE with applications to Statisticians' Networks Jiashun Jin Statistics Department Carnegie Mellon University Collaborators: Pengsheng Ji (Univ. of Georgia) Zheng Tracy Ke (Univ. of Chicago) April 6, 2015 Jiashun Jin Community Detection by SCORE Network community detection I n = 1222 web blogs (nodes) I 16714 hyperlinks (edges) 2 I #fedgesg n : adjacency matrix X is very sparse I Two perceivable communities I Goal. Find the (unknown) community labels Political web blogs (Adamic and Glance; 2005) Jiashun Jin Coauthorship and Citation networks for Statisticians Jiashun Jin Community Detection by SCORE Abstraction (undirected) Data: adjacency matrix A of a network N = (V ; E) I V = f1; 2;:::; ng: nodes 1; an edge between nodes i and j A(i; j) = 0; otherwise I K perceivable \communities" V = V (1) [ V (2) ::: [ V (K) Goal. For each node, predict the community label. Diagonals of A are 0 for convenience Jiashun Jin Community Detection by SCORE Signal and noise decomposition Adjacency matrix : A = E[A]+W ; W ≡ (A−E[A]); \signal"+\noise" I W = A − E[A]: generalized Wigner matrix I upper triangles: independent centered-Bernoulli I Question: How to model Ω if we write E[X ] = Ω − diag(Ω) Jiashun Jin Community Detection by SCORE Box's wisdom \All models are wrong, but some are useful" George E.P. Box (1919{2013) Jiashun Jin Community Detection by SCORE Degree Corrected Block Model (DCBM) Ω(i; j) = P(k; `) () Ω = ΘLΘ θ(i) · θ(j) 2 θ(1) 3 a b 6 θ(2) 7 P = ; Θ = 6 . 7 b c 4 .. 5 θ(7) 2 abababa 3 2 aaaabbb 3 6 bcbcbcb 7 6 aaaabbb 7 6 7 6 7 6 abababa 7 6 aaaabbb 7 6 7 permute 6 7 L = 6 bcbcbcb 7 −−−−! 6 aaaabbb 7 6 abababa 7 6 bbbbccc 7 6 7 6 7 4 bcbcbcb 5 4 bbbbccc 5 abababa bbbbccc Karrer and Newman (2010) Jiashun Jin Community Detection by SCORE Tukey's suggestion \Which part of the sample contains the information" Tukey (1965), PNAS John W. Tukey (1915{2000) Jiashun Jin Community Detection by SCORE Where is the information? A = Ω − diag(Ω) + W ≈ Ω 0 SVD : Ω = ΘLΘ = Un;K DK;K (Un;K ) 2 3 s1 t1 2 3 θ(1) 6 s2 t2 7 6 7 6 θ(2) 7 6 s1 t1 7 Un;K = ΘTn;K = 6 . 7 6 . 7 4 .. 5 6 . 7 6 7 θ(n) 4 s1 t1 5 s2 t2 Jiashun Jin Community Detection by SCORE SCORE: algorithm SCORE: Spectral Clustering On Ratios-of-Eigenvectors Input: A and K I Obtain leading eigenvectorsη ^1,η ^2, :::,η ^K I Obtain n × (K − 1) matrix of entry-wise ratios η^ (i) R^(i; k) = k+1 ; 1 ≤ i ≤ n; 1 ≤ k ≤ K − 1 η^1(i) I Apply k-means to R^ (assume ≤ K clusters) Jiashun Jin Community Detection by SCORE Political weblog network (K = 2) x-axis: i = 1; 2;:::; n; y-axis: R^(i); 58 errors (lowest in literature) 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −3 −3.5 −4 0 200 400 600 800 1000 1200 Methods SCORE PCA normalized PCA NSC BCPL Errors 58 437 600 69 104:5 (SD: 145:4) Newman (2016), Bickel and Chen (2009), Zhao et al (2012) Jiashun Jin Community Detection by SCORE Regularity conditions K X 0 A = Ω − diag(Ω) + W ; Ω = ΘLΘ; L = P(k; `)1k 1` k;`=1 I (a). Eigen-spacing of DPD is ≥ a constant C X D(k; k)2 = θ(i)2=kθk2 i2V (k) 4 I (b). log(n)θmax kθk1=kθk ! 0, so that kW k kΩk; with prob. 1 − o(n−3) 2 3 I (c). log(n)θmax /θmin ≤ kθk3, so matrix-form Bernstein inequality holds (for the sum of random matrices) Jiashun Jin Community Detection by SCORE Consistency of SCORE n 3 n ^ −1 X ^ kθk3 X 1 1 kθk1 2 Hammp(`; `) = n min P `i 6= π(`i ) ; errn = 4 max ; 2 π kθk θ(i) θmin kθk i=1 i=1 Theorem. Consider DCBM where (a)-(c) hold. As n ! 1, if −1 n∗ log(n)errn ! 0, where n∗ is the minimum community size, ^score −1 3 then Hammp(` ; `) ≤ Cn log (n)errn. −1 Proof. Full analysis of Θ (^ηk − ηk ) I Spectral perturbation theory I Classical large deviations inequalities I Matrix-form Bernstein inequality (Tropp, 2012) iid Remark. If we assume θ(i) ∼ F as in Zhao et al (2012), then score −1 3 Hammp(`^ ; `) ≤ Cn log (n) Jiashun Jin Community Detection by SCORE Coauthor/Citation Networks (statisticians) I People most interested: statisticians/friends I We know \inside information" N/A to outsiders Scientific Problem: Dynamics of US-based statisticians in theory & methods of the HDDA era HDDA: High-Dimensional Data Analysis Data: All published research papers in AoS, Biometrika, JASA, and JRSS-B, 2003{2012 Jiashun Jin Community Detection by SCORE Disclaimer I Data and scope of scientific interests: limited I It is not our intention to I rank one author/paper/area over the others I label an author/paper to a certain area I We have to use real names because the networks are for real people (\us") Jiashun Jin Community Detection by SCORE Citation Network, I Large-Scale Multiple Testing by SCORE (359 nodes; 26 shown) 40 John● Rice David L● Donoho David Siegmund● John D● Storey 35 Joseph P● Romano Bradley● Efron Subhashis● Ghosal Daniel Yekutieli● 30 Abba M● Krieger Sanat K● Sarkar Donald ●B Rubin ● Aad van ●der Vaart Jiashun Jin 25 Larry Wasserman● Zhiyi● Chi Yoav Benjamini● D R● Cox 20 Peter ●Muller E L Lehmann● Christian● P Robert James O● Berger Mark ●G Low 15 Christopher● Genovese Paul R Rosenbaum● Felix Abramovich● 10 Iain M Johnstone● 0 5 10 15 20 Jiashun Jin Community Detection by SCORE Citation Network, II Spatial stat./nonparametric stat. by SCORE (1010 nodes; 42 shown) Ming−Hui Chen Paul Fearnhead R Todd Ogden Yi Li Gareth Roberts David RuppertAmy H Herring Omiros Papaspiliopoulos Simon N Wood Gary L Rosner Jeffrey S Morris Joseph G Ibrahim Mohammad Hosseini−Nasab Ciprian M Crainiceanu Steven N MacEachern Ulrich Stadtmuller Athanasios Kottas Raymond J Carroll Alan E Gelfand Naisyin Wang Sudipto BanerjeeMark F J Steel Alan H Welsh Andrew O Finley Anthony OHagan Brian S Caffo Michael L SteinMontserrat Fuentes Yongtao Guan Hao Purdue Zhang Marc G Genton Michael Sherman Huiyan Sang Theo Gasser Tilmann Gneiting Douglas W Nychka Martin Schlather Adrian E Raftery N Reid Jonathan Tawn Laurens de Haan Robin Henderson Jiashun Jin Community Detection by SCORE Citation Network, II (further split, I) Parametric Spatial Statistics by SCORE (304 nodes; 21 shown) Fadoua Balabdaoui● Laurens● de Haan Adrian E● Raftery Cristiano● Varin Jonathan● Tawn Martin Schlather● Anthony● OHagan Marc G● Genton Tilmann● Gneiting Nicolas● Chopin N Reid● Douglas W● Nychka Haavard● Rue Huiyan● Sang Michael● L Stein Paolo ●Vidoni Hao Zhang● (Purdue) Montserrat● Fuentes Leah J● Welty Sudipto ●Banerjee Andrew ●O Finley Jiashun Jin Community Detection by SCORE Citation Network, II (further split, II) Nonparametric Spatial Statistics by SCORE (212 nodes; 21 shown) Herbert ●K H Lee Natesh● Pillai Trivellore E ●Raghunathan Steven N MacEachern● ● Matthew J Beal ● ● Robert B Gramacy Gary L● RosnerAthanasios Kottas David ●M Blei ● Fernando A QuintanaMark F● J Steel Yee Whye● Teh Alan E ●Gelfand Ju−Hyun● Park Gareth ●Roberts Omiros Papaspiliopoulos● Paul Fearnhead● Radford● M Neal Pilar L ●Iglesias Alexandros● Beskos Yi● Li Jiashun Jin Community Detection by SCORE Citation Network, II (further split, III) Non-parametrics/semi-parametrics by SCORE (392 nodes; 24 shown) Brian S● Caffo Mohammad Hosseini−Nasab● Ulrich Stadtmuller● Jeffrey S● Morris Ciprian M Crainiceanu● David Ruppert● Naisyin● Wang Vic Patrangenaru● Thomas ●C M Lee ● Hongtu● Zhu Rui Paulo Ray Carroll● Nilanjan Chatterjee● Joseph G● Ibrahim Rabi Bhattacharya● ● Alan H Welsh Robert A● Rigby Michael A● Benjamin Hua Yun● Ming−HuiChen ● Chen Silvia Shimakura● D Mikis Stasinopoulos● Robin Henderson● Theo Gasser● Jiashun Jin Community Detection by SCORE Citation Network, III Variable Selection by SCORE (1285 nodes; 40 shown) Qiwei Yao Mohsen Pourahmadi Elizaveta Levina Ji Zhu Hans−Georg Muller Peter HallPeter J Bickel Hansheng Wang Robert J Tibshirani Jinchi Lv Cun−Hui Zhang Heng Peng Jianqing Fan Emmanuel J Candes Jianhua ZLixing Huang Zhu Jian Huang Ming Yuan Peter Buhlmann Runze Li Xuming He Terence TaoTrevor J HastieHui Zou Alexandre B Tsybakov Yi Lin Nicolai Meinshausen Joel L Horowitz Hao Helen Zhang R Dennis Cook Michael R Kosorok Dan Yu Lin L J Wei Jiashun Jin Community Detection by SCORE Coauthorship Network, I Objective Bayes by SCORE (64 nodes; 14 shown) John A● Cafeo ● ●● ● ● 11 Fei● Liu ●● Jerry ●Sacks ● Carlos M Carvalho ● ● ● ● ● Steven N MacEachern ● 10 ● ● ● Rui Paulo● ● ● ● ● ● ● 9 ● ● ● ● ● ● ● Alan E Gelfand ● James O● Berger ● Athanasios● Kottas 8 ● ● ● ● ● ● ● R J Parthasarathy● ● ● ● 7 ● ● ● ● Gonzalo● Garcia−Donato● ● ● ● Daniel● Walsh J Palomo● 6 M J Bayarri● ● 0 5 10 15 20 ● Jiashun Jin Community Detection by SCORE Coauthorship Network, II Biostatistics by SCORE (388 nodes; 16 shown) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Heping Zhang Steve Marron● ● ● ● ● ● ● Ji Zhu ● ● ● ● ● ● ● ● ● ● ● ● ● Hongtu● ● Zhu ● ● ● ● ● ● ● ● ● ● ● Louise● Ryan ●Weili● ● Lin ● ● ● ● ● ● ● ● ● ● ● ● ● Yimei●● ● ●Li Helen Zhang ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Joseph Ibrahim ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Eric Feuer Tapabrata● Maiti● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Debajyoti● ● Sinha ● ● Trivellore● Raghunathan ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Zhiliang● ● Ying ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jun Liu ● ● ● ● ● ● ● ● ● ● ● ● ● L J Wei● ● ● ● ● ● ● David Dunson● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jiashun Jin Community Detection by SCORE Coauthorship Network, III HDDA by SCORE (1811 nodes, 32 shown) Hans−Georg●