Community Detection by SCORE with applications to Statisticians’ Networks

Jiashun Jin

Statistics Department Carnegie Mellon University

Collaborators: Pengsheng Ji (Univ. of Georgia) Zheng Tracy Ke (Univ. of Chicago)

April 6, 2015

Jiashun Jin Community Detection by SCORE Network community detection

I n = 1222 web blogs (nodes)

I 16714 hyperlinks (edges)

2 I #{edges}  n : adjacency matrix X is very sparse

I Two perceivable communities

I Goal. Find the (unknown) community labels Political web blogs (Adamic and Glance; 2005) Jiashun Jin Coauthorship and Citation networks for Statisticians

Jiashun Jin Community Detection by SCORE Abstraction (undirected)

Data: adjacency matrix A of a network N = (V , E)

I V = {1, 2,..., n}: nodes

 1, an edge between nodes i and j A(i, j) = 0, otherwise

I K perceivable “communities”

V = V (1) ∪ V (2) ... ∪ V (K)

Goal. For each node, predict the community label.

Diagonals of A are 0 for convenience

Jiashun Jin Community Detection by SCORE Signal and noise decomposition

Adjacency matrix : A = E[A]+W , W ≡ (A−E[A]), “signal”+“noise”

I W = A − E[A]: generalized Wigner matrix

I upper triangles: independent centered-Bernoulli

I Question: How to model Ω if we write

E[X ] = Ω − diag(Ω)

Jiashun Jin Community Detection by SCORE Box’s wisdom

“All models are wrong, but some are useful”

George E.P. Box (1919–2013)

Jiashun Jin Community Detection by SCORE Degree Corrected Block Model (DCBM) Ω(i, j) = P(k, `) ⇐⇒ Ω = ΘLΘ θ(i) · θ(j)  θ(1)    a b  θ(2)  P = , Θ =  .  b c  ..  θ(7)  abababa   aaaabbb   bcbcbcb   aaaabbb       abababa   aaaabbb    permute   L =  bcbcbcb  −−−−→  aaaabbb   abababa   bbbbccc       bcbcbcb   bbbbccc  abababa bbbbccc Karrer and Newman (2010)

Jiashun Jin Community Detection by SCORE Tukey’s suggestion

“Which part of the sample contains the information” Tukey (1965), PNAS

John W. Tukey (1915–2000)

Jiashun Jin Community Detection by SCORE Where is the information?

A = Ω − diag(Ω) + W ≈ Ω

0 SVD : Ω = ΘLΘ = Un,K DK,K (Un,K )

  s1 t1   θ(1)  s2 t2     θ(2)   s1 t1  Un,K = ΘTn,K =  .   . .   ..   . .    θ(n)  s1 t1  s2 t2

Jiashun Jin Community Detection by SCORE SCORE: algorithm

SCORE: Spectral Clustering On Ratios-of-Eigenvectors

Input: A and K

I Obtain leading eigenvectorsη ˆ1,η ˆ2, ...,η ˆK I Obtain n × (K − 1) matrix of entry-wise ratios

ηˆ (i) Rˆ(i, k) = k+1 , 1 ≤ i ≤ n, 1 ≤ k ≤ K − 1 ηˆ1(i)

I Apply k-means to Rˆ (assume ≤ K clusters)

Jiashun Jin Community Detection by SCORE Political weblog network (K = 2) x-axis: i = 1, 2,..., n; y-axis: Rˆ(i); 58 errors (lowest in literature) 1

0.5

0

−0.5

−1

−1.5

−2

−2.5

−3

−3.5

−4 0 200 400 600 800 1000 1200

Methods SCORE PCA normalized PCA NSC BCPL Errors 58 437 600 69 104.5 (SD: 145.4)

Newman (2016), Bickel and Chen (2009), Zhao et al (2012)

Jiashun Jin Community Detection by SCORE Regularity conditions

K X 0 A = Ω − diag(Ω) + W , Ω = ΘLΘ, L = P(k, `)1k 1` k,`=1

I (a). Eigen-spacing of DPD is ≥ a constant C X D(k, k)2 =  θ(i)2/kθk2 i∈V (k)

4 I (b). log(n)θmax kθk1/kθk → 0, so that

kW k  kΩk, with prob. 1 − o(n−3)

2 3 I (c). log(n)θmax /θmin ≤ kθk3, so matrix-form Bernstein inequality holds (for the sum of random matrices)

Jiashun Jin Community Detection by SCORE Consistency of SCORE

n 3 n ˆ −1 X ˆ  kθk3 X 1 1 kθk1 2 Hammp(`, `) = n min P `i 6= π(`i ) , errn = 4 max , 2 π kθk θ(i) θmin kθk i=1 i=1 Theorem. Consider DCBM where (a)-(c) hold. As n → ∞, if −1 n∗ log(n)errn → 0, where n∗ is the minimum community size, ˆscore −1 3 then Hammp(` , `) ≤ Cn log (n)errn.

−1 Proof. Full analysis of Θ (ˆηk − ηk )

I Spectral perturbation theory

I Classical large deviations inequalities

I Matrix-form Bernstein inequality (Tropp, 2012)

iid Remark. If we assume θ(i) ∼ F as in Zhao et al (2012), then score −1 3 Hammp(`ˆ , `) ≤ Cn log (n)

Jiashun Jin Community Detection by SCORE Coauthor/Citation Networks (statisticians)

I People most interested: statisticians/friends

I We know “inside information” N/A to outsiders

Scientific Problem: Dynamics of US-based statisticians in theory & methods of the HDDA era HDDA: High-Dimensional Data Analysis Data: All published research papers in AoS, Biometrika, JASA, and JRSS-B, 2003–2012

Jiashun Jin Community Detection by SCORE Disclaimer

I Data and scope of scientific interests: limited

I It is not our intention to

I rank one author/paper/area over the others

I label an author/paper to a certain area

I We have to use real names because the networks are for real people (“us”)

Jiashun Jin Community Detection by SCORE Citation Network, I

Large-Scale Multiple Testing by SCORE (359 nodes; 26 shown) 40 John● Rice David L● Donoho David Siegmund● John D● Storey 35 Joseph P● Romano

Bradley● Efron Subhashis● Ghosal Daniel Yekutieli● 30 Abba M● Krieger Sanat K● Sarkar Donald ●B Rubin

● Aad van ●der Vaart Jiashun Jin 25 Larry Wasserman● Zhiyi● Chi

Yoav Benjamini● D R● Cox 20 Peter ●Muller E L Lehmann●

Christian● P Robert James O● Berger Mark ●G Low 15 Christopher● Genovese Paul R Rosenbaum● Felix Abramovich● 10 Iain M Johnstone● 0 5 10 15 20 Jiashun Jin Community Detection by SCORE Citation Network, II

Spatial stat./nonparametric stat. by SCORE (1010 nodes; 42 shown)

Ming−Hui Chen R Todd Ogden Yi Li Gareth Roberts David RuppertAmy H Herring Omiros Papaspiliopoulos Simon N Wood Gary L Rosner Jeffrey S Morris Joseph G Ibrahim Mohammad Hosseini−Nasab Ciprian M Crainiceanu Steven N MacEachern Ulrich Stadtmuller Athanasios Kottas Raymond J Carroll Alan E Gelfand Naisyin Wang Sudipto BanerjeeMark F J Steel Alan H Welsh Andrew O Finley Anthony OHagan Brian S Caffo Michael L SteinMontserrat Fuentes Yongtao Guan Hao Purdue Zhang Marc G Genton Michael Sherman Huiyan Sang Theo Gasser Tilmann Gneiting Douglas W Nychka Martin Schlather Adrian E Raftery N Reid

Jonathan Tawn Laurens de Haan

Robin Henderson

Jiashun Jin Community Detection by SCORE Citation Network, II (further split, I)

Parametric Spatial by SCORE (304 nodes; 21 shown)

Fadoua Balabdaoui●

Laurens● de Haan

Adrian E● Raftery

Cristiano● Varin Jonathan● Tawn Martin Schlather● Anthony● OHagan Marc G● Genton Tilmann● Gneiting

Nicolas● Chopin N Reid● Douglas W● Nychka Haavard● Rue Huiyan● Sang

Michael● L Stein Paolo ●Vidoni Hao Zhang● (Purdue) Montserrat● Fuentes Leah J● Welty Sudipto ●Banerjee

Andrew ●O Finley

Jiashun Jin Community Detection by SCORE Citation Network, II (further split, II)

Nonparametric Spatial Statistics by SCORE (212 nodes; 21 shown)

Herbert ●K H Lee Natesh● Pillai Trivellore E ●Raghunathan

Steven N MacEachern● ● Matthew J Beal ● ● Robert B Gramacy Gary L● RosnerAthanasios Kottas David ●M Blei

● Fernando A QuintanaMark F● J Steel Yee Whye● Teh Alan E ●Gelfand Ju−Hyun● Park

Gareth ●Roberts Omiros Papaspiliopoulos● Paul Fearnhead● Radford● M Neal

Pilar L ●Iglesias

Alexandros● Beskos

Yi● Li

Jiashun Jin Community Detection by SCORE Citation Network, II (further split, III)

Non-parametrics/semi-parametrics by SCORE (392 nodes; 24 shown)

Brian S● Caffo

Mohammad Hosseini−Nasab●

Ulrich Stadtmuller●

Jeffrey S● Morris Ciprian M Crainiceanu● David Ruppert● Naisyin● Wang Vic Patrangenaru● Thomas ●C M Lee ● Hongtu● Zhu Rui Paulo Ray Carroll● Nilanjan Chatterjee● Joseph G● Ibrahim Rabi Bhattacharya●

● Alan H Welsh Robert A● Rigby Michael A● Benjamin

Hua Yun● Ming−HuiChen ● Chen

Silvia Shimakura● D Mikis Stasinopoulos● Robin Henderson● Theo Gasser●

Jiashun Jin Community Detection by SCORE Citation Network, III

Variable Selection by SCORE (1285 nodes; 40 shown)

Qiwei Yao Mohsen Pourahmadi

Elizaveta Levina Ji Zhu Hans−Georg Muller Peter HallPeter J Bickel Hansheng Wang Robert J Tibshirani Jinchi Lv Cun−Hui Zhang Heng Peng Emmanuel J Candes Jianhua ZLixing Huang Zhu Jian Huang Ming Yuan Peter Buhlmann Runze Li Xuming He Terence TaoTrevor J HastieHui Zou

Alexandre B Tsybakov Yi Lin Nicolai Meinshausen Joel L Horowitz Hao Helen Zhang R Dennis Cook

Michael R Kosorok

Dan Yu Lin L J Wei

Jiashun Jin Community Detection by SCORE Coauthorship Network, I

Objective Bayes by SCORE (64 nodes; 14 shown)

John A● Cafeo ● ●● ● ● 11 Fei● Liu ●● Jerry ●Sacks ● Carlos M Carvalho ● ● ● ● ● Steven N MacEachern ● 10 ● ● ● Rui Paulo● ● ● ● ● ● ● 9 ● ● ● ● ● ● ● Alan E Gelfand ● James O● Berger ● Athanasios● Kottas 8 ● ● ● ● ● ● ● R J Parthasarathy● ● ● ● 7 ● ● ● ● Gonzalo● Garcia−Donato● ● ● ● Daniel● Walsh J Palomo● 6 M J Bayarri● ● 0 5 10 15 20 ●

Jiashun Jin Community Detection by SCORE Coauthorship Network, II

Biostatistics by SCORE (388 nodes; 16 shown)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Heping Zhang Steve Marron● ● ● ● ● ● ● Ji Zhu ● ● ● ● ● ● ● ● ● ● ● ● ● Hongtu● ● Zhu ● ● ● ● ● ● ● ● ● ● ● Louise● Ryan ●Weili● ● Lin ● ● ● ● ● ● ● ● ● ● ● ● ● Yimei●● ● ●Li Helen Zhang ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Joseph Ibrahim ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Eric Feuer Tapabrata● Maiti● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Debajyoti● ● Sinha ● ● Trivellore● Raghunathan ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Zhiliang● ● Ying ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jun Liu ● ● ● ● ● ● ● ● ● ● ● ● ● L J Wei● ● ● ● ● ● ● David Dunson● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Jiashun Jin Community Detection by SCORE Coauthorship Network, III

HDDA by SCORE (1811 nodes, 32 shown)

Hans−Georg● Muller ● Larry ●Brown Marc G● Genton Jane−Ling Wang Xuming● He T Tony● Cai Peter● Hall Lixing● Zhu Yanyuan● Ma Runze● Li Jianqing● Fan Wolfgang● Hardle Hua ●LiangMalay● Ghosh Song Xi● Chen Larry Wasserman● ● James R● Robins Ray CarrollBani Mallick● Enno Mammen● ● Alexandre● Tsybakov Andrea Rotnitzky● Xihong Lin Nilanjan ●Chatterjee Gerda Claeskens● Ciprian Crainiceanu● Trevor● Hastie Robert J Tibshirani● Holger● Dette Giovanni ●Parmigiani Christian● Robert Peter ●Muller

Jiashun Jin Community Detection by SCORE Comparisons, I

Undirected networks:

I Newman’s Spectral Clustering (NSC)

I Bickel and Chen’s Profile Likelihood (BCPL)

I Amini et al’s Pseudo Likelihood (APL) Directed networks: Leicht & Newman’s Spectral Clustering

Jiashun Jin Community Detection by SCORE Comparisons, II

Adjusted Rand Index (ARI); larger means more similar SCORE NSC APL BCPL SCORE 1.00 .55 .19 .00 NSC 1.00 .41 .00 APL 1.00 0.00 BCPL 1.00

Sizes of the 3 communities identified by SCORE, NSC, and APL Objective Bayes Biostat-Coau HDDA-Coau SCORE 64 388 1811 NSC 69 163 2031 APL 20 50 2193 SCORE ∩ NSC 55 162 1807 SCORE ∩ APL 20 50 1811 NSC ∩ APL 20 50 2032 SCORE ∩ NSC ∩ APL 20 50 1807

Jiashun Jin Community Detection by SCORE More on Coauthorship Network, I

“Theo. Statist. Learning” (15 nodes) and “Dim. Reduction” (14 nodes) Guilherme● RochaTao● Shi ● Xiangrong● Xin Yin Chen Karim ●Lounici Bin● Yu Liqiang● Ni ● Marten H Wegkamp ● Liliana Forzani ● Florentina● BuneaAnatoli B● Juditsky R Dennis Cook Alexandre ●B Tsybakov Francesca Chiaromonte● Nicolai Meinshausen● Philippe● Rigollet Lexin● Li Bing● Li Sara van● de Geer Yuexiao● Dong Peter Buhlmann● Liping● Zhu Lukas● Meier Lixing● Zhu Markus● Kalisch Winfried● Stute ● Xia● Cui Marloes H Maathuis Liugen● Xue Piet Groeneboom● Jon A Wellner● Fadoua Balabdaoui●

Jiashun Jin Community Detection by SCORE More on Coauthorship Network, II

“Johns Hopkins”, “Duke”, “Stanford”, “Quant. Reg.”, “Exp. Design”

Barry Rowlingson Carlos M Carvalho Armin Schwartzman Brian S Caffo Gary L Rosner Benjamin Yakir Chong-Zhi Di Gerard Letac David Siegmund Ciprian M Crainiceanu Helene Massam F Gosselin David Ruppert James G Scott John D Storey Dobrin Marchev Jonathan R Stroud Jonathan E Taylor Galin L Jones Maria De Iorio Keith J Worsley James P Hobert Mike West Nancy Ruonan Zhang John P Buonaccorsi Nicholas G Polson Ryan J Tibshirani John Staudenmayer Peter Muller Naresh M Punjabi Peter J Diggle Sheng Luo Hengjian Cui Andrey Pepelyshev Huixia Judy Wang Frank Bretz Jianhua Hu Holger Dette Jianhui Zhou Natalie Neumeyer Valen E Johnson Stanislav Volgushev Wing K Fung Stefanie Biedermann Xuming He Tim Holland-Letz Yijun Zuo Viatcheslav B Melas Zhongyi Zhu

Jiashun Jin Community Detection by SCORE More on Coauthorship Network, III

● ● ● ● ● ● ● ● ● David Dunson● ● ● ● ● ● ● ● ● ● ● ● ● ● Hongtu● ● Zhu● ● ● ● ● ● ● ● ● ● ● ● ● ● ●Joseph G Ibrahim ● ● ● ● ● ● ● ● ● Donglin● Zeng ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● T Tony● Cai ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jianqing● ●● Fan ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Raymond● J Carroll ● ● ● ● ● ● ● ● ● ● ● ● ● Hua● Liang ● ● ● ● ● ● ● ● ● ●Peter● Hall● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Hans−Georg● ● Muller● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jing● ● Qin● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● David Dunson● ● ● ● ● ● ● ● ● ● ● ● ● ● Hongtu● ● Zhu● ● ● ● ● ● ● ● ● ● ● ● ● ● ●Joseph G Ibrahim ● ● ● ● ● ● ● ● ● Donglin● Zeng ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● T Tony● Cai ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jianqing● ●● Fan ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Raymond● J Carroll ● ● ● ● ● ● ● ● ● ● ● ● ● Hua● Liang ● ● ● ● ● ● ● ● ● ●Peter● Hall● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Hans−Georg● ● Muller● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jing● ● Qin● ● ● ● ● ● ● ● ● ●

Jiashun Jin Community Detection by SCORE More on Coauthorship Network, IV

● ● ● ● ● ● ● ● ● David Dunson● ● ● ● ● ● ● ● ● ● ● ● ● ● Hongtu● ● Zhu● ● ● ● ● ● ● ● ● ● ● ● ● ● ●Joseph G Ibrahim ● ● ● ● ● ● ● ● ● Donglin● Zeng ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● T Tony● Cai ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jianqing● ●● Fan ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Raymond● J Carroll ● ● ● ● ● ● ● ● ● ● ● ● ● Hua● Liang ● ● ● ● ● ● ● ● ● ●Peter● Hall● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Hans−Georg● ● Muller● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jing● ● Qin● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● David Dunson● ● ● ● ● ● ● ● ● ● ● ● ● ● Hongtu● ● Zhu● ● ● ● ● ● ● ● ● ● ● ● ● ● ●Joseph G Ibrahim ● ● ● ● ● ● ● ● ● Donglin● Zeng ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● T Tony● Cai ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jianqing● ●● Fan ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Raymond● J Carroll ● ● ● ● ● ● ● ● ● ● ● ● ● Hua● Liang ● ● ● ● ● ● ● ● ● ●Peter● Hall● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Hans−Georg● ● Muller● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Jing● ● Qin● ● ● ● ● ● ● ● ● ●

Jiashun Jin Community Detection by SCORE Take home messages

I Proposed a fast, flexible, easy-to-implement, yet effective, method: SCORE

I Successfully applied to Statisticians’ networks and found many meaningful communities

I Data sets: a fertile ground for future research (many results are not reported here)

References: Jin J (2015) Fast network community detection by SCORE. Ann. Statist. 43(1), 57-89. Ji P, Jin J (2014) Coauthorship and Citation networks for statisticians. arXiv.1410.2840.

Jiashun Jin Community Detection by SCORE