Community Detection in Hypergraphs: Optimal Statistical Limit and Efficient Algorithms

Community Detection in Hypergraphs: Optimal Statistical Limit and Efficient Algorithms

Community Detection in Hypergraphs: Optimal Statistical Limit and Efficient Algorithms I (Eli) Chien Chung-Yi Lin I-Hsiang Wang University of Illinois, Urbana Champaign National Taiwan University National Taiwan University Abstract learning, while it is usually an ill-posed problem due to the lack of ground truth. A prevalent way to cir- cumvent the difficulty is to formulate it as an inverse In this paper, community detection in hy- problem on a graph G = fV; Eg, where each node pergraphs is explored. Under a generative i 2 V = [n] f1; : : : ; ng is assigned a community hypergraph model called \d-wise hypergraph , (label) σ (i) 2 [K] f1;:::;Kg that serves as the stochastic block model" (d-hSBM) which na- , ground truth. The ground-truth community assign- turally extends the Stochastic Block Mo- ment σ :[n] ! [K] is hidden while the graph G is del (SBM) from graphs to d-uniform hyper- revealed. Each edge in the graph models a certain graphs, the fundamental limit on the asymp- kind of pairwise interaction between the two nodes. totic minimax misclassified ratio is characte- The goal of community detection is to determine σ rized. For proving the achievability, we pro- from G, by leveraging the fact that different combina- pose a two-step polynomial time algorithm tion of community relations leads to different likeliness that provably achieves the fundamental limit of edge connectivity. A canonical statistical model is in the sparse hypergraph regime. For pro- the stochastic block model (SBM) (Holland et al., 1983) ving the optimality, the lower bound of the (also known as planted partition model (Condon and minimax risk is set by finding a smaller pa- Karp, 2001)) which generates randomly connected ed- rameter space which contains the most do- ges from a set of labeled nodes. The presence of the minant error events, inspired by the analy- n edges is governed by n independent Bernoulli sis in the achievability part. It turns out 2 2 random variables, and the parameter of each of them that the minimax risk decays exponentially depends on the community assignments of the two no- fast to zero as the number of nodes tends des in the corresponding edge. to infinity, and the rate function is a weigh- ted combination of several divergence terms, Through the lens of statistical decision theory, the each of which is the R´enyi divergence of order fundamental statistical limits of community detection 1=2 between two Bernoulli distributions. The provides a way to benchmark various community de- Bernoulli distributions involved in the cha- tection algorithms. Under SBM, the fundamental sta- racterization of the rate function are those tistical limits have been characterized recently. One governing the random instantiation of hype- line of work takes a Bayesian perspective, where the redges in d-hSBM. Experimental results on unknown labeling σ of nodes in V is assumed to be both synthetic and real-world data validate distributed according to certain prior, and one of the our theoretical finding. most common assumption is i.i.d. over nodes. Along this line, the fundamental limit for exact recovery is characterized (Abbe et al., 2016) in the full genera- 1 INTRODUCTION lity, while partial recovery remains open in general. See the survey (Abbe, 2017) for more details and re- Community detection (clustering) has received great ferences therein. A second line of work takes a mini- attention recently across many applications, including max perspective, and the goal is to characterize the social science, biology, computer science, and machine minimax risk, which is typically the mismatch ratio between the true community assignment and the reco- Proceedings of the 21st International Conference on Arti- vered one. In (Zhang and Zhou, 2016), a tight asymp- ficial Intelligence and Statistics (AISTATS) 2018, Lanza- totic characterization of the minimax risk for commu- rote, Spain. PMLR: Volume 84. Copyright 2018 by the nity detection in SBM is found. Along with these theo- author(s). retical results, several algorithms have been proposed Community Detection in Hypergraphs: Optimal Statistical Limit and Efficient Algorithms to achieve these limits, including degree-profiling com- to the censored block model to the hypergraph setting parison (Abbe and Sandon, 2015) for exact recovery, is considered in (Ahn et al., 2016), where an informa- spectral MLE (Yun and Proutiere, 2015) for almost- tion theoretic limit on the sample complexity for exact exact recovery, and a two-step mechanism (Gao et al., recovery is characterized. 2017) under the minimax framework. As a first step towards characterizing the fundamental However, graphs can only capture pairwise relational limit of community detection in hypergraphs, in this information, while such dyadic measure may be in- work we focus on the \d-wise hypergraph stochastic adequate in many applications, such as the task of block model" (d-hSBM), where all hyperedges genera- 3-D subspace clustering (Agarwal et al., 2005) and ted in the hypergraph stochastic block model are of the higher-order graph matching problem in compu- order d. Our main contributions are two-fold.First, ter vision (Duchenne et al., 2011). Therefore, it is na- we give a tight asymptotic characterization of the op- tural to model such beyond-pairwise interaction by a timal minimax risk in d-hSBM for any d. Second, we hyperedge in a hypergraph and study the clustering propose a polynomial time algorithm which provably problem in a hypergraph setting. Hypergraph par- achieves the minimax risk under mild regularity con- titioning has been investigated in computer science, ditions. Throughout the paper, the order d and the and several algorithms have been proposed, including number of communities K are both treated as con- spectral methods based on clique expansion (Agarwal stants, while other parameters (hyperedge connection et al., 2006), hypergraph Laplacian (Zhou et al., 2006), probability) may be coupled with n. The proposed al- tensor method (Ghoshdastidar and Dukkipati, 2015), gorithm consists of two steps. The first step is a global linear programming (Li et al., 2016), to name a few. estimator that roughly recovers the hidden commu- Existing approaches, though, mainly focus on optimi- nity assignment to a certain precision level, and the zing a certain score function entirely based on the con- second step refines the estimated assignment based on nectivity of the observed hypergraph and do not view the underlying probabilistic model. This refine-after- it as a statistical estimation problem. initialize concept has also been used in graph clus- tering (Abbe and Sandon, 2015; Yun and Proutiere, In this paper, we investigate the community detection 2015; Gao et al., 2017) and ranking (Chen and Suh, problem in hypergraphs through the lens of statistical 2015). The proposed algorithm performs well on both decision theory. Our goal is to characterize the funda- synthetic data and real-world data. The experimental mental statistical limit and develop computationally results validate the theoretical finding that not only feasible algorithms to achieve it. As for the genera- is the refinement step critical in achieving the optimal tive model for hypergraphs, one natural extension of statistical limit, but it is significantly better to use the SBM model to a hypergraph setting is the hyper- hypergraphs for community detection problem rather graph stochastic block model (hSBM), where the pre- than graphs. sence of an order-h hyperedge e ⊂ V (i.e. jej = h ≤ M, the maximum edge cardinality) is governed by a Ber- The characterized minimax risk in d-hSBM is an ex- noulli random variable with parameter θe and the pre- ponential rate, and the error exponent turns out to sence of different hyperedges are mutually indepen- be a linear combination of R´enyi divergences of order dent. Despite the success of the aforementioned algo- 1/2. Each divergence term in the sum corresponds to rithms applied on many practical datasets, it remains a pair of community relations that would be confused open how they perform in hSBM since the the funda- with one another when there is only one misclassifi- mental limits have not been characterized and the pro- cation, and the weighted coefficient associated with it babilistic nature of hSBM has not been fully utilized. indicates the total number of such confusing patterns. Probabilistically, there may well be two or more mis- The hypergraph stochastic block model is first intro- classifications, with each confusing relation pair pertai- duced in (Ghoshdastidar and Dukkipati, 2014) as the ning to a R´enyi divergence when analyzing the error planted partition model in random uniform hyper- probability. However, we demonstrate technically that graphs where each hyperedge has the same cardinality. these situations are all dominated by the error event The uniform assumption is later relaxed in a follow-up with a single misclassified node, which leaves out only work (Ghoshdastidar and Dukkipati, 2017) and a more the \neighboring" divergence terms in the asymptotic general hSBM with mixing edge orders is considered. In expression. The main technical challenge resolved in (Angelini et al., 2015), the authors consider the sparse this work is attributed to the fact that the community regime and propose a spectral method based on a ge- relations become much more complicated as the order neralization of non-backtracking operator. Besides, a d increases, meaning that more error events may arise weak consistency condition is derived in (Ghoshdasti- compared to the much simpler homogeneous graph SBM dar and Dukkipati, 2017) for hSBM by using the hyper- case. In the proof of achievability, we show that the re- graph Laplacian. Departing from SBM, an extension I (Eli) Chien, Chung-Yi Lin, I-Hsiang Wang n finement step is able to achieve the fundamental limit x = (x1; : : : ; xn) 2 R . Also, Sn is the symmetric provided that the initilization step satisfies a certain group of degree n which contains all the permutations weak consistency condition.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    9 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us