Arxiv:1808.05185V1 [Stat.ME] 15 Aug 2018
Total Page:16
File Type:pdf, Size:1020Kb
Model-based clustering for random hypergraphs Tin-Lok James Ng1, Thomas Brendan Murphy1 School of Mathematics and Statistics, University College Dublin Abstract A probabilistic model for random hypergraphs is introduced to represent unary, binary and higher order interactions among objects in real-world problems. This model is an extension of the Latent Class Analysis model, which captures clustering structures among objects. An EM (expectation max- imization) algorithm with MM (minorization maximization) steps is developed to perform parameter estimation while a cross validated likelihood approach is employed to perform model selection. The developed model is applied to three real-world data sets where interesting results are obtained. Keywords: Hypergraph, Latent Class Analysis, Minorization Maximization 1. Introduction A large number of random graph models have been proposed [36, 20, 19, 27] to describe complex inter- actions among objects of interest. Pairwise relationships among objects can be naturally represented as a graph, in which the objects are represented by the vertices, and two vertices are joined by an edge if certain relationship exists between them. While graphs are capable of representing pairwise interaction between objects, they are inadequate to represent higher order and unary interactions that are typically observed in many real-world problems. Examples of higher-order and unary relationships include co-authorship on academic papers, co-appearance in movie scenes, and songs performed in a concert. arXiv:1808.05185v1 [stat.ME] 15 Aug 2018 For example, the study of coauthorship networks of scientists have attracted significant research in- terests in both natural and social sciences [34, 35, 33, 32, 1]. Such networks are typically constructed by connecting two scientists if they have coauthored one or more papers together. However, as we will illustrate below, such representation inevitably results in loss of information while a hypergraph representation naturally preserves all information. A hypergraph is a generalization of a graph in which hyperedges are arbitrary sets of vertices, and can contain any number of vertices. As a result, 1This work was supported by the Science Foundation Ireland funded Insight Research Centre (SFI/12/RC/2289). Preprint submitted to Journal of LATEX Templates August 16, 2018 hypergraphs are capable of representing relationships of any arbitrary orders. We consider a simple example of a coauthorship network with 7 authors and 4 papers in order to illustrate the benefits of hypergraph modelling. A hypergraph representation of the network is given in Figure 1 where the vertices v1; v2; : : : ; v7 represent the authors while the hyperedges e1; : : : ; e4 represent the papers. For example, the paper e1 is written by four authors v1; v2, v3 and v4, and the paper e2 is written by two authors v2 and v3, while the paper e4 has a single author v4. On the other hand, a graph representation of this coauthorship network with edges between any two au- thors who have coauthored at least one paper results in the edge set f(v1; v2); (v1; v3); (v1; v4); (v2; v3); (v2; v4); (v3; v4); (v3; v5); (v3; v6); (v5; v6)g. It is evident that much information is lost with this repre- sentation. In particular, this representation removes information about the number of authors that co-authored a paper. For example, one can only deduce from this edge set that v3 has co-authored with v1 and v2 while unable to conclude that the co-authorship was for the same paper. Furthermore, the hyperedge e4 which contains a singleton v4 is left out in the graph representation. A number of random hypergraph models were studied in probability and combinatorics literature where theoretical properties such as phase transition, chromatic number were investigated [23, 15, 5, 9, 38]. A novel parametrization of distributions on hypergraphs based on geometry of points is proposed in [30] which is used to infer Markov structure for multivariate distributions. On the other hand, statistical modeling with random hypergraph is less explored. [44] introduced the hypergraph beta model with three variants, which is a natural extension of the beta model for random graphs [21]. In their model, the probability of a hyperedge e appearing in the hypergraph is parameterized by a vector β 2 RN, which represents the \attractiveness" of each vertex. However, their model does not capture clustering among objects which is a typical real world phenomenon. In addition, the assumption of an upper bound on the size of hyperedges violates many real world data sets. One may equivalently represent a hypergraph using a bipartite network (also called two-mode network and affiliation network). Two-mode networks consist of two different kinds of vertices and edges can only be observed between the two types of vertices, but not between vertices of the same type. A hypergraph can be represented as a two-mode by considering the hyperedges as a second type of vertices. For example, an equivalent bipartite representation of the hypergraph shown in Figure 1 is provided in Figure 2 where the hyperedges fe1; : : : ; e4g are now replaced by the four green vertices. Two-mode networks have been studied in various disciplines including computer science [37], social sciences [10, 40, 24, 12] and physics [29]. A number of approaches have been proposed to analyze and model two-mode network data [2, 40, 8, 26, 46, 43]. In particular, models originally developed for 2 e2 e1 v2 v3 e4 v1 v4 e3 v5 v7 v6 Figure 1: A hypergraph representation of a coauthorship network. binary networks were extended for two-mode networks. [8] developed a blockmodeling approach of two-mode network data which aims to simultaneously par- tition the two types of vertices into blocks. [41] proposed exponential random graph models (ERGMs) for two-mode networks which models the logit of the probability of an actor belong to an event as a function of actor and event specific effects and other graph statistics. A clustering algorithm for two-mode network is developed in [11] based on the modelling framework in [41]. Several extensions to the ERGMs for bipartite networks are proposed in recent years [46, 45]. [43] proposed a methodology for studying the co-evolution of two-mode and one-mode networks. A network autocorrelation model for two-mode networks is introduced in [13]. Representing network observations using two-mode networks has the benefits of modelling vertices of both types jointly. However, in analyzing a two-mode network, one type of vertices may attract most interest. For example, in co-authorship networks, the main interest may lie in the collaborations rather than in co-authored papers. In such scenarios, a hypergraph representation is most nature by converting one type of vertices into hyperedges with no loss of information. In this paper, we propose the Extended Latent Class Analysis (ELCA) model for random hypergraphs, which is a natural extension of the Latent Class Analysis (LCA) model [28, 16, 3] and includes the LCA model as a special case. The model is applied to two applications, including Star Wars movie scenes and Lady Gaga concerts 2014. 2. Model and Motivation 2.1. Hypergraph A hypergraph is represented by a pair H = (V; E), where V = fV1;V2; ··· ;VN g is the set of N vertices and E = fe1; e2; ··· ; eM g is the set of M hyperedges. A hyperedge e is a subset of V , and we allow 3 V E 1 1 2 3 2 4 3 5 4 6 7 Figure 2: Bipartite graph representation of the hypergraph in Figure 1. repetitions in the hyperedge set E. Thus, the hypergraph H can alternatively be represented with a N × M matrix x = (xij) where xij = 1 if vertex Vi appears in hyperedge ej and xij = 0 otherwise. 2.2. Latent Class Analysis Model for Random Hypergraphs The binary latent class analysis (LCA) model [28, 16] is a commonly used mixture model for high dimensional binary data. It assumes that each observation is a member of one and only one of the G latent classes, and conditional on the latent class membership, the manifest variables are mutually independent of each other. The LCA model appears to be a natural candidate to model random hy- pergraphs where hyperedges are partitioned into G latent classes, and the probability that a hyperedge e 2 E contains a vertex v 2 V depends only on its latent class assignment. Let π = (π1; ··· ; πG) be the a priori latent class assignment probabilities where G is the number of latent classes, and define the N × G matrix p = (pig) and pig is the probability that vertex Vi is contained in a hyperedge e with latent class label g. The likelihood function can be written as M h G N i Y X Y xij 1−xij L(x; p; π) = πg pig (1 − pig) : j=1 g=1 i=1 (1) (1) (1) By introducing the M × G latent class membership matrix z = (zjg ) where zjg = 1 if hyperedge (1) (1) ej has latent class label g and zjg = 0 otherwise, the complete data likelihood of x and z can be 4 expressed as (1). M G N (1) h izjg (1) Y Y Y xij 1−xij L(x; z ; p; π) = πg pig (1 − pig) : (1) j=1 g=1 i=1 In comparison to the hypergraph beta models introduced in [44], the LCA model is capable of capturing the clustering and heterogeneity of hyperedges. For example, academic papers can be naturally labelled according to subject areas and conditional on a paper being labelled mathematics, one would expect that the probability a mathematician co-authored the paper is higher than a biologist. The LCA model does not assume an upper bound on the size of hyperedges and can model hyperedges of any size.