Submodular : p-Laplacians, Cheeger Inequalities and Spectral Clustering

Pan Li 1 Olgica Milenkovic 1

Abstract approximations have been based on the assumption that each hyperedge cut has the same weight, in which case the We introduce submodular hypergraphs, a family underlying is termed homogeneous. of hypergraphs that have different submodular weights associated with different cuts of hyper- However, in , MAP inference on edges. Submodular hypergraphs arise in cluster- Markov random fields (Arora et al., 2012; Shanu et al., ing applications in which higher-order structures 2016), studies (Li & Milenkovic, 2017; Ben- carry relevant information. For such hypergraphs, son et al., 2016; Tsourakakis et al., 2017) and rank learn- we define the notion of p-Laplacians and derive ing (Li & Milenkovic, 2017), higher order relations between corresponding nodal domain theorems and k-way vertices captured by hypergraphs are typically associated Cheeger inequalities. We conclude with the de- with different cut weights. In (Li & Milenkovic, 2017), Li scription of algorithms for computing the spectra and Milenkovic generalized the notion of hyperedge cut of 1- and 2-Laplacians that constitute the basis of weights by assuming that different hyperedge cuts have new spectral hypergraph clustering methods. different weights, and that consequently, each hyperedge is associated with a vector of weights rather than a single scalar weight. If the weights of the hyperedge cuts are sub- 1. Introduction modular, then one can use a graph with nonnegative edge Spectral clustering algorithms are designed to solve a relax- weights to efficiently approximate the hypergraph, provided ation of the graph cut problem based on graph Laplacians that the largest size of a hyperedge is a relatively small con- that capture pairwise dependencies between vertices, and stant. This property of the projected hypergraphs allows one produce sets with small conductance that represent clusters. to leverage spectral hypergraph clustering algorithms based Due to their scalability and provable performance guaran- on expansions with provable performance guaran- tees, spectral methods represent one of the most prevalent tees. Unfortunately, the clique expansion method in general graph clustering approaches (Chung, 1997; Ng et al., 2002). has two drawbacks: The spectral clustering algorithm for graphs used in the second step is merely quadratically opti- Many relevant problems in clustering, semisupervised learn- mal, while the projection step can cause a large distortion. ing and MAP inference (Zhou et al., 2007; Hein et al., 2013; Zhang et al., 2017) involve higher-order dependen- To address the quadratic optimality issue in graph cluster- cies that require one to consider hypergraphs instead of ing, Amghibech (Amghibech, 2003) introduced the notion graphs. To address spectral hypergraph clustering problems, of p-Laplacians of graphs and derived Cheeger-type inequal- several approaches have been proposed that typically op- ities for the second smallest eigenvalue of a p-Laplacian, ¨ erate by first projecting the hypergraph onto a graph via p > 1, of a graph. These results motivated Buhler and clique expansion and then performing spectral clustering Hein’s work (Buhler¨ & Hein, 2009) on spectral clustering on graphs (Zhou et al., 2007). Clique expansion involves based on p-Laplacians that provided tighter approximations transforming a weighted hyperedge into a weighted clique of the Cheeger constant. Szlam and Bresson (Szlam & such that the graph cut weights approximately preserve the Bresson, 2010) showed that the 1-Laplacian allows one to cut weights of the hyperedge. Almost exclusively, these exactly compute the Cheeger constant, but at the cost of computational hardness (Chang, 2016). Very little is known 1Department of Electrical and Computer Engineering, about the use of p-Laplacians for hypergraph clustering and University of Illinois Urbana-Champaign, USA. Correspon- their spectral properties. dence to: Pan Li , Olgica Milenkovic . To address the clique expansion problem, Hein et al. (Hein

th et al., 2013) introduced a clustering method for homoge- Proceedings of the 35 International Conference on Machine neous hypergraphs that avoids expansions and works di- Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). rectly with the total variation of homogeneous hypergraphs, Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering without investigating the spectral properties of the operator. mental results in the spectral theory of p-Laplacians, while The only other line of work trying to mitigate the projection Section4 introduces two algorithms for evaluating the sec- problem is due to Louis (Louis, 2015), who used a natural ond largest eigenvalue of p-Laplacians needed for 2-way extension of 2-Laplacians for homogeneous hypergraphs, clustering. Section5 presents experimental results. All derived quadratically-optimal Cheeger-type inequalities and proofs are relegated to the Supplementary Material. proposed a semidefinite programing (SDP) based algorithm whose complexity scales with the size of the largest hyper- 2. Mathematical Preliminaries edge in the hypergraph. A weighted graph G = (V, E, w) is an ordered pair of Our contributions are threefold. First, we introduce submod- two sets, the vertex set V = [N] = {1, 2,...,N} and the ular hypergraphs. Submodular hypergraphs allow one to edge set E ⊆ V × V , equipped with a weight function perform hyperedge partitionings that depend on the subsets w : E → +. of elements involved in each part, thereby respecting higher- R order and other constraints in graphs (see (Li & Milenkovic, A cut C = (S, S¯) is a bipartition of the set V , while the 2017; Arora et al., 2012; Fix et al., 2013) for applications in cut-set (boundary) of the cut C is defined as the set of edges food network analysis, learning to rank, subspace clustering that have one endpoint in S and one in the complement of and image segmentation). Second, we define p-Laplacians S, S¯, i.e., ∂S = {(u, v) ∈ E | u ∈ S, v ∈ S¯}. The weight P for submodular hypergraphs and generalize the correspond- of the cut induced by S equals vol(∂S) = u∈S, v∈S¯ wuv, ing discrete nodal domain theorems (Tudisco & Hein, 2016; while the conductance of the cut is defined as Chang et al., 2017) and higher-order Cheeger inequalities. vol(∂S) Even for homogeneous hypergraphs, nodal domain theo- c(S) = , min{ (S), (S¯)} rems were not known and only one low-order Cheeger in- vol vol equality for 2-Laplacians was established by Louis (Louis, P P where vol(S) = u∈S µu, and µu = v∈V wuv. When- 2015). An analytical obstacle in the development of such a ever clear from the context, for e = (uv), we write we p theory is the fact that -Laplacians of hypergraphs are oper- instead of wuv. Note that in this setting, the vertex weight sets of values ators that act on vectors and produce . Conse- values µu are determined based on the weights of edges we quently, operators and eigenvalues have to be defined in a incident to u. Clearly, one can use a different choice for set-theoretic manner. Third, based on the newly established these weights and make them independent from the edge spectral hypergraph theory, we propose two spectral cluster- weights, which is a generalization we pursue in the context ing methods that learn the second smallest eigenvalues of of submodular hypergraphs. The smallest conductance of 2 1 2 - and -Laplacians. The algorithm for -Laplacian eigen- any bipartition of a graph G is denoted by h2 and referred value computation is based on an SDP framework and can to as the Cheeger constant of the graph. provably achieve quadratic optimality with an O(pζ(E)) approximation constant, where ζ(E) denotes the size of A generalization of the Cheeger constant is the k−way the largest hyperedge in the hypergraph. The algorithm for Cheeger constant of a graph G. Let Pk denote the set 1-Laplacian eigenvalue computation is based on the inverse of all partitions of V into k-disjoint nonempty subsets, power method (IPM) (Hein & Buhler¨ , 2010) that only has i.e., Pk = {(S1,S2, ..., Sk)|Si ⊂ V,Si 6= ∅,Si ∩ Sj = convergence guarantees. The key novelty of the IPM-based ∅, ∀i, j ∈ [k], i 6= j}. The k−way Cheeger constant is method is that the critical inner- optimization problem defined as of the IPM is efficiently solved by algorithms recently devel- hk = min max c(Si). oped for decomposable submodular minimization (Jegelka (S1,S2,...,Sk)∈Pk i∈[k] et al., 2013; Ene & Nguyen, 2015; Li & Milenkovic, 2018). Although without performance guarantees, given that the Spectral provides a means for bounding the 1-Laplacian provides the tightest approximation guarantees, Cheeger constant using the (normalized) Laplacian ma- the IPM-based algorithm – as opposed to the clique expan- trix of the graph, defined as L = D − A and L = sion method (Li & Milenkovic, 2017) – performs very well I − D−1/2AD−1/2, respectively. Here, A stands for the ad- empirically even when the size of the hyperedges is large. jacency matrix of the graph, D denotes the diagonal This fact is illustrated on several UC Irvine matrix, while I stands for the identity matrix. The graph datasets available from (Asuncion & Newman, 2007). (g) Laplacian is an operator 42 (Chung, 1997) that satisfies

The paper is organized as follows. Section2 contains an (g) X 2 overview of graph Laplacians and introduces the notion of hx, 42 (x)i = wuv(xu − xv) . submodular hypergraphs. The section also contains a de- (uv)∈E scription of hypergraph Laplacians, and relevant concepts A generalization of the above operator termed the p- in submodular function theory. Section3 presents the funda- (g) Laplacian operator of a graph 4p was introduced by Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering P Amghibech in (Amghibech, 2003), where as y(S) = yv. The volume of a subset of vertices v∈S P S ⊆ V equals vol(S) = µv. (g) X p v∈S hx, 4 (x)i = wuv|xu − xv| . p For any S ⊆ V , we generalize the notions of the boundary (uv)∈E of S and the volume of the boundary of S according to ∂S = {e ∈ E|e ∩ S 6= ∅, e ∩ S¯ 6= ∅}, and The well known Cheeger inequality asserts the following re- X X lationship between h2 and λ, the second smallest eigenvalue vol(∂S) = ϑ w (S) = ϑ w (S), (1) (g) e e e e of the normalized Laplacian 42 of a graph: e∈∂S e∈E √ √ respectively. Then, the normalized cut induced by S, the h2 ≤ 2λ ≤ 2 h2. Cheeger constant and the k-way Cheeger constant for hy- ˆ It can be shown that the cut h2 dictated by the√ elements of pergraphs are defined in an analogous manner as for graphs. ˆ the eigenvector√ associated with λ satisfies h2 ≤ 2λ, which ˆ implies h2 ≤ 2 h2. Hence, spectral clustering provides a 2.2. Laplacian Operators for Hypergraphs quadratically optimal . We introduce next p-Laplacians of hypergraphs and a num- 2.1. Submodular Hypergraphs ber of relevant notions associated with Laplacian operators. (h) A weighted hypergraph G = (V, E, w) is an ordered pair Hein et al.(Hein et al., 2013) connected p-Laplacians 4p of two sets, the vertex set V = [N] and the hyperedge set for homogeneous hypergraphs with the total variation via E ⊆ 2V , equipped with a weight function w : E → R+. (h) X p The relevant notions of cuts, boundaries and volumes for hx, 4p (x)i = we max |xu − xv| , u,v∈e hypergraphs can be defined in a similar manner as for graphs. e∈E If each cut of a hyperedge e has the same weight we, we where we denotes the weight of a homogeneous hyperedge refer to the cut as a homogeneous cut and the corresponding e. They also introduced the Inverse Power Method (IPM) hypergraph as a homogeneous hypergraph. to evaluate the spectrum of the hypergraph 1-Laplacian Ω (h) For a ground set Ω, a set function f : 2 → R is termed 41 (Hein et al., 2013), but did not establish any per- submodular if for all S, T ⊆ Ω, one has f(S) + f(T ) ≥ formance guarantees. In an independent line of work, f(S ∪ T ) + f(S ∩ T ). Louis (Louis, 2015) introduced a quadratic variant of a hypergraph Laplacian A weighted hypergraph G = (V,E, µ, w) is termed a sub- modular hypergraph V E (h) X 2 with vertex set , hyperedge set hx, 4 (x)i = we max(xu − xv) . 2 u,v∈e and positive vertex weight vector µ , {µv}v∈V , if each e∈E hyperedge e ∈ E is associated with a submodular weight e He also derived a Cheeger-type inequality relating the function we(·) : 2 → [0, 1]. In addition, we require the second smallest eigenvalue λ of 4(h) and the Cheeger weight function we(·) to be: 2 ˆ constant of the√ hypergraph h2 that√ reads as h2 ≤ 1) Normalized, so that we(∅) = 0, and all cut weights p p O( log ζ(E)) λ ≤ O( log ζ(E)) h2. Compared to corresponding to a hyperedge e are normalized by ϑe = the result of graph (3.12), for homogeneous hypergraphs, maxS⊆e we(S). In this case, we(·) ∈ [0, 1]; log ζ(E) plays as some additional difficulty to approximate

2) Symmetric, so that we(S) = we(e\S) for any S ⊆ e; h2. Learning the spectrum of generalizations of hypergraph Laplacians can be an even more challenging task. The submodular hyperedge weight functions are summa- rized in the vector w {(w , ϑ )} . If w (S) = 1 , e e e∈E e 2.3. Relevant Background on Submodular Functions for all S ∈ 2e\{∅, e}, submodular hypergraphs reduce to homogeneous hypergraphs. We omit the designation homo- Given an arbitrary set function F : 2V → R satisfying geneous whenever there is no context ambiguity. F (V ) = 0, the Lovasz´ extension (Lovasz´ , 1983) f : RN → R of F is defined as follows: For any vector x ∈ RN , we Clearly, a vertex v is in e if and only if we({v}) > 0: If order its entries in nonincreasing order xi1 ≥ xi2 ≥ · · · ≥ we({v}) = 0, the submodularity property implies that v is xin while breaking the ties arbitrarily, and set not incident to e, as for any S ⊆ e\{v}, |we(S ∪ {v}) − w (S)| ≤ w ({v}) = 0. N−1 e e X P f(x) = F (Sj)(xij − xij+1 ), (2) We define the degree of a vertex v as dv = ϑe, e∈E: v∈e j=1 i.e., as the sum of the max weights of edges incident to N the vertex v. Furthermore, for any vector y ∈ R , we with Sj = {i1, i2, ..., ij}. For submodular F , the Lovasz´ define the projection weight of y onto any subset S ⊆ V extension is a convex function (Lovasz´ , 1983). Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering

N N Let 1S ∈ R be the indicator vector of the set S. Hence, R according to for any S ⊆ V , one has F (S) = f(1S). For a submodular X p F , we define a convex set termed the base polytope hx, 4p(x)i , Qp(x) = ϑefe(x) . (4) e∈E B {y ∈ N |y(S) ≤ F (S), for all S ⊆ V, and such that , R Hence, 4p(x) may also be specified directly as an operator y(V ) = F (V ) = 0}. over RN that reads as  P p−1 e∈E ϑefe(x) ∇fe(x) p > 1, According to the defining property of submodular func- 4p(x) = P ϑe∇fe(x) p = 1. tions (Lovasz´ , 1983), we may write f(x) = maxy∈Bhy, xi. e∈E The subdifferential ∇f(x) of f is defined as Definition 3.2. A pair (λ, x) ∈ R × RN /{0} is called an eigenpair of the p-Laplacian 4p if 4p(x) ∩ λU ϕp(x) 6= ∅. N 0 0 0 N {y ∈ R | f(x ) − f(x) ≥ hy, x − xi, ∀x ∈ R }. As fe(1) = 0, we have 4p(1) = 0, so that (0, 1) is an eigenpair of the operator 4p.A p-Laplacian operates on An important result from (Bach et al., 2013) characterizes vectors and produces sets. In addition, since for any t > 0, p−1 p−1 the subdifferentials ∇f(x): If f(x) is the Lovasz´ extension 4p(tx) = t 4p(x) and ϕp(tx) = t ϕp(x), (tx, λ) is of a submodular function F with base polytope B, then an eigenpair if and only if (x, λ) is an eigenpair. Hence, one only needs to consider normalized eigenpairs: In our setting, ∇f(x) = arg maxhy, xi. (3) y∈B we choose eigenpairs that lie in Sp,µ for a suitable choice for the dimension of the space. Observe that ∇f(x) is a set and that the right hand side of For linear operators, the Rayleigh-Ritz method (Gould, the definition represents a set of maximizers of the objective 1966) allows for determining approximate solutions to eigen- function. If f(x) is the Lovasz´ extension of a submodular problems and provides a variational characterization of function, then hq, xi = f(x) for all q ∈ ∇f(x). eigenpairs based on the critical points of functionals. To For each hyperedge e ∈ E of a submodular hypergraph, generalize the method, we introduce two even functions, following the above notations, we let Be, E(Be), fe denote Qp(x) the base polytope, the set of extreme points of the base poly- Q˜p(x) Qp(x)|S ,Rp(x) . , p,µ , kxkp tope, and the Lovasz´ extension of the submodular hyperedge `p,µ weight function w , respectively. Note that for any S ⊆ V , e Definition 3.3. A point x ∈ Sp,µ is termed a critical point w (S) = w (S ∩ e). Consequently, for any y ∈ B , y = 0 e e e v of Rp(x) if 0 ∈ ∇Rp(x). Correspondingly, Rp(x) is for v 6∈ e. Since ∇f ⊆ B , it also holds that (∇f ) = 0 e e e v termed a critical value of Rp(x). Similarly, x is termed for v∈ / e. When using formula (2) to explicitly describe a critical point of Q˜p if there exists a σ ∈ ∇Qp(x) such the Lovasz´ extension fe, we can either use a vector x of that P (x)σ = 0, where P (x)σ stands for the projection of dimension N or only those of its components that lie in e. σ onto the tangent space of Sp,µ at the point x. Correspond- Furthermore, in the later case, |E(Be)| = |e|!. ingly, Q˜p(x) is termed a critical value of Q˜p. ˜ 3. p-Laplacians of Submodular Hypergraphs The relationships between the critical points of Qp(x) and Rp(x) and the eigenpairs of 4p relevant to our subsequent We start our discussion by defining the notion of a p- derivations are listed in Theorem 3.4. Laplacian operator for submodular hypergraphs. We find the Theorem 3.4. A pair (λ, x) (x ∈ S ) is an eigenpair of following definitions useful for our subsequent exposition. p,µ the operator 4p Let sgn(·) be the sign function defined as sgn(a) = 1, for 1) if and only if x is a critical point of Q˜p with critical value a > 0, sgn(a) = −1, for a < 0, and sgn(a) = [−1, 1], λ, and provided that p ≥ 1. for a = 0. For all v ∈ V , define the entries of a vec- 2) if and only if x is a critical point of Rp with critical value N p−1 tor ϕp over R according to (ϕp(x))v = |xv| sgn(xv). λ, and provided that p > 1. Furthermore, let U be a N × N diagonal matrix such that 3) if x is a critical point of Rp with critical value λ, and Uvv = µv for all v ∈ V . provided that p = 1. P p 1/p Let kxk`p,µ = ( v∈V µv|xv| ) and Sp,µ , {x ∈ ˜ N N The critical points of Qp bijectively characterize eigenpairs |kxk` ,µ = 1}. For a function Φ over , let Φ|S R p R p,µ for all choices of p ≥ 1. However, Rp has the same property stand for Φ restricted to Sp,µ. only if p > 1. This is a consequence of the nonsmoothness Definition 3.1. The p-Laplacian operator of a submodular of the set S1,µ, which has been observed for graphs as well hypergraph, denoted by 4p (p ≥ 1), is defined for all x ∈ (See the examples in Section 2.2 of (Chang, 2016)). Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering

3.1. Discrete Nodal Domain Theorem for p−Laplacians the same manner, except for changing the strict inequalities as {v ∈ V |x ≥ 0} (respectively, {v ∈ V |x ≤ 0}). Nodal domain theorems are essential for understanding the v v structure of eigenvectors of operators and they have been The following lemma establishes that for a connected sub- the subject of intense study in geometry and graph theory modular hypergraph G, all nonconstant eigenvectors of the alike (Bıyıkoglu et al., 2007). The eigenfunctions of a Lapla- operator 4p correspond to nonzero eigenvalues. cian operator may take positive and negative values. The Lemma 3.8. If G is connected, then all eigenvectors asso- signs of the values induce a partition of the vertices in V into ciated with the zero eigenvalue have constant entries. maximal connected components on which the sign of the eigenfunction does not change: These components represent We next state new nodal domain theorems for submodular the nodal domains of the eigenfunction and approximate the hypergraph p−Laplacians. The results imply the bounds for clusters of the graphs. the numbers of nodal domains induced from eigenvectors of p-Laplacian do not essentially change compared to those Davies et al. (BrianDavies et al., 2001) derived the first (g) for graphs (Tudisco & Hein, 2016). We do not consider the discrete nodal domain theorem for the 42 operator. Chang case p = 1, although it is possible to adapt the methods for et al. (Chang et al., 2017) and Tudisco et al. (Tudisco & Hein, (g) (g) (g) analyzing the 41 operators of graphs to 41 operators of 2016) generalized these theorem for 41 and 4p (p > 1) submodular hypergraphs. Such a generalization requires of graphs. In what follows, we prove that the discrete nodal extensions of the critical-point theory to piecewise linear domain theorem applies to 4p of submodular hypergraphs. manifolds (Chang, 2016). As every nodal domain theorem depends on some underly- Theorem 3.9. Let p > 1 and assume that G is a connected ing notion of connectivity, we first define the relevant notion submodular hypergraph. Furthermore, let the eigenvalues of of connectivity for submodular hypergraphs. In a graph or (p) (p) (p) (p) 4p be ordered as 0 = λ1 < λ2 ≤ · · · ≤ λk−1 < λk = a homogeneous hypergraph, vertices on the same edge or (p) (p) (p) (p) ··· = λ < λ ≤ · · · ≤ λ , with λ having hyperedge are considered to be connected. However, this k+r−1 k+r n k multiplicity r. Let x be an arbitrary eigenvector associated property does not generalize to submodular hypergraphs, with λ(p). Then x induces at most k + r − 1 strong and at as one can merge two nonoverlapping hyperedges into one k k without changing the connectivity of the hyperedges. To see most weak nodal domains. Lemma 3.10. Let G be a connected submodular hyper- why this is the case, consider two hyperedges e1 and e2 that are nonintersecting. One may transform the submodular graph. For p > 1, any nonconstant eigenvector has at least two weak (strong) nodal domains. Hence, the eigenvectors hypergraph so that it includes a hyperedge e = e1 ∪ e2 associated with the second smallest eigenvalue λ(p) have with weight we = we1 + we2 . This transformation essen- 2 tially does not change the submodular hypergraph, but in the exactly two weak (strong) nodal domains. For p = 1, the newly obtained hypergraph, according to the standard defini- eigenvectors associated with the second smallest eigenvalue (1) tion of connectivity, the vertices in e1 and e2 are connected. λ2 may have only one single weak (strong) nodal domain. This problem may be avoided by defining connectivity based + on the volume of the boundary set. We define next the following three functions: µp (x) , P p−1 0 P µv|xv| , µ (x) µv, and Definition 3.5. Two distinct vertices u, v ∈ V are said to v∈V :xv >0 , v∈V :xv =0 µ−(x) P µ |x |p−1. be connected if for any S such that u ∈ S and v∈ / S, p , v∈V :xv <0 v v vol(∂S) > 0. A submodular hypergraph is connected if for Lemma 3.11. Let G be a connected submodular hyper- any non-empty S ⊂ V , one has vol(∂S) > 0. graph. Then, for any nonconstant eigenvector x of 4p, one + − + − has µp (x)−µp (x) = 0 for p > 1, and |µ1 (x)−µ1 (x)| ≤ According to the following lemma, it is always possible to 0 µ (x) for p = 1. Consequently, 0 ∈ arg minc∈R kx − transform the weight functions of submodular hypergraph p c1k` ,µ for any p ≥ 1. in such a way as to preserve connectivity. p Lemma 3.6. Any submodular hypergraph G = The nodal domain theorem characterizes the structure of the (V,E, w, µ) can be reduced to another submodular hyper- eigenvectors of the operator, and the number of nodal do- graph G0 = (V,E0, w0, µ) without changing vol(∂S) for mains determines the approximation guarantees in Cheeger- any S ⊆ V and ensuring that for any e ∈ E0, and u, v ∈ e, type inequalities relating the spectra of graphs and hyper- u and v are connected. graphs and the Cheeger constant. These observations are rigorously formalized in the next section. Definition 3.7. Let x ∈ RN .A positive (respectively, negative) strong nodal domain is the set of vertices of a 3.2. Higher-Order Cheeger Inequalities maximally connected induced subgraph of G such that {v ∈ V |xv > 0} (respectively, {v ∈ V |xv < 0}). A posi- In what follows, we analytically characterize the relation- tive (respectively, negative) weak nodal domain is defined in ship between the Cheeger constants and the eigenvalues Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering

(p) (p) λ 4 p > 1 λ = inf N R (x) k of p for submodular hypergraphs. Theorem 4.1. For , 2 x∈R p . More- (p) (1) over, λ2 = infx∈ N R1(x) = h2. Theorem 3.12. Suppose that p ≥ 1 and let (λk , xk) be R N the k−th eigenpair of the operator 4p, with mk denoting Definition 4.2. Given a nonconstant vector x ∈ R , and the number of strong nodal domains of xk. Then, a threshold θ, set Θ(x, θ) = {v : xv > θ}. The optimal conductance obtained from thresholding vector x equals  1 p−1 h p mk ≤ λ(p) ≤ (min{ζ(E), k})p−1 h , k k vol(∂Θ(x, θ)) τ p c(x) = inf . θ∈[xmin,xmax) min{vol(Θ(x, θ)), vol(V/Θ(x, θ))} where τ = maxv dv/µv. For homogeneous hypergraphs, a tighter bound holds that reads as Theorem 4.3. For any x ∈ RN that satisfies 0 ∈  p−1  p arg minc Zp,µ(x, c), i.e., such that Zp,µ(x, 0) = Zp,µ(x), 2 hmk (p) p−1 (p−1)/p 1/p ≤ λ ≤ 2 h . one has c(x) ≤ p τ Rp(x) , where τ = τ p k k maxv∈V dv/µv.

It is straightforward to see that setting p = 1 produces the In what follows, we present two algorithms. The first algo- tightest bounds on the eigenvalues, while the case p = 2 rithm describes how to minimize R2(x), and hence provides reduces to the classical Cheeger inequality. This motivates a polynomial-time solution for submodular hypergraph par- an in depth study of algorithms for evaluating the spectrum titioning with provable approximation guarantees, given that of p = 1, 2-Laplacians, described next. the size of the largest hyperedge is a constant. The result is concluded in Theorem 4.5. The algorithm is based on an 4. Spectral Clustering Algorithms for SDP, and may be computationally too intensive for practical Submodular Hypergraphs applications involving large hypergrpahs of even moder- ately large hyperedges. The second algorithm is based on The Cheeger constant is frequently used as an objective IPM (Hein & Buhler¨ , 2010) and aims to minimize R1(x). function for (balanced) graph and hypergraph partition- Although this algorithm does not come with performance ing (Zhou et al., 2007;B uhler¨ & Hein, 2009; Szlam & guarantees, it provably converges (see Theorem 4.6) and Bresson, 2010; Hein & Buhler¨ , 2010; Hein et al., 2013; Li has good heuristic performance. Moreover, the inner loop (p) & Milenkovic, 2017). Theorem 3.12 implies that λk is of the IPM involves solving a version of the proximal-type a good approximation for the k-way Cheeger constant of decomposable submodular minimization problem (see Theo- submodular graphs. Hence, to perform accurate hypergraph rem 4.7), which can be efficiently performed using a number (p) of different algorithms (Kolmogorov, 2012; Jegelka et al., clustering, one has to be able to efficiently learn λk (Ng et al., 2002; Von Luxburg, 2007). We outline next how to 2013; Nishihara et al., 2014; Ene & Nguyen, 2015; Li & do so for k = 2. Milenkovic, 2018). In Theorem 4.1, we describe an objective function that al- (p) 4.1. An SDP Method for Minimizing R2(x) lows us to characterize λ2 in a computationally tractable manner; the choice of the objective function is related to The R2(x) minimization problem introduced in Equa- the objective developed for graphs in (Buhler¨ & Hein, 2009; tion (5) may be rewritten as Szlam & Bresson, 2010). Minimizing the proposed objec- Q2(x) tive function produces a real-valued output vector x ∈ RN . min , (6) x:Ux⊥1 kxk2 Theorem 4.3 describes how to round the vector x and ob- `2,µ tain a partition which provably upper bounds c(S). Based P 2 where we observe that Q2(x) = e∈E ϑefe (x) = on the theorems, we propose two algorithms for evaluating P 2 (2) (1) (1) e∈E ϑe maxy∈E(Be)hy, xi . This problem is, in turn, λ2 and λ2 . Since λ2 = h2, the corresponding parti- equivalent to the nonconvex optimization problem tion corresponds to the tightest approximation of the 2-way (2)  2 Cheeger constant. The eigenvalue λ2 can be evaluated X min ϑe max hy, xi (7) in polynomial time with provable performance guarantees. N x∈R y∈E(Be) The problem of devising good approximations for values e (p) X 2 X s.t. µvxv = 1, µvxv = 0. λk , k 6= 2, is still open. v∈V v∈V p Let Zp,µ(x, c) , kx − c1k` ,µ and Zp,µ(x) , p Following an approach proposed for homogeneous hyper- min Z (x, c), and define c∈R p,µ graphs (Louis, 2015), one may try to solve an SDP relax- Qp(x) ation of (7) instead. To describe the relaxation, let each Rp(x) , . (5) 0 n Zp,µ(x) vertex v of the graph be associated with a vector xv ∈ R , Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering n ≥ ζ(E). The assigned vectors are collected into a matrix Algorithm 2: IPM-based minimization of R1(x) 0 0 of the form X = (x1, .., xN ). The SDP relaxation reads as Input: A submodular hypergraph G = (V,E, w, µ) Find nonconstant x0 ∈ N s.t. 0 ∈ arg min kx0 − c1k X R c `1,µ 2 ˆ0 0 min ϑeηe (8) initialize λ ← R1(x ), k ← 0 X∈ n×N , η∈ |E| R R e 1: Repeat: 2 2 ( k k s.t. kXyk2 ≤ ηe ∀y ∈ E(Be), e ∈ E sgn(x )µv, if x 6= 0 k v v 2: For v ∈ V , g ← + k − k X 0 2 X 0 v µ1 (x )−µ1 (x ) k µvkxvk2 = 1, µvxv = 0. − µ0(xk) µv, if xv = 0 v∈V v∈V k+1 ˆk k 3: z ← arg minz:kzk≤1 Q1(z) − λ hz, g i 4: ck+1 ← arg min kzk+1 − c1k E(B ) O(|e|!) c `1,µ Note that e is of size , and the above problem k+1 k+1 k+1 can be solved efficiently if ζ(E) is small. 5: x ← z − c 1 ˆk+1 k+1 6: λ ← R1(x ) Algorithm 1 lists the steps of an SDP-based algorithm for 7: Until |λˆk+1 − λˆk|/λˆk <  minimizing R2(x), and it comes with approximation guar- 8. Output xk+1 antees stated in Lemma 4.4. In contrast to homogeneous hypergraphs (Louis, 2015), for which the approximation factor equals O(log ζ(E)), the guarantees for general sub- where the primal and dual variables are related according to ˆk k P modular hypergraphs are O(ζ(E)). This is due to the fact λ g − ye z = e∈E . kλˆkgk−P y k that the underlying base polytope Be for a submodular func- e∈E e 2 tion is significantly more complex than the corresponding If the norm of the vector z in Step 3 is kzk∞, the underlying polytope for the homogeneous case. We conjecture that this optimization problem is equivalent to the following SFM approximation guarantee is optimal for SDP methods. problem

Algorithm 1: Minimization of R (x) using SDP X ˆk k 2 min ϑewe(S) − λ g (S), (10) S⊆V Input: A submodular hypergraph G = (V,E, w, µ) e 1: Solve the SDP (8). where the the primal and dual variables are related according 2: Generate a random Gaussian vector g ∼ N(0,In), to zv = 1 if v ∈ S, and zv = −1 if v∈ / S. where In denotes the identity matrix of order n. 3: Output x = XT g. For special forms of submodular weights, different algo- Lemma 4.4. Let x be as in Algorithm 1, and let the opti- rithms for the optimization problems in Theorem 4.7 may mal value of (8) be SDPopt. Then, with high probability, be used instead. For graphs and homogeneous hypergraphs with hyperedges of small size, the min-cut algorithms by R2(x) ≤ O(ζ(E)) SDPopt ≤ O(ζ(E)) min R2. Karger et al. and Chekuri et al. (Karger, 1993; Chekuri & This result immediately leads to the following theorem. Xu, 2017) allow one to efficiently solve the discrete prob- lem (10). Continuous optimization methods such as alter- Theorem 4.5. Suppose that x is the output of Algorithm 1. nating projections (AP) (Nishihara et al., 2014) and coordi- Then, c(x) ≤ O(pζ(E)τ h ) with high probability. 2 nate descend methods (CDM) (Ene & Nguyen, 2015) can

We describe next Algorithm 2 for optimizing R1(x) which be used to solve (9) by “tracking” minimum norm points has guaranteed convergence properties. of base polytopes corresponding to individual hyperedges, where for general submodular weights, the Wolfe’s Algo- Theorem 4.6. The sequence {xk} generated by Algorithm rithm (Wolfe, 1976) can be used. When the submodular 2 satisfies R (xk+1) ≤ R (xk). 1 1 weights have some special properties, such as that they The computationally demanding part of Algorithm 2 is the depend only on the cardinality of the input, there exist al- optimization procedure in Step 3. The optimization problem gorithms that operate efficiently even when |e| is extremely is closely related to the problem of submodular function large (Jegelka et al., 2013). minimization (SFM) due to the defining properties of the In our experimental evaluations, we use a random coordi- Lov´asz extension. Theorem 4.7 describes different equiva- nate descent method (RCDM) (Ene & Nguyen, 2015), which lent formulations of the optimization problem in Step 3. ensures an expected (1 + )−approximation by solving an 2 1 Theorem 4.7. If the norm of the vector z in Step 3 is kzk2, expected number of O(|V | |E| log  ) min-norm-point prob- the underlying optimization problem is the dual of the fol- lems. Note that when performing continuous optimization, lowing `2 minimization problem one does not need to solve the inner-loop optimization prob- lem exactly and is allowed to exit the loop as long as the X ˆk k 2 min k ye − λ g k2, ye ∈ ϑeBe, ∀ e ∈ E, (9) objective function value decreases. The underlying algo- ye e∈E rithm – Algorithm 3 – is described in the Supplement. Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering

×10-3 ×10-4 33 6 15.5 8.4 IPM-S IPM-S IPM-S 5.5 15 IPM-S 8.2 32.5 CEM CEM CEM CEM IPM-H 14.5 8 IPM-H 32 IPM-H 5 IPM-H 14 7.8 4.5 31.5 13.5 7.6 31 4 13 7.4 30.5 3.5 12.5 7.2 12 3 30 7 11.5 2.5 29.5 11 6.8 Cheeger constant (news20) clustering error % (news20)

29 2 clustering error % (mushroom) 10.5 6.6 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 Cheeger constant % (mushroom) 0 0.01 0.02 0.03 0.04 α α α α × -4 ×10-3 10 47.4 1.62 13.5 5.8 IPM-S IPM-S IPM-S 47.3 IPM-S CEM 1.6 CEM 13 CEM 5.6 CEM 47.2 IPM-H IPM-H 1.58 IPM-H 12.5 IPM-H 47.1 5.4 1.56 12 47 1.54 11.5 5.2 46.9 1.52 11 46.8 5 10.5 46.7 1.5 4.8 46.6 1.48 10 clustering error % (covertype45) clustering error % (covertype67) Cheeger constant (covertype67) 46.5 Cheeger constant (covertype45) 1.46 9.5 4.6 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 α α α α

Figure 1. Experimental clustering results for four UCI datasets, displayed in pairs of figures depicting the Clustering error and the Cheeger constant versus α. Fine tuning the parameter α may produce significant performance improvements in several datasets - for example, on the Covertype67 dataset, choosing α = 0.028 results in visible drops of the clustering error and the Cheeger constant. Both the use of 1-Laplacians and submodular weights may be credited for improving clustering performance.

5. Experiments The intuitive explanation behind our choice of weights is that it allows one to accommodate categorization errors In what follows, we compare the algorithms for submod- and outliers: In contrast to the homogeneous case in which ular hypergraph clustering described in the previous sec- any partition of a hyperedge has weight one, the chosen tion to two methods: The IPM for homogeneous hyper- submodular weights allow a smaller weight to be used when graph clustering (Hein et al., 2013) and the clique expansion the hyperedge is partitioned into small parts, i.e., when method (CEM) for submodular hypergraph clustering (Li min{|S|, |e/S|} < dα|e|e. In practice, α is chosen to be & Milenkovic, 2017). We focus on 2-way graph partition- relatively small – in all experiments, we set α ≤ 0.04, with ing problems related to the University of California Irvine α close to zero producing homogeneous hyperedge weights. (UCI) datasets selected for analysis in (Hein et al., 2013), described in Table 1 of the Supplementary Material. The The results are shown in Figure1. As may be observed, datasets include 20Newsgroups, Mushrooms, Covertype. both in terms of the Clustering error (i.e., the total num- In all datasets, ζ(E) was roughly 103, and each of these ber of erroneously classified vertices) and the values of the datasets describes multiple clusters. Since we are interested Cheeger constant, IPM-based methods outperform CEM. in 2-way partitioning, we focused on two pairs of clusters This is due to the fact that for large hyperedge sizes, CEM in Covertype, denoted by (4, 5) and (6, 7), and paired the incurs a high distortion when approximating the submodular four 20Newsgroups clusters, one of which includes Comp. weights (O(ζ(E)) (Li & Milenkovic, 2017)). Moreover, as and Sci, and another one which includes Rec. and Talk. The we(S) depends merely on |S|, the submodular hypergraph Mushrooms and 20Newsgroups datasets contain only cate- CEM reduces to the homogeneous hypergraph CEM (Zhou gorical features, while Covertype also includes numerical et al., 2007), which is an issue that the IPM-based method features. We adopt the same approach as the one described does not face. Comparing the performance of IPM on sub- in (Hein et al., 2013) to construct hyperedges: Each fea- modular hypergraphs (IPM-S) with that on homogeneous ture corresponds to one hyperedge; hence, each categorical hypergraphs (IPM-H), we see that IPM-S achieves better feature is captured by one hyperedge, while numerical fea- clustering performance on both 20Newsgroups and Cover- tures are first quantized into 10 bins of equal size, and then types, and offers the same performance as IPM-H on the mapped to hyperedges. To describe the submodular weights, Mushrooms dataset. This indicates that it is practically we fix ϑe = 1 for all hyperedges and parametrize we using useful to use submodular hyperedge weights for clustering a variable α ∈ (0, 0.5] purposes. A somewhat unexpected finding is that for cer- tain cases, one observes that when α increases (and thus, when we decreases), the corresponding Cheeger constant 1 1  |S| |e/S|  increases. This may be caused by the fact that the IPM we(S; α) = + min 1, , , ∀S ⊆ e. algorithm can get trapped in a local optima. 2 2 dα|e|e dα|e|e Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering

References IEEE International Conference on Computer Vision, pp. 3104–3111. IEEE, 2013. Amghibech, S. Eigenvalues of the discrete p-Laplacian for graphs. Ars Combinatoria, 67:283–302, 2003. Gould, Sydney Henry. Variational methods for eigen- value problems, volume 22. University of Toronto Press Arora, Chetan, Banerjee, Subhashis, Kalra, Prem, and Ma- Toronto, 1966. heshwari, SN. Generic cuts: An efficient algorithm for optimal inference in higher order MRF-MAP. In Proceed- Hein, Matthias and Buhler,¨ Thomas. An inverse power ings of the European Conference on Computer Vision, pp. method for nonlinear eigenproblems with applications 17–30. Springer, 2012. in 1-spectral clustering and sparse pca. In Advances in Neural Information Processing Systems, pp. 847–855, Asuncion, Arthur and Newman, David. UCI machine learn- 2010. ing repository, 2007. Hein, Matthias, Setzer, Simon, Jost, Leonardo, and Ranga- Bach, Francis et al. Learning with submodular functions: puram, Syama Sundar. The total variation on hypergraphs- A convex optimization perspective. Foundations and learning on hypergraphs revisited. In Advances in Neural Trends R in Machine Learning, 6(2-3):145–373, 2013. Information Processing Systems, pp. 2427–2435, 2013. Benson, Austin R, Gleich, David F, and Leskovec, Jure. Jegelka, Stefanie, Bach, Francis, and Sra, Suvrit. Reflection Higher-order organization of complex networks. Science, methods for user-friendly submodular optimization. In 353(6295):163–166, 2016. Advances in Neural Information Processing Systems, pp. Bıyıkoglu, Turker,¨ Leydold, Josef, and Stadler, Peter F. 1313–1321, 2013. Laplacian eigenvectors of graphs. Lecture notes in math- Karger, David R. Global min-cuts in RNC, and other rami- ematics, 1915, 2007. fications of a simple min-cut algorithm. In Proceedings BrianDavies, E, Gladwell, GrahamM L, Leydold, Josef, and of the ACM-SIAM Symposium on Discrete Algorithms, Stadler, Peter F. Discrete nodal domain theorems. Linear volume 93, pp. 21–30, 1993. Algebra and its Applications, 336(1-3):51–60, 2001. Kolmogorov, Vladimir. Minimizing a sum of submodular Buhler,¨ Thomas and Hein, Matthias. Spectral clustering functions. Discrete Applied Mathematics, 160(15):2246– based on the graph p-Laplacian. In Proceedings of the 2258, 2012. International Conference on Machine Learning, pp. 81– 88. ACM, 2009. Li, Pan and Milenkovic, Olgica. Inhomogeneous hyper- graph clustering with applications. In Advances in Neural Chang, KC, Shao, Sihong, and Zhang, Dong. Nodal do- Information Processing Systems, pp. 2305–2315, 2017. mains of eigenvectors for 1-Laplacian on graphs. Ad- vances in Mathematics, 308:529–574, 2017. Li, Pan and Milenkovic, Olgica. Revisiting decomposable submodular function minimization with incidence rela- Chang, Kung Ching. Spectrum of the 1-Laplacian and tions. arXiv preprint arXiv:1803.03851, 2018. Cheeger’s constant on graphs. Journal of Graph Theory, 81(2):167–207, 2016. Louis, Anand. Hypergraph markov operators, eigenvalues and approximation algorithms. In Proceedings of the Chekuri, Chandra and Xu, Chao. Computing minimum cuts ACM symposium on Theory of computing, pp. 713–722. in hypergraphs. In Proceedings of the ACM-SIAM Sym- ACM, 2015. posium on Discrete Algorithms, pp. 1085–1100. Society for Industrial and Applied Mathematics, 2017. Lovasz,´ Laszl´ o.´ Submodular functions and convexity. In Mathematical Programming The State of the Art, pp. 235– Chung, Fan RK. . Number 92. Ameri- 257. Springer, 1983. can Mathematical Soc., 1997. Ng, Andrew Y, Jordan, Michael I, and Weiss, Yair. On spec- Ene, Alina and Nguyen, Huy. Random coordinate descent tral clustering: Analysis and an algorithm. In Advances methods for minimizing decomposable submodular func- in Neural Information Processing Systems, pp. 849–856, tions. In Proceedings of the International Conference on 2002. Machine Learning, pp. 787–795, 2015. Nishihara, Robert, Jegelka, Stefanie, and Jordan, Michael I. Fix, Alexander, Joachims, Thorsten, Park, Sung Min, and On the convergence rate of decomposable submodular Zabih, Ramin. Structured learning of sum-of-submodular function minimization. In Advances in Neural Informa- higher order energy functions. In Proceedings of the tion Processing Systems, pp. 640–648, 2014. Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering

Shanu, Ishant, Arora, Chetan, and Singla, Parag. Min norm point algorithm for higher order MRF-MAP inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5365–5374, 2016. Szlam, Arthur and Bresson, Xavier. Total variation and cheeger cuts. In Proceedings of the International Confer- ence on Machine Learning, pp. 1039–1046, 2010.

Tsourakakis, Charalampos E, Pachocki, Jakub, and Mitzen- macher, Michael. Scalable motif-aware graph clustering. In Proceedings of the 26th International Conference on World Wide Web, pp. 1451–1460. International World Wide Web Conferences Steering Committee, 2017.

Tudisco, Francesco and Hein, Matthias. A nodal domain theorem and a higher-order cheeger inequality for the graph p-Laplacian. arXiv preprint arXiv:1602.05567, 2016. Von Luxburg, Ulrike. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007. Wolfe, Philip. Finding the nearest point in a polytope. Math- ematical Programming, 11(1):128–149, 1976. Zhang, Chenzi, Hu, Shuguang, Tang, Zhihao Gavin, and Chan, T-H. Hubert. Re-revisiting learning on hyper- graphs: Confidence interval and subgradient method. In Proceedings of the International Conference on Machine Learning, volume 70, pp. 4026–4034, 2017. Zhou, Denny, Huang, Jiayuan, and Scholkopf,¨ Bernhard. Learning with hypergraphs: Clustering, classification, and embedding. In Advances in Neural Information Pro- cessing Systems, pp. 1601–1608, 2007.