SUBMITTED TO IEEE TRANSACTIONS ON , 2007 1 Information Inequalities for Joint Distributions, with Interpretations and Applications Mokshay Madiman, Member, IEEE, and Prasad Tetali, Member, IEEE

Abstract— Upper and lower bounds are obtained for the joint we use this flexibility later, but it is convenient to fix a entropy of a collection of random variables in terms of an default order.) arbitrary collection of subset joint entropies. These inequalities • For any set s ⊂ [n], X stands for the collection of generalize Shannon’s chain rule for entropy as well as inequalities s of Han, Fujishige and Shearer. A duality between the upper random variables (Xi : i ∈ s), with the indices taken and lower bounds for joint entropy is developed. All of these in their increasing order. results are shown to be special cases of general, new results for • For any index i in [n], define the degree of i in C as submodular functions– thus, the inequalities presented constitute r(i) = |{t ∈ C : i ∈ t}|. Let r−(s) = mini∈s r(i) denote a richly structured class of Shannon-type inequalities. The new the minimal degree in s, and r (s) = max r(i) denote inequalities are applied to obtain new results in combinatorics, + i∈s such as bounds on the number of independent sets in an arbitrary the maximal degree in s. graph and the number of zero-error source-channel codes, as First we present a weak form of our main inequality. well as new determinantal inequalities in matrix theory. A new inequality for relative entropies is also developed, along with Proposition I:[WEAKDEGREEFORM] Let X1,...,Xn be interpretations in terms of hypothesis testing. Finally, revealing arbitrary random variables jointly distributed on some discrete connections of the results to literature in economics, computer sets. For any collection C such that each index i has non-zero science, and physics are explored. degree, Index Terms— Entropy inequality; inequality for minors; X H(Xs|Xsc ) X H(Xs) entropy-based counting; submodularity. ≤ H(X[n]) ≤ , (1) r+(s) r−(s) s∈C s∈C

I.INTRODUCTION where r+(s) and r−(s) are the maximal and minimal degrees in s. If C satisfies r (s) = r (s) for each s in C, then (1) ET X ,X ,...,X be a collection of random variables. − + 1 2 n also holds for h in the setting of continuous random variables. L There are the familiar two canonical cases: (a) the random variables are real-valued and possess a probability density function, in which case h represents the differential Proposition I unifies a large number of inequalities in the entropy, or (b) they are discrete, in which case H represents the literature. Indeed, discrete entropy. More generally, if the joint distribution has a 1) Applying to the class C1 of singletons, density f with respect to some reference product measure, the n n joint entropy may be defined by −E[log f(X1,X2,...,Xn)]; X X H(Xi|X[n]\i) ≤ H(X[n]) ≤ H(Xi). (2) with this definition, H corresponds to counting measure and h i=1 i=1 to Lebesgue measure. The only assumption we will implicitly The upper bound represents the subadditivity of entropy make throughout is that the joint entropy is finite, i.e., neither noticed by Shannon. The lower bound may be inter- −∞ nor +∞. arXiv:0901.0044v2 [cs.IT] 3 Aug 2009 preted as the fact that the erasure entropy of a collection We wish to discuss the relationship between the joint of random variables is not greater than their entropy; see entropies of various subsets of the random variables Section VI for further comments. X1,X2,...,Xn. Thus we are motivated to consider an ar- 2) Applying to the class Cn−1 of all sets of n−1 elements, bitrary collection C of subsets of {1, 2, . . . , n}. The following n 1 X conventions are useful: H(X |X ) ≤ H(X ) n − 1 [n]\i i [n] • [n] is the index set {1, 2, . . . , n}. We equip this set with i=1 n (3) its natural (increasing) order, so that 1 < 2 < . . . < n. 1 X ≤ H(X ). (Any other total order would do equally well, and indeed n − 1 [n]\i i=1 Material in this paper was presented at the Information Theory and This is Han’s inequality [23], [10], in its prototypical Applications Workshop, San Diego, CA, January 2007, and at the IEEE Symposium on Information Theory, Nice, France, June 2007. form. Mokshay Madiman is with the Department of Statistics, Yale Uni- 3) Let r+ = mini∈[n] r(i) and r− = maxi∈[n] r(i) be the versity, 24 Hillhouse Avenue, New Haven, CT 06511, USA. Email: minimal and maximal degrees with respect to C. Using [email protected] Prasad Tetali is with the School of Mathematics and College of Com- r− ≤ r−(s) and r+ ≤ r+(s), we have puting, Georgia Institute of Technology, Atlanta, GA 30332, USA. Email: 1 X 1 X [email protected]. Supported in part by NSF grants DMS- H(Xs|Xsc ) ≤ H(X[n]) ≤ H(Xs). r+ r− 0401239 and DMS-0701043. s∈C s∈C SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 2

The upper bound is Shearer’s lemma [9], known in the hypothesis testing and concentration of measure are also given combinatorics literature [43]. The lower bound is new. there. The paper is organized as follows. First, in Section II, In Section X, we note that weaker versions of our main the notions of fractional coverings and packings using hyper- inequality for submodular functions follow from results devel- graphs, which provide a useful language for the information oped in various communities (economics, computer science, inequalities we present, are developed. In Section III, we physics); this history and the consequent connections do present the main technical result of this paper, which is a new not seem to be well known or much tapped in information inequality for submodular set functions. Section IV presents theory. Finally in Section XI, we conclude with some final the main entropy inequality of this paper, which strengthens remarks and brief discussion of other applications, including Proposition I, and gives a very simple proof as a corollary to multiuser information theory. of the general result for submodular functions. This entropy inequality is developed in two forms, which we call the strong fractional form and the strong degree form; Proposition I may II.ON HYPERGRAPHSAND RELATED CONCEPTS then be thought of as the weak degree form. A different It is appropriate here to recall some terminology from manifestation of the upper bound in this weak degree form discrete mathematics. A collection C of subsets of [n] is called of the inequality was recently proved (in a more involved a hypergraph, and each set s in C is called a hyperedge. manner) by Friedgut [15]; the relationship with his result is When each hyperedge has cardinality 2, then C can be thought also further discussed in Section IV using the preliminary of as the set of edges of an undirected graph on n labelled concepts developed in Section II. vertices. Thus all the statements made above can be translated While independent sets in graphs have always been of into the language of hypergraphs. In the rest of this paper, combinatorial and graph-theoretical interest, counting inde- we interchangeably use “hypergraph” and “collection” for C, pendent sets in bipartite graphs received renewed attention “hyperedge” and “set” for s in C, and “vertex” and “index” due to Kahn’s entropy approach [26] to Dedekind’s prob- for i in [n]. lem. Dedekind’s problem involves counting the number of We have the following standard definitions. antichains in the Boolean lattice, or equivalently, counting the number of Boolean functions on n variables that can be Definition I: The collection C is said to be r-regular if each constructed using only AND and OR (and no NOT) gates. To index i in [n] has the same degree r, i.e., if each vertex i handle this problem by induction on the number of levels in the appears in exactly r hyperedges of C. lattice, Kahn first derived a tight bound on the (logarithm of the) number of independent sets in a regular bipartite graph. The following definitions extend the familiar notion of pack- In Section V, we build on Kahn’s work to obtain a bound ings, coverings and partitions of sets by allowing fractional on number of independent sets in an arbitrary graph. We counts. The history of these notions is unclear to us, but also generalize this to counting graph homomorphisms, with some references can be found in the book by Scheinerman applications to graph coloring and zero-error source-channel and Ullman [44]. codes. The applications of entropy inequalities to counting typi- Definition II: Given a collection C of subsets of [n], a function cally involves discrete random variables, but the inequalities α : C → R+, is called a fractional covering, if for each P also have applications when applied to continuous random i ∈ [n], we have s∈C:i∈s α(s) ≥ 1. variables. In Section VI, we develop such an application by Given C, a function β : C → R+ is a fractional packing, if proving a new family of determinantal inequalities that provide P for each i ∈ [n], we have s∈C:i∈S β(s) ≤ 1. generalizations of the classical determinantal inequalities of If γ : C → R+ is both a fractional covering and a fractional Hadamard, Szasz and Fischer. packing, we call γ a fractional partition. Having presented two applications of our main inequalities, we move on to studying the structure of the inequalities more Note that the standard definition of a fractional packing closely. In Section VII, we present a duality between our upper of [n] using C (as in [44]), would assign weights βi to the and lower bounds that generalizes a theorem of Fujishige [17]. elements, (rather than sets) i ∈ [n], and require that, for P In particular, we show that the collection of upper bounds on each s ∈ C, we have i∈s βi ≤ 1. Our terminology can H(X[n]) for all collections C is equivalent to the collection be justified, if one considers the “dual hypergraph,” obtained of lower bounds. There we also discuss interpretations of the by interchanging the role of elements and sets – consider the inequality relating to sensor networks and erasure entropy, and 0-1 incidence matrix (with rows indexed by the elements and generalize the monotonicity property for special collections of columns by the sets) of the set system, and simply switch the subsets discovered by Han [23]. roles of the elements and the sets. Section VIII presents some new entropy power inequalities The following simple lemmas are useful. for joint distributions, and points out an intriguing analogy between them and the recent subset sum entropy power Lemma I:[FRACTIONAL ADDITIVITY] Let {ai : i ∈ [n]} inequalities of Madiman and Barron [33]. In Section IX, we be an arbitrary collection of real numbers. For any s ⊂ [n], P prove inequalities for relative entropy between joint distribu- define as = aj. For any fractional partition γ using j∈s P tions. Interpretations of the relative entropy inequality through any hypergraph C, a[n] = s∈C γ(s)as. Furthermore, if each SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 3

1 ai ≥ 0, then covering given by α(s) = is called the degree covering, r−(s) X X and the fractional packing given by β(s) = 1 is called the β(s)a ≤ a ≤ α(s)a (4) r+(s) s [n] s degree packing. s∈C s∈C for any fractional packing β and any fractional covering α The following lemma is a trivial consequence of the defini- using C. tions.

Proof: Interchanging sums implies Lemma III: If C is quasiregular, the degree packing and X X X X X degree covering coincide and provide a fractional partition of α(s) ai = ai α(s)1{i∈s} ≥ ai, P [n] using C. In particular, a[n] = s∈C as/r−(s). s∈C i∈s i∈[n] s∈C i∈[n] One may define the weight of a fractional partition as using the definition of a fractional covering. The other state- follows. ments are similarly obvious. We introduce the notion of quasiregular hypergraphs. Definition V: Let γ be a fractional partition (or a fractional covering or packing). Then the weight of γ is w(γ) = P Definition III: The hypergraph C is quasiregular if the degree s∈C γ(s). function r :[n] → Z+ defined by r(i) = |{s ∈ C : s 3 i}| is constant on s, for each s in C. There are natural optimization problems associated with the weight function. The problem of minimizing the weight of α Example: One can construct simple examples of quasiregular over all fractional coverings α is the called the optimal frac- hypergraphs using what are called bi-regular graphs in the tional covering problem, and that of maximizing the weight graph theory literature. Consider a bipartite graph on vertex of β over all fractional packings β is the called the optimal sets V1 and V2 (i.e., all edges go between V1 and V2), such fractional packing problem. These are linear programming that every vertex in V1 has degree r1 and every vertex in V2 relaxations of the integer programs associated with optimal has degree r2. Such a graph always exists if |V1|r1 = |V2|r2. covering and optimal packing, which are of course important Now consider the hypergraph on V1 ∪ V2 with hyperedges in many applications. Much work has been done on these being the neighborhoods of vertices in the bipartite graph. This problems, including studies of the integrality gap (see, e.g., hypergraph is quasiregular (with degrees being r1 and r2), and [44]). it is not regular if r1 is different from r2. One may also define a notion of duality for fractional partitions. There is a sense in which all quasiregular hypergraphs are similar to the example above; specifically, any quasiregular Definition VI: For any hypergraph C, define the complimen- hypergraph has a canonical decomposition as a disjoint union tary hypergraph as C¯ = {sc : s ∈ C}. If α is a fractional of regular subhypergraphs. covering (or packing) using C, the dual fractional packing (respectively, covering) using C¯ is defined by Lemma II: Suppose the hypergraph C on the vertex set [n] is α(s) quasiregular. Then one can partition [n] into disjoint subsets α¯(sc) = . {Vm}, and C into disjoint subhypergraphs {Cm} such that each w(α) − 1 Cm is a regular hypergraph on vertex set Vm. To see that this definition makes sense (say for the case of Proof: Consider the equivalence relation on [n] induced a fractional covering α), note that for each i ∈ [n], by the degree, i.e., i and j are related if r(i) = r(j). This relation decomposes [n] into disjoint equivalence classes X X α(s) α¯(sc) = {Vm}. Since C is quasiregular, all indices in s have the same w(α) − 1 c ¯ c s∈C,i/∈s degree for each set s ∈ C, and hence each s ∈ C is a subset s ∈C,s 3i P α(s) − P α(s) of exactly one equivalence class Vm. Q.E.D. = s∈C s∈C,i∈s w(α) − 1 The notion of quasiregularity is related to what we believe w(α) − 1 ≤ = 1. is an important and natural fractional covering/packing pair. w(α) − 1 As long as there is at least one set s in the hypergraph C that contains i, we have

X 1 X 1{i∈s} X 1{i∈s} = ≥ = 1, III.A NEWINEQUALITYFORSUBMODULARFUNCTIONS r−(s) r−(s) r(i) s∈C,s3i s∈C s∈C The following definitions are necessary in order to state the main technical result of this paper. so that α(s) = 1 provide a fractional covering. Similarly, r−(s) the the numbers β(s) = 1 provide a fractional packing. [n] r+(s) Definition VII: The set function f : 2 → R is submodular if Definition IV: Let C be any hypergraph on [n] such that every index appears in at least one hyperedge. The fractional f(s) + f(t) ≥ f(s ∪ t) + f(s ∩ t) SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 4 for every s, t ⊂ [n]. If −f is submodular, we say that f is Thus supermodular. X (a) X X α(s)f(s| < s) = α(s) f(j| < j ∩ s, < s) Definition VIII: For any disjoint subsets s and t of [n], define s∈C s∈C j∈s (b) f(s|t) = f(s ∪ t) − f(t). For a fixed subset t ( [n], the X X [n]\t ≥ α(s) f(j| < j) function ft : 2 → defined by ft(s) = f(s|t) is called f R s∈C j∈s conditional on t. (c) X X = f(j| < j) α(s)1{j∈s} For any s ⊂ [n], denote by < s the set of indices less j∈[n] s∈C than every index in s. Similarly, > s is the set of indices (d) X greater than every index in s. Also, the index i is identified ≥ f(j| < j) with the set {i}; thus, for instance, < i is well-defined. We j∈[n] (a) also write [i : i + k] for {i, i + 1, . . . , i + k − 1, i + k}. Note = f(X ), that [n] = [1 : n]. [n] where (a) follows by the chain rule (6), (b) follows from (5), Lemma IV: Let f : 2[n] → R be any submodular function (c) follows by interchanging sums, and (d) follows by the with f(φ) = 0. definition of a fractional covering. The lower bound may be proved in a similar fashion by a 1) If s, t, u are disjoint sets, chain of inequalities. Indeed, X f(s|t, u) ≤ f(s|t). (5) β(s)f(s|sc\ > s) s∈C (a) X X 2) The following “chain rule” expression holds for f([n]): = β(s) f(j| < j ∩ s, sc\ > s) X s∈C j∈s f([n]) = f(i| < i). (b) X X i∈[n] ≤ β(s) f(j| < j) s∈C j∈s (c) X X Proof: First note that if s, t, u are disjoint sets, then = f(j| < j) 1{j∈s}β(s) submodularity implies j∈[n] s∈C (e) X ≤ f(j| < j) f(s ∪ t ∪ u) + f(t) ≤ f(s ∪ t) + f(t ∪ u), j∈[n] (a) which is equivalent to f(s|t, u) ≤ f(s|t). = f([n]), The “chain rule” expression for f([n]) is obtained by where (a), (b), (c) follow as above, and (e) follows by the induction. Note that f([2]) = f(1)+f(2|1) = f(1|φ)+f(2|1) definition of a fractional partition. since f(φ) = 0. Now assume the chain rule holds for [n], and observe that Remark 1: The key new element in this result is the fact that one can use, for any ordering on the ground set [n], the X f([n + 1]) = f([n]) + f(n + 1|[n]) = f(i| < i), conditional values of f that appear in the upper and lower i∈[n+1] bounds for f([n]). Because of (5), this is an improvement over simply using f. The latter weaker inequality has been implicit where we used the induction hypothesis for the second equal- in the cooperative game theory literature; various historical ity. remarks explicating these connections are given in Section X.

Theorem I: Let f : 2[n] → R be any submodular function with f(φ) = 0. Let γ be any fractional partition with respect Corollary I: Let f : 2[n] → R be any submodular function to any collection C of subsets of [n]. Then with f(φ) = 0, such that f([j]) is non-decreasing in j for j ∈ [n]. Then, for any collection C of subsets of [n], X c X γ(s)f(s|s \ > s) ≤ f([n]) ≤ γ(s)f(s| < s) . X X β(s)f(s|sc\ > s) ≤ f([n]) ≤ α(s)f(s| < s) , s∈C s∈C s∈C s∈C where β is any fractional packing and α is any fractional Proof: The chain rule (actually a slightly extended covering of C. version of it with additional conditioning in all terms that can be proved in exactly the same way) implies Proof: The proof is almost exactly the same as that of Theorem I; the only difference being that the validity there of X f(s| < s) = f(j| < j ∩ s, < s). (6) (d) for fractional coverings and of (e) for fractional packings j∈s is guaranteed by the non-negativity of f(j| < j). SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 5

Observe that if f defines a polymatroid (i.e., f is not When the subset entropies are all equal, this is just the problem only submodular but also non-decreasing in the sense that of optimal fractional covering discussed in Section II. f(s) ≤ f(t) if s ⊂ t), then the condition of Corollary I is automatically satisfied. B. Strong Degree Form IV. ENTROPY INEQUALITIES The choice of α as the degree covering and β as the degree packing in Theorem I’ gives the strong degree form of the A. Strong Fractional Form inequality. The main entropy inequality introduced in this work is the following generalization of Shannon’s chain rule. Theorem II:[STRONG DEGREE FORM] Let C be any collec- tion of subsets of [n], such that every index i appears in at Theorem I’:[STRONGFRACTIONALFORM] For any collection least one element of C. Then C of subsets of [n], X H(Xs|Xsc\>s) X H(Xs|Xs) ≤ H(X[n]) ≤ α(s)H(Xs|Xs) ≤ h(X[n]) ≤ γ(s)h(Xs|X s in and γ is any fractional partition of C. the lower bound.

One can give an elementary proof of Theorem I’ as a Remark 4: The collections C for which the results in this refinement of that given by Llewellyn and Radhakrishnan paper hold need not consist of distinct sets. That is, one may for Shearer’s lemma (see [43]). However, instead of giving have multiple copies of a particular s ⊂ [n] contained in C, and the proof in terms of entropy (which one may find in the as long as this is taken into account in counting the degrees conference paper [35]), we have proved in Theorem I a of the indices (or checking that a set of coefficients forms more general result that holds for the rather wide class of a fractional packing or covering), the statements extend. We submodular set functions. To see that Theorem I’ follows from will make use of this feature when developing applications to Theorem I, we need to check that the joint entropy set function combinatorics in Section V. f(s) = H(Xs) is a submodular function with f(φ) = 0. The submodularity of f is a well known result that to our Remark 5: Using the previous remark, one may write down knowledge was first explicitly mentioned by Fujishige [17], Theorem II with arbitrary numbers of repetitions of each set in although he appears to partially attribute the result to a 1960 C. This gives a version of Theorem I’ with rational coefficients, paper of Watanabe that we have been unable to find. It follows following which an approximation argument can be used to from the fact that H(Xs) + H(Xt) − H(Xs∪t) − H(Xs∩t) = obtain Theorem I’. This proof is similar to the one alluded I(Xs\t; Xt\s|Xs∩t) is a conditional (see, to by Friedgut [15] for the version without ordering. Thus e.g., Cover and Thomas [10]), which is guaranteed to be Theorem II is actually equivalent to Theorem I’. non-negative by Jensen’s inequality. To see that the “correct” definition of f(φ) = 0, note that the “unconditional” entropy The strong degree form of the inequality generalizes Shan- H(Xs) should be equal to H(Xs|Xφ), but the latter is non’s chain rule. In order to see this, simply choose the H(Xs)−H(Xφ) by definition, which suggests that H(Xφ) = collection C to be C1, the collection of all singletons. For this 0. collection, Theorem II says n n Again, we would like to stress the freedom given by X X Theorem I’ in terms of choice of ordering. For convenience of H(Xi|X[n]\≥i) ≤ H(X[n]) ≤ H(Xi|X

Proof: Let X be chosen uniformly at random from where R(xP (i)) is the cardinality of the range of Xi given that Hom(G, F ). The random homomorphism X can be repre- XP (i) = xP (i), and we have bounded H(Xi|XP (i) = xP (i)) sented by the values it assigns to each i ∈ V , i.e., X = by log R(xP (i)), and the last inequality follows by Jensen’s (X(1),X(2),...,X(n)) = (X1,X2,...,Xn), where Xi ∈ inequality. Thus VF . By definition, Xi and Xj are connected in F if i and j X 1  X  H(X) ≤ log R (x )p(i) . are connected in G. We aim to bound H(X) from above. d(i) i P (i) i∈V x ∈X Let ≺ denote an ordering on vertices according to the P (i) i decreasing order of their degrees (ties may be broken, for The proof is completed by observing that, for any i ∈ V , instance, by using an underlying lexicographic ordering of V ). X R (x )p(i) ≤ |Hom(K ,F )| . (11) For each i ∈ V , let i P (i) p(i),p(i) xP (i)∈X i

Indeed, first note that every (partial) homomorphism xP (i) of P (i) = {j ∈ V : {i, j} ∈ E and j ≺ i} , P (i) for any graph G (regardless of the ordering ≺) is trivially a valid (partial) homomorphism of one side of Kp(i),p(i), and define p(i) = |P (i)|. Consider the collection C to be the since each side of this bipartite graph has no edges and collection of P (i), and in addition, p(i) copies of singleton |P (i)| = p(i). Furthermore, for a valid xP (i), the number of sets {i}, for each i. Then observe that each i is covered by extensions Ri(xP (i)) to i is the same whether the graph is G d(i) sets in C, i.e., that the degree of i in the collection C or Kp(i),p(i), since it only depends on F . This proves (11). is r(i) = d(i). Indeed, each i appears in d(i) − p(i) sets of Note that the inequality (11) can be strict, since there can be the form, P (j), corresponding to each j such that i ≺ j and partial homomorphisms of one side of Kp(i),p(i) to a given {i, j} ∈ E, and once in each of the p(i) singleton sets {i}. F which are not necessarily valid while considering (partial) G F By the upper bound in Theorem II applied to this collection homomorphisms from to , since the induced graph on P (i) i C, we have , for a given , might have some edges. (This corrects the claim in [21] that (11) holds with equality.)

X 1   Nayak, Tuncel and Rose [42] note that zero-error source- H(X) ≤ H X |X min d(j) P (i) ≺P (i) channel codes are precisely graph homomorphisms from a i∈V j∈P (i) “source confusability graph” G to a “channel characteristic X p(i) U + H(X |X ) graph” G . Thus, Theorem IV may also be interpreted as d(i) i ≺i X i∈V giving a bound on the number of zero-error source channel X 1 p(i)  codes that exist for a given source-channel pair. ≤ H(X ) + H(X |X ) , d(i) P (i) d(i) i P (i) i∈V C. Counting independent sets By choosing appropriate graphs F , various corollaries can by relaxing the conditioning and by the fact that the chosen be obtained. In particular, it is well known that the problem ordering makes j ∈ P (i) imply d(j) ≥ d(i). of counting independent sets in a graph can be cast in the Let qi denote the probability mass function of XP (i), which language of graph homomorphisms. Choose F to be the graph takes its values in X i = {xP (i) : x ∈ Hom(G, F )}. In other on two vertices joined by an edge, and with a self-loop on one words, q(xP (i)) is the probability that XP (i) = xP (i), under of the vertices. Then, by considering the set of vertices of G the uniform distribution on X. Finally, let R(xP (i)) be the that are mapped to the un-looped vertex in F , it is easy to number of values that Xi can take given that XP (i) = xP (i), see that each homomorphism from G to F corresponds to an i.e., the support size of the conditional distribution of Xi given independent set of G. This yields the following corollary. XP (i) = xP (i). Note that this is also the number of possible extensions of the partial homomorphism on P (i) to a partial Corollary II:[INDEPENDENT SETS] Let G = (V,E) be an homomorphism on P (i) ∪ {i}. arbitrary graph on N vertices, and let I(G) denote the set SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 8

vertices of G. Thus the above theorem yields a corresponding upper bound on the number of r-colorings of a graph G, by replacing Hom(Kp(v),p(v),F ) in (10) with the number of r- 0 1 colorings of the complete bipartite graph Kp(v),p(v). Independent sets VI.AN APPLICATION TO DETERMINANTAL INEQUALITIES The connection between determinants of positive definite matrices and multivariate normal distributions is classical. 3 For example, Bellman’s text [3] on matrix analysis makes extensive use of an “integral representation” of determinants in terms of an integrand of the form e−, which is 4 essentially the Gaussian density. The classical determinan- 2 tal inequalities of Hadamard and Fischer then follow from the subadditivity of entropy. This approach seems to have been first cast in probabilistic language by Dembo, Cover 5 1 and Thomas [11], who further showed that an inequality of Szasz can be derived (and generalized) using Han’s inequality. 5−colorings Following this well-trodden path, Proposition II yields the Fig. 1. The graphs F relevant for counting independent sets and number of following general determinantal inequality. 5-colorings. Corollary III:[DETERMINANTAL INEQUALITIES] Let K be a positive definite n×n matrix and let C be a hypergraph on [n]. of independent sets of G. Let ≺ denote an ordering on V ac- Let K(s) denote the submatrix corresponding to the rows and cording to decreasing order of degrees of the vertices, breaking columns indexed by elements of s. Then, using |M| denote ties arbitrarily. Let p(v) denote the number of neighbors of v the determinant of M, we have for any fractional partition α∗, which precede v, under the ≺ ordering. Then α∗(s) Y |K|  Y α∗(s) Y (p(v)+1) 1 ≤ |K| ≤ |K(s)| . |I(G)| ≤ 2 d(v) . |K(sc)| s∈C s∈C v∈V The proof follows from Proposition II via the fact that Specializing to the case of d-regular graphs G = (V,E) on any positive definite n × n matrix K can be realized as n vertices, it is clear that the covariance matrix of a multivariate normal distribution Y (p (v)+1) 1 n + n |I(G)| ≤ 2 a d ≤ 2 2 d . N(0,K), whose entropy is v∈V 1  n  H(X[n]) = 2 log (2πe) |K| , where ≺a is an arbitrary total order on V , and pa(v) is the number of vertices preceding v in this order, which are and furthermore, that if X[n] ∼ N(0,K), then Xs ∼ neighbors of v. This recovers Kahn’s unpublished result [27] N(0,K(s)). Note that an alternative approach to proving for d-regular graphs, which generalized his earlier result [25] Corollary III would be to directly apply Theorem I to for the d-regular, bipartite case. Note that we removed the the known fact (called the Koteljanskii or sometimes the assumption of regularity in Kahn’s result by making a choice Hadamard-Fischer inequality) that the set function f(s) = of ordering. log |K(s)| is submodular. There is another way to view this result that is useful in For an r-regular hypergraph C, using the degree partition in computational geometry. Namely, if one considers a region (of, Corollary III implies that say, Euclidean space) and a finite family of subsets F = {Av : r Y v ∈ V } of this region, then one can define the intersection |K| ≤ |K(s)|. s∈C graph GF of this family by connecting i and j in V if and only if Ai ∩ Aj 6= φ. Then the independent sets of GF are in Considering the hypergraphs C1 and Cn−1 then yields the one-to-one correspondence with packings of the region using Hadamard and prototypical Szasz inequality, while the Fischer sets in the family F. Thus Corollary II also gives a bound on inequality follows by considering C = {s, sc}, for an arbitrary the number of packings of a region using a given family of s ⊂ [n]. sets. We remark that one can interpret Corollary III using the all- Another easy corollary of Theorem III is to graph colorings. minors matrix-tree theorem (see, e.g., Chaiken [7] or Lewin Recall that a (proper) r-coloring of the vertices of G is a [32]). This is a generalization of the matrix tree theorem mapping f : V → [r] so that u, v ∈ V and uv ∈ E implies of Kirchhoff [29], which states that the determinant of any that f(u) 6= f(v). Consider the constraint graph be F = Kr, cofactor of the Laplacian matrix of a graph is the total number a complete graph on r vertices, for r ≥ 2. Then Hom(G, Kr) of distinct spanning trees in the graph, and interprets all minors corresponds to the number of (proper) r-colorings of the of this matrix in terms of combinatorial properties of the graph. SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 9

VII.DUALITYAND MONOTONICITYOF GAPS The gaps in the inequalities have especially nice structure Consider the weak fractional form of Theorem I, namely when they are considered in the weak degree form, i.e., for the fractional partition using a r-regular hypergraph C, all of X X γ(s)f(s|sc) ≤ f([n]) ≤ γ(s)f(s). whose coefficients are 1/r. The associated gaps are s∈C s∈C 1 X g (f, C) = f([n]) − f(s|sc) We observe that there is a duality between the upper and lower L r s∈C bounds, relating the gaps in this inequality. (15) 1 X and gU (f, C) = f(s) − f([n]). [n] r Theorem IV:[DUALITYOF GAPS] Let f : 2 → R be a s∈C submodular function with f(φ) = 0. Let γ be an arbitrary fractional partition using some hypergraph C on [n]. Define Corollary IV:[DUALITYFOR REGULAR COLLECTIONS] Let the lower and upper gaps by [n] f : 2 → R be a submodular function with f(φ) = 0. For a X c r-regular collection C, GapL(f, C, γ) = f([n]) − γ(s)f(s|s ) s∈C gL(f, C¯) r X (12) = . and GapU (f, C, γ) = γ(s)f(s) − f([n]). gU (f, C) |C| − r s∈C Then Let us now specialize to the entropy set function e(s)– we Gap (f, C, γ) Gap (f, C¯, γ¯) use this to mean either H(X ) (if the random variables X are U = L , (13) s i w(γ) w(¯γ) discrete) or h(Xs) (if the random variables Xi are continuous). The special hypergraphs C , k = 1, 2, . . . , n, consisting of all where w is the weight function and γ¯ is the dual fractional k k-sets or sets of size k, are of particular interest, and a lot partition defined in Section II. is already known about the gaps for these collections. For Proof: This follows easily from the definitions. Indeed, instance, Han’s inequality [23] already implies Proposition I for these hypergraphs, and Corollary IV applied to these X f([n]) − γ¯(sc)f(sc|s) hypergraphs implies that sc∈C¯ gL(e, Cn−k) k X γ(s) = , =f([n]) − f([n]) − f(s) g (e, C ) n − k w(γ) − 1 U k s∈C recovering an observation made by Fujishige [17]. Indeed, P γ(s)f(s)  w(γ)  = s∈C − − 1 f([n]) Theorem IV and Corollary IV generalize what [17] interpreted w(γ) − 1 w(γ) − 1 using the duality of polymatroids, since our assumptions are 1  X  weaker and the assertions broader. Fujishige [17] considered = γ(s)f(s) − f([n]) , w(γ) − 1 these gaps important enough to merit a name: building on s∈C terminology of Han [23], he called the quantity gU (e, Ck) a and “”, and gL(e, Ck) a “dual total correlation”. P X γ(s) w(γ) In two particular cases, the gaps have simple expressions as w(¯γ) = γ¯(sc) = s∈C = . w(γ) − 1 w(γ) − 1 relative entropies (see Section IX for definitions). First, note sc∈C¯ that the lower gap in Han’s inequality (3) is related to the Dividing the first expression by the second yields the result. dependence measure that generalizes the mutual information.

(n − 1)gL(e, Cn−1) = gU (e, C1) X = e({i}) − e([n]) Note that the upper bound for f([n]) with respect to (C, γ) (16) is equivalent to the lower bound for f([n]) with respect to i∈[n] the dual (C¯, γ¯), implying that the collection of upper bounds = D(PX[n] kPX1 × ... × PXn ). for all hypergraphs and all fractional coverings is equivalent to the collection of lower bounds for all hypergraphs and all It is trivial to see that the gap is zero if and only if the random fractional packings. Also, it is clear that under the assumptions variables are independent. of Corollary I, one can state a duality result extending Theorem Second, the lower gap in Proposition I with respect to the IV by replacing γ by any fractional covering α, and γ¯ by the singleton class C1 is related to the upper gap in the prototypical dual fractional packing α¯. form (3) of Han’s inequality.

From Theorem IV, it is clear by symmetry that also gL(e, C1) = (n − 1)gU (e, Cn−1) X Gap (f, C, γ) Gap (f, C¯, γ¯) = D(P kP |P ). (17) L = U . (14) Xi|X[n]\i Xi|X

n Again specializing to the joint entropy function, let X H−(X ) = H(X |X ). [n] i [n]\i (U) 1 X e(s) e = i=1 k n k k s:|s|=k They give several motivations for defining these quantities; denote the joint entropy per element for subsets of size k most significantly, the erasure entropy has an operational averaged over all k-element subsets, and c significance as the number of bits required to reconstruct a (L) 1 X e(s|s ) e = symbol erased by an erasure channel. Theorem 1 in [50] states k n k − k s:|s|=k that H (X[n]) ≤ H(X[n]) with equality if and only if the Xi are independent. The inequality here is simply the lower denote the corresponding average of per bound of Proposition I applied to the singleton class C , and is (U) 1 element. Since gU (e, Ck) = nek − e([n]) and gL(e, Ck) = thus a special case of our results. The difference between the (L) (U) e([n])−nek , Corollary V asserts that ek is decreasing in k, joint entropy of X[n] and its erasure entropy is just gL(e, C1), (L) while ek is increasing in k. Dembo, Cover and Thomas [11] and the characterization of equality in terms of independence give a nice interpretation of this fact, briefly outlined below. follows from the remarks above. It would be interesting to see Suppose we have n sensors collecting data relevant to the if the more general bounds on joint entropy developed here task at hand. For instance, the sensors might be measuring the can also be given an operational meaning using appropriate temperature of the ocean at various points, or they might be erasure-type channels. evaluating the probability that a human face is in a collection Apart from the eponymous duality between the total and of camera images taken along the boundary of a high-security dual total correlations discussed above, these quantities also site, or they might be taking measurements of neurons in a satisfy a monotonicity property, sometimes called Han’s theo- monkey’s brain. Suppose due to experimental conditions, at rem (cf., [23]). Since this complements the duality result, we any time, we only have access to a random subset of m sensor state it below in the more general submodular function setting. measurements out of n. Then Han’s monotonicity theorem implies that, on average, we are getting more information as Corollary V:[MONOTONICITYOF GAPS] Let f : 2[n] → R m increases, etc. be a submodular function with f(φ) = 0, and let gL(f, Ck) and gU (f, Ck) be defined by (15). Then both gL(f, Ck) and VIII.ENTROPY POWER INEQUALITIES gU (f, Ck) are monotonically decreasing in k. Theorem I’ implies similar inequalities for entropy powers. Recall that the entropy power of the random vector Xs is Proof: Proposition I, applied to the collection Ck, imme- 2h(Xs) diately implies that 0 = gU (f, Cn) ≤ gU (f, Ck), for k ∈ [n], N (Xs) = e |s| . on observing that r (s) = r (s) = n−1. To obtain the full − + k−1 This is sometimes standardized by a constant (2πe), which is chain of inequalities, first note that for any s in C , k+1 convenient in the continuous case as it allows for a comparison with a multivariate normal distribution. For discrete random 1 X variables, one can replace h by H in the above definition. f(s) ≤ f(s \ i). k i∈s Corollary VI: Let γ be any fractional partition of [n] using the hypergraph C. Then Thus X N (X[n]) ≤ wsN (Xs), s∈C gU (f, Ck) − gU (f, Ck+1) where w = γ(s)|s| are weights that sum to 1 over s ∈ C. 1 X 1 X s n = n−1 f(s) − n−1 f(s) Proof: k−1 s∈Ck k s∈Ck+1 First note that 1  X 1 X X  X X γ(s) X X 1 X ≥ f(s) − f(s \ i) . w = 1 = γ(s) = 1, n−1 n − k s n i∈s n k−1 s∈Ck s∈Ck+1 i∈s s∈C s∈C i∈s i∈[n] s∈C,s3i SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 11 since γ is a fractional partition. Thus IX.AN INEQUALITYFOR RELATIVE ENTROPY, AND INTERPRETATIONS     2h(X[n]) 2 X exp ≤ exp γ(s)h(Xs) Let A be either a countable set, or a Polish (i.e., complete n n s∈C separable metric) space equipped as usual with its Borel   X 2h(Xs) σ-algebra of measurable sets. Let and be probability = exp w P Q s |s| measures on the Polish product space An. For any nonempty s∈C X subset s of [n], write Ps for the marginal probability measure ≤ wsN (Xs), corresponding to the coordinates in s. Recall the definition of s∈C the relative entropy:   where the first inequality follows from Proposition II, and the dPs last inequality follows by Jensen’s inequality. D(PskQs) = EP log ∈ [0, ∞] dQs

Remark 8: Corollary VI generalizes an implication of The- when Ps is absolutely continuous with respect to Qs, and orem 16.5.2 of Cover and Thomas [10], which looks at the D(PskQs) = +∞ otherwise. collections of k-sets. Note that, as in the special case covered One may also define the conditional relative entropy by in [10], Corollary VI continues to hold with the entropy D(Ps|tkQs|t|P) = E t D(Ps|tkQs|t), (19) power N = N2 replaced throughout by any of the quantities P N (X ) = exp{ch(X )/|s|} for any c > 0. As in the c s s where Ps|t is understood to mean the conditional distribution case of entropy, the bounds on the entropy powers associated (under P) of the random variables corresponding to s given with the hypergraphs Cm and the degree covering satisfy a particular values of the random variables corresponding to t; monotonicity property. Indeed, by Theorem 16.5.2 of [10], then EPt denotes the averaging using Pt over the values that are conditioned on. With this definition, it is easy to verify the 1 X n  Nc(Xs) chain rule m s∈Cn−m d(s ∪ t) = D(Ps|tkQs|t|P) + d(t) is a decreasing sequence in m. for disjoint s and t, so that following the terminology devel- More interesting than entropy power inequalities for joint oped in Section III, we have distributions, however, are entropy power inequalities for sums d(s|t) = D( k | ). of independent random variables with densities. Introduced Ps|t Qs|t P by Shannon [45] and Stam [48] in seminal contributions, We have freely used (regular) conditional distributions in these they have proved to be extremely useful and surprisingly definitions; the existence of these is justified by the fact that deep– with connections to functional analysis, central limit we are working with Polish spaces. theorems, and to the determination of capacity and rate regions for problems in information theory. Recently the first author Theorem V: Let Q be a product probability measure on An, showed (building on work by Artstein, Ball, Barthe and Naor where A is a Polish space as above. Suppose P is a probability [2] and Madiman and Barron [33]) the following generalized measure on An such that the set function d : 2[n] → [0, ∞] entropy power inequality. For independent real-valued random given by variables Xi with densities and finite variances, d(s) = D(PskQs)  X  X  X  N Xi ≥ γ(s)N Xi , (18) does not take the value +∞ for any s ⊂ [n]. Then d(s) is i∈[n] s∈C i∈s supermodular. for any fractional partition γ with respect to any hypergraph C Proof: For any nonempty s, t ⊂ [n], we have on [n]. Inequality (18) shares an intriguing similarity of form to the inequalities of this paper, although it is much harder to d(s ∪ t) + d(s ∩ t) − d(s) − d(t) prove. = [d(s ∪ t) − d(t)] − [d(s) − d(s ∩ t)]

The formal similarity between results for joint entropy and = d(s ∪ t \ t t) − d(s \ s ∩ t s ∩ t). for entropy power of sums extends further. For instance, the fact that Since s ∪ t \ t = s \ s ∩ t, it would suffice to prove for disjoint sets s0 and t that 1 X  X  n  N Xi d(s0|t) ≥ d(s0|t0) (20) m s∈Cn−m i∈s for any t0 ⊂ t. is an increasing sequence in m, can be thought of as a formal However observe that, since Q is a product probability dual of Han’s theorem. It is an open question whether upper measure, bounds for entropy power of sums can be obtained that are 0 d(s |t) = E D( 0 k 0 ) = E E D( 0 k 0 ) analogous to the lower bound in Theorem I’. Pt Ps |t Qs Pt0 Pt\t0 Ps |t Qs SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007 12 and The tensorization property of the entropy functional is 0 0 of enormous utility in functional analysis, and the study of d(s |t ) = E D( 0 0 k 0 ) = E D(E 0 k 0 ), Pt0 Ps |t Qs Pt0 Pt\t0 Ps |t Qs isoperimetry, concentration of measure, and convergence of so that (20) is an immediate consequence of the convexity of Markov processes to stationarity. For instance, see Gross [22], relative entropy (see, e.g., [10]). Bobkov and Ledoux [4], and Kontoyiannis and Madiman [30], where the classical tensorization property is used to prove Based on the supermodularity proved in Theorem V, The- logarithmic Sobolev inequalities for Gaussian, Poisson and orem I applied to −d(s) immediately implies the following compound Poisson distributions respectively. corollary. X.HISTORICAL REMARKS Corollary VII: Under the assumptions of Theorem V, It turns out that the main technical result of this paper, X γ(s)D(Ps|sc\>skQs|P) ≥ D(P[n]kQ[n]) Theorem I, is related to a wide body of work in a number of s∈C fields, including the study of combinatorial optimization of set (21) X functions in computer science, the study of cooperative games ≥ γ(s)D(Ps|

[39] F. Matu´ˇs. Two constructions on limits of entropy functions. IEEE Trans. [46] L. S. Shapley. On balanced sets and cores. Naval Research Logistics Inform. Theory, 53(1):320–330, 2007. Quarterly, 14:453–560, 1967. [40] J. Moulin Ollagnier and D. Pinchon. Filtre moyennant et valeurs [47] L. S. Shapley. Cores of convex games. International Journal of Game moyennes des capacites´ invariantes. Bull. Soc. Math. France, Theory, 1(1):11–26, 1971. 110(3):259–277, 1982. [48] A.J. Stam. Some inequalities satisfied by the quantities of information [41] H. Narayanan. Submodular functions and electrical networks, volume 54 of Fisher and Shannon. Information and Control, 2:101–112, 1959. of Annals of Discrete Mathematics. North-Holland Publishing Co., [49] A. C. D. van Enter, R. Fernandez,´ and A. D. Sokal. Regularity properties Amsterdam, 1997. and pathologies of position-space renormalization-group transforma- [42] J. Nayak, E. Tuncel, and K. Rose. Zero-error source–channel codingwith tions: scope and limitations of Gibbsian theory. J. Statist. Phys., 72(5- side information. IEEE Trans. Inform. Th., 52:4626–4629, 2006. 6):879–1167, 1993. [43] J. Radhakrishnan. Entropy and counting. In Computational Mathemat- [50] S. Verdu´ and T. Weissman. Erasure entropy. Proc. IEEE Intl. Symp. ics, Modelling and Algorithms (ed. J. C. Misra), Narosa, 2003. Inform. Theory, Seattle, 2006. [44] E. R. Scheinerman and D. H. Ullman. Fractional Graph Theory. Wiley, [51] J. Zhang and R.W. Yeung. On characterization of entropy function via 1997. information inequalities. IEEE Trans. Inform. Th., 44:1440–1452, 1998. [45] C.E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379–423, 623–656, 1948.