INFORMATION THEORETIC RESULTS ON EXACT DISTRIBUTED SIMULATION AND INFORMATION EXTRACTION

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Gowtham Ramani Kumar August 2014

© 2014 by Gowtham Kumar Ramani Kumar. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- 3.0 United States License. http://creativecommons.org/licenses/by/3.0/us/

This dissertation is online at: http://purl.stanford.edu/sy415my4619

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Abbas El Gamal, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Ayfer Ozgur Aydin

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Tsachy Weissman

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

In the first part of the thesis we will introduce the notion of exact common infor- mation, which is the minimum description length of the common randomness needed for the exact distributed generation of two correlated random variables (X,Y ). We introduce the quantity G(X; Y ) = minX→W →Y H(W ) as a natural bound on the ex- act common information. We then introduce the exact common information rate, which is the minimum description rate of the common randomness for the exact gen- eration of correlated sources (X,Y ). We give a multiletter characterization for it as n n the limit G(X; Y ) = limn→∞(1/n)G(X ; Y ). While in general G¯(X; Y ) is greater than or equal to the Wyner common information, we show that they are equal for the Symmetric Binary Erasure Source. We then discuss the computation of G(X; Y ). In the second part, we will introduce a conjecture regarding the maximum mutual information a Boolean function can reveal about noisy inputs. Specifically, let (X,Y ) be a doubly symmetric binary source with cross-over probability α. For any boolean function b : 0, 1 n 0, 1 , we conjecture that I(b(Xn); Y n) 1 H(α). While the { } →{ } ≤ − conjecture remains open, we provide substantial evidence supporting its validity.

iv Acknowledgements

First of all, I like to thank my parents for all they have done for me. If not for their numerous sacrifices, I won’t be a Stanford student in the first place. I am from a lesser developed city in India. We lacked qualified teachers to help me get in to the Indian Institutes of Technology (IIT), premier engineering colleges in India with an acceptance ratio less than 1%. My parents made for that gap by over-spending on books and creating a personal library of all great science text-books in my home. When I was preparing for the IIT entrance examination, my school teachers in fact discouraged me by stating that they haven’t seen anyone ever clear the exam. In spite of these difficulties, my parents constantly believed in my ability to clear the IIT entrance exam. They were there for financial and moral support throughout my struggle. I would also like to take this opportunity to thank my family. In particular, I like to thank my wife for her moral support and for making a video of my PhD defense. Before I thank my Stanford colleagues, I would love to thank Prof. Andrew Thangaraj from IIT Madras, who served as my undergraduate final year project ad- visor. He believed in me and encourage me to publish early even in my undergraduate years. I would also like to thank all my professors and fellow students at IIT Madras who were instrumental in imparting knowledge through various fundamental courses that would later get me admitted to Stanford university. Next, I would like to give a special thanks to the late Prof. Thomas Cover who was my first PhD advisor at Stanford. I read Tom’s wonderful book on before I even obtained an admit from Stanford. Tom believes that puzzles lead to interesting research problems. When any new student first approaches Tom

v for a PhD position, he is asked to attend the Wednesday puzzle meetings and wait for something interesting to happen. These meetings go from elementary high school puzzles to deep research topics in information theory. It keeps everyone engaged, from novices to experts. In addition to puzzles, Tom loved theory. Tom didn’t care about practical problems as much. His only criterion in evaluating a research problem is if it is an interesting or elegant problem. I like to recall one favorite quote from Tom: “Theory is the first term in the Taylor series for practice”. Finally, I like to comment on Tom’s dedication to teaching. We all remember the unfortunate day when Tom passed away on March 26, 2012. Now that I know he chose to teach until his last moments, even after retirement and even after his health was low, I feel truly gifted to have had him as my advisor and be his 26-th PhD student (if you start counting from 0). Next, I would like to thank my advisor, Prof. . My first in- teraction with Abbas is during my first quarter at Stanford when I took his course on Network Information Theory (NIT). The course taught me the fundamentals of NIT that every researcher needs to know. One only has to read the unreadable in- formation theory papers before 1980 to appreciate the simplicity with which Abbas explains concepts. Later in my PhD career, when my original advisor Tom passed away, I turned to Abbas for support. He then took over as my primary advisor and kept the ball rolling. He helped me through the herculean task of solving my PhD thesis problem with his amazing ability to simplify the questions and ask the most fundamental questions. He took over Tom’s challenging problem and distilled it into the concept of exact common information, one of my primary thesis topics. Abbas is also known for his ability to give great presentations. He spends hours with his students in preparing for presentations and in writing papers. I still remember going through numerous dry runs of our ISIT presentations. We started our rehearsals 2 months before the conference and were doing our 10th rehearsal when the other in- formation theory students haven’t yet made the first draft. We all learnt the value of hard work. No great speaker is born great; great speakers are made. Next, I would like to thank Prof. Tsachy Weissman who served as my associate advisor. He taught me the course on universal schemes in information theory, topics

vi that every researcher in the field must be familiar with. I attended a lot of his research meetings and interacted with his students throughout my career. Next, I would like to thank Prof. Ayfer Ozgur, Prof. Thomas Courtade and Prof. Amir Dembo, who served as committee members of my PhD oral defense and provided valuable insight and feedback on my research. Next, I would like to thank Yeow Khiang Chia, whose discussions resulted in my formulation of the Boolean functions conjecture. I would also like to thank Prof. Chandra Nair and Prof. Thomas Courtade for their help in making progress on the conjecture. Next, I would like to thank all my collaborators and everyone from the informa- tion theory community I interacted with, including Lei Zhao, Paul Cuff, Haim Per- muter, Idoia Ochoa Alvarez, Mikel Hernaez, Kartik Venkat, Vinith Misra, Alexandros Manolakos, Albert No, Bernd Bandemer, Cheuk Ting Li, Hyeji Kim, Jiangtao and Young-Han-Kim who helped me at various stages of my PhD career and added to the richness of my PhD experience. Next I would like to thank every professors who taught me interesting courses at Stanford and created a unique Stanford experience. I would especially like to thank Prof. Stephen Boyd for his course on convex optimization, Prof. Andrew Ng for his course on Machine Learning and Matt Vassar for his course on public speaking. I would also love to thank Stanford for teaching me various unique skills, one of which is a circus art named aerial fabrics. I like to thank my aerial fabrics teachers Elizabeth, Rachel, Erica, Kristin and Graham. Next, I would like to thank the support staff at Stanford, including Denise Murphy, Karin Sligar, Katt Clark, Andrea Kuduk, Douglas Chaffee, Vickie Carrilo and IT expert John DeSilva. Their efficiency and expertise saved a lot of precious time and helped me focus on my research. Finally, I would love to thank almighty for guiding me in a successful path. At this point, I would love to point out an incident that explains a butterfly effect in my life. During my early years, my dad wanted to enroll me in a school where the medium of instruction is our local language Tamil rather than English. If he had proceeded with his plan, my life and career would be completely different from what

vii it is now. Also, the exam that got me in to IIT was a one day exam. A lot could’ve gone wrong during that day. Numerous key events and choices can change our lives completely. Since I cannot attribute a lot of my successes and experiences to skill alone, I would love to thank almighty for ensuring I am in a trajectory for success.

viii Contents

Abstract iv

Acknowledgements v

1 Introduction 1 1.1 EntropyasMeasureofInformation ...... 2 1.1.1 Compression...... 3 1.1.2 Source Simulation ...... 4 1.1.3 RandomnessExtraction ...... 4 1.2 MeasuresofCommonInformation...... 5 1.2.1 Distributed Compression ...... 6 1.2.2 Distributed Source Simulation ...... 8 1.2.3 DistributedSecret-keyGeneration...... 10 1.2.4 Relationship between Common Information Measures . . . . . 11 1.3 EfficiencyofInformationExtraction...... 12 1.4 Results on Exact Distributed Simulation and Information Extraction 14 1.4.1 Exact Distributed Simulation ...... 14 1.4.2 Efficiency of Information Extraction via Boolean Functions . . 14 1.4.3 Organization ...... 15

2 Exact Distributed Simulation 17 2.1 DefinitionsandProperties ...... 18 2.1.1 Properties of G(X; Y ) ...... 20 2.2 ExactCommonInformationRate ...... 22

ix 2.2.1 ExactCommonInformationRatefortheSBES ...... 25 2.2.2 ApproximateCommonInformationRate ...... 28 2.3 ExactCoordinationCapacity ...... 30 2.4 OnComputationofExactCommonInformation ...... 33 2.5 ConcludingRemarks ...... 35

3 Boolean Functions 36 3.1 Main Results and Implications ...... 39 3.1.1 Notation and Definitions ...... 39 3.1.2 RefinementofConjecture1 ...... 40 3.1.3 Conjecture2andedge-isoperimetry ...... 41 3.1.4 A computer-assisted proof of Conjecture 3 for any given α .. 45 3.2 Proofs ...... 49 3.2.1 ProofofTheorem7...... 49 3.2.2 ProofofTheorem9...... 53 3.3 ConcludingRemarks ...... 56

A Exact Distributed Simulation 57 A.1 Computing G(X; Y )forSBES...... 57 A.2 Subadditivity of G(Xn; Y n) ...... 58 A.3 ProofofProposition3 ...... 59 A.4 ProofofProposition5 ...... 59 A.5 ProofofLemma2...... 60 A.6 ProofofProposition7 ...... 63 A.7 ProofofProposition8 ...... 64

B Boolean Functions 67 B.1 Randomized Boolean functions do not improve mutual information . 67 B.2 Further comments on edge-isoperimetry, the Takagi function, and H(b(Xn) Y n) 68 | B.3 Enumerating the functions in ...... 71 Sn B.4 H¨older Continuity of Sα(p)...... 71

x Bibliography 74

xi List of Figures

1.1 Setting for zero-error distributed lossless compression...... 6 1.2 SettingforWynercommoninformation...... 9 1.3 Setting for G´acs–K¨orner–Witsenhausen common information...... 10 1.4 Setting for information extraction...... 13

2.1 Setting for distributed generation of correlated random variables. For exact generation, (X,ˆ Yˆ ) p (x, y)...... 18 ∼ X,Y 2.2 Setting for exact channel simulation...... 31

3.1 Illustration of the compression operator CI. The boolean function n I Ic n b(x )= b(x , x ) is shown in the left and CIb(x ) is shown in the right. (For easier visualization, we use an empty cell to represent b(xn) = 1.) 42 3.2 Successive one-dimensional compressions applied to the boolean func- tion b(xn) defined by b−1(0) = 000, 010, 011, 111 , elements of which { } are represented by balls on the three-dimensional hypercube. (Vertices are only labeled on the leftmost cube for cleaner presentation.) . . . . 43

3.3 A comparison of Sα(p) and H(p)H(α) for α = 0.05. The broken line shows the chords Algorithm 3.1.1 constructs before terminating. . .. 48

A.1 ProofofLemma2...... 61

xii Chapter 1

Introduction

In a distributed information processing system, there can be several sources that are dependent. For example in a sensor network for measuring the temperature in California, the measurements from Los Angeles and San Diego are dependent. In a wireless communication network, the received signals from multiple paths are depen- dent. There are also cases in which one may want to generate correlated sources on purpose to facilitate some sort of distributed coordination between agents. For exam- ple suppose one wishes to generate an agreed upon secret key at different locations. A satellite may broadcast a random signal and the received noisy versions of it are then used to generate the shared secret key to be used for secure communication over a public network. Another example is distributed simulation. Suppose we wish to perform weather simulation over some geographical area using multiple processors. Clearly the weather in the subareas assigned to different processors are correlated, so the processors cannot perform the simulation completely separately. In this case one can envision a central processor sending common information to the processors responsible for the simulation which they can then use together with their own local information to simulate the different areas. A natural question that occurs in all the above scenarios is, “How does one measure the amount of common information shared by a pair of dependent random variables X,Y ?” This question has been posed in various settings in Information Theory, resulting in numerous measures of common information [1, 2, 3]. In this chapter

1 CHAPTER 1. INTRODUCTION 2

we will study common information under three general settings, namely compression, simulation and extraction.

1.1 Entropy as Measure of Information

Before we delve into measures of common information, we will first look at how one can measure the amount of information contained in a single random variable X p (x). This question was first posed and answered convincingly by Shannon in ∼ X his seminal 1948 paper [1]. Shannon introduced entropy by first stating three reason- able assumptions or axioms that any measure H(X) = H pX (x) of the amount of information in X must satisfy. He showed that the only measure H (X) that satisfies those assumptions is given by

H(X)= p (x) log p (x), − X X x∈X X up to a multiplicative factor. Immediately after stating the above result, Shannon remarked, “This result and the assumptions required for its proof are in no way necessary for the present theory. It is given chiefly to lend a certain plausibility to some of our later definitions. The real justification of these definitions, however, will reside in their implications.” What Shannon meant is that entropy arises naturally as a measure of information in several key information processing settings. We will describe three of these settings, namely compression, randomness extraction and source simulation. To focus on the key ideas, we will only consider the simplest possible model of a source which is the Discrete Memoryless Source (DMS) defined by a set of symbols or outcomes and X a probability mass function over it pX (x). The source generates a sequence of iid symbols each of which is selected from the alphabet according to p (x). X X CHAPTER 1. INTRODUCTION 3

1.1.1 Compression

One way to measure the amount of information in a DMS is to compress it and specify the minimum number of bits required to describe it. The classic setup for zero-error compression of a DMS ( ,p (x)) consists of an encoder that compresses X X a source sequence Xn into a string of bits and a decoder that decompresses the string of bits and recovers the original source sequence Xn. This setup is ubiquitous: any email, text or video that we wish to communicate over the internet or other means is first encoded into a string of bits at the source and is then decoded to be reproduced exactly at the destination. The encoder and the decoder should agree on a code : n 0, 1 ∗ which is a one-one mapping associating each source C X → { } sequence xn n with a unique codeword (xn) consisting of a finite length string ∈ X C of bits. Since we allow variable length codewords, the decoder needs to have a way of knowing when a codeword ends so that one can concatenate multiple codewords without confusion. This is achieved by using prefix-free codes in which (xn) is never C 1 a prefix of (xn) for xn = xn. The expected description length of a code is given C 2 1 6 2 C by E ℓ (Xn) , where ℓ(s) denotes the length of a binary string s 0, 1 ∗. It is C ∈ { } reasonableh to measurei the information rate of the DMS ( ,p (x)) by specifying its  X X minimum expected per-symbol description length over all codes and block lengths n, measured in bits per symbol. The following theorem proved by Shannon shows that the minimum expected per-symbol description length of the DMS coincides with the entropy H(X).

Theorem 1. Given a DMS ( ,p (x)), we have X X

1 n inf min E ℓ (X ) = H(X)= pX (x) log pX (x), n n C − x∈X h i X where the minimum is over all prefix free codes : n 0, 1 ∗. C X →{ } Thus entropy arises naturally as a measure of information in a well-motivated setting, rather than through a set of axioms. CHAPTER 1. INTRODUCTION 4

1.1.2 Source Simulation

A second way of measuring the amount of information in a DMS is to specify the minimum number of bits per symbol required to simulate it. The corresponding setup consists of a simulator that has access to a source of iid Bern(1/2) bits (or equivalently, fair coin flips) and attempts to output an iid source sequence Xn n p (x ). The ∼ i=1 X i n number of bits required to generate X can be variable dependingQ on the outcomes. n We thus need a variable length string of bits B = b1b2b3 ... to generate X . Let Ln be the length of B and Rn = E[Ln]/n be the simulation rate, which is the expected per-symbol length of B needed to generate an instance Xn of the DMS. For example, suppose we wish to generate a single RV X Bern(1/3) RV. Here is ∼ one procedure to generate X: if b1 = 0 output X = 0. Otherwise, if b1b2 = 10, output

X = 1. Otherwise, if b1b2 = 11, discard b1, b2 and repeat the procedure starting from b3. For this procedure, the rate or the expected number of bits required to generate

X is R1 = E[L1]/1 = 2 bits. Under this setup, it is reasonable to define the information rate of the DMS ( ,p (x)) as the minimum simulation rate (infimum of rates) over all procedures X X and all n. It gives the minimum number of bits per symbol needed to simulate the DMS. Knuth and Yao [4] showed that the information rate coincides with the entropy rate H(X). Thus once again, entropy arises naturally as a measure of information.

1.1.3 Randomness Extraction

A third way of measuring the amount of information in a DMS is to specify the rate at which one can extract iid Bern(1/2) bits from it. In this setting we wish to extract n iid Bern(1/2) bits or random coin flips Bn from the a DMS ( ,p (x)). The source X X may be samples of thermal noise or radioactive decay and the extractor is a true random number generator. The randomness extractor generates the n iid bits using a variable number of source symbols. Let Ln be the length of the source sequence n required to extract B (Ln is random). The extraction rate Rn is defined as the number of random bits extracted divided by the expected number of source symbols, CHAPTER 1. INTRODUCTION 5

i.e., n Rn = . E[Ln]

For example, suppose that X is a Bern(1/3) source so that pX (1) = 1/3 and we want to generate B Bern(1/2). This corresponds to n = 1 (that is, we wish to generate 1 ∼ a single unbiased coin flip from a sequence of biased coin flips). Then here is a way to

do it. If X1 =0,X2 = 1 the extractor outputs B1 = 0 and if X1 =1,X2 = 0, then the extractor outputs B = 1. If the extractor sees X X 00, 11 , it discards X ,X 1 1 2 ∈ { } 1 2 and repeats the procedure again using X3,X4,X5,... . It is clear that the probability of producing a 1 is equal to that of producing a 0. The average number of X symbols

needed then is 9/2, thus achieving a rate R1 =2/9. Now we can define the information rate of the DMS ( ,p (x)) as the highest X X extraction rate (supremum of rates) Rn over all schemes and all n, which is the maximum number of random bits per symbol that can be extracted from the DMS. It turns out that the information rate is again equal to the entropy H(X). The literature on this problem is very large and goes back to work by von Neumann in the early fifties [5], but this result is a special case of a general result by Elias [6]. To summarize, entropy arises naturally as a measure of information in various settings including compression, simulation and extraction.

1.2 Measures of Common Information

We will now focus on measuring common information between a pair of dependent random variables X,Y . While entropy is unequivocally a natural measure of informa- tion, there are numerous measures of common information depending on the setting. We will once again consider three main settings, namely compression, simulation and extraction and look at the common information between X and Y under each set- ting. For simplicity, we restrict our source to be a 2-DMS ( ,p (x, y)) which X ×Y X,Y generates jointly i.i.d. pairs of random variables Xn,Y n n p (x ,y ). ∼ i=1 X,Y i i One way to measure the common information betweenQX and Y is to measure CHAPTER 1. INTRODUCTION 6

the amount of reduction in information by considering (X,Y ) jointly versus consider- ing them individually. The source ( ,p (x)) corresponds to H(X) bits per symbol X X and the source ( ,p (y)) corresponds to H(Y ) bits per symbol, whereas the 2-DMS Y Y ( , ,p (x, y)) jointly correspond to only H(X,Y ) H(X)+ H(Y ) bits per sym- X Y X,Y ≤ bol. The difference, H(X)+ H(Y ) H(X,Y ), is the well-known mutual information − I(X; Y ) and is the most natural measure of common information in all three settings namely, compression, simulation and extraction, as long as the operations are per- formed jointly with both sources at a single location. However, as we shall soon see, if we consider distributed versions of these three settings, we obtain different measures of common information under each setting.

1.2.1 Distributed Compression

Consider the setting for joint distributed compression of a 2-DMS shown in Fig. 1.1. In this setting, Encoder 1 (Alice) has access to Xn and Encoder 2 (Bob) has access to Y n. Each encoder compresses its source separately via a prefix-free code and sends them to the decoder, which then combines the two descriptions to reconstruct the 2-DMS.

n n X 1(X ) Encoder 1 C Xn,Y n Decoder n n Y 2(Y ) Encoder 2 C

Figure 1.1: Setting for zero-error distributed lossless compression.

We may define the rate used by each encoder as the expected per-symbol length

of the prefix free code used to describe its source sequence. Mathematically, R1 = (1/n) E[ℓ( (Xn))] and R = (1/n) E[ℓ( (Y n))]. C1 2 C2 While compressing Xn and Y n independently will yield rates H(X) and H(Y ) CHAPTER 1. INTRODUCTION 7

respectively, joint compression by the encoders may yield a sum rate R1 + R2 < H(X)+H(Y ). Thus, under this setting, it is reasonable to define common information as the maximum savings in sum rate, R∗ = H(X)+ H(Y ) min(R + R ), where C − 1 2 the savings are obtained by exploiting the dependence between the sources. For example, let the 2-DMS be defined by X = (U, W ) and Y = (V, W ) where U, V, W are independent. Encoder 1 may send (U, W ) at rate H(U, W ) = H(U)+ H(W ) and encoder 2 may send only V at rate H(V ). The savings in rate is given by R∗ = H(X)+ H(Y ) R + R = H(W ), which measures the common information C − 1 2 between X and Y . As another example, consider the Symmetric Binary Erasure Source (SBES(p)), which is a 2-DMS consisting of X Bern(1/2) and Y obtained by passing X through ∼ an Erasure(p) channel. For the SBES(p), we can achieve savings in sum-rate of 1 p − by sending X at rate 1 from Encoder 1 and the location of Erasures 1Y=e at rate H(p) from Encoder 2. ∗ In general, we don’t have a computable expression for RC . However, if we relax the zero-error constraint and allow the decoder to make errors (with vanishingly small probability) in estimating Xn,Y n, we can compute the common information. The corresponding setting is the classic Slepian-Wolf setting for lossless distributed source coding [7]. In this setting, we restrict the codes (Xn) and (Y n) to have fixed C1 C2 lengths of nR1 bits and nR2 bits respectively. We say that the rate-pair R1, R2 is achievable if there exists a sequence of codes such that

P (n) = Pr (Xˆ n, Yˆ n) =(Xn,Y n) 0. e { 6 }→

We measure the common information between X and Y as the savings in sum-rate due to joint distributed compression. The use of H(X) to measure information rate of ( ,p (x)) makes sense even in X X the lossless setting because it can be shown that the source ( ,p (x)) still requires a X X minimum description rate of H(X) bits per symbol even after relaxing the zero-error source coding constraint to a lossless source coding constraint.

Slepian and Wolf [7] characterized the achievable R1, R2 pairs and showed that CHAPTER 1. INTRODUCTION 8

a sum rate of R1 + R2 = H(X,Y ) is achievable. As a consequence, the maximum savings in sum-rate is I(X; Y ) and thus the rate of losslessly compressing the 2-DMS separately is the same as that of compressing it jointly. In some cases the common information rate for the zero-error case is equal to that for the lossless case. For example, for the SBES, the mutual information is 1 p − ∗ ∗ which is the same as RC . However, RC is not always equal to I(X; Y ). K¨orner and Orlitsky showed that R∗ = 0 for any 2-DMS with p (x, y) > 0 x, y [8], whereas C X,Y ∀ I(X; Y ) = 0 if and only if X,Y are independent. For example, consider a Doubly Symmetric Binary Source, written DSBS(α), which consists of X Bern(1/2) and Y ∼ obtained by passing X through a Binary Symmetric Channel (BSC) with cross-over probability α (0, 1/2). For the DSBS, R∗ = 0, but I(X; Y )=1 H(α) > 0. ∈ C −

1.2.2 Distributed Source Simulation

Consider the setting for distributed simulation shown in Fig. 1.2, first considered by Wyner [3]. Alice wishes to generate Xn and Bob wishes to generate Y n such that (Xn,Y n) appears to be generated by a 2-DMS ( ,p (x, y)). For this X ×Y X,Y purpose, Alice and Bob have access to a common message W Unif[1 : 2nR], n ∼ which corresponds to R common bits per symbol. Alice uses a stochastic decoder n n n p ˆ n (x w) to produce an estimate Xˆ from W , and Bob uses p ˆ n (y w) to X |Wn | n Y |Wn | produce Yˆ n. This setting naturally gives rise to a measure of common information between X and Y , namely the minimum number of common bits per symbol R, required to simulate the 2-DMS ( ,p (x, y)). X ×Y X,Y In this section, we discuss approximate source simulation considered by Wyner. Under this setting, a rate R is said to be achievable if there exists a sequence of simulation schemes such that

n −nR n n lim 2 p ˆ n (x w)p ˆ n (y w) pX,Y (xi,yi) =0, (1.1) n→∞ X |Wn | Y |Wn | − w∈[1:2nR] i=1 X Y T V

where p q = p(u) q(u) denotes the total variation distance between | − |T V u∈U | − | two pmf’s p( ) and q( ) over the alphabet . Thus we only require the distribution of · P· U CHAPTER 1. INTRODUCTION 9

W Unif[1 : 2nR] n ∼

Alice Decoder 1 Decoder 2 Bob

Xˆ n Yˆ n Figure 1.2: Setting for Wyner common information.

(Xˆ n, Yˆ n) to be approximately equal to the desired distribution. The Wyner common information J(X; Y ) between X and Y is the infimum of achievable rates R. Wyner [3] showed the following:

Theorem 2. The single-letter characterization for J(X; Y ) is given by

J(X; Y ) = min I(W ; X,Y ) (1.2) X→W →Y

As an example, let the 2-DMS is defined by X = (U, W ) and Y = (V, W ) where U, V, W are independent. For this 2-DMS, it can be shown that J(X; Y ) = H(W ) ∗ which coincides with RC . Another example is the SBES(p), for which it can be shown that

1, if p< 0.5, J(X; Y )=  H(p), if p 0.5.  ≥

∗  In this case, J(X; Y ) > RC . CHAPTER 1. INTRODUCTION 10

1.2.3 Distributed Secret-key Generation

Consider the setting shown in Fig. 1.3 for zero-error generation of a common secret key introduced by G´acs and K¨orner [2].

Xn Y n Alice Bob

Kn Kn Figure 1.3: Setting for G´acs–K¨orner–Witsenhausen common information.

Alice has access to Xn and Bob has access to Y n where (Xn,Y n) is generated by a 2-DMS ( ,p (x, y)). Alice and Bob want to generate a common key X ×Y X,Y n n Kn = Kn(X )= Kn(Y ) from their respective sources. This secret key can later be used for secure communication over a public channel. The G´acs-K¨orner-Witsenhausen common information K(X; Y ) is the maximum rate of common-key generation:

1 K(X; Y ) = sup max H(Kn) n Extractor n

Theorem 3. The single-letter characterization for G´acs-K¨orner-Witsenhausen com- mon information is given by

K(X; Y ) = max H(W ). (1.3) W =f(X)=g(Y )

The optimal W in (1.3) is obtained by permuting the rows and columns of

pX,Y (x, y) to arrange it in block-diagonal form with the largest number of blocks as shown below. CHAPTER 1. INTRODUCTION 11

X Y

W = 1 0 0

0 W = 2 0

0 0 W = k

The optimal W is said to be the “common part” of (X,Y ). Observe that for any 2-DMS with p (x, y) > 0 x, y, K(X; Y ) = 0 since there X,Y ∀ is only one block in such a pX,Y (x, y) matrix. Further, in the above scenario, it can be shown that Alice and Bob cannot agree on even 1 bit of common secret key from (Xn,Y n) for any n (even with vanishingly small probability of disagreement) [9].

1.2.4 Relationship between Common Information Measures

∗ So far, we mentioned 4 measures of common information, namely RC , I(X; Y ), J(X; Y ), K(X; Y ). We summarize the relationship between the various measures as follows:

K(X; Y ) R∗ I(X; Y ) J(X; Y ) min H(X),H(Y ) . ≤ C ≤ ≤ ≤ { }

The largest measure among these is the Wyner common information J(X; Y ) which arises in approximate distributed simulation. The next largest measure of common information is the well known mutual infor- mation. It can be strictly smaller than J(X; Y ), for example, for the SBES. R∗ I(X; Y ) makes sense because zero-error compression requires a higher sum- C ≤ ∗ rate than lossless compression. The inequality may be strict: RC = 0 for DSBS, whereas I(X; Y ) > 0. CHAPTER 1. INTRODUCTION 12

Finally, K(X; Y ) R∗ because the common part between X and Y need to be ≤ C sent at least once to the decoder in zero-error distributed compression.

1.3 Efficiency of Information Extraction

Towards the end of Section 1.2.3, we saw that the G´acs-K¨orner-Witsenhausen com- mon information K(X; Y ) is 0 for a large number of sources in nature. Thus, speci- fying 1 bit of information about Xn will reveal strictly less than 1 bit of information about Y n. Therefore, it makes sense to measure the efficiency of information ex- traction, i.e., the maximum amount information that can be obtained about Y n by extracting a fixed number of bits from Xn. Consider the setting shown in Fig. 1.4 which consists of a 2-DMS ( , ,p (x, y)) X Y X,Y with Xn made available to the encoder. The encoder’s goal is to find a function f : n [1 : 2nR] that maximizes the mutual information I(f(Xn); Y n). We can X → define the quantity

1 ∆(R) = sup max I(f(Xn); Y n) (1.4) n nR n f:X →[1:2 ] n that specifies the maximum per-letter mutual information that a rate R description of source can have with the source . ∆(R) was first defined by Erkip in the context X Y of gambling [10]. If Y n is a sequence of iid horse races and side information Xn is available via a rate-limited channel at rate R, ∆(R) specifies the maximum increase in doubling rate of wealth due to side information. Erkip [10] showed the following.

Theorem 4. The single-letter characterization for ∆(R) of a 2-DMS ( , ,p (x, y)) X Y X,Y is given by ∆(R) = max I(U; Y ), p(u|x):I(U;X)≤R where it suffices to maximize over all U with U X +1. | |≤| | CHAPTER 1. INTRODUCTION 13

Xn 2-DMS Encoder ( ,p (x, y) f : n [1 : 2nR] X ×Y X,Y X →

n Y n f(X )

Figure 1.4: Setting for information extraction.

Note that ∆(R) is a concave function of R. Of particular interest is the quantity

d ∆′(0) = s∗(X; Y )= ∆(R) , (1.5) dR R→0

proposed by Erkip [10]. It denotes the initial extraction efficiency, the initial rate of increase in mutual information of f(Xn) with Y n. This quantity was analyzed in depth by Anantharam et. al. [11] who showed

I(U; Y ) s∗(X; Y ) = sup . (1.6) p(u|x) I(U; X)

One can recast (1.6) as I(U; Y ) s∗(X; Y )I(U; X) for any Markov chain U X ≤ → → Y , thereby obtaining a stronger version of the data-processing inequality. Example: For the DSBS(α), the G´acs-K¨orner common information K(X; Y )=0 and the mutual information I(X; Y )=1 H(α). The initial efficiency of information − extraction s∗(X; Y ) is given by

I(U; Y ) s∗(X; Y ) = sup = (1 2α)2. (1.7) p(u|x) I(U; X) −

Remark: s∗(X; Y ) = 1 iff the G´acs-K¨orner common information K(X; Y ) > 0. CHAPTER 1. INTRODUCTION 14

1.4 Results on Exact Distributed Simulation and Information Extraction

In this thesis, we will consider two problems. One is related to distributed simulation of a 2-DMS and the other is related to the efficiency of information extraction.

1.4.1 Exact Distributed Simulation

In Section 1.2.2, we discussed Wyner Common Information which captures the mini- mum number of bits per symbol of common randomness required to simulate a 2-DMS in a distributed fashion. Wyner required the distribution of (Xˆ n, Yˆ n) to be close in total variation to the required jointly iid product distribution. In our setting, we ˆ n ˆ n n require (X , Y ) to have the exact product distribution i=1 pX,Y (xi,yi). We also relax the requirement of W Unif[1 : 2nR] and instead allow any arbitrary RV W . n ∼ Q n We measure the rate R as the expected per symbol length of a prefix free code for ∗ Wn. We define the corresponding minimum common information rate R as the exact common information rate G(X; Y ). For example, if the 2-DMS is defined by X =(U, W ) and Y =(V, W ) where U, V are conditionally independent given W , the central node will simply describe W n at rate H(W ) to both Alice and Bob. In that case, we can show G(X; Y )= H(W ). We do not know how to compute G(X; Y ) in general. One of our main results that we will show in Chapter 2 is that, for the SBES(p), G(X; Y )= J(X; Y ). While we can show G(X; Y ) J(X; Y ), we do not know if G(X; Y )= J(X; Y ) in general. ≥

1.4.2 Efficiency of Information Extraction via Boolean Func- tions

In Section 1.3, we introduced the quantity s∗(X; Y ) which is a measure of the initial extraction efficiency for extracting information about Y n from Xn. We now provide a simply stated conjecture regarding the maximum mutual information a boolean function can extract about Y n from Xn. CHAPTER 1. INTRODUCTION 15

Conjecture 1. Let (Xn,Y n) be n jointly iid pairs of RVs generated by a DSBS(α). For any boolean function b : 0, 1 n 0, 1 , { } →{ }

I(b(Xn); Y n) 1 H(α). (1.8) ≤ −

n The outer bound in (1.8) is trivially achieved by setting b(x ) = x1 (or xi or its complement). A comparison between (1.8) and (1.4) seems to suggest that Conjecture 1 can be proven by letting R 0 in Erkip’s setting for efficiency of information extraction from → Fig. 1.4. However, s∗(X; Y ) = (1 2α)2 > 1 H(α) for a DSBS with α (0, 1/2), − − ∈ suggesting that one can extract (1 2α)2 bits of mutual information with Y n for each − bit of Xn, directly contradicting Conjecture 1. The discrepancy is due to a subtle exchange of limits involved. While Erkip’s setting allows the number of bits specified about Xn to go to infinity asymptotically as n even at fixed rates R close to → ∞ zero, the statement of Conjecture 1 restricts the encoder to specify at most 1 bit of information for any n. Thus, we only have a loose outer bound

I(b(Xn); Y n) s∗(X; Y )=(1 2α)2. ≤ −

We discuss Conjecture 1 in greater detail in Chapter 3.

1.4.3 Organization

The rest of the thesis is organized as follows. In Chapter 2, we will discuss the notion of exact common information rate G(X; Y ) which is the analogue of Wyner Common Information J(X; Y ) with exact simulation.

We will start by introducing the “common entropy” G(X; Y ) = minX→W →Y H(W ) in Section 2.2. Then we will formally define the exact common information rate G(X; Y ) n n in Section 2.2 and show G(X; Y ) = limn→∞(1/n)G(X ; Y ). We will show a main result that for the SBES, G(X; Y ) = J(X; Y ). In Section 2.3, we will discuss the problem of exact coordination capacity. In Section 2.4, we will discuss computation CHAPTER 1. INTRODUCTION 16

of G(X; Y ) which involves solving a non-convex optimization problem. We will pro- vide cardinality bounds for computing G(X; Y ) and use them to provide an explicit expression for G(X; Y ) in the case of binary alphabets. We devote Chapter 3 to discuss Conjecture 1 in greater depth. Section 3.1 pro- vides a summary of our main results regarding Conjecture 1 and their implications. Primarily, it focuses on a refinement of Conjecture 1 into two “sub-conjectures” and provides substantial evidence supporting their validity. Section 3.2 contains the proofs of the main results, and Section 3.3 delivers concluding remarks. Chapter 2

Exact Distributed Simulation

In Section 1.2.2, we introduced the distributed simulation setup and mentioned Wyner Common Information J(X; Y ) which is the minimum number of bits of common ran- domness needed to generate one symbol pair of the 2-DMS ( ,p (x, y)) in a X ×Y X,Y distributed fashion with asymptotically vanishing total variation. We then posed a question regarding the minimum common randomness rate needed to generate the 2-DMS exactly rather than approximately. In this chapter, we will study the corre- sponding notions of exact common information and exact common information rate.

In Section 2.2, we will introduce the “common-entropy” G(X; Y ) = minX→W →Y H(W ) as a natural bound on the exact common information and study some of its properties. We will then define the exact common information rate for a 2-DMS. We show that n n it is equal to the limit G(X; Y ) = limn→∞(1/n)G(X ; Y ) and that it is in general greater than or equal to the Wyner common information. One of the main results in this thesis is to show that G(X; Y ) = J(X; Y ) for the SBES. A consequence of this result is that the quantity G(Xk; Y k) can be strictly smaller than kG(X; Y ), that is, the per-letter common entropy can be reduced by increasing the dimension. We then introduce a notion of approximate common information rate which uses a vari- able length description for common randomness, but relaxes the condition of exact generation to asymptotically vanishing total variation. We show that the correspond- ing minimum rate coincides with the Wyner common information. Computing the quantity G(X; Y ) involves solving a non-convex optimization problem. In Section 2.4

17 CHAPTER 2. EXACT DISTRIBUTED SIMULATION 18

we present cardinality bounds on W and use them to find an explicit expression for G(X; Y ) when X and Y are binary.

2.1 Definitions and Properties

Consider the distributed generation setup depicted in Figure 2.1. Alice and Bob both have access to common randomness W . Alice uses W and her own local randomness to generate X and Bob uses W and his own local randomness to generate Y such that (X,Y ) p (x, y). We wish to find the limit on the least amount of common ∼ X,Y randomness needed to generate (X,Y ) exactly. W

Alice Decoder 1 Decoder 2 Bob

Xˆ Yˆ Figure 2.1: Setting for distributed generation of correlated random variables. For exact generation, (X,ˆ Yˆ ) p (x, y). ∼ X,Y

More formally, we define a simulation code (W, R) for this setup to consist of

A common random variable W p (w). As a measure of the amount of ◦ ∼ W common randomness, we use the per-letter minimum expected codeword length R over the set of all variable length prefix-free zero-error binary codes 0, 1 ∗ C⊂{ } for W , i.e., R = min E(L), where L is the codeword length of the code for C C W .

A stochastic decoder p ˆ (x w) for Alice and a stochastic decoder p ˆ (y w) ◦ X|W | Y |W | for Bob such that Xˆ and Yˆ are conditionally independent given W . CHAPTER 2. EXACT DISTRIBUTED SIMULATION 19

The random variable pair (X,Y ) is said to be exactly generated by the simulation code (W, R) if pX,ˆ Yˆ (x, y)= pX,Y (x, y). We wish to find the exact common information R∗ between the sources X and Y , which is the infimum over all rates R such that the random variable pair (X,Y ) can be exactly generated. Remark: The exact common information R∗ (and the exact common information rate defined in the next section) can be also defined through a “zero error” version of the Gray-Wyner system [12]. This approach, however, is neither operationally better motivated than the above setup nor yields better insights or results. Hence, we will not pursue this alternative setup any further. Define the following quantity, which can be interpreted as the “common entropy” between X and Y , G(X; Y ) = min H(W ). (2.1) W : X→W →Y Remark: We can use min instead of inf in the definition of G(X; Y ) because the cardinality of W is bounded as we will see in Proposition 5, hence the optimization for computing G(X; Y ) is over a closed set. Following the proof of Shannon’s zero-error compression theorem, we can readily show the following.

Proposition 1. G(X; Y ) R∗

G(X; Y ) = min 1,H(p)+1 p . { − }

Note that the Wyner common information for this source is [13]

1 if p 0.5, J(X; Y )=  ≤ H(p) if p> 0.5.

 CHAPTER 2. EXACT DISTRIBUTED SIMULATION 20

In the following we present some basic properties of G(X; Y ).

2.1.1 Properties of G(X; Y )

1. G(X; Y ) 0 with equality if and only if X and Y are independent. ≥

Proof. First assume G(X; Y ) = 0. Suppose W achieves G(X; Y ). Then X → W Y and H(W ) = 0. Thus W = φ, constant. Hence X,Y are independent. → To show the converse, if X,Y are independent, X φ Y and G(X; Y ) → → ≤ H(φ) = 0. Thus G(X; Y ) = 0.

2. G(X; Y ) J(X; Y ). ≥

Proof.

G(X; Y ) = min H(W ) W :X→W →Y = H(W ∗) I(W ∗; X,Y ) ≥ min I(W ; X,Y ) ≥ W :X→W →Y = J(X; Y ).

3. Data-processing Inequality: If U X Y forms a Markov chain, then → → G(U; Y ) G(X; Y ). ≤

Proof. Let W achieve G(X; Y ). Then U X W Y forms a Markov → → → chain. Hence, G(U; Y ) H(W )= G(X; Y ). ≤

4. Define G(X; Y Z) = p (z)G(X; Y Z = z). Then G(X; Y ) H(Z)+ | z∈Z Z | ≤ G(X; Y Z). | P CHAPTER 2. EXACT DISTRIBUTED SIMULATION 21

Proof. For each Z = z, choose W as the random variable that achieves G(X; Y Z = z | z). Then X (Z, W ) Y forms a Markov chain. Therefore, → Z →

G(X; Y ) H(Z, W ) ≤ Z = H(Z)+ H(W Z) Z | = H(Z)+ pZ(z)H(Wz) z X = H(Z)+ G(X; Y Z). |

5. If there exist functions f(X) and g(Y ) such that Z = f(X) = g(Y ), then G(X; Y )= H(Z)+ G(X; Y Z). |

Proof. First observe that if Z satisfies the condition Z = f(X)= g(Y ) and W satisfies the Markov condition X W Y , then Z is a function of W . To see → → why, note that

p (x, y)= p (w)p (x w)p (y w). X,Y W X|W | Y |W | wX∈W Therefore, if p (x, y) = 0 and p (x w) > 0, then p (y w) = 0. X,Y X|W | Y |W | Now we will show that if p (x w) > 0 and p (x w) > 0, then f(x ) = X|W 1| X|W 2| 1 f(x ). If not, for any y such that g(y) = f(x ), p (x ,y) = 0, therefore 2 6 1 X,Y 1 p (x w) > 0 = p (y w) = 0. Similarly for any y such that g(y) = X|W 1| ⇒ Y |W | 6 f(x ), p (x ,y) = 0, therefore, p (x w) > 0 = p (y w)=0. Asa 2 X,Y 2 X|W 2| ⇒ Y |W | consequence, for any y, p (y w) = 0, a contradiction. Y |W | Thus, for any w the set f(x) : p (x w) > 0 has exactly one element. { X|W | } If we define h(w) as that unique element, then f(X) = h(W ). Therefore, Z = f(X)= g(Y )= h(W ). CHAPTER 2. EXACT DISTRIBUTED SIMULATION 22

To complete the proof, let W achieve G(X; Y ). Then,

G(X; Y )= H(W )= H(W, Z) = H(Z)+ H(W Z) | = H(Z)+ p (z)H(W Z = z) Z | z X H(Z)+ p (z)G(X; Y Z = z) ≥ Z | z X = H(Z)+ G(X; Y Z). |

This, in combination with Property 4, completes the proof.

6. Let T (X) be a sufficient statistic of X with respect to Y ([14], pg. 305). Then G(X; Y ) = G(T (X); Y ). Further, if W achieves G(X; Y ), we have H(W ) H(T (X)). Thus a noisy description of X via W may potentially have ≤ a smaller entropy than the minimal sufficient statistic, which is a deterministic description.

Proof. Observe that both Markov chains T (X) X Y and X T (X) Y → → → → hold. Hence by the data-processing inequality, G(X; Y ) = G(T (X); Y ). Now, since W achieves G(X; Y ) and T (X) T (X) Y , → →

H(W )= G(X; Y )= G(T (X); Y ) H(T (X)). ≤

2.2 Exact Common Information Rate

The distributed generation setup in Figure 2.1 can be readily extended to the n-letter n setting in which Alice wishes to generate X from common randomness Wn and her n local randomness and Bob wishes to generate Y from Wn and his local randomness n n n such that p ˆ n ˆ n (x ,y ) p (x ,y ). We define a simulation code (W ,R,n) X ,Y ∼ i=1 X,Y i i n for this setup in the same mannerQ as for the one-shot case. CHAPTER 2. EXACT DISTRIBUTED SIMULATION 23

We say that Alice and Bob can exactly generate the 2-DMS (X,Y ) at rate R if for some n 1, there exists a (W ,R,n) simulation code that exactly generates ≥ n n n (X ,Y ) (since we assume prefix-free codes for Wn, we can simulate for arbitrarily large lengths via concatenation of successive codewords). We wish to find the exact common information rate R∗ between the sources X and Y , which is the infimum over all rates R such that the 2-DMS (X,Y ) can be exactly generated. Define the “joint common entropy” quantity

n n G(X ; Y ) =n min n H(Wn). (2.2) Wn: X →Wn→Y

n n n n It can be readily shown that limn→∞(1/n)G(X ; Y ) = infn∈N(1/n)G(X ; Y ). Hence, we can define the limiting quantity

1 G(X; Y ) = lim G(Xn; Y n). n→∞ n

We are now are ready to establish the following multiletter characterization for the exact common information rate.

Proposition 2 (Multiletter Characterization of R∗). The exact common information rate between the components X and Y of a 2-DMS (X,Y ) is

R∗ = G(X; Y ).

Proof. Achievability: Suppose R>G(X; Y ). We will show that the rate R is achiev- able. Since G(X; Y ) = lim (1/n)G(Xn; Y n), R (1/n)(G(Xn; Y n)+1) for n large n→∞ ≥ enough. By the achievability part of Shannon’s zero-error source coding theorem, it is possible to exactly generate (Xn,Y n) at rate at most (1/n) G(Xn; Y n)+1 . Hence rate R is achievable and thus R∗ G(X; Y ). ≤  Converse: Now suppose a rate R is achievable. Then there exists a (Wn,R,n)- simulation code that exactly generates (Xn,Y n). Therefore, by the converse for Shan- non’s zero-error source coding theorem, R (1/n)G(Xn; Y n). Since (1/n)G(Xn; Y n) ≥ ≥ G(X; Y ), we conclude that R∗ G(X; Y ). ≥ CHAPTER 2. EXACT DISTRIBUTED SIMULATION 24

As expected the exact common information rate is greater than or equal to the Wyner common information.

Proposition 3. G(X; Y ) J(X; Y ). ≥ The proof of this result is in Appendix A.3. We now present two cases where G(X; Y )= J(X; Y ).

1. Suppose X = (X′, W ) and Y = (Y ′, W ) where X′ and Y ′ are conditionally independent given W . Then G(X; Y )= G(X; Y )= J(X; Y )= H(W ).

Proof. We already know J(X; Y ) = H(W ) [3]. Hence G(X; Y ) J(X; Y ) = ≥ H(W ). Since X W Y forms a Markov chain, we also have G(X; Y ) → → ≤ H(W ). Thus G(X; Y ) = H(W ). Since J(X; Y ) G(X; Y ) G(X; Y ), we ≤ ≤ also have G(X; Y )= H(W ).

2. Suppose J(X; Y ) = H(X) for the DMS (X,Y ). Then G(X; Y ) = G(X; Y ) = J(X; Y )= H(X).

Proof. Since X X Y forms a Markov chain, G(X; Y ) H(X). Further, → → ≤ G(X; Y ) J(X; Y ) = H(X). Thus G(X; Y ) = H(X). Finally observe that ≥ G(X; Y ) is sandwiched between G(X; Y ) and J(X; Y ).

As an example, let (X,Y ) be a uniform distribution on the set (x, y) { ∈ 1, 2, 3 1, 2, 3 : x = y . For this DMS, it can be shown that J(X; Y ) = { }×{ } 6 } H(X)= H(Y ). Hence, G(X; Y )= H(X)= H(Y ).

In the above examples, G(X; Y )= J(X; Y )= G(X; Y ). In the following section, we show that G(X; Y ) = J(X; Y ) < G(X; Y ) for the SBES in Example 1. We do not know if G(X; Y )= J(X; Y ) in general, however. CHAPTER 2. EXACT DISTRIBUTED SIMULATION 25

2.2.1 Exact Common Information Rate for the SBES

We will need the following result regarding computing the Wyner common informa- tion for the SBES.

Lemma 1. To compute J(X; Y ) for the SBES, it suffices to consider W of the form

X w.p. 1 p1, W =  − e w.p. p1,  and 

W w.p. 1 p2, Y =  − e w.p. p2, where p ,p satisfy p + p p p =p, the erasure probability of the SBES. 1 2 1 2 − 1 2  The proof follows by [13], Appendix A. We now present the main result on exact common information rate in this chapter.

Theorem 5. If (X,Y ) is an SBES, then G(X; Y )= J(X; Y ).

Proof. In general G(X; Y ) J(X; Y ). We will now provide an achievability scheme ≥ to show that for SBES, G(X; Y ) J(X; Y ). ≤ Choose a W as defined in Lemma 1 and define

d if W 0, 1 , ˜ W =  ∈{ } e if W = e,

 d if Y 0, 1 , ˜ Y =  ∈{ } e if Y = e.  Note that Y˜ n, denoting the location of the erasures, is i.i.d. Bern(p) (with 1 e, ← 0 d) and independent of Xn. Furthermore, Y n is a function of Xn and Y˜ n. ← Codebook Generation: Generate a codebook consisting of 2n(I(Y˜ ;W˜ )+ǫ) sequences C w˜n(m), m [1 : 2n(I(Y˜ ;W˜ )+ǫ)], that “covers” almost all they ˜n sequences except for ∈ CHAPTER 2. EXACT DISTRIBUTED SIMULATION 26

a subset of small probability δ(ǫ). By the covering lemma ([15], page 62), such a codebook exists for large enough n. This lets us associate every covered sequencey ˜n with a uniquew ˜n =w ˜n(˜yn) ∈ C n n (n) such that (˜y , w˜ ) ǫ . ∈ T Define the random variable

w˜n(˜yn) ify ˜n is covered by , W˜ n = C (2.3)  n n y˜ ify ˜ is not covered.   n Note that W˜ n is a function of Y˜ and that the set of erasure coordinates in W˜ n is a subset of those in Y˜ n. Channel Simulation Scheme:

1. The central node generates W˜ n defined in (2.3) and sends it to both encoders.

n n n 2. Encoder 2 (Bob) generates Y˜ p ˜ n ˜ (˜y w˜ ) ∼ Y |Wn | 3. The central node generates and sends to both encoders a message M comprising n i.i.d. Bern(1/2) bits for only those coordinates i of X where W˜ n(i)= d. Thus H(M) n(1 p + δ(ǫ)). ≤ − 1 4. Encoder 1 (Alice) generates the remaining bits of Xn not conveyed by M using n n local randomness. Then X is independent of W˜ n, Y˜ and is i.i.d. Bern(1/2).

n n n n 5. Encoder 2 generates Y = Y (W˜ n,X ) = Y (W˜ n, M). He only needs the bits

Xi such that Y˜i = d, which are available via M.

To complete the proof, note that Xn (W˜ , M) Y n forms a Markov chain. → n → CHAPTER 2. EXACT DISTRIBUTED SIMULATION 27

Therefore,

G(Xn; Y n) H(W˜ , M)+1 H(W˜ )+ H(M)+1 ≤ n ≤ n (a) H(δ(ǫ))+(1 δ(ǫ))H(W˜ W˜ )+ δ(ǫ)H(W˜ W˜ / )+ n(1 p + δ(ǫ))+1 ≤ − n | n ∈ C n | n ∈ C − 1 (b) H(δ(ǫ))+(1 δ(ǫ)) log + δ(ǫ) log ˜n + n(1 p + δ(ǫ))+1 ≤ − |C| |Y | − 1 = n(I(Y˜ ; W˜ )+1 p + δ(ǫ)) − 1 (c) = n(I(W ; X,Y )+ δ(ǫ)),

where (a) follows by the grouping lemma for entropy, since P W˜ / = P Y˜ n not covered = { n ∈ C} { } δ(ǫ); (b) follows since entropy is upper bounded by log of the alphabet size; and (c) follows from the definition of mutual information and some algebraic manipulations. If we let n , we obtain G(X; Y ) I(W ; X,Y )+ δ(ǫ) for any ǫ > 0. Mini- → ∞ ≤ mizing I(W ; X,Y ) over all W from Lemma 1 completes the proof.

Note that the single letter characterization of the Wyner common information for the 2-DMS (Xk,Y k) k p (x ,y ) is k times that of the 2-DMS (X; Y ), that ∼ i=1 X,Y i i k k is, min I(W ; X ,Y ) = Qk min I(W ; X,Y ). The same property holds for the G´acs– K¨orner–Witsenhuesen common information [2], and for the mutual information. In the following we show that G(Xk; Y k) can be strictly smaller than kG(X; Y ). Hence, it is possible to realize gains in the “common entropy” when we increase the dimension. By the fact that for the SBES(p), G(X; Y ) = H(p) for p > 1/2 and G(X; Y ) = min 1,H(p)+1 p , there exists a p such that G(X; Y )

1/3 1/3 pX,Y = . 1/3 0    Then, by Proposition 8 in Section 2.4, we have G(X; Y ) = H(1/3), where H(p), CHAPTER 2. EXACT DISTRIBUTED SIMULATION 28

0 p 1, is the binary entropy function. Note that we can write ≤ ≤

1/9 1/9 1/9 1/9 1/9 0 1/9 0  pX2,Y 2 = 1/9 1/90 0      1/90 0 0     t  t t t 1/4 1 1 0 0 0 0 0 4 1/4 0 3 0 1/3 1 1 0 1 0 1 = + + + . 9 1/4 0 9 0 1/3 9 0 1 9 1 0                                 1/4 0 0 1/3 0 0 0 0                                 Let W p (w)= 4/9, 3/9, 1/9, 1/9 , then ∼ W h i 2 2 2 2 p 2 2 (x ,y )= p (w)p 2 (x w)p 2 (y w), X ,Y W X |W | Y |W | w X that is, X2 W Y 2 form a Markov chain. Thus, → →

2 2 G(X ; Y )= H(W )

2.2.2 Approximate Common Information Rate

Consider the approximate distributed generation setting in which Alice and Bob wish to generate 2-DMS (X,Y ) with vanishing total variation

n n n lim p ˆ n ˆ n (x ,y ) pX,Y (xi,yi) =0. n→∞ X ,Y − TV i=1 Y

We define a (Wn,R,n)-simulation code for this setting in the same manner as for exact distributed generation. We define the approximate common information rate ∗ RTV between the sources X and Y as the infimum over all rates R such that the 2-DMS (X,Y ) can be approximately generated. We can show that the approximate common information rate is equal to the Wyner CHAPTER 2. EXACT DISTRIBUTED SIMULATION 29

common information. Proposition 4. ∗ RTV = J(X; Y ). Proof. Achievability: Achievability follows from Wyner’s coding scheme [3]. Choose W Unif[1 : 2nR] and associate each w with a codeword of fixed length n ∼ n ∈ Wn ℓ(w )= nR . Decoders 1 (Alice) and 2 (Bob) first decode W and then use Wyner’s n ⌈ ⌉ n coding scheme to generate Xˆ n, Yˆ n, respectively. Any rate R>J(X; Y ) is admissible and will guarantee the existence of a scheme such that (Xˆ n, Yˆ n) is close in total variation to (Xn,Y n). Thus R∗ J(X; Y ). TV ≤ Converse: Suppose that for any ǫ> 0, there exists a (Wn,R,n) simulation code that generates (Xˆ n, Yˆ n) whose pmf differs from that of (Xn,Y n) by at most ǫ in total variation. Then we have

nR H(W ) I(Xˆ n, Yˆ n; W ) ≥ n ≥ n n = I(Xˆ , Yˆ ; W Xˆ q−1, Yˆ q−1) q q | q=1 Xn = I(Xˆ , Yˆ ; W, Xˆ q−1, Yˆ q−1) I(Xˆ , Yˆ ; Xˆ q−1, Yˆ q−1) q q − q q q=1 X (a) n I(Xˆ , Yˆ ; W ) nδ(ǫ) ≥ q q − q=1 X = nI(Xˆ , Yˆ ; W Q) nδ(ǫ) Q Q | − = nI(Xˆ , Yˆ ; W, Q) nI(Xˆ , Yˆ ; Q) nδ(ǫ) Q Q − Q Q − (b) nI(Xˆ , Yˆ ; W, Q) nδ(ǫ) ≥ Q Q − (c) nJ(X; Y ) nδ(ǫ). ≥ −

(a), (b) follow from Lemma 20 and Lemma 21 respectively in [16] since the pmf of (Xˆ n, Yˆ n) differs from that of (Xn,Y n) by at most ǫ in total variation; and (c) follows from the continuity of J(X; Y ) [3].

Remark: Note that if we replace the total variation constraint in Proposition 4 by CHAPTER 2. EXACT DISTRIBUTED SIMULATION 30

the stronger condition

n n n n n n p n n (x ,y )=(1 ǫ)p ˆ n ˆ n (x ,y )+ ǫr(x ,y ) (2.4) X ,Y − X ,Y for some pmf r(xn,yn) over n n, the required approximate common informa- X ×Y ∗ tion rate RSD becomes equal to the exact common information G(X; Y ). To show this, note that R∗ G(X; Y ) is trivial because the exact distributed generation SD ≤ constraint is stronger than (2.4). To show R∗ G(X; Y ), start with any (W ,R,n) simulation code that generates SD ≥ n (Xˆ n, Yˆ n) satisfying (2.4). Let

W w.p. 1 ǫ, ′ n − Wn =  (X¯ n, Y¯ n) r(xn,yn) w.p. ǫ.  ∼  We construct a (W ′ , R′, n) code that generates (Xn,Y n) exactly and satisfies R′ n ≤ ′ R + δ(ǫ). If the decoders receive Wn = Wn, they follow the original achievability ˆ n ˆ n ′ ¯ n ¯ n scheme to generate (X , Y ) satisfying (2.4). If Wn = (X , Y ), then the decoders simply output X¯ n and Y¯ n, respectively. Now,

H(W ′ ) H(ǫ)+(1 ǫ)H(W )+ ǫ log n n n ≤ − n |X| |Y| = H(Wn)+ nδ(ǫ).

Therefore, R′ (1/n)(H(W ′ )+1)= R + δ(ǫ)+1/n = R + δ(ǫ) for n large enough. ≤ n Thus R∗ G(X; Y ). SD ≥

2.3 Exact Coordination Capacity

In this section, we consider exact channel simulation, an extension of channel simu- lation with total variation constraint introduced in [17]. Consider the setup shown in Figure 2.2. Nature generates Xn n p (x ) that is available to the encoder. Both ∼ i=1 X i encoder and decoder have access toQ common randomness Wn. The encoder sends a CHAPTER 2. EXACT DISTRIBUTED SIMULATION 31

message M(Xn, W ) to the decoder. The decoder outputs Yˆ n using the message M,

the common randomness Wn, and local randomness. We wish to characterize the trade-off between the amount of common randomness and the information rate.

Wn

M n Encoder Decoder X Yˆ n

Figure 2.2: Setting for exact channel simulation.

Formally, a (Wn,R,R0, n) channel simulation code for this setup consists of

n a common random variable W p n (w) independent of the source X . Asa • n ∼ W measure of the amount of common randomness, we use the per-letter minimum

expected codeword length R0 over the set of all variable length prefix-free zero- error binary codes 0, 1 ∗ for W , i.e., R = min E(L), where L is the C0 ⊂ { } n 0 C0 codeword length of the code for W , C0 n an encoding function M(Xn, W ) that maps (Xn, W ) into a random variable • n n M. As a measure of the information rate, we use the per-letter minimum expected codeword length R over the set of all variable length prefix-free zero- error binary codes 0, 1 ∗ for M, i.e., R = min E(L′), where L′ is the C ⊂{ } C codeword length of the code for M, and C n n a stochastic decoder p ˆ n (y m, w) that outputs Yˆ . • Y |M,Wn | The channel simulation code is said to simulate the DMC p (y x) exactly if Y |X | n n n p ˆ n n (y x )= p (y x ). Y |X | i=1 Y |X i| i We wish to characterizeQ the set of all achievable rates (R, R0) for which the DMC p (y x) can be simulated exactly. Y |X | CHAPTER 2. EXACT DISTRIBUTED SIMULATION 32

We do not know the rate region for exact simulation of an arbitrary DMC. In the following, we show that for the erasure channel, it is equal to the rate region under total variation in [13].

Theorem 6. When X is a binary symmetric source and p (y x) is a binary erasure Y |X | channel with erasure probability p, the rate region for exact channel simulation is the

set of rate pairs (R, R0) such that

R r, ≥ (2.5) R + R H(p)+ r 1 H((1 p)/r) , 0 ≥ − −  for some 1 p r min 2(1 p), 1 . − ≤ ≤ { − } Proof. The converse holds for total variation constraint and should therefore trivially hold for the more stringent exact channel simulation constraint. The achievability

proof closely resembles that of Theorem 5. The central node still generates W˜ n, but the message M, instead of being generated at the central node, is now generated by

the encoder (Alice). Thus, a rate pair (R, R0) is achievable if

R = (1/n)(H(M)+1)+ δ(ǫ)=1 p + δ(ǫ), − 1 R = (1/n)(H(W˜ )+1)+ δ(ǫ)= I(Y˜ ; W˜ )+ δ(ǫ)= H(p) (1 p )H(p )+ δ(ǫ). 0 n − − 1 2 (2.6) for some p and p such that p = p + p p p . Letting r = 1 p shows that 1 2 1 2 − 1 2 − 1 R = r, R = H(p) rH((1 p)/r) is achievable. Finally note that both the encoder 0 − − (Alice) and the central node can share the responsibility of generating and sending

W˜ n to the decoder (Bob). Thus an arbitrary fraction of R0 can be removed and instead added to R. This shows the equivalence of (2.5) and (2.6). CHAPTER 2. EXACT DISTRIBUTED SIMULATION 33

2.4 On Computation of Exact Common Informa- tion

The optimization problem for determining G(X; Y ) is in general quite difficult, in- volving the minimization of a concave function over a complex Markovity constraint. In this section we provide some results on this optimization problem. We provide two bounds on the cardinality of W , establish two useful extremal lemmas, and use these results to analytically compute G(X; Y ) for binary alphabets. We then briefly discuss a connection to a problem in machine learning. We first establish the following upper bound on cardinality.

Proposition 5. To compute G(X; Y ) for a given pmf pX,Y (x, y), it suffices to con- sider W with cardinality . |W| ≤ |X||Y| The proof of this proposition is in Appendix A.4. We now state an extremal lemma regarding the optimization problem for G(X; Y ) that will naturally lead to another cardinality bound.

Lemma 2. Given p (x, y), let W attain G(X; Y ). Then for w = w , the supports X,Y 1 6 2 of p ( w ) and p ( w ) must be different. Y |W ·| 1 Y |W ·| 2 The proof of this lemma is in Appendix A.5. Lemma 2 yields the following bound on the cardinality of W .

Proposition 6. To compute G(X; Y ) for a given pmf pXY (x, y), it suffices to consider W with cardinality 2min{|X|,|Y|} 1. |W| ≤ − Proof. Suppose > 2|Y| 1. Since there are only 2|Y| 1 non-empty subsets of , |W| − − Y by pigeon-hole principle, there exists w = w such that the supports of p ( w ) 1 6 2 Y |W ·| 1 and p ( w ) are the same. This contradicts Lemma 2. Hence 2|Y| 1. By Y |W ·| 2 |W| ≤ − a symmetric argument, 2|X| 1. |W| ≤ − The following shows that the bound in Proposition 6 is tight.

Example 2 Let (X,Y ) be a SBES with p = 0.1. Since pX,Y (0, 1) = pX,Y (1, 0) = 0, the Markovity constraint X W Y implies that the only W with = 2 is → → |W| CHAPTER 2. EXACT DISTRIBUTED SIMULATION 34

W = X; see [13], Appendix A. Hence, G(X; Y ) H(X) = 1. However, H(Y ) = ≤ H(0.1)+0.1 < 1. Thus, the optimal W ∗ that achieves G(X; Y ) requires ∗ = 3, |W | making the bound in Proposition 6 tight. The following is another extremal property of G(X; Y ).

Proposition 7. Suppose W attains G(X; Y ). Consider a non-empty subset ′ . W ⊆ W Let (X′,Y ′) be defined by the joint pmf

pW (w) ′ ′ pX ,Y (x, y)= ′ pX|W (x w)pY |W (y w). ′ w′∈W′ pW (w ) | | wX∈W P Then H(X′; Y ′)= H(W W ′). | ∈ W The proof of this proposition is in Appendix A.6. We now use the above results to analytically compute G(X; Y ) for binary alpha- bets, i.e., when = = 2. |X| |Y| Proposition 8. Let X Bern(p) and ∼

α β pY |X = α¯ β¯   for some α, β [0, 1], α¯ =1 α, β¯ =1 β. Let W achieve G(X; Y ). Then either ∈ − −

α 1 1 β/¯ α¯ pY |W = , pW |X = , and α¯ 0 0 1 β/¯ α¯ −     W Bern p¯ 1 β/¯ α¯ , ∼ −   or

0 β 1 α/β 0 pY |W = , pW |X = − , and 1 β¯  α/β 1

W Bern p(1 α/β) .  ∼ −  CHAPTER 2. EXACT DISTRIBUTED SIMULATION 35

The proof of this proposition uses Lemma 2 as well as the cardinality bound 3 derived from Proposition 6. It considers all possible cases for W and fi- |W| ≤ nally concludes that = 2 suffices. The detailed arguments can be found in |W| Appendix A.7. Remark (Relationship to machine learning): Computing G(X; Y ) is closely related to positive matrix factorization, which has applications in recommendation systems, e.g., [18]. In that problem, one wishes to factorize a matrix M with positive entries in the form M = AB, where A and B are both matrices with positive entries. Indeed, finding a Markov chain X W Y for a fixed p is akin to factorizing p = → → X,Y Y |X pY |W pW |X and numerical methods such as in [19] can be used. Rather than minimizing the number of factors as is done in positive matrix factorization literature, it |W| may be more meaningful for recommendation systems to minimize the entropy of the factors W . Computing G(X; Y ) for large alphabets appears to be very difficult, however.

2.5 Concluding Remarks

We introduced the notion of exact common information for correlated random vari- ables (X,Y ) and bounded it by the common entropy quantity G(X; Y ). For the exact generation of a 2-DMS, we established a multiletter characterization of the exact common information rate. While this multiletter characterization is in gen- eral greater than or equal to the Wyner common information, we showed that they are equal for the SBES. The main open question is whether the exact common in- formation rate has a single letter characterization in general. Is it always equal to the Wyner common information? Is there an example 2-DMS for which the exact common information rate is strictly larger than the Wyner common information? It would also be interesting to further explore the application to machine learning. Chapter 3

Which Boolean Functions are Most Informative?

In Section 1.2.3, we introduced Conjecture 1 regarding the maximum mutual infor- mation that a boolean function of a source can have with a dependent source. We restate the Conjecture here for readability: If (Xn,Y n) is drawn from a DSBS(α), any boolean function b : 0, 1 n 0, 1 satisfies { } →{ }

I(b(Xn); Y n) 1 H(α). (3.1) ≤ −

At first sight, Conjecture 1 might appear to be no more than a simple exercise. However, we hope to convince the reader that the conjecture is much deeper than it appears. Despite its apparent simplicity, standard information-theoretical manipula- tions appear incapable of establishing (3.1). Conjecture 1 represents the simplest, nontrivial embodiment of Boolean functions in an information-theoretic context. In words, Conjecture 1 asks: “What is the most significant bit that Xn can provide about Y n?” Despite their fundamental roles in computer science and digital computation, Boolean functions have received relatively little attention from the information theory community. The recent work [20] is perhaps most relevant to our Conjecture 1 and provides compelling motivation for its study. In [20], Klotz et al. prove that for n and

36 CHAPTER 3. BOOLEAN FUNCTIONS 37

Pr b(Xn)=0 1/2 fixed, I(b(Xn); X ) is maximized by functions b which satisfy { } ≥ 1 n b(X ) = 0 whenever X1 = 0 (i.e., when b is canalizing in X1). Their motivation for considering this problem comes from computational biology, where Boolean networks are used to model dependencies in various regulatory networks. We encourage the reader to refer to [20, 21] and the references therein for further information. By employing Fourier-analytic techniques, we can obtain the following result1:

Theorem 7. If (Xn,Y n) is drawn from a DSBS(α), any boolean function b : 0, 1 n { } → 0, 1 chosen such that Pr b(Xn)=0 = Pr b(Xn)=1 =1/2 satisfies { } { } { } n I(b(Xn); Y ) 1 H(α). (3.2) i ≤ − i=1 X While Theorem 7 is weaker2 than Conjecture 1, it is striking nonetheless. Indeed, it states that the sum of the n mutual information terms in (3.2) cannot exceed 1 H(α), regardless of how large n is. Unfortunately, this Fourier-analytic approach − appears incapable of establishing Conjecture 1. A known result that is similar in flavor to Conjecture 1 involves the Hirschfeld- Gebelein-R´enyi (HGR) maximal correlation [22]. For any pair of RVs (X,Y ) ∼ pX,Y (x, y), the HGR correlation, denoted ρ(X; Y ), is defined as

ρ(X; Y ) = max E f(X)g(Y ) . E[f(X)]=E[g(Y )]=0, E 2 E 2 [f(X) ]= [g(Y ) ]=1   Witsenhausen [9] showed that if (Xn,Y n) is generated by a DSBS(α), any boolean function b : 0, 1 n 0, 1 satisfies ρ(b(Xn); Y n) ρ(X ; Y ) = (1 2α)2 and is { } → { } ≤ 1 1 − n trivially achieved by setting b(x )= x1. ∗ I(U;Y ) Another similar result involves the quantity s (X; Y ) = supp(u|x) I(U;X) discussed in Section 1.3. Again, we can show s∗(b(Xn); Y n) s∗(X ; Y ) = (1 2α)2 and is ≤ 1 1 − n trivially achieved by setting b(x )= x1.

1A proof of Theorem 7 is postponed until Section 3.2. 2To see that (3.2) is weaker than Conjecture 1, note that (3.1) is equivalent to n i 1 n I Y − ,b(X ); Yi 1 H(α) by independence of the Yi’s. i=1 ≤ − P  CHAPTER 3. BOOLEAN FUNCTIONS 38

Conjecture 1 is also related to the Information Bottleneck Method [23], which attempts to solve the optimization problem

min I(Xn; U) λI(Y n; U). (3.3) p(u|xn) −

For a given λ> 0, the optimizing U intuitively provides the best tradeoff between the accuracy of describing Y n and the descriptive complexity of U. In our setting, b(Xn) plays the role of U, and we constrain the descriptive complexity to be at most one bit. While U is allowed to be a stochastic function of Xn in (3.3), it is relatively easy to show that randomized Boolean functions do not yield a higher mutual information in our setting. Thus, expressing Conjecture 1 in terms of deterministic Boolean functions comes without loss of generality. A more concrete example comes in the context of gambling. To this end, suppose Y n is a simple model for a market of n stock options, where each option doubles in value or goes bankrupt with probability 1/2, independent of all other options. If an oracle has access to side information Xn, and we are allowed to ask one yes/no question of the oracle, which question should we ask to maximize the rate at which our wealth grows? The validity of Conjecture 1 would imply that we should only concern ourselves with the performance of a single stock option, say Y1. This is readily seen as a consequence of known results on gambling with side information [24, Theorem n 6.2.1], since putting b(X )= X1 yields

I(b(Xn); Y n)= I(X ; Y n)= I(X ; Y )=1 H(α), (3.4) 1 1 1 − hence the conjectured upper bound (3.1) is attainable and represents the maximum possible increase in doubling rate. Finally, we point out that (3.1) is related in spirit to the notion of average sensi- tivity of Boolean functions. This topic has received a great deal of attention in the computer science literature (see the survey [25], and references therein). To see the CHAPTER 3. BOOLEAN FUNCTIONS 39

connection to sensitivity, note that (3.1) can be rewritten as

H(b(Xn) Y n) H(b(Xn)) 1+ H(α). (3.5) | ≥ −

For fixed Pr b(Xn)=0 , the right hand side of (3.5) is constant. Hence, our con- { } jecture essentially lower bounds the output uncertainty of Boolean functions with respect to noisy inputs. A longstanding conjecture which is similar in spirit is the entropy/influence conjecture. Though not necessarily information-theoretic in nature, the entropy/influence conjecture postulates that the average sensitivity of a Boolean function can be lower bounded (up to a constant factor) by the entropy of its squared Fourier coefficients. We refer the interested reader to [26] for details and a precise statement.

3.1 Main Results and Implications

3.1.1 Notation and Definitions

Throughout this chapter, we assume (Xn,Y n) is generated by a DSBS(α). Equiva- n lently, X = (X1,X2,...,Xn) is a sequence of n i.i.d. Bernoulli 1/2 random vari- n n ables, and Y = (Y1,Y2 ...Yn) is the result of passing X through  a memoryless Binary Symmetric Channel (BSC) with crossover probability α. For any boolean function b : 0, 1 n 0, 1 , the notation b−1(0) will be used to { } →{ } denote the set xn 0, 1 n : b(xn)=0 . { ∈{ } } For a scalar p [0, 1], H(p) will be used to denote the binary entropy function ∈ p log(p) (1 p) log(1 p). On the other hand, if W is a random variable, we write − − − − H(W ) for the Shannon entropy of W . H(X Y ) will denote the conditional entropy | of X given Y . H(X y) will denote the entropy of X conditioned on the event Y = y. | The different usages of H( ) will be clear from context. All logarithms are assumed · to be base-2 unless stated otherwise.

Definition 1. The lexicographical ordering on 0, 1 k is defined as follows: xk ≺L { } ≺L k x˜ iff xj < x˜j for some j and xi =x ˜i for all i < j. CHAPTER 3. BOOLEAN FUNCTIONS 40

For example, if k = 3, we have 000 001 010 011 100 101 ≺L ≺L ≺L ≺L ≺L ≺L 110 111. ≺L

Definition 2. We define Lk(M) to be the initial segment of size M in the lexico- graphical ordering on 0, 1 k. For example, L (4) = 000, 001, 010, 011 . { } 3 { } We say that a boolean function b : 0, 1 n 0, 1 is lex if b−1(0) = L ( b−1(0) ). { } →{ } n | | In other words, b is lex when it maps an initial segment of the lexicographical ordering to 0, and the complement segment to 1. Mathematically, b is lex if

xn xn = b(xn) b(xn). 1 ≺L 2 ⇒ 1 ≤ 2

3.1.2 Refinement of Conjecture 1

Instead of dealing with Conjecture 1 directly, consider the following two conjectures:

Conjecture 2. For a given n and fixed bias Pr b(Xn)=0 satisfying H(Pr b(Xn)= { } { 0 ) 1 H(α), the conditional entropy H(b(Xn) Y n) is minimized when b is lex. } ≥ − | Conjecture 3. If b : 0, 1 n 0, 1 is lex, then { } →{ }

H(b(Xn) Y n) H(b(Xn))H(α). (3.6) | ≥

Conjecture 1 would follow as a corollary if Conjectures 2 and 3 were valid. Though it might seem counterproductive to try to prove a stronger result by splitting Con- jecture 1 into two different conjectures, this appears to be the only way to gain trac- tion on Conjecture 1. Intuitively, Conjecture 2 concerns the structure of maximally- informative Boolean functions, while Conjecture 3 handles the inequality component of Conjecture 1. This intuition should become more clear to the reader over the course of this chapter. We briefly comment that our reference to Conjecture 3 as a “conjecture” is perhaps too modest. Indeed, for any given α, we have a simple recursive algorithm capable of proving (3.6) for all n. With computer assistance, our algorithm has verified Conjecture 3 for α ranging from 0 to 1/2 in increments of 0.001 (refer to Theorem 10 CHAPTER 3. BOOLEAN FUNCTIONS 41

and Algorithm 3.1.1 in Section 3.1.4 for details). Nevertheless, given the computer- assisted nature of our method for verifying (3.6), we find it appropriate to call it a conjecture. In the following two subsections, we elaborate on Conjectures 2 and 3 and our corresponding results.

3.1.3 Conjecture 2 and edge-isoperimetry

Conjecture 2 is reminiscent of a classical theorem in discrete mathematics originally due to Harper [27] that gives an exact edge-isoperimetric inequality for the hypercube.

To state the theorem, we need a few basic notations. Let Qn be the n-dimensional hypercube, which is an undirected graph with vertex set V (Q ) = 0, 1 n and edge n { } set E(Qn) consisting of pairs of vertices that have a hamming distance of 1. For S V (Q ), the edge boundary ∂(S) is the set of edges one has to delete to disconnect ⊆ n S from any vertex not in S.

Theorem 8. For S V (Q ) with S = k, we have ∂(S) ∂(L (k)) . ⊆ n | | | |≥| n | Thus an initial segment of the lexicographical ordering minimizes the edge bound- ary on Qn. We comment on the relationship between edge-isoperimetry and our conjecture in detail in Appendix B.2. The simplest proofs of Theorem 8 rely on so-called compression operators, pop- ularized by Bollob´as and Leader [28]. These compression operators turn out to be useful in making progress towards Conjecture 2, so we introduce them now. While [28] defines compression on subsets of vertices of the hypercube, we define the equivalent notion of compression on boolean functions. Let be subset of indices [1 : n] = 1, 2,...n . To be concrete, let = I { } I i , i ,...,i , where i < i < < i . Let c denote the set of indices not { 1 2 |I|} 1 2 ··· |I|} I c I in , i.e., = [1 : n] . We will use x to denote (xi1 , xi2 ,...xi|I| ). Note that any I I \I c c boolean function b(xn) can be written as b(xI , xI ). We use the notation b xI ZI I I Ic n to denote the set x : b(x , x )=0 . Note that any boolean function b(x ) is { }c c c completely specified by specifying (xI ) for each xI 0, 1 |I |. ZI ∈{ } CHAPTER 3. BOOLEAN FUNCTIONS 42

❍ Ic ❍ Ic ❍❍ x ❍❍ x I ❍ 00 01 10 11 I ❍ 00 01 10 11 x ❍❍ x ❍❍

00 0 0 CI 00 0 0 0 0 01 0 −→ 01 0 0 10 0 0 0 10 0 0 11 0 0 11

Figure 3.1: Illustration of the compression operator CI . The boolean function n I Ic n b(x ) = b(x , x ) is shown in the left and CIb(x ) is shown in the right. (For easier visualization, we use an empty cell to represent b(xn) = 1.)

n n The -compression of a boolean function b(x ), denoted CIb(x ), is a boolean I c function obtained by replacing xI by an initial segment of the lexicographic ZI c c c ordering, L xI , for each xI  0, 1 |I |. Thus C b(xn) is defined by |I| ZI ∈{ } I   

c c C b(xI )= L xI . ZI I |I| ZI !  

The dimension of an -compression operator is . I |I| We illustrate the compression operator CI in Figure 3.1. We also illustrate succes- sive compressions on the boolean function b defined by b−1(0) = 000, 010, 011, 111 { } in Figure 3.2. Note that -compression preserves the size of the set b−1(0), i.e., I C b−1(0) = b−1(0) . | I | | | A boolean function b(xn) is said to be -compressed if C b(xn)= b(xn) for all xn I I ∈ 0, 1 n. Note that C b(xn) is always -compressed, i.e. C C b(xn)= C b(xn). Note { } I I I I I also that if a boolean function b(xn) is -compressed, then it is also -compressed I J for each . J ⊂ I The following theorem states that when 2, applying an -compression to |I| ≤ I b(xn) does not decrease the mutual information I(b(Xn); Y n). Thus, compression provides a method of modifying functions in a manner that does not adversely affect the mutual information I(b(Xn); Y n).

Theorem 9. Let b : 0, 1 n 0, 1 be any boolean function and let [1 : n] { } → { } I ⊆ CHAPTER 3. BOOLEAN FUNCTIONS 43

101 111

100 110

001 011

000 010 n n n b(x ) C{2}b(x ) C{3}C{2}b(x )

Figure 3.2: Successive one-dimensional compressions applied to the boolean function b(xn) defined by b−1(0) = 000, 010, 011, 111 , elements of which are represented by { } balls on the three-dimensional hypercube. (Vertices are only labeled on the leftmost cube for cleaner presentation.)

satisfy 2. Then |I| ≤ I(C b(Xn); Y n) I(b(Xn); Y n). I ≥ One can repeatedly apply Theorem 9 for different subsets of cardinality 2, I ultimately terminating3 at a function ˆb(xn) which is -compressed for all with I I 2, leading to the following corollary. |I| ≤ Corollary 1. Let be the set of boolean functions b : 0, 1 n 0, 1 that are Sn { } → { } -compressed for all with 2. In maximizing I(b(Xn); Y n), it is sufficient to I I |I| ≤ consider functions b . ∈ Sn The implications of Theorem 9 and its corollary are twofold. First, it allows the verification of Conjecture 2 for modest values of n. Indeed, we have numerically n n validated Conjectures 1 and 2 for n 7 by evaluating I(b(X ); Y ) for b n. To ≤ ∈ S n appreciate the reduction afforded by Corollary 1, define to be the set of all 22 Bn Boolean functions on n inputs. A comparison between and is given in the |Sn| |Bn| table below.

3 −1 To show termination, observe from the illustration in Figure 3.1 that we can obtain CI b (0) from b−1(0) by replacing each element of b−1(0) by an element that is never higher in the lexico- graphic ordering on 0, 1 n. { } CHAPTER 3. BOOLEAN FUNCTIONS 44

n |Sn| |Bn| 2 5 16 3 10 256 4 25 65,536 5 119 4.3 109 × 6 1173 1.8 1019 × 7 44,315 3.4 1038 × A brute-force attempt to validate Conjecture 1 for n 7 would be computationally ≤ intractable, yet Theorem 9 allows us to do so by considering only those functions in . Sn In fact, one could potentially leverage Theorem 9 to exhaustively verify Conjecture 1 for modestly larger values of n (e.g., n =8, 9), though we have not done so. Second, Theorem 9 reinforces the intuition behind Conjecture 2. As we noted −1 above, if CI changes an element of b (0), it moves it lower in the lexicographical ordering on 0, 1 n. Thus, roughly speaking, applying -compression to b( ) yields a { } I · function ˆb which is (i) closer to an initial segment of the lexicographical order, and (ii) for 2 satisfies H(ˆb(Xn) Y n) H(b(Xn) Y n). |I| ≤ | ≤ | Ideally, Theorem 9 should generalize to include -compressions for > 2. In- I |I| deed, if we could take = n, Conjecture 2 would be proved. However, we have |I| found counterexamples where compression increases H(b(Xn) Y n) for > 2 (but | |I| still reduces H(b(Xn) Y n) for = n). One such counterexample is given as follows. | |I| Let n = 6 and suppose b1 and b2 are functions defined by

000000 000000  000001  000001    000010  000010   −1  −1  b1 (0) =  000100 b2 (0) =  000011 (3.7)    001000  001000

 010000  010000    100000  100000         −1 −1 where the respective rows correspond to vectors composing b1 (0) and b2 (0). Note CHAPTER 3. BOOLEAN FUNCTIONS 45

n n n n that b2(x )= C{4,5,6}b1(x ). However, for α =0.1, computation reveals that I(b1(X ); Y )= 0.2186, while I(b (Xn); Y n)=0.2173. Therefore, we see that -compressions do not 2 I always improve mutual information for 3. Nevertheless, letting b be the lex |I| ≥ L function for n = 6 and b−1(0) = 7, we can compute I(b (Xn); Y n)=0.2283 for | L | L α =0.1, supporting Conjecture 2. In fact, as discussed above, Theorem 9 permits us to verify that 0.2283 bits is the maximum mutual information attainable among all functions b with b−1(0) = 7 for n = 6, α =0.1. | | Remark 1. From the above counterexample, it appears that the threshold function4 behaves like a local maximum in terms of mutual information provided about Y n.

Remark 2. A function b : 0, 1 n 0, 1 is said to be monotone if b(xn) b(˜xn) { } → { } ≤ whenever x x˜ for 1 i n. Since monotone Boolean functions are precisely i ≤ i ≤ ≤ those functions which are -compressed for all with = 1, Corollary 1 implies I I |I| that I(b(Xn); Y n) is maximized by a monotone function.

3.1.4 A computer-assisted proof of Conjecture 3 for any given α

Now, we turn toward establishing Conjecture 3. Unless otherwise specified, all boolean functions in this subsection are assumed to be lex. For a dyadic rational p = k/2n, define

S (p)= H(b(Xn) Y n), α |

where b is the unique lex function on n inputs with Pr b(Xn)=0 = p. Note that if k { } n is even, b(X ) is independent of the input bit Xn. Therefore, Sα(p) is well-defined for all dyadic rationals p [0, 1]. It is shown in Appendix B.4 that S ( ) is continuous on ∈ α · the dyadic rationals (in fact, it is H¨older continuous with exponent 1/2). Therefore, S (p) is also well-defined when p [0, 1] is not a dyadic rational, by considering its α ∈ unique continuous extension to [0, 1].

4A threshold function is a Hamming ball (of any radius) centered at the all-zero vector. The radius determines the probability that the threshold function returns 0. CHAPTER 3. BOOLEAN FUNCTIONS 46

Conjecture 3 is equivalent to the statement

S (p) H(p)H(α) (3.8) α ≥

Note that by symmetry, it is enough to show (3.8) for 0 p 1/2. ≤ ≤ We will now state a lemma for Sα(p) that will help in constructing a computer- aided proof for Conjecture 3.

Lemma 3. For k odd and k < 2n,

k 1 k 1 k +1 S S − + S α 2n ≥ 2 α 2n α 2n   "    #

Proof.

k (a) S = H(b(Xn) Y n) α 2n |   1 H(b(Xn) Y n,X )= H(b(Xn−1, 0) Y n−1)+ H(b(Xn−1, 1) Y n−1) ≥ | n 2 | | (b) 1 k 1 k +1  = S − + S , 2 α 2n α 2n "    #

n k where in (a), b is the unique lex with Pr b(X )=0 = n and (b) follows because { } 2 b(xn−1, 0) and b(xn−1, 1) are both lex on n 1 inputs xn−1. − A consequence of Lemma 3 is the following theorem.

Theorem 10. Let α (0, 0.5) be fixed. Let C (p) be the piecewise linear approxima- ∈ n,α k k n tion of Sα(p) obtained by joining the points 2n ,Sα 2n ,k =0, 1, 2,... 2 . If there exists n > 0 such that C (p) H(p)H(α)for 0  p 1, then S (p) H(p)H(α) n,α ≥ ≤ ≤ α ≥ for the fixed α.

Proof. We only need to show S (p) C (p). This is easy because C (p) α ≥ n,α n,α ≥ C (p) for 0 p 1 by Lemma 3 and lim C (p)= S (p). n−1,α ≤ ≤ n→∞ n,α α CHAPTER 3. BOOLEAN FUNCTIONS 47

Based on Theorem 10 and symmetry of Sα(p), we construct Algorithm 3.1.1 such

that if a call to the algorithm with arguments (p−,p+) = (0, 0.5) and a fixed α eventually terminates, then Conjecture 3 is true for the chosen α. The key idea Algorithm 3.1.1 is that it recursively constructs a piecewise linear function on the interval p [0, 1] which simultaneously upper bounds H(p)H(α) and ∈ lower bounds Sα(p). Figure 3.3 illustrates the function constructed by Algorithm 3.1.1 for α =0.05. Using a Matlab implementation of Algorithm 3.1.1, we have validated (3.8) for α ranging from 0 to 1/2 in increments of 0.001. Hence, it is reasonable to believe that Conjecture 3 is true in general.

Algorithm 3.1.1: TestInequality(p−,p+)

main

if CheckChord(p−,p+) < 0 p 1 (p + p ) ← 2 − + then TestInequality(p−,p)  TestInequality(p,p+)   procedure CheckChord(a, b) C(x) is the chord connecting comment: the points (a, Sα(a)) and (b, Sα(b)). C(x) := Sα(b)−Sα(a) (x a)+ S (a) b−a − α ν min C(x) H(x)H(α) ← x∈[a,b] − return (ν)

Remark 3. In the subroutine CheckChord(a, b) of Algorithm 3.1.1, the minimiza- tion can be computed in closed form.

Remark 4. Inherently, the running time of Algorithm 3.1.1 will depend on α, but does not depend on n. The termination of TestInequality(0, 1) guarantees validity of (3.8) for all n =1, 2, 3,... . CHAPTER 3. BOOLEAN FUNCTIONS 48

0.4

0.3

0.2

0.1

0 0 1/4 1/2 3/4 1 p=Pr{b(Xn)=0}

Figure 3.3: A comparison of Sα(p) and H(p)H(α) for α = 0.05. The broken line shows the chords Algorithm 3.1.1 constructs before terminating.

Despite the apparent gap between S (p) and H(p)H(α) for p (0, 1) (e.g., α ∈ Fig. 3.3), the oscillatory behavior of Sα(p) seems to render traditional analysis tech- niques ineffective in establishing (3.8). This was our motivation for pursuing a computer-assisted method for verifying (3.6). To get a sense for the strange behavior of Sα(p), we show in Appendix B.2 that limα→0 Sα(p)/H(α) is equal to the Takagi function, a classical construction of an everywhere-continuous, nowhere-differentiable function closely related to the edge-isoperimetric inequality given in Theorem 8 (cf.

[29, 30]). Even for α = 0.05, the pathological nature of Sα(p) makes itself apparent in Figure 3.3. CHAPTER 3. BOOLEAN FUNCTIONS 49

3.2 Proofs

In this section, we prove our main results.

3.2.1 Proof of Theorem 7

Here, we use Fourier-analytic techniques to prove Theorem 7. We will require the following lemma.

Lemma 4. Let r2 1 and ρ [0, 1]. An optimal solution for the (non-convex) ≤ ∈ program

n 1+ ρx minimize : H i (3.9) 2 i=1   Xn subject to : x2 r2 (3.10) i ≤ i=1 X is given by x1 = r, and xi =0 for i =2,...,n.

Proof. Assume for simplicity that all entropies and logarithms are computed with respect to the natural basis. Note that it suffices to consider the case where n = 2, since the claim then follows by induction. Also observe that, by Schur-concavity of entropy,

1+ ρx 1+ ρx′ H H (3.11) 2 ≤ 2     for x x′ . Thus, we can assume the constraint (3.10) is met with equality. To | |≥| | this end, consider the parameterization

x1(φ)= r cos φ x2(φ)= r sin φ, (3.12)

where we take 0 φ π and r 0 since we may assume without loss of generality ≤ ≤ 4 ≥ that x x 0 by symmetry. 1 ≥ 2 ≥ CHAPTER 3. BOOLEAN FUNCTIONS 50

With foresight, consider the function

1+ z cos φ 1+ z sin φ ψ (z) , (sin φ) log (cos φ) log , (3.13) φ 1 z cos φ − 1 z sin φ  −   −  defined for z [0, 1] and φ [0,π/4] with derivative given by ∈ ∈ d z2 sin(4φ) ψ (z)= 0. (3.14) dz φ 2(1 z2 cos2(φ))(1 z2 sin2(φ)) ≥ − − As a consequence, ψ (z) 0 for φ [0,π/4] since ψ (0) = 0 and z [0, 1]. φ ≥ ∈ φ ∈ Noting that

∂ 2 1+ ρx (φ) 1+ ρr cos φ 1+ ρr sin φ H i = ρr (sin φ) log (cos φ) log , ∂φ 2 1 ρr cos φ − 1 ρr sin φ i=1 " # X    −   −  (3.15)

it is clear that

∂ 2 1+ ρx (φ) sign H i = sign ψ (ρr) 0 (3.16) ∂φ 2  φ ≥ i=1   X     since we have assumed that 0 ρ, r 1 and φ [0,π/4]. Thus, 2 H 1+ρxi(φ) ≤ ≤ ∈ i=1 2 assumes its minimum at φ = 0, which completes the proof. P   We are now in a position to prove Theorem 7.

Proof of Theorem 7. We prove the claim using the Fourier transform. For conve- nience, we will assume Xn and Y n take values in 1, 1 n. Consequently, we con- {− } sider Boolean functions b : 1, 1 n 1, 1 . Any Boolean function b : 1, 1 n {− } → {− } {− } → 1, 1 can be written in terms of its Fourier coefficients as (cf. [25]) {− }

n ˆ n b(x )= b(S)ΠS(x ), S⊆X[1:n] CHAPTER 3. BOOLEAN FUNCTIONS 51

n where ΠS(x ) = i∈S xi are the orthonormal basis functions for the Fourier trans- form, and ˆb(S) are the Fourier coefficients defined by { }SQ⊆[1:n]

ˆ n n b(S)= E b(X )ΠS(X ). (3.17)

If S = , then we define Π (xn) = 1. It is easy to verify the following identities, ∅ S n the first two of which verify that ΠS(x ) = i∈S xi are indeed orthonormal basis functions: Q

1. For all xn, Π (xn)2 = 1 since Π (xn) 1, 1 . Therefore, S S ∈ {− }

n 2 E ΠS(X ) =1. (3.18)

2. If S = T and neither are equal to , then Π (Xn) and Π (Xn) are independent 6 ∅ S T and uniformly distributed on 1, 1 giving the identity {− }

n n E ΠS(X )ΠT (X )=0. (3.19)

On the other hand, if = S = T , then we simply have E Π (Xn)Π (Xn) = ∅ 6 S T n E ΠT (X )=0.

3. As a consequence, Parseval’s theorem holds:

1= E b(Xn)2 = ˆb(S)2, (3.20) SX⊆[1:n] since

E b(Xn)2 = E ˆb(S)Π (Xn) ˆb(T )Π (Xn) (3.21)  S   T  S⊆[1:n] T ⊆[1:n]    X X     n n  = ˆb(S)ˆb(T ) E ΠS(X )ΠT (X ) (3.22) S⊆[1:n] T ⊆[1:n] X X   = ˆb(S)2. (3.23) S⊆X[1:n] CHAPTER 3. BOOLEAN FUNCTIONS 52

By definition of the Fourier coefficient ˆb( ), we have ˆb( )= E b(Xn) = 0 since b(Xn) ∅ ∅ is equiprobable on 1, 1 . Also, {− }   n n 1+ E b(X ) Yi = yi Pr b(X )=1 Yi = yi = | (3.24) { | }  2  ˆ n 1+ E S⊆[1:n] b(S)ΠS(X ) Yi = yi =   (3.25) P 2

ˆ n 1+ S⊆[1:n] b(S)E ΠS(X ) Yi = yi =   (3.26) P 2

Observe that E Π (Xn) Y = y = 0 unless S = or S = i since X are S | i i ∅ { } { j}j∈S\{i} independent from Yi. Hence, we have

1+ ˆb( )+ ρy ˆb( i ) 1+ ρy ˆb( i ) Pr b(Xn)=1 Y = y = ∅ i { } = i { } , (3.27) { | i i} 2 2 where ρ , E[X Y = 1] = 1 2α. By definition of mutual information and our i| i − assumption that Pr b(Xn)=1 = 1 , { } 2 n n n I(b(Xn); Y )= nH(b(Xn)) H(b(Xn) Y )= n H(b(Xn) Y ). (3.28) i − | i − | i i=1 i=1 i=1 X X X Hence, we prove Theorem 7 by attempting to minimize n H(b(Xn) Y ) forPr b(Xn)= i=1 | i { 0 = Pr b(Xn)=1 = 1 . Using (3.27), observe that } { } 2 P n n 1 H(b(Xn) Y )= H Pr b(Xn)=1 Y =1 + H Pr b(Xn)=1 Y = 1 | i 2 { | i } { | i − } i=1 i=1 X X h   i (3.29)

n 1 1+ ρˆb( i ) 1 ρˆb( i ) = H { } + H − { } (3.30) 2  2 2  i=1 ! ! X n  1+ ρˆb( i )  = H { } , (3.31) 2 i=1 ! X CHAPTER 3. BOOLEAN FUNCTIONS 53

ˆ ˆ where the last equality holds since 1 1−ρb({i}) = 1+ρb({i}) . Recalling that − 2 2 Parseval’s identity implies n ˆb( i )2 1, an application  of Lemma 4 implies that i=1 { } ≤ (3.31) is minimized by setting ˆb( 1 ) = 1 and ˆb( i )=0 for i =2,...,n. Thus, P { } { } n H(b(Xn) Y ) H(α)+(n 1). (3.32) | i ≥ − i=1 X Substitution into (3.28) proves the theorem.

3.2.2 Proof of Theorem 9

We begin the proof of Theorem 9 by first proving the following result for 1-dimensional compressions.

Lemma 5. Let b : 0, 1 n 0, 1 and i 1, 2,...,n . If ˆb(xn)= C b(xn), then { } →{ } ∈{ } {i} I(ˆb(Xn); Y n) I(b(Xn); Y n). ≥ Proof. It suffices to consider the case where i = n, as any other case can be handled by first permuting coordinates i and n. Since Pr b(Xn)=0 = Pr ˆb(Xn)=0 , it suffices to show H(b(Xn) yn−1,Y ) { } { } | n ≥ H(ˆb(Xn) yn−1,Y ) for each yn−1 0, 1 n−1. We do this by using concavity of entropy | n ∈{ } as shown below. n n−1 It is convenient to write b(x )= b(x , xn). Define the four sets

E = xn−1 : b(xn−1, 0) = i and b(xn−1, 1) = j , where (i, j) 0, 1 2. ij ∈{ }  n Note that the Eij’s completely determine the boolean function b(x ). Further note ˆ n n n that b(x ) = C{n}b(x ) is obtained from b(x ) by emptying the contents of E10 into

E01. CHAPTER 3. BOOLEAN FUNCTIONS 54

Observe that

Pr b(Xn)=0 yn−1,y = Pr Xn−1 E yn−1 + Pr Xn−1 E yn−1 Pr X =0 y { | n} { ∈ 00 | } { ∈ 01 | } { n | n} + Pr Xn−1 E yn−1 Pr X =1 y { ∈ 10 | } { n | n} = Pr Xn−1 E yn−1 + Pr Xn−1 E yn−1 α¯1−yn αyn { ∈ 00 | } { ∈ 01 | } + Pr Xn−1 E yn−1 α¯yn α1−yn and { ∈ 10 | } Pr ˆb(Xn)=0 yn−1,y = Pr Xn−1 E yn−1 { | n} { ∈ 00 | } + Pr Xn−1 E E yn−1 Pr X =0 y { ∈ 01 ∪ 10 | } { n | n} = Pr Xn−1 E yn−1 + Pr Xn−1 E E yn−1 α¯1−yn αyn . { ∈ 00 | } { ∈ 01 ∪ 10 | }

It is then relatively straightforward to see that

Pr b(Xn)=0 yn−1, 0 = θ Pr ˆb(Xn)=0 yn−1, 0 + (1 θ) Pr ˆb(Xn)=0 yn−1, 1 and { | } { | } − { | } Pr b(Xn)=0 yn−1, 1 = (1 θ) Pr ˆb(Xn)=0 yn−1, 0 + θ Pr ˆb(Xn)=0 yn−1, 1 , { | } − { | } { | } where Pr Xn−1 E yn−1 θ = { ∈ 01| } . Pr Xn−1 E yn−1 + Pr Xn−1 E yn−1 { ∈ 01| } { ∈ 10| } Note that θ does not depend on Yn. Applying concavity of entropy, we get

H(b(Xn) yn−1, 0) θH(ˆb(Xn) yn−1, 0)+(1 θ)H(ˆb(Xn) yn−1, 1), | ≥ | − | H(b(Xn) yn−1, 1) (1 θ)H(ˆb(Xn) yn−1, 0)+ θH(ˆb(Xn) yn−1, 1). | ≥ − | |

Adding both the inequalities and multiplying by 1/2 yields

H(b(Xn) yn−1,Y ) H(ˆb(Xn) yn−1,Y ). | n ≥ | n

The rest of the proof is straight-forward.

We are now in a position to finish the proof of Theorem 9, which is similar to the proof of Lemma 5. CHAPTER 3. BOOLEAN FUNCTIONS 55

Proof of Theorem 9. We assume that = n 1, n , as all other cases follow by a I { − } permutation of coordinates. It suffices to show

H(b(Xn) yn−2,Y ,Y ) H(ˆb(Xn) yn−2,Y ,Y ) | n−1 n ≥ | n−1 n for each yn−2 0, 1 n−2. This follows from concavity of entropy as shown below. ∈{ } c c Recall that we defined b xI = xI : b(xI , xI )=0 in Section 3.1.3. By a ZI { } n repeated application of Lemma 5 using C{n−1} and C{n}, we can assume that b(x ) is n 1 - and n -compressed. Thus, the b xn−2 can only be one of the following: { − } { } ZI , 00 , 00, 01 , 00, 10 , 00, 01, 10 , or 00, 01, 10, 11 . Note that all of these sets ∅ { } { } { } { } {  } are initial segments of the lexicographical order on 0, 1 2 except 00, 10 . Hence, we { } { } aim to transform b(xn) to ˆb(xn) so that ˆb xn−2 = 00, 10 . To this end, define ZI 6 { }  E = xn−2 : b xn−2 = , where 0, 1 2. S ZI S S⊆{ } n  o Note that the E ’s completely determine b xn−2 and therefore b(xn). Further S ZI note that E = φ unless is one of the 6 possible values of b xn−2 for a boolean S S  ZI function b(xn) that is n 1 - and n -compressed. Finally, note that ˆb(xn) is { − } { }  n obtained from b(x ) by emptying the contents of E{00,10} into E{00,01}. It is relatively straightforward to show that

Pr b(Xn)=0 yn−2, 00 = Pr ˆb(Xn)=0 yn−2, 00 , { | } { | } Pr b(Xn)=0 yn−2, 11 = Pr ˆb(Xn)=0 yn−2, 11 , { | } { | } Pr b(Xn)=0 yn−2, 01 = θ Pr ˆb(Xn)=0 yn−2, 01 + (1 θ) Pr ˆb(Xn)=0 yn−2, 10 and { | } { | } − { | } Pr b(Xn)=0 yn−2, 10 = (1 θ) Pr ˆb(Xn)=0 yn−2, 01 + θ Pr ˆb(Xn)=0 yn−2, 10 , { | } − { | } { | }

where Pr Xn−2 E yn−2 θ = { ∈ {00,01}| } . Pr Xn−2 E yn−1 + Pr Xn−2 E yn−1 { ∈ {00,01}| } { ∈ {00,10}| } The rest of the proof follows from concavity of entropy and is similar to the proof of Lemma 5. CHAPTER 3. BOOLEAN FUNCTIONS 56

3.3 Concluding Remarks

Although Conjecture 1 remains open, we have provided substantial evidence in sup- port of its validity. Indeed, our results allow us to exhaustively verify Conjecture 2 for n 7, and we have a computer-assisted method for establishing Conjecture 3 for ≤ any given value of α. Any complete proof of Conjectures 1 or 2 would be of significant interest, since it would likely require new methods which may be applicable in infor- mation theory and elsewhere (e.g., in proving discrete isoperimetric inequalities). An analytical proof of Conjecture 3 which does not require computer assistance would also be interesting, but appears difficult given the pathological nature of H(b(Xn) Y n) | for lex functions. In closing, we propose two weaker forms of Conjecture 1 which could provide insight, but are still open.

1. Does Theorem 7 continue to hold when b(Xn) is not equiprobable? Unfortu- nately, our Fourier-analytic proof of Theorem 7 appears to fail in this setting. Nonetheless, we feel that establishing this generalization of Theorem 7 should be considerably easier than establishing the stronger statement of Conjecture 1.

2. For Boolean functions b, b′, does it hold that I(b(Xn); b′(Y n)) 1 H(α)? ≤ − While this problem appears difficult in general, it is a simple exercise to show this is true when b(Xn) and b′(Y n) are both equiprobable by using the data- processing property of maximal correlation and the fact that the only couplings between b(Xn) and b′(Y n) are symmetric. Intuitively, this should be the case for b, b′ which maximize I(b(Xn); b′(Y n)), but a proof remains elusive.

In the time since the second problem above was first posed by the authors in [31], Anantharam, Gohari, Kamath and Nair have made some tangible progress. In partic- ular, they have leveraged hypercontractivity to numerically verify the second problem above, further supporting the conjectures in this chapter. We refer the reader to the recent paper [32] for details. Separately, Bogdanov and Nair have proven this partic- ular conjecture holds in the restrictive setting of b = b′ [33]. Appendix A

Exact Distributed Simulation

A.1 Computing G(X; Y ) for SBES

As described in [13], Appendix A, for any W satisfying X W Y , each w → → ∈ W must fall in one of the three categories:

1. If pX|W =wpY |W =w is positive for Y = 0, then X = 0 with probability 1.

2. If pX|W =wpY |W =w is positive for Y = 1, then X = 1 with probability 1.

3. The last category is when p p is zero on Y 0, 1 , i.e., p (e w)= X|W =w Y |W =w ∈{ } Y |W | 1.

Following the same reasoning in [13], Appendix A, we conclude that we need only one w in each category. Thus,

(1 p)/2 0 p/2 pX,Y = −  0 (1 p)/2 p/2 − =  p (w)p ( w)p ( w)t W X|W ·| Y |W ·| wX∈W a 0 a 0 0 0 0 0 c = 1 2 + + 1 .  0 0 0  0 b1 b2 0 0 c2       57 APPENDIX A. EXACT DISTRIBUTED SIMULATION 58

Each term above corresponds to a category of w. Thus,

pW = a1 + a2 b1 + b2 c1 + c2 h i and we must minimize H(W ) such that a = b = (1 p)/2, a + c = b + c = p/2, 1 1 − 2 1 2 2 and b , b ,c ,c 0. This results in 1 2 1 2 ≥

G(X; Y ) = min H 1/2 c1, 1/2 c2,c1 + c2 . 0≤c1,c2≤p/2 − −  Since the objective is a concave function of c1 and c2, the minimizing (c1,c2) should be one of the 4 corner points (0, 0), (p/2,p/2), (0,p/2), (p/2, 0). The last two points are symmetric. Thus G(X; Y ) = min 1,H(p)+1 p . { − }

A.2 Subadditivity of G(Xn; Y n)

Suppose W achieves G(Xm; Y m) and W achieves G(Xn; Y n). Then, Xm W m n → m → Y m and Xm+n W Y m+n form Markov chains, and so does Xm+n (W , W ) m+1 → n → m+1 → m n → Y m+n. Therefore,

G(Xm+n; Y m+n) H(W , W ) ≤ m n = H(Wm)+ H(Wn) = G(Xm; Y m)+ G(Xn; Y n).

Thus the sequence G(Xn; Y n) : n N is sub-additive. Hence, an appeal to Feteke’s { ∈ } subadditivity lemma [34] shows that

1 lim G(Xn; Y n) = inf (1/n)G(Xn; Y n). n→∞ n n∈N APPENDIX A. EXACT DISTRIBUTED SIMULATION 59

A.3 Proof of Proposition 3

To show this consider

1 G(X; Y ) = lim min H(W ) n→∞ W :Xn→W →Y n n 1 ∗ = lim H(Wn ) n→∞ n 1 ∗ n n lim I(Wn ; X ,Y ) ≥ n→∞ n n 1 ∗ lim I(Wn ; Xi,Yi) ≥ n→∞ n i=1 X min I(W ; X,Y ) ≥ W :X−W −Y = J(X; Y ).

A.4 Proof of Proposition 5

We use the perturbation method (see Appendix C in [15]). The Markov Chain X → W Y is equivalent to →

p(y x, w)= p(y w), or | | ′ p(x,y,w) x′ p(x ,y,w) ′ = ′ ′ . ′ p(x, y ,w) ′ ′ p(x ,y ,w) y Px ,y P P Further, w p(x,y,w)= pX,Y (w). Let pǫP(x,y,w)= p(x,y,w)(1 + ǫφ(w)) be a perturbed pmf, where ǫ can be either APPENDIX A. EXACT DISTRIBUTED SIMULATION 60

positive or negative. We first observe that

pǫ(x,y,w) pǫ(y x, w)= ′ | y′ pǫ(x, y ,w) Pp(x,y,w) = ′ y′ p(x, y ,w) ′ ′ p x ,y,w P x ( ) = ′ ′ ′ ′ p(x ,y ,w) Px ,y ′ P x′ pǫ(x ,y,w) = ′ ′ ′ ′ p (x ,y ,w) Px ,y ǫ = pPǫ(y w). |

Thus, p (x,y,w) also satisfies the Markov chain X W Y . ǫ → → Now, we also require that pǫ(x, y)= pX,Y (x, y), i.e.,

p(x,y,w)φ(w) = 0 for all (x, y). w X The above equation represents linear constraints in φ(w). Thus if > |X||Y| |W| , we can find a perturbation φ(w) = 0 such that p (x,y,w) satisfies p (x,y,w)= |X||Y| 6 ǫ w ǫ p (x, y) as well as the Markov chain X W Y . X,Y → → P If we choose ǫ small enough, we can ensure that 1 + ǫφ(w) 0 for all w and thus ≥ pǫ(u,v,w) is a valid pmf.

Note that pǫ(w)= p(w)(1+ ǫφ(w)) is a linear function of ǫ. Since the entropy is a

concave function, H(pǫ(w)) is a concave function of ǫ and is minimized by an extremal

ǫ. Such an extremal ǫ makes pǫ(w) = 0 for some w, thus reducing the cardinality by 1 while also reducing H(W ). Thus we conclude for the minimizing H(W ), |W| . |W| ≤ |X||Y|

A.5 Proof of Lemma 2

We will prove by contradiction. Assume that Lemma 2 is false, i.e. w = w such ∃ 1 6 2 that the supports of p ( w ) and p ( w ) are the same for some W that achieves Y |W ·| 1 Y |W ·| 2 APPENDIX A. EXACT DISTRIBUTED SIMULATION 61

H1(X; Y ). Let = 1, 2,..., , = 1, 2,..., , etc. without loss of generality. X { |X|} Y { |Y|} For a given w, p ( w ) is a vector in R|Y| whose y-th element is the conditional Y |W ·| 1 probability p (y w). The line segment joining p ( w ) and p ( w ) can be Y |W | Y |W ·| 1 Y |W ·| 2 extended from both ends without crossing the boundary of the 1 dimensional |Y| − probability simplex as shown in Fig. A.1. Geometrically, the Markov chain X W Y means that the vectors p ( x) : → → { Y |X ·| x lie in the convex hull of the vectors p ( w) : w . ∈X} { Y |W ·| ∈ W}

pY W (y w2) | | pY W (y w1) | | b 1 pY W ′ (y w2) a | |

pY W (y w3) pY W ′ (y w1) | | | |

Figure A.1: Proof of Lemma 2.

We will construct W ′ such that X W W ′ Y forms a Markov chain and → → → H(W ′) < H(W ). Assume w.l.o.g. that the Euclidean distance between p ( w ) Y |W ·| 1 and p ( w ) in R|Y| is 1 unit. Extend the line segment in either direction by a, b Y |W ·| 2 units respectively so that it’s new end-points are p ′ ( w ),p ′ ( w ) (see Fig. Y |W ·| 1 Y |W ·| 2 A.1). Then X W W ′ Y forms a Markov chain, where W ′ is defined as → → → follows: APPENDIX A. EXACT DISTRIBUTED SIMULATION 62

p ′ (y w) if w / w ,w , Y |W | ∈{ 1 2}  b+1 a pY |W (y w)= p ′ (y w )+ p ′ (y w ) if w = w , |  a+b+1 Y |W 1 a+b+1 Y |W 2 1  | |  b a+1 p ′ (y w )+ p ′ (y w ) if w = w . a+b+1 Y |W | 1 a+b+1 Y |W | 2 2   and  ′ 1(w = w) if w / w1,w2 ,  ∈{ }  b+1 if (w′,w)=(w ,w ),  a+b+1 1 1   a ′  if (w ,w)=(w2,w1), ′  a+b+1 pW ′|W (w w)=  |  b ′  a+b+1 if (w ,w)=(w1,w2), a+1 if (w′,w)=(w ,w ),  a+b+1 2 2   0 otherwise.    Hence we have  b +1 b p ′ ( )= p (w )+ p (w ), W · a + b +1 W 1 a + b +1 W 2  a a +1 p (w )+ p (w ),p (3),p (4),... , a + b +1 W 1 a + b +1 W 2 W W  and

b + p H(W ) H(W ′)= P W w ,w H (p) H , (A.1) − { ∈{ 1 2}} − a + b +1 "  #

where p = P W = w W w ,w does not depend on a, b andp ¯ =1 p. In (A.1), { 1| ∈{ 1 2}} − the term (b + p)/(a + b + 1) increases with b and decreases with a. Further, when a = b =0, (b + p)/(a + b +1) = p. Hence, by perturbing either a or b, one can make the RHS of (A.1) positive. In that case H(W ′)

A.6 Proof of Proposition 7

′ ′ pW (w) ′ ′ Define the random variable W with pmf pW (w)= P pW (w′) . Then X ∈ W w′∈W′ → W ′ Y ′ and H(W ′) = H(W W ′). Hence H(X′; Y ′) H(W ′) = H(W W → | ∈ W ≤ | ∈ ′). W Now we show by contradiction that H(X′; Y ′)= H(W W ′). | ∈ W If H(X′; Y ′)

p (x, y)= p (w)p (x w)p (y w) X,Y W X|W | Y |W | wX∈W = pW (w)pX|W (x w)pY |W (y w)+ pW (w)pX|W (x w)pY |W (y w) ′ | | ′ | | wX∈W w∈W\WX ′ = P W pX′,Y ′ (x, y)+ pW (w)pX|W (x w)pY |W (y w) { ∈ W } ′ | | w∈W\WX ′ = P W p ˜ (w)p ′ ˜ (x w)p ′ ˜ (y w) { ∈ W } W X |W | Y |W | ˜ wX∈W + pW (w)pX|W (x w)pY |W (y w). ′ | | w∈W\WX Thus, if we define

′ ′′ W, if W , W =  ∈ W W,˜ if W / ′.  ∈ W Then X W ′′ Y forms a Markov chain. Also → → 

(a) H(W ′′) = H P W ′ + P W ′ H(W ′)+ P W / ′ H(W W / ′) { ∈ W } { ∈ W } { ∈ W } | ∈ W

where (a)and(b) follow since H(X, Θ) = H(Θ)+H(X Θ). Thus we obtain G(X; Y ) < | H(W ), a contradiction. APPENDIX A. EXACT DISTRIBUTED SIMULATION 64

A.7 Proof of Proposition 8

Proposition 6 implies 3. |W| ≤ Geometrically, the Markov chain X W Y means that the vectors p ( x) : → → { Y |X ·| x lie in the convex hull of the vectors p ( w) : w . ∈X} { Y |W ·| ∈ W} When we restrict = 2, an application of Lemma 2 together with the obser- |W| vation that p (w x)=0 p (x w) = 0 and the Markovity X W Y W |X | ⇐⇒ X|W | → → immediately results in the two cases described in Proposition 8. Similarly, when we let = 3, by Lemma 2, the supports of p ( w) are |W| Y |W ·| different for each w . Therefore, we have ∈ W

1 0 x pY |W = , 0 1x ¯   where each column corresponds to a value of W and 0 x 1. ≤ ≤ Similarly pW |X can be one of the 6 cases below. Each row corresponds to a W . Since there are 3 rows, there are 3! = 6 cases.

y z y z 0z ¯ y¯ 0 y¯ 0 0z ¯

pW |X y¯ 0 , 0z ¯ , y z , y z , 0z ¯ , y¯ 0 , ∈   0z ¯ y¯ 0 y¯ 0 0z ¯ y z y z                           for appropriately chosen numbers y, z between 0 and 1.  Assume α β pY |X = α¯ β¯   and X Bern(p). We first consider case 1 for p . In that case, we have the ∼ W |X Markovity pY |W pW |X = pY |X , i.e.,

y z 1 0 x α β y¯ 0 = , 0 1x ¯ α¯ β¯ 0z ¯         APPENDIX A. EXACT DISTRIBUTED SIMULATION 65

which yields y = α and z +¯zx = β or z = β−x , hence x β. Also x¯ ≤

py +¯pz

pW = pW |X pX =  py¯  .  p¯z¯      y = α is fixed, and z decreases monotonically with x. Since pW is a linear function of z, the optimal z that minimizes the concave function H(W ) should lie at one of the end-points of z. The two end-points are z = 0 (corresponding to x = β) and z = β (corresponding to x = 0). When z = 0, H(W ) > H(¯p) = H(X) and may be eliminated because H (X; Y ) H(X). When z = β, we have x = 0, hence 1 ≤ p (y 1) = p (y 2) for all y. Y |W | y|W | In this case, we can replace W by it’s sufficient statistic w.r.t Y , namely

1, if w=0 T (W )=  0, otherwise.  By property 6 from Section 2.1 this operation reduces the cardinality and in- |W| creases the entropy.

Cases 2, 3, 4 for pW |X are very similar to case 1. Now consider case 5. In this case, Markovity implies y¯ 0 1 0 x α β 0z ¯ = 0 1x ¯ α¯ β¯ y z         and py¯

pW =  p¯z¯  py +pz. ¯      APPENDIX A. EXACT DISTRIBUTED SIMULATION 66

Thus we want to solve the following optimization problem:

Minimize H(py,¯ p¯z,py¯ +¯pz) Subject toy ¯ + xy = α, xz = β.

We can eliminate variable x from the above constraints to obtain a constraint

α¯ β + =1. y z

We want to show that for the optimal (y∗, z∗), either y∗ = 1 or z∗ = 1. In that case, the support of W reduces to size 2, thus showing that = 2 suffices. To |W| show this, we observe that (y, z) : α¯ + β = 1 lies in the convex hull of the points { y z } (1,β/α), (¯α/β,¯ 1), (0, 0), (0, 1), (1, 0) . Since H() is a concave function, it’s minimum { } in the convex hull should be one of the boundary points. It is easy to see that the minimum must be one of (1,β/α), (¯α/β,¯ 1) . These 2 points satisfy the original { } α¯ β ∗ ∗ constraint y + z = 1. Hence, either y =1 or z = 1. Case 6 is similar to case 5. Appendix B

Which Boolean Functions Maximize Mutual Information on Noisy Inputs?

B.1 Randomized Boolean functions do not improve mutual information

As mentioned briefly in the introduction, it suffices to consider deterministic functions when upper-bounding the quantity I(b(Xn); Y n). Indeed, any randomized Boolean function b can be written as ˜b(Q, Xn), where ˜b is a deterministic function and Q is a random variable that is independent of Xn. In this case, we have

I(˜b(Q, Xn); Y n) I(˜b(Q, Xn), Q; Y n) (B.1) ≤ = I(Q; Y n)+ I(˜b(Q, Xn); Y n Q) (B.2) | = I(˜b(Q, Xn); Y n Q). (B.3) |

Thus, there must exist some q such that

I(˜b(Q, Xn); Y n Q = q) I(˜b(Q, Xn); Y n), (B.4) | ≥

67 APPENDIX B. BOOLEAN FUNCTIONS 68

which establishes the claim that randomization does not help.

B.2 Further comments on edge-isoperimetry, the Takagi function, and H(b(Xn) Y n) | The Takagi function is a simple construction of an everywhere continuous, nowhere differentiable function [35]. Specifically, the Takagi function T(p) is defined on the unit interval [0, 1] by

∞ s(p2n) T(p)= , (B.5) 2n n=0 X where s(x) = min Z m x (i.e., the distance to the nearest integer). In 2000, Guu m∈ | − | [30] connected the Takagi function with the edge-isoperimetry problem mentioned in Section 3.1.31. We recall his result in the following theorem.

Theorem 11 (From [30]).

∂(S) T k min | n | = n (B.6) S⊆V (Qn):|S|=k 2 2   and the minimum is achieved only if S coincides with an initial segment of the lexi- cographical ordering under an isomorphism of the hypercube Qn.

As we alluded to in Section 3.1.3, H(b(Xn) Y n) is related to the edge-isoperimetric | problem and, consequently, the Takagi function. The relationship becomes most apparent in the limiting case of α 0, which is made precise by the following → theorem.

Theorem 12. For any Boolean function b, associate b−1(0) with the corresponding set of vertices in the n-dimensional hypercube Q . Let ∂(b−1(0)) be the size of the n | | 1Takagi-like functions also appear in various other problems (e.g., [36]). APPENDIX B. BOOLEAN FUNCTIONS 69

−1 edge boundary of b (0) in Qn. Then

H(b(Xn) Y n) ∂(b−1(0)) lim | =2| | (B.7) α→0 H(α) 2n −1 T b (0) 2 | n | (B.8) ≥ 2 ! =2T Pr b(Xn)=0 . (B.9)    Moreover, the lower bound is attained with equality if b is lex.

We can infer from Theorem 12 that Conjecture 2 is valid for sufficiently small α since those functions b which are not equivalent to a lex function (in the same sense as in Theorem 11) will not meet the lower bound (B.9) as α vanishes.

Remark 5. Before continuing to the proof, we remark that information-theoretic techniques have been applied to proving edge-isoperimetric inequalities in the past [37, 38]. However, it appears that those results are related to our problem only insofar as Harper’s edge-isoperimetric inequality is related.

Theorem 12 follows immediately from Theorem 8, Theorem 11, and the following lemma.

Lemma 6.

E n [f(Pr b(Xn)=0 Y n )] ∂(b−1(0)) Y | lim = | n |, (B.10) α→0 H(α) 2 where f(x)= x log(1/x).

Proof. For convenience letα ¯ =1 α. By definition, we know that −

n n n n Pr(b(Xn)=0 yn)= αd(x ,y )α¯n−d(x ,y ), (B.11) | n −1 x ∈Xb (0) where d(xn,yn) is the Hamming distance between vectors xn and yn. Suppose αkα¯n−k is the largest term in the sum (B.11) for a given yn, and put N (yn)= xn b−1(0) : d(xn,yn)= k . k { ∈ }

APPENDIX B. BOOLEAN FUNCTIONS 70

Then, for α near zero and k 1, ≥

f(Pr(b(Xn)=0 yn)) (B.12) | n k n−k k+1 = f Nk(y )α α¯ + O(α ) (B.13)   = N (yn)αkα¯n−k + O(αk+1) (B.14) − k   log N (yn)αkα¯n−k + log 1+ O(α) × k h   i = N (yn)αk + O(αk+1)  (B.15) − k   k log(α) + log(N (yn)) + O(α) × k h i = kN (yn)αk log(α)+ O(αk)+ O αk+1 log(α) . (B.16) − k   Therefore,

f(Pr(b(Xn)=0 yn)) | H(α) n k k k+1 kNk(y )α log(α)+ O(α )+ O α log(α) = − (B.17) α log(α)+ o(α log(α)) −  N (yn) if k =1 α→0 1 (B.18) −→  0 if k > 1. 

On the other hand, if k= 0, then yn b−1(0), and N (yn) = 1. In this case, ∈ 0

f(Pr(b(Xn)=0 yn)) = f(¯αn + O(α)) (B.19) | = f(1 + O(α)) (B.20) = (1 + O(α)) log(1 + O(α)) (B.21) − = O(α) (B.22) α→0 0. (B.23) −→ APPENDIX B. BOOLEAN FUNCTIONS 71

Since Pr Y n = yn =2−n for all yn 0, 1 n, we can conclude that { } ∈{ }

E n [f(Pr b(Xn)=0 Y n )] ∂(b−1(0)) Y | lim = | n | (B.24) α→0 H(α) 2 as claimed.

Remark 6. Using a Taylor series expansion similar to the proof above, it is possible to prove that Conjecture 1 holds as α 1/2. However, since this setting does not exhibit → the nice connection to the edge-isoperimetric inequality that occurs when α 0, we → omit the details.

B.3 Enumerating the functions in Sn Recall from Corollary 1 that denotes the set of Boolean functions on n inputs that Sn are -compressed for all of cardinality 2. Suppose b : 0, 1 n+1 0, 1 , and define I I { } →{ } the functions b : 0, 1 n 0, 1 and b : 0, 1 n 0, 1 by 0 { } →{ } 1 { } →{ }

n n b0(x )= b(0, x ) (B.25) n n b1(x )= b(1, x ) (B.26) for all xn 0, 1 n. By definition, b if, and only if, b , b , and b is ∈{ } ∈ Sn+1 0 ∈ Sn 1 ∈ Sn 1, j -compressed for all j 2, 3,...,n +1 . Therefore, given , we can enumerate { } ∈{ } Sn all functions in in O(n 2n−1 2) time since, for b : 0, 1 n+1 0, 1 , we can Sn+1 |Sn| { } → { } na¨ıvely test whether b is 1, j -compressed in O(2n−1) time. With this observation in { } mind, it is a routine exercise to recursively enumerate the functions in for modest Sn values of n.

B.4 H¨older Continuity of Sα(p)

In Section 3.1.4 we defined S (p) , H(b(Xn) Y n), where b : 0, 1 n 0, 1 is a α | { } → { } lex function satisfying Pr b(Xn)=0 = p. Note that if k is even, then b does not { } depend on its input bit xn. Therefore, by induction, Sα(p) is well-defined for all APPENDIX B. BOOLEAN FUNCTIONS 72

dyadic rationals p [0, 1]. As we show shortly, S (p) is continuous, and therefore has ∈ α a unique continuous extension to the interval [0, 1]. In order to do so, we require the following lemma:

Lemma 7. For x, y [0, 1], ∈

x log(x) y log(y) 2 x y . (B.27) | − | ≤ | − | p Proof. Assume without loss of generality that x 0. Now, note that

∂ x x log(x) (x + δ) log(x + δ) = log < 0, ∂x − x + δ    and hence x log(x) (x+δ) log(x+δ) attains it’s maximum when x =0or x =1 δ. | − | − Thus, we can conclude that

x log(x) (x + δ) log(x + δ) | − | 1 1 max δ log , (1 δ) log 2√δ. (B.28) ≤ δ − 1 δ ≤  − 

Lemma 8. Sα(p) is H¨older continuous. Specifically,

S (x) S (y) 4 x y . (B.29) | α − α | ≤ | − | p Proof. We will prove the claim when x, y are dyadic rationals. Since dyadic rationals are dense, H¨older continuity of the same order carries over immediately to the contin- uous extension of Sα on the interval [0, 1]. To this end, let x, y be dyadic rationals with x>y, and let b and b be lex functions on 0, 1 n which satisfy x = Pr b (Xn)=0 x y { } { x } APPENDIX B. BOOLEAN FUNCTIONS 73

and y = Pr b (Xn)=0 . Then { y }

S (x) S (y) | α − α | = H(b (Xn) Y n) H(b (Xn) Y n) x | − y | (a) 1 n n n n H(bx(X ) y ) H(by (X ) y ) ≤ 2n | − | yn∈{0,1}n X (b) 1 n n n n n 4 Pr bx(X )=0 y Pr by(X )=0 y ≤ n n 2 { | } − { | } y ∈{X0,1} q (c) 4 E n Pr b (Xn)=0 Y n Pr b (Xn)=0 Y n ≤ Y { x | } − { y | } =4√qx y,  − where,

(a) follows from the triangle inequality, • (b) follows from Lemma 7 and the fact that Pr b (Xn)=0 Y n Pr b (Y n)= • { x | } ≥ { y 0 Y n since x>y implies b−1(0) b−1(0), and | } y ⊂ x (c) follows by Jenson’s inequality since √x is concave for x 0. • ≥ Bibliography

[1] . A mathematical theory of communication. Bell System Tech- nical Journal, 27:379–423, 623–656, July, October 1948.

[2] P. G´acs and J. K¨orner. Common information is far less than mutual information. Probl. Control Inf. Theory, 2(2):149–162, 1973.

[3] Aaron D. Wyner. The common information of two dependent random variables. IEEE Trans. Inf. Theory, 21(2):163–179, March 1975.

[4] D. E. Knuth and A. C.-C. Yao. The complexity of nonuniform random num- ber generation. In J. F. Traub, editor, Algorithms and Complexity: New Direc- tions and Recent Results, pages 357–428. Proceedings of a Symposium, Carnegie- Mellon University, Computer Science Department, Academic Press, New York, NY, 1976.

[5] John von Neumann. Various Techniques Used in Connection with Random Dig- its. J. Res. Nat. Bur. Stand., 12:36–38, 1951.

[6] Peter Elias. The efficient construction of an unbiased random sequence. The Annals of Mathematical Statistics, 43(3):865–870, 06 1972.

[7] D. Slepian and J. Wolf. Noiseless coding of correlated information sources. In- formation Theory, IEEE Transactions on, 19(4):471 – 480, Jul 1973.

[8] J. K¨orner and A Orlitsky. Zero-error information theory. Information Theory, IEEE Transactions on, 44(6):2207–2229, Oct 1998.

74 BIBLIOGRAPHY 75

[9] H. S. Witsenhausen. On sequences of pairs of dependent random variables. SIAM J. Appl. Math., 28(1):100–113, 1975.

[10] Elza Erkip. The efficiency of information in investment. PhD Dissertaton, Elec- trical Engineering, Stanford University, 1996.

[11] Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality stud- ied by Erkip and Cover. ArXiv:1304.6133, 2013.

[12] R. Gray and A. Wyner. Source coding for a simple network. Bell Systems Tech. Journal, 53:1681–1721, November 1974.

[13] Paul Cuff. Distributed channel synthesis. arXiv:1208.4415v3, 2013.

[14] A A Borokov. Mathematical statistics. Gordon and Breach Science Publishers, 1998.

[15] Abbas El Gamal and Young Han Kim. Network information theory. Cambridge University Press, 2011.

[16] Paul Cuff. Communication in networks for coordinating behavior. PhD Disser- tation submitted to the dept. of Electrical Engg. at Stanford University, 2009.

[17] P. Cuff, H. Permuter, and T. Cover. Coordination capacity. IEEE Trans. Inf. Theory, 56(9):4181 4206, September 2010.

[18] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer Society, 2009.

[19] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix fac- torization. Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference. MIT Press., pages 556–562, 2001.

[20] J.G. Klotz, D. Kracht, M. Bossert, and S. Schober. Canalizing boolean func- tions maximize mutual information. Information Theory, IEEE Transactions on, 60(4):2139–2147, April 2014. BIBLIOGRAPHY 76

[21] Areejit Samal and Sanjay Jain. The regulatory network of E. coli metabolism as a Boolean dynamical system exhibits both homeostasis and flexibility of response. BMC Systems Biology, 2(1):21, 2008.

[22] A. R´enyi. On measures of dependence. Acta Mathematica Academiae Scien- tiarum Hungarica, 10:441–451, 1959.

[23] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In The 37th Annual Allerton Conference on Communication, Control, and Com- puting, pages 368 – 377, September 1999.

[24] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2nd edition, 2006.

[25] Ryan O’Donnell. Some topics in analysis of boolean functions. In Proc. STOC ’08, pages 569–578, New York, NY, USA, 2008. ACM.

[26] Ehud Friedgut and Gil Kalai. Every monotone graph property has a sharp threshold. Proc. Amer. Math. Soc., 124:2993–3002, 1996.

[27] L. H. Harper. Optimal numberings and isoperimetric problems on graphs,. Jour- nal of Combinatorial Theory, pages 385 – 393, 1966.

[28] B´ela Bollob´as and Imre Leader. Compressions and isoperimetric inequalities. J. Combinatorial Theory, Series A, 56(1):47 – 62, 1991.

[29] Pieter C. Allaart and Kiko Kawamura. The Takagi function: a survey. Real Analysis Exchange, 37(1):1 – 54, 2011.

[30] Ching J. Guu. The McFunction. Discrete Mathematics, 213(13):163 – 167, 2000.

[31] Gowtham R. Kumar and Thomas A. Courtade. Which Boolean functions are most informative? In Proceedings of the 2013 IEEE International Symposium on Information Theory (ISIT), pages 226–230, 2013. BIBLIOGRAPHY 77

[32] Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On hy- percontractivity and the mutual information between Boolean functions. In The 51st Annual Allerton Conference on Communication, Control, and Computing, 2013.

[33] Chandra Nair. personal communication, July 2013.

[34] J.L. Vant and R.M. Wilson. A course in combinatorics. Cambridge University Press, 1992.

[35] Teiji Takagi. A simple example of a continuous function without derivative. Proc. Phys. Math. Japan, 1:176–177, 1903.

[36] J. R. Trollope. An explicit expression for binary digital sums. Mathematics Magazine, 41(1):21–25, January 1968.

[37] and Ning Cai. General edge-isoperimetric inequalities part I. information theoretical methods. Europ . J . Combinatorics, 18:355 – 372, 1997.

[38] Rudolf Ahlswede and Ning Cai. General edge-isoperimetric inequalities part II. a local-global principle for lexicographical solutions. Europ . J . Combinatorics, 18:479–489, 1997.