arXiv:2008.06092v1 [cs.IT] 13 Aug 2020 (i.e., H H htte ocr h aibelnt etn,weetenu the where setting, variable-length the concern they that u oti oncin ayrslso ...sm nprobab in sums i.i.d. on results many connection, this to Due atrdb h ocp fifiiedvsblt:aprobabil a probab divisibility: given infinite a of whether concept that the question by reverse captured the ask also may en h asindsrbto)adtelwo ml number small of law la the the and are distribution) sums Gaussian i.i.d. the regarding being theorems limit notable Some nomto nntl iiil,lk pc n ie oan To theory. ? probability and in like divisible, infinitely information h needn u ftersetv nomto pcr o spectra information respective the of sum independent the > c 1]adteKuhYomto o eeaigrno variabl random generating variable for random method discrete Knuth-Yao the and [14] h nvra nto nomto,adapeeo informatio of piece a and information, of unit universal the admvariable random n ol nifraintheory. information spectrum information in the tools of approximation Gaussian the (corresponding and [9] property equipartition asymptotic the h on to bound the itiuin fiid sums i.i.d. of distributions ι variables hr ewrite we where ( Z ( 1 ( ...rno variables random i.i.d. hspprsest nwrtefloigqeto:cnapiec a can question: following the answer to seeks paper This npoaiiyter,i sotno neett osdrth consider to interest of often is it theory, probability In nifrainter,te“u”o h nomto ftoin two of information the of “sum” the theory, information In ntervreqeto htwehragvnrno variable random given a whether that question reverse the On X ermr htteaypoi qiatto rpryimpli property equipartition asymptotic the that remark We 1 X ,Z m + ) 0 netv function injective eoeteijciiyrqieeton requirement injectivity the remove aibe eso htaydsrt admvariable random discrete any that show we variable, on itiuino niid eune admvariable random A sequence. i.i.d. an of distribution joint hr xssa ...sqec frno variables random of sequence i.i.d. an exists there satisfies pcrlifiieydvsbedsrbtos hr ecnr can we where X distributions, divisible infinitely spectral nfr theorem. uniform opnn nlss itiue trg ihasceycon u secrecy Kolmogorov’s a of with analogue storage information distributed an analysis, as component regarded be can hsmasta as that means This ) 2 1 = sauieslcntn.Se[] 4,[]fripoeet o improvements for [5] [4], [3], See constant. universal a is ) sol h prxmt vrg ubro bits). of number average approximate the only is ( = ( o nntl iiil itiuin,ifrainspectrum information distributions, divisible Infinitely esuya nomto nlgeo nntl iiil pro divisible infinitely of analogue information an study We Z ( Y m ,isedo niid sequence. i.i.d. an of instead ), 1 Y 1 Z , ) 1 Y , . . . , H Y , . . . , fbt ihlwerrpoaiiyas probability error low with bits of cm 2 ( = ) X ( − F Z ) 2 P /n m 1 / ι m Z 3 Z , i f ) hc ssont etight. be to shown is which , 1 ∈ nnt iiiiiyo Information of Divisibility Infinite Y ≤ sisl niid sequence, i.i.d. an itself is ( i 2 uhthat such m Z R ) o h d of cdf the for H 1 Z X nta of instead hr exists there , increases, )+ ( 1 P Z Z , . . . , a edcmoe noroughly into decomposed be can 1 ι i m ) Z =1 X 2 ≤ ( Y Z = 1 n i ( 2 Z . Y f ) 59 r nfrl ls obigifiieydvsbe ntesen the in divisible, infinitely being to close uniformly are uhthat such 1 1 n ec h nomto pcrmof spectrum information the hence and , ( Y , . . . , + Z f H W P hnteeeit i.i.d. exists there then , 1 Z eateto nomto Engineering Information of Department ( h hns nvriyo ogKong Hong of University Chinese The Z , . . . , X i m 2 olwn nifiieydvsbedsrbto uhthat such distribution divisible infinitely an following m =1 eetees ycnieigtesl information self the considering by Nevertheless, . ) /n m ∞ → m Y ) P i mi:[email protected] Email: n 2 + eoe lsrt en pcrlifiieydvsbei un a in divisible infinitely spectral being to closer becomes , ≥ ) i n hl hr osnteitifrainlyifiieydivi infinitely informationally exist not does there While . d =1 ncmaio,terslsi hsppras pl hnthe when apply also paper this in results the comparison, In . 2 U . sta h ...sequence i.i.d. the that es o hc h utpiaiegap multiplicative the which for , 43 Z .I I. Z X 1 eoe h nfr metric uniform the denotes eas td e ls fdsrt rbblt distribut probability discrete of class new a study also We . hu igLi Ting Cheuk i mv h utpiaiegap multiplicative the emove Z , . . . , otelwo ag ubr) ag eitosaayi [10] analysis deviations large numbers), large of law the to a distribution has ne Terms Index a one utpiaiegpt nnt iiiiiy t divisibility, infinite to gap multiplicative bounded a has 1 NTRODUCTION t distribution ity needn opnn nlss itiue trg,Ko storage, distributed analysis, component independent , tan,addsrbtdrno ubrgeneration. number random distributed and straint, wrti usin efis eiwtecneto nnt div infinite of concept the review first we question, this swer sotndcmoe nobt.Toeape r ufa cod Huffman are examples Two bits. into decomposed often is n Abstract bro isi o xdadcndpn ntevleof value the on depend can and fixed not is bits of mber X f lt hoycnb are vrt nomto hoy o e For theory. information to over carried be can theory ility 8,[3 crepnigt h eta ii hoe)aei are theorem) limit central the to (corresponding [13] [8], flrenmes eta ii hoe wt ii distri limit (with theorem limit central numbers, large of w 1 wt ii itiuinbigtePisndistributi Poisson the being distribution limit (with [1] s Z Z n sum e H lt itiuincnb xrse ssc niid u.Th sum. i.i.d. an such as expressed be can distribution ility scle nomtoal nntl iiil f o any for if, divisible infinitely informationally called is 1 1 iomterm plctoso u euticueindepen include result our of Applications theorem. niform s[5,bt fte salsigta,loeysekn,a speaking, loosely that, establishing them of both [15], es htcnan h aeifrainas information same the contains that Z , . . . , aiiydsrbtos hr h ...smi elcdby replaced is sum i.i.d. the where distributions, bability ( and fifrainb iie noabtaiymn ics Is pieces? many arbitrarily into divided be information of e X eedn admvariables random dependent a eepesda niid eune isaergre as regarded are bits sequence, i.i.d. an as expressed be can omgrvsuiomterm rk[] 7 improved [7] [6], Arak theorem. uniform Kolmogorov’s n = ) P Z n 2 i n E and =1 P rfrt 8 o icsino nomto spectrum). information on discussion for [8] to (refer P [ X ι omgrvsuiomterm[]sae htthe that states [2] theorem uniform Kolmogorov’s . sifiieydvsbei,frany for if, divisible infinitely is X Z f i ( ( = ( X satisfying of Z )] Y 1 1 Z , 1 1 n . ubro is aeto ohrslsis results both of caveat A bits. of number Y , . . . , d . 59 59 U 2 ...rno variables random i.i.d. utemr,w td h aewhere case the study we Furthermore, . ) ( a erpae by replaced be can F tedsrbto of distribution (the X 1 m F , = ) 2 a edcmoe noafie number fixed a into decomposed be can f =sup := ) ( ι Z Z Z i 1 1 eta o n ...random i.i.d. any for that se ( Z , . . . , Z , d z U := ) X 2 ( ei nyoerno variable random one only is re F .. hr xssan exists there i.e., , il iceerandom discrete sible scpue ytejoint the by captured is x P fr anr This manner. iform 1+5 n | − F ) i n h entropy the and , Y 1 log ι i n ( ( p Z F , x Z ≥ 1 a s fwe if is, hat ) p 1 (log ,Z os called ions, W lmogorov’s Z Z , . . . , − 1 2 i ) hr exists there , ) ( F n ( z m 1] [12], [11], , ≤ Z 2 ) ) ≥ ehave we , ( 1 mportant n.One on). dent /m cm x Z , xample, the X n isibility ) 1 bution | , . and , 2 − ∈ (and ) sis is 1 is ) ing / ny R 5 . , In this paper, we consider the information analogue of infinite divisibility, where the number of components n cannot depend on the value of X. Given a X, we would determine whether, for all n 1, there exists an i.i.d. sequence of random variables Z ,...,Z that contains the same information as X, i.e., H(X Z ,...,Z≥ ) = H(Z ,...,Z X)=0 1 n | 1 n 1 n| (there exists an injective function f such that X = f(Z1,...,Zn), or equivalently, they generate the same σ-algebra). If this is 1/n possible, then these Z1,...,Zn are an equipartition of the information in X, and they can be regarded as “X ” (the inverse n operation of taking n i.i.d. copies X = (X1,...,Xn)). Unfortunately, informationally infinitely divisible discrete random variable does not exist, revealing a major difference between the “sum” of information and the sum of real random variables.2 This is due to the fact that, while any random variable has an information spectrum, not all distributions are the information spectrum of some random variables. Nevertheless, we show that any random variable has a bounded multiplicative gap to infinite divisibility. More precisely, for any discrete random variable X and n N, there exists i.i.d. discrete random variables Z1,...,Zn such that H(X Z1,...,Zn)= 0, and ∈ | 1 1 e H(X) H(Z ) H(X)+2.43, (1) n ≤ 1 ≤ n · e 1 − where the upper bound has a multiplicative gap e/(e 1) 1.582 from the lower bound. The additive gap 2.43 can be replaced 1/2 − ≈ with a term that scales like O(n− log n) as n for any fixed H(X). This is proved in Theorem 10. As a consequence, the random variable X can indeed be divided into→n ∞i.i.d. pieces (where the entropy of each piece tends to 0 as n ), but at the expense of a multiplicative gap 1.582 for large H(X), and the multiplicative gap becomes worse (and tends→ ∞ to ) for small H(X). ≈ ∞ We also study a new class of discrete probability distributions, called the spectral infinitely divisible (SID) distributions, defined as the class of distributions where the information spectrum has an infinitely divisible distribution. Examples include the discrete uniform distribution and the geometric distribution. We show that if X follows an SID distribution, then the multiplicative gap e/(e 1) in (1) can be eliminated (though the additive gap 2.43 stays). − Furthermore, we can study the case where X = (Y1,...,Ym) is itself an i.i.d. sequence, m 2. In this case, we show that the multiplicative gap e/(e 1) in (1) can be replaced by ≥ − log m 1+4.71 . (2) r m This is proved in Proposition 14 and Theorem 15. This means that as m increases, (Y1,...,Ym) becomes closer to being SID in a uniform manner (that does not depend on the distribution of Yi). This can be regarded as an information analogue of Kolmogorov’s uniform theorem. We conjecture that the optimal multiplicative gap is 1+ O(1/√m), that is, the log m term can be eliminated. Theorem 15 can also be regarded as a one-sided variant of Kolmogorov’s uniform theorem, where we require the infinitely divisible estimate to stochastically dominates the original distribution. This may be of independent interest outside of information theory. The main challenge in proving Theorem 15, compared to Kolmogorov’s uniform theorem, is that the stochastic dominance condition is significantly more stringent than having a small dU (which is a rather weak condition on the tails of the distributions). We list some potential applications of our result.

A. Independent Component Analysis and Blind Source Separation Independent component analysis (ICA) [16], [17], [18] concerns the problem of recovering independent signals from their observed mixture. While assumptions on the distributions of the individual signals are not usually imposed, the model (under which signals are mixed to form the observation) is usually assumed to take a certain prescribed form (e.g. the signals are summed together linearly). The generalized ICA setting [19], [20] (also see [21], [22], [23], [24], [25], [26], [27] for related problems) removes the assumption on the model. Given the observation X = (Y1,...,Yn), Yi 1,...,a , the generalized ICA aims at finding the signals Z ,...,Z 1,...,a and a bijective mapping f : 1,...,a ∈{n 1,...,a} n such that X = f(Z ,...,Z ), and 1 n ∈ { } { } → { } 1 n Z1,...,Zn are as mutually independent as possible, which is accomplished by minimizing the total correlation [28] n C(Z ; ... ; Z ) := H(Z ) H(Z ,...,Z ). 1 n i − 1 n i=1 X The generalized ICA does not make any assumption on the mapping f other than that it is bijective. In [29], it is shown that if a =2 and the distribution of X is generated uniformly over the probability simplex, then the average optimal C(Z1; ... ; Zn)

2We remark that any continuous random variable over R is informationally infinitely divisible due to the existence of a measure-preserving bijection between [0, 1]n and [0, 1] modulo zero. can be upper-bounded by a constant. The main differences between the generalized ICA and our result (1) are that we require Z1,...,Zn to be exactly (not only approximately) mutually independent and have the same distribution, but f does not need to be bijective, and we do not make any assumption on the size of the alphabet of Zi. Our result can be regarded as a variant of the generalized ICA, which we call independent identically-distributed component analysis (IIDCA). Suppose Z1,...,Zn are i.i.d. following an unknown distribution, and f(z1,...,zn) is an unknown (not necessarily injective) function. We are able to estimate the distribution of X = f(Z1,...,Zn) (e.g. by i.i.d. samples of X). The goal is to learn the distribution of Zi and the function f (since the labeling of the values of Zi is lost, we can only learn the entries of the pmf of Zi, but not the actual values of Zi). In ICA, the components are mixed together via addition, which causes loss of information. In IIDCA, the loss of information is due to the unknown nature of the function (or the loss of the labeling), and also that f may not be injective. One approach to IIDCA is to solve the following optimization problem: minimize H(Z1) subject to the constraint that Z ,...,Z are i.i.d., and H(X Z ,...,Z )=0. This would minimize the amount of information lost by the function, i.e., 1 n | 1 n H(Z1,...,Zn) H(X). By (1), we can always achieve H(Z1) (e/(e 1))H(X)/n +2.43. Refer to Section VII for an approximate algorithm.− ≤ − For a concrete example, consider the compression of a 5 5 image, where the color of each pixel Z (i = 1,..., 25) × i is assumed to be i.i.d. following an unknown distribution pZ . We observe the image through a deterministic but unknown transformation f(z1,...,z25) (which is an arbitrary distortion of the image that does not need to preserve locations and colors of pixels). We observe a number of i.i.d. samples of transformed images (the transformation f is the same for all samples). The goal is to estimate the distribution of colors pZ and the transformation f. Since f is arbitrary, we can only hope for learning the entries of the pmf pZ , but not which color these entries correspond to. The IIDCA provides a method to recover the distribution of Zi (up to permutation) without any assumption on the transformation f.

B. Distributed Storage with a Secrecy Constraint

Suppose we would divide a piece of data X (a discrete random variable) into n pieces Z1,...,Zn, where each piece is stored in a separate node. There are three requirements: 1) (Recoverability) X is recoverable from the pieces of all n nodes, i.e., H(X Z ,...,Z )=0. | 1 n 2) (Mutual secrecy) Each node has no information about the pieces stored at other nodes, i.e., Z1,...,Zn are mutually independent. 3) (Identity privacy) Each node has no information about which piece of data it stores, i.e., if J Unif 1,...,n , then we ∼ { } require that J is independent of ZJ . This meaning that if we assign the pieces to the nodes randomly, then a node which stores ZJ (let J be the index of the piece it stores) cannot use ZJ to gain any information about J. This condition is equivalent to that Z1,...,Zn have the same distribution. Our goal is to minimize the amount of storage needed at each node, i.e., max1 i n H(Zi). These three requirements are ≤ ≤ trivially achievable if X = (Y1,...,Ym) contains m i.i.d. components, where m is divisible by n, since we can simply let Zi = (Y(i 1)m/n+1,...,Yim/n). Nevertheless, this strategy fails when the data does not follow such homogeneous structure. − For example, assume X = (Y1,...,Y100) contains the net worth of 100 individuals, and Y1 corresponds to a well-known billionaire (and thus has a much higher expectation than other Yi’s). If we simply divide the dataset into 10 pieces (each with 10 data points) to be stored in 10 nodes, then identity privacy is not guaranteed, and if a node observes a data point much larger than the rest, it will be able to identify the billionaire and know the billionaire’s net worth, which may be considered a privacy breach. Hence, identity privacy is critical to the privacy of the user’s data. The construction in (1) provides a way to compute Z1,...,Zn satisfying the three requirements, and requires an amount of storage max H(Z ) (e/(e 1))H(X)/n+2.43 at each node, a multiplicative factor e/(e 1) 1.582 larger than H(X)/n. i i ≤ − − ≈ An advantage of our construction is that it does not require X to follow any structure. If X = (Y1,...,Ym) indeed follows an i.i.d. structure, then we can apply (2) to reduce the storage requirement, even when m

Suppose we want to generate a discrete random variable X following the distribution pX , using N identical and independent random number generators. Each generator is capable of generating a random variable following the distribution pZ . We are allowed to design pZ according to pX . Nevertheless, some of the generators may fail, and we want to guarantee that X can be generated whenever n out of the N generators are working. k One strategy is to let pZ be Unif 1,..., 2 (i.e., k i.i.d. fair bits), and use a discrete distribution generating tree [15] to generate X (traverse the tree according{ to the bits} produced by the first working generator, then the second working generator, and so on). We declare failure if a leaf node is not reached after we exhaust all working generators. This strategy would introduce a bias to the distribution of X, since the less probable values of X (corresponding to leave nodes of larger depth) are less likely to be reached. Therefore, this strategy is undesirable if we want X to follow pX exactly. Therefore, a fixed length scheme (which declares failure if there are fewer than n working generators, where n is a fixed threshold) is preferrable over a variable length scheme.

Using the construction in (1), we can design pZ = pZi such that X can be generated by the output of any n generators. We declare failure if there are fewer than n working generators. We can guarantee that X pX conditional on the event that there is no failure. The entropy used per generator is bounded by (e/(e 1))H(X)/n +2∼.43. −

D. Bridge Between One-shot and IID Results While most results in information theory (e.g. channel coding theorem, lossy source coding theorem) are proved in the i.i.d. asymptotic regime, one-shot results are also widely studied, with Huffman coding [14] (a one-shot lossless source coding result) being a notable example. See [35], [36], [37], [38], [39], [40], [41], [42] for results on one-shot or finite-blocklength lossy source coding, and [43], [44], [45], [11], [46], [47], [48], [49], [50] for results on one-shot channel coding. The results in this paper allows us to discover near-i.i.d. structures in one-shot settings, without explicitly assuming it. For example, in a one-shot source coding setting where we would compress the random variable X, we can apply (1) to divide X into i.i.d. Z1,...,Zn, and then apply standard techniques on i.i.d. source such as the method of type [9], [51]. Therefore, the results in this paper can provide a bridge between one-shot and i.i.d. results. While (1) suffers from a multiplicative penalty, the penalty may be reduced if X follows certain structure (e.g. (2)). Similar approaches have been used in [21], [29] for large alphabet source coding. One particular structure that allows a simple i.i.d. representation is Markov chain. Suppose we would compress X = (Y1,...,Ym), where Y1,...,Ym forms a Markov chain (see [52], [53], [54] for results on source coding for Markov chains). We may represent Yi+1 = f(Yi, Wi) as a function of Yi and Wi, where Wi is a random variable independent of Yj j i, Wj j

E. Other Concepts of Divisibility in Information Theory We remark that continuous-time memoryless channels, such as the continuous-time additive white Gaussian noise channel [9] and the Poisson channel [62], [63], [64], can be regarded as “infinitely divisible” since one can break the time interval into segments. Nevertheless, the existence of infinitely divisible channels does not imply the existence of informationally infinitely divisible discrete random variables. Information-theoretic analysis on infinitely divisible distributions ( in particular) has been performed in, e.g. [65], [66], [67], [68], [69], [70], though they concerned the classical definition of infinite divisibility, not the “informational infinite divisibility” in this paper. A marked Poisson process [71] can be considered as an informationally infinitely divisible random object. More precisely, let (Xi,Ti) i be a point process, where Ti i is a Poisson iid { } { } process over [0, 1], and Xi pX . The point process (Xi,Ti) i is informationally infinitely divisible since we can divide it into n i.i.d. pieces, where∼ the k-th piece is (X ,T { (k 1)}/n) : (k 1)/n T < kn . However, a marked Poisson { i i − − − ≤ i } process is not discrete and has infinite entropy (since the Ti are continuous). Nevertheless, our construction in Theorem 10, loosely speaking, can be regarded as a marked Poisson process where the information about the times Ti are removed. Marked Poisson processes have also been used in information theory to construct coding schemes in [41], [50]. Successive refinement in source coding [72] is sometimes referred as source divisibility (e.g. [73], [74]), though this division is performed in a heterogeneous manner (each description is intended for a different distortion level), not in the homogeneous i.i.d. manner in this paper. Notations Throughout this paper, we assume that the entropy H is in bits, log is to base 2, and ln is to base e. The binary entropy function is H (x) := x log x (1 x) log(1 x). We write N = 1, 2, 3,... , [n]= 1,...,n . The probability mass function b − − − − { } { } (pmf) of a random variable X is denoted as pX . The support of a probability mass function p is denoted as supp(p). For a pmf p over the set , and a pmf q over the set , the product pmf p q is a pmf over with (p q)(x, y) := p(x)q(y). n X Y × X × Y d × Write p× := p p (n terms on the right hand side). For two random variables X and Y , X = Y means that X and Y have the same distribution.×···× The cumulative distribution function (cdf) of a random variable X is denoted as FX . The space of cdf’s over [0, ) is 1 ∞ denoted as . For a cdf F , the inverse cdf is given by F − (t) = inf x 0 : F (x) t . For two cdf’s F , F , P+ ∈ P+ { ≥ ≥ } 1 2 we say F1 stochastically dominates F2, written as F1 F2, if F1(t) F2(t) for any t. The convolution of two cdf’s t ≤ n≤ F , F is given by (F F )(t) := F (t s)dF (s). Write F ∗ := F F (n terms on the right hand side). 1 2 ∈ P+ 1 ∗ 2 0 1 − 2 ∗···∗ The uniform metric between cdf’s is denoted´ as dU(F1, F2) := supx F1(x) F2(x) . The total variation distance is denoted 1 1 | − | as dTV(F1, F2) := supA R measurable t A dF1(t) t A dF2(t) . The mean of the distribution given by the cdf ⊆ | { ∈ } − { ∈ } | F + is denoted as ´ ´ ∈P ∞ E(F ) := ET F [T ]= tdF (t). ∼ ˆ0 The pmf of the Bernoulli distribution is denoted as Bern(x; γ) := 1 x =0 (1 γ)+ 1 x =1 γ. The pmf of the geometric x 1 { } − { } distribution over N is denoted as Geom(x; γ) := γ(1 γ) − . The binomial, negative binomial and Poisson distribution are denoted as Bin(x; n,p), NegBin(x; r, p) and Poi(x; λ−) respectively. We use the currying notation Bern(γ) to denote the pmf x Bern(x; γ) (i.e., Bern(γ) is a function with Bern(γ)(x) = Bern(x; γ)). Similar for Geom and other distributions. 7→

II.INFORMATION SPECTRUM

For a discrete random variable X with pmf pX , its self information is defined as ιX (x) := log pX (x). Its information spectrum cdf is defined as − F (t) := P (ι (X) t) ιX X ≤ = P ( log p (x) t) . − X ≤ ∞ Note that H(X)= E(FιX )= 0 tdFιX (t) is the mean of ιX (X), and if X is independent of Y , then the joint self information ´ t is ι(X,Y )(x, y)= ιX (x)+ ιY (y), and hence Fι(X,Y ) (t) = (FιX FιY )(t)= 0 FιX (t s)dFιY (s). Not every cdf over [0, ) is the information spectrum cdf of∗ some discrete´ random− variable. For a cdf over [0, ) to be the information spectrum∞ cdf, that cdf must correspond to a discrete distribution, where the probability of t is a multiple∞ of t 2− . We define the set of information spectrum cdf’s as := F : X is discrete RV . Pι { ιX } We present the concept of aggregation in [57], [75].

Definition 1. For two pmf’s pX ,pY , we say pY is an aggregation of pX , written as pX pY , if there exists a function g : supp(p ) supp(p ) (called the aggregation map) such that p is the pmf of g(X), where⊑ X p . X → Y Y ∼ X Also recall the concept of majorization over pmf’s (see [76]): Definition 2. For two pmf’s p ,p , we say p is majorized by p , written as p p , if X Y X Y X  Y max pX (A) max pY (B) (3) A supp(p): A k ≤ B supp(q): B k ⊆ | |≤ ⊆ | |≤ N for any k (write pX (A) := x A pX (x)). Equivalently, the sum of the k largest pX (x)’s is not greater than the sum of ∈ ∈ the k largest pY (x)’s. P It is shown in [75] that pX pY implies pX pY . In fact, it is straightforward to check that (refer to (5) and Proposition 4 for a proof) ⊑  p p F F p p . X ⊑ Y ⇒ ιX ≤ ιY ⇒ X  Y For the other direction, it is shown in [61] that p p implies X  Y p Geom(1/2) p , (4) X × ⊑ Y where p Geom(1/2) is the joint pmf of (X,Z) where X p is independent of Z Geom(1/2). X × ∼ X ∼ We now generalize the concept of majorization to arbitrary cdf’s over [0, ). ∞ ι Definition 3. For two cdf’s F1, F2 +, we say that F1 is informationally majorized by F2, written as F1 F2, if G (γ) G (γ) for any γ [0, 1], where∈ P G : [0, 1] [0, ),  F1 ≥ F2 ∈ F → ∞ ∞ G (γ) := 2td min F (t), γ , F ˆ { } −∞ where F (t)=0 for t< 0. ι It is clear that “ ” is a transitive relation. It is straightforward to check that G (γ) is continuous, strictly increasing, convex,  F and G (γ) γ. Write G′ for the left derivative of G (let G′ (0) = 0). Since F ≥ F F F ∞ G (γ)= 2td min F (t), γ F ˆ { } −∞γ −1 = 2F (s)ds, ˆ0 we have 1 F − (γ) = log GF′ (γ). (5) ι As a result, F1 F2 implies F1 F2. We first show≤ that informational majorization is a generalization of majorization. The proof is given in Appendix A. ι Proposition 4. For two discrete random variables X, Y , we have p p if and only if F F . X  Y ιX  ιY ι ∞ The mean E(F ) = 0 tdF (t) is non-increasing with respect to “ ”, as shown in the following proposition. This is an analogue of the fact that´ entropy is Schur-concave. The proof is given in Appendix B. ι Proposition 5. If F F , then E(F ) E(F ). 1  2 1 ≥ 2 ι Convolution and mixture preserves “ ”, as shown in the following proposition. The proof is given in Appendix C.  ι ι ι ι Proposition 6. If F F , F F , then F F F F , and (1 λ)F + λF (1 λ)F + λF for any λ [0, 1]. 1  2 3  4 1 ∗ 3  2 ∗ 4 − 1 3  − 2 4 ∈ Any cdf is closely approximated by an information spectrum cdf, as shown in the following proposition. ι Proposition 7. For any cdf F , there exists a discrete random variable X such that F F and ∈P+ ιX  E(F ) 1 E(F ) H(X) E(F )+ Hb min , ≤ ≤ (s log e 2)! E(F )+1. ≤ 1 1 Proof: Let G = GF . Let G′(γ) be the left derivative of G. Let G− (t) be the inverse function of G (let G− (t)=1 if 1 t supγ G(γ)), and let γk := G− (k) for k 0, and γ 1 := 1. By the convexity of G, γk is a concave sequence. Let ≥ ≥ − − qk := γk γk 1 for k 0 (note that q0 =1). By the concavity of γk, qk is a non-increasing sequence. − − ≥ Let X be a random variable with pX (k)= qk for k 1. Write G˜ = GF . Note that G˜(γk)= G(γk)= k for any k 0, ≥ ιX ≥ and that G˜(γ) is affine over the interval [γk 1, γk]. By the convexity of G, we have G˜(γ) G(γ) for any γ, and hence ι − ≥ F F . The lower bound in the proposition is a consequence of Proposition 5. To prove the upper bound, ιX  ∞ E(F )= tdF (t) ˆ0 1 (a) = log G′(γ)dγ ˆ0 (b) 1 1 log dγ ˆ ≥ q1 q1 1 = (1 q1) log − q1 (1 q )2 log e, ≥ − 1 where (a) is due to (5), and (b) is is because G′(γ) is non-decreasing, and 1/q1 is the slope of the chord of G between 0 and q1. Hence, E(F ) q1 1 . ≥ − s log e We have

∞ E(F )= tdF (t) ˆ0 1 1 = F − (γ)dγ ˆ0 1 = log G′(γ)dγ ˆ0 γ ∞ k = log G (γ)dγ ˆ ′ γk−1 Xk=1 (a) ∞ 1 (γk γk 1) log ≥ − − γk 1 γk 2 Xk=1 − − − ∞ 1 = qk log qk 1 Xk=1 − ∞ qk = H(X)+ qk log qk 1 Xk=1 − (b) ∞ ( k∞=2 qk) H(X)+ q1 log q1 + qk log ≥ ! ( k∞=2 qk 1) Xk=2 P − = H(X)+ q log q + (1 q ) log(1 q ) 1 1 − 1 −P1 E(F ) 1 H(X) Hb min , , ≥ − (s log e 2)! where (a) is because G′(γ) is non-decreasing, and 1/(γk 1 γk 2) is the slope of the chord of G between γk 2 and γk 1 − − − − − (for k =1, 1/(γk 1 γk 2)=1 G′(γ)), and (b) is due to the log sum inequality. − − − ≤

III. MULTIPLICATIVE GAP TO INFINITE DIVISIBILITY n Define the space of n-divisible cdf’s ∗ and the space of infinitely divisible cdf’s ∗∞ as P+ P+ n n ∗ := F ∗ : F , P+ { ∈P+} n ∗∞ := ∗ . P+ P+ n N \∈ n Similarly, define the space of n-informationally-divisible cdf’s ∗ and the space of informationally infinitely divisible cdf’s Pι ∗∞ as Pι n n ∗ := F ∗ : F Pι { ∈Pι} n ∗∞ := ∗ . Pι Pι n N \∈ n n Note that ∗ is the set of F n for i.i.d. sequences Z = (Z ,...,Z ). A random variable X is informationally infinitely Pι ιZ 1 n divisible (as described in the introduction) if and only if FιX ι∗∞. The following proposition shows that there is no informatio∈Pnally infinitely divisible random variable.

Proposition 8. ∗∞ = . Pι ∅ Proof: Consider i.i.d. discrete random variables Zn = (Z ,...,Z ), Z p , where we assume Z N, p (1) p (2) 1 n i ∼ Z ∈ Z ≥ Z ≥ . Let a be the largest such that pZ (a)= pZ (1). Assume Z is not a uniform random variable (i.e., pZ (a + 1) > 0). ··· n n 1 Consider the pmf pZn . Its largest entries have values (pZ (1)) , and its second largest entries have values (pZ (1)) − pZ (a+1). n n n n 1 The number of entries z N with value pZn (z ) = (pZ (1)) − pZ (a + 1) is a multiple of n (since it is the number of vectors zn Nn such that∈ exactly one z has p (z )= p (a + 1), and every other z ’s has p (z )= p (1), so the number ∈ i Z i Z i Z i Z of such vectors is a multiple of n by considering which component has pZ (zi)= pZ (a + 1)). Therefore, either Z is uniform, or the number of second largest entries of pZn is a multiple of n. Assume the contrary that there exists random variable X such that ∗∞. It is clear that X cannot be uniform (by FιX ∈ Pι taking n to be larger than its cardinality). Consider the number of second largest entries of pX , and let n to be larger than this n n number. Since ∗ , there exists i.i.d. discrete random variables Z = (Z ,...,Z ) such that = n . Since X FιX ∈ Pι 1 n FιX FιZ is not uniform, Zi cannot be uniform, and hence the number of second largest entries of pZn is a multiple of n, which gives a contradiction.

Nevertheless, there is a bounded multiplicative gap between and ∗∞. In fact, there is a bounded multiplicative gap Pι P+ between any cdf and ∗∞, as shown in the following theorem. P+ n Theorem 9. For any cdf F and n N , there exists F˜ ∗ such that F˜ F and ∈P+ ∈ ∪ {∞} ∈P+ ≤ 1 E(F ) E(F˜) E(F ), ≤ ≤ 1 (1 n 1)n − − − 1 n where we assume 1/(1 (1 n− ) )= e/(e 1) when n = . − − − ∞ 1 1 Proof: Fix ζ (0, 1). Let U Unif[ζ, 1], and Z := F − (U) F − (ζ). Let Z1,Z2,... be i.i.d. copies of Z. Let ∈1/n ∼ − 1 N N Bin(n, 1 ζ ) if n < , and N Poi( ln ζ) if n = , independent of Zi , and V := F − (ζ)+ i=1 Zi. Note∼ that when−n = , V is an∞ infinitely divisible∼ − random variable∞ since it follows a{compound} Poisson distribution (with ∞ n 1 1 P P1 a constant offset). We have FV +∗ . Let V2 := F − (ζ)+ N 1 Z1 (note that (N 1) = 1 ζ), V3 := N = 1 ∈ P1 { ≥ } ≥ − { 0 F − (U˜)+ 1 N 1 (Z1 + F − (ζ)), where U˜ Unif[0, ζ]. It is clear that V3 F and V V2 V3, and hence F} F F .{ ≥ } ∼ ∼ ≥ ≥ V ≤ V2 ≤ We then bound E[V ]= E(FV ). We have 1 E[V ]= F − (ζ)+ E[N]E[Z] E 1 1 [N] 1 1 = F − (ζ)+ F − (u) F − (ζ) du 1 ζ ˆ − − ζ E 1  1 [N] 1 = (1 E[N]) F − (ζ)+ F − (u)du − 1 ζ ˆζ E− 1 [N] (1 E[N]) F − (ζ)+ E(F ). ≤ − 1 ζ − 1 n 1 The result follows from substituting ζ = (1 n− ) if n< , and ζ = e− if n = , which makes E[N]=1. − ∞ ∞ As a result, we can approximately divide any discrete random variable into equal parts. The following theorem bounds the n gap between and ∗ . Pι Pι Theorem 10. For any discrete random variable X and n N, there exists i.i.d. random variables Z1,...,Zn and function f d ∈ such that X = f(Z1,...,Zn), and 1 H(X) H(Z ) 1 − 1 (1 n 1)n · n − − − e H(X) 1 1/n 1/n min 2.43, H min , + H (2− )+2(1 2− ) . ≤ b (e 1) log e · n 2 b − ( (s − )! ) As a result, for any fixed X, we can achieve

e H(X) 1/2 H(Z )= + O(n− log n) 1 e 1 · n − 1/2 as n (where the constant in O(n− log n) depends on H(X)). We now prove the theorem. Proof:→ ∞ We assume n 2 (the theorem is trivial for n = 1). Fix any discrete random variable X. By Theorem 9, there ι n ≥ exists F˜ ∗ such that F˜ F (and hence F˜ F ) and ∈P+ ≤ ιX  ιX 1 E(F˜) H(X). ≤ 1 (1 n 1)n − − − ι n ˜ Let F + such that F ∗ = F . By Proposition 7, there exists a discrete random variable Y such that FιY F and ∈ P ι ι  ∞ . By Proposition 6, n n ˜ . Since n , by H(Y ) 0 tdF (t)+ Hb(min E(F )/ log e, 1/2 ) Fι∗Y F ∗ = F FιX Fι∗Y , FιX ι ≤ n { } n   ∈P Proposition´ 4, p× p . By (4) (proved in [61]), p× Geom(1/2) p . Y X p Y X Define a random variable B N as × ⊑ ∈ k 1/n (k 1) 1/n p (k) := (1 2− ) (1 2− − ) . (6) B − − − n If B1,...,Bn are i.i.d. copies of B, then max B1,...,Bn Geom(1/2). Hence pB× Geom(1/2). Let pZ = pY pB. n n n n { }∼ ⊑ × We have p× p× p× p× Geom(1/2) p . It is left to bound H(Z). We have Z ⊑ Y × B ⊑ Y × ⊑ X H(Z)= H(Y )+ H(B) E(F ) 1 E(F )+ Hb min , + H(B) ≤ (s log e 2)! 1 E(F˜) 1 = E(F˜)+ Hb min , + H(B) n  sn log e 2     1 H(X) e H(X) 1  + H min , + H(B). ≤ 1 (1 n 1)n · n b (e 1) log e · n 2 − − − (s − )! Note that e H(X) 1 H min , b (e 1) log e · n 2 (s − )! 1/2 = O(n− log n) as n for any fixed H(X). It can be checked numerically that H(B) < 1.43 for n 2. For another bound on H(B), since→ the ∞ conditional distribution of B given B 2 majorizes Geom(1/2), we have ≥ ≥ H(B) H (p (1)) + (1 p (1))H(B B 2) ≤ b B − B | ≥ 1/n 1/n H (2− )+2(1 2− ). ≤ b − 1/n 1/n 1 Note that H (2− )+2(1 2− )= O(n− log n). b − We remark that the condition “X =d f(Z ,...,Z )” in Theorem 10 is equivalent to “H(X Z ,...,Z )=0” in the 1 n | 1 n introduction, since we can couple Z1,...,Zn together with X such that X = f(Z1,...,Zn).

IV. SPECTRAL INFINITE DIVISIBILITY

Although ∗∞ = (i.e., there is no informationally infinitely divisible random variable), we have ∗∞ = . A Pι ∅ Pι ∩P+ 6 ∅ discrete random variable X is called spectral infinitely divisible (SID) if FιX +∗∞. The distribution of X is called a spectral infinitely divisible distribution. A discrete uniform distribution is SID. Since∈P a geometric distribution is infinitely divisible, it is straightforward to check that it is also SID. We now define a more general class of SID distributions that includes uniform distributions and geometric distributions as special cases. Definition 11. Define the spectral negative binomial random variable X SNB(r,p,a,b) (where r,a,b N, p (0, 1]) over N as follows: let ∼ ∈ ∈ k + r 1 s := − abk k r 1  −  for k Z 0, and ∈ ≥ k pr 1 p p (x) = SNB(x; r,p,a,b) := − , X a b   k 1 k for x N, where k Z 0 satisfies − sk < x sk. ∈ ∈ ≥ i=0 ≤ i=0 It is straightforward to check that P P r d p 1 p ι (X) = log + K log − , X a b where K follows the negative binomial distribution NegBin(r, p) with failure probability p and number of failures until stopping r. Since a negative binomial distribution is infinitely divisible, the spectral negative binomial random variable is SID. The discrete uniform distribution with cardinality a is SNB(1, 1, a, 1). The geometric distribution is SNB(1, p, 1, 1). Another property is that if X1 SNB(r1,p,a1,b) and X2 SNB(r2,p,a2,b) are independent, then there exists an injective function f such that ∼ ∼ f(X ,X ) SNB(r + r , p, a a , b). 1 2 ∼ 1 2 1 2

V. RATIO TO INFINITE DIVISIBILITY More generally, we can measure how close a distribution is to infinite divisibility as follows. Definition 12. For a cdf F , define its ratio to infinite divisibility as ∈P+ 1 r (F ) := inf E(F˜): F˜ ∗∞, F˜ F . ID E(F ) ∈P+ ≤ n o If the mean E(F )=0, then rID(F ) := 1.

We list some properties of rID(F ). Proposition 13. The ratio to infinite divisibility satisfies: (Bound) For any F , • ∈P+ e 1 r (F ) . ≤ ID ≤ e 1 − (Relation to SID) A random variable X is SID if and only if r (F )=1. • ID ιX (Convolution) For any F , F with positive mean, • 1 2 ∈P+ rID(F1)E(F1)+ rID(F2)E(F2) rID(F1 F2) . ∗ ≤ E(F1)+ E(F2) Proof: The bound is a direct consequence of Theorem 9. For the relation to SID, the “only if” part is trivial. For the ˜ N ˜ ˜ “if” part, if rID(FιX )=1, then there exists Fi +∗∞, i such that Fi FιX and limi E(Fi) = E(FιX ). Hence, ∈ P ∈ ≤ →∞ ˜ ∞ W1(FιX , Fi) 0, where W1(F1, F2) := F1(t) F2(t) dt is the 1-Wasserstein metric. Convergence in 1-Wasserstein metric implies→ convergence in Lévy metric,´−∞ which| is− equivalent| to weak convergence. Since the limit of infinitely divisible distributions is infinitely divisible, F is infinitely divisible. The convolution property follows from the fact that if F˜ ∗∞, ιX i ∈P+ F˜ F , i =1, 2, then F˜ F˜ ∗∞ and F˜ F˜ F F . i ≤ i 1 ∗ 2 ∈P+ 1 ∗ 2 ≤ 1 ∗ 2 We can refine Theorem 10 using rID as follows. The proof is similar to Theorem 10 and is omitted.

Proposition 14. Fix any random variable X and n N. There exists i.i.d. random variables Z1,...,Zn and function f such d ∈ that X = f(Z1,...,Zn), and H(X) H(Z ) r (F ) 1 − ID ιX n

e H(X) 1 1/n 1/n min 2.43, H min , + H (2− )+2(1 2− ) . ≤ b (e 1) log e · n 2 b − ( (s − )! )

VI. APPROXIMATE SPECTRAL INFINITE DIVISIBILITY OF IID SEQUENCES m In this section, we consider the case X = Y = (Y1,...,Ym), where Y1,...,Ym are i.i.d. random variables. As an information analogue of Kolmogorov’s uniform theorem [2], we show that Y m tends to being spectral infinitely divisible uniformly, in the sense that rID(FιY m ) 1 as m uniformly. This result can be used together with Proposition 14 to show that Y m can be divided into n i.i.d.→ random variables→ ∞ (with n possibly larger than m), with entropy close to the lower bound. This result can also be regarded as a one-sided variant of Kolmogorov’s uniform theorem, where we require the infinitely divisible estimate to stochastically dominates the original distribution, which may be of independent interest. m Theorem 15. For any cdf F ∗ , m 2, there exists F˜ ∗∞ such that F˜ F and ∈P+ ≥ ∈P+ ≤ log m E(F˜) 1+4.71 E(F ). ≤ r m ! As a result, for m 2, ≥ log m sup rID(F ) 1+4.71 . (7) F ∗m ≤ m ∈P+ r

Theorem 15 can be stated in the following equivalent way: For any i.i.d. random variables T1,...,Tm 0, m 2, there exists an infinitely divisible random variable S such that S m T almost surely, and ≥ ≥ ≥ i=1 i m P E S T 4.71 m log m E[T ]. − i ≤ · 1 " i=1 # X p Before we prove the Theorem 15, we first prove the following lemma. Lemma 16. If γ 1 and m N satisfy that ≥ 0 ∈ sup rID(F ) γ, (8) F S ∗m ≤ ∈ m≥m0 P+ then for any m> max γ2(m 2), 0 , { 0 − } 2γ 1 sup rID(F ) 1+ 2 . (9) F ∗m ≤ m − γ ∈P+     Proof: Fix ǫ> 0. Let m> max γ2(m 2), 0 , and F¯ . Let { 0 − } ∈P+ 1 λ := m 1 , − γ   γ 1 ζ := − , 2γ 1 − m n˜ := +1, γ2  1 g0 := F¯− (ζ). Let U Unif[ζ, 1], and ∼ 1 Z := F¯− (U) g . − 0 2 n˜ Since m>γ (m 2), we have n˜ m . By (8), let Z˜ 0 be a random variable with F ˜ ∗∞, F ˜ F ∗ , and 0 − ≥ 0 ≥ Z ∈ P+ Z ≤ Z E[Z˜] γn˜E[Z]+ ǫ. Let Z ,Z ,... be i.i.d. copies of Z independent of Z˜. Let N Poi(λ) independent of Z , Z˜. Let ≤ 1 2 ∼ { i} N V := mg0 + Z˜ + Zi. i=1 X We have FV +∗∞. ∈P ¯ m ¯ We now show that FV F ∗ . Let W = Z with probability 1 ζ, and W = 0 with probability ζ. Since Fg0+W F , ≤¯ m − ≤ we have Fmg +Pm W F ∗ , where Wi are i.i.d. copies of W . Let M Bin(m, 1 ζ) independent of Zi . Then 0 i=1 i ≤ { } ∼ − { } M Z =d m W . It will be shown in Appendix D that N +˜n stochastically dominates M (i.e., F F ), and hence i=1 i i=1 i N+˜n ≤ M ¯ m P P F ∗ Fmg +Pm W ≥ 0 i=1 i = F M mg0+=1 Zi (a) Fmg +PN+˜n Z ≥ 0 i=1 i Fmg +Z˜+PN Z ≥ 0 i=1 i = FV , where (a) is because we can couple M together with N such that N +˜n M. ≥ We then bound E[V ]= E(FV ). We have

E[V ]= mg0 + E[N]E[Z]+ E[Z˜] mg + (λ + γn˜) E[Z]+ ǫ ≤ 0 1 m = mg + m 1 + γ +1 E[Z]+ ǫ 0 − γ γ2       mg0 + (m +2γ) E[Z]+ ǫ ≤ 1 1 1 = mg + (m +2γ) F¯− (u) g du + ǫ 0 1 ζ ˆ − 0 − ζ 1  1 1 = 2γg + (m +2γ) F¯− (u)du + ǫ − 0 1 ζ ˆ − ζ 1 ¯ (m +2γ) γ 1 E(F )+ ǫ ≤ 1 2γ− 1 − − 1 1 m = 1+2γm− 2 γ− E(F¯∗ )+ ǫ. − The result follows from letting ǫ 0.   → We now prove Theorem (15). Proof: Fix 1 <α< 2 and let 1 α 2(e(e 1)− ) 2 β := − + (e 1)− . (10) α(α 1) − − For k 0, let ≥ k + e α γ := , k k + e 1  −  m := β (k + e 1)2α . k − Note that j k 2α mk >β (k + e 1) 1 − 1 α− 2(e(e 1)− ) 2 2α − + (e 1)− (e 1) 1 ≥ α(α 1) − − −  − 1 α  2(e(e 1)− ) > − (e 1)2α α(α 1) − − 2eα(e 1)α = − α(α 1) − 0, ≥ and hence m 1. We will prove inductively that for k 0, k ≥ ≥ sup rID(F ) γk. (11) F S ∗m ≤ ∈ m≥mk P+ 1 α 1 When k = 0, γ0 = (e(e 1)− ) e(e 1)− , and hence (11) holds by Theorem 9. Assume (11) holds for k. Fix any m m . We have − ≥ − ≥ k+1 m β (k + e)2α ≥ jβ (k + e)2α k 1 ≥ − k + e 2α m 1 ≥ k + e 1 k −  −  >γ2(m 2). k k − By Lemma 16,

sup rID(F ) F ∗m ∈P+ 2γ 1 1+ k 2 ≤ m − γ  k   k  1 α α 2(e(e 1)− ) k + e 1 1+ − 2α 2 − ≤ β (k + e 1) 1! − k + e − −     1 α α α 2(e(e 1)− ) 2(k + e) (k + e 1) 1+ − 2 − α − ≤ β (k + e 1) 1! (k + e) − − 1 α α α α 2(e(e 1)− ) (k + e + 1) 2(k + e) + (k + e 1) = γk+1 1+ − 2 1 − α − β (k + e 1) 1! − (k + e + 1) − −   (a) 1 α α 2 2(e(e 1)− ) α(α 1)(k + e + 1) − γk+1 1+ − 2 1 − α ≤ β (k + e 1) 1! − (k + e + 1) − −   1 α 2(e(e 1)− ) α(α 1) = γk+1 1+ − 2 1 − 2 β (k + e 1) 1! − (k + e + 1) − −   1 α 2(e(e 1)− ) α(α 1) γk+1 1+ − 2 − 2 ≤ β (k + e 1) 1 − (k + e + 1) ! − − (b) γ , ≤ k+1 where (a) is because (k + e + 1)α 2(k + e)α + (k + e 1)α − − k+e+1 k+e α 1 α 1 = αt − dt αt − dt ˆk+e − ˆk+e 1 − k+e+1/2 α 1 α 1 = α (t +1/2) − (t 1/2) − dt ˆk+e 1/2 − − − k+e+1/2 t+1/2  α 2 = α (α 1)s − dsdt ˆk+e 1/2 ˆt 1/2 − − − α 2 α(α 1)(k + e) − ≥ − α 2 α(α 1)(k + e + 1) − ≥ − α 2 since 1 <α< 2 and s s − is convex, and (b) is because 7→ α(α 1) β (k + e 1)2 1 − 2 − 1 −α (k + e + 1) · 2(e(e 1)− ) − 2 β (k + e 1)− = α(α 1) − − − 2(e(e 1) 1)α − − 1 ≥ by (10). Therefore, (11) holds for all k 0 by induction. ≥ Fix any m mmin := 600. Let ≥ 1 1 α = 1 + (ln m)− 1 + (ln m )− , ≤ min 1 m 2+2/ ln m k = e +1 , ψ ln m − $  % where 1 1+(ln m )−1 2 1 ψ := 2(e(e 1)− ) min + (e 1)− (ln m )− . − − min It can be checked that k 0. We have ≥ 2α mk β (k + e 1) ≤ − 1 α 2(e(e 1)− ) 2 2α = − + (e 1)− (k + e 1) α(α 1) − −  − 1 α 2 1 2(e(e 1)− ) (e 1)− (ln m )− − + − min (k + e 1)2α ≤ α 1 α 1 −  − −  ψ (k + e 1)2α ≤ α 1 − − = ψ(ln m) (k + e 1)2+2/ ln m − m. ≤ By (11),

sup rID(F ) F ∗m ∈P+ γ ≤ k k + e α = k + e 1  −  α k+e α 1 (k + e 1) + k+e 1 αt − dt = − − (k + e´ 1)α α − α 1 (k + e 1) + α(k + e 1/2) − − − ≤ (k + e 1)α −α α 1 k + e 1/2 α(k + e 1/2) − =1+ − − k + e 1 (k + e 1/2)α  −  − e 1/2 α 1 1+ α − ≤ e 1 k + e 1/2  −  − α 1 1 e 1/2 m 2+2/ ln m − 1+ α − 1/2 ≤ e 1 ψ ln m −  −    ! α 1 e 1/2 1 1 − 1+ α − m 2+2/ ln m 1/2 ≤ e 1 √ψ ln m −     − α 1 e 1/2 1 (ln m)2 − =1+ α − exp 1/2 e 1 √ψ ln m 2 ln m +2 −  − α     1 e 1/2 1 ln m ln m − =1+ α − exp 1/2 e 1 √ψ ln m 2 − 2 ln m +2 −  − α    1  e 1/2 1 ln m 1 − 1+ α − exp 1/2 ≤ e 1 √ψ ln m 2 − 2 −       − α 1 e 1/2 1 m − =1+ α − 1/2 e 1 √eψ ln m −  −   r  α 1 mmin 1 e 1/2 √eψ ln mmin 1 m − 1+ α − ≤ e 1 1 qmmin √eψ ln m 1/2 r  −  √eψ ln mmin −   qm α min √ln 2 e 1/2 ln mmin log m =1+ α − · 1q m e 1 min 1/2r m  −  √eψ ln mmin − log m q 1+4.70662 ≤ r m The bound also holds when 2 m 599 by Theorem 9, since 1+4.70662 (log m)/m e/(e 1) in this range. The result follows. ≤ ≤ ≥ − p A slightly curious consequence of Theorem 15 is that if H(X) = 100, then Theorem 10 implies that we can divide X into 100000 two i.i.d. pieces Z1,Z2 with H(Z1) 70. In comparison, by Theorem 15, we can divide X (100000 i.i.d. copies of X) into 200000 i.i.d. pieces, each with entropy≤ 56, smaller than 70. This shows that it is easier to divide X100000 into 200000 i.i.d. pieces, than to divide X into two i.i.d.≤ pieces. If we can eliminate the (1+2γ/m) term in (9) (which comes from rounding errors in bounding the cdf of binomial and Poisson distributions), then we can improve the bound in (7) to 1+ O(1/√m). We conjecture that this is the correct scaling of the gap. Conjecture 17. There exists a universal constant c such that for all m N, ∈ c sup rID(F ) 1+ . F ∗m ≤ √m ∈P+

VII.INDEPENDENT IDENTICALLY-DISTRIBUTED COMPONENT ANALYSIS In this section, we introduce a variant of independent component analysis, called independent identically-distributed compo- nent analysis (IIDCA). Suppose Z1,...,Zn are i.i.d. following an unknown distribution pZ , and f(z1,...,zn) is an unknown function (which may not be injective). We observe the distribution of X = f(Z1,...,Zn), or an estimate of the distribution, e.g. by i.i.d. samples of X. The goal is to learn the distribution pZ and the function f. Since the labeling of the values of Zi is lost, we can only learn the entries of the pmf of Zi, but not the actual values of Zi. Our assumption is that the amount of information lost by the function, i.e., H(Z1,...,Zn) H(X), should be small. This means that f is close to being injective. This assumption is suitable if the space of X is rich enough− to contain all information in Z1,...,Zn. For the image example mentioned in Section I-A, the assumption is suitable if the space of transformed images is rich enough, for example, the transformed image has a larger size or a richer color space. Another reason for this assumption is that it is necessary to infer any information about pZ , since if too much information is lost by f, then the distribution of X can be quite arbitrary and reveals little information about pZ . Since minimizing H(Z ,...,Z ) H(X) is equivalent to minimizing H(Z ), this gives the following optimization problem: 1 n − 1 minimizeH(Z1)

subject toZ1,...,Zn i.i.d., H(X Z ,...,Z )=0. (12) | 1 n By Theorem 10, we can always achieve H(Z ) (e/(e 1))H(X)/n+2.43. Nevertheless, it is difficult to solve (12) directly. 1 ≤ − Therefore, we would consider a relaxation by allowing an arbitrary cdf F in place of the information spectrum FιZ : minimizeE(F ) n subject toF , F ∗ F . (13) ∈P+ ≤ X After finding F , we can apply the procedures in the proof of Theorem 10 to convert it to the desired distribution pZ . The gap between the optimal values of (12) and (13) is bounded by 2.43.

Another relaxation is to only require pZ1,...,Zn to be majorized by pX (see (3)):

minimizeH(pZ ) n subject top× p . (14) Z  X After finding pZ , we can apply the procedures in the proof of Theorem 10 (i.e., taking Z˜ = (Z,B), where B is given in (6)) to convert it to a distribution p , where p is an aggregation of p , and hence we can assume H(X Z˜ ,..., Z˜ )=0. Z˜ X Z˜1,...,Z˜n 1 n The gap between the optimal values of (14) and (13) is bounded by 1.43. | Nevertheless, (14) is a non-convex problem. We propose the following greedy algorithm. Let the size of the support of X be l. We construct a probability vector q = (q ,...,q ), q q in the following recursive manner: for i =1,...,l, take 1 l 1 ≥···≥ l n qi := max t 0 : (q1,...,qi 1,t)× pX , (15) ≥ −  n where “ n” is the n-fold tensor product of the vector (treated as a vector in Ri ), and “ ” follows the same definition as (3) except× that we do not require the left hand side to be a probability vector. The optimal t can be found by binary 1/n search. This procedure continues until i = l or qi = 0. Note that q1 = (maxx pX (x)) , and qi is non-increasing since n i n (q1,...,qi 1,t)× pX is a more stringent condition for larger i. We have qj 1 (since (q1,...,qi)× pX ). If −  j=1 ≤  P i 1 j−=1 qj < 1, then the qi given by (15) satisfies qi > 0, and at least one more equality in the inequalities in the definition of n n (q1,...,qi)× pX is satisfied (compared to (q1,...,qi 1)× pX ). To show this, let k 0 be the largest integer such that P  −  ≥ n

max qaj = max pX (B), A 1,...,i 1 n: A k B: B k ⊆{ − } | |≤ a A j=1 | |≤ { jX}j ∈ Y i 1 i.e., the k-th inequality in (3) is an equality. Assume j−=1 qj < 1 and consider the qi that attains the maximum in (15). There exists k′ such that P n

max qaj = max pX (B), (16) n B: B k′ A 1,...,i : A k′, a A j=1 | |≤ ⊆{ } | |≤ { Xj }j ∈ Y a A, j′ 1,...,n s.t.a ′ = i ∃{ j }j ∈ ∈{ } j i.e., at least one inequality involving qi is an equality, or else we can further increase qi. If k′ k, then there exists aj j n n ≤ { } ∈n ′ 1,...,i where aj = i for some j′, and j=1 qaj is greater than or equal to the k-th largest entry of (q1,...,qi 1)× { } n − (and hence must be equal since (q1,...,qi)× pX ), and thus (16) also holds for k′ = k +1. Therefore we can assume Q  n k′ > k, and at least one more equality in the inequalities in the definition of (q1,...,qi)× pX is satisfied. Therefore, when n  the procedure terminates, either qi = 0 or i = l (all the l inequalities in (q1,...,qi)× pX are equalities), both implying i  j=1 qj =1. Hence, when the procedure terminates, the vector q is a probability vector, and we can take pZ = q. P VIII. ACKNOWLEDGEMENT The author acknowledges support from the Direct Grant for Research, The Chinese University of Hong Kong.

APPENDIX A. Proof of Proposition 4

Without loss of generality, assume X, Y N, pX (1) pX (2) and pY (1) pY (2) . For the “only if” part, ∈ ≥ ≥ · · · kX ≥ ≥ · · · assume pX pY . Fix any γ and let kX Z 0 be the largest integer satisfying pX (x) γ. Define kY similarly. Since  ∈ ≥ x=1 ≤ pX pY , we have kX kY . We have  ≥ P ∞ 2td min F (t), γ ˆ { ιX } −∞ γ kX p (x) = k + − x=1 X X p (k + 1) XP X k +1. ≤ X ∞ t ∞ t Hence, if kX > kY , we have 0 2 d min FιX (t), γ 0 2 d min FιY (t), γ . It is left to consider kX = kY . In this case, ´ { }≥ ´ { } ∞ 2td min F (t), γ ˆ { ιX } −∞ kX γ x=1 pX (x) = kX + − kX +1 p (x) kX p (x) x=1 XP − x=1 X kX P γ x=1 pPY (x) kX + − ≥ kX +1 p (x) kX p (x) x=1 Y P − x=1 Y ∞ = 2Ptd min F (t), γP. ˆ { ιY } −∞ k ∞ t ∞ t Y For the “if” part, assume 2 d min FιX (t), γ 2 d min FιY (t), γ . Fix any kY . Let γ = y=1 pY (y), and let kX be defined as in the “only´−∞ if” part. We{ have }≥ ´−∞ { } P ∞ k = 2td min F (t), γ Y ˆ { ιY } −∞ ∞ 2td min F (t), γ ≤ ˆ { ιX } −∞ γ kX p (x) = k + − x=1 X X p (k + 1) XP X < kX +1. Hence k k , and kY p (x) kX p (x) γ = kY p (y). Y ≤ X x=1 X ≤ x=1 X ≤ y=1 Y B. Proof of PropositionP 5 P P

Write Gi = GFi , and Gi′ (γ) for the left derivative of Gi. By (5),

∞ tdFi(t) ˆ0 1 1 = Fi− (γ)dγ ˆ0 1 = log Gi′ (γ)dγ ˆ0 1 ∞ log e = min Gi′ (γ), ξ dγ 2 dξ. ˆ1 ˆ0 { } · ξ 1 1 Therefore, to prove ∞ tdF (t) ∞ tdF (t), it suffices to prove min G′ (γ), ξ dγ min G′ (γ), ξ dγ for any ξ 1. 0 1 ≥ 0 2 0 { 1 } ≥ 0 { 2 } ≥ Fix any ξ 1, and´ let η := sup γ´ [0, 1] : G′ (γ) ξ . We have´ ´ ≥ { ∈ 1 ≤ } 1 min G1′ (γ), ξ dγ ˆ0 { } η = G1′ (γ)dγ + (1 η)ξ ˆ0 − = G (η)+(1 η)ξ 1 − G2(η)+(1 η)ξ ≥ 1 − 1 1 = ( γ η G2′ (γ)+ γ > η ξ) dγ ˆ0 { ≤ } { } 1 min G2′ (γ), ξ dγ. ≥ ˆ0 { } C. Proof of Proposition 6

We use the following alternative definition of GF :

∞ G (γ)= 2td min F (t), γ F ˆ { } −∞ = inf E[2T Q], P : Q [0,1], E[Q]=γ Q|T ∈ where T F (T has cdf F ), and the supremum is over random variables Q [0, 1] (which can be dependent of T ) with E[Q]= γ∼. It is straightforward to check that the two definitions are equivalent. ∈ ι ι ι To prove Proposition 6, it suffices to prove that if F F˜ , then F F F˜ F and (1 λ)F +λF (1 λ)F˜ +λF . 1  1 1 ∗ 2  1 ∗ 2 − 1 2  − 1 2 Fix any γ 0. Let T1 F1, T2 F2, T˜1 F˜1 mutually independent (hence T1 + T2 F1 F2). Fix any ǫ> 0, γ 0 and any random≥ variable Q∼ [0, 1] with∼ E[Q]=∼ γ. We have ∼ ∗ ≥ ∈ E[2T1+T2 Q] = E 2T2 E[2T1 Q T ] | 2 (a) E 2T2 G (E[Q T ]) ≥ F1 | 2 (b)  T2  E 2 G ˜ (E[Q T2]) ≥ F1 | (c) ˜ E 2T2 (1 ǫ)E[2T1 Q˜ T ] ≥ − | 2 ˜ = (1 h ǫ)E[2T1+T2 Q˜] i − (1 ǫ)GF˜ F (γ), ≥ − 1∗ 2 ι ˜ ˜ where (a) is by the alternative definition of GF1 , and (b) is by F1 F1. For (c), we let Q [0, 1] be a random variable such T˜1  1 ∈ that E[Q˜ T2 = t2]= E[Q T2 = t2] and E[2 Q˜ T2 = t2] (1 ǫ)− G ˜ (E[Q T2 = t2]) for any t2 (this is possible due | | | ≤ − F1 | to the alternative definition of G ˜ (E[Q T2 = t2])). Also note that E[Q˜]= E[E[Q˜ T2]] = E[Q]= γ. Hence, F1 | | E T1+T2 GF1 F2 (γ) = inf [2 Q] ∗ Q [0,1], E[Q]=γ ∈ (1 ǫ)GF˜ F (γ). ≥ − 1∗ 2 ι Letting ǫ 0, we have F1 F2 F˜1 F2. → ∗ ι  ∗ To prove (1 λ)F1 + λF2 (1 λ)F˜1 + λF2, let T := TA and T˜ = 1 A = 1 T˜1 + 1 A = 2 T2, where A = 1 with probability 1 −λ, A =2 with probability− λ. We have { } { } − E[2T Q] = (1 λ)E[2T1 Q A =1]+ λE[2T2 Q A = 2] − | | (1 λ)G (E[Q A = 1]) + λE[2T2 Q A = 2] ≥ − F1 | | T2 (1 λ)G ˜ (E[Q A = 1]) + λE[2 Q A = 2] ≥ − F1 | | (a) ˜ (1 λ)(1 ǫ)E[2T Q˜ A =1]+ λE[2T2 Q A = 2] ≥ − −˜ | | (1 ǫ)E[2T Q˜] ≥ − (1 ǫ)G(1 λ)F˜ +λF , ≥ − − 1 2 where in (a), we let Q˜ [0, 1] be a random variable such that E[Q˜ A = 1] = E[Q A = 1] and E[2T˜Q˜ A = 1] 1 ∈ | | | ≤ (1 ǫ)− G ˜ (E[Q A = 1]), and Q˜ = Q if A =2. Hence − F1 | E T G(1 λ)F1+λF2 = inf [2 Q] − Q [0,1], E[Q]=γ ∈ (1 ǫ)G(1 λ)F˜ +λF . ≥ − − 1 2 The result follows from letting ǫ 0. → D. Proof of the Claim in the Proof of Lemma 16 We prove the following claim: Claim 18. Let γ > 1, 1 γ λ := n 1 , p := . − γ 2γ 1   − 2 Let a Z 0 such that a nγ− . Let M Bin(n,p), N Poi(λ). Then we have N + a +1 stochastically dominates M (i.e., F∈ ≥ (t) F (t)≥for all t). ∼ ∼ N+a+1 ≤ M Proof: We first check that E[N]+ a = λ + a np = E[M]. We have ≥ 1 2 1 γ− + γ− − 1 γ + γ2 = − γ2 γ (2γ 1)(1 γ + γ2) − − ≥ 2γ 1 γ3  −    2γ3 3γ2 +3γ 1 = p − − γ3   (γ 1)3 = p 1+ − γ3   p. ≥ Hence, λ + a 1 n n 1 + ≥ − γ γ2  1 2 = n 1 γ− + γ− − np. ≥  We use the following bound in [77]: k F (k) Φ √2ng M ≥ B n    for k =1, 2,...,n 1, where Φ is the cdf of the standard Gaussian distribution, and − r 1 r g (r) := sign(r p) r ln + (1 r) ln − . B − p − 1 p r − We also use the following bound in [78]: F (k) Φ √2g (k + 1) N ≤ P for k =0, 1, 2,... (although [78] requires k 1, it is straightforward to check that the proof in [78] also works for k =0), where ≥ t gP (t) := sign(t λ) λ t + t ln . − r − λ Define

2 t 1 t 1 2 1 2 t γ− g(t) := sign(t p) t ln + (1 t) ln − sign t 1+ γ− γ− 1 γ + (t γ ) ln − − p − 1 p − − − − − − − e (1 γ 1) r − s − −  t 1 t λ a λ a t a sign(t p) t ln + (1 t) ln − sign t + t ln − n ≤ − p − 1 p − − n − n sn − n e λ r −     n  1 = g (t) g (nt a) . B − √n P −

We now check that FN+a+1(k) FM (k) for k = 0, 1, 2,.... This is obvious for k a since FN+a+1(0) = 0, and also obvious for k n since F (k)=1≤ . Hence, to check F (k) F (k), it suffices≤ to check that ≥ M N+a+1 ≤ M k g (k a) √ng , P − ≤ B n   2 for a +1 k n 1. Therefore, it suffices to check that g(t) 0 for γ−

REFERENCES [1] L. H. Chen, “Poisson approximation for dependent trials,” The Annals of Probability, pp. 534–545, 1975. [2] A. N. Kolmogorov, “Two uniform limit theorems for sums of independent random variables,” Theory of Probability & Its Applications, vol. 1, no. 4, pp. 384–394, 1956. [3] Y. V. Prokhorov, “On a uniform limit theorem of AN Kolmogorov,” Theory of Probability & Its Applications, vol. 5, no. 1, pp. 98–106, 1960. [4] L. LeCam, “On the distribution of sums of independent random variables,” in Bernoulli 1713, Bayes 1763, Laplace 1813. Springer, 1965, pp. 179–202. [5] I. Ibragimov and E. Presman, “On the rate of approach of the distributions of sums of independent random variables to accompanying distributions,” Theory of Probability & Its Applications, vol. 18, no. 4, pp. 713–727, 1974. [6] T. Arak, “On the convergence rate in Kolmogorov’s uniform limit theorem. I,” Theory of Probability & Its Applications, vol. 26, no. 2, pp. 219–239, 1982. [7] T. V. Arak, “An improvement of the lower bound for the rate of convergence in Kolmogorov’s uniform limit theorem,” Theory of Probability & Its Applications, vol. 27, no. 4, pp. 826–832, 1983. [8] T. S. Han, Information-spectrum methods in information theory, ser. Stochastic Modelling and Applied Probability. Springer, 2003. [9] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948. [10] I. N. Sanov, “On the probability of large deviations of random magnitudes,” Matematicheskii Sbornik, vol. 84, no. 1, pp. 11–44, 1957. [11] R. Gallager, “A simple derivation of the coding theorem and some applications,” IEEE Trans. Inf. Theory, vol. 11, no. 1, pp. 3–18, 1965. [12] I. Csiszar and J. Körner, Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011. [13] M. Hayashi, “Second-order asymptotics in fixed-length source coding and intrinsic randomness,” IEEE Transactions on Information Theory, vol. 54, no. 10, pp. 4619–4637, 2008. [14] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952. [15] D. E. Knuth and A. C. Yao, “The complexity of nonuniform random number generation,” Algorithms and Complexity: New Directions and Recent Results, pp. 357–428, 1976. [16] C. Jutten and J. Herault, “Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture,” Signal processing, vol. 24, no. 1, pp. 1–10, 1991. [17] P. Comon, “Independent component analysis, a new concept?” Signal processing, vol. 36, no. 3, pp. 287–314, 1994. [18] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 4-5, pp. 411–430, 2000. [19] A. Painsky, S. Rosset, and M. Feder, “Generalized binary independent component analysis,” in 2014 IEEE International Symposium on Information Theory. IEEE, 2014, pp. 1326–1330. [20] ——, “Generalized independent component analysis over finite alphabets,” IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 1038–1053, 2015. [21] H. B. Barlow, T. P. Kaushal, and G. J. Mitchison, “Finding minimum entropy codes,” Neural Computation, vol. 1, no. 3, pp. 412–423, 1989. [22] J. Himberg and A. Hyvärinen, “Independent component analysis for binary data: An experimental study,” in Proc. ICA2001, 2001, pp. 552–556. [23] A. Yeredor, “ICA in Boolean XOR mixtures,” in International Conference on Independent Component Analysis and Signal Separation. Springer, 2007, pp. 827–835. [24] ——, “Independent component analysis over Galois fields of prime order,” IEEE Transactions on Information Theory, vol. 57, no. 8, pp. 5342–5359, 2011. [25] H. W. Gutch, P. Gruber, A. Yeredor, and F. J. Theis, “ICA over finite fields-separability and algorithms,” Signal Processing, vol. 92, no. 8, pp. 1796–1808, 2012. [26] T. Šingliar and M. Hauskrecht, “Noisy-or component analysis and its application to link analysis,” Journal of Machine Learning Research, vol. 7, no. Oct, pp. 2189–2213, 2006. [27] H. Nguyen and R. Zheng, “Binary independent component analysis with or mixtures,” IEEE Transactions on Signal Processing, vol. 59, no. 7, pp. 3168–3181, 2011. [28] S. Watanabe, “Information theoretical analysis of multivariate correlation,” IBM Journal of research and development, vol. 4, no. 1, pp. 66–82, 1960. [29] A. Painsky, S. Rosset, and M. Feder, “Large alphabet source coding using independent component analysis,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6514–6529, 2017. [30] A. Shamir, “How to share a secret,” Communications of the ACM, vol. 22, no. 11, pp. 612–613, 1979. [31] L. H. Ozarow and A. D. Wyner, “Wire-tap channel II,” AT&T Bell Laboratories technical journal, vol. 63, no. 10, pp. 2135–2157, 1984. [32] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private information retrieval,” in Proceedings of IEEE 36th Annual Foundations of Computer Science. IEEE, 1995, pp. 41–50. [33] N. B. Shah, K. Rashmi, and P. V. Kumar, “Information-theoretically secure regenerating codes for distributed storage,” in 2011 IEEE Global Telecommunications Conference-GLOBECOM 2011. IEEE, 2011, pp. 1–5. [34] S. Goparaju, S. El Rouayheb, R. Calderbank, and H. V. Poor, “Data secrecy in distributed storage systems under exact repair,” in 2013 International Symposium on Network Coding (NetCod). IEEE, 2013, pp. 1–6. [35] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE Nat. Conv. Rec, vol. 4, no. 142-163, p. 1, 1959. [36] T. J. Goblick, “Coding for a discrete information source with a distortion measure,” Ph.D. dissertation, Massachusetts Institute of Technology, 1963. [37] E. C. Posner and E. R. Rodemich, “Epsilon entropy and data compression,” The Annals of Mathematical Statistics, pp. 2079–2125, 1971. [38] V. Kostina and S. Verdú, “Fixed-length lossy compression in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 58, no. 6, pp. 3309–3338, 2012. [39] L. Palzer and R. Timo, “A converse for lossy source coding in the finite blocklength regime,” in International Zurich Seminar on Communications- Proceedings. ETH Zurich, 2016, pp. 15–19. [40] V. Kostina and S. Verdú, “Nonasymptotic noisy lossy source coding,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 6111–6123, 2016. [41] C. T. Li and A. El Gamal, “Strong functional representation lemma and applications to coding theorems,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 6967–6978, Nov 2018. [42] N. Elkayam and M. Feder, “One shot approach to lossy source coding under average distortion constraints,” arXiv preprint arXiv:2001.03983, 2020. [43] A. Feinstein, “A new basic theorem of information theory,” IRE Trans. Inf. Theory, no. 4, pp. 2–22, 1954. [44] C. E. Shannon, “Certain results in coding theory for noisy channels,” Information and control, vol. 1, no. 1, pp. 6–25, 1957. [45] D. Blackwell, L. Breiman, A. Thomasian et al., “The capacity of a class of channels,” The Annals of Mathematical Statistics, vol. 30, no. 4, pp. 1229–1241, 1959. [46] M. Hayashi and H. Nagaoka, “General formulas for capacity of classical-quantum channels,” IEEE Transactions on Information Theory, vol. 49, no. 7, pp. 1753–1768, 2003. [47] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010. [48] S. Verdú, “Non-asymptotic achievability bounds in multiuser information theory,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, Oct 2012, pp. 1–8. [49] M. H. Yassaee, M. R. Aref, and A. Gohari, “A technique for deriving one-shot achievability results in network information theory,” in 2013 IEEE International Symposium on Information Theory, July 2013, pp. 1287–1291. [50] C. T. Li and V. Anantharam, “A unified framework for one-shot achievability via the Poisson matching lemma,” in 2019 IEEE International Symposium on Information Theory (ISIT). IEEE, 2019, pp. 942–946. [51] T. M. Cover, Elements of information theory. John Wiley & Sons, 1999. [52] K. Vašek, “On the error exponent for ergodic Markov source,” Kybernetika, vol. 16, no. 4, pp. 318–329, 1980. [53] L. Davisson, G. Longo, and A. Sgarro, “The error exponent for the noiseless encoding of finite ergodic Markov sources,” IEEE Transactions on Information Theory, vol. 27, no. 4, pp. 431–438, 1981. [54] S. Natarajan, “Large deviations, hypotheses testing, and source coding for finite Markov chains,” IEEE Transactions on Information Theory, vol. 31, no. 3, pp. 360–365, 1985. [55] A. El Gamal and Y.-H. Kim, Network information theory. Cambridge University Press, 2011. [56] A. Painsky, S. Rosset, and M. Feder, “Innovation representation of stochastic processes with application to causal inference,” IEEE Transactions on Information Theory, 2019. [57] M. Vidyasagar, “A metric between probability distributions on finite sets of different cardinalities and applications to order reduction,” IEEE Transactions on Automatic Control, vol. 57, no. 10, pp. 2464–2477, 2012. [58] M. Kovaceviˇ c,´ I. Stanojevic,´ and V. Šenk, “On the entropy of couplings,” Information and Computation, vol. 242, pp. 369–382, 2015. [59] M. Kocaoglu, A. G. Dimakis, S. Vishwanath, and B. Hassibi, “Entropic causality and greedy minimum entropy coupling,” in 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 1465–1469. [60] F. Cicalese, L. Gargano, and U. Vaccaro, “Minimum-entropy couplings and their applications,” IEEE Transactions on Information Theory, vol. 65, no. 6, pp. 3436–3451, 2019. [61] C. T. Li, “Efficient approximate minimum entropy coupling of multiple probability distributions,” arXiv preprint arXiv:2006.07955, 2020. [62] Y. M. Kabanov, “The capacity of a channel of the Poisson type,” Theory of Probability & Its Applications, vol. 23, no. 1, pp. 143–147, 1978. [63] M. Davis, “Capacity and cutoff rate for Poisson-type channels,” IEEE Transactions on Information Theory, vol. 26, no. 6, pp. 710–715, 1980. [64] A. D. Wyner, “Capacity and error exponent for the direct detection photon channel. II,” IEEE Transactions on Information Theory, vol. 34, no. 6, pp. 1462–1471, 1988. [65] L. Kontoyiannis and M. Madiman, “Entropy, compound Poisson approximation, log-Sobolev inequalities and measure concentration,” in Information Theory Workshop. IEEE, 2004, pp. 71–75. [66] I. Kontoyiannis, P. Harremoës, and O. Johnson, “Entropy and the law of small numbers,” IEEE Transactions on information theory, vol. 51, no. 2, pp. 466–472, 2005. [67] Y. Yu, “On the entropy of compound distributions on nonnegative ,” IEEE transactions on information theory, vol. 55, no. 8, pp. 3645–3650, 2009. [68] P. Harremoës, O. Johnson, and I. Kontoyiannis, “Thinning, entropy, and the law of thin numbers,” IEEE Transactions on Information Theory, vol. 56, no. 9, pp. 4228–4244, 2010. [69] A. Barbour, O. Johnson, I. Kontoyiannis, M. Madiman et al., “Compound Poisson approximation via information functionals,” Electronic Journal of Probability, vol. 15, pp. 1344–1369, 2010. [70] O. Johnson, I. Kontoyiannis, and M. Madiman, “Log-concavity, ultra-log-concavity, and a maximum entropy property of discrete compound Poisson measures,” Discrete Applied Mathematics, vol. 161, no. 9, pp. 1232–1250, 2013. [71] M. Haenggi, Stochastic geometry for wireless networks. Cambridge University Press, 2012. [72] W. H. Equitz and T. M. Cover, “Successive refinement of information,” IEEE Transactions on Information Theory, vol. 37, no. 2, pp. 269–275, 1991. [73] V. N. Koshelev, “On the divisibility of discrete sources with an additive single-letter distortion measure,” Problemy Peredachi Informatsii, vol. 30, no. 1, pp. 31–50, 1994. [74] E. Haroutunian and A. Harutyunyan, “Successive refinement of information with reliability criterion,” in 2000 IEEE International Symposium on Information Theory (Cat. No. 00CH37060). IEEE, 2000, p. 205. [75] F. Cicalese, L. Gargano, and U. Vaccaro, “Approximating probability distributions with short vectors, via information theoretic distance measures,” in 2016 IEEE International Symposium on Information Theory (ISIT). IEEE, 2016, pp. 1138–1142. [76] A. W. Marshall, I. Olkin, and B. C. Arnold, Inequalities: theory of majorization and its applications. Springer, 1979, vol. 143. [77] A. M. Zubkov and A. A. Serov, “A complete proof of universal inequalities for the distribution function of the binomial law,” Theory of Probability & Its Applications, vol. 57, no. 3, pp. 539–544, 2013. [78] M. Short, “Improved inequalities for the Poisson and binomial distribution and upper tail quantile functions,” International Scholarly Research Notices, vol. 2013, 2013.