1

Coding Theorems for Noisy Permutation Channels Anuran Makur

Abstract—In this paper, we formally define and analyze the II-D Degradation and Noisy Permutation class of noisy permutation channels. The noisy permutation Channel Capacity ...... 6 channel model constitutes a standard discrete memoryless chan- nel (DMC) followed by an independent random permutation III Achievability and Converse Bounds 8 that reorders the output codeword of the DMC. While coding theoretic aspects of this model have been studied extensively, III-A Auxiliary Lemmata ...... 8 particularly in the context of reliable communication in network III-B Achievability Bounds for DMCs . . . . 10 settings where packets undergo transpositions, and closely related III-C Converse Bounds for Strictly Positive models of DNA based storage systems have also been analyzed DMCs ...... 13 recently, we initiate an information theoretic study of this model by defining an appropriate notion of noisy permutation channel capacity. Specifically, on the achievability front, we prove a lower IV Noisy Permutation Channel Capacity 17 bound on the noisy permutation channel capacity of any DMC IV-A Unit Rank Channels ...... 17 in terms of the rank of the stochastic matrix of the DMC. On IV-B Permutation Transition Matrices . . . . 18 the converse front, we establish two upper bounds on the noisy IV-C Strictly Positive Channels ...... 18 permutation channel capacity of any DMC whose stochastic ma- IV-D Erasure Channels and Doeblin Mi- trix is strictly positive (entry-wise). Together, these bounds yield coding theorems that characterize the noisy permutation channel norization ...... 19 capacities of every strictly positive and “full rank” DMC, and our achievability proof yields a conceptually simple, computationally V Conclusion 22 efficient, and capacity achieving coding scheme for such DMCs. Furthermore, we also demonstrate the relation between the well- Appendix A: Proof of Proposition1 23 known output degradation preorder over channels and noisy permutation channel capacity. In fact, the proof of one of our Appendix B: Proof of Lemma4 23 converse bounds exploits a degradation result that constructs a symmetric channel for any DMC such that the DMC is a degraded version of the symmetric channel. Finally, we illustrate Appendix C: Proof of Lemma5 24 some examples such as the special cases of binary symmetric channels and (general) erasure channels. Somewhat surprisingly, Appendix D: Proof of Lemma6 24 our results suggest that noisy permutation channel capacities are generally quite agnostic to the parameters that define the DMCs. References 25

Index Terms—Permutation channel, channel capacity, degra- I.INTRODUCTION dation, second moment method, Doeblin minorization. In this paper, we initiate an information theoretic study of the problem of reliable communication through noisy permu- CONTENTS tation channels by defining and analyzing a pertinent notion of information capacity for such channels. Noisy permuta- I Introduction 1 tion channels refer to discrete memoryless channels (DMCs) I-A Related Literature and Motivation . . .2 followed by independent random permutation transformations arXiv:2007.07507v1 [cs.IT] 15 Jul 2020 I-B Noisy Permutation Channel Model . . .4 that are applied to the entire blocklength of the output - I-C Additional Notation ...... 5 word. Such channels can be perceived as models of commu- I-D Outline ...... 5 nication links in networks where packets are not delivered in sequence, and hence, the ordering of the packets does not carry II Main Results 5 any information. Moreover, they also bear a close resemblance II-A Achievability Bound ...... 6 to recently introduced models of deoxyribonucleic acid (DNA) II-B Converse Bounds ...... 6 based storage systems. The main contributions of this work are II-C Strictly Positive and Full Rank Channels6 the following:

This work was presented in part at the 2020 IEEE International Symposium 1) We formalize the notion of “noisy permutation channel on [1], and a very preliminary version of this work capacity” of a DMC in Definition1, which captures, was presented in part at the 2018 56th Annual Allerton Conference on up to first order, the maximum number of messages Communication, Control, and Computing [2]. A. Makur is with the Department of Electrical Engineering and Computer than can be transmitted through a noisy permutation Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA channel model with vanishing probability of error as the (e-mail: a [email protected]). blocklength tends to infinity. (Although our formalism Copyright (c) 2020 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be is quite natural, it has not appeared in the literature to obtained from the IEEE by sending a request to [email protected]. our knowledge.) 2

2) We establish an achievability bound on the noisy permu- Firstly, in the literature, one earlier line of tation channel capacity of any DMC in terms of the rank work concerned the construction of error-correcting that of the DMC in Theorem1 by analyzing a conceptually achieve capacity of the random deletion channel, cf. [3]–[5]. simple and computationally tractable randomized coding The random deletion channel operated on the codeword space scheme. Moreover, we also demonstrate an alternative by deleting each input codeword symbol independently with proof of our achievability bound for DMCs that have some probability p ∈ (0, 1), and copying it otherwise. As rank 2 in Proposition2 by using the so called second expounded in [4, Section I], with sufficiently large alphabet moment method (in Lemmata4 and5). size 2b, where each symbol of the alphabet was construed 3) We prove two converse bounds on the noisy permutation as a packet with b bits and b = Ω(log(n)) depended on the channel capacity of any DMC that is strictly positive blocklength n, “embedding sequence numbers into the trans- (entry-wise). The first bound, in Theorem2, is in terms mitted symbols [turned] the deletion channel [into a memory- of the output alphabet size of the DMC, and the second less] erasure channel.” Since coding for erasure channels was bound, in Theorem3, is in terms of the “effective input well-understood, the intriguing question became to construct alphabet” size of the DMC. (Neither bound is uniformly (nearly) capacity achieving codes for the random deletion better than the other.) channel using sufficiently large packet length b (depending on 4) Using the aforementioned achievability and converse n), but without embedding sequence numbers (see, e.g., [3], bounds, we exactly characterize the noisy permutation [4], and the references therein).1 In particular, the author of channel capacity of all strictly positive and “full rank” [4] demonstrated that low density parity check (LDPC) codes DMCs in Theorem4. Furthermore, we propound a with verification-based decoding formed a computationally candidate solution for the noisy permutation channel tractable coding scheme with these properties. Notably, this capacity of general strictly positive DMCs (regardless coding scheme also tolerated transpositions of packets that of their rank) in Conjecture1. were not deleted in the process of transmission. Therefore, it 5) To complement these results and assist in understanding was equivalently a coding scheme for a memoryless erasure them, we derive an intuitive monotonicity relation be- channel followed by a random permutation block, albeit with tween the degradation preorder over channels and noisy an alphabet size that grew polynomially with the blocklength. permutation channel capacity in Theorem5 (also see Broadly speaking, the results of [3], [4] can be perceived Theorem6). Furthermore, we also construct symmetric as preliminary steps towards analyzing the fundamental limits channels that dominate given DMCs in the degradation of reliable communication through noisy permutation channels sense in Proposition1. This construction is utilized in where the DMCs are erasure channels. Several other coding the proof of Theorem3. schemes for erasure permutation channels with sufficiently 6) Finally, we present exact characterizations of the noisy large alphabet size have also been developed in the literature. permutation channel capacities of several specific fam- We refer readers to [5], which builds upon the key conceptual ilies of channels, e.g., binary symmetric channels in ideas in [4], and the references therein for other examples of Proposition5 (cf. [2, Theorem 3]), channels with unit such coding schemes. rank transition kernels in Proposition3, and chan- Secondly, this discussion concerning the random deletion nels with permutation matrices as transition kernels in channel has a patent counterpart in the (closely related) com- Proposition4. Furthermore, we present bounds on the munication networks literature. Indeed, in the context of the noisy permutation channel capacities of (general) erasure well-known store-and-forward transmission scheme for packet channels in Proposition6, and also propose related con- networks, packet losses (or deletions) were typically corrected jectures (see, e.g., Conjecture2). In particular, although using Reed-Solomon codes which assumed that each packet Theorem1 yields our achievability bound for erasure carried a header with a sequence number—see, e.g., [7], [8, channels, we show an alternative achievability proof Section I], and the references therein. Akin to the random in Proposition6 by exploiting the classical notion of deletion channel setting, this simplified the error correction Doeblin minorization. problem since packet losses could be treated as erasures. The ensuing subsections provide some background literature However, “motivated by networks whose topologies change to motivate our study, a formal description of the noisy over time, or where several routes with unequal delays are permutation channel model, some additional notation that will available to transmit the data,” the authors of [8] illustrated be used throughout the paper, and an outline of the remainder that packet errors and losses could also be corrected using of the paper. binary codes under a channel model where the impaired or lost packets were randomly permuted, and the packets were A. Related Literature and Motivation not indexed with sequence numbers. Such work can also be The setting of channel coding with transpositions, where the construed as developing codes for specific kinds of noisy output codeword undergoes some reordering of its symbols, permutation channels. has been widely studied in the coding theory, communica- In general, the noisy permutation channel model in subsec- tion networks, and molecular and biological communications tion I-B is a simple and useful abstraction for point-to-point communities. We briefly discuss some relevant literature from these three disciplines, each of which provides a compelling 1We also refer readers to the recent work [6], which proves that Reed-Muller incentive to study noisy permutation channels. codes achieve capacity for erasure channels, and the references therein. 3

data was encoded into codeword strings (or DNA molecules) consisting of letters from an alphabet of four nucleobases, and short fragments of these codewords were then cached in an unordered fashion akin to the effect of the random permutation

SENDER RECEIVER in our noisy permutation channel model. The receiver read the encoded data by shotgun sequencing, or equivalently, by randomly sampling the stored and unordered fragments. While the unordered caching aspect of this model resembles our model, as stated in [12, Section I-B], this storage model also differs from our model since the receiver samples the stored NETWORK codewords with replacement and without errors. A very closely related DNA based storage model to [15], Fig. 1. Illustration of point-to-point communication between a sender and a known as the noisy shuffling channel, is investigated in [17]. receiver through a mobile ad hoc network (MANET). Specifically, in order to represent the corruption of DNA molecules during “synthesis, sequencing, and. . . storage,” the authors of [17] studied the storage capacity of a model where communication between a source and a receiver in various DNA codewords first experienced the deleterious effects of network settings. For instance, when information is transmitted a DMC (e.g., a ), and were then using a lower level multipath routed network (see, e.g., Figure fragmented, and subsequently, the fragments were randomly 1), the set of all possible packets make up the channel input permuted. (Unlike [15], the receiver had access to all the alphabet with each packet representing a different symbol (as permuted fragments in this model for simplicity.) The DNA mentioned earlier), and any context specific packet impair- based storage model in [17] is much closer to our noisy ments are represented by the DMC in the model. Furthermore, permutation channel model than the model in [15]. However, since the packets (or symbols) can take different paths to the in contrast to our model, both [15] and [17] assume that the receiver in such a network, they may arrive at the destination lengths of the codeword fragments (which are permuted) grow out-of-order due to different delay profiles in the different logarithmically with the number of fragments. We refer readers paths. This out-of-order delivery of packets is captured by to [13] for a broader overview of DNA based storage systems, the random permutation transformation in the model. Several and to [14], [16], and the references therein for other examples other aspects of noisy permutation channels have also been of codes for such systems. Moreover, we also refer readers investigated in the communication networks literature. For to the comprehensive bibliography in [12] for further related example, the authors of [9] established rate-delay tradeoffs for literature on noisy permutation channels. multipath routed networks, although they neglected to account Finally, it is worth re-emphasizing that the noisy permuta- for packet impairments, such as deletions, in their analysis for tion channel model in subsection I-B may be regarded as a simplicity. variant or generalization of the models described above. More More recently, inspired by packet networks such as mobile precisely, the analysis in [3], [4] pertains to erasure permuta- ad hoc networks (where the network topology changes over tion channels where the alphabet size grows polynomially with time)—see Figure1, and heavily loaded datagram-based net- the blocklength, the work in [8], [10]–[12] is concerned with works (where packets are often re-routed for load balancing various codes for specific noisy permutation channels, and the purposes), the authors of [10]–[12] have considered the general focus of [17] is on noisy shuffling channels that randomly problem of coding in channels where the codeword undergoes permute fragments of codewords whose lengths scale loga- a random permutation and is subjected to impairments such rithmically with the number of fragments. In comparison, our as insertions, deletions, substitutions, and erasures. As stated results on noisy permutation channels in this paper consider in [10, Section I], the basic strategy to reliably communi- much broader classes of DMCs, and assume that alphabet sizes cate across a channel that applies a transformation to its are constant with respect to the blocklength, or alternatively, codewords is to “encode the information in an object that that fragment lengths are constant with respect to the number is invariant under the given transformation.” In the case of of fragments. (Note that this latter assumption ensures that we noisy permutation channels, the appropriate codes are the so cannot add sequence numbers to packets in order to transform called multiset codes, where the codewords are characterized our problem into one of classical coding.) by their empirical distribution over the underlying alphabet. Furthermore, as the discussion heretofore reveals, the ma- The existence of certain perfect multiset codes is established jority of the literature on noisy permutation channels analyzes in [11], and several other multiset code constructions based its coding theoretic aspects. In contrast, we approach these on lattices and Sidon sets are analyzed in [12]. channels from a purely information theoretic perspective. Thirdly, an alternative motivation for analyzing noisy per- To our knowledge, such a systematic analysis has not been mutation channels stems from research at the intersection of undertaken until now, and thus, there are no known results computational biology and information theory on DNA based on the information capacity of the noisy permutation channel storage systems, cf. [13]–[17]. For example, the authors of [15] model described in the next subsection. (Indeed, while the examined the storage capacity of systems where the source aforementioned references [3], [15], and [17] have a more is encoded using DNA molecules. In their model, source information theoretic focus, they analyze different models to 4

𝑀𝑋 𝑍 Random 𝑌 𝑀 Encoder DMC Decoder Permutation

Fig. 2. Illustration of a communication system with a DMC followed by a random permutation.

n n n ours.) In this paper, we will take some first steps towards permutation channel Π {Π(·|z ) = P n n (·|z ): z ∈ , 1 Y1 |Z1 1 1 a complete understanding of the information capacity of Yn} is defined as noisy permutation channels. Rather interestingly, our main n n n n Π(y |z ) = PY n|Zn (y |z ) achievability proof will automatically yield computationally 1 1 1 1 1 1 1 X tractable codes for reliable communication through certain = 1∀i ∈ {1, . . . , n}, y = z (2) n! λ(i) i noisy permutation channels, thereby rendering the need to λ∈Sn develop conceptually sophisticated coding schemes for these for every yn, zn ∈ Yn, where the sum is over all permutations channels futile when (theoretically) achieving noisy permuta- 1 1 λ : {1, . . . , n} → {1, . . . , n} in the symmetric group S over tion channel capacity is the sole objective. n the set {1, . . . , n}, and 1{·} is the indicator function defined in subsection I-C. Alternatively, we can describe the action of Π in (2) as follows: B. Noisy Permutation Channel Model 1) First, randomly draw a bijection (or permutation) λ : {1, . . . , n} → {1, . . . , n} uniformly, and independently We define the point-to-point noisy permutation channel of everything else, from the symmetric group Sn over model in analogy with standard information theoretic defini- {1, . . . , n}, 2) Then, generate Y n from Zn using the permutation λ so tions, cf. [18, Section 7.5]. Let n ∈ N , {1, 2, 3,... } denote 1 1 that Y = Z for all i ∈ {1, . . . , n}. a fixed blocklength, M ∈ M , {1,..., |M|} be a message λ(i) i that is drawn uniformly from the message set Throughout this paper, we will refer to random permutation n M, fn : M → X be a (possibly randomized) encoder, where channels on different alphabets, such as the one defined above, X is the finite input alphabet of the channel with |X | ≥ 2, as “random permutations” without any further clarification. n n and gn : Y → M ∪{e} be a (possibly randomized) decoder, Finally, the received codeword Y1 is decoded to produce ˆ n where Y is the finite output alphabet of the channel with an estimate M = gn(Y1 ) of M. Figure2 illustrates this |Y| ≥ 2 and e denotes an additional “error symbol.” The mes- communication system. n sage M is first encoded into a codeword X1 = fn(M), where Let the average probability of error in this model be each X ∈ X , and we use the notation Xj (X ,...,X ) i i , i j P (n) M 6= Mˆ  , (3) for i < j. This codeword is transmitted through a (stationary) error , P discrete memoryless channel defined by the conditional prob- where we assume that any decoder gn always makes an error 2 ability distributions {PZ|X (·|x) ∈ PY : x ∈ X } to produce when it outputs the error symbol e. The “rate” of the encoder- n n Z1 ∈ Y , where each Zi ∈ Y, and PY denotes the probability decoder pair (fn, gn) is defined as simplex in |Y| of all probability distributions on Y. Note R log(|M|) that in later sections, we will often treat a DMC PZ|X as a R , , (4) |X |×|Y| log(n) row stochastic transition probability matrix PZ|X ∈ R whose rows are given by {PZ|X (·|x) ∈ PY : x ∈ X }, and vice where log(·) is the binary logarithm (with base 2) throughout versa, since the two perspectives are equivalent. (In particular, this paper, and all Shannon entropy H(·), for every x ∈ X and y ∈ Y, the conditional probability I(·; ·), and Kullback-Leibler (KL) divergence (or relative en- 3 PZ|X (y|x) is also the (x, y)th element of the matrix PZ|X , tropy) D(·||·) terms are measured in bits. So, we can also   R R i.e., PZ|X (y|x) = PZ|X x,y using the notation in subsection write |M| = n . (Strictly speaking, n should be an integer, I-C. Likewise, for every x ∈ X , the conditional distribution but we will often neglect this detail since it will not affect our PZ|X (·|x) ∈ PY forms the xth row of the matrix PZ|X .) The results.) We will say that a rate R ≥ 0 is achievable if there memorylessness property of the DMC implies that exists a sequence of encoder-decoder pairs {(fn, gn)}n∈N such

2 n Under an average probability of error criterion, the sequence of encoders (n) n n Y fn that minimize Perror are deterministic, and the corresponding sequence of PZn|Xn (z1 |x1 ) = PZ|X (zi|xi) (1) (n) 1 1 decoders gn that minimize Perror are the maximum a posteriori decoders (or i=1 maximum likelihood decoders, since M is uniformly distributed), which are also deterministic without loss of generality [19, Section 16.2.1]. In contrast, n n n n under a maximal probability of error criterion, randomized encoders and for every x1 ∈ X and every z1 ∈ Y . The noisy codeword n decoders can be useful [19, Section 16.2.1]. Z1 is then passed through an independent random permutation 3The notion of rate defined in (4) is analogous to the so called third-order n n transformation to generate Y1 ∈ Y . Specifically, the random coding rate in the finite blocklength analysis literature; see, e.g., [20]. 5

(n) that limn→∞ Perror = 0. Lastly, we operationally define the A), σmin(A) denote the smallest of the min{m, n} singular noisy permutation channel capacity as follows. values of A, rank(A) denote the rank of A, AT ∈ Rn×m denote the transpose or adjoint of A, A† ∈ n×m denote Definition 1 (Noisy Permutation Channel Capacity). For any R the Moore-Penrose pseudoinverse of A, and A−1 ∈ n×n DMC P , its noisy permutation channel capacity is given R Z|X denote the inverse of A when m = n and A is non-singular. by Furthermore, when the rows of A are linearly independent, † T T−1 Cperm(PZ|X ) , sup{R ≥ 0 : R is achievable} . then A = A AA is a right inverse of A such that AA† = I, where I is the identity matrix of appropriate It is straightforward to verify that the scaling in (4) is m×n dimension. For any row stochastic matrix A ∈ R , we indeed log(n) rather than the standard n. As mentioned earlier, let ext(A) denote the number of extreme points of the convex due to the independent random permutation in the model, all hull of the rows of A, and it is straightforward to verify that information embedded in the ordering within codewords is lost. (In fact, canonical fixed composition codes cannot carry rank(A) ≤ ext(A) ≤ m . (9) more than one message in this setting.) So, the maximum number of decodable messages is (intuitively) upper bounded In the sequel, we refer to a row stochastic matrix A as full n 4 by the number of possible empirical distributions of Y1 , i.e., rank if rank(A) = min{ext(A), n}, and strictly positive if the elements of A are all strictly positive. n + |Y| − 1 nR = |M| ≤ ≤ (n + 1)|Y|−1 , (5) Finally, we present some miscellaneous analysis notation. |Y| − 1 p We let the customary k·kp notation denote the L -norm for where taking log’s and letting n → ∞ yields Cperm(PZ|X ) ≤ p ∈ [1, ∞]. We let exp(·) denote the natural exponential |Y| − 1 (at least non-rigorously). This justifies that log(n) function (with base e), and b·c denote the floor function. is the correct scaling in (4), i.e., the maximum number of Throughout this paper, we will use the standard Bachmann- messages that can be reliably communicated is polynomial in Landau asymptotic notation, e.g., O(·), Θ(·), o(·), and ω(·), the blocklength (rather than exponential). with the understanding that the parameter n → ∞ and all other parameters are held constant with respect to n. C. Additional Notation In this subsection, we define some additional notation that D. Outline will be utilized throughout the paper. We begin with some In closing sectionI, we briefly delineate the organization of probabilistic notation. The standard expressions P(·), E[·], and the rest of this paper. In sectionII, we present all of our main VAR(·) represent the probability, expectation, and variance operators, where the underlying probability measures will be results, which were described at the outset of sectionI. In sec- tion III, we prove our main achievability and converse bounds clear from context. Moreover, we will write X ∼ PX when the using several auxiliary lemmata. Then, we illustrate several random variable X has probability law PX . We let 1{·} denote the indicator function which equals 1 if its input proposition examples of noisy permutation channel capacities for different n n families of channels in sectionIV. Furthermore, we also is true and 0 otherwise. Given any sequence x1 ∈ X with establish the connection between the degradation preorder over n ∈ N, we define the empirical distribution (histogram or type) n channels and noisy permutation channel capacity in section of x1 as IV. Finally, we conclude our discussion and propose future  0 0  Pˆ n = Pˆ n (x ): x ∈ X ∈ P , research directions in sectionV. On a separate note, it is worth x1 x1 X (6) mentioning that throughout this paper, theorems, propositions, Pˆ n X and lemmata are stated according to the following convention: where x1 is a probability distribution on , and for every x0 ∈ X , If the result is known in the literature, we provide references n 0 1 X 0 in the header, and if the result is new, we (obviously) do not Pˆxn (x ) 1{xi = x } . (7) 1 , n provide any references. i=1 For convenience, we will use the notation II.MAIN RESULTS  n  n! , (8) nPˆ n Y ˆ 0  x1 nP n (x ) ! In this section, we present our main results under the setup x1 x0∈X of subsection I-B, very briefly mention the important ideas in the corresponding proofs, and discuss any related literature for the multinomial coefficient. Furthermore, for any k ∈ N where appropriate. and p ∈ [0, 1], we let Ber(p) denote a Bernoulli distribution with success probability p, and bin(k, p) denote a binomial 4This is in contrast to standard usage where A is said to be “full rank” if distribution with k trials and success probability p. rank(A) = min{m, n}. Our alternative usage of the phrase “full rank” is Next, we introduce some linear algebraic notation. Fix any motivated by information theoretic contexts, such as in the proof of Theorem3 m, n ∈ . Given any matrix A ∈ m×n, we let [A] in subsection III-C, where the effective number of rows (or input alphabet) of N R i,j a row stochastic matrix (or channel) A can often be reduced to ext(A) due to denote the (i, j)th element of A, kAkop denote the operator the convexity of KL divergence. The resulting sub-matrix, which has ext(A) or spectral norm of A (which is the largest singular value of rows, is full rank in the standard sense when rank(A) = min{ext(A), n}. 6

A. Achievability Bound Together, the bounds in Theorems2 and3 yield the follow- Our first main result is a lower bound on the noisy permu- ing corollary that for any strictly positive DMC PZ|X , tation channel capacity of any DMC in terms of the rank of min{ext(P ), |Y|} − 1 C (P ) ≤ Z|X . (10) the DMC. perm Z|X 2

Theorem 1 (Achievability Bound). The noisy permutation On the other hand, for a general DMC PZ|X , which may have channel capacity of a DMC PZ|X is lower bounded by zero entries, we can show that

rank(P ) − 1 Cperm(PZ|X ) ≤ min{ext(PZ|X ), |Y|} − 1 . (11) C (P ) ≥ Z|X . perm Z|X 2 To see this, note that the bound Cperm(PZ|X ) ≤ |Y| − 1 is Theorem1 is proved in subsection III-B using a simple (ran- already intuitively justified by (5), and a rigorous argument domized) code which enables a basic concentration of measure follows along the same lines as the converse proof in subsec- inequality based argument. We also present an alternative tion IV-B. Moreover, the bound Cperm(PZ|X ) ≤ ext(PZ|X )−1 proof of Theorem1 for the special case of row stochastic can be established by following the proof of Theorem3 in matrices with rank 2 in subsection III-B, which employs the subsection III-C. (Indeed, the derivation of (65) in this proof so called second moment method for total variation distance. also holds for DMCs PZ|X with zero entries, in which case, PZ˜|X˜ is the identity channel. The converse proof in subsection IV-B can then be applied to yield the desired bound.) We omit B. Converse Bounds these proofs for the sake of brevity. Our second main result is an upper bound on the noisy permutation channel capacity of any strictly positive DMC in C. Strictly Positive and Full Rank Channels terms of the output alphabet size of the DMC. Theorem1 and (10) portray that for any strictly positive Theorem 2 (Converse Bound I). The noisy permutation chan- DMC PZ|X , the noisy permutation channel capacity satisfies nel capacity of a strictly positive DMC PZ|X , which means the bounds that P (y|x) > 0 for all x ∈ X and y ∈ Y, is upper Z|X rank(PZ|X ) − 1 bounded by ≤ Cperm(PZ|X ) 2 |Y| − 1 (12) C (P ) ≤ . min{ext(PZ|X ), |Y|} − 1 perm Z|X 2 ≤ . 2 Theorem2 is established in subsection III-C. The proof Based on the inequalities in (12), we now state (perhaps) the of Theorem2 uses a Fano’s inequality argument followed most important result of this paper, which characterizes the by a careful application of a central limit theorem (CLT) noisy permutation channel capacity of the family of strictly based approximation of the entropy of a binomial random positive and full rank channels. variable. Intuitively, we also expect to have a converse bound in terms of the input alphabet size, because when |X | is much Theorem 4 (Cperm of Strictly Positive and Full Rank smaller than |Y|, there are at most On|X |−1 distinguishable Channels). The noisy permutation channel capacity of a |Y|−1 empirical distributions (rather than O n , as suggested strictly positive and full rank DMC PZ|X with rank r , by (5)). Our third main result addresses this intuition by rank(PZ|X ) = min{ext(PZ|X ), |Y|} is given by providing an alternative upper bound on the noisy permutation r − 1 C (P ) = . channel capacity of any strictly positive DMC in terms of the perm Z|X 2 number of extreme points of the convex hull of the conditional Proof. Recalling the definition of “full rank” from subsection probability distributions defining the DMC. I-C, this is an immediate corollary of (12) (i.e., of Theorems Theorem 3 (Converse Bound II). The noisy permutation 1,2, and3).  channel capacity of a strictly positive DMC PZ|X is upper bounded by D. Degradation and Noisy Permutation Channel Capacity ext(P ) − 1 To complement the aforementioned results, we next present C (P ) ≤ Z|X . perm Z|X 2 another main result that relates the notion of noisy permutation channel capacity with the so called (output) degradation Theorem3 is also proved in subsection III-C. Its proof preorder over channels, which was defined in information layers a degradation argument, based on Proposition1 (which theory to study broadcast channels in [21], [22]. (It is worth will be presented in due course), over the derivation of mentioning that in this paper, we are concerned with the notion Theorem2. We remark that the quantity ext(P ) can be Z|X of stochastic degradation as opposed to physical degradation, perceived as an “effective input alphabet” size. Indeed, as cf. [23, Section 5.4].) elucidated in the proof of Theorem3, the input alphabet of PZ|X can be reduced to a subset of X corresponding to the Definition 2 (Degradation Preorder). For any two DMCs (or |X |×|Z1| extreme points of the convex hull of the rows of PZ|X without row stochastic matrices) PZ1|X ∈ R and PZ2|X ∈ loss of generality (due, essentially, to the convexity of KL R|X |×|Z2| with common input alphabet X and output alphabets divergence). Z1 and Z2, respectively, we say that PZ2|X is a degraded 7

version of PZ1|X if PZ2|X = PZ1|X PZ2|Z1 for some channel Lastly, while we are on the topic of degradation, we present |Z1|×|Z2| PZ2|Z1 ∈ R . another seemingly disparate result which constructs symmetric channels that dominate given DMCs in the degradation sense. The degradation preorder has a long and intriguing history To state this result, we first recall the definition of symmetric that is worth elaborating on. Its study actually originated in channels, cf. [35, Equation (10)]. the statistics literature [24]–[26], where it is also known as Definition 3 (q-ary Symmetric Channel). Under the formalism the Blackwell order. Indeed, the channels PZ1|X and PZ2|X can be construed as statistical experiments (or observation presented in subsection I-B, we define a q-ary symmetric models) of the parameter space X . In this statistical deci- channel with total crossover probability δ ∈ [0, 1], and input sion theoretic context, the celebrated Blackwell-Sherman-Stein and output alphabet X = Y with |X | = q ∈ N\{1}, denoted q-SC(δ), using the doubly stochastic matrix theorem states that PZ2|X is a degraded version of PZ1|X if and only if for every prior distribution PX ∈ PX , and every  δ δ δ  1 − δ q−1 ··· q−1 q−1 real-valued loss function with domain X × X , the minimum δ δ δ  q−1 1 − δ ··· q−1 q−1  Bayes risk corresponding to PZ1|X is less than or equal to the    ......  q×q minimum Bayes risk corresponding to PZ2|X [24]–[26] (also Sδ  .  ∈ R (13) ,  . . . .  see [27] for a simple proof of this result using the separating  δ δ ··· 1 − δ δ  hyperplane theorem). Furthermore, degradation has beautiful  q−1 q−1 q−1  δ δ ··· δ 1 − δ ties with non-Bayesian binary hypothesis testing as well. When q−1 q−1 q−1 |X | = 2, the channels PZ |X and PZ |X can be construed as δ 1 2 which has 1 − δ along its principal diagonal, and q−1 in all dichotomies P of likelihoods, and it can be shown that Z2|X is a other entries. (The rows and columns of Sδ are both indexed degraded version of PZ1|X if and only if the Neyman-Pearson consistently by X .) function, or receiver operating characteristic curve, of PZ |X 1 We note that in the special case where q = 2, X = Y = dominates the Neyman-Pearson function of PZ2|X pointwise (cf. [28, Theorem 5.3] and [29, Section 9.3], where equivalent {0, 1}, and δ is the probability that the input bit flips, we refer characterizations using f-divergences and majorization are to the 2-SC(δ) as a binary symmetric channel (BSC), denoted BSC(δ). also given). Moreover, for the special case where PZ |X and 1 The ensuing proposition portrays a sufficient condition for PZ2|X are binary input symmetric channels, other majorization and stochastic domination based characterizations of degrada- degradation by q-ary symmetric channels. tion can be found in [30, Sections 4.1.14–4.1.16]. Finally, we Proposition 1 (Degradation by Symmetric Channels). Sup- note that degradation is also equivalent to the notion of matrix pose we are given a DMC (or row stochastic matrix) PZ|X ∈ majorization in [31, Chapter 15, Definition C.8] (also see [32] |X |×|Y| R with minimum entry and [33]). We refer readers to the author’s doctoral thesis [34, Section 3.1.1] and [35, Section I-B] for further discussion and ν = min PZ|X (y|x) , x∈X , y∈Y references. The next theorem conveys an intuitive comparison result and a q-ary symmetric channel, q-SC(δ), which has a common that if one DMC dominates another DMC in the degradation input alphabet X such that |X | = q. If the total crossover sense, then the noisy permutation channel capacity of the probability parameter satisfies dominating DMC is larger than the noisy permutation channel ν 0 ≤ δ ≤ , capacity of the degraded DMC. ν 1 − ν + q−1 Theorem 5 (Comparison Bound via Degradation). Consider then PZ|X is a degraded version of q-SC(δ). |X |×|Z1| |X |×|Z2| any two DMCs PZ1|X ∈ R and PZ2|X ∈ R , with common input alphabet X and output alphabets Z1 and Proposition1 is proved in appendixA. Although it appears to be unrelated to our thrust towards understanding noisy Z2, respectively. If PZ2|X is a degraded version of PZ1|X , then we have permutation channel capacity, it turns out to be indispensable in the proof of Theorem3. We state Proposition1 here as a main result because we believe it can have many applications Cperm(PZ2|X ) ≤ Cperm(PZ1|X ) . in information theory and statistics beyond the context of noisy Theorem5 is derived in subsection IV-D. As with the permutation channels. We refer readers to [35], [36] for further setting of traditional channel capacity, the proof of Theorem insight regarding the value of studying channel domination by 5 proceeds by verifying that a noisy permutation channel symmetric channels. It is also worth making a few remarks capacity achieving coding scheme for PZ2|X can be used to about related results in the literature. Indeed, Proposition1 achieve the same rate and vanishing probability of error when establishes a result analogous to [35, Theorem 2] (also see [36, communicating through PZ1|X . Furthermore, a specialization Theorem 2]) that holds for general rectangular row stochastic of Theorem5 for erasure channels turns out to correspond to matrices PZ|X (rather than square row stochastic matrices as the concept of Doeblin minorization, and we use this connec- in [35, Theorem 2]). However, Proposition1 is weaker than tion in subsection IV-D to provide an alternative achievability [35, Theorem 2] for square row stochastic matrices (i.e., the bound on the noisy permutation channel capacity of erasure upper bound on δ in [35, Theorem 2] is larger than that in channels. Proposition1 when q > 2), because the proof of [35, Theorem 8

𝑀𝑋 Random 𝑉 𝑊 𝑀 Encoder DMC Decoder Permutation

Fig. 3. Illustration of a communication system with a random permutation followed by a DMC.

n n 2] exploits more sophisticated majorization arguments. We Proof. This follows from direct calculation. Fix any x1 ∈ X n n also remark that other sufficient conditions for degradation of and y1 ∈ Y . Observe that square row stochastic matrices (or Markov kernels) by q-ary n n X n n n n P n n (y |x ) = P n n (y |v )P n n (v |x ) W1 |X1 1 1 W1 |V1 1 1 V1 |X1 1 1 symmetric channels, which either use more information than n n v1 ∈X : the minimum entries of the matrices (see [37, Proposition 8.1, ˆ ˆ Pvn =Pxn Equations (8.1) and (8.2)]), or assume further structure on the 1 1  −1 n matrices such as additive noise over Abelian groups (see [35, n X Y = PW |V (yi|vi) nPˆ n Theorem 3, Proposition 10] or [36, Theorem 3]), have been x1 n n v1 ∈X : i=1 derived in the literature. Pˆ n =Pˆ n v1 x1  −1 n III.ACHIEVABILITY AND CONVERSE BOUNDS n X Y = PZ|X (yi|vi) nPˆ n We prove the achievability result in Theorem1 and the x1 n n v1 ∈X : i=1 converse results in Theorems2 and3 in this section. We Pˆ n =Pˆ n v1 x1 commence by presenting some useful lemmata in subsection  n −1 n −1 III-A, and then proceed to establishing the aforementioned = · nPˆ n nPˆ n theorems in subsections III-B and III-C, respectively. y1 x1 n X X Y A. Auxiliary Lemmata PZ|X (˜yi|vi) , (15) n n n n y˜1 ∈Y : v1 ∈X : i=1 First, to establish our converse bounds in Theorems2 and3, Pˆ n =Pˆ n Pˆ n =Pˆ n y˜1 y1 v1 x1 we will present two lemmata. The first lemma we will exploit n where the first equality uses the Markov property X1 → is the following useful estimate of the entropy of a binomial n n distribution from the literature. V1 → W1 , the third equality follows from (14), and the fourth equality holds because Lemma 1 (Approximation of Binomial Entropy [38, Equation n n n n PW n|Xn (y |x ) = PW n|Xn (˜y |x ) (7)]). Given a binomial random variable X ∼ bin(n, p) with 1 1 1 1 1 1 1 1 n n n n ∈ N and p ∈ (0, 1), we have for every y˜1 ∈ Y that is a permutation of y1 , which follows from the expression in the third equality. Likewise, we have 1 c(p) H(X) − log(2πenp(1 − p)) ≤ n n X n n n n 2 n P n n (y |x ) = P n n (y |z )P n n (z |x ) Y1 |X1 1 1 Y1 |Z1 1 1 Z1 |X1 1 1 zn∈Yn: for some constant c(p) ≥ 0 (that depends on p). 1 Pˆ n =Pˆ n z1 y1 The second lemma we will utilize illustrates that swapping −1 n  n  X Y the order of the DMC and the random permutation block in the = PZ|X (zi|xi) nPˆ n communication system in Figure2 produces the statistically y1 n n z1 ∈Y : i=1 Pˆ n =Pˆ n equivalent communication system in Figure3. z1 y1  −1 −1 Lemma 2 (Equivalent Model). Consider the channel P n n n n W1 |X1 n n = · shown in Figure3, where the input codeword X1 ∈ X passes nPˆxn nPˆyn n 1 1 through an independent random permutation to produce V1 ∈ n n n X X Y X , and V1 then passes through a DMC PW |V to produce PZ|X (zi|x˜i) , (16) n n n n n n the output codeword W1 ∈ Y so that (much like (1)) x˜1 ∈X : z1 ∈Y : i=1 Pˆ n =Pˆ n Pˆ n =Pˆ n n x˜1 x1 z1 y1 n n n n n n Y ∀v ∈ X , w ∈ Y ,PW n|V n (w |v ) = PW |V (wi|vi) . n 1 1 1 1 1 1 where the first equality uses the Markov property X1 → i=1 n n Z1 → Y1 , and the third equality holds because If the DMC PW |V is equal to the DMC PZ|X entry-wise, i.e., n n n n P n n (y |x ) = P n n (y |x˜ ) Y1 |X1 1 1 Y1 |X1 1 1 ∀x ∈ X , z ∈ Y,PW |V (z|x) = PZ|X (z|x) , (14) n n n for every x˜1 ∈ X that is a permutation of x1 , which follows then the channel P n n is equivalent to the channel P n n W1 |X1 Y1 |X1 from the expression in the second equality. Therefore, using (described in subsection I-B and Figure2), i.e., (15) and (16), we have n n n n n n n n n n n n ∀x ∈ X , y ∈ Y ,P n n (y |x ) = P n n (y |x ) . P n n (y |x ) = P n n (y |x ) , 1 1 W1 |X1 1 1 Y1 |X1 1 1 W1 |X1 1 1 Y1 |X1 1 1 9

which completes the proof.  between two probability measures P0 and P1 on a common measurable space (U, F ) is defined as Next, to derive our achievability bound in Theorem1, we will require the following well-known concentration of kP0 − P1kTV , sup |P0(A) − P1(A)| (21) measure inequality, which is a specialization of Hoeffding’s A∈F kP − P k inequality. = 0 1 1 , (22) 2 Lemma 3 (Hoeffding’s Inequality [39, Theorems 1 and 2]). where (22) is well-known (see, e.g., [42, Chapter 4] for a proof Suppose X1,...,Xn are independent and identically dis- in the discrete case). Then, we have tributed (i.i.d.) random variables such that |X1| ≤ σ almost (n) 1 ⊗n ⊗n  surely for some σ > 0. Then, for every γ ≥ 0, P = 1 − P − Q , (23) ML 2 X X TV n !  2  1 X nγ where P ⊗n and Q⊗n denote the n-fold product distributions P Xi − E[X1] ≥ γ ≤ exp − X X n 2σ2 n i=1 of X1 given H = 0 and H = 1, respectively. The next lemma presents a vector generalization of the so called “second and moment method for TV distance,” cf. [43, Lemma 4.2(iii)], n ! ⊗n ⊗n 1 X  nγ2  and lower bounds P − Q . X − [X ] ≤ −γ ≤ exp − . X X TV P n i E 1 2σ2 i=1 Lemma 4 (Second Moment Method). For the binary hypoth- While Lemma3 is used to provide exponentially decaying esis testing problem in (17), we have tail bounds on certain conditional probability of error terms kP − Q k2 P ⊗n − Q⊗n ≥ X X 2 . in the proof of Theorem1 (see (33) and (34) in subsection X X TV X  4 PˆXn (x) III-B), we will also show that much weaker tail bounds suffice VAR 1 x∈X for proving Theorem1 for DMCs with rank 2. Indeed, our alternative achievability proof of Proposition2 in subsection Lemma4 is proved in appendixB. Moreover, as mentioned III-B uses the two ensuing lemmata pertaining to the following in the remark in appendixB, Lemma4 can also be construed as binary hypothesis testing problem. a variant of the Hammersley-Chapman-Robbins (HCR) bound in statistics [44], [45]. Fix any n ∈ N, and two distinct probability distributions Our final lemma, Lemma5, establishes an upper bound on PX ,QX ∈ PX (which can depend on n). Consider the hypoth- (n) 1  P using Lemma4. It will be used to derive Proposition2 esis random variable H ∼ Ber 2 (i.e., uniform prior), and ML in subsection III-B—a specialization of Theorem1 for DMCs likelihoods PX|H (·|0) = PX (·) and PX|H (·|1) = QX (·), such n with rank 2. that we observe n samples X1 that are drawn conditionally i.i.d. given H from the likelihoods, viz., Lemma 5 (Testing between Converging Hypotheses). For the 2 n i.i.d. binary hypothesis problem in (17), suppose the ` -distance Given H = 0 : X1 ∼ PX , (17) between PX and QX is lower bounded by n i.i.d. Given H = 1 : X1 ∼ QX . 1 kP − Q k ≥ (24) X X 2 1 − The (classical) objective of binary hypothesis testing is to n 2 n decode the hypothesis H with minimum probability of error 1  for some constant n ∈ 0, (which may depend on n). Then, n 2 from the observed samples X1 . It is well-known that the we have n maximum likelihood (ML) decision rule for H based on X , (n) |X | 1 P ≤ , ˆ n n ML 2n HML : X → {0, 1}, which is defined by 2|X | + 2n

n (n) n n n ˆ n n Y which implies that lim PML = 0 when lim n = +∞. ∀x1 ∈ X , HML(x1 ) = arg max PX|H (xi|h) , (18) n→∞ n→∞ h∈{0,1} i=1 Lemma5 is established in appendixC. It illustrates that or equivalently, as long as the Euclidean distance between the likelihoods√ PX|H (·|0) and PX|H (·|1) vanishes slower than Θ(1/ n), Hˆ n (xn) = 0 ML 1 we can decode the hypothesis H with vanishing probability D(Pˆ n ||Q ) D(Pˆ n ||P ) , x1 X R x1 X (19) Hˆ n (xn) = 1 of error as n → ∞. Intuitively, when kPX − QX k2 = ML 1 1 −  Θ 1/n 2 n and we neglect  , Lemma5 holds because the achieves the minimum probability of error n sum of the variances of the entries of the sufficient statistic   ˆ 1 1 (n) ˆ n n Tn = PXn − PX − QX (defined in (80) and (85) in PML , P HML(X1 ) 6= H , (20) 1 2 2 the proof of Lemma4 in appendixB) is O(1/n). So, as where the tie-breaking rule in (19) (when the likelihoods long as the Euclidean distance between the two likelihoods (n) √ of 0 and 1 are equal) does not affect PML (see, e.g., [40, is ω(1/ n), it is possible to distinguish between the two (n) Chapter 2]). Furthermore, Le Cam’s relation states that the ML hypotheses. We also remark that tighter upper bounds on PML decoding probability of error is completely characterized by can be obtained using standard exponential concentration of the total variation (TV) distance between the two likelihoods, measure inequalities. However, the simpler second moment cf. [41, proof of Theorem 2.2(i)]. Recall that the TV distance method approach will suffice for our proof of Proposition 10

2 (while our proof of Theorem1 will in fact use stronger where we choose a minimizer randomly when there are concentration bounds). several.

B. Achievability Bounds for DMCs This encoder-decoder pair completely specifies the communi- cation system model in subsection I-B. Intuitively, this decoder In this subsection, we first prove our main achievability † performs reasonably well because P˜ is a valid right inverse result in Theorem1 and then provide an alternative proof for Z|X ˜ ˜ DMCs with rank 2. Recall the formalism of subsection I-B, of PZ|X (since the rows of PZ|X are linearly independent). Indeed, conditioned on sending a particular message, Pˆ n which describes the noisy permutation channel model with a Y1 is “close” to P (which is the true distribution of the Y ’s DMC PZ|X . Z i as shown below) with high probability when n is large. So, P  †  Proof of Theorem1. Since the lower bound in Theorem1 Pˆ n (y) P˜ is “close” to the true P (x) for all y∈Y Y1 Z|X y,x X trivially holds for the case rank(PZ|X ) = 1, we assume x ∈ X 0 with high probability. We now analyze the average without loss of generality that r , rank(PZ|X ) ≥ 2. Let probability of error for this coding scheme. X 0 ⊆ X denote any (fixed) subset of X such that |X 0| = r {M = p} p = {P (·|x) ∈ P : Let us condition on the event for some and the set of conditional distributions Z|X Y p1 pr  ,..., ∈ Pr,k. Then, we have x ∈ X 0} are linearly independent (as vectors in R|Y|),5 and k k ˜ r×|Y| let PZ|X ∈ R denote the row stochastic matrix whose rows are given by {P (·|x) ∈ P : x ∈ X 0}. Furthermore, Z|X Y Xn i.i.d.∼ P ,Zn i.i.d.∼ P ,Y n i.i.d.∼ P , define 1 X 1 Z 1 Z ( r ) p1 pr  X Pr,k , ,..., : p1, . . . , pr ∈ N ∪{0}, pi = k k k where PZ denotes the output distribution when PX , defined i=1 n (25) via (28), is “pushed forward” through the channel PZ|X , Z1 n 1 r are i.i.d. because the channel PZ|X is memoryless, and Y as the intersection of the scaled integer lattice k Z and the 1 r n probability simplex in R , where k ∈ N is some large constant. are i.i.d. because they are the output of passing Z1 through 1  an independent random permutation. Let p represent the un- Under the setup of subsection I-B, for any  ∈ 0, 2 , consider P the following message set and encoder-decoder pair: derlying probability measure after conditioning on {M = p}.  1 − The conditional probability that our element-wise thresholding 1) The message set M = Pr,k with k = n 2 so that the cardinality of M is decoder makes an error is upper bounded by   k + r − 1  r−1  2 −(r−1) |M| = = Θ n , (26) ˆ  r − 1 Pp M 6= M n  where the elements of M have been re-indexed for = Pp gn(Y1 ) 6= p convenience. = ∃ x ∈ X 0, pˆ 6= p  n Pp x x 2) The randomized encoder fn : Pr,k → X is given by (a) X  p1 pr  n i.i.d. ≤ Pp pˆx ∈ {0, . . . , k}\{px} ∀p = ,..., ∈ Pr,k, fn(p) = X ∼ PX , k k 1 x∈X 0 (27)   where Xn are i.i.d. according to a probability distribu- (b) X X X 1 =  Pp(ˆpx = j) + Pp(ˆpx = j) tion PX ∈ PX such that 0 x∈X j>px jp y∈Y 0 x and X = {1, . . . , r} without loss of generality.   3) Instead of the ML decoder which achieves minimum X X X h † i px + j + Pˆ n (y) P˜ ≤ , Pp Y1 Z|X  probability of error, consider the (sub-optimal) element- y,x 0 2k n x∈X j

( pˆ1 pˆr  pˆ1 pˆr  n k ,..., k , if k ,..., k ∈ Pr,k gn(y1 ) = where (a) follows from the union bound, (b) splits a summation e, otherwise over j ∈ {0, . . . , k}\{p } into two summations (and one of (29) x n n 0 these summations is 0 if px ∈ {0, k}), and (c) holds because for every y1 ∈ Y , where for each x ∈ X , P  †  pˆx = j implies that k PˆY n (y) P˜ is closer to y∈Y 1 Z|X y,x j than p due to (30), and we count the tie case, where X h † i j x pˆ = arg min Pˆ n (y) P˜ − , (30) P ˆ  ˜†  x y1 Z|X k PY n (y) P is equally close to j and px, as y,x k y∈Y 1 Z|X y,x j∈{0,...,k} y∈Y an error since this gives us an upper bound on the desired

5 conditional probability of error. This implies that the extreme points of the convex hull of {PZ|X (·|x) ∈ 0 0 PY : x ∈ X } are precisely {PZ|X (·|x) ∈ PY : x ∈ X }. To show that this upper bound in (31) vanishes, observe that 11

0 0 for any x ∈ X and any j > px (assuming px < k), for any x ∈ X and any j > px (assuming px < k). Likewise, 0   for any x ∈ X and any j < px (assuming px > 0), Lemma3 X h † i px + j Pˆ n (y) P˜ ≥ yields Pp Y1 Z|X  y,x 2k y∈Y   X h † i px + j n ! Pˆ n (y) P˜ ≤ 1 h i p j − p Pp Y1 Z|X  X ˜† x x y,x 2k = Pp PZ|X − ≥ , y∈Y (34) n Yi,x k 2k i=1  n(j − p )2  (32) ≤ exp − x . which holds because 8σ2k2 n X h † i 1 X h † i So, bounding the terms in (31) using (33) and (34) produces Pˆ n (y) P˜ = P˜ Y1 Z|X Z|X y,x n Y ,x  2  y∈Y i=1 i ˆ  X X n(j − px) Pp M 6= M ≤ exp − 2 2 almost surely. To bound the right hand side of (32), we notice 0 8σ k x∈X j∈{0,...,k}\{px} three facts: X X  n  †  ˜  ≤ exp − 2 1−2 1) PZ|X Y ,x : i ∈ {1, . . . , n} are i.i.d. random vari- 8σ n i x∈X 0 j∈{0,...,k}\{p } ables, because P˜†  is a deterministic function of x Z|X Yi,x  2  1 − n ˜† ≤ r n 2 exp − (35) Yi (since PZ|X is a known deterministic matrix), and 2 n 8σ Y1 are i.i.d. random variables given M = p. † 1  ˜  2 − 2) For each i ∈ {1, . . . , n}, PZ|X Y ,x is bounded almost where the second inequality holds because k = bn c ≤ i 1 − 0 surely by n 2 and |j − px| ≥ 1 for all x ∈ X and all j 6= px, and the 0 1 − third inequality holds because |X | = r and k ≤ n 2 . h † i T † P˜ = e P˜ ex Z|X Yi Z|X Lastly, taking expectations with respect to the law of M in Yi,x (35) yields ≤ max uTP˜† v |Y| r Z|X  2  u∈R , v∈R : (n) 1 − n kuk =kvk =1 P ≤ r n 2 exp − , (36) 2 2 error 8σ2

˜† = PZ|X , σ , (n) op which implies that limn→∞ Perror = 0. Therefore, using (26), where ej denotes the jth standard basis vector with the rate unity at the jth position and zero elsewhere, and the last k + r − 1 log equality follows from the Courant-Fischer-Weyl min- r − 1 r − 1 R = lim = − (r − 1) max theorem, cf. [46, Section 3.1, Problem 6, p.155], n→∞ log(n) 2 [47, Lemma 2].  †  1  r−1 3) For each i ∈ {1, . . . , n}, P˜ has expected value is achievable for every  ∈ 0, 2 , and Cperm(PZ|X ) ≥ 2 . Z|X Yi,x   This completes the proof.  h † i X h i h † i P˜ = Pˆ n (y) P˜ Ep Z|X Ep Y1 Z|X We now make two pertinent remarks regarding Theorem1. Yi,x y,x y∈Y Firstly, the randomized encoder and element-wise threshold- h i X ˜† ing decoder presented in the achievability proof constitute a = PZ (y) PZ|X y,x y∈Y computationally tractable coding scheme. Indeed, unlike the random coding argument in traditional channel coding, the = PX (x) px element-wise thresholding decoder takes O(n) (i.e., linear) = , n time, because constructing Pˆyn from y requires O(n) time k 1 √ 1 and computing (30) requires O n time. Therefore, com- where Ep[·] represents expectation with respect to the conditional probability distribution of Y n given M = p, munication via noisy permutation channels appears to not 1 require the development of conceptually sophisticated coding the third equality crucially uses the fact that PZ|X has a right inverse since its rows are linearly independent,6 schemes to theoretically achieve capacity. (Of course, other and the final equality follows from (28). code constructions could be of utility based on alternative Using these facts, we can apply Lemma3 to the right hand practical considerations.) Furthermore, our achievability proof side of (32) and obtain also implies the existence of a good deterministic code using a   simple application of the (see, e.g., [48, X h † i px + j Lemma 2.2]). Pˆ n (y) P˜ ≥ Pp Y1 Z|X  y,x 2k Secondly, although we have presented Theorem1 under y∈Y (33) an average probability of error criterion, our achievability  2  n(j − px) proof establishes a lower bound on noisy permutation channel ≤ exp − 8σ2k2 capacity under a maximal probability of error criterion as well; see, e.g., (35). More generally, the noisy permutation channel 6The existence of a right inverse of P ensures that the input distribution Z|X capacity of a DMC remains the same under a maximal prob- PX can be uniquely recovered from the output distribution PZ and PZ|X . This elucidates why our achievability bound depends on the rank of PZ|X . ability of error criterion. This follows from a straightforward 12

n i.i.d. expurgation argument similar to [19, Theorem 18.3, Corollary Given H = 1 : Y1 ∼ QZ , 18.1] or [18, Section 7.7, p.204].7 where the hypotheses H = 0 and H = 1 correspond to the For the special case where r = rank(P ) = 2, we next Z|X messages M = p and M = q, respectively, and Q ∈ P is present an alternative proof of Theorem1 which exploits the Z Y the output distribution when the input distribution Q ∈ P , second moment method bound in Lemma5. For convenience, X X defined analogously to (28) as we also state the corresponding achievability result in the ( ensuing proposition. qx 0 k , for x ∈ X QX (x) = , Proposition 2 (Achievability Bound for DMCs with Rank 2). 0, for x ∈ X \X 0 The noisy permutation channel capacity of a DMC PZ|X with is “pushed forward” through the channel PZ|X . Notice that r rank(PZ|X ) = 2 is lower bounded by 2 , the ` -distance between PZ and QZ can be upper and lower 1 bounded using kP − Q k ;8 indeed, C (P ) ≥ . X X 2 perm Z|X 2 σ P˜  kP − Q k ≤ kP − Q k Proof. We commence our proof without imposing the r = 2 min Z|X X X 2 Z Z 2 ≤ P kP − Q k , constraint. As in the proof of Theorem1, consider the reduced Z|X op X X 2 input alphabet X 0 = {1, . . . , r} ⊆ X such that the rows of (38) ˜  ˜ ˜ where σmin PZ|X > 0 because the rows of PZ|X are linearly PZ|X are linearly independent, the message set M = Pr,k 1  2 − 1  independent, the first inequality follows from the Courant- where k = n for any  ∈ 0, 2 , and the randomized n Fischer-Weyl min-max theorem, cf. [49, Theorem 7.3.8], be- encoder fn : Pr,k → X given in (27) and (28). However, on n cause kPX − QX k2 = kp − qk2, and PZ and QZ can be the receiver end, consider the ML decoder gn : Y → Pr,k obtained by pushing p and q forward through the channel such that ˜ PZ|X , respectively, and the second inequality follows from n n n n ∀y ∈ Y , g (y ) = arg max P n (y |p) , 1 n 1 Y1 |M 1 the definition of operator norm. So, letting p∈P r,k     ˜  1 Pr 2 where the tie-breaking rule (to choose one maximizer when log σmin PZ|X + 2 log i=1 (pi − qi) (n) n =  + (39) there are several) does not affect Perror. We now analyze log(n) the average probability of error for this simple encoding and 1  decoding scheme. such that n ∈ 0, 2 for all sufficiently large n (depending Firstly, as before, we condition on the event {M = p} for on PZ|X ), we have p1 pr  some p = ,..., ∈ Pr,k, and note that ˜  k k kPZ − QZ k2 ≥ σmin PZ|X kp − qk2 i.i.d. i.i.d. i.i.d. n n n v r X1 ∼ PX ,Z1 ∼ PZ ,Y1 ∼ PZ , ˜ u σmin PZ|X X 2 = u (p − q )  1 − t i i where PZ denotes the output distribution when PX , defined 2 n i=1 via (28), is “pushed forward” through P . Moreover, as Z|X 1 before, we let p represent the underlying probability measure ≥ , P 1 − after conditioning on {M = p}. The conditional probability n 2 n p1 pr  q1 qr  that our ML decoder makes an error is upper bounded by where p = k ,..., k ∈ Pr,k and q = k ,..., k ∈  1 − ˆ  Pr,k with k = n 2 . Using Lemma5 (which is based on Pp M 6= M 1  n  the second moment method in Lemma4), if H ∼ Ber 2 , = Pp gn(Y1 ) 6= p i.e., the hypotheses are equiprobable, then the ML decoding (a) n n  probability of error for our binary hypothesis testing problem, ≤ Pp ∃ q ∈ Pr,k\{p},PY n|M (Y1 |q) ≥ PY n|M (Y1 |p) (n) 1 1 P = Hˆ n (Y n) 6= H, satisfies (b) ML P ML 1 X n n  ≤ p PY n|M (Y |q) ≥ PY n|M (Y |p) (37) 1 1 P 1 1 1 1 (n) ˆ n n  ˆ n n  PML = Pp HML(Y1 ) = 1 + Pq HML(Y1 ) = 0 q∈Pr,k\{p} 2 2 where (a) is an upper bound because we regard the ML |Y| n n ≤ 2 . decoding equality case, P n (Y |q) = P n (Y |p) for 2|Y| + 2n n Y1 |M 1 Y1 |M 1 q 6= p, as an error even though the ML decoder may return This implies that the false alarm probability satisfies the correct message in this scenario, and (b) follows from the n n  n n  Hˆ (Y ) = 1 = P n (Y |q) ≥ P n (Y |p) union bound. Pp ML 1 Pp Y1 |M 1 Y1 |M 1 Then, to prove that this upper bound in (37) vanishes, for |Y| q1 qr  ≤ 2 , (40) any message q = k ,..., k ∈ Pr,k\{p}, consider a binary |Y| + n n hypothesis test with likelihoods given by where the equality follows from breaking ties in ML decoding, n i.i.d. n n i.e., in cases where we get P n (Y |q) = P n (Y |p), by Given H = 0 : Y1 ∼ PZ , Y1 |M 1 Y1 |M 1

7 8 Indeed, Cperm under a maximal probability of error criterion is clearly The ensuing lower bound is where we crucially introduce a dependence upper bounded by Cperm under an average probability of error criterion, and between the noisy permutation channel capacity and rank of a DMC. Fur- expurgating the code used to achieve Cperm under an average probability of thermore, the upper and lower bounds together imply that kP − Q k2 =  Z Z error criterion shows that this bound can be met with equality. Θ kPX − QX k2 . 13

ˆ n n assigning HML(Y1 ) = 1 (which does not affect the analysis that the bound in (44) diverges when r ≥ 2. Indeed, notice (n) of PML in Lemma5). that Next, combining (37) and (40) yields X 1 X 1 2 ≥ 2 r kmk r kmk ˆ  X |Y| m∈Z \(0,...,0) 2 m∈Z \(0,...,0) 1 Pp M 6= M ≤ |Y| + n2n ∞ r−1 q∈P \{p} 1 X 1 Y r,k ≥ (d + j) X 1 (r − 1)! d2 ≤ |Y| d=1 j=1 n2n ∞ q∈Pr,k\{p} 1 X 1 ≥ = +∞ , (45) |Y| X 1 (r − 1)! d3−r = , d=1 ˜ 2 2 Pr 2 σmin PZ|X n (pi − qi) p q∈Pr,k\{p} i=1 where the first inequality uses the monotonicity of ` -norms (41) in p ∈ [1, ∞], the second inequality follows from enumerating over all possible `1-norms d and noting that there are d+r−1 where the last equality follows from substituting (39). At this r−1 (entry-wise) non-negative points in the integer lattice r that point, we use the fact that r = 2 to simplify (41) so that Z have an `1-norm of d, and the expression in the final inequality (a) is infinity due to the divergent nature of the harmonic series. ˆ  |Y| X 1 Pp M 6= M ≤ ˜ 2 2 2 In the r = 2 case, as in the proof of Proposition2 above, 2 σmin PZ|X n (p1 − q1) q∈Pr,k\{p} it is possible to tighten (44) and obtain a summation over ∞ Z (b) |Y| X 1 rather than 2. However, such a tightening does not ameliorate ≤ Z ˜ 2 2 j2 the divergent situation for r > 2. So, the proof technique of σmin PZ|X n j=1 Proposition2 cannot be used for r > 2. 2 (c) |Y|π = , (42) Thirdly, as in the earlier proof of Theorem1, the randomized ˜ 2 2 6 σmin PZ|X n encoder and ML decoder presented in the proof of Proposition 2 also constitute a computationally tractable coding√ scheme. where (a) holds because p1 + p2 = q1 + q2 = k, (b) holds In particular, the ML decoder requires at most O n like- because j = p1 − q1 ranges over a subset of all non-zero lihood ratio tests, which means that the decoder operates in integers, and (c) utilizes the renowned solution to the Basel polynomial time in n. problem. Fourthly, for any non-trivial DMCs, we intuitively expect Finally, taking expectations with respect to the law of M in (n) the rate of decay of Perror to be dominated by the rate of (42) produces decay of the probability of error in distinguishing between two 2 “consecutive” messages. Although we do not derive precise (n) |Y|π Perror ≤ 2 , (43) error exponents or rates of decay in this paper, (35), (36), 6 σ P˜  n2 min Z|X Lemma5, and (43) indicate that this intuition is accurate. (n) Lastly, when we specialize (43) for a BSC(δ) with δ ∈ which implies that limn→∞ Perror = 0. Therefore, as argued 1  1  r−1 1 0, 2 ∪ 2 , 1 , we get in the proof of Theorem1, Cperm(PZ|X ) ≥ 2 = 2 , which completes the proof.  π2 P (n) ≤ , (46) error 3(1 − 2δ)2n2 In view of Proposition2, some further remarks are in order. ˜  Firstly, the high-level proof strategy to establish Proposition because |Y| = |{0, 1}| = 2 and σmin PZ|X = |1 − 2δ|. 2 parallels the pairwise error probability analysis technique This improves the constant in our corresponding bound in [2, used in conventional channel coding problems (see, e.g., [50]). Theorem 3, Equation (19)] by a factor of 3. However, the details of our hypothesis testing formulation and the bounds we use to execute our analysis are different to such C. Converse Bounds for Strictly Positive DMCs classical approaches. Secondly, if r > 2, then the obvious approach to bounding In this subsection, we first prove Theorem2. To this end, (n) once again recall the formalism introduced in subsection I-B. Perror starting from (41) (and taking expectations with respect to M) yields Proof of Theorem2. Suppose we are given a sequence of

encoder-decoder pairs {(fn, gn)}n∈N on message sets of size (n) |Y| X 1 R (n) Perror ≤ 2 2 , (44) |M| = n such that limn→∞ Perror = 0. Consider the Markov ˜  2 kmk n n n σmin PZ|X n m∈ r \(0,...,0) 2 M → X → Z → Y → Pˆ n Z chain 1 1 1 Y1 . Observe using (2) that n n r for every y1 ∈ Y and m ∈ M, because we can define m = (m1, . . . , mr) ∈ Z such that  −1 mi = pi − qi for every i ∈ {1, . . . , r}, and then take n n   P n (y |m) = Pˆ n = Pˆ n M = m . Y1 |M 1 P Z1 y1 the summation over additional sequences m whose sums are nPˆyn not necessarily equal to 0 (i.e., we take the summation over 1 n n Since P n (y |m) depends on y through Pˆ n , the Fisher- additional q1, . . . , qr whose sums are not necessarily equal to Y1 |M 1 1 y1 k q ∈ P Neyman factorization theorem Pˆ n , as is the case when r,k). It is straightforward to verify implies that Y1 is a sufficient 14

n (d) n n statistic of Y1 for M [51, Theorem 3.6]. Then, following the = H(N1,...,N|Y|−1|X1 = x1 ) standard argument from [18, Section 7.9], we have |Y|−1 (e) n n X n n (a) = H(N1|X = x ) + H(Ni|N1,...,Ni−1,X = x ) R log(n) = H(M) 1 1 1 1 i=2 (b) ˆ ˆ |Y|−1 = H(M|M) + I(M; M) (f) X ≥ H(N |{N : j ∈ Y\{i, |Y|}},Xn = xn) , (c) i j 1 1 (50) (n) n i=1 ≤ 1 + PerrorR log(n) + I(M; Y1 )

(d) (n) where (a) holds because PˆZn = PˆY n (almost surely), (b) uses = 1 + P R log(n) + I(M; PˆY n ) 1 1 error 1  ˆn (e) (49), (c) follows from [18, Problem 2.14] because PZ|X=x ∈ (n) n ≤ 1 + P R log(n) + I(X ; Pˆ n ) , error 1 Y1 (47) PY : x ∈ X are mutually (conditionally) independent random variables given Xn = xn, (d) holds because Pˆn where (a) holds because M is uniformly distributed, (b) 1 1 Z|X=x∗ sums to unity and we let Y = {1,..., |Y|} without loss follows from the definition of mutual information, (c) follows of generality, (e) follows from the chain rule for Shannon from Fano’s inequality and the data processing inequality entropy (and the summation is 0 when |Y| = 2), and (f) holds [18, Theorems 2.10.1 and 2.8.1], (d) holds because PˆY n is 1 because conditioning reduces Shannon entropy (and it equals a sufficient statistic, cf. [18, Section 2.9], and (e) also follows n n H(N1|X1 = x1 ) when |Y| = 2). from the data processing inequality. n n n We next lower bound H(N1|N2,...,N|Y|−1,X1 = x1 ); We now upper bound I(X ; PˆY n ). Notice that 1 1 the other terms in the sum in (50) can be lower bounded n n ∗ ∗ ∗  I(X ; Pˆ n ) = H(Pˆ n ) − H(Pˆ n |X ) similarly. Let p PZ|X (1|x )/ PZ|X (1|x )+PZ|X (|Y||x ) . 1 Y1 Y1 Y1 1 , Then, the conditional distribution of N1 given N2,...,N|Y|−1 ≤ (|Y| − 1) log(n + 1) n n and X1 = x1 is given by X n n n − P n (x )H(Pˆ n |X = x ) , X1 1 Y1 1 1 n n n n P N1 = k1 N2 = k2,...,N|Y|−1 = k|Y|−1,X1 = x1 x1 ∈{0,1} n n (48) P N1 = k1,...,N|Y|−1 = k|Y|−1 X1 = x1 = n n where we use the upper bound on the number of possible P N2 = k2,...,N|Y|−1 = k|Y|−1 X1 = x1 n n ∗ kj empirical distributions given in (5). Given X1 = x1 for any Y PZ|X (j|x ) n n fixed x ∈ X , {Zi ∼ PZ|X (·|xi): i ∈ {1, . . . , n}} are kj! 1 j∈Y mutually independent and PˆZn = PˆY n (almost surely). To = 1 1 k1+k |Y|−1 n n ∗ ∗  |Y| ∗ kj H(Pˆ n |X = x ) PZ|X (1|x ) + PZ|X (|Y||x ) Y PZ|X (j|x ) lower bound Y1 1 1 , we need some additional ∗ x = arg max Pˆ n (x) (k1 + k|Y|)! kj! definitions. Let x∈X x1 , choosing one j=2 maximizer arbitrarily if there are several. For every y ∈ Y,   k1 + k|Y| define the random variable = pk1 (1 − p)k|Y| k1 n X 1 ∗ 1 for every k , . . . , k ∈ ∪ {0} such that P k = Ny , {xi = x } {Zi = y} . 1 |Y| N j∈Y j ∗ ∗ |Y|−1  i=1 nPˆ n (x ) N ∼ bin nPˆ n (x ) − P k , p x1 . Therefore, 1 x1 j=2 j n n Furthermore, define the empirical conditional distribution given N2 = k2,...,N|Y|−1 = k|Y|−1 and X1 = x1 . (We |Y| = 2 n remark that this calculation also holds for the 1 P|Y|−1 n X case because kj = 0.) Now, for some fixed constant Pˆ (y) 1{xi = x} 1{Zi = y} j=2 Z|X=x , ∗ ∗ nPˆ n (x)  x1 i=1 τ ∈ 1 − PZ|X (1|x ) − PZ|X (|Y||x ), 1 , let E ∈ {0, 1} be a binary random variable defined by for every y ∈ Y and every x ∈ X such that Pˆ n (x) > 0, x1   ˆn ˆn  |Y|−1 where we let PZ|X=x = PZ|X=x(y): y ∈ Y ∈ PY so that  X ∗  E 1 N ≤ nPˆ n (x )τ . , j x1 X n  j=2  Pˆ n = Pˆ n (x)Pˆ . Z1 x1 Z|X=x (49) x∈X Observe using the Bienayme-Chebyshev´ inequality that n n x ∈ X Pˆ n (x) = 0 (E = 0|X = x ) In the sequel, for any , if x1 , then we interpret P 1 1 n Pˆ n (x)Pˆ as the zero vector. Using these definitions, we  n n n n x1 Z|X=x ˆ ˆ = P PZ|X=x∗ (1) + PZ|X=x∗ (|Y|) < 1 − τ X1 = x1 have   ˆn ˆn n n n n (a) VAR PZ|X=x∗ (1) + PZ|X=x∗ (|Y|) X1 = x1 H(PˆY n |X = x ) 1 1 1 ≤ ∗ 2 (a) ˆ n n τ + PZ|X ({1, |Y|}|x ) − 1 = H(PZn |X1 = x1 ) 1 ∗ ∗  ! (b) P ({1, |Y|}|x ) 1 − P ({1, |Y|}|x ) Z|X Z|X (b) X n n n = 2 = H Pˆ n (x)Pˆ X = x ˆ ∗ ∗  x1 Z|X=x 1 1 nP n (x ) τ + P ({1, |Y|}|x ) − 1 x1 Z|X x∈X (c) |X |P ({1, |Y|}|x∗)1 − P ({1, |Y|}|x∗) (c)   Z|X Z|X ˆ ∗ ˆn n n ≤ 2 (51) ≥ H Pxn (x )P ∗ X = x ∗  1 Z|X=x 1 1 n τ + PZ|X ({1, |Y|}|x ) − 1 15

 1  n n = O , (52) analysis of H(N1|N2,...,N|Y|−1,X1 = x1 ) above with τ = n τ ∗, the maximum bound of the form (51) is 0 0  ∗ |X |P ({y, y }|x) 1 − P ({y, y }|x)  1  where (a), (b), and (c) use the notation PZ|X ({1, |Y|}|x ) = max Z|X Z|X = O P (1|x∗) + P (|Y||x∗), and (c) also utilizes the fact that x∈X , ∗ 0 2 n Z|X Z|X 0 n τ + PZ|X ({y, y }|x) − 1 ∗ 1 ∗ y,y ∈Y: Pˆ n (x ) ≥ by definition of x . Then, we can apply 0 x1 |X | y6=y Lemma1 and get (54) 1  ∗ which remains O n . Furthermore, let p be the optimal value n n H(N1|N2,...,N|Y|−1,X1 = x1 ) of PZ|X (y|x) (a) n n p = 0 > 0 ≥ H(N1|N2,...,N|Y|−1,E,X1 = x1 ) PZ|X ({y, y }|x) (b)    ∗ 1  1 n n that minimizes (53), with τ = τ and the O term given ≥ 1 − O H(N1|N2,...,N|Y|−1,E = 1,X = x ) n n 1 1 by (54), over all x ∈ X and y, y0 ∈ Y such that y 6= y0. Then,   1  following the derivation of (53), for all i ∈ {1,..., |Y| − 1}, = 1 − O · n we obtain the lower bound    |Y|−1   n n H(Ni|{Nj : j ∈ Y\{i, |Y|}},X = x ) ˆ ∗ X n n 1 1 EHbinnPxn (x ) − Nj, p E = 1,X1 = x1       ∗  1 1 1 ∗ ∗ n(1 − τ ) j=2 ≥ 1 − O log 2πep (1 − p ) − 1 n 2 |X | (c)        1 n(1 − τ)   1  c(p∗)|X | ≥ 1 − O H bin , p − 1 − O , (55) n |X | n n(1 − τ ∗) − |X | (d)   1  1  n(1 − τ)  ≥ 1 − O log 2πep(1 − p) − 1 where the O 1  term is given by (54). Hence, we can combine n 2 |X | n (48), (50), and (55) to produce   1  c(p)|X | − 1 − O , (53) n I(X ; Pˆ n ) n n(1 − τ) − |X | 1 Y1

  1  where (a) holds because conditioning reduces Shannon en- ≤ (|Y| − 1) log(n + 1) − 1 − O · tropy, (b) follows from (52) and the non-negativity of Shannon n    ∗  entropy, (c) follows from [18, Problem 2.14] and the facts 1 ∗ ∗ n(1 − τ ) ∗ 1 log 2πep (1 − p ) − 1 that E = 1 and Pˆ n (x ) ≥ , and (d) employs Lemma x1 |X | 2 |X | 1. Here, to employ Lemma1 and obtain (53), we implicitly ! c(p∗)|X |  utilize the assumption that all conditional distributions in − . (56) n(1 − τ ∗) − |X | {PZ|X (·|x) ∈ PY : x ∈ X } are strictly positive, which ensures 9 that p ∈ (0, 1). We also note that when |Y| = 2, the above Finally, combining (47) and (56), and dividing by log(n), argument mutatis mutandis yields yields

n n 1 log(n + 1) H(N1|X1 = x1 ) (n) R ≤ + PerrorR + (|Y| − 1) −   ∗  log(n) log(n) = H bin nPˆxn (x ), p 1   1 log(2πep∗(1 − p∗)(1 − τ ∗)/|X |) 1   n  c(p)|X | 1 − O + ≥ log 2πep(1 − p) − 1 − , n 2 log(n) 2 |X | n − |X | log(n − (|X |/(1 − τ ∗))) − which is lower bounded by (53). So, the bound in (53) is valid 2 log(n) ! for all |Y| ≥ 2. c(p∗)|X |  n ∗ , I(X ; Pˆ n ) τ ∗ Next, to upper bound 1 Y1 , let be any fixed (n(1 − τ ) − |X |) log(n) constant such that where letting n → ∞ produces 0 ∗ 1 − min PZ|X ({y, y }|x) < τ < 1 , |Y| − 1 x∈X , y,y0∈Y: R ≤ . y6=y0 2 This completes the proof. 0 0  where PZ|X ({y, y }|x) = PZ|X (y|x) + PZ|X (y |x) for any x ∈ X and y, y0 ∈ Y such that y 6= y0. Notice that when The proofs of Lemmata4 and5 in appendicesB andC (along with the discussion following Lemmata4 and5) and we analyze other conditional entropy terms H(Ni|{Nj : j ∈ n n the proof of Proposition2 portray that the distinguishability Y\{i, |Y|}},X1 = x1 ) for i ∈ {1,..., |Y| − 1} akin to our between two “consecutive” (encoded) messages can be de- termined by a careful comparison of the difference between 9Note that since we do not know a priori which value x∗ takes and we have their means and a variance (at least in the rank 2 case). This to prove (53) for every term in (50), we have to assume that PZ|X (y|x) > 0 for all x ∈ X and y ∈ Y. suggests that the CLT can be used to obtain the correct scaling 16

n of |M| with n in general. We remark that the CLT is in fact X X n Y = P n (x ) Q (˜x )P (z |x˜ ) X1 1 xi i Z|X i i implicitly used in the above converse proof when we apply n 0 n n n x˜1 ∈(X ) x1 ∈X i=1 Lemma1, because estimates for the entropy of a binomial n distribution are typically obtained using the CLT. X n Y = P ˜ n (˜x ) PZ|X (zi|x˜i) , (60) X1 1 We conclude this section by using Theorem2, Proposition x˜n∈(X 0)n i=1 1, and Lemma2 to establish the alternative converse bound 1 n P n Z on the noisy permutation channel capacity of strictly positive where the first equality defines the distribution Z1 of 1 (and DMCs given in Theorem3. uses the memorylessness of PZ|X ), the second equality follows from (58), the third equality follows from the distributive Proof of Theorem3. As in the proof of Theorem2, consider property, the fourth equality follows from swapping the order any sequence of encoder-decoder pairs {(fn, gn)}n∈N on of summations, and the fifth equality follows from (59). R (n) message sets of size |M| = n such that limn→∞ Perror = 0. Furthermore, notice that This defines the Markov chain M → Xn → Zn → Y n, and 1 1 1 n n the standard argument from [18, Section 7.9], which yielded I(X1 ; Y1 ) X n n (47) earlier, easily produces = P n (x )D(P n n (·|x )||P n ) X1 1 Y1 |X1 1 Y1 n n (n) n n x1 ∈X R log(n) ≤ 1 + PerrorR log(n) + I(X1 ; Y1 ) . (57) (a) X n n = PXn (x )D(PZn|Xn (·|x ) · Π||PY n ) We proceed to upper bounding I(Xn; Y n) using a degradation 1 1 1 1 1 1 1 1 xn∈X n argument. 1 (b) X n First, we reduce the cardinality of the input alphabet X of = P n (x ) · X1 1 0 n n the DMC PZ|X . In particular, we let X ⊆ X be any (fixed) x1 ∈X 0   subset of X such that q , |X | = ext(PZ|X ) and the set n 0 X Y n  of conditional distributions {PZ|X (·|x) ∈ PY : x ∈ X } (as n n n D Qxi (˜xi) PZ |X (·|x˜1 ) · Π PY  |Y| 1 1 1 vectors in ) are the extreme points of the convex hull n 0 n R x˜1 ∈(X ) i=1 10 of {P (·|x) ∈ P : x ∈ X }. Moreover, we let P ˜ ∈ Z|X Y Z|X (c) X n q×|Y| = PXn (x ) · denote the row stochastic matrix whose rows are given 1 1 R n n 0 x1 ∈X by {PZ|X (·|x) ∈ PY : x ∈ X }, where the random variable 0  n  X˜ ∈ X . Since the convex hulls of {P (·|x) ∈ PY : x ∈ Z|X X Y n 0 n n n D Qxi (˜xi)PY |X (·|x˜1 ) PY  X } and {PZ|X (·|x) ∈ PY : x ∈ X } are equivalent, for every 1 1 1 x˜n∈(X 0)n i=1 x ∈ X , we have 1 (d) n X X n X Y n ∀z ∈ Y,P (z|x) = Qx(˜x)P (z|x˜) (58) ≤ P n (x ) Q (˜x )D(P n n (·|x˜ )||P n ) Z|X Z|X X1 1 xi i Y1 |X1 1 Y1 x˜∈X 0 n n n 0 n x1 ∈X x˜1 ∈(X ) i=1 0 n for some convex weights {Qx(˜x) ≥ 0 :x ˜ ∈ X } such (e) X n X n Y n n n n P = D(PY |X (·|x˜1 )||PY ) PX (x1 ) Qxi (˜xi) that 0 Qx(˜x) = 1. Observe that for every probability 1 1 1 1 x˜∈X n 0 n n n x˜ ∈(X ) x1 ∈X i=1 distribution PXn ∈ PX n , we can construct the probability 1 1 (f) X P n ∈ P 0 n n n distribution X˜ (X ) given by = P ˜ n (˜x )D(PY n|Xn (·|x˜ )||PY n ) 1 X1 1 1 1 1 1 n 0 n n x˜1 ∈(X ) n 0 n n X n Y n (g) n n ∀x˜1 ∈ (X ) ,PX˜ n (˜x1 ) PX (x1 ) Qxi (˜xi) , ˜ 1 , 1 = I(X1 ; Y1 ) , (61) n n x1 ∈X i=1 (59) where (a) and (c) hold because the conditional distribution ˜ ˜ 0 n n where the random variables X1,..., Xn ∈ X . This distribu- P n n (·|x ) = P n n (·|x ) · Π ∈ P n is the output of Y1 |X1 1 Z1 |X1 1 Y n tion has the property that it induces the marginal distribution pushing the conditional distribution PZn|Xn (·|x ) ∈ PYn n n n 1 1 1 P n Z z ∈ Y Z1 of 1 , namely, for all 1 , through the random permutation channel Π = P n n (de- Y1 |Z1 n n n fined in (2)) for every x1 ∈ X , (b) follows from (60) n X n Y P n (z ) P n (x ) P (z |x ) after substituting Kronecker delta distributions PXn into (59) Z1 1 , X1 1 Z|X i i 1 n n P x1 ∈X i=1 (and uses the memorylessness of Z|X ), (d) follows from n the convexity of KL divergence, (e) follows from swapping X n Y X = P n (x ) Q (˜x)P (z |x˜) the order of summations, (f) follows from (59), and (g) X1 1 xi Z|X i xn∈X n i=1 x˜∈X 0 P n ∈ P n 1 holds because (60) conveys that Y1 Y , which is the n n marginal distribution of Y1 in the original Markov chain X n X Y = P n (x ) Q (˜x )P (z |x˜ ) n n n n X1 1 xi i Z|X i i X1 → Z1 → Y1 , is also the marginal distribution of Y1 xn∈X n x˜n∈(X 0)n i=1 ˜ n n n 11 1 1 in the Markov chain X1 → Z1 → Y1 .

10We note that when there are multiple copies of an extreme point of the 11Note that we abuse notation here and use the same random variable labels n n n n n ˜ n n convex hull of {PZ|X (·|x) ∈ PY : x ∈ X } in {PZ|X (·|x) ∈ PY : x ∈ Z1 and Y1 for the Markov chains X1 → Z1 → Y1 and X1 → Z1 → n n n X}, we only add one of these conditional distributions to {PZ|X (·|x) ∈ Y1 , because the two chains can be coupled so that Z1 and Y1 are shared 0 PY : x ∈ X }. random variables. 17

Second, we construct an equivalent model of the Markov (i.e., a q-SC(δ)) with input and output alphabet X 0, and the ˜ n n n chain X1 → Z1 → Y1 , which has reduced input alphabet channel PY˜ n|Z˜n is defined by a random permutation channel. 0 ˜ n 1 1 X , X ∼ P ˜ n , P n ˜ n given by the DMC P ˜ , and Hence, we have 1 X1 Z1 |X1 Z|X PY n|Zn = Π given by the random permutation channel in 1 1 I(X˜ n; W˜ n) = I(X˜ n; Y˜ n) . (2). Employing Lemma2 (also see Figure3), we can swap 1 1 1 1 (64) the random permutation channel Π and the DMC PZ|X˜ to Finally, combining (61), (62), (63), and (64), we get ˜ n n n get a Markov chain X1 → V1 → W1 such that the n n ˜ n ˜ n I(X1 ; Y1 ) ≤ I(X1 ; Y1 ) , channel PW n|X˜ n is equivalent to the channel PY n|X˜ n . In this 1 1 n 0 n 1 1 alternative Markov chain, V1 ∈ (X ) is an independent which implies that the right hand side of (57) can be upper ˜ n 0 n n n random permutation of X1 ∈ (X ) , and W1 ∈ Y is the bounded as output of passing V n through a DMC P , which satisfies 1 W |V (n) ˜ n ˜ n R log(n) ≤ 1 + PerrorR log(n) + I(X1 ; Y1 ) . PW |V = PZ|X˜ (as shown in (14)). Hence, we have ˜ n n ˜ n n Executing the Fisher-Neyman factorization argument from the I(X1 ; Y1 ) = I(X1 ; W1 ) , (62) ˆ outset of the proof of Theorem2, we obtain that PY˜ n is a ˜ n ˜ n n n 1 since X1 is common to both Markov chains X1 → Z1 → Y1 sufficient statistic of Y˜ n for X˜ n. So, we have ˜ n n n 1 1 and X1 → V1 → W1 . (n) ˜ n ˆ Third, we construct a q-ary symmetric channel that domi- R log(n) ≤ 1 + P R log(n) + I(X ; P ˜ n ) , (65) error 1 Y1 nates the DMC P = P in the degradation sense. To W |V Z|X˜ much like the bound in (47). At this stage, noting that the this end, define the parameter DMC P = S is strictly positive, we can upper bound ν Z˜|X˜ δ ˜ n ˆ δ = ν > 0 I(X1 ; PY˜ n ) by following the proof of Theorem2 mutatis 1 − ν + 1 q−1 mutandis. Indeed, starting from (65) and proceeding with the in terms of the minimum entry, ν, of PZ|X , viz., proof of Theorem2 yields 0 ν = min PZ|X (y|x) = min P ˜ (y|x) > 0 , q − 1 |X | − 1 ext(PZ|X ) − 1 x∈X , y∈Y x∈X 0, y∈Y Z|X R ≤ = = . 2 2 2 where the second equality follows from (58). (Note that δ > 0 This completes the proof.  because ν > 0, and ν > 0 since PZ|X is strictly positive.) Then, applying Proposition1, we get that PW |V = PZ|X˜ is a degraded version of q-SC(δ) (see Definition3). Let the IV. NOISY PERMUTATION CHANNEL CAPACITY input random variable of the q-SC(δ) be V ∈ X 0, and the To complement our main result in Theorem4, we charac- output random random variable be W˜ ∈ X 0, so that we terize and bound the noisy permutation channel capacities of several other simple classes of DMCs in this section. can write PW˜ |V = Sδ. Now consider the Markov chain ˜ n n ˜ n n X1 → V1 → W1 , where V1 is a random permutation ˜ n of X as before, and the channel P ˜ n n is given by the A. Unit Rank Channels 1 W1 |V1 DMC PW˜ |V = Sδ. Since the degradation preorder tensorizes, We start with what is perhaps the simplest setting—that of ⊗n the channel PW |V is a degraded version of the channel an “independent channel.” In this case, the noisy permutation P ⊗n = S⊗n, where A⊗n denotes the n-fold Kronecker W˜ |V δ channel capacity is obviously zero, and a standard Fano’s product (or tensor product) of a row stochastic matrix A, which inequality argument rigorously justifies this. corresponds to n uses of the memoryless channel A.12 Thus, Proposition 3 (C of Unit Rank Stochastic Matrices). For we have a Markov chain X˜ n → V n → W˜ n → W n using perm 1 1 1 1 a unit rank DMC P ∈ |X |×|Y| such that all rows of P Definition2 (where we neglect the difference between physical Z|X R Z|X are equal, we have and stochastic degradation since it is inconsequential in this context). By the data processing inequality, this implies that Cperm(PZ|X ) = 0 . ˜ n n ˜ n ˜ n I(X1 ; W1 ) ≤ I(X1 ; W1 ) . (63) Proof. We need only prove a converse bound to establish this. Fourth, we again swap the random permutation and DMC Since all rows of PZ|X are equal, the output Z of the DMC ˜ n n ˜ n PZ|X is independent of the input X. Recalling the setup in blocks in the Markov chain X1 → V1 → W1 using Lemma2. As we argued earlier, this produces an equivalent subsection I-B, this implies that ˜ n ˜n ˜ n n Markov chain X1 → Z1 → Y1 such that the channel I(X ; Pˆ n ) = 0 . 1 Y1 (66) P ˜ n ˜ n is equivalent to the channel P ˜ n ˜ n . Moreover, in Y1 |X1 W1 |X1 ˜ n Notice that (47) (in the proof of Theorem2) holds for any this alternative Markov chain, X1 ∼ PX˜ n as before, the 1 DMC. So, dividing both sides of (47) by log(n) and applying product channel PZ˜n|X˜ n is defined by the DMC PZ˜|X˜ = Sδ 1 1 (66) yields 12 1 (n) The tensorization property of the degradation preorder is well-known in R ≤ + PerrorR, information theory. For a proof, notice that given any three row stochastic log(n) matrices A, B, C (with consistent dimensions so that the ensuing products are legal), if A = BC, then A ⊗ A = (B ⊗ B)(C ⊗ C) using the mixed- where letting n → ∞ produces R ≤ 0. Therefore, we have product property, where ⊗ denotes the Kronecker product. Cperm(PZ|X ) = 0.  18

B. Permutation Transition Matrices (47) with the above bound on mutual information and dividing log(n) Next, we consider the straightforward dual setting of a by yields “perfect channel.” In this case, the DMC is an identity channel 1 (k − 1) log(n + 1) R ≤ + P (n) R + , without loss of generality, which means that the corresponding log(n) error log(n) noisy permutation channel just permutes its input codewords where letting n → ∞ produces R ≤ k − 1. Therefore, we randomly (see subsection I-B). So, intuitively, the maximum have C (P ) ≤ k − 1, which completes the proof. number of decodable messages that can be reliably communi- perm Z|X  cated is equal to the number of possible empirical distributions over the alphabet of the DMC (see (67) below). Thus, the C. Strictly Positive Channels noisy permutation channel capacity is clearly characterized Recall that Theorem4 in sectionII presents the main con- by the alphabet size of the DMC. The formal proof is again tribution of this paper—a closed-form expression for the noisy straightforward, but we include it here for completeness. permutation channel capacity of strictly positive DMCs with full rank. For general strictly positive DMCs, we complement Proposition 4 (Cperm of Permutation Stochastic Matrices). For k×k Theorem4 by proposing the following conjecture. a DMC PZ|X such that |X | = |Y| = k ≥ 2 and PZ|X ∈ R is a permutation matrix, we have Conjecture 1 (Cperm of Strictly Positive Channels). For any strictly positive DMC PZ|X , we have Cperm(PZ|X ) = k − 1 . 1 ˆ ˆ Cperm(PZ|X ) = lim inf sup I(PXn ; PY n ) Proof. n→∞ 1 1 log(n) P n ∈PX n Achievability: Under the setup of subsection I-B, consider X1 rank(P ) − 1 the obvious encoder-decoder pair: = Z|X ,   T 2 1) The message set M = m = m1 ··· mk ∈ (N ∪ k where the supremum in the first equality is over all probability {0}) : m1 + ··· + mk = n with cardinality distributions in PX n , or equivalently, over all probability   n + k − 1 distributions of PˆXn . |M| = = Θnk−1 , (67) 1 k − 1 While Definition1 provides an operational definition of where the elements of M have been re-indexed for Cperm(PZ|X ), the first equality in Conjecture1 can be con- convenience. strued as the corresponding notion of “information capacity” n 2) The encoder fn : M → X is given by (analogous to the definition of multi-letter information capac- ity in, e.g., [19, Definition 18.6]), and the second equality ∀m ∈ M, fn(m) = (1,..., 1, 2,..., 2,...,k,...,k) , in Conjecture1 is a closed-form expression for the noisy | {z } | {z } | {z } m1 1’s m2 2’s mk k’s permutation channel capacity. As Conjecture1 reveals, we believe that our achievability bound in Theorem1 is most where X = {1, . . . , k} without loss of generality. likely tight. To briefly elaborate on this further, the first n 3) The decoder gn : Y → M is given by equality is inspired by the modified Fano’s inequality argument T in (47), where we also use Lemma2 to obtain I(PˆXn ; PˆY n ) n n n  ˆ ˆ  1 1 ∀y1 ∈ Y , gn(y1 ) = n PZ|X Pyn (1) ··· Pyn (k) , n 1 1 I(X ; Pˆ n ) instead of 1 Y1 , and the second equality is suggested by the first equality via intuition from the multivariate CLT. where Y = X without loss of generality (and PZ|X ∈ Rk×k is a permutation matrix). Next, as a concrete and canonical illustration of Propositions 3 and4 and Theorem4, we present the noisy permutation Clearly, this encoder-decoder pair achieves P (n) = 0. Hence, error channel capacity of binary symmetric channels below (see using (67), the rate Definition3 for a definition of q-SCs). This result was first n + k − 1 proved in [2, Theorem 3]. (Note that in the context of the log k − 1 work in [10], [11], and [12], this BSC setting corresponds to R = lim = k − 1 noisy permutation channels with substitution errors.) n→∞ log(n)

Proposition 5 (Cperm of BSCs [2, Theorem 3]). is achievable, and Cperm(PZ|X ) ≥ k − 1.  Converse: Recall that (47) (in the proof of Theorem2) holds 1, for δ ∈ {0, 1}  for any DMC. We bound the mutual information term in (47) 1 1  1  Cperm(BSC(δ)) = , for δ ∈ 0, ∪ , 1 . with 2 2 2  1 0, for δ = 2 n I(X ; Pˆ n ) = H(Pˆ n ) 1 Y1 Y1 1 Proof. The δ = 2 case follows from Proposition3, the δ ∈ ≤ (k − 1) log(n + 1) , {0, 1} case follows from Proposition4, and the remaining case

n follows from Theorem4.  H(Pˆ n |X ) = 0 where the equality holds because Y1 1 , and the inequality uses the upper bound on the number of possible We remark that Proposition5 illustrates a few somewhat empirical distributions given in (5). Then, as before, combining surprising facts about noisy permutation channel capacity. 19

While traditional channel capacity is convex as a function if there exists a probability distribution QZ ∈ PY and a of the channel (with fixed dimensions), noisy permutation constant η ∈ (0, 1) such that channel capacity is clearly non-convex and discontinuous as ∀z ∈ Y, ∀x ∈ X ,P (z|x) ≥ η Q (z) , a function the channel. Moreover, for the most part, the noisy Z|X Z permutation channel capacity of a BSC does not depend and we say that PZ|X satisfies Doeblin(QZ , η). Furthermore, on δ. Looking at the proof of Proposition2 in subsection we say that PZ|X satisfies Doeblin(QZ , 0) when PZ|X does III-B, this is because the scaling with n of the `2-distance not satisfy the Doeblin minorization condition (since the above between two encoded messages does not change after passing condition is trivially true when η = 0). through the memoryless BSC (see, e.g., (38) and footnote Definition5 of Doeblin minorization is less general than its 8). However, (36) and (43) suggest that δ does affect the (n) definition in a finite state space Markov chain context, where rate of decay of P . Finally, we note that in a manner error one often studies “local minorization” of multi-step Markov similar to Proposition5, we can also determine the noisy transition kernels, cf. [52, Section 4]. On the other hand, our permutation channel capacity of any q-SC(δ) for δ ∈ [0, 1) definition applies to more general (rectangular) transition ker- using Propositions3 and4 and Theorem4. nels. While the Doeblin minorization condition was originally developed to study the ergodicity of Markov processes,13 as we D. Erasure Channels and Doeblin Minorization alluded to earlier, it turns out to be equivalent to degradation In this subsection, we consider the important class of q- by an erasure channel. The next lemma depicts this known, ary erasure channels. Indeed, in the context of communication but seemingly overlooked, connection. networks, networks where packets can be dropped are typ- ically modeled as noisy permutation channels with possible Lemma 6 (Doeblin Minorization and Degradation [52], [55]). deletions, or equivalently, erasures, cf. [10], [11], [12, Remark Consider any DMC PZ|X with input alphabet X and output 1]. Since the transition kernels of q-ary erasure channels alphabet Y with |X | = q. Then, the following are true: contain zero entries, our converse results in Theorems2 and 1) (Equivalence [52, Theorem 3.1]) For any constant η ∈ 3 do not hold. So, we will present some bounds on the their (0, 1), PZ|X satisfies Doeblin(QZ , η) for some distribu- noisy permutation channel capacities. First, let us recall the tion QZ ∈ PY if and only if PZ|X is a degraded version definition of q-ary erasure channels. of the q-ary erasure channel q-EC(η). 2) (Extremality [55, Lemma 4]) The extremal erasure prob- Definition 4 (q-ary Erasure Channel). Under the formalism ability η∗ = η∗(P ) such that P is a degraded presented in subsection I-B, we define a q-ary erasure channel Z|X Z|X version of q-EC(η∗) is given by P with erasure probability η ∈ [0, 1], input alphabet X Z|X   with |X | = q ∈ N\{1}, and output alphabet Y = X ∪{E}, PZ|X is a degraded η∗ , max η ∈ [0, 1] : where E denotes the erasure symbol, using the conditional version of q-EC(η) distributions  P satisfies Doeblin(Q , η)  = sup η ∈ [0, 1) : Z|X Z  for some distribution Q ∈ P 1 − η, for z = x Z Y  X ∀z ∈ Y, ∀x ∈ X ,PZ|X (z|x) = η, for z = E . = min PZ|X (z|x) , x∈X 0, otherwise z∈Y where the second equality follows from part 1 and the Moreover, we represent such a channel P as q-EC(η) for Z|X quantity in the final equality is known as Doeblin’s convenience. coefficient of ergodicity, cf. [56, Definition 5.1]. We note that in the special case where q = 2, X = {0, 1}, Although Lemma6 is known in the literature, we provide and η is the probability that the input bit is erased, we refer a proof of part 1 in appendixD for completeness. Moreover, to the 2-SC(η) as a binary erasure channel (BEC), denoted we note that the equivalent description of Doeblin minorization BEC(η). as degradation by an erasure channel can also be viewed as Next, in order to present our bounds on the noisy permu- a specialization of the so called regeneration or Nummelin tation channel capacity of erasure channels, we introduce a splitting technique in the theory of Harris chains [57], [58]. classical concept from the Markov process literature. As we In the ensuing theorem, we derive Theorem5, which uses will see, one approach to proving our achievability bound the notion of degradation to prove a comparison bound for entails using a symmetric channel that is degraded by the noisy permutation channel capacities, as well as a related erasure channel under consideration. While we have intro- bound pertaining to Doeblin minorization, which specializes duced degradation in subsection III-A, the specific setting of Theorem5 for erasure channels (as revealed by our discussion degradation by erasure channels has been studied extensively heretofore). As outlined in subsection II-D, this result concurs in the Markov process literature under the guise of “Doeblin with the intuition that degraded channels are “more noisy,” and minorization.” We next introduce the concept of Doeblin therefore, have smaller noisy permutation channel capacity. minorization in an information theoretic light (within the formalism of subsection I-B), cf. [52, Section 3]. 13As a historical remark, it is worth mentioning that as stated in [52, Section 3], “two of the most powerful ideas in the modern theory of Markov processes Definition 5 (Doeblin Minorization). A row stochastic matrix were introduced [by Doeblin in [53] and [54]]; namely minorization and |X |×|Y| PZ|X ∈ R satisfies the Doeblin minorization condition coupling, respectively.” 20

n n Theorem 6 (Comparison Bounds via Degradation). Consider set and fn are the same as before, V1 ∈ X is a random |X |×|Z1| |X |×|Z2| n n n any two DMCs PZ1|X ∈ R and PZ2|X ∈ R , permutation of X1 , W1 ∈ Z1 is the output of the DMC n n n with common input alphabet X and output alphabets Z1 and PZ1|X with input V1 , and U1 ∈ Z2 is the output of n Z2, respectively. Then, the following are true: the DMC PZ2|Z1 with input W1 . Using Lemma2, notice that the conditional distribution P n n is equivalent to 1) If PZ2|X is a degraded version of PZ1|X , then we have (Y1)1 |X1 the conditional distribution P n n . Furthermore, Lemma2 W1 |X1 Cperm(PZ2|X ) ≤ Cperm(PZ1|X ) . and the degradation relation PZ2|X = PZ1|X PZ2|Z1 imply that the conditional distribution P n n is equivalent to 2) If PZ2|X satisfies Doeblin(QZ , η) for some distribution (Y2)1 |X1 the conditional distribution P n n . Thus, the joint distri- QZ ∈ PY and some constant η ∈ (0, 1), then we have U1 |X1 n bution of (M, g˜n((Y1)1 )) is equal to the joint distribution Cperm(PZ2|X ) ≤ Cperm(q-EC(η)) , n of (M, gn((Y2)1 )), because the conditional distributions of n n where we let q = |X |. g˜n(W1 ) and gn(U1 ) given M are equivalent, where we may perceive U n as the intermediate random variables used by g˜ Proof. 1 in (68) so that g˜ (W n) = g (U n). This produces the relation Part 1: Recalling the formalism introduced in subsection n 1 n 1 (69). I-B, fix any (small)  > 0 such that R C (P )− ≥ 0 , perm Z2|X Lastly, we conclude this proof by realizing that (69) reveals is an achievable rate for the DMC PZ2|X . Then, for the noisy that R = Cperm(PZ2|X ) −  is an achievable rate for the DMC permutation channel model with DMC PZ2|X , consider the n n n PZ1|X . Therefore, we have Markov chain M → fn(M) = X1 → (Z2)1 → (Y2)1 → n gn((Y2) ), defined by a message set M with cardinality 1 Cperm(PZ2|X ) −  ≤ Cperm(PZ1|X ) , |M| = nR, a sequence of possibly randomized encoders n and we can let  → 0 to obtain the desired inequality. {fn : M → X }n∈N, and a sequence of associated possibly n Part 2: This follows immediately from part 1 of this theorem randomized decoders {gn : Z2 → M ∪ {e}}n∈N, where n n and Lemma6. (Y2)1 ∈ Z2 denotes a random permutation of the output  codeword (Z )n of the DMC P . Let us define 2 1 Z2|X We are now in a position to present bounds on the noisy (n) n permutation channel capacity of q-ary erasure channels. Al- Perror(PZ2|X , fn, gn) , P(M 6= gn((Y2)1 )) though the achievability bound in the ensuing proposition can as the average probability of error for the noisy permutation be obtained as a direct consequence of Theorem1, we will channel model corresponding to P , f , and g . Since Z2|X n n elucidate an alternative coding scheme that establishes this R is an achievable rate, we further assume that fn and gn bound using Doeblin minorization. Similarly, although the are chosen such that lim P (n) (P , f , g ) = 0. By n→∞ error Z2|X n n converse bound in the ensuing proposition is just the trivial our assumption in the theorem statement, we know using bound given in (11), we will provide an alternative proof for Definition2 that there exists some DMC P ∈ |Z1|×|Z2| Z2|Z1 R it. (with input alphabet Z1 and output alphabet Z2) such that Proposition 6 (Bounds on Cperm of q-EC). For a q-ary erasure PZ2|X = PZ1|X PZ2|Z1 . We will use this degradation relation to construct a “good” encoder-decoder pair for the DMC channel q-EC(η) with η ∈ (0, 1), we have

PZ1|X . q − 1 ≤ C (q-EC(η)) ≤ q − 1 . To this end, for the noisy permutation channel model with 2 perm DMC P , consider the Markov chain M → f (M) = Z1|X n Furthermore, the extremal noisy permutation channel capaci- Xn → (Z )n → (Y )n, where we use the same message set 1 1 1 1 1 ties are C (q-EC(0)) = q − 1 and C (q-EC(1)) = 0. (with cardinality nR) and encoder sequence as before, and perm perm n n (Y1)1 ∈ Z1 is a random permutation of the output codeword Proof. n (Z1)1 of the DMC PZ1|X . (In fact, the random variables M Achievability for η ∈ (0, 1): To derive a lower bound on n and X1 are coupled to be equal for the two models.) Now, Cperm(q-EC(η)), we will employ a useful representation of n for every n ∈ N, construct the decoder g˜n : Z1 → M ∪{e} q-SC’s using q-EC’s. Observe that the row stochastic transition so that probability matrix of a q-SC(η(q − 1)/q) can be decomposed n n n n ∀y ∈ Z , g˜n(y ) gn(Z ) , (68) as 1 1 1 , 1   n 1 T where Z ∼ P (·|y ) for every i ∈ {1, . . . , n}, and Z Sη(q−1)/q = (1 − η) I + η 11 , (70) i Z2|Z1 i 1 q are mutually independent. This produces the Markov chain n n n n M → X1 → (Z1)1 → (Y1)1 → g˜n((Y1)1 ). We note that where Sη(q−1)/q is the q-ary symmetric channel matrix defined the decoder in (68) essentially “simulates” the auxiliary DMC in (13), I ∈ Rq×q is the identity matrix representing a channel that exactly copies its input, 1 = [1 ··· 1]T ∈ q PZ2|Z1 so that its output is statistically equivalent to the output R denotes the column vector with all elements equal to unity, of the decoder gn with DMC PZ2|X . 1 T We next prove the intuitively straightforward relation and q 11 represents a channel whose output is an independent uniform random variable. Hence, a q-SC(η(q − 1)/q) can be P (n) (P , f , g ) = P (n) (P , f , g˜ ) . (69) error Z2|X n n error Z1|X n n equivalently construed as a channel that either copies its input To establish this, consider yet another Markov chain, M → random variable with probability 1 − η, or generates a com- n n n n fn(M) = X1 → V1 → W1 → U1 , where the message pletely independent output random variable that is uniformly 21

n X |X | = q − H(Pˆ n (1),..., Pˆ n (q − 1)|X , Pˆ n (E)) distributed on the input alphabet (where ) with prob- Y1 Y1 1 Y1 ability η. A consequence of this interpretation, or equivalently, (71) the decomposition (70), is that a q-SC(η(q − 1)/q) satisfies (e) Doeblin(1T/q, η), where 1T/q is a row vector representing the ≤ (q − 1) log(n + 1) , (72) uniform distribution. Using part 1 of Lemma6, this means Pˆ n where (a) holds because Y1 sums to unity and we let that a q-SC(η(q − 1)/q) is a degraded version of a q-EC(η); X = {1, . . . , q} without loss of generality, (b) follows from in particular, a q-SC(η(q − 1)/q) is statistically equivalent to n I(X ; Pˆ n (E)) = 0 the chain rule, (c) uses the fact that 1 Y1 since a q-EC(η) followed by a channel that outputs an independent the codeword Xn is independent of the number of erasures uniformly distributed random variable for the input erasure 1 nPˆ n (E) nPˆ n (i) ∈ {0, . . . , n} Y1 , (d) holds because Y1 for every symbol E, and copies all other input symbols. Moreover, part i ∈ {1, . . . , q − 1}, and (e) follows from the non-negativity of 2 of Theorem6 conveys that Shannon entropy. Therefore, as before, substituting (72) into  η(q − 1) (47), dividing by log(n), and letting n → ∞, we get that any C q-SC ≤ C (q-EC(η)) . perm q perm achievable rate R ≥ 0 satisfies R ≤ q − 1. This proves that C (q-EC(η)) ≤ q − 1. η(q−1) q−1  perm Since q ∈ 0, q , it is straightforward to verify that the Case η = 0: In this case, the q-EC(0) is just the determin- q-ary symmetric channel matrix S is strictly positive η(q−1)/q istic identity channel. Hence, Cperm(q-EC(0)) = q − 1 using and non-singular. So, we have Proposition4.  η(q − 1) q − 1 Case η = 1: In this case, the q-EC(1) erases all its input C (q-EC(η)) ≥ C q-SC = perm perm q 2 symbols so that we obtain an “independent channel.” Hence, C (q-EC(1)) = 0 using Proposition3. using Theorem4, which proves the desired result. perm  We remark that according to the proof of part 1 of Theorem We finally make several pertinent remarks. Firstly, in the 6, an appropriately altered coding scheme from the achievabil- special case of q = 2 and η ∈ (0, 1), the elegant interpretation η  ity proof of Theorem1 in subsection III-B, which has: of a BSC 2 as a BEC(η) which additionally replaces any 1  1) A randomized encoder described by (27), output erasure symbol E with an independent Ber 2 bit, or 2) A decoder that first generates independent uniform equivalently, the decomposition (70), is a notion that origi- random output letters to replace every erasure symbol nates from Fortuin-Kasteleyn random cluster representations in the output codeword, and then applies the decoder of Ising models in the study of percolation, cf. [59]. More- (29), which is characterized by (30) specialized to a over, this notion has been exploited in various other discrete q-SC(η(q − 1)/q), probability contexts such as reliable computation using noisy circuits [60, p.570], broadcasting on trees [43, p.412], and achieves the lower bound on Cperm(q-EC(η)). Alternatively, if we directly use Theorem1 and the fact that rank(q-EC(η)) = broadcasting on directed acyclic graphs [61, Appendix C] (also q to obtain the lower bound on C (q-EC(η)) (as mentioned see [62, p.1634]). perm Secondly, in the special case of q = 2 and η ∈ (0, 1), we earlier), then this corresponds to using the same encoder (27), propose the following conjecture. but an alternative decoder (29), which is characterized by (30) specialized to a q-EC(η). Conjecture 2 (Cperm of BECs). For any erasure probability Converse for η ∈ (0, 1): As mentioned earlier, the upper η ∈ (0, 1), the noisy permutation channel capacity of the bound in the proposition statement is immediate from (11) BEC(η) is given by and the fact that ext(q-EC(η)) = q. However, as outlined after 1 C (BEC(η)) = . (11) in subsection II-B, the inequality in terms of ext(·) in (11) perm 2 is proved using a degradation argument akin to the proof of Specifically, we believe that the achievability bound pre- Theorem3. Here, we provide a simpler alternative proof of the sented in Proposition6 is tight. Consequently, unlike tradi- (intuitively obvious) upper bound in the proposition statement tional channel capacity, we believe that the noisy permutation by exploiting specific properties of q-ary erasure channels. channel capacities of BSCs and BECs are equal in the non- Recall that (47) (from the proof of Theorem2) holds for a trivial regimes of their parameters. Indeed, the converse bound q-EC(η), and we can bound the mutual information term in in Proposition6 for the case q = 2, Cperm(BEC(η)) ≤ 1, is (47) via intuitively trivial as there are only n + 1 distinct empirical n n (a) n distributions of codewords in {0, 1} . So, we anticipate that I(X ; PˆY n ) = I(X ; PˆY n (E), PˆY n (1),..., PˆY n (q − 1)) 1 1 1 1 1 1 this bound can be significantly tightened. One approach to- (b) n = I(X ; Pˆ n (1),..., Pˆ n (q − 1)|Pˆ n (E)) 1 Y1 Y1 Y1 wards tightening the converse bound would be to consider the n ˆ mutual information bound in (71), and much like the proof + I(X1 ; PY n (E)) 1 of Theorem2 in subsection III-C, derive a lower bound on (c) n ˆ ˆ ˆ n = I(X ; PY n (1),..., PY n (q − 1)|PY n (E)) H(Pˆ n (1)|X , Pˆ n (E)) 1 1 1 1 Y1 1 Y1 of the form = H(Pˆ n (1),..., Pˆ n (q − 1)|Pˆ n (E)) Y1 Y1 Y1 n 1 H(PˆY n (1)|X , PˆY n (E)) ≥ log(n) + o(log(n)) . (73) n 1 1 1 − H(Pˆ n (1),..., Pˆ n (q − 1)|X , Pˆ n (E)) 2 Y1 Y1 1 Y1 (d) Clearly, combining (47), (71), and (73) would yield the desired 1 ≤ (q − 1) log(n + 1) bound Cperm(BEC(η)) ≤ 2 . As explained in [2, Equation 22

(26)], finding a bound of the form (73) corresponds to an- that “good” encoders need to utilize all the symbols in X alyzing the Shannon entropy of hypergeometric distributions significantly. in non-trivial regimes of their parameters. Thirdly, when q ≥ 3, we also postulate that for any η ∈ V. CONCLUSION (0, 1), q In closing, we first briefly reiterate our main contributions. Cperm(q-EC(η)) ≤ . (74) Propelled by existing literature in coding theory, commu- 2 nication networks, and molecular and biological communi- While this upper bound does not exactly determine the noisy cations, we formulated the information theoretic notion of permutation channel capacity of q-ary erasure channels, it does noisy permutation channel capacity for the problem of reliably have the following useful corollaries (if proven to be true): transmitting information through a noisy permutation channel, 1) In the limit of asymptotically large input alphabet size i.e., a DMC followed by an independent random permutation (i.e., as q → ∞), the noisy permutation channel capacity transformation. We then derived achievability and converse of q-ary erasure channels is characterized by bounds on noisy permutation channel capacities in Theorems 1,2, and3 (as well as in (11)). These results gave rise to Cperm(q-EC(η)) 1 ∀η ∈ (0, 1), lim = . (75) an exact characterization of the noisy permutation channel q→∞ q 2 capacity of strictly positive and full rank DMCs in Theorem4. 2) Applying part 2 of Theorem6, if a DMC PZ|X satisfies Furthermore, in our effort to prove these results and acquire a the Doeblin minorization condition, then we obtain the deeper understanding of noisy permutation channel capacity, converse bound we elucidated a simple construction of symmetric channels |X | that dominate given DMCs in the degradation sense in Propo- C (P ) ≤ . (76) perm Z|X 2 sition1, and established an intuitive monotonicity relation This bound is clearly weaker than that in Theorem3 for between noisy permutation channel capacity and degradation strictly positive DMCs. However, it also holds for certain in Theorem6. DMCs that have zero entries, and therefore, extends We next propose some directions for future research. Ev- Theorem3 for such DMCs. idently, addressing any of the open problems explicated in Conjectures1,2, and (74) is an excellent starting point to fur- We believe (74) could be true, because we can upper bound thering this line of work. After these conjectures are resolved, the mutual information term in (47) so that our ultimate objective is to establish the noisy permutation n n I(X ; Pˆ n ) = H(Pˆ n ) − H(Pˆ n |X ) channel capacity of general DMCs (whose row stochastic 1 Y1 Y1 Y1 1 (a) matrices can have zero entries). We remark that determining n ≤ q log(n + 1) − H(Pˆ n (1),..., Pˆ n (q)|X ) Y1 Y1 1 the noisy permutation channel capacities of DMCs with zero (b) entries appears to be more intractable than strictly positive = q log(n + 1) − H(N (1),...,N (q)|Xn) E E 1 DMCs, because zero entries introduce a combinatorial flavor q (c) X to the problem.14 So, completely settling the noisy permutation = q log(n + 1) − H(N (i)|Xn) E 1 channel capacity question for general DMCs is likely to be i=1 q quite challenging. Finally, there are several other open ques- (d) X h i = q log(n + 1) − H(bin(nPˆ n (i), η)) , tions that parallel aspects of classical information theoretic E X1 i=1 development such as: (77) 1) Finding tight bounds on the average probability of error (akin to classical error exponent analysis), cf. [65, where (a) follows from the bound in (5), the fact that PˆY n 1 Chapter 5]. sums to unity, and by letting X = {1, . . . , q} without loss of 2) Developing strong converse results, cf. [19, Section generality, (b) follows from defining the number of erasures 22.1], [65, Theorem 5.8.5]. that occur on the input symbol i ∈ X as the random variable 3) Establishing exact asymptotics for the maximum achiev- NE(i) nPˆXn (i) − nPˆY n (i) ∈ {0, . . . , nPˆXn (i)}, (c) holds , 1 1 1 able value of |M| (akin to “finite blocklength analysis”), because {N (i): i ∈ X } are conditionally independent given E cf. [66], [67, Chapter II.4], and the references therein. Xn, (d) holds because each N (i) is a binomial random vari- 1 E 4) Extending the noisy permutation channel model by nPˆ n (i) η able with X1 trials and success probability conditioned n   replacing DMCs with other kinds of memoryless chan- on X , and each term H(bin(nPˆXn (i), η)) represents 1 E 1 nels or networks, e.g., additive white Gaussian noise the expectation of a binomial entropy with respect to the n (AWGN) channels or multiple-access channels (MACs), distribution of X1 . If it can be shown that any encoder, which and by using more general or “realistic” algebraic op- achieves vanishing probability of error for rates “close to” the erations that are applied to the output codewords, e.g., noisy permutation channel capacity, must satisfy PˆXn (i) ≥ α 1 random permutations that belong to subgroups of the for all i ∈ X for some constant α ∈ (0, 1) with high probability, then (47), (77), and Lemma1 would yield (74). 14This combinatorial aspect of the problem is similar to (but not exactly the same as) the zero error capacity problem, cf. [63]. It is well-known that However, proving such a lower bound on PˆXn appears to 1 calculating the zero error capacity of channels is very challenging, and the be challenging (if at all possible), since we essentially have best known approaches use semidefinite programming relaxations such as the to develop a “probabilistic pigeonhole principle” to argue Lovasz´ ϑ function, cf. [64]. 23

symmetric group. (For example, when modeling out-of- a sufficient condition that ensures that the minimum entry of order delivery of packets in a communication network, Sτ PZ|X is non-negative is all permutations of the packets are not equally likely;  ν  indeed, the first two transmitted packets are quite likely ν − δ 1 − ν + q−1 ν ≥ 0 ⇔ δ ≤ . to arrive swapped at the receiver, but the first and last δ ν 1 − δ − 1 − ν + q−1 transmitted packets are very unlikely to change their q−1 relative ordering.) This completes the proof.  Altogether, our main results and these future directions il- lustrate that the study of noisy permutation channel capacity APPENDIX B begets a fairly rich, relevant, and seemingly solvable class of PROOFOF LEMMA 4 new problems. Proof. For the binary hypothesis problem in (17), define the n “translated empirical distribution of X1 ” random vector APPENDIX A ˆ Tn , PXn − Cn ∈ Tn , (80) PROOFOF PROPOSITION 1 1 |X | Proof. To prove this result, we seek to find q-SC(δ)’s with where the constant vector Cn = (c1, . . . , c|X |) ∈ R  q−1  (which can depend on n) will be chosen later, and Tn = δ ∈ 0, q such that PZ|X is a degraded version of q-SC(δ).  k1 k|X |  Indeed, it is straightforward to see that the upper bound on δ n − c1,..., n − c|X | : k1, . . . , k|X | ∈ N ∪{0}, k1 + − in the proposition statement satisfies ··· + k|X | = n . Moreover, for ease of exposition, let Tn and T + denote versions of the random variable T with ν q − 1 n n probability distributions P − and P + induced by P ⊗n and ν ≤ , (78) Tn Tn X 1 − ν + q−1 q ⊗n QX , respectively, such that because (78) is equivalent to − −    ˆ n PT (t) , P Tn = t = P PX = t + Cn H = 0 , ν(q − 1) q − 1 1 n 1

≤ ⇔ ν ≤ + +   ˆ  (q − 1) − ν(q − 2) q 2 P (t) T = t = PXn = t + Cn H = 1 , Tn , P n P 1 for any q ∈ N\{1}, and the latter bound is always true since for all t ∈ Tn. Then, we clearly have |Y| ≥ 2. Furthermore, we have equality in (78) if and only if 1 1 ν = 1 , which happens precisely when |Y| = 2 and all rows of P = P − + P + . 2 Tn Tn Tn 1 1  2 2 PZ|X are equal to 2 , 2 . In this case, PZ|X = SδPZ|X for all δ ∈ [0, 1], and the sufficient condition for degradation in It is straightforward to verify that Tn is a sufficient statistic n the proposition statement holds trivially. So, we will assume of X1 for performing inference about H. So, the ML de- n ˆ n n 1 coder of H based on X1 , HML(X1 ), is a function of Tn without loss of generality that ν < 2 in the rest of the proof.  q−1  without loss of generality (see (19)), and we denote it as To construct q-SC(δ)’s with δ ∈ 0, such that PZ|X is q Hˆ n : T → {0, 1} Hˆ n (T ) a degraded version of q-SC(δ), we must ensure that P = ML n , ML n with abuse of notation. Thus, Z|X (n) n q×|Y| we have P = (Hˆ (Tn) 6= H), and (23) implies that SδQ for some row stochastic matrix Q ∈ R . Equivalently, ML P ML −1 ⊗n ⊗n + − Sδ PZ|X must be a row stochastic matrix. A direct calculation P − Q = P − P . (81) −1 X X TV Tn Tn TV yields Sδ = Sτ with (cf. [35, Proposition 4]) It therefore suffices to lower bound the right hand side. −δ τ = , (79) Similar to the proof of [43, Lemma 4.2(iii)], observe that 1 − δ − δ q−1  +  − 2 E Tn − E Tn −1 2 i.e., Sδ = Sτ has the structure shown in (13) (but with τ 2 X replacing δ). Since the rows of Sτ sum to unity, the rows of = t P + (t) − P − (t) −1 Tn Tn S PZ|X = Sτ PZ|X also sum to unity. Thus, it suffices to δ t∈Tn 2 S P verify that the minimum entry of τ Z|X is non-negative. |X | + − ! !2  q−1  (a) X X PT (t) − PT (t) p For δ ∈ 0, q , we have τ ≤ 0, which means that the = n n t P (t) p i Tn principal diagonal elements of Sτ are at least unity, and the off- PT (t) i=1 t∈Tn n diagonal elements of S are non-positive. Hence, the minimum τ 2 ! |X |  (b) P + (t) − P − (t) entry of Sτ PZ|X is lower bounded by 1 X Tn Tn X X 2 ≤ 4  ti PTn (t)   4 PTn (t) min Sτ PZ|X ≥ (1 − τ)ν + τ(1 − ν) t∈Tn i=1 t∈Tn i∈{1,...,q} i,j | {z } j∈{1,...,|Y|} + − , LC(PT ||PT )   n n ν + −  h 2i ν − δ 1 − ν + q−1 = 4 LC P P kT k (82) Tn Tn E n 2 = δ , 1 − δ − (c) q−1 h 2i + − ≤ 4 E kTnk P − P , (83) where the inequality uses the fact that the maximum entry of 2 Tn Tn TV PZ|X is upper bounded by 1−ν (because PZ|X is a stochastic where we let t = (t1, . . . , t|X |) in (a), (b) follows from the matrix), and the equality follows from substituting (79). So, Cauchy-Schwarz-Bunyakovsky inequality, LC(·||·) denotes the 24

P (x) + Q (x) P (x) + Q (x)2 Vincze-Le Cam distance or triangular discrimination between = X X − X X two probability distributions [68], [69], and (c) upper bounds 2n 2   Vincze-Le Cam distance using TV distance (via the observa- n − 1 2 2 tion that P + (t)−P − (t) ≤ P + (t)+P − (t) for all t ∈ T ). + PX (x) + QX (x) Tn Tn Tn Tn n 2n We note that (82) is precisely a vector version of [43, Lemma P (x) (1 − P (x)) = X X 4.2(iii)]. Hence, combining (81) and (83), we get 2n 2 k [T +] − [T −]k QX (x) (1 − QX (x)) P ⊗n − Q⊗n ≥ E n E n 2 . + X X TV h i (84) 2n 4 kT k2 E n 2 (P (x) − Q (x))2 + X X We now select the vector Cn. Since the numerator of the 4 bound in (84) is invariant to the value of C , the best bound of 1 (P (x) − Q (x))2 n ≤ + X X , the form (84) is obtained by minimizing the second moment 4n 4 2  T . Thus, C is given by E n 2 n where the equalities follow from straightforward algebraic ma- h i 1 1 1 ˆ nipulations, and the final inequality holds because t(1−t) ≤ 4 Cn = E PXn = PX + QX , (85) 1 2 2 for all t ∈ [0, 1]. Plugging this inequality into (86) yields using the binary hypothesis testing model (17), where we em- ! 1 kP − Q k2 ploy the well-known fact that mean-squared error is minimized P (n) ≤ 1 − X X 2 ML 2 |X | 2 by the mean (see, e.g., [70, Section 1.7, Example 7.17]). With n + kPX − QX k2 this choice of Cn, notice that |X | = 1 1 2  − 2|X | + 2n kPX − QX k2 E Tn = PX − QX , 2 2 |X | 1 1 ≤ , T + = Q − P , 2|X | + 2n2n E n 2 X 2 X h 2i X  where the final inequality follows from applying (24). This kT k = Pˆ n (x) . E n 2 VAR X1 x∈X completes the proof.  Using these expressions, we can simplify the second moment method bound in (84) and obtain the bound in the lemma APPENDIX D statement.  PROOFOF LEMMA 6

We remark that with the choice of Cn in (85), (82) can be Proof. perceived as a vector version of the HCR bound in statistics Part 1: For the convenience of readers unfamiliar with the [44], [45], where the Vincze-Le Cam distance replaces the notion of iterated random maps, we translate the proofs of usual χ2-divergence. [52, Theorem 3.1, Proposition 4.1] into information theoretic language. (We also refer readers to [71, Remark III.2], which APPENDIX C shows the forward direction.) PROOFOF LEMMA 5 Suppose PZ|X satisfies Doeblin(QZ , η). Then, construct the Proof. To upper bound the ML decoding probability of error channel PZ|X0 with input alphabet X ∪{E} and output alphabet (n) Y such that PML , we combine (23) and Lemma4 to get    PZ|X (z|x) − ηQZ (z)  , for x ∈ X 2 PZ|X0 (z|x) = 1 − η (n) 1  kPX − QX k2  P ≤ 1 − . (86)  ML  X  QZ (z), for x = E 2  4 Pˆ n (x)  VAR X1 x∈X for all z ∈ Y and x ∈ X ∪{E}, where PZ|X (z|x)−ηQZ (z) ≥ P We now compute the right hand side of this bound explicitly. 0 due to Definition5, and z∈Y PZ|X (z|x) − ηQZ (z) = 1 − Observe using (85) that for every x ∈ X , η. It follows via a direct calculation that PZ|X = q-EC(η) · PZ|X0 (i.e., PZ|X is the product of the stochastic matrices  n !2 ˆ  1 X q-EC(η) and PZ|X0 ), which means that PZ|X is a degraded VAR PXn (x) = E 1{Xi = x}  1 n2 version of q-EC(η). i=1 To prove the reverse direction, suppose PZ|X is a degraded P (x) + Q (x)2 − X X version of q-EC(η). Then, using Definition2, there exists 2 a channel PZ|X0 with input alphabet X ∪ {E} and output 2 P (x) + Q (x) P (x) + Q (x) alphabet Y such that PZ|X = q-EC(η) · PZ|X0 . Hence, it is = X X − X X 2n 2 straightforward to show that for every x ∈ X and y ∈ Y, 1 X + [1{X = x}1{X = x}] PZ|X (z|x) = (1 − η)PZ|X0 (z|x) + ηPZ|X0 (z|E) n2 E i j 1≤i,j≤n ≥ ηPZ|X0 (z|E) , i6=j 25

where the inequality holds because (1 − η)PZ|X0 (z|x) ≥ 0. [19] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” May 2019, Department of Electrical Engineering and Computer Science, MIT, Thus, employing Definition5, this implies that PZ|X satisfies Cambridge, MA, USA, Lecture Notes 6.441. Doeblin(PZ|X0 (·|E), η). This completes the proof of part 1. [20] O. Kosut and L. Sankar, “New results on third-order coding rate for Part 2: We refer readers to [55, Lemma 4] for a proof of this universal fixed-to-variable source coding,” in Proceedings of the IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, part. (It is worth juxtaposing η∗(PZ|X ) with [72, Equations USA, June 29-July 4 2014, pp. 2689–2693. (58) and (102)], which state that contraction coefficients of [21] T. M. Cover, “Broadcast channels,” IEEE Transactions on Information operator convex f-divergences characterize the extremal era- Theory, vol. IT-18, no. 1, pp. 2–14, January 1972. sure probability η such that PZ|X is dominated by a q-EC(η) [22] P. P. Bergmans, “Random coding theorem for broadcast channels with in the “less noisy” sense; see [72] for details.) degraded components,” IEEE Transactions on Information Theory, vol.  IT-19, no. 2, pp. 197–207, March 1973. [23] A. El Gamal and Y.-H. Kim, Network Information Theory. New York, NY, USA: Cambridge University Press, 2011. REFERENCES [24] D. Blackwell, “Comparison of experiments,” in Proceedings of the [1] A. Makur, “Bounds on permutation channel capacity,” in Proceedings of Second Berkeley Symposium on Mathematical Statistics and Probability the IEEE International Symposium on Information Theory (ISIT), Los (Berkeley, CA, USA, July 31-August 12 1950), J. Neyman, Ed. Berkeley, Angeles, CA, USA, June 21-26 2020, pp. 1–6. CA, USA: University of California Press, 1951, pp. 93–102. [2] A. Makur, “Information capacity of BSC and BEC permutation chan- [25] S. Sherman, “On a theorem of Hardy, Littlewood, Polya, and Blackwell,” nels,” in Proceedings of the 56th Annual Allerton Conference on Proceedings of the National Academy of Sciences of the United States Communication, Control, and Computing, Monticello, IL, USA, October of America (PNAS), vol. 37, no. 12, pp. 826–831, December 1951. 2-5 2018, pp. 1112–1119. [26] C. Stein, “Notes on a seminar on theoretical statistics. I. Comparison of [3] S. N. Diggavi and M. Grossglauser, “On transmission over deletion experiments,” University of Chicago, Tech. Rep., 1951. channels,” in Proceedings of the 39th Annual Allerton Conference on [27] M. Leshno and Y. Spector, “An elementary proof of Blackwell’s theo- Communication, Control, and Computing, Monticello, IL, USA, October rem,” Mathematical Social Sciences, Elsevier, vol. 25, no. 1, pp. 95–98, 3-5 2001, pp. 573–582. December 1992. [4] M. Mitzenmacher, “Polynomial time low-density parity-check codes [28] E. Torgersen, “Stochastic orders and comparison of experiments,” with rates very close to the capacity of the q-ary random deletion channel in Stochastic Orders and Decision Under Risk, ser. Lecture Notes- for large q,” IEEE Transactions on Information Theory, vol. 52, no. 12, Monograph Series, K. Mosler and M. Scarsini, Eds., vol. 19. Hayward, pp. 5496–5501, December 2006. CA, USA: Institute of Mathematical Statistics, 1991, pp. 334–371. [5] J. J. Metzner, “Simplification of packet-symbol decoding with errors, [29] E. Torgersen, Comparison of Statistical Experiments, ser. Encyclopedia deletions, misordering of packets, and no sequence numbers,” IEEE of and Its Applications. New York, NY, USA: Cambridge Transactions on Information Theory, vol. 55, no. 6, pp. 2626–2639, University Press, 1991. June 2009. [30] T. Richardson and R. Urbanke, Modern Coding Theory. Cambridge, [6] S. Kudekar, S. Kumar, M. Mondelli, H. D. Pfister, E. S¸as¸oglu,ˇ and R. L. UK: Cambridge University Press, 2008. Urbanke, “Reed–Muller codes achieve capacity on erasure channels,” [31] A. W. Marshall, I. Olkin, and B. C. Arnold, Inequalities: Theory of IEEE Transactions on Information Theory, vol. 63, no. 7, pp. 4298– Majorization and Its Applications, 2nd ed., ser. Springer Series in 4316, July 2017. Statistics. New York, NY, USA: Springer, 2011. [7] Y. Xu and T. Zhang, “Variable shortened-and-punctured Reed–Solomon [32] G. Dahl, “Matrix majorization,” and its Applications, codes for packet loss protection,” IEEE Transactions on Broadcasting, Elsevier, vol. 288, pp. 53–73, February 1999. vol. 48, no. 3, pp. 237–245, September 2002. [33] G. Dahl, “Majorization polytopes,” Linear Algebra and its Applications, [8] M. Gadouleau and A. Goupil, “Binary codes for packet error and packet Elsevier, vol. 297, pp. 157–175, August 1999. loss correction in store and forward,” in Proceedings of the International [34] A. Makur, “Information contraction and decomposition,” Sc.D. Thesis ITG Conference on Source and Channel Coding (SCC), no. 25, Siegen, in Electrical Engineering and Computer Science, Massachusetts Institute Germany, January 18-21 2010, pp. 1–6. of Technology, Cambridge, MA, USA, May 2019. [9] J. M. Walsh, S. Weber, and C. wa Maina, “Optimal rate-delay tradeoffs [35] A. Makur and Y. Polyanskiy, “Comparison of channels: Criteria for and delay mitigating codes for multipath routed and network coded domination by a symmetric channel,” IEEE Transactions on Information networks,” IEEE Transactions on Information Theory, vol. 55, no. 12, Theory, vol. 64, no. 8, pp. 5704–5725, August 2018. pp. 5491–5510, December 2009. [36] A. Makur and Y. Polyanskiy, “Less noisy domination by symmetric [10] M. Kovaceviˇ c´ and D. Vukobratovic,´ “Subset codes for packet networks,” channels,” in Proceedings of the IEEE International Symposium on IEEE Communications Letters, vol. 17, no. 4, pp. 729–732, April 2013. Information Theory (ISIT), Aachen, Germany, June 25-30 2017, pp. [11] M. Kovaceviˇ c´ and D. Vukobratovic,´ “Perfect codes in the discrete 2463–2467. simplex,” Designs, Codes and Cryptography, vol. 75, no. 1, pp. 81–95, April 2015. [37] E. Mossel, K. Oleszkiewicz, and A. Sen, “On reverse hypercontractivity,” Geometric and Functional Analysis, vol. 23, no. 3, pp. 1062–1097, June [12] M. Kovaceviˇ c´ and V. Y. F. Tan, “Codes in the space of multisets– 2013. Coding for permutation channels with impairments,” IEEE Transactions on Information Theory, vol. 64, no. 7, pp. 5156–5169, July 2018. [38] J. A. Adell, A. Lekuona, and Y. Yu, “Sharp bounds on the entropy of the [13] S. M. H. T. Yazdi, H. M. Kiah, E. Garcia-Ruiz, J. Ma, H. Zhao, Poisson law and related quantities,” IEEE Transactions on Information and O. Milenkovic, “DNA-based storage: Trends and methods,” IEEE Theory, vol. 56, no. 5, pp. 2299–2306, May 2010. Transactions on Molecular, Biological, and Multi-Scale Communica- [39] W. Hoeffding, “Probability inequalities for sums of bounded random tions, vol. 1, no. 3, pp. 230–248, September 2015. variables,” Journal of the American Statistical Association, vol. 58, no. [14] H. M. Kiah, G. J. Puleo, and O. Milenkovic, “Codes for DNA sequence 301, pp. 13–30, March 1963. profiles,” IEEE Transactions on Information Theory, vol. 62, no. 6, pp. [40] G. W. Wornell, “Inference and information,” May 2017, Department of 3125–3146, June 2016. Electrical Engineering and Computer Science, MIT, Cambridge, MA, [15] R. Heckel, I. Shomorony, K. Ramchandran, and D. N. C. Tse, “Fun- USA, Lecture Notes 6.437. damental limits of DNA storage systems,” in Proceedings of the IEEE [41] A. B. Tsybakov, Introduction to Nonparametric Estimation, ser. Springer International Symposium on Information Theory (ISIT), Aachen, Ger- Series in Statistics. New York, NY, USA: Springer, 2009. many, June 25-30 2017, pp. 3130–3134. [42] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing [16] M. Kovaceviˇ c´ and V. Y. F. Tan, “Asymptotically optimal codes correcting Times, 1st ed. Providence, RI, USA: American Mathematical Society, fixed-length duplication errors in DNA storage systems,” IEEE Commu- 2009. nications Letters, vol. 22, no. 11, pp. 2194–2197, November 2018. [43] W. Evans, C. Kenyon, Y. Peres, and L. J. Schulman, “Broadcasting on [17] I. Shomorony and R. Heckel, “Capacity results for the noisy shuffling trees and the Ising model,” The Annals of Applied Probability, vol. 10, channel,” in Proceedings of the IEEE International Symposium on no. 2, pp. 410–433, May 2000. Information Theory (ISIT), Paris, France, July 7-12 2019, pp. 762–766. [44] J. M. Hammersley, “On estimating restricted parameters,” Journal of the [18] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Royal Statistical Society, Series B (Methodological), vol. 12, no. 2, pp. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2006. 192–240, 1950. 26

[45] D. G. Chapman and H. Robbins, “Minimum variance estimation without [70] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed., ser. regularity assumptions,” The Annals of Mathematical Statistics, vol. 22, Springer Texts in Statistics. New York, NY, USA: Springer, 1998. no. 4, pp. 581–586, December 1951. [71] M. Raginsky, “Strong data processing inequalities and Φ-Sobolev in- [46] R. A. Horn and C. R. Johnson, Topics in Matrix Analysis. New York, equalities for discrete channels,” IEEE Transactions on Information NY, USA: Cambridge University Press, 1991. Theory, vol. 62, no. 6, pp. 3355–3389, June 2016. [47] V. Rakoceviˇ c´ and H. K. Wimmer, “A variational characterization of [72] A. Makur and L. Zheng, “Comparison of contraction coefficients for canonical angles between subspaces,” Journal of Geometry, vol. 78, f-divergences,” Problems of Information Transmission, vol. 56, no. 2, no. 1, pp. 122–124, 2003. pp. 103–156, April 2020. [48] M. Bloch and J. Barros, Physical-Layer Security: From Information Theory to Security Engineering. New York, NY, USA: Cambridge University Press, 2011. [49] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed. New York, NY, USA: Cambridge University Press, 2013. [50] I. Sason and S. Shamai, Performance Analysis of Linear Codes under Maximum-Likelihood Decoding: A Tutorial, ser. Foundations and Trends in Communications and Information Theory, S. Verdu,´ Ed. Hanover, MA, USA: now Publishers Inc., 2006, vol. 3, no. 1-2. [51] R. W. Keener, Theoretical Statistics: Topics for a Core Course, ser. Springer Texts in Statistics. New York, NY, USA: Springer, 2010. [52] R. Bhattacharya and E. C. Waymire, “Iterated random maps and some classes of Markov processes,” in Stochastic Processes: Theory and Methods, ser. Handbook of Statistics, D. N. Shanbhag and C. R. Rao, Eds., vol. 19. Amsterdam, Netherlands: North-Holland, Elsevier, 2001, pp. 145–170. [53] W. Doeblin, “Sur les proprietes asymptotiques de mouvement regis´ par certains types de chaˆınes simples,” Bulletin Mathematique´ de la Societ´ e´ Roumaine des Sciences, vol. 39, no. 1, pp. 57–115, 1937, in French. [54] W. Doeblin, “Expose´ de la theorie´ des chaˆınes simples constantes de Markov a` un nombre fini d’etats,”´ Revue Mathematique´ de l’Union Interbalkanique, vol. 2, pp. 77–105, 1938, in French. [55] A. Gohari, O. Gunl¨ u,¨ and G. Kramer, “Coding for positive rate in the source model key agreement problem,” May 2019, arXiv:1709.05174v5 [cs.IT]. [56] J. E. Cohen, Y. Iwasa, G. Rautu, M. B. Ruskai, E. Seneta, and G. Zbaganu,˘ “Relative entropy under mappings by stochastic matrices,” Linear Algebra and its Applications, Elsevier, vol. 179, pp. 211–235, January 1993. [57] K. B. Athreya and P. Ney, “A new approach to the limit theory of recurrent Markov chains,” Transactions of the American Mathematical Society, vol. 245, pp. 493–501, November 1978. [58] E. Nummelin, “A splitting technique for Harris recurrent Markov chains,” Zeitschrift fur¨ Wahrscheinlichkeitstheorie und Verwandte Ge- biete, vol. 43, no. 4, pp. 309–318, December 1978. [59] G. Grimmett, “Percolation and disordered systems,” in Lectures on Probability Theory and Statistics: Ecole d’Ete´ de Probabilites´ de Saint- Flour XXVI-1996, ser. Lecture Notes in Mathematics, P. Bernard, Ed., vol. 1665. Berlin, Heidelberg, Germany: Springer, 1997, pp. 153–300. [60] T. Feder, “Reliable computation by networks in the presence of noise,” IEEE Transactions on Information Theory, vol. 35, no. 3, pp. 569–571, May 1989. [61] A. Makur, E. Mossel, and Y. Polyanskiy, “Broadcasting on random directed acyclic graphs,” IEEE Transactions on Information Theory, vol. 66, no. 2, pp. 780–812, February 2020. [62] A. Makur, E. Mossel, and Y. Polyanskiy, “Broadcasting on random networks,” in Proceedings of the IEEE International Symposium on Information Theory (ISIT), Paris, France, July 7-12 2019, pp. 1632– 1636. [63] C. E. Shannon, “The zero error capacity of a noisy channel,” IRE Transactions on Information Theory, vol. 2, no. 3, pp. 8–19, September 1956. [64] L. Lovasz,´ “On the Shannon capacity of a graph,” IEEE Transactions on Information Theory, vol. IT-25, no. 1, pp. 1–7, January 1979. [65] R. G. Gallager, Information Theory and Reliable Communication. New York, NY, USA: John Wiley & Sons, Inc., 1968. [66] Y. Polyanskiy, H. V. Poor, and S. Verdu,´ “Channel coding rate in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, May 2010. [67] V. Y. F. Tan, Asymptotic Estimates in Information Theory with Non- Vanishing Error Probabilities, ser. Foundations and Trends in Commu- nications and Information Theory, S. Verdu,´ Ed. Hanover, MA, USA: now Publishers Inc., 2014, vol. 11, no. 1-2. [68] I. Vincze, “On the concept and measure of information contained in an observation,” in Contributions to Probability: A Collection of Papers Dedicated to Eugene Lukacs, J. Gani and V. K. Rohatgi, Eds. New York, NY, USA: Academic Press, 1981, pp. 207–214. [69] L. Le Cam, Asymptotic Methods in Statistical Decision Theory, ser. Springer Series in Statistics. New York, NY, USA: Springer, 1986.