IEEE TRANSACTIONS ON , VOL. 44, NO. 6, OCTOBER 1998 2505 The Method of Types

Imre Csiszar,´ Fellow, IEEE

(Invited Paper)

Abstract— The method of types is one of the key technical the “random coding” upper bound and the “sphere packing” tools in Shannon Theory, and this tool is valuable also in other lower bound to the error probability of the best code of a given fields. In this paper, some key applications will be presented in rate less than capacity (Fano [44], Gallager [46], Shannon, sufficient detail enabling an interested nonspecialist to gain a working knowledge of the method, and a wide selection of fur- Gallager, and Berlekamp [74]). These bounds exponentially ther applications will be surveyed. These range from hypothesis coincide for rates above a “critical rate” and provide the exact testing and large deviations theory through error exponents for error exponent of a DMC for such rates. These results could discrete memoryless channels and capacity of arbitrarily varying not be obtained via typical sequences, and their first proofs channels to multiuser problems. While the method of types is suitable primarily for discrete memoryless models, its extensions used analytic techniques that gave little insight. to certain models with memory will also be discussed. It turned out in the 1970’s that a simple refinement of the typical sequence approach is effective—at least in the discrete Index Terms—Arbitrarily varying channels, choice of decoder, counting approach, error exponents, extended type concepts, memoryless context—also for error exponents, as well as for hypothesis testing, large deviations, multiuser problems, universal situations where the probabilistic model is partially unknown. coding. The idea of this refinement, known as the method of types, is to partition the -length sequences into classes according to type (empirical distribution). Then the error event of interest is I. INTRODUCTION partitioned into its intersections with the type classes, and the NE of Shannon’s key discoveries was that—for quite error probability is obtained by summing the probabilities of Ogeneral source models—the negative logarithm of the these intersections. The first key fact is that the number of type probability of a typical long sequence divided by the number of classes grows subexponentially with . This implies that the symbols is close to the source entropy ; the total probability error probability has the same exponential asymptotics as the of all -length sequences not having this property is arbitrarily largest one among the probabilities of the above intersections. small if is large. Thus “it is possible for most purposes The second key fact is that sequences of the same type are to treat long sequences as though there were just of equiprobable under a memoryless probabilistic model. Hence them, each with a probability ” [75, p. 24]. Shannon to bound the probabilities of intersections as above it suffices demonstrated the power of this idea also in the context of to bound their cardinalities, which is often quite easy. This channels. It should be noted that Shannon [75] used the term informal description assumes models involving one set of “typical sequence” in an intuitive rather than technical sense. sequences (source coding or hypothesis testing); if two or more Formal definitions of typicality, introduced later, need not sets of sequences are involved (as in channel coding), joint concern us here. types have to be considered. At the first stage of development of information theory, the In this paper, we will illustrate the working and the power main theoretical issue was to find the best rates of source of the method of types via a sample of examples that the or channel block codes that, assuming a known probabilistic author considers typical and both technically and historically model, guarantee arbitrarily small probability of error (or toler- interesting. The simple technical background, including con- able average distortion) when the blocklength is sufficiently venient notation, will be introduced in Section II. The first large. For this purpose, covered by the previous quotation from key applications, viz. universally attainable exponential error Shannon [75], typical sequences served as a very efficient and bounds for hypothesis testing and channel block-coding, will intuitive tool, as demonstrated by the book of Wolfowitz [81]. be treated in Sections III and IV, complete with proofs. The The limitations of this tool became apparent when interest universally attainable error exponent for source block-coding shifted towards the speed of convergence to zero of the error arises as a special case of the hypothesis testing result. A probability as . Major achievements of the 1960’s were, in the context of discrete memoryless channels (DMC’s), basic result of large deviations theory is also included in Section III. Section V is devoted to the arbitrarily varying channel (AVC) capacity problem. Here proofs could not be Manuscript received December 16, 1997; revised April 20, 1998. This work given in full, but the key steps are reproduced in detail was supported by the Hungarian National Foundation for Scientific Research showing how the results were actually obtained and how under Grant T016386. naturally the method of types suggested a good decoder. Other The author is with the Mathematical Institute of the Hungarian Academy of Sciences, H1364 , P.O. Box 127, . typical applications are reviewed in Section VI, including Publisher Item Identifier S 0018-9448(98)05285-7. rate-distortion theory, source-channel error exponents, and

0018–9448/98$10.00 © 1998 IEEE 2506 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998 multiuser problems. Although the method of types is tailored divergence, i.e., to discrete memoryless models, there exist extensions of the type concept suitable for certain models with memory. These will be discussed in Section VII. The selection of problems and results treated in this paper has been inevitably subjective. To survey all applications of the method of types would have required a paper the size of a book. In particular, several important applications in with the standard conventions that , Combinatorics are not covered, in this respect the reader if . Here and in the sequel the base should consult the paper of Korner¨ and Orlitsky in this issue. of and of is arbitrary but the same; the usual choices While historical aspects were taken seriously, and a rather are or . large list of references has been included, no attempts were The type of a sequence and the joint made to give a detailed account of the history of the method type of and are the PD’s of types. About its origins let us just make the following brief and defined by letting and be comments. the relative frequency of among and of The ingredients have been around for a long time. In among , respectively, for all , probability theory, they appear in the basic papers of Sanov . Joint types of several -length sequences are defined [72] and Hoeffding [53] on large deviations, cf. Section similarly. The subset of consisting of the possible types III below. A similar counting approach had been used in of sequences is denoted by . statistical physics even earlier, dating back to Boltzmann Lemma II.1: [18]. A remarkable example is the paper of Schrodinger¨ [73] that predates modern large deviations theory but remained unknown outside the physics community until recently. Infor- mation theorists have also used ideas now considered pertinent Proof: Elementary combinatorics. to the method of types. Fano’s [44] approach to the DMC error exponent problem was based on “constant composition codes,” The probability that independent drawings from a PD and Berger’s [13] extension of the rate-distortion theorem to give , is denoted by . Similarly, the sources with partially known and variable statistics relied upon probability of receiving when is sent over his key lemma about covering a type class. Later, in the 1970’s, a DMC with matrix , is denoted by . Clearly, if several authors made substantial use of the concept now called have type and have joint type joint type, including Blahut [16], Dobrushin and Stambler [40], and Goppa [48]. While the ideas of the method of types were already around (II.1) in the 1970’s, this author believes that his research group is fairly credited for developing them to a general method, indeed, to a basic tool of the information theory of discrete memoryless systems. The key coworkers were Janos´ Korner¨ (II.2) and Katalin Marton. A systematic development appears in the book of Csiszar´ and Korner¨ [30]. Were that book written now, Here is defined for any by both authors would prefer to rely even more extensively on types, rather than typical sequences. Indeed, while “merging nearby types, i.e., the formalism of typical sequences has (II.3) the advantage of shortening computations” [30, p. 38], that where denotes the -marginal of . advantage is relatively minor in the discrete memoryless We will write , respectively , to denote context. On the other hand, the less delicate “typical sequence” that or is for each or with approach is more robust, it can be extended also to those or respectively. The divergences in models with memory or continuous alphabets for which the (II.1) and (II.2) are finite iff , respectively, . type idea apparently fails. For , the type class will be denoted by . Similarly, for we write : , , . II. TECHNICAL BACKGROUND The technical background of the method of types is very Lemma II.2: For any type simple. In the author’s information theory classes, the lemmas (II.4) below are part of the introductory material. will denote finite sets, unless stated otherwise; and for any PD the size of is denoted by . The set of all probability distributions (PD’s) on is denoted by . For PD’s and , denotes entropy and denotes information (II.5) CSISZAR:´ THE METHOD OF TYPES 2507

Proof: Equation (II.1) with gives maximum-likehood estimate of the entropy based upon the observed sequence .

Hence (II.4) follows because III. SOURCE BLOCK CODING, HYPOTHESIS TESTING, LARGE DEVIATION THEORY The working of the method of types is particularly simple in problems that involve only one set of sequences, such as source coding and hypothesis testing. Theorems III.1 and III.2 below establish the existence of source block codes and tests of (in general, composite) hypotheses, universally optimal in where the first equality can be checked by simple algebra. the sense of error exponents. Theorem III.1 appeared as a first Clearly, (II.1) and (II.4) imply (II.5). illustration of the method of types in Csiszar´ and Korner¨ [30, Remark: The bounds (II.4) and (II.5) could be sharpened p. 37], cf. also Longo and Sgarro [63]. Formally, as pointed via Stirling approximation to factorials, but that sharpening is out below, it is a special case of Theorem III.2. The latter is seldom needed. effectively due to Hoeffding [53]. In the sequel, we will use the convenient notations and Theorem III.1: Given , the sets : for inequality and equality up to a polynomial factor, i.e., satisfy means that for all , where is some polynomial of , and means (III.1) that both and . When and depend not only on but on other variables as well, it is understood that the polynomial can be chosen and for every PD independently of those. With this notation, by Lemmas II.1 and II.2, we can write (III.2)

where (II.6)

Random variables (RV’s) with values in , , etc., will be denoted by (often with indices). Distributions, if (III.3) respectively, joint distributions of RV’s are denoted by , , etc. It is often convenient to represent types, particularly joint types, as (joint) distributions of dummy RV’s. For dummy Moreover, for any sequence of sets satisfying (III.1), RV’s with or , etc., we have for all we will write , , etc., instead of , , etc. Also, we will use notations like : . (III.4) Lemma II.3: For representing a joint type, i.e., , and any and channel Interpretation: Encoding -length sequences by assigning distinct codewords to sequences of empirical en- tropy , this code is universally optimal among those of (II.7) (asymptotic) rate , for the class of memoryless sources: for any source distribution of entropy less than , the error Proof: As is constant for , it equals probability goes to exponentially, with exponent that could . Thus the first assertion follows from (II.4), since not be improved even if the distribution were known. . The second assertion follows For a set of PD’s, and , write from the first one and (II.2). The representation of types by dummy RV’s suggests the (III.5) introduction of “information measures” for (nonrandom) se- quences. Namely, for , , we define the (non- Further, for and , denote by the probabilistic or empirical) entropy , conditional entropy “divergence ball with center and radius ,” and for , and mutual information as , , denote by the “divergence -neighborhood” for dummy RV’s whose joint distribution of equals the joint type . Of course, nonprobabilistic conditional mutual information like is defined similarly. Notice that on account of (II.1), for any the probability is maximum if . Hence the (III.6) nonprobabilistic entropy is actually the 2508 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Theorem III.2: Given any set of PD’s and Hence the above universally rate optimal tests are what statis- , let be the set of those whose type ticians call likehood ratio tests. is in the complement of if , respectively, Remarks: in the complement of if where , . Then i) One sees from (III.9) that in the case , the type 2 error probability exponent (III.8) could not be improved even if (III.7) were replaced by the weaker requirement (III.7) that and for every for each

if ii) In the case , the requirement (III.7) could be if relaxed to (III.8) for each where denotes the closure of . Moreover, for arbitrary and in and any sequence provided that ; the latter always of sets holds if for each but not necessarily otherwise. A particular case of this result is known as Stein’s lemma, cf. Chernoff [20]: if a simple null-hypothesis implies is to be tested against a simple alternative , with an arbitrarily fixed upper bound on the type 1 error (III.9) probability, the type 2 error probability can be made to and decrease with exponential rate but not better. iii) Theorem III.2 contains Theorem III.1 as the special case , , where is the uniform distribution on . implies Proof of Theorem III.2: Suppose first that . Then (III.10) is the union of type classes for types not belonging to whenever . By the definition (III.6) of Interpretation: For testing the null-hypothesis that the true , for such we have and hence, by distribution belongs to , with type 1 error probability required Lemma II.2, for each . This to decrease with a given exponential rate or just go to zero, a gives by Lemma II.1 universally rate optimal test is the one with critical region (i.e., the test rejects the null-hypothesis iff ). Indeed, by for (III.11) (III.9), (III.10), no tests meeting the type 1 error requirement establishing (III.7) (the notation has been defined in Sec- (III.7), even if designed against a particular alternative , tion II). Further, is the union of type classes for can have type 2 error probability decreasing with a better types that belong to , thus satisfy exponential rate than that in (III.8) (when , this follows . Hence we get as above simply because (III.12) and this gives by (III.5) and (III.6); in the case , one needs the (III.13) observation that To complete the proof in the case , it suffices to prove (III.9). Given any , the assumption in (III.9) implies for sufficiently large that for every ). In particular, for such that the exponent in (III.8) is , the type 2 error probability cannot exponen- for all tially decrease at all. The notion that the null-hypothesis is (III.14) rejected whenever the type of the observed sample is outside Indeed, else would hold for some a “divergence neighborhood” of is intuitive enough. In . Since sequences in the same type addition, notice that by Lemma II.1 the rejection criterion class are equiprobable, the latter would imply by Lemma II.2 is equivalent to that

contradicting the assumption in (III.9). CSISZAR:´ THE METHOD OF TYPES 2509

Given any , take such that Large Deviations, Gibbs’ Conditioning Principle Large deviation theory is concerned with “small probability events,” typically of probability going to zero exponentially. An important example of such events is that the type of an in- and take such that dependent and identically distributed (i.i.d.) random sequence belongs to a given set of PD’s on that does not contain the distribution of the RV’s . In this context, the type of is called the empirical distribution . Thus (possible for large ). Then by (III.14) and (II.6) (III.19)

Theorem III.3: For any

(III.15) (III.20)

As was arbitrary, (III.15) establishes (III.9). and if has the property that In the remaining case , (III.11) will hold with (III.21) replaced by ; using the assumption , this yields (III.7). Also (III.12) will hold with replaced by . then It is easy to see that implies (III.22)

Corollary: If satisfies (III.21) and a unique hence we get satisfies then for any fixed , the conditional joint distribution of on the condition approaches as . (III.16) Remarks: i) The condition (III.21) is trivially satisfied if is an open To complete the proof, it suffices to prove (III.10). Now, subset of or, more generally, if each with implies, for large , that is contained in the closure of the set of those for which all sufficiently close to for some belong to . In particular, (III.8) for is an (III.17) instance of (III.22). Hoeffding [53] considered sets of PD’s such that (III.21) holds with rate of convergence . For such , (III.22) can be sharpened to Indeed, else would hold for all , implying (III.23)

ii) The empirical distribution can be defined also for RV’s taking values in an arbitrary (rather than finite) set , cf. Section VII, (VII.21). Theorem III.3 and its extensions to arbitrary are referred to as Sanov’s theorem, the first general result being that of Sanov [72], cf. also Dembo and Zeitouni [39]. iii) The Corollary is an instance of “Gibbs’ conditioning For large , this contradicts , since principle,” cf. [39]. implies by Lemmas II.1 and II.2 that Proof: By (III.19) we have the -probability of the union of type classes with goes to as . (III.24) Pick satisfying (III.17), then By Lemma II.2 and (III.5), this gives

(III.18) (III.25) Here by assumption, and this implies . Thus (III.18) gives (III.10). whence Theorem III.3 follows. 2510 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

To prove the Corollary, notice first that for any type Korner¨ [30, p. 181]) establishing that the previous universally the conditional probability of attainable error exponent is often (though not always) the best on the condition is the same as the probability of possible even for a known channel. This optimality holds the following: given an urn containing balls marked with among codes with a fixed codeword type, while the type symbols , where the number of balls marked by yielding the best exponent depends on the actual channel. equals , if balls are drawn without replacement their Lemma IV.1: For any and type , there consecutive marks will be . Clearly, this probability exist sequences in such that approaches that for drawing with replacement, uniformly in for every joint type with the sense that the difference is less than any fixed for sufficiently large depending on and only. Thus (IV.1)

(cf. Section II for notation). if Remark: Equation (IV.1) implies that for (III.26) each . In particular, are distinct sequences if . Let Proof: By a simple random coding argument. For com- for each (III.27) pleteness, the proof will be given in the Appendix. be a small neighborhood of . As is closed there Theorem IV.1: For any DMC , a code with exists with , and codeword set as in Lemma IV.1 and decoder by the assumed uniqueness of this implies defined by for some if for each (III.28) if no such exists The Corollary follows since (IV.2) has maximum probability of error satisfying

(IV.3)

where

(IV.4)

and where for sufficiently large the absolute value term is arbitrarily small if with small, by (III.26) (IV.5) and (III.27), while the conditional probability factor is less than if , by (III.28), Lemma II.2, and (III.22). is the “random coding” exponent for codeword type . Remarks: IV. ERROR EXPONENTS FOR DMC’S i) Denote by the mutual information A first major success of the method of types was to gain when . Clearly, better insight into the error exponent problem for DMC’s. iff . Thus the channel-independent codes Theorem IV.1, below, due to Csiszar,´ Korner,¨ and Marton in Theorem IV.1 have exponentially decreasing error [32], shows that the “random coding” error exponent for probability for every DMC with ; it constant composition codes is attainable universally, i.e., by is well known that for channels with the same encoder and decoder, for all DMC’s for which it no rate- codes with codewords in have small is positive. An important feature of the proof is that, rather probability of error. The exponent is best than bounding the expectation of the error probability for possible for many channels, namely, for those that satisfy an ensemble of codes (and conclude that some code in the , cf. Remark ii) to Theorem ensemble meets the obtained bound), the error probability is IV.2. bounded for a given codeword set and a given decoder. The ii) It follows by a simple continuity argument that for any role of “random coding” reduces to show the existence of DMC a “good” codeword set. We will also reproduce the simple “method of types” proof of the “sphere packing” bound for constant composition codes (Theorem IV.2, cf. Csiszar´ and CSISZAR:´ THE METHOD OF TYPES 2511

is also an attainable error exponent with rate- codes, Since only when , (IV.1) though no longer channel-independent ones (as the max- and (IV.9) imply that imizing depends on ). This exponent is positive for every less than the channel capacity . It was first derived by Fano [44], and then in a simple way by Gallager [46], in algebraic forms different from that given above. iii) The “empirical mutual information decoder” (IV.2) was first suggested by Goppa [48] as one not depending on the channel and still suitable to attain channel capacity. This decoder could be equivalently defined, perhaps more intuitively, by minimizing the “entropy distance” (IV.10) of the codewords from the received . Lemma IV.1 may be visualized as one asserting the existence This bound remains valid if is replaced by of codes with good entropy distance distribution. As , since the left-hand side is always . Hence Blahut [16] observed, among sequences in a single (IV.8) gives type class the entropy distance satisfies the axioms of a metric, except that it may vanish for certain pairs of distinct sequences. Proof of Theorem IV.1: As is constant for , we may write, using (II.7) (IV.11)

where the last inequality holds by (IV.7). Substituting (IV.11) into (IV.6) gives (IV.3), completing the proof. Theorem IV.2: Given arbitrary , , and DMC (IV.6) , every code of sufficiently large blocklength with codewords, each of the same type , where the summation is for all joint types and with arbitrary decoder , has average probability of error with . This and other sums below will be bounded in the sense by the largest term, without explicitly saying so. (IV.12) To bound the cardinality ratios in (IV.6), notice that by the definition (IV.2) of , iff there exists such that where the joint type of is represented by dummy RV’s with (IV.13)

(IV.7) is the “sphere packing” exponent for codeword type . Thus, denoting by the set of joint types Remarks: i) It follows from Theorem IV.2 that even if the codewords are not required to be of the same type, the lower bound with given , with , and satisfying (IV.7), (IV.12) to the average probability of error always holds with replaced by

(IV.8) The latter is equal to the exponent in the “sphere packing bound” first proved by Shannon, Gallager, and By Lemma II.3 Berlekamp [74]. The first simple proof of the sphere packing bound was given by Haroutunian [52]. The author is indebted to Haroutunian for the information that he had been aware also of the proof reproduced below but published only his other proof because it was not restricted to the discrete case. ii) Both and are con- vex functions of , positive in the same interval (IV.9) ; they are equal if where 2512 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

is the abscissa of the leftmost point where for joint types with . A the graph of as a functions of meets simple continuity argument shows that for sufficiently large its supporting line of slope . The same holds for the minimum of for these joint types is less than and which are (positive and) the minimum in (IV.13) plus , and this completes the proof. equal in an interval . For in this interval, their common value is the exact error exponent Related Further Results for rate- codes. For smaller rates, the exact error The proof technique of Theorem IV.1 has lead to various exponent is still unknown. further results. Already in Csiszar,´ Korner,¨ and Marton [32] iii) Dueck and Korner¨ [42] showed that for codes a stronger result than Theorem IV.1 was proved, with an with codeword set and rate exponent better for “small rates” than . With with , the the channel-dependent maximum-likehood decoder, a similar average probability of correct decoding goes to zero derivation yields an even better exponent for small rates exponentially with exponent not smaller than the that, when optimized with respect to the codeword type, minimum of for gives the exponent of Gallager’s [46] “expurgated bound” satisfying . This result (cf. [30, pp. 185, 193]). In [32] the problem of separately follows from (IV.16) by Lemma II.3. In [42] also its bounding the erasure and undetected error probabilities was tightness was proved. also addressed; a decoder yields Proof of Theorem IV.2: Given a codeword set an erasure if , while an undetected error occurs if equals a message index but not the correct one. Using a (still channel-independent) modification of the decoder (IV.2), jointly attainable exponents for erasure and arbitrary decoder , write and undetected error probabilities were obtained (cf. [30, pp. (IV.14) 174–177]). Csiszar´ and Korner¨ [28] derived exponential error bounds attainable with (a codeword set as in Lemma IV.1 For every joint type with and) decoders defined similarly to (IV.2) but with instead of for an arbitrary function on , (IV.15) possibly channel-dependent. Recently Telatar and Gallager [77] used channel-dependent decoders of a somewhat more general kind to derive jointly attainable exponents for erasure Hence, supposing , it follows by (II.6) and undetected error probabilities improving upon those in and (II.7) that [32]. A particular class of -decoders received considerable atten- tion recently. They are the -decoders defined by minimizing a “distance”

(IV.16)

In particular, if and (sufficiently a given function on (IV.19) large), the left-hand side of (IV.16) is less than , say, and setting if for all , and hence if no such exists. Here the term “distance” is used in the widest sense, no restriction on is implied. In this context, even the capacity problem is open, in general. The -capacity of a DMC is the supremum of rates of codes with if (IV.17) a given -decoder that yield arbitrarily small probability of error. In the special case when or according as On account of (IV.4) and (IV.14), it follows from (IV.17) and or , -capacity provides the “zero undetected Lemma II.3 that error” capacity or “erasures-only” capacity. Shannon’s zero- error capacity can also be regarded as a special case of -capacity, and so can the graph-theoretic concept of Sperner capacity, cf. [35]. A lower bound to -capacity follows as a special case of a result in [28]; this bound was obtained also by Hui [55]. Balakirsky [12] proved by delicate “type” arguments the tightness of that bound for channels with binary input alphabet. Csiszar´ and Narayan [35] showed that the mentioned bound is not tight in general but its positivity is necessary for (IV.18) positive -capacity. Lapidoth [59] showed that -capacity can CSISZAR:´ THE METHOD OF TYPES 2513 equal the channel capacity even if the above lower error probability is bounded via the method of types. A bound is strictly smaller. Other recent works addressing the remarkable feature is that the very error-bounding process problem of -capacity or its special case of zero undetected naturally suggests a good decoder. error capacity include Merhav, Kaplan, Lapidoth, and Shamai Given an AVC defined by [67], Ahlswede, Cai, and Zhang [9], as well as Telatar and Gallager [77]. as above, for input symbols and we write if there V. CAPACITY OF ARBITRARILY VARYING CHANNELS exists PD’s and on such that AVC’s were introduced by Blackwell, Breiman, and for all Thomasian [15] to model communication situations where the channel statistics (“state”) may vary in an unknown and (V.5) arbitrary manner during the transmission of a codeword, perhaps caused by jamming. Formally, an AVC with (finite) The AVC is symmetrizable if there exists a channel input alphabet , output alphabet , and set of possible states such that is defined by the probabilities of receiving when is sent and is the state. The corresponding probabilities for -length sequences are for all (V.6)

(V.1) It has long been known that iff for all in [57], and that the right-hand side of (V.9) below is always an upper bound to [10]. Ericson [43] showed that The capacity problem for AVC’s has many variants accord- symmetrizability implies . ing to sender’s and receivers’s knowledge about the states, the state selector’s knowledge about the codeword, degree Theorem V.1: For write of randomization in encoding and decoding, etc. Here we concentrate on the situation when no information is available to the sender and receiver about the states, nor to the state where (V.7) selector about the codeword sent, and only deterministic codes are permissible. and For a code with codeword set and decoder , the maximum and average proba- bility of error are defined as subject to (V.8) Then is an achievable rate for the max- (V.2) imum probability of error criterion, for each . Hence where (V.9) (V.3) if the maximum is attained for some with . The supremum of rates at which transmission with arbitrarily Theorem V.2: For a nonsymmetrizable AVC, . small maximum or average probability of error is possible, is called the -capacity or -capacity , respectively. Remarks: The first strong attack at -capacity was that Unlike for a DMC, is possible, and may be of Dobrushin and Stambler [40]. They were first to use zero when the “random coding capacity” is positive. Here large deviations arguments (including Chernoff bounding for is the supremum of rates for which ensembles of codes dependent RV’s) as well as “method of types” calculations exists such that the expectation of over the ensemble is to show that for randomly selected codes, the probability arbitrarily small for and all . Already that is not small for a fixed goes Blackwell, Breiman, and Thomasian [15] showed that to zero doubly exponentially, implying the same also for . Unfortunately, as the method of types was not yet developed at that time, much effort had to be spent on technicalities. This where (V.4) diverted the authors’ attention from what later turned out to be a key issue, viz., the choice of a good decoder, causing the and gave an example where . results to fall short of a complete solution of the -capacity Below we review the presently available best result about problem. Not much later, Ahlswede [2] found a shortcut to that -capacity (Theorem V.1; Csiszar´ and Korner¨ [29]), and problem, proving by a clever trick that whenever the single-letter characterization of -capacity (Theorem V.2; ; however, a single-letter necessary and sufficient Csiszar´ and Narayan [33]). Both follow the pattern of Theorem condition for remained elusive. Remarkably, the IV.1: for a “good” codeword set and “good” decoder, the sufficiency of nonsymmetrizability for does not seem 2514 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998 easier to prove than its sufficiency for . The strongest codeword sent is negligible. Indeed, we have by Lemmas II.1 result about -capacity preceding [29] was that of Ahlswede and II.3, writing as a shorthand for [4] who proved (V.9) for AVC’s such that never holds when . He used large deviations arguments similar to those of Dobrushin and Stambler [40], and was first to use a sophisticated decoder (but not the one below). A full solution of the -capacity problem appears a long way ahead; one of (V.17) its special cases is Shannon’s celebrated zero-error capacity and—using also (V.16) and (V.11)— problem. Proof of Theorems V.1 and V.2: First one needs an ana- log of Lemma IV.1 about the existence of “good” codeword sets. This is the following Lemma V.1: For any , , and sufficiently large , for each type with there exist sequences in such that for all joint types and all

(V.18) (V.10) Motivated by (V.17) and (V.18) we will consider (V.11) for some (V.19) for some if (V.12) as the list of candidates for the decoder output , where denotes or according as the maximum or if (V.13) average probability of error criterion is being considered. Dobrushin and Stambler [40] used, effectively, a decoder This lemma can be proved by random selection, although whose output was if was the only candidate in the more refined arguments are needed to get (V.10)–(V.12) than above sense (with ), while otherwise an error those in the proof of Lemma IV.1. One has to show that only was declared. This “joint typicality decoder” has been shown with probability going to zero faster than exponentially will the suitable to achieve the -capacity for some but not all randomly selected codewords violate (V.10)–(V.12), for any AVC’s. To obtain a more powerful decoder, when the list fixed , and in (V.10). This can be done by Chernoff (V.19) contains several messages one may reject some of them bounding. One difficulty is that in case of (V.12) dependent by a suitable rule. If only one remains unrejected, RV’s have to be dealt with; this can be overcome using an idea that will be the decoder output. We will consider rejection of Dobrushin and Stambler [40]. For details, cf. [29] and [33]. rules corresponding to sets of We will use a codeword set as in Lemma V.1, with a joint distributions as follows: a candidate is decoder whose exact form will be suggested by the very rejected if for every with there exists error-bounding process. in such that the joint type of belongs to Denote by and the family of those joint . To reflect that and that for some distributions that satisfy (as ), we assume that consists of such joint distributions whose marginals and satisfy (V.14) respectively, is the marginal of some (V.20) (V.15) A set of joint distributions with the properties (V.20) will Notice that since be called permissible if for each , the rejection rule corresponding to leaves at most one unrejected. Then is set equal to the unrejected , at if (V.16) no such exists. For such a decoder , preliminarily with an unspecified , the maximum or average A codeword and a received sequence may be con- probability of error can be bounded via standard “method of sidered “jointly typical” if there exists such that types” technique. The result will suggest a natural choice of the joint type of belongs to or . The that makes the bound exponentially small under the hypotheses contribution to the maximum or average probability of error of Theorems V.1 and V.2. The calculation is technical but of output sequences not jointly typical in this sense with the instructive; it will be given in the Appendix. CSISZAR:´ THE METHOD OF TYPES 2515

Further Developments needed to cover the type class , where is By the above approach, Csiszar,´ and Narayan [33] also de- defined by (IV.19). Then termined the -capacity for AVC’s with state constraint . The latter means that only those are permissible as state sequences that satisfy for a given (VI.2) cost function on . For this case, Ahlswede’s [2] trick does where not work and, indeed, may happen. Specializing the result to the (noiseless) binary adder AVC ( if ) subject to (VI.3) with , the intuitive but nontrivial result was obtained that for , the -capacity equals the capacity is the “rate-distortion function.” of the binary symmetric DMC with crossover probability . This lemma is seen today as a consequence of a sim- Notice that the corresponding -capacity is the maximum rate ple general result about coverings known as the John- of codes admitting correction of each pattern of no more than son–Stein–Lovasz´ theorem, cf. Cohen et al. [21, p. 322]. The bit errors, whose determination is a central open problem latter is useful in information theory also in other contexts, of coding theory. The role of the decoding rule was further cf. Ahlswede [3]. studied, and -capacity for some simple cases was explicitly An immediate consequence of Lemma VI.1 is that the calculated in Csiszar´ and Narayan [34]. minimum (asymptotic) rate of source block codes admitting While symmetrizable AVC’s have , if the decoder the reproduction of each by some with output is not required to be a single message index but a list distortion , is equal to . of candidates the resulting “list code capacity” Berger [13] also used this lemma to derive the rate-distortion may be nonzero already for . Pinsker conjectured theorem for arbitrarily varying sources. in 1990 that the “list-size- ” -capacity is always equal to As another application of Lemma VI.1, Marton [64] deter- if is sufficiently large. Contributions to this problem, mined the error exponent for the compression of memoryless establishing Pinsker’s conjecture, include Ahlswede and Cai sources with a fidelity criterion. In fact, her error exponent is [7], Blinovsky, Narayan, and Pinsker [17], and Hughes [54]. attainable universally (with codes depending on the “distortion The last paper follows and extends the approach in this measure” but not on the source distribution or the section. For AVC’s with , a number called the distortion threshold ). Thus the following extension of symmetrizability is determined, and the list-size- -capacity Theorem III.1 holds, cf. [30, p. 156]: Given is shown to be for and equal to for . A partial analog of this result for list-size- -capacity is that the limit of the latter as is always given by (V.9), there exist codeword sets such that cf. Ahlswede [6]. The approach in this section was extended to multiple- (VI.4) access AVC’s by Gubner [50], although the full analog of Theorem V.2 was only conjectured by him. This conjecture and for every PD and every was recently proved by Ahlswede and Cai [8]. Previously, Jahn [56] determined the -capacity region of a multiple- access AVC under the condition that it had nonvoid interior, (VI.5) and showed that then the -capacity region coincided with the random coding capacity region. where (VI.6) VI. OTHER TYPICAL APPLICATIONS Moreover, for any sequence of sets satisfying (VI.4) the liminf of the left-hand side of (VI.5) is . A. Rate-Distortion Theory Remarkably, the exponent (VI.6) is not necessarily a contin- The usefulness for rate-distortion theory of partitioning - uous function of , cf. Ahlswede [5], although as Marton length sequences into type classes was first demonstrated by [64] showed, it is continuous when is the normalized Berger [13]. He established the following lemma, called the Hamming distance (cf. also [30, p. 158]). type covering lemma in [30]. Recently, several papers have been devoted to the redun- dancy problem in rate-distortion theory, such as Yu and Speed Lemma VI.1: Given finite sets and a nonnegative [82], Linder, Lugosi, and Zeger [61], Merhav [65], Zhang, function on , for and let Yang, and Wei [83]. One version of the problem concerns the denote the minimum number of “ -balls of radius “rate redundancy” of -semifaithful codes. A -semifaithful ” code is defined by a mapping such that for all , together with an assignment (VI.1) to each in the range of of a binary codeword of length 2516 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

, subject to prefix condition. Yu and Speed [82] showed where represents independent repetitions of that for a memoryless source with generic distribution there . Csiszar,´ Korner,¨ and Marton proved in 1977 exist -semifaithful codes whose rate redundancy (published in [27], cf. also [30, pp. 264–266]) that for suitable codes as above, with encoders that map and into (VI.7) codeword sets of sizes (VI.11) is less than a constant times , moreover, such codes the error probability (VI.10) goes to zero exponentially as may be given universally (not depending on ). They also , with exponent conjectured that the rate redundancy (VI.7) can never be less than a constant times , under a technical condition that excludes the case . Zhang, Yang, and Wei [83] proved that conjecture, and also determined the exact asymptotics of the “distortion redundancy” of the best rate- block codes, again under some technical condition. This result of [83] says (VI.12) that for a memoryless source with generic distribution , the minimum of whenever is in the interior of the achievable rate region [76] (VI.8) (VI.13) for mappings with is This assertion can be proved letting be the “minimum asymptotically equal to constant times , with an explic- entropy decoder” that outputs for any pair of codewords itly given constant. Here is the inverse function of that pair whose nonprobabilistic entropy defined by (VI.3), with fixed. Both papers [82] and is minimum among those having the given codewords [83] heavily rely on the method of types. The latter represents (ties may be broken arbitrarily). Using this , the incorrectly one of the very few cases where the delicate calculations decoded pairs belong to one of the following three sets: require more exact bounds on the cardinality and probability of a type class than the crude ones in Lemma II.2. for some with B. Source-Channel Error Exponent When a memoryless source with alphabet and generic for some with distribution is transmitted over a DMC using a source-channel block code with encoder and decoder , the probability of error is for some and with (VI.9) It can be seen by random selection that there exist and satisfying (VI.11) such that for each joint type Using techniques as in the proof of Theorem IV.1, Csiszar´ [22] showed that by suitable source-channel codes of block- length , not depending on , the error probability (VI.9) can be made exponentially small whenever , with exponent (cf. (III.3) and the remarks to Theorems IV.1 and IV.2 for notation). This exponent is best possible if the minimum is attained for some . For further results in this direction, including source-channel transmission with a distortion threshold, cf. Csiszar´ [24]. Hence the assertion follows by Lemmas II.1 and II.2. The error exponent (VI.12) for the Slepian–Wolf problem C. Multiterminal Source Coding is attainable universally, i.e., with codes not depending on the Historically, the first multiuser problem studied via the distribution of . This result is a counterpart of Theorem method of types was that of the error exponent for the IV.1 for DMC’s. The counterpart of Theorem IV.2 was also Slepian–Wolf [76] problem, i.e., separate coding of (memo- established by Csiszar,´ Korner,¨ and Marton, loc cit: For no ryless) correlated sources. Given a source pair with generic source pair can the error probability of codes satisfying (VI.11) variables , the error probability of an -length block decrease faster than with exponent code with separate encoders and common decoder is

(VI.10) (VI.14) CSISZAR:´ THE METHOD OF TYPES 2517

The functions in (VI.12) and (VI.14) are equal if is distribution is of form close to the boundary of . For such rate pairs, their common value gives the exact error exponent. (VI.16) Remark: For the intrinsic relationship of source and chan- nel problems, cf., e.g., Csiszar´ and Korner¨ [30, Secs. III.1 and and such that III.2], Csiszar´ and Korner¨ [28], and Ahlswede [3, Part II]. The above results have been extended in various ways. Extensions to more than two correlated sources are straight- forward, cf. [27], or [30, pp. 267, 268]. Csiszar´ and Korner¨ (VI.17) [28] showed that good encoders can be obtained, instead of The capacity region was first determined (in a different random selection, also by a graph-theoretic approach. Another algebraic form) by Ahlswede [1] and Liao [60]. The maximum contribution of [28], in retrospect the more important one, probability of error criterion may give a smaller region, cf. was to apply the method of types to study the performance Dueck [41]; a single-letter characterization of the latter is not of various decoders, and to improve the exponent (VI.12) for known. “large” rates. Csiszar´ [23] showed that the exponent (VI.12) is For in the interior of , can be made exponen- (universally) attainable also with linear codes, i.e., constraining tially small; Gallager [47] gave an attainable exponent every- and be linear maps (to this, and have to be fields, where positive in the interior of . Pokorny and Wallmeier [71] but that can always be assumed, extending the alphabets by were first to apply the method of types to this problem. They dummy symbols of zero probability if necessary). Also in [23], showed the existence of (universal) codes with codewords linear codes were shown to give better than the previously and whose joint types with a fixed are arbitrarily known best exponent for certain rate pairs. More recently, given and such that the Oohama and Han [70] obtained another improvement for average probability of error is bounded above exponentially, certain rate pairs, and Oohama [69] determined the exact with exponent depending on , , and ; that exponent exponent for a modified version of the problem. That mod- is positive for each with the property that for ification admits partial cooperation of the encoders, which, determined by (VI.16) with the given and , (VI.17) however, does not affect the achievable rate region (VI.13) is satisfied with strict inequalities. Pokorny and Wallmeier [71] nor the upper bound (VI.14) to achievable error exponents. used the proof technique of Theorem IV.1 with a decoder On the other hand, the modification makes the exponent in maximizing . Recently, Liu and Hughes [62] (VI.14) achievable for all rate pairs in the interior improved upon the exponent of [71], using a similar technique of , even universally. but with decoder minimizing . The “maximum mutual information” and “minimum conditional entropy” de- coding rules are equivalent for DMC’s with codewords of the D. Multiterminal Channels same type but not in the MAC context; by the result of [62], The first application of the method of types to a multi- “minimum conditional entropy” appears the better one. terminal channel coding problem was the paper of Korner¨ and Sgarro [58]. Using the same idea as in Theorem IV.1, VII. EXTENSIONS they derived an error exponent for the asymmetric broadcast channel, cf. [30, p. 359] for the definition of this channel. While the type concept is originally tailored to memoryless Here let us concentrate on the multiple-access channel models, it has extensions suitable for more complex models, (MAC). A MAC with input alphabets , , and output as well. So far, such extensions proved useful mainly in the alphabet is formally a DMC , with context of source coding and hypothesis testing. the understanding that there are two (noncommunicating) Abstractly, given any family of source models with alphabet senders, one selecting the -component the other the - , a partition of into sets can be regarded component of the input. Thus codes with two codewords as a partition into “type classes” if sequences in the same sets and are are equiprobable under each model in the family. Of course, a considered, the decoder assigns a pair of message indices subexponential growth rate of is desirable. This general to each , and the average probability of error is concept can be applied, e.g., to variable-length universal source coding: assign to each a binary codeword of length , the first bits (VI.15) specifying the class index , the last bits identifying within . Clearly, will exceed the “ideal codelength” by less that , for each source model The capacity region, i.e., the closure of the set of those in the family. rate pairs to which codes with , As an example, consider the model family of renewal , and exist, is characterized as follows: processes, i.e., binary sources such that the lengths of runs iff there exist RV’s , with taking preceeding the ’s are i.i.d. RV’s. Define the renewal type of values in an auxiliary set of size , whose joint a sequence as where denotes 2518 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998 the number of ’s in which are preceeded by exactly ( is called irreducible if the stationary Markov consecutive ’s. Sequences of the same renewal chain with two-dimensional distribution is irreducible). type are equiprobable under each model in , and renewal The above facts permit extensions of the results in Section types can be used for the model family much in the III to Markov chains, cf. Boza [19], Davisson, Longo, and same way as standard types for memoryless sources. Csiszar´ Sgarro [38], Natarajan [68], Csiszar,´ Cover, and Choi [26]. and Shields [37] showed that there are In particular, the following analog of Theorem III.3 and its renewal type classes, which implies the possibility of universal Corollary holds, cf. [26]. coding with redundancy for the family of renewal Theorem VII.1: Given a with processes. It was also shown in [37] that the redundancy bound transition probability matrix and , and a is best possible for this family, as opposed to finitely set of PD’s such that for each , parametrizable model families for which the best attainable redundancy is typically . the second-order type of satisfies Below we briefly discuss some more direct extensions of the standard type concept. (VII.5) A. Second- and Higher Order Types iff there exist second-order types The type concept appropriate for Markov chains is “second- such that approaches the minimum in (VII.5). order type,” defined for a sequence as Further, if the minimum in (VII.5) is attained for a unique the PD with , and denotes a stationary Markov chain with , for with , (VII.1) we have

In other words, is the joint type of and . Denote by the set of all possible (VII.6) second-order types of sequences with , and whenever (VII.5) holds for . for dummy RV’s representing such a second-order type (i.e., ) let denote the type class Remarks: , , . i) Let denote the set of those irreducible The analog of (II.1) for a Markov chain with for which all in a sufficiently small stationary transition probabilities given by a matrix is that neighborhood of belong to . The first assertion of if and (with ) then Theorem VII.1 gives that (VII.5) always holds if the closure of equals . ii) Theorem VII.1 is of interest even if are i.i.d.; the limiting conditional distribution in (VII.6) is Markov rather than i.i.d. also in that case. (VII.2) As an immediate extension of (VII.1), the th-order type of a sequence is defined as the PD with where . The analog of Lemma II.2 is that for (VII.3) (VII.7) (VII.4) This is the suitable type concept for order- Markov chains, in which conditional distributions given the past de- Of course, (VII.4) is a consequence of (VII.2) and (VII.3). pend on the last symbols. All results about Markov The simple idea in the proof of Lemma II.2 suffices only chains and second-order types have immediate extensions to for the part of (VII.3), the part is more delicate. order- Markov chains and th-order types. One way to get it (Boza [19]) is via the exact formula Since order- Markov chains are also order- ones if , for due to Whittle [80], an elementary proof of the analog of the hypothesis-testing result Theorem III.2 can which has been given by Billingsley [14]. An important be applied to test the hypothesis that a process known to be property of second-order types is that they have (equal or) Markov of order is actually Markov of order for a given asymptotically equal marginals as . Indeed, for . Performing a multiple test (for each ) amounts the marginals and of to estimating the Markov order. A recent paper analyzing differ only at and , if , both differences being this approach to Markov order estimation is Finesso, Liu, and . Moreover, denoting by the set of those Narayan [45], cf. also prior works of Gutman [51] and Merhav, whose two marginals are equal, each irreducible Gutman, and Ziv [66]. can be arbitrarily approximated by second-order “Circular” versions of second- and higher order types types with if is sufficiently large are also often used as in [38]. The th-order circular type CSISZAR:´ THE METHOD OF TYPES 2519 of is the same as the th-order type of . The type concept adequate for the source model , i.e., the joint type of the sequences , . (VII.12) A technical advantage of circular types is compatibility: lower order circular types are marginals of the higher order ones. The price is that expressing probabilities in terms of circular is the -type defined as the joint type , where types is more awkward. is determined by as in (VII.12). Of course, for the corresponding -type classes we still have (VII.9) when B. Finite-State Types , and consequently also A sequence of -valued RV’s is called a (VII.13) unifilar finite-state source if there exists a (finite) set of “states,” an initial state , and a mapping Unlike for the finite-state type classes , however, a that specifies the next state as a function of the present state lower bound counterpart of (VII.13) cannot be established in and source output, such that general. An early appearance of this -type concept, though not of (VII.8) the term, was in Csiszar´ and Korner¨ [31], applied to DMC’s with feedback. The encoder of a feedback code of blocklength for messages is defined by mappings , where is a stochastic matrix specifying the source output , that specify the input symbols , probabilities given the states. As the state sequence , depending on the previous received symbols is uniquely determined by and the , when message is to be transmitted. Then initial state , so is the joint type . It will be called the received sequence in generated by a generalized the finite state type of , given the mapping and finite-state model as in (VII.12), with alphabet , state set the initial state , cf. Weinberger, Merhav, and Feder [79]. , and . In particular, the probability of receiving Notice that the th-order type (VII.7) of is an equals , cf. equal to the finite state type of for and (VII.9). Hence a decoder will correctly decode message defined by , with probability with . Denote the set of finite-state types , , by , and let denote the class of sequences with . Then for , (VII.8) (VII.14) gives where . Similarly to (IV.16), we have

(VII.9)

Further, for the following analogs of (VII.3) and (VII.4) hold: (VII.15)

(VII.10) On account of (VII.13) (with replaced by ), the left-hand side of (VII.15) is also . It (VII.11) follows that if then These permit extensions of results about Markov chains and second-order types to unifilar finite-state sources and finite- state types. Weinberger, Merhav, and Feder [79] used this (VII.16) type concept to study the performance of universal sequential Averaging the probability of correct decoding (VII.14) over codes for individual sequences (rather than in the averaged the messages , (VII.16) implies that the average sense). They established a lower bound to the codelength, probability of correct decoding is valid for most sequences in any given type class , except for a vanishingly small fraction of the finite-state types . The finite-state model (VII.8) can be extended in various where the minimum is for all . Comparing ways. Let us consider here the extension when the “next this with Remark iii) to Theorem IV.2 shows that feedback state” depends on the past sequence not cannot exponentially improve the probability of correct de- necessarily through and but, more generally, coding at rates above channel capacity. where is an arbitrary mapping. Here A recent combinatorial result of Ahlswede, Yang, and denotes the set of all finite sequences of symbols from , Zhang [11] is also easiest to state in terms of -types. Their including the void sequence ; the initial state need not be “inherently typical subset lemma” says, effectively, that given explicitly specified in this model, as it is formally given by and , there is a finite set such that for sufficiently 2520 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998 large , to any there exists a mapping Theorem VII.3: Let be independent -valued and an -type such that RV’s with common distribution . Then (VII.17) (VII.22) While this lemma is used in [11] to prove (the converse part of) a probabilistic result, it is claimed to also yield the asymptotic (VII.23) solution of the general isoperimetric problem for arbitrary finite alphabets and arbitrary distortion measures. for every set of pm’s on for which the probabilities are defined. Here and denote the interior C. Continuous Alphabets and closure of in the -topology. Extensions of the type concept to continuous alphabets Theorem VII.3 is a general version of Sanov’s theorem. are not known. Still, there are several continuous-alphabet In the parlance of large derivations theory (cf. Dembo and problems whose simplest (or the only) available solution relies Zeitouni [39]) it says that satisfies the large deviation upon the method of types, via discrete approximations. For principle with good rate function (“goodness” means example, the capacity subject to a state constraint of an AVC that the sets are compact in the - with general alphabets and states, for deterministic codes and topology; the easy proof of this property is omitted). the average probability of error criterion, has been determined Proof: (Groeneboom, Oosterhoff, and Ruymgaart [49]) in this way, cf. [25]. Pick any , and and satisfying (VII.20). Apply At present, this approach seems necessary even for the Theorem III.3 to the quantized RV’s with distribution following intuitive result. , where if , and to the set of those distributions on for which , Theorem VII.2 ([25]): Consider an AVC whose permissible . -length inputs satisfy , and the output As the latter is an open set containing , it follows that is where the deterministic sequence and the random sequence with independent zero- mean components may be arbitrary subject to the power constraints , . This AVC has (VII.24) the same -capacity as the Gaussian one where the ’s are The left hand side of (VII.24) is a lower bound to that of i.i.d. Gaussian RV’s with variance . (VII.22), by (VII.20). Hence, as has been arbitrary, For the latter Gaussian case, Csiszar´ and Narayan [36] had (VII.19) and (VII.24) imply (VII.22). previously shown that Notice next that for each partition , Theorem III.3 applied if to the quantized RV’s as above and to (VII.18) gives that if Discrete approximations combined with the method of types provide the simplest available proof of a general form of Sanov’s theorem, for RV’s with values in an arbitrary set endowed with a -algebra (the discrete case has been (VII.25) discussed in Section III). Clearly, (VII.23) follows from (VII.19) and (VII.25) if one For probability measures (pm’s) on , the - shows that divergence is defined as (VII.19) (VII.26) the supremum taken for partitions of The nontrivial but not too hard proof of (VII.26) is omitted. into sets . Here denotes the -quantization of The “discrete approximation plus method of types” ap- defined as the distribution on the finite proach works also for other problems that can not be entered set . here. For extensions of the hypothesis testing results in Section The -topology of pm’s on is the topology in which III, cf. Tusnady´ [78]. a pm belongs to the interior of a set of pm’s iff for some partition and VIII. CONCLUSIONS The method of types has been shown to be a powerful tool (VII.20) of the information theory of discrete memoryless systems. It affords extensions also to certain models with memory, and The empirical distribution of an -tuple can be applied to continuous alphabet models via discrete of -valued RV’s is the random pm defined by approximations. The close links of the method of types to large deviations theory (primarily to Sanov’s theorem) have (VII.21) also been established. CSISZAR:´ THE METHOD OF TYPES 2521

Sometimes it is claimed that “type” arguments, at least for for at least indices . Assuming without any loss of models involving only one set of sequences as in hypothesis generality that (A.5) holds for , it follows by testing, could be replaced by referring to general results from (A.3) that large deviations theory. This is true for some applications (although the method of types gives more insight), but in other applications the explicit “type” bounds valid for all afford (A.6) stronger conclusions than the asymptotic bounds provided by large deviations theory. It is interesting to note in this respect for each and with that for the derivation of (VII.6) even the familiar type bound . As may be chosen such was not sufficient, rather, the exact formula (of Whittle [80]) that , this completes the for the size of second order type classes had to be used. proof. Of course, the heavy machinery of large deviations theory Proof of Theorems V.1, V.2 (continued): Consider a de- (cf. [39]) works for many problems for which type arguments coder as defined in Section V, with a preliminarily unspecified do not. In particular, while that machinery is not needed permissible set . Recall that denotes for Sanov’s theorem (Theorem VII.3), it appears necessary or defined by (V.14) and (V.15), according as to derive the corresponding result for continuous alphabet the maximum or average probability of error criterion is Markov chains. Indeed, although the method of types does considered, and each has to satisfy (V.20). work for finite alphabet Markov chains (cf. Theorem VII.1), Clearly, for with we can have extension to general alphabets via discrete approximations only if for some and does not seem feasible, since quantization destroys the Markov . Using (II.7) and (V.10), property. it follows that the fraction of sequences in with is bounded, in the sense, by APPENDIX Proof of Lemma IV.1: Pick sequences from at random. Then, using Lemma II.2, we have for any joint type with , and any ,

where the sum and max are for all joint types with the given marginal . Except for an exponentially small fraction of the indices , it suffices to take the above sum and max for those joint types (A.1) that satisfy the additional constraint (A.7) This implies that Indeed, the fraction of indices to which a exists with not satisfying (A.7), is (A.2) exponentially small by (V.12). If the fraction of incorrectly decoded ’s within Writing is exponentially small for each , it follows by (V.3) and (V.17) that is exponentially small. Hence, writing

(A.8)

(A.3) the maximum probability of error defined by (V.2) will be exponentially small if a exists such that it follows from (A.2) that whenever . Similarly, if the fraction of incorrectly decoded ’s within is exponentially small for each , except perhaps for an exponentially (A.4) small fraction of the indices , it follows by On account of (A.4), the same inequality must hold without (V.18) that is exponentially small, supposing, the expectation sign for some choice of , and of course, that . Hence the average probability of error then the latter satisfy defined by (V.2) will be exponentially small if a exists such that whenever (A.7) holds (A.5) and . 2522 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Actually, in both cases suffices, the AVC is nonsymmetrizable and , are sufficiently small, as in Lemma V.1 can always be chosen smaller than . cf. [33] for details. To complete the proof, it suffices to find a permissible set of joint distributions such that REFERENCES i) in case of Theorem V.1, has a pos- itive lower bound subject to if [1] R. Ahlswede, “Multi-way communication channels,” in Proc. 2nd Int. , Symp. Inform. Theory (Tsahkadzor, Armenian SSR, 1971). Budapest, Hungary: Akademiai´ Kiado,´ 1973, pp. 23–52. ii) in the case of Theorem V.2, has a [2] , “Elimination of correlation in random codes for arbitrarily positive lower bound subject to and (A.7) varying channels,” Z. Wahrscheinlichkeitsth. Verw. Gebiete, vol. 44, pp. if 159–175, 1978. [3] , “Coloring hypergraphs: A new approach to multi-user source coding I–II,” J. Comb. Inform. Syst. Sci., vol. 4, pp. 76–115, 1979, and where vol. 5, pp. 220–268, 1980. (A.9) [4] , A method of coding and an application to arbitrarily varying channels, J. Comb. Inform. Systems Sci., vol. 5, pp. 10–35, 1980. cf. (V.4), and the AVC is nonsymmetrizable. [5] , “Extremal properties of rate distortion functions,” IEEE Trans. Inform. Theory, vol. 36, pp. 166–171, Jan. 1990. Now, from (A.8), [6] , “The maximal error capacity of arbitrarily varying channels for constant list sizes,” IEEE Trans. Inform. Theory, vol. 39, pp. 1416–1417, 1993. if [7] R. Ahlswede and N. Cai, “Two proofs of Pinsker’s conjecture concerning arbitrarily varying channels,” IEEE Trans. Inform. Theory, vol. 37, pp. (A.10) 1647–1649, Nov. 1991. [8] , “Arbitrarily varying multiple-access channels, part 1. Ericson’s Moreover, if (A.7) holds then implies symmetrizability is adequate, Gubner’s conjecture is true,” Preprint 96–068, Sonderforschungsbereich 343, Universitat¨ Bielefeld, Bielefeld, , Germany. hence [9] R. Ahlswede, N. Cai, and Z. Zhang, “Erasure, list, and detection zero- error capacities for low noise and a relation to identification,” IEEE Trans. Inform. Theory, vol. 42, pp. 52–62, Jan. 1996. [10] R. Ahlswede and J. Wolfowitz, “The capacity of a channel with arbitrar- if ily varying cpf’s and binary output alphabets,” Z. Wahrscheinlichkeitsth. Verw. Gebiete, vol. 15, pp. 186–194, 1970. (A.11) [11] R. Ahlswede, E. Yang, and Z. Zhang, “Identification via compressed data,” IEEE Trans. Inform. Theory, vol. 43, pp. 48–70, Jan. 1997. If then is the marginal of some , [12] V. B. Balakirsky, “A converse coding theorem for mismatched decoding cf. (V.20), where denotes or . In the first at the output of binary-input memoryless channels,” IEEE Trans. Inform. Theory, vol. 41, pp. 1989–1902, Nov. 1995. case, implies by (V.14) that is [13] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data arbitrarily close to and hence any number less than Compression. Englewood Cliffs, NJ: Prentice Hall, 1971. [14] P. Billingsley, “Statistical methods in Markov chains,” Ann. Math. defined by (V.7) is a lower bound to if Statist., vol. 32, pp. 12–40; correction, p. 1343, 1961. is sufficiently small. Then (A.10) shows that the claim under [15] D. Blackwell, L. Breiman, and A. J. Thomasian, “The capacities of i) always holds when . In the second case certain channel classes under random coding,” Ann. Math. Statist., vol. 31, pp. 558–567, 1960. implies by (V.15) that is close to [16] R. E. Blahut, “Composition bounds for channel block codes,” IEEE and hence any number less than the minimum in Trans. Inform. Theory, vol. IT-23, pp. 656–674, 1977. [17] V. Blinovsky, P. Narayan, and M. Pinsker, “Capacity of the arbitrarily (A.9) is a lower bound to if is sufficiently small. varying channel under list decoding,” Probl. Pered. Inform., vol. 31, pp. Then (A.11) shows that the claim under ii) always holds when 99–113, 1995. . [18] L. Boltzmann, “Beziehung zwischen dem zweiten Hauptsatze der mech- anischen Warmetheorie¨ und der Wahrscheinlichkeitsrechnung respektive So far, the choice of played no role. To make the claim den Satzen¨ uber¨ das Warmegleichgewicht,”¨ Wien. Ber., vol. 76, pp. under i) hold also when , chose as the set of 373–435, 1877. joint distributions satisfying , [19] L. B. Boza, “Asymptotically optimal tests for finite Markov chains,” Ann. Math. Statist., vol. 42, pp. 1992–2007, 1971. in addition to (V.20). It can be shown by rather straightforward [20] H. Chernoff, “A measure of asymptotic efficiency for tests of a hypoth- calculation using (V.8) and (V.13) that this is permissible esis based on a sum of observations,” Ann. Math. Statist., vol. 23, pp. if , providing and are sufficiently small, cf. 493–507, 1952. [21] G. Cohen, I. Honkala, S. Litsyn, and A. Lobstein, Covering Codes. [29] for details. Amsterdam, The Netherlands: North Holland, 1997. Concerning the claim under ii) in the remaining case [22] I. Csiszar,´ “Joint source-channel error exponent,” Probl. Contr. Inform. Theory, vol. 9, pp. 315–328, 1980. , notice that then (A.8) gives [23] , “Linear codes for sources and source networks: Error exponents, universal coding,” IEEE Trans. Inform. Theory, vol. IT-28, pp. 585–592, July 1982. [24] , “On the error exponent of source-channel transmission with a distortion threshold,” IEEE Trans. Inform. Theory, vol. IT-28, pp. 823–828, Nov. 1982. [25] , “Arbitrarily varying channels with general alphabets and states,” IEEE Trans. Inform. Theory, vol. 38, pp. 1725–1742, 1992. because by (A.7). Hence [26] I. Csiszar,´ T. M. Cover, and B. S. Choi, “Conditional limit theorems the claim will hold if is chosen as the set of those joint under Markov conditioning,” IEEE Trans. Inform. Theory, vol. IT-33, pp. 788–801, Nov. 1987. distributions that satisfy in [27] I. Csiszar´ and J. Korner¨ “Towards a general theory of source networks,” addition to (V.20). It can be shown that this is permissible if IEEE Trans. Inform. Theory, vol. IT-26, pp. 155–165, Jan. 1980. CSISZAR:´ THE METHOD OF TYPES 2523

[28] , “Graph decomposition: A new key to coding theorems,” IEEE Trans. Inform. Theory, vol. IT-27, pp. 212–226, Mar. 1981. Trans. Inform. Theory, vol. IT-27, pp. 5–11, Jan. 1981. [57] J. Kiefer and J. Wolfowitz, “Channels with arbitrarily varying channel [29] , “On the capacity of the arbitrarily varying channel for maximum probability functions,” Inform. Contr., vol. 5, pp. 44–54, 1962. probability of error,” Z. Wahrscheinlichkeitsth. Verw. Gebiete, vol. 57, [58] J. Korner¨ and A. Sgarro, “Universally attainable error exponents for pp. 87–101, 1981. broadcast channels with degraded message sets,” IEEE Trans. Inform. [30] , Information Theory: Coding Theorems for Discrete Memoryless Theory, vol. IT-26, pp. 670–679, 1980. Systems. New York: Academic, 1981. [59] A. Lapidoth, “Mismatched decoding and the multiple-access channel,” [31] , “Feedback does not affect the reliability function of a DMC IEEE Trans. Inform. Theory, vol. 42, pp. 1439–1452, Sept. 1996. at rates above capacity,” IEEE Trans. Inform. Theory, vol. IT-28, pp. [60] H. J. Liao, “Multiple access channels,” Ph.D. dissertation, Univ. Hawaii, 92–93, Jan. 1982. Honolulu, 1972. [32] I. Csiszar,´ J. Korner,¨ and K. Marton, “A new look at the error exponent [61] T. Linder, G. Lugosi, and K. Zeger, “Fixed-rate universal lossy source of discrete memoryless channels,” presented at the IEEE Int. Symp. coding and rates of convergence for memoryless sources,” IEEE Trans. Information Theory (Cornell Univ., Ithaca, NY, 1977), preprint. Inform. Theory, vol. 41, pp. 665–676, May 1995. [33] I. Csiszar´ and P. Narayan, “The capacity of the arbitrarily varying [62] Y. S. Liu and B. L. Hughes, “A new universal random coding bound channel revisited: Positivity, constraints,” IEEE Trans. Inform. Theory, for the multiple-access channel,” IEEE Trans. Inform. Theory, vol. 42, vol. 34, pp. 181–193, Mar. 1988. pp. 376–386, 1996. [34] , “Capacity and decoding rules for arbitrarily varying channels,” [63] G. Longo and A. Sgarro, “The source coding theorem revisited: A IEEE Trans. Inform. Theory, vol. 35, pp. 752–769, July 1989. combinatorial approach,” IEEE Trans. Inform. Theory, vol. IT-25, pp. [35] , “Channel capacity for a given decoding metric,” IEEE Trans. 544–548, May 1979. Inform. Theory, vol. 41, pp. 35–43, Jan. 1995. [64] K. Marton, “Error exponent for source coding with a fidelity criterion,” [36] , “Capacity of the Gaussian arbitrarily varying channel,” IEEE IEEE Trans. Inform. Theory, vol. IT-20, pp. 197–199, Mar. 1974. Trans. Inform. Theory, vol. 37, pp. 18–26, Jan. 1991. [65] N. Merhav, “A comment on ‘A rate of convergence result for a [37] I. Csiszar´ and P. C. Shields, “Redundancy rates for renewal and other universal -semifaithful code,’” IEEE Trans. Inform. Theory, vol. 41, processes,” IEEE Trans. Inform. Theory, vol. 42, pp. 2065–2072, Nov. pp. 1200–1202, July 1995. 1996. [66] N. Merhav, M. Gutman, and J. Ziv, “On the estimation of the order of [38] L. Davisson, G. Longo, and A. Sgarro, “The error exponent for the a Markov chain and universal data compression,” IEEE Trans. Inform. noiseless encoding of finite ergodic Markov sources,” IEEE Trans. Theory, vol. 35, pp. 1014–1019, Sept. 1989. Inform. Theory, vol. IT-27, pp. 431–348, 1981. [67] N. Merhav, G. Kaplan, A. Lapidoth, and S. Shamai, “On information [39] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applica- rates for mismatched decoders,” IEEE Trans. Inform. Theory, vol. 40, tions. Jones and Bartlett, 1993. pp. 1953–1967, Nov. 1994. [40] R. L. Dobrushin and S. Z. Stambler, “Coding theorems for classes of [68] S. Natarajan, “Large deviations, hypothesis testing, and source coding arbitrarily varying discrete memoryless channels,” Probl. Pered. Inform., for finite Markov sources,” IEEE Trans. Inform. Theory, vol. IT-31, pp. vol. 11, no. 2, pp. 3–22, 1975, English translation. 360–365, 1985. [41] G. Dueck, “Maximal error capacity regions are smaller than average [69] Y. Oohama, “Universal coding for correlated sources with linked error capacity regions for multi-user channels,” Probl. Contr. Inform. encoders,” IEEE Trans. Inform. Theory, vol. 42, pp. 837–847, May 1996. Theory, vol. 7, pp. 11–19, 1978. [70] Y. Oohama and T. S. Han, “Universal coding for the Slepian–Wolf data [42] G. Dueck and J. Korner,¨ “Reliability function of a discrete memoryless compression system and the strong converse theorem,” IEEE Trans. channel at rates above capacity,” IEEE Trans. Inform. Theory, vol. IT-25, Inform. Theory, vol. 40, pp. 1908–1919, 1994. pp. 82–85, Jan. 1979. [71] J. Pokorny and H. M. Wallmeier, “Random coding bound and codes [43] T. Ericson, “Exponential error bounds for random codes in the arbitrarily produced by permutations for the multiple-access channel,” IEEE Trans. varying channel,” IEEE Trans. Inform. Theory, vol. IT-31, pp. 42–48, Inform. Theory, vol. IT-31, pp. 741–750, Nov. 1985. Jan. 1985. [72] I. N. Sanov, “On the probability of large deviations of random vari- [44] R. M. Fano, Transmission of Information, A Statistical Theory of Com- ables,” Mat. Sbornik, vol. 42, pp. 11–44, 1957, in Russian; English munications. New York: Wiley, 1961. translation in Select. Transl. Math. Statist. Probab., vol. 1, pp. 213–244, [45] L. Finesso, C. C. Liu, and P. Narayan, “The optimal error exponent 1961. for Markov order estimation,” IEEE Trans. Inform. Theory, vol. 42, pp. [73] E. Schrodinger,¨ “Uber¨ die Umkehrung der Naturgesetze,” Sitzungsber. 1488–1497, Sept. 1996. Preuss. Akad. Wiss. Berlin Phys. Math. Klass., vols. 8/9, pp. 144–153, [46] R. G. Gallager, “A simple derivation of the coding theorem and some 1931. applications,” IEEE Trans. Inform. Theory, vol. IT-11, pp. 3–18, Jan. [74] C. E. Shannon, R. G. Gallager, and E. R. Berlekamp, “Lower bounds to 1965. error probability for coding on discrete memoryless channels,” Inform. [47] , “A perspective on multiaccess channels,” IEEE Trans. Inform. Contr., vol. 10, pp. 65–104 and 522–552, 1967. Theory, vol. IT-31, pp. 124–142, Mar. 1985. [75] C. E. Shannon and W. Weaver, The Mathematical Theory of Communi- [48] V. D. Goppa, “Nonprobabilistic mutual information without memory,” cation. Urbana, IL: Univ. Illinois Press, 1949. Probl. Contr. Inform. Theory, vol. 4, pp. 97–102, 1975. [76] D. Slepian and J. K. Wolf, “Noiseless coding of correlated information [49] P. Groeneboom, J. Oosterhoff, and F. H. Ruymgaart, “Large deviation sources,” IEEE Trans. Inform. Theory, vol. IT-19, pp. 471–480, 1973. theorems for empirical probability measures,” Ann. Probab., vol. 7, pp. [77] I.˙ E. Telatar and R. G. Gallager, “New exponential upper bounds to 553–586, 1979. error and erasure probabilities,” in Proc. IEEE Int. Symp. Information [50] J. A. Gubner, “On the deterministic-code capacity of the multiple-access Theory (Trondheim, Norway, 1994), p. 379. arbitrarily varying channel,” IEEE Trans. Inform. Theory, vol. 36, pp. [78] G. Tusnady,´ “On asymptotically optimal tests,” Ann. Statist., vol. 5, pp. 262–275, Mar. 1990. 358–393, 1977. [51] M. Gutman, “Asymptotically optimal classification for multiple tests [79] M. J. Weinberger, N. Merhav, and M. Feder, “Optimal sequential with empirically observed statistics,” IEEE Trans. Inform. Theory, vol. probability assignment for individual sequences,” IEEE Trans. Inform. 35, pp. 401–408, Mar. 1989. Theory, vol. 40, pp. 384–396, Mar. 1994. [52] E. A. Haroutunian, “Bounds on the error probability exponent for [80] P. Whittle, “Some distributions and moment formulae for the Markov semicontinuous memoryless channels,” Probl. Pered. Inform., vol. 4, chain,” J. Roy. Stat. Soc., ser. B, vol. 17, pp. 235–242, 1955. no. 4, pp. 37–48, 1968, in Russian. [81] J. Wolfowitz, Coding Theorems of Information Theory. Berlin, Ger- [53] W. Hoeffding, “Asymptotically optimal tests for multinominal distribu- many: Springer, 1961. tions,” Ann. Math. Statist., vol. 36, pp. 1916–1921, 1956. [82] B. Yu and T. P. Speed, “A rate of convergence result for a universal - [54] B. L. Hughes, “The smallest list for the arbitrarily varying channel,” semifaithful code,” IEEE Trans. Inform. Theory, vol. 39, pp. 513–820, IEEE Trans. Inform. Theory, vol. 43, pp. 803–815, May 1997. May 1993. [55] J. Y. N. Hui, “Fundamental issues of multiple accessing,” Ph.D. disser- [83] Z. Zhang, E. Yang, and V. K. Wei, “The redundancy of source coding tation, MIT, Cambridge, MA, 1983. with a fidelity criterion—Part one: Known statistics,” IEEE Trans. [56] J. H. Jahn, “Coding for arbitrarily varying multiuser channels,” IEEE Inform. Theory, vol. 43, pp. 71–91, Jan. 1997.