<<

Polynomial Singular Value Decompositions of a Family of Source-Channel Models

The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

Citation Makur, Anuran and Lizhong Zheng. "Polynomial Singular Value Decompositions of a Family of Source-Channel Models." IEEE Transactions on Information Theory 63, 12 (December 2017): 7716 - 7728. © 2017 IEEE

As Published http://dx.doi.org/10.1109/tit.2017.2760626

Publisher Institute of Electrical and Electronics Engineers (IEEE)

Version Author's final manuscript

Citable link https://hdl.handle.net/1721.1/131019

Terms of Use Creative Commons Attribution-Noncommercial-Share Alike

Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ IEEE TRANSACTIONS ON INFORMATION THEORY 1

Polynomial Singular Value Decompositions of a Family of Source-Channel Models Anuran Makur, Student Member, IEEE, and Lizhong Zheng, Fellow, IEEE

Abstract—In this paper, we show that the conditional expec- interested in a particular subclass of one-parameter exponential tation operators corresponding to a family of source-channel families known as natural exponential families with quadratic models, defined by natural exponential families with quadratic variance functions (NEFQVF). So, we define natural exponen- variance functions and their conjugate priors, have orthonor- mal polynomials as singular vectors. These models include the tial families next. Gaussian channel with Gaussian source, the Poisson channel Definition 1 (Natural ). Given a measur- with gamma source, and the binomial channel with beta source. To derive the singular vectors of these models, we prove and able space (Y, B(Y)) with a σ-finite measure µ, where Y ⊆ R employ the equivalent condition that their conditional moments and B(Y) denotes the Borel σ-algebra on Y, the parametrized are strictly degree preserving polynomials. family of probability densities {PY (·; x): x ∈ X } with re- Index Terms—Singular value decomposition, natural exponen- spect to µ that have support Y (independent of x) is called a tial family, conjugate prior, orthogonal polynomials. natural exponential family when each density has the form:

∀x ∈ X , ∀y ∈ Y,PY (y; x) = exp (xy − α(x)) PY (y; 0) I.INTRODUCTION PECTRAL and singular value decompositions (SVDs) where PY (·; 0) is a base density function, and: S of conditional expectation operators have many uses in Z  information theory and statistics [1]–[3]. As a result, it is ∀x ∈ X , α(x) = log exp (xy) PY (y; 0) dµ(y) valuable to analytically determine the singular vectors corre- Y sponding to some widely studied toy models. In this paper, we is known as the log-partition function which satisfies α(0) = 0 illustrate that a certain simple family of source-channel models without loss of generality. The parameter x is called the always has corresponding conditional expectation operators natural parameter, and the natural parameter space X , with orthogonal polynomial singular vectors. We commence {x ∈ R : |α(x)| < +∞} ⊆ R is defined as the largest interval by presenting this family of models and formally defining where the log-partition function is finite. We usually assume conditional expectation operators in the next two subsections. without loss of generality that 0 ∈ X . (Here, and throughout this paper, exp(·) and log(·) refer to the natural exponential A. Natural Exponential Families with Quadratic Variance and the natural logarithm with the base e, respectively.) Functions and their Conjugate Priors In [9] and [10], Morris specialized Definition1 further in Since we will study source-channel models that have ex- an effort to justify why certain natural exponential families ponential family and conjugate prior structure, we briefly in- like the Gaussian, Poisson, and binomial enjoy “many useful troduce these notions. Exponential families form an important mathematical properties” [9]. He asserted that the tractability class of distributions in statistics because they are analytically of these distributions stemmed from their quadratic variance tractable and intimately tied to several theoretical phenomena functions. To define this, observe that α(·) is infinitely differ- [4], [5]. For instance, they have sufficient statistics with entiable on X ◦ (the interior of X )[4], and satisfies: bounded dimension after i.i.d. sampling (Pitman-Koopman- ∀x ∈ X , α(x) = log [exp (xY )] Darmois theorem) [6], they have conjugate priors [7], they EPY (·;0) (1) ◦ 0 admit efficient estimators that achieve the Cramer-Rao´ bound ∀x ∈ X , α (x) = EPY (·;x) [Y ] (2) under a mean parametrization [5], they are maximum entropy ◦ 00 ∀x ∈ X , α (x) = VARPY (·;x) (Y ) (3) distributions under moment constraints [8], and they are used in tilting arguments in large deviations theory [5]. We are where Y denotes a taking values in Y,(1) is the cumulant generating function of Y , and (3) is the Fisher Manuscript received on December 9, 2015; revised on July 16, 2017; information Y carries about x [5]. Following the exposition in accepted on September 21, 2017. This research was supported by the National 0 + Science Foundation Award 1216476 and a Hewlett-Packard Fellowship. This [9], we may define the variance function V : image(α ) → R work was presented in part at the 54th Annual Allerton Conference on as the variance of Y written as a function of the mean of Y : Communication, Control, and Computing, 2016. −1 A. Makur and L. Zheng are with the Department of Electrical Engineering ∀γ ∈ image(α0),V (γ) α00(α0 (γ)) (4) and Computer Science, Massachusetts Institute of Technology, Cambridge, , MA 02139, USA (e-mail: a [email protected]; [email protected]). 0 00 Copyright (c) 2017 IEEE. Personal use of this material is permitted. where α (·) is injective because α (·) is strictly positive in However, permission to use this material for any other purposes must be non-degenerate scenarios. We can now define NEFQVFs. obtained from the IEEE by sending a request to [email protected]. 2 IEEE TRANSACTIONS ON INFORMATION THEORY

 Definition 2 (NEFQVF). An NEFQVF is a natural exponential probability densities PY |X=x : x ∈ X with respect to a σ- family whose variance function V (γ) is a polynomial in γ with finite measure µ on the standard measurable space (Y, B(Y)). degree at most 2. We will use the notation PX and PY to denote the marginal probability laws of X and Y , and we will assume that and We will only analyze channel conditional distribution mod- PX have (measure theoretic) supports X and Y (which are the els P (y|x) = P (y; x) that are NEFQVFs (although we PY Y |X Y closures of X and Y), respectively. Finally, we note that this will not explicitly use the NEFQVF parametrization in our source-channel model defines a joint probability density P calculations). There are six possible NEFQVFs [9]: X,Y on the product measure space (X × Y, B(X ) ⊗ B(Y), λ × µ) 1) Gaussian pdfs with mean parameter and fixed variance such that PX,Y (x, y) = PY |X (y|x)PX (x) for every x ∈ X 2) Poisson pmfs with rate parameter and y ∈ Y. In sectionII, our channels will be NEFQVFs and 3) binomial pmfs with success probability parameter and our sources will be the corresponding conjugate priors. fixed number of Bernoulli trials We next define the Hilbert spaces and linear operators 4) gamma pdfs with rate parameter and fixed “shape” pertinent to our discussion. Corresponding to the measure 5) negative binomial pmfs with success probability param- space (X , B(X ), PX ), we define the separable Hilbert space eter and fixed “number of failures” 2 L (X , PX ) over the field R: 6) generalized hyperbolic secant pdfs (see [9] for details 2   2  regarding this family) L (X , PX ) , f : X → R | E f (X) < +∞ (5) and only the first three will lead to non-degenerate situations. which is the space of all Borel measurable and PX -square Given a channel PY |X (y|x) = PY (y; x) defined by an NE- integrable functions, with correlation as the inner product: FQVF, we will only analyze source distributions that belong 2 ∀f, g ∈ L (X , X ) , hf, gi [f(X)g(X)] (6) to the corresponding conjugate prior family. For any natural P PX , E exponential family, we may define a conjugate prior family as and induced norm: shown next [4], [5]. 1 1 2 2  2  2 ∀f ∈ L (X , X ) , kfk hf, fi = f (X) . (7) Definition 3 (Conjugate Prior). Suppose we are given a natural P PX , PX E 2 exponential family from Definition1 such that X is a non- Likewise, we define the separable Hilbert space L (Y, PY ) empty open interval defining the measurable space (X , B(X )) corresponding to the measure space (Y, B(Y), PY ). The condi- with σ-finite measure λ. The corresponding conjugate prior tional expectation operators are maps that are defined between family is the parametrized family of probability densities these Hilbert spaces. The “forward” conditional expectation 2 2 {PX (·; z, n):(z, n) ∈ Ξ} with respect to λ that have support operator C : L (X , PX ) → L (Y, PY ) is defined as: X (independent of (z, n)), and are of the form: 2 ∀f ∈ L (X , PX ) , (C(f))(y) , E [f(X)|Y = y] , (8) ∀x ∈ X ,PX (x; z, n) = exp (zx − nα(x) − τ(z, n)) and the “reverse” conditional expectation operator C∗ : 2 2 for any (z, n) ∈ Ξ, where the log-partition function τ :Ξ → R L (Y, PY ) → L (X , PX ) is defined as: is given by: 2 ∗ ∀g ∈ L (Y, PY ) , (C (g))(x) , E [g(Y )|X = x] . (9) Z  ∀(z, n) ∈ Ξ, τ(z, n) = log exp (zx − nα(x)) dλ(x) It is straightforward to verify from (8) and (9) that the X codomains of C and C∗ are indeed Hilbert spaces. The next and (z, n) are hyper-parameters that belong to the hyper- proposition collects some simple properties of these operators. parameter space Ξ {(z, n) ∈ R × R : |τ(z, n)| < +∞}. , Proposition 1 (Conditional Expectation Operators). C and C∗

When channels are given by natural exponential families, if are bounded linear operators with operator norms kCkop = ∗ ∗ we use a conjugate prior source, then posterior distributions kC kop = 1. Moreover, C is the adjoint operator of C. also belong to the conjugate family. This structure allows computationally efficient updating of beliefs in Bayesian infer- Proof. See AppendixA.  ence problems [5]. A comprehensive list of different conjugate Given NEFQVF channels and conjugate prior sources, we prior families can be compiled from [5], [10], [11], and we will prove that the corresponding operators C and C∗ have will present the conjugate prior families for the first three singular vectors that are orthonormal polynomials under the NEFQVFs listed above (without tediously referring back to regularity condition that the input and output Hilbert spaces the aforementioned sources) in sectionII. have orthonormal polynomial bases. The ensuing two subsec- tions provide some illustrations from the literature where such B. Conditional Expectation Operators SVDs can be useful. We next formally define conditional expectation operators. C. Maximal Correlation Functions We fix a probability space, (Ω, F, P), and define an input random variable X :Ω → X ⊆ R with source probability In statistics, one utility of singular vectors of conditional density PX with respect to a σ-finite measure λ on the standard expectation operators is that they can be construed as “maxi- measurable space (X , B(X )). Likewise, we define an output mal correlation functions.” To explain this, we first recall the random variable Y :Ω → Y ⊆ R, and channel conditional Hirschfeld-Gebelein-Renyi´ maximal correlation, which is a IEEE TRANSACTIONS ON INFORMATION THEORY 3 variational generalization of the well-known Pearson correla- 2E[f(X)g(Y )] + E[g2(Y )] = 2 − 2E[f(X)g(Y )]. Hence, the tion coefficient. Given two jointly distributed random variables non-linear regression problem (13) studied in [2] is equivalent X ∈ X and Y ∈ Y, the maximal correlation between them is: to the maximal correlation problem (10). While the singular vectors f ? and g? of C have palpa- ρ(X; Y ) , sup E [f(X)g(Y )] (10) f, g ble significance in the contexts of regression and maximal 2 correlation, we may impart other singular vectors of C with where the supremum is over all functions f ∈ L (X , PX ) 2 similar operational interpretations. The pair of singular vectors and g ∈ L (Y, PY ) such that E[f(X)] = E[g(Y )] = 0 and k C  2   2  corresponding to the th largest singular value of (for E f (X) = E g (Y ) = 1 [1]. Furthermore, if X or Y k ∈ {2, 3, 4,... }) are the functions that are maximally corre- is a constant almost surely, then ρ(X; Y ) = 0. ρ(X; Y ) was lated and orthogonal to all previous pairs of singular vectors. originally introduced as a normalized measure of the statistical Hence, we refer to all such singular vectors as “maximal cor- dependence between X and Y that satisfies seven “reasonable” relation functions.” Maximal correlation functions associated axioms [1]. Indeed, 0 ≤ ρ(X; Y ) ≤ 1, and ρ(X; Y ) = 0 if with larger singular values of C can be interpreted as more and only if X and Y are independent random variables. informative score functions, and are useful in decomposing Maximal correlation turns out to have an elegant spectral information into several mutually orthogonal parts. Indeed, characterization. Notice that the everywhere unity functions 2 2 such functions are used to perform inference on hidden 1X ∈ L (X , PX ) and 1Y ∈ L (Y, PY ) (which are defined Markov models in an image processing context in [12], and in AppendixA) are the right and left singular vectors of the algorithms based on the ACE algorithm to learn such functions conditional expectation operator C corresponding to its largest are presented in [13]. These algorithms are essentially power singular value of kCkop = 1: iteration methods to compute singular vectors of C. Our ∗ C (1X ) = 1Y and C (1Y ) = 1X . (11) main results in sectionII provide explicit characterizations of maximal correlation functions for conditional expectation The orthogonal complement of the span of this right singular operators defined by NEFQVFs and their conjugate priors. ⊥  2 vector, span(1X ) = f ∈ L (X , PX ): E [f(X)] = 0 , is 2 a sub-Hilbert space of L (X , PX ). Indeed, it is clearly a 2 linear subspace of L (X , PX ) that inherits the same inner D. Local Perturbation Arguments in Information Theory product (6), and its completeness follows from the continuity of the inner product. It is proved in [1, Theorem 1] that SVDs of conditional expectation operators are also useful maximal correlation can be written as the Courant-Fischer- when performing perturbation arguments in network informa- Weyl variational characterization of the second largest singular tion theory. For instance, SVDs of Gaussian conditional ex- value of C: pectation operators are used to demonstrate that non-Gaussian kC(f)k codes can achieve higher rates than Gaussian codes for various ρ(X; Y ) = sup PY (12) Gaussian networks in [3] (where in particular, the strong ⊥ kfk f∈span(1X ) \{0} PX Shamai-Laroia conjecture for the Gaussian ISI channel is where 0 denotes the zero function. If C is a compact operator, disproved). As another example, we briefly delineate the linear the supremum in (12) is actually achieved by some right information coupling problem studied in [14]. ? ⊥ ? singular vector f ∈ span(1X ) . Furthermore, f and the Suppose X and Y are finite sets (and λ and µ are counting ? ? ? corresponding left singular vector g = C(f )/ kC(f )k measures), PX,Y is a joint pmf such that PX (x) > 0 for every PY are precisely the maximal correlation functions achieving the x ∈ X and PY (y) > 0 for every y ∈ Y, and U ∈ U (with supremum in (10). |U| < ∞) is an arbitrary random variable that is conditionally Maximal correlation functions can also be construed as the independent of Y given X so that U → X → Y is a Markov solutions to a general version of non-linear regression studied chain. For any fixed  6= 0, we first consider the extremal in [2]: problem that maximizes I(U; Y ) with the constraint that only h 2i min E (f(X) − g(Y )) (13) a thin layer of information can pass through X: f∈F, g∈G  2 2 sup I(U; Y ) (14) where F = f ∈ L (X , PX ): E[f(X)] = 0, E[f (X)] = 1  2 2 PU ,PX|U : U→X→Y and G = g ∈ L (Y, Y ): [g(Y )] = 0, [g (Y )] = 1 are 1 2 P E E I(U;X)≤ 2  collections of arbitrary (non-linear) Borel measurable func- tions, and we assume the minimum exists. Note that when real where the supremum is over all PU and PX|U such that PX,Y data (that is assumed to be drawn i.i.d. from PX,Y ) is given, is fixed (or equivalently, over all PU|X ). Note that problem the idealized problem in (13) can be modified by replacing the (14) and some of its variants have also been considered in the theoretical (population) expectations with empirical (sample) contexts of investment portfolio theory [15], the information expectations. Breiman and Friedman proposed the alternating bottleneck method [16], and strong data processing inequalities conditional expectations (ACE) algorithm in [2] to solve (13) [17]–[19]. Then, we assume each conditional pmf PX|U=u ? ? and find the optimal f ∈ F and g ∈ G that provide the for u ∈ U is a (multiplicative) local perturbation of PX by ? ? 2 best linear relationship between f (X) and g (Y ). Moreover, φu ∈ L (X , PX ): these f ? and g? are also the maximal correlation functions 2 2 that achieve (10), because E[(f(X) − g(Y )) ] = E[f (X)] − ∀u ∈ U,PX|U=u = PX (1X +  φu) (15) 4 IEEE TRANSACTIONS ON INFORMATION THEORY where the sums and products in (15) hold pointwise, and for A. The Laguerre SVD every u ∈ U, E[φu(X)] = 0 so that PX|U=u is a valid pmf. Recalling the setup in subsection I-B, let X = (0, ∞) and From the Markov relation U → X → Y , we have: Y = N , {0, 1, 2,... }, and let λ be the Lebesgue measure and µ be the counting measure. For our first result, the channel ∀u ∈ U,PY |U=u = PY (1Y +  C(φu)) (16) conditional pmfs {PY |X=x ∼ Poisson(x): x ∈ (0, ∞)} are 2 where the conditional expectation operator C : L (X , PX ) → the NEFQVF of Poisson distributions: L2 (Y, ) was defined in (8). Using (15) and (16), we can PY xye−x locally approximate the mutual information terms in (14) via ∀x ∈ (0, ∞), ∀y ∈ N,PY |X (y|x) = (18) the result in Corollary9 of AppendixB. Neglecting all o(2) y! terms, this produces the linear information coupling problem: where x ∈ (0, ∞) is the rate (or expectation) parameter of X 2 the . We remark that the Poisson channel max P (u) kC(φ )k (17) U u Y PU , {φu:u∈U} P is a widely used model in optical communications where X u∈U represents the intensity of transmitted light and Y represents 2 where we maximize over all PU and {φu ∈ L (X , PX ): u ∈ the number of photons hitting a direct-detection receiver; see U} subject to the constraints: E[φu(X)] = 0 for every u ∈ [20] and the references therein. The corresponding conjugate P 2 P U, PU (u) kφuk ≤ 1, and PU (u)PX φu = 0 prior family consists of gamma distributions, and we assume u∈U PX u∈U (which ensures that the marginal pmf of X is fixed at PX ). that the source pdf is PX ∼ gamma(α, β): It is straightforward to verify that problem (17) can be α α−1 −βx 1 β x e solved by setting U = {−1, 1} with PU (−1) = PU (1) = ∀x ∈ (0, ∞),P (x) = (19) 2 X Γ(α) (i.e. U ∼ Rademacher), letting φ1 be a unit norm right sin- gular vector of C corresponding to its second largest singular where α ∈ (0, ∞) is the shape parameter, β ∈ (0, ∞) is the value, and setting φ−1 = −φ1. Hence, the SVD of a con- rate parameter, and the gamma function, Γ : (0, ∞) → R, is: ditional expectation operator C solves the linear information Z ∞ coupling problem by identifying the optimal perturbations as z−1 −x Γ(z) , x e dx . (20) right singular vectors of C. Moreover, problem (17) is actually 0 a single letter case of a more general multi-letter problem, Note that when α ∈ + {1, 2, 3,... }, the gamma dis- which can be solved via single letterization using tensorization Z , tribution specializes to an Erlang distribution. The posterior properties of the SVD [14]. The authors of [14] exploit this pdfs {P ∼ gamma(α + y, β + 1) : y ∈ } are also tensorization to study questions in network information theory. X|Y =y N gamma distributions as we used a conjugate prior. Finally, the 1  output marginal pmf PY ∼ negative-binomial α, p = β+1 is E. Outline a negative : Having illustrated the utility of SVDs of conditional expec- Γ(α + y)  1 y  β α ∀y ∈ ,P (y) = (21) tation operators, we briefly outline the remaining discussion. In N Y Γ(α)y! β + 1 β + 1 sectionII, we will state the polynomial SVDs of three source- 1 channel models in key theorems. In section III, we will present where p = β+1 ∈ (0, 1) is the success probability parameter the proofs of these results via a useful lemma. and α ∈ (0, ∞) is the number of failures parameter. When α ∈ Z+, the negative binomial random variable is the sum II.MAIN RESULTS of α independent geometric random variables, and models the number of successes in a Bernoulli process until α failures. In this section, we present our main results. Informally, we 2 The Hilbert space L ((0, ∞), PX ) has an orthonormal basis show that: of generalized Laguerre polynomials. In particular, the gen- Conditional expectation operators corresponding to every eralized Laguerre polynomial with degree k ∈ N, denoted (α,β) NEFQVF channel and its conjugate prior source, such that Lk : (0, ∞) → R, is defined by the Rodrigues formula: all moments of the marginal distributions exist and are finite, s Γ(α) dk have orthonormal polynomial singular vectors. L(α,β)(x) x1−αeβx xk+α−1e−βx k , Γ(k + α) k! dxk It is straightforward to verify that the moments of the output (22) marginal distributions corresponding to the gamma, negative with the parameters α, β ∈ (0, ∞). These polynomials satisfy binomial, and generalized hyperbolic secant NEFQVFs and the orthogonality relation: their conjugate priors do not always exist, and are sometimes 2 h (α,β) (α,β) i infinite. So, the Hilbert space L (Y, Y ) does not have an P ∀j, k ∈ N, E Lj (X)Lk (X) = δjk (23) orthonormal basis of polynomials for these joint distributions, and we cannot hope for the singular vectors of C to be with respect to the gamma pdf, where δjk is the Kronecker orthonormal polynomials. Hence, we will establish three main delta function that equals 1 if j = k and equals 0 otherwise. 2 results in this paper corresponding to the Poisson, binomial, The Hilbert space L (N, PY ) has a unique (up to arbitrary and Gaussian NEFQVFs. These results are outlined in the sign changes) orthonormal polynomial basis of Meixner poly- ensuing subsections. nomials. The Meixner polynomial with degree k ∈ N, denoted IEEE TRANSACTIONS ON INFORMATION THEORY 5

(s,p) Mk : N → R, is parametrized by s ∈ (0, ∞) and p ∈ (0, 1). where α, β ∈ (0, ∞) are shape parameters, and the beta These polynomials satisfy the orthogonality relation: function, B : (0, ∞)2 → R, is defined as: ∞ Z 1 Γ(s + y) Γ(z1)Γ(z2) X (s,p) (s,p) y s B(z , z ) xz1−1(1 − x)z2−1 dx = . (28) Mj (y)Mk (y) p (1 − p) = δjk (24) 1 2 , Γ(s)y! 0 Γ(z1 + z2) y=0 The posterior pdfs {P ∼ beta(α+y, β+n−y): y ∈ [n]} for every j, k ∈ , with respect to the negative binomial X|Y =y N are also beta distributions since we used a conjugate prior. distribution with parameters s ∈ (0, ∞) and p ∈ (0, 1). All Lastly, the output marginal pmf P ∼ beta-binomial(n, α, β) our definitions of orthogonal polynomials are derived from Y is the beta-binomial distribution: [21]–[23], and we use to these sources in subsequent sections   without tediously referring back to them. n B(α + y, β + n − y) ∀y ∈ [n],PY (y) = (29) The next theorem presents the orthogonal polynomial SVD y B(α, β) of the conditional expectation operator C corresponding to the with parameters n ∈ Z+, α ∈ (0, ∞), and β ∈ (0, ∞). gamma source and Poisson channel model. 2 The Hilbert space L ((0, 1), PX ) has an orthonormal basis Theorem 2 (Laguerre SVD). For the Poisson channel with of Jacobi polynomials. In particular, the Jacobi polynomial (α,β) gamma source, as presented in (18) and (19), the conditional with degree k ∈ N, denoted Jk : (0, 1) → R, is defined 2 2 expectation operator, C : L ((0, ∞), PX ) → L (N, PY ), has by the Rodrigues formula: SVD: k 1 d  (α,β) (α, ) (α,β) 1−α 1−β k+α−1 k+β−1 β+1 Jk (x) , x (1 − x) x (1 − x) · ∀k ∈ N,C Lk = σkMk dxk s (2k + α + β − 1)B(α, β)Γ(k + α + β − 1) where {σk ∈ (0, 1] : k ∈ N} are the singular values such that (−1)k σ0 = 1 and lim σk = 0. Γ(k + α)Γ(k + β) k! k→∞ (30) The α = 1 case of Theorem2, where PX is an exponen- with the parameters α, β ∈ (0, ∞). These polynomials satisfy tial distribution and PY is a , is also the orthogonality relation: presented in [12], and the corresponding singular values are h (α,β) (α,β) i calculated to be: ∀j, k ∈ N, E Jj (X)Jk (X) = δjk (31)

k  1  2 with respect to the . They also generalize ∀k ∈ , σ = . (25) N k β + 1 several other orthogonal polynomial families such as the Legendre and Chebyshev polynomials. Note that when α = 1, the right singular vectors of C are 2 The Hilbert space L ([n], PY ) has a unique (up to arbitrary known as Laguerre polynomials. Although the left singular sign changes) orthonormal polynomial basis of Hahn polyno- vectors of C are Meixner polynomials, we refer to this result as mials. The Hahn polynomial with degree k ∈ [n], denoted the “Laguerre SVD” because Meixner polynomials behave like (α,β) Qk :[n] → R, is parametrized by α, β ∈ (0, ∞). These discrete Laguerre polynomials. Indeed, the negative binomial polynomials satisfy the orthogonality relation: distribution is the discrete analog of the h i (much like the geometric distribution is the discrete analog of (α,β) (α,β) ∀j, k ∈ [n], E Qj (Y )Qk (Y ) = δjk (32) the ). with respect to the beta-binomial distribution. The Hahn poly- nomials also generalize several other families of orthogonal B. The Jacobi SVD polynomials in the limit, including the Jacobi and Meixner For our second result, let X = (0, 1) and Y = [n] , polynomials defined earlier, and the Krawtchouk and Charlier {0, . . . , n}, and let λ be the Lebesgue measure and µ be the polynomials which are orthogonal with respect to the binomial counting measure in subsection I-B. The channel conditional and Poisson distributions, respectively [21]. pmfs {PY |X=x ∼ binomial(n, x): x ∈ (0, 1)} are the The following theorem presents the orthogonal polynomial NEFQVF of binomial distributions: SVD of the conditional expectation operator C corresponding to the beta source and binomial channel model. n ∀x ∈ (0, 1), ∀y ∈ [n],P (y|x) = xy(1 − x)n−y (26) Y |X y Theorem 3 (Jacobi SVD). For the binomial channel with beta source, as presented in (26) and (27), the conditional where x ∈ (0, 1) is the success probability parameter and expectation operator, C : L2 ((0, 1), ) → L2 ([n], ), has + PX PY n ∈ Z is the fixed number of Bernoulli trials of the binomial SVD: distribution. The capacity of this “biased coin channel” model  (α,β) (α,β) has been studied in the literature [24]. The corresponding ∀k ∈ [n],C Jk = σkQk conjugate prior family consists of beta distributions, and we   ∀k ∈ \[n],C J (α,β) = 0 assume that the source pdf is PX ∼ beta(α, β): N k

xα−1(1 − x)β−1 where {σk ∈ (0, 1] : k ∈ [n]} are the singular values such that ∀x ∈ (0, 1),P (x) = (27) X B(α, β) σ0 = 1. 6 IEEE TRANSACTIONS ON INFORMATION THEORY

When α = β = 1, PX is the uniform pdf and PY is the where {σk ∈ (0, 1] : k ∈ N} are the singular values such that uniform pmf. The corresponding orthonormal polynomials are σ0 = 1 and lim σk = 0. known as Legendre polynomials and discrete Chebyshev or k→∞ Gram polynomials respectively, and are analogs of each other. This result is known in the literature. For example, Theorem For this reason, and the fact that Jacobi polynomials can be 4 was derived in [3] for the r = 0 case, where the authors also obtained as limits of Hahn polynomials, we refer to the SVD computed the singular values to be: in Theorem3 as the “Jacobi SVD.” k  p  2 ∀k ∈ , σ = . (38) N k p + ν C. The Hermite SVD Furthermore, the authors of [3] also remarked upon the pos- For our final result, let X = Y = R, and let λ and µ be the sible relation between Theorem4 and the classical theory Lebesgue measure in subsection I-B. The channel conditional of the Ornstein-Uhlenbeck process. We include Theorem4 pdfs {PY |X=x ∼ N (x, ν): x ∈ R} are the NEFQVF of here for completeness, and provide an alternative proof of Gaussian distributions: it. Theorems2 and3 generalize Theorem4 by establishing 1  (y − x)2  a “nice” class of source-channel models whose conditional ∀x, y ∈ R,PY |X (y|x) = √ exp − (33) expectation operators have orthogonal polynomial singular 2πν 2ν vectors. where x ∈ R is the expectation parameter of the Gaussian distribution and ν ∈ (0, ∞) is some fixed variance. We can D. Related Results in the Literature construe (33) as the well-known single letter additive white Gaussian noise (AWGN) channel: The general problem of analyzing when the singular vectors (or eigenvectors) of certain linear operators are orthogonal Y = X + W, X ⊥⊥ W ∼ N (0, ν) (34) polynomials has been widely studied in mathematics. Com- prehensive resources on the general theory of orthogonal poly- where the input X is independent of the Gaussian noise W . nomials include [21]–[23]. In particular, it is well-known that The corresponding conjugate prior family consists of Gaussian the classical orthogonal polynomials (Hermite, Laguerre, and distributions, and we assume that the source pdf is PX ∼ Jacobi) arise as eigenfunctions of certain second order (Sturm- N (r, p): Liouville type of) differential operators. Both [23] and [25] 1  (x − r)2  meticulously expound various relationships between orthogo- ∀x ∈ ,PX (x) = √ exp − (35) nal polynomials and differential or integral linear operators. R 2πp 2p In the setting of probability theory, there are deep ties be- where r ∈ R is the expectation parameter, and p ∈ (0, ∞) tween orthogonal polynomials and certain Markov semigroups. is the variance parameter. The posterior pdfs {PX|Y =y ∼ Under regularity conditions, the conditional expectation op- N ((py + νr)/(p + ν), pν/(p + ν)) : y ∈ R} are also Gaussian erators of a semigroup are completely characterized by an distributions as we used a conjugate prior. Finally, the output infinitesimal generator, because they form the unique solution marginal pdf PY ∼ N (r, p+ν) is also a Gaussian distribution. to the heat equation defined by their generator (due to the 2 2 The Hilbert spaces L (R, PX ) and L (R, PY ) have or- Hille-Yosida theorem) [26]. When the generator is a diffusion thonormal bases of Hermite polynomials. In particular, the operator (which is a kind of second order differential opera- (r,τ) Hermite polynomial with degree k ∈ N, denoted Hk : R → tor), the orthogonal polynomials with respect to the invariant R, is defined by the Rodrigues formula: measure of the semigroup turn out to be eigenfunctions of the generator, or equivalently, the conditional expectation op- r k 2 k  2  (r,τ) τ (x−r) d (x−r) k 2τ − 2τ erators. Moreover, there are only three families of orthogonal Hk (x) , (−1) e e (36) k! dxk polynomials (up to scaling and translations) that are eigenfunc- tions of diffusion operators: the Hermite, Laguerre, and Jacobi with the parameters r ∈ and τ ∈ (0, ∞). These polynomials R polynomials [27]. The three corresponding diffusion operators satisfy the orthogonality relation: are precisely the aforementioned second order differential Z ∞ 2 (r,τ) (r,τ) 1 (x−r) operators with classical orthogonal polynomial eigenfunctions. − 2τ ∀j, k ∈ N, Hj (x)Hk (x)√ e dx = δjk −∞ 2πτ In particular, the Markov semigroup in the Hermite case is the (37) well-known Ornstein-Uhlenbeck semigroup. We refer readers with respect to the Gaussian distribution N (r, τ). to [26], [27], and the references therein for detailed expositions The ensuing theorem presents the orthogonal polynomial of these ideas. SVD of the conditional expectation operator C corresponding Our results are closer in spirit to a line of work in statistics to the Gaussian source-channel model. initiated by Lancaster [28], [29]. Given marginal distributions {f ∈ Theorem 4 (Hermite SVD). For the Gaussian channel with PX and PY , and sequences of orthonormal functions, j L2 (X , )} {g ∈ L2 (Y, )} Gaussian source, as presented in (33) and (35), the conditional PX and k PY , a bivariate distribution 2 2 PX,Y is called a Lancaster distribution (with respect to PX , expectation operator, C : L (R, PX ) → L (R, PY ), has SVD: PY , {fj}, and {gk}) if for every j, k:  (r,p) (r,p+ν) ∀k ∈ N,C Hk = σkHk E [fj(X)gk(Y )] = σkδjk (39) IEEE TRANSACTIONS ON INFORMATION THEORY 7 for some Lancaster sequence of non-negative correlations first derive an auxiliary result that provides simple necessary {σk}. In [28], Lancaster proved that if PX,Y is absolutely con- and sufficient conditions for conditional expectation operators tinuous with respect to the product distribution PX × PY , and to have orthonormal polynomial singular vectors. We refer has finite “mean square contingency” (i.e. the χ2-divergence readers to [36]–[38] for the relevant functional analysis back- 2  2 χ (PX,Y ||PX × PY ) = EPX ×PY τ(X,Y ) − 1 is finite), ground. 2 2 then there exist orthonormal bases, {fj ∈ L (X , PX )} and Recall that we are given the Hilbert space L (X , PX ) with 2 {gk ∈ L (Y, PY )}, such that PX,Y is a Lancaster distribution, dimension: and the following expansion holds: X dim L2 (X , ) = |X | ∈ + ∪ {+∞} (41) τ(x, y) = σkfk(x)gk(y) (40) PX Z k 2 where τ(x, y) denotes the Radon-Nikodym derivative of i.e. L (X , PX ) is infinite dimensional when |X | = +∞, and PX,Y 2 finite dimensional when |X | < +∞. Since L (X , X ) is with respect to PX × PY . It is straightforward to see that P an expansion of the form (40) captures the SVD structure of separable, it equivalently has a countable complete orthonor- 2 mal (Schauder) basis. We will assume that L (X , X ) has the conditional expectation operators associated with PX,Y . P Explicit expansions of the form (40) in terms of orthogonal a unique (up to arbitrary sign changes) orthonormal basis  2 polynomials have since been derived for various bivariate dis- of polynomials pk ∈ L (X , PX ): k < |X | , where pk is an tributions. For instance, orthogonal polynomial expansions for orthonormal polynomial with degree k ∈ N. Typically, such bivariate distributions that are generated additively from three orthonormal polynomials can be constructed by applying the  2 independent NEFQVF random variables were established in Gram-Schmidt algorithm to the monomials 1, x, x ,... . [30]. We refer readers to [30], [31], and the references therein Note that the finiteness of the moment generating function for further details on such classical work. More contemporary (MGF) of X on an interval containing zero guarantees the 2 results on Lancaster distributions are presented in [32], [33], existence of an orthonormal polynomial basis of L (X , PX ) and the references therein. As explained in [33], one direction (since it ensures that all moments of X exist and are finite), and the positive definiteness of h·, ·i with respect to the of research is to find the extremal Lancaster sequences cor- PX responding to the extremal points of the compact, convex set subspace of all polynomials guarantees the uniqueness of of Lancaster distributions corresponding to certain marginal this basis. Furthermore, this discussion holds mutatis mu- 2 distributions and their orthogonal polynomial sequences. tandis for the Hilbert space L (Y, PY ), and we also as- In contrast to the aforementioned classical examples of sume that it has a unique orthonormal basis of polynomials  2 orthogonal polynomial eigenfunctions, the conditional expec- qk ∈ L (Y, PY ): k < |Y| , where qk is an orthonormal tation operators that we derive SVDs for are defined by NE- polynomial with degree k. The next definition presents a FQVF channels and conjugate prior sources. As we mentioned pertinent property of bounded linear operators between the 2 2 earlier, the Hermite SVD result in Theorem4 can be related to Hilbert spaces L (X , PX ) and L (Y, PY ). results on the Ornstein-Uhlenbeck semigroup since a Gaussian Definition 4 (Degree Preservation). A bounded linear operator NEFQVF has a Gaussian conjugate prior family. However, we 2 2 T : L (X , PX ) → L (Y, PY ) is degree preserving if for emphasize that the Laguerre and Jacobi SVDs in Theorems2 2 any polynomial p ∈ L (X , PX ) with degree k ∈ N, T (p) ∈ and3 are distinct from classical results (in the contexts of 2 L (Y, PY ) is also a polynomial with degree at most k. T is differential equations, integral equations, Markov semigroups, strictly degree preserving if: or Lancaster distributions). To our knowledge, these classical 2 results do not analyze the setting of NEFQVF channels and • Case |X | ≤ |Y|: For any polynomial p ∈ L (X , PX ) 2 conjugate prior sources. On the other hand, we would like to with degree k < |X |, T (p) ∈ L (Y, PY ) is also a acknowledge that results similar to ours on spectral decompo- polynomial with degree exactly k. sitions of Markov chains have been independently derived in • Case |X | > |Y| (⇒ |Y| < +∞): For any polynomial p ∈ 2 2 [34] to analyze the convergence rate of . L (X , PX ) with degree k < |Y|, T (p) ∈ L (Y, PY ) is Finally, it is worth mentioning that although we refer to also a polynomial with degree exactly k, and for any 2 Morris’ unified theory of NEFQVFs in [9] and [10] in section polynomial p ∈ L (X , PX ) with degree |Y| ≤ k < |X |, 2 I, the importance of NEFQVFs was recognized much earlier T (p) ∈ L (Y, PY ) is also a polynomial with degree at by Meixner. Indeed, Meixner characterized the orthogonal most |Y| − 1. polynomial families corresponding to NEFQVFs as precisely In Definition4, we use the convention that ∞ ≤ ∞ those that have generating functions with a certain tractable is true, and ∞ < ∞ is false. We also remark that when form in [35]. (Since we only use orthogonal polynomials X = Y, this definition implies that polynomials form an corresponding to and rather than those corresponding PX PY invariant subspace of a degree preserving operator T . The next to NEFQVFs, Meixner’s results are not directly of relevance proposition presents our auxiliary result using this definition. to us.) Proposition 5 (Orthogonal Polynomial SVD). Let T : III.PROOFSOF MAIN RESULTS 2 2 L (X , PX ) → L (Y, PY ) be a compact linear operator, and ∗ 2 2 In this section, we will prove our main results under the T : L (Y, PY ) → L (X , PX ) be its unique adjoint operator. conditions stated in subsection I-B. To this end, we will Then, T and T ∗ are strictly degree preserving if and only if 8 IEEE TRANSACTIONS ON INFORMATION THEORY

T has SVD: This means that {T (pk): k < min{|X |, |Y|}} are scaled ver- sions of the orthonormal polynomials in L2 (Y, ) since T ∀k < min{|X |, |Y|} ,T (p ) = β q PY k k k is strictly degree preserving: |X | > |Y| ⇒ ∀|Y| ≤ k < |X |,T (pk) = 0 √ ∀k < min{|X |, |Y|},T (pk) = αkqk where {βk ∈ (0, ∞): k < min{|X |, |Y|}} are singular values such that lim βk = 0 when min{|X |, |Y|} = +∞. where the sign of each orthonormal polynomial q is chosen k→∞ √ k to keep α > 0. On the other hand, if |X | > |Y|, then for Proof. We first prove the forward direction. Since T is a k any |Y| ≤ k < |X |, T (p ) is a polynomial with degree at compact linear operator, its adjoint T ∗ is also compact by k most |Y| − 1 and is orthogonal to every q with j < |Y|. This Schauder’s theorem [36], [37]. Hence, the (Gramian) operator j implies that T (p ) = 0 as {q ∈ L2 (Y, ): j < |Y|} is an T ∗T is self-adjoint, positive, and compact since the composi- k j PY orthonormal basis of L2 (Y, ). Moreover, α = 0 for every tion of compact operators is compact. Moreover, since T and PY k |Y| ≤ k < |X |. Therefore, we have: T ∗ are strictly degree preserving, T ∗T is degree preserving; in fact, T ∗T is strictly degree preserving when |X | ≤ |Y|. √ ∀k < min{|X |, |Y|},T (pk) = αkqk Using the spectral theorem for compact self-adjoint op- |X | > |Y| ⇒ ∀|Y| ≤ k < |X |,T (p ) = 0 erators [36], T ∗T has a countable orthonormal eigenbasis k 2 √ {ri ∈ L (X , PX ): i ∈ N, i < |X |}: which is the SVD of T with singular values, βk = αk > 0 ∗ k < min{|X |, |Y|} lim β = 0 ∀i < |X |,T T (ri) = αiri for every , that satisfy k→∞ k when min{|X |, |Y|} = +∞ (since lim α = 0). This + k→∞ k where {αi ∈ R : i < |X |} are the non-negative eigenvalues completes the proof of the forward direction. ∗ (since T T is positive) such that limi→∞ αi = 0 when To prove the converse direction, notice that T having SVD: |X | = +∞. We will prove by strong induction that these eigenfunctions are orthonormal polynomials. ∀k < min{|X |, |Y|} ,T (pk) = βkqk ∗ The first eigenfunction of T T must be the constant function |X | > |Y| ⇒ ∀|Y| ≤ k < |X |,T (p ) = 0 ∗ k r0 = p0 = 1X since T T is degree preserving. Assume that the first k + 1 eigenfunctions are orthonormal polynomials: implies that T ∗ has SVD: r = p i ∈ {0, . . . , k} i i for (inductive hypothesis). Then, since ∗ ∀k < min{|X |, |Y|} ,T (qk) = βkpk pk+1 is orthogonal to span (r0, . . . , rk) = span (p0, . . . , pk), ∗ we have: |Y| > |X | ⇒ ∀|X | ≤ k < |Y|,T (qk) = 0 . X pk+1 = hpk+1, rji rj . PX This is an exercise in functional analysis. Since βk > 0 for any k+1≤j<|X | k < min{|X |, |Y|}, and any polynomial can be decomposed When |X | = +∞, this equality holds in the sense that the into a weighted sum of orthonormal polynomials, these SVDs 2 ∗ partial sums converge to pk+1 in L (X , PX )-norm. Applying imply that T and T are strictly degree preserving. This T ∗T to both sides and using the continuity (or equivalently, completes the proof.  the boundedness) of T ∗T , we get: We briefly make some remarks regarding Proposition5. ∗ X T T (pk+1) = αj hpk+1, rji rj Firstly, the result continues to hold if or are re- PX PX PY k+1≤j<|X | laxed to be non-probability measures that have unique or- 2 thonormal polynomial bases. Secondly, the singular values which also holds in the L (X , PX )-norm sense when |X | = ∗ {βk ∈ (0, ∞): k < min{|X |, |Y|}} of T must be computed +∞. Hence, T T (pk+1) is orthogonal to span (p0, . . . , pk) using the continuity of the inner product, and it is a polynomial on a case by case basis if desired. Thirdly, as mentioned  2 with degree at most k + 1 as T ∗T is degree preserving. This in the proof, the signs of pk ∈ L (X , PX ): k < |X | and  2 implies that: qk ∈ L (Y, PY ): k < |Y| are chosen to ensure that the ∗ singular values are non-negative. This convention also applies T T (pk+1) = αk+1pk+1 to Theorems2,3,4, and Lemma6 below. Fourthly, it is worth where α is possibly zero, which means that r = p 2 2 k+1 k+1 k+1 considering Proposition5 when L (X , PX ) and L (Y, PY ) (without loss of generality). By strong induction, {pk ∈ are finite dimensional, and isomorphic to |X | and |Y|, 2 ∗ R R L (X , PX ): k < |X |} are the eigenfunctions of T T : respectively. In this scenario, T and T ∗ have finite rank, and ∗ ∀k < |X |,T T (pk) = αkpk (42) are trivially compact operators that have SVDs. Moreover, every basis of a Euclidean space Rn (n = |X | or n = |Y|) where for all k < min{|X |, |Y|}, αk > 0 because both T ∗ corresponds to a basis of polynomials, where each polynomial and T do not reduce the degrees of input polynomials with has degree at most n − 1, by the unisolvence theorem. So, the degrees less than min{|X |, |Y|}. singular vectors of T and T ∗ will always be polynomials. The Now observe (by definition of the adjoint operator) that: non-trivial aspect of Proposition5 in this finite dimensional ∀j, k < |X |, hT (p ) ,T (p )i = hp ,T ∗T (p )i setting is that T and T ∗ have orthonormal polynomial singular j k Y j k X P P vector bases if and only if T and T ∗ are strictly degree = αk hpj, pki PX preserving. Lastly, it is worth noting that although the SVD = αkδjk . result in Proposition5 requires strict degree preservation, the IEEE TRANSACTIONS ON INFORMATION THEORY 9 spectral decomposition result in (42) only requires degree So, C and C∗ are strictly degree preserving by Lemma6. preservation as the proof illustrates. The ensuing lemma is a straightforward corollary of Propo- Example 2 (Uniform Source and Binary Erasure Channel). Suppose X = {0, 1}, Y = X ∪ {e} (where e is the erasure sition5 specializing it for conditional expectation operators. 1  symbol), X ∼ Bernoulli 2 , and Y is the output of passing Lemma 6 (Conditional Moment Condition). Suppose the con- X through a binary erasure channel with erasure probability 2 2 ditional expectation operator C : L (X , PX ) → L (Y, PY )  ∈ (0, 1). In this case, the SVD of C is: is compact, and suppose |X | ≥ |Y| without loss of generality. √ n Then, for every n ∈ N, n < |Y|, E [Y |X] is a polynomial C(p0) = g0 and C(p1) = 1 −  g1 (45) in X with degree n and E [Xn|Y ] is a polynomial in Y with degree n, and for every n ∈ N, |Y| ≤ n < |X |, E [Xn|Y ] is where the right singular vectors are given in Example1, and a polynomial in Y with degree at most |Y| − 1 if and only if the left singular vectors√ are the orthonormal√ vectors g0 =  2 C has SVD: (1, 1, 1) and g1 = 1/ 1 − , −1/ 1 − , 0 in L (Y, PY ), where gk = (gk(0), gk(1), gk(e)) for k = 0, 1. Note that ∀k < |Y|,C (pk) = βkqk C has an orthonormal polynomial SVD if and only if g1 ∀|Y| ≤ k < |X |,C (pk) = 0 1 is a linear polynomial, which is true if and only if e = 2 . Therefore, when e ∈ and e 6= 1 , C and C∗ are not strictly where {βk ∈ (0, 1] : k < |Y|} are singular values such that R 2 degree preserving by Lemma6 (although g is a non-linear β0 = 1, and lim βk = 0 when |Y| = +∞. 1 k→∞ polynomial). Proof. The conditional moment conditions of the lemma are equivalent to C and C∗ being strictly degree preserving. The SVD of C then follows from Proposition5, where as before, B. Proof of Theorem2: Laguerre SVD the signs of the orthonormal polynomials are selected to keep the singular values positive. (When |X | = |Y|, no value of k Proof. First notice that given X = x ∈ (0, ∞), Y is Poisson satisfies the second line of the SVD, and it is vacuously true.) distributed with rate x as shown in (18). This means that the cumulants of PY |X=x are all equal to x. Since the nth Furthermore, βk ≤ 1 for all k < |Y| because kCkop = 1 by moment E[Y n|X = x] for n ∈ N is a polynomial in the first Proposition1, and β0 = 1 since C (1X ) = 1Y .  n cumulants with degree n [39], E[Y n|X] is a polynomial in Lemma6 provides an easily testable equivalent condition X with degree n for every n ∈ N. (Note that this can also be for a conditional expectation operator to have an orthonormal proved directly by using induction on the derivatives of the polynomial SVD; it holds with natural modifications when MGF of a Poisson random variable.) |X | ≤ |Y|. We must also verify that C is compact when using Next, we prove that the moments of a gamma distribution this lemma. A well-known sufficient condition that ensures PX ∼ gamma(α, β) with α, β ∈ (0, ∞), shown in (19), are ∗ that C (and C ) are compact is the Hilbert-Schmidt condition: polynomials in α with the same degree. The MGF of X is: Z Z 2 PX,Y (x, y) α dµ(y) dλ(x) < +∞ (43) (  β  P (x)P (y)  sX  β−s , s < β X Y X Y MX (s) , E e = +∞ , s ≥ β where PX,Y , PX , and PY are the probability densities defined in subsection I-B. In functional analysis, this condition arises and as β > 0, the MGF is finite on an open interval around from the compactness of Hilbert-Schmidt operators, which are s = 0. This means the moments of X are given by: integral operators with square integrable kernels [36]. In statis- tics, it corresponds to the finite “mean square contingency” n n d E [X ] = MX (s) condition mentioned in subsection II-D; for example, it is dsn mentioned after Assumption 5.2 in [2], and in the premise of s=0 dn  β α Theorem 2 in [1]. In our ensuing proofs, we will not explicitly = n check for compactness of the operators for brevity. ds β − s s=0 α n−1  β  1 Y = (α + i) A. Finite Alphabet Examples β − s (β − s)n i=0 s=0 Before proving our main results, we briefly provide two n−1 1 Y basic examples of polynomial SVDs in the finite alphabet case. = (α + i) βn Example 1 (Uniform Source and Binary Symmetric Channel). i=0 Suppose X ∼ Bernoulli 1 , and Y ∼ Bernoulli 1  is the 2 2 for every n ∈ \{0}. Thus, for every n ∈ , [Xn] is a X N N E output of passing through a binary symmetric channel with polynomial in α with degree n. crossover probability δ ∈0, 1 . The orthonormal polynomials 2 As mentioned earlier in subsection II-A, the posterior pdfs in L2 {0, 1}, Bernoulli 1  are p = (1, 1) and p = (1, −1), 2 0 1 {P ∼ gamma(α + y, β + 1) : y ∈ } are also gamma where p = (p (0), p (1)) for k = 0, 1. It is straightforward X|Y =y N k k k pdfs with updated parameters. Hence, for every n ∈ , to directly verify that the SVD of C is: N E [Xn|Y ] is a polynomial in Y with degree n. Applying C(p0) = p0 and C(p1) = (1 − 2δ) p1 . (44) Lemma6 completes the proof.  10 IEEE TRANSACTIONS ON INFORMATION THEORY

C. Proof of Theorem3: Jacobi SVD uses the following lemma, cf. the line below equation (51) in [40, Chapter 7]. Proof. First observe that given X = x ∈ (0, 1), PY |X=x ∼ binomial(n, x), which means that Y = Z1 + ··· + Zn where Lemma 7 (Translation Invariant Kernels). Fix u, v ∈ R\{0}, Z1,...,Zn are conditionally i.i.d. Bernoulli(x) random vari- and a Borel measurable λ-integrable function φ : R → R R ables (i.e. P(Zi = 1) = x and P(Zi = 0) = 1 − x for such that φ dλ = 1, where λ is the Lebesgue measure. If 2 R 2 i = 1, . . . , n). Hence, we have for any m ∈ N: T : L (R, PX ) → L (R, PY ) is a bounded integral operator " n !m # with translation invariant φ: X [Y m|X = x] = Z X = x Z E E i 2 i=1 ∀f ∈ L (R, PX ) , (T (f))(y) = φ(uy + vx)f(x) dλ(x) , n R X m! Y h i ki = E Zi X = x then T is strictly degree preserving. k1! ··· kn! 0≤k1,...,kn≤m i=1 k1+···+kn=m Proof. We include a proof of this known result in Appendix X m! C for completeness.  = xN(k1,...,kn) k1! ··· kn! 0≤k1,...,kn≤m Note that u = 1 and v = −1 corresponds to a difference ker- k1+···+kn=m nel setting where T represents convolution with the function φ. 2 2 where the second equality follows from the multinomial theo- We also remark that although L (R, PX ) and L (R, PY ) are rem, the third equality follows from the fact that the moments the Hilbert spaces defined in subsection II-C, Lemma7 also  0  holds for other (appropriately generalized) Hilbert spaces. of the Bernoulli random variables are E Zi |X = x = 1 and m for every m ∈ \{0}, [Z |X = x] = x, and N(k1, . . . , kn) N E i Proof of Theorem4 . Observe that when PX,Y is defined by denotes the number of non-zero ki. Since N(k1, . . . , kn) ≤ (33) and (35), both C and C∗ are integral operators with trans- min{m, n} and N(k1, . . . , kn) = min{m, n} for at least one m lation invariant kernels satisfying the conditions of Lemma7. of the terms, we have that for every m ∈ [n], E [Y |X] is a 2 2 Indeed, for any f ∈ L (R, PX ) and any g ∈ L (R, PY ): polynomial in X with degree m. Z Next, as mentioned in subsection II-B, we note that the  (C(f))(y) = f(x)PX|Y (x|y) dλ(x) posterior pdfs PX|Y =y ∼ beta(α + y, β + n − y): y ∈ [n] R  2  are also beta pdfs with updated parameters. For any fixed Y =  py+νr  Z f(x) x − p+ν y ∈ [n] and any m ∈ N, we have: = exp−  dλ(x), r   pν    pν  Z α+y−1 β+n−y−1 R 2 p+ν m m x (1 − x) 2π p+ν E [X |Y = y] = x dλ(x) (0,1) B(α + y, β + n − y) Z ∗ B(α + y + m, β + n − y) (C (g))(x) = g(y)PY |X (y|x) dλ(y) = R B(α + y, β + n − y) Z g(y)  (y − x)2  Γ(α + y + m)Γ(α + β + n) = √ exp − dλ(y) . = 2πν 2ν Γ(α + y)Γ(α + β + n + m) R Hence, C and C∗ are strictly degree preserving by Lemma7. m−1 Y (α + y + k) Finally, applying Lemma6 completes the proof.  = k=0 (46) m−1 IV. CONCLUSION Y (α + β + n + k) In this paper, we first illustrated the utility of SVDs of k=0 conditional expectation operators by citing examples from the where the first equality uses (27) with the updated parameters. literature such as maximal correlation functions (which are Therefore, for every m ∈ [n], E [Xm|Y ] is a polynomial in themselves singular vectors of conditional expectation oper- Y with degree m, and for every m ∈ N\[n], E [Xm|Y ] is a ators) and linear information coupling problems (which are polynomial in Y with degree at most n. The latter deduction solved by SVDs of conditional expectation operators). We then seems counter-intuitive in light of (46), which seems to suggest proved that conditional expectation operators corresponding that E [Xm|Y ] is always a polynomial in Y with degree m. to NEFQVF channels and conjugate prior sources, where all However, since the function y 7→ E [Xm|Y = y] is supported marginal moments exist and are finite, have orthonormal poly- on a set of size n+1, it can only be uniquely represented as a nomial SVDs. In particular, the Gaussian source and Gaus- polynomial with degree at most n by the unisolvence theorem. sian channel produce Hermite polynomial singular vectors, Finally, employing Lemma6 completes the proof.  the gamma source and Poisson channel produce generalized Laguerre and Meixner polynomial singular vectors, and the beta source and binomial channel produce Jacobi and Hahn D. Proof of Theorem4: Hermite SVD polynomial singular vectors. To establish these results, we ver- As mentioned earlier, Theorem4 was proved in [3] using the ified that the corresponding conditional expectation operators Appell sequence recurrence relation of Hermite polynomials. and their adjoint operators are strictly degree preserving, which We now provide another proof using Lemma6 here. Our proof we showed is equivalent to having orthonormal polynomial IEEE TRANSACTIONS ON INFORMATION THEORY 11

SVDs. This equivalence between strict degree preservation and variation distance, χ2-divergence, squared Hellinger distance, orthonormal polynomial SVDs may also be useful in future Jensen-Shannon divergence, and Jeffreys divergence [44], [45]. for deriving orthonormal polynomial SVDs for other source- Proposition 8 (Local f-Divergence). Let f : (0, ∞) → be channel models. R a convex function such that f(1) = 0, f(t) is thrice differen- tiable over some open interval around t = 1, f 000(t) is locally APPENDIX A bounded at t = 1, and f 00(1) > 0. Suppose we are given a PROOFOF PROPOSITION 1 family of probability densities that are local perturbations of ∗ Proof. We prove this for completeness. The maps C and C the reference probability density PX > 0 λ-a.e.: are clearly linear since they are defined using expectations. To show that they are bounded, it suffices to prove that they have Q = PX (1X +  φ) operator norms equal to unity (and this also verifies that their where φ ∈ L∞(X , ) is a fixed perturbation function such codomains are indeed Hilbert spaces). We now prove that: PX that E[φ(X)] = 0, and Q is a valid probability density for all kC(h)k PY  6= 0 of sufficiently small magnitude. Then, the f-divergence kCkop , sup = 1 2 khk between Q and PX can be locally approximated as: h∈L (X ,PX )\{0} PX 00 where 0 denotes the zero function. Let 1S denote the every- f (1) 2  2 2 Df (Q||PX ) =  E φ(X) + o  where unity function on the set S ⊆ R: 1S : S → R, 1S(x) = 2 2 2 1. Observe that 1X ∈ L (X , X ) (because k1X k = 1), and P PX 2 2 2 where o  is the Bachmann-Landau asymptotic notation using (8), C (1X ) = 1Y ∈ L (Y, Y ) (because k1Y k = 1). 2 2 P PY denoting a function which satisfies lim o  / = 0. 2 →0 Hence, kCkop ≥ 1. Now, for any h ∈ L (X , PX ), we have: h i Proof. First observe that using f(1) = 0, the thrice differen- kC(h)k2 = [h(X)|Y ]2 ≤  h2(X)|Y  = khk2 PY E E E E PX tiability of f(t) over some open interval around t = 1, and Taylor’s theorem about the point t = 1, we have: using (8), conditional Jensen’s inequality, and the tower prop- erty and (7), respectively. Thus, kCk = 1 and kC∗k = 1, 1 1 op op f(t) = f 0(1)(t − 1) + f 00(1)(t − 1)2 + f 000(s)(t − 1)3 where the latter follows from an analogous argument. 2 6 Since C and C∗ are bounded, they have unique adjoint for every t ∈ (0, ∞) sufficiently close to unity, and some operators using the Riesz representation theorem [36], [38]. corresponding min {1, t} ≤ s ≤ max {1, t}, where we have Note that for every f ∈ L2 (X , ) and every g ∈ L2 (Y, ), PX PY used the Lagrange form of the remainder. Using this Taylor we have: approximation of f(t), we have for every x ∈ Xb: hC(f), gi = [ [f(X)|Y ] g(Y )] Y E E   00 P Q(x) 0 f (1) 2 2 = [f(X) [g(Y )|X]] = hf, C∗(g)i f = f (1)  φ(x) +  φ(x) E E X P (x) 2 P X (48) where the second equality follows from applying the tower f 000(s(x)) + 3φ(x)3 property to E [f(X)g(Y )]. Hence, C∗ is the adjoint operator 6 of C (as is implicitly suggested by the notation).  where Xb ⊆ X is a Borel measurable set such that PX (x) > 0 for all x ∈ Xb and λ(X\Xb) = 0, and where for every x ∈ Xb, APPENDIX B min {1,Q(x)/PX (x)} ≤ s(x) ≤ max {1,Q(x)/PX (x)}. LOCAL APPROXIMATIONS OF f -DIVERGENCES Let h(x, ) denote the Lagrange remainder term: In this appendix, we locally approximate f-divergences 000 satisfying some regularity conditions, and specialize the ap- f (s(x)) 3 3 h(x, ) ,  φ(x) . (49) proximation for Kullback-Leibler (KL) divergence. The f- 6 divergences are a class of divergence measures over distribu- ∞ Since φ ∈ L (X , PX ), kφk∞ , ess supx∈X |φ(x)| < +∞. tions that were independently introduced by Csiszar´ in [41], So, the remainder term is bounded λ-a.e. on X : [42], and by Ali and Silvey in [43]. Let f : (0, ∞) → be R 000 a convex function such that f(1) = 0. Given two probability |f (s(x))| 3 3 B 3 3 |h(x, )| = || |φ(x)| ≤ || kφk∞ < +∞ densities QX and PX with respect to λ on (X , B(X )) (recall 6 6 the setup in subsection I-B), the f-divergence between QX where |f 000(s(x))| ≤ B < +∞ λ-a.e. for sufficiently small and PX is defined as:  6= 0 because f 000(t) is locally bounded at t = 1, and we Z Q (x) have: D (Q ||P ) P (x)f X dλ(x) (47) f X X , X P (x) X X min {1, 1 +  φ(x)} ≤ s(x) ≤ max {1, 1 +  φ(x)} λ-a.e. 0  where we assume that f(0) = limt→0+ f(t), 0f 0 = 0, and ⇒ 1 − || kφk ≤ s(x) ≤ 1 + || kφk λ-a.e. r  r  1  ∞ ∞ for all r > 0, 0f 0 = lims→0+ sf s = r lims→0+ sf s based on continuity arguments. With appropriate choices of which means for sufficiently small  6= 0, s(x) will be in the the function f : (0, ∞) → R, f-divergences generalize many neighborhood of t = 1 around which f 000(t) is bounded. In known divergence measures including KL divergence, total fact, we also require Q(x)/PX (x) to be sufficiently close 12 IEEE TRANSACTIONS ON INFORMATION THEORY

to unity λ-a.e. for (48) to be valid, and this also holds for APPENDIX C sufficiently small  6= 0 because: PROOFOF LEMMA 7 Proof. For any polynomial p : R → R with degree n ∈ N, n Q(x) p(x) = a0 + a1x + ··· + anx such that an 6= 0, we have: − 1 = |||φ(x)| ≤ || kφk∞ λ-a.e. . PX (x) Z ∀y ∈ R, (T (p))(y) = φ(uy + vx)p(x) dλ(x) R We now take the expectation of both sides of (48) with respect Z 1 z − uy  to and use [φ(X)] = 0 to obtain: = φ(z)p dλ(z) PX E |v| v R Z n 00 1 X i f (1) 2  2 = φ(z) fi(z)y dλ(z) Df (Q||PX ) =  E φ(X) + E[h(X, )] . (50) |v| 2 R i=0 n Z X i 1  2 ∞ = y φ(z)fi(z) dλ(z) (Note that E φ(X) is finite because φ ∈ L (X , PX ) ⊆ |v| 2 i=0 R L (X , PX ), and the inclusion holds as PX is a finite measure.) 2 3 where the second equality follows from a change of variables Since |h(x, )|/ ≤ B kφk∞ ||/6 λ-a.e. for sufficiently small  6= 0, we have: z = uy + vx, and the third equality holds because p((z − n uy)/v) = a0 + a1(z − uy)/v + ··· + an((z − uy)/v) is a i   polynomial in y with degree n, coefficients fi(z) of y for i = |E[h(X, )]| |h(X, )| B 3 0 ≤ 2 ≤ E 2 ≤ kφk∞ || 0, . . . , n (these depend on u, v as well), and leading coefficient   6 n fn(z) = an(−u/v) 6= 0. The non-zero leading coefficient of T (p) is: 2 which implies that E[h(X, )] = o  . Therefore, (50) pro- n n Z 1 −u a −u duces the desired approximation.  φ(z)a dλ(z) = n |v| n v |v| v R since R φ dλ = 1. Hence, T (p) is a polynomial with degree Proposition8 asserts that if we consider the stochastic man- R n, which implies that T is strictly degree preserving.  ifold of probability densities on X in the “neighborhood” of PX , the f-divergence between any two densities is a squared REFERENCES weighted L2-norm. This intuition is known in the literature; the discrete case of Proposition8 is proved in [45] where f- [1]A.R enyi,´ “On measures of dependence,” Acta Mathematica Academiae divergences are locally shown to be χ2-divergences. Our proof Scientiarum Hungarica, vol. 10, no. 3-4, pp. 441–451, 1959. [2] L. Breiman and J. H. Friedman, “Estimating optimal transformations for uses additional regularity conditions to provide a more general multiple regression and correlation,” Journal of the American Statistical version of the result. The specialization of Proposition8 for Association, vol. 80, no. 391, pp. 580–598, September 1985. KL divergence is provided in the next corollary. [3] E. Abbe and L. Zheng, “A coordinate system for Gaussian networks,” IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 721–733, February 2012. Corollary 9 (Local KL Divergence). Suppose we are given a [4] R. W. Keener, Theoretical Statistics: Topics for a Core Course, ser. family of probability densities that are local perturbations of Springer Texts in Statistics. New York: Springer, 2010. [5] G. W. Wornell, “Inference and information,” May 2017, Course 6.437 the reference probability density PX > 0 λ-a.e.: Lecture Notes, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Q = P ( +  φ) [6] E. J. G. Pitman, “Sufficient statistics and intrinsic accuracy,” Mathemat-  X 1X ical Proceedings of the Cambridge Philosophical Society, vol. 32, no. 4, pp. 567–579, 1936. ∞ [7] P. Diaconis and D. Ylvisaker, “Conjugate priors for exponential fami- where φ ∈ L (X , PX ) is a fixed perturbation function such lies,” The Annals of Statistics, vol. 7, no. 2, pp. 269–281, 1979. that E[φ(X)] = 0, and Q is a valid probability density for all [8] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.  6= 0 of sufficiently small magnitude. Then, the KL divergence New Jersey: John Wiley & Sons, Inc., 2006. [9] C. N. Morris, “Natural exponential families with quadratic variance between Q and PX can be locally approximated as: functions,” The Annals of Statistics, vol. 10, no. 1, pp. 65–80, 1982. [10] ——, “Natural exponential families with quadratic variance functions: Statistical theory,” The Annals of Statistics, vol. 11, no. 2, pp. 515–529, 1 2  2 2 1983. D(Q||PX ) =  E φ(X) + o  . 2 [11] D. Fink, “A compendium of conjugate priors,” May 1997, Environmental Statistics Group, Department of Biology, Montana State Univeristy. [12] S.-L. Huang, A. Makur, F. Kozynski, and L. Zheng, “Efficient statistics: Proof. It is straightforward to verify the conditions of Propo- Extracting information from iid observations,” in Proceedings of the sition8 for f(t) = t log(t) (where log(·) denotes the natural 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton House, UIUC, Illinois, USA, October 1-3 2014, logarithm), and this choice of f generates KL divergence.  pp. 699–706. [13] A. Makur, F. Kozynski, S.-L. Huang, and L. Zheng, “An efficient algorithm for information decomposition and extraction,” in Proceedings of the 53rd Annual Allerton Conference on Communication, Control, Corollary9 is well-known in the literature. The discrete and Computing, Allerton House, UIUC, Illinois, USA, September 29- version can be found in [8], and the general case in [46]. October 2 2015, pp. 972–979. IEEE TRANSACTIONS ON INFORMATION THEORY 13

[14] S.-L. Huang and L. Zheng, “Linear information coupling problems,” [39] M. G. Kendall, The Advanced Theory of Statistics, 2nd ed. London: in Proceedings of the IEEE International Symposium on Information Charles Griffin and Co. Ltd., 1945, vol. 1. Theory (ISIT), Cambridge, MA, USA, July 1-6 2012, pp. 1029–1033. [40] W. Rudin, Principles of Mathematical Analysis, 3rd ed., ser. International [15] E. Erkip and T. M. Cover, “The efficiency of investment information,” Series in Pure and Applied Mathematics. New York: McGraw-Hill, Inc., IEEE Transactions on Information Theory, vol. 44, no. 3, pp. 1026– 1976. 1040, May 1998. [41] I. Csiszar,´ “Eine informationstheoretische ungleichung und ihre an- [16] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck wendung auf den beweis der ergodizitat¨ von Markoffschen ketten,” method,” in Proceedings of the 37th Annual Allerton Conference on Publications of the Mathematical Institute of the Hungarian Academy Communication, Control, and Computing, Allerton House, UIUC, Illi- of Sciences, ser. A, vol. 8, pp. 85–108, 1963. nois, USA, September 22-24 1999, pp. 368–377. [42] ——, “Information-type measures of difference of probability distribu- [17] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On maximal tions and indirect observations,” Studia Scientiarum Mathematicarum correlation, hypercontractivity, and the data processing inequality studied Hungarica, vol. 2, pp. 299–318, 1967. by Erkip and Cover,” April 2013, arXiv:1304.6133 [cs.IT]. [43] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence [18] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with of one distribution from another,” Journal of the Royal Statistical Society, input constraints,” IEEE Transactions on Information Theory, vol. 62, Series B (Methodological), vol. 28, no. 1, pp. 131–142, 1966. no. 1, pp. 35–55, January 2016. [44] I. Sason, “Tight bounds for symmetric divergence measures and a [19] A. Makur and L. Zheng, “Bounds between contraction coefficients,” in new inequality relating f-divergences,” in Proceedings of the IEEE Proceedings of the 53rd Annual Allerton Conference on Communica- Information Theory Workshop (ITW), Jerusalem, Israel, April 26-May tion, Control, and Computing, Allerton House, UIUC, Illinois, USA, 1 2015. September 29-October 2 2015, pp. 1422–1429. [45] I. Csiszar´ and P. C. Shields, Information Theory and Statistics: A Tuto- [20] A. Lapidoth and S. Shamai, “The Poisson multiple-access channel,” rial, ser. Foundations and Trends in Communications and Information IEEE Transactions on Information Theory, vol. 44, no. 2, pp. 488–501, Theory, S. Verdu,´ Ed. Hanover: now Publishers Inc., 2004, vol. 1, March 1998. no. 4. [21] G. E. Andrews and R. Askey, “Classical orthogonal polynomials,” [46] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” August in Polynomesˆ Orthogonaux et Applications, ser. Lecture Notes in 2017, Course 6.441 Lecture Notes, Department of Electrical Engineering Mathematics, C. Brezinski, A. Draux, A. P. Magnus, P. Maroni, and and Computer Science, Massachusetts Institute of Technology, and A. Ronveaux, Eds., vol. 1171. Berlin, Heidelberg: Springer, 1985, pp. Course 664 Lecture Notes, Department of Statistics and Data Science, 36–62. Yale University. [22] T. S. Chihara, An Introduction to Orthogonal Polynomials, Dover reprint ed. New York: Dover Publications, 2011. [23] G. Szego,¨ Orthogonal Polynomials, 4th ed., ser. Colloquium Publica- tions. Providence, Rhode Island: American Mathematical Society, 1975, vol. XXIII. [24] C. Komninakis, L. Vandenberghe, and R. D. Wesel, “Capacity of the binomial channel, or minimax redundancy for memoryless sources,” in Proceedings of the IEEE International Symposium on Information Theory (ISIT), Washington, D.C., USA, June 24-29 2001, p. 127. Anuran Makur (S’16) received a B.S. degree with highest honors (summa [25] A. M. Krall, Hilbert Space, Boundary Value Problems and Orthogonal cum laude) from the Department of Electrical Engineering and Computer Polynomials, 1st ed., ser. Operator Theory: Advances and Applications. Sciences at the University of California, Berkeley (UC Berkeley), USA, in Boston: Birkhauser¨ Basel, 2002, vol. 133. 2013, and an S.M. degree from the Department of Electrical Engineering [26] D. Bakry, “Functional inequalities for Markov semigroups,” in Probabil- and Computer Science (EECS) at the Massachusetts Institute of Technology ity Measures on Groups, 2004, Tata Institute of Fundamental Research, (MIT), Cambridge, USA, in 2015. He is currently pursuing a Ph.D. degree Mumbai. Mumbai, India: hhal-00353724i, 2006, pp. 91–147. from the Department of EECS at MIT. His research interests include infor- [27] O. Mazet, “Classification des semi-groupes de diffusion sur R associes´ mation theory, statistics, and other areas in applied probability. He was a a` une famille de polynomesˆ orthogonaux,” in Seminaire´ de Probabilites´ recipient of the Edward Frank Kraft Award from UC Berkeley in 2011, the XXXI, ser. Lecture Notes in Mathematics, J. Azema,´ M. Yor, and Leman, Rubin Merit Scholarship from UC Berkeley in 2012, the Arthur M. M. Emery, Eds., vol. 1655. Berlin, Heidelberg: Springer, 1997, pp. Hopkin Award from UC Berkeley in 2013, the Irwin Mark Jacobs and Joan 40–53. Klein Jacobs Presidential Fellowship from MIT in 2013, the Hewlett-Packard [28] H. O. Lancaster, “The structure of bivariate distributions,” The Annals Fellowship from MIT in 2015, and the Ernst A. Guillemin Master’s Thesis of Mathematical Statistics, vol. 29, no. 3, pp. 719–736, 1958. Award from MIT in 2015. [29] ——, The Chi-Squared Distribution. New York: John Wiley & Sons Inc., 1969. [30] G. K. Eagleson, “Polynomial expansions of bivariate distributions,” The Annals of Mathematical Statistics, vol. 35, no. 3, pp. 1208–1215, 1964. [31] R. C. Griffiths, “The canonical correlation coefficients of bivariate gamma distributions,” The Annals of Mathematical Statistics, vol. 40, no. 4, pp. 1401–1408, 1969. [32] A. E. Koudou, “Probabilites´ de Lancaster,” Expositiones Mathematicae, vol. 14, no. 3, pp. 247–275, 1996. [33] ——, “Lancaster bivariate probability distributions with Poisson, neg- Lizhong Zheng (S’00–M’02–F’16) received the B.S. and M.S. degrees from ative binomial and gamma margins,” Test, vol. 7, no. 1, pp. 95–110, the Department of Electronic Engineering at Tsinghua University, Beijing, 1998. China, in 1994 and 1997, respectively, and a Ph.D. degree from the Depart- [34] P. Diaconis, K. Khare, and L. Saloff-Coste, “Gibbs sampling, exponen- ment of Electrical Engineering and Computer Sciences at the University of tial families and orthogonal polynomials,” Statisical Science, vol. 23, California, Berkeley (UC Berkeley), USA, in 2002. Since 2002, he has been no. 2, pp. 151–178, 2008. working in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, Cambridge, USA, where he is [35] J. Meixner, “Orthogonale polynomsysteme mit einer besonderen gestalt currently a Professor of Electrical Engineering. His research interests include der erzeugenden funktion,” Journal of the London Mathematical Society, information theory, statistical inference, and wireless communications and vol. 9, pp. 6–13, 1934. networks. He was a recipient of the Eli Jury Award from UC Berkeley in [36] E. M. Stein and R. Shakarchi, Real Analysis: Measure Theory, Integra- 2002, the IEEE Information Theory Society Paper Award in 2003, the NSF tion, and Hilbert Spaces, ser. Princeton Lectures in Analysis. New CAREER Award in 2004, and the AFOSR Young Investigator Award in 2007. Jersey: Princeton University Press, 2005, vol. 3. He became an IEEE Fellow in 2016. [37] R. E. Megginson, An Introduction to Banach Space Theory, ser. Graduate Texts in Mathematics. New York: Springer, October 1998, vol. 183. [38] R. Melrose, “Functional analysis,” May 2017, Course 18.102 Lecture Notes, Department of Mathematics, Massachusetts Institute of Technol- ogy.