<<

On Optimization and Convergence of Observation Channels and Quantizers in Stochastic Control

Serdar Y¨uksel and Tam´as Linder

Abstract—This paper studies the optimization of observation and we used the notation channels and quantizers in partially observed stochastic control problems. Continuity properties of the optimal cost in channels Y[0,t] = {Ys, 0 ≤ s ≤ t−1}, U[0,t−1] = {Us, 0 ≤ s ≤ t−1}. and quantizers are explored. Sufficient conditions for sequential compactness under total variation and setwise convergence are The joint distribution of the state, control, and observation presented. Conditions for existence of optimal channels and quan- processes is determined by the above and the following rela- tizers are established. It is shown that quantizers are extreme tionships: points of the space of channels in an appropriate sense.

Pr (X0, Y0) ∈ B = P (dx0)Q(dy0|x0), B ∈B(X×Y), I.INTRODUCTION Z  B In stochastic control, one is often concerned with the fol- where P is the (prior) distribution of the initial state X0, and lowing problem: Given a dynamical system, an observation channel (stochastic kernel), a cost function, and an action set, Pr (Xt, Yt) ∈ B x ,y ,u  [0,t−1] [0,t−1] [0,t−1] when does there exist an optimal policy, and what is an optimal control policy? The theory for such problems is advanced, and = P (dxt|x t−1,ut−1)Q(dyt|xt), B ∈B(X × Y), practically significant, spanning a wide variety of applications ZB in engineering, economics, and natural sciences. where is a stochastic kernel from X U to X. In this paper, we are interested in a dual problem with the P (·|x, u) × following questions to be explored: Given a dynamical system, One way of presenting the problem in a familiar setting is a cost function, an action set, and a set of observation chan- the following: Consider a dynamical system described by the nels, how can one optimize the observation channels? What is discrete-time equations an optimal observation channel subject to constraints on the Xt+1 = f(Xt,Ut, Wt), Yt = g(Xt, Vt) channel? What is the right convergence notion for continuity in such observation channels for optimization purposes? for some measurable functions f,g, with {Wt} being indepen- We start with the probabilistic setup of the problem. Let dent and identically distributed (i.i.d) system noise process and n X ⊂ R , be a Borel set in which elements of a controlled {Vt} an i.i.d. disturbance process, which are independent of Markov process {Xt, t ∈ Z+} live. Here and throughout X0 and each other. Here, the second equation represents the the paper Z+ denotes the set of nonnegative integers and N communication channel Q, as it describes the relation between denotes the set of positive integers. Let Y ⊂ Rm be a Borel the state and observation variables. set, and let an observation channel Q be defined as a stochastic With the above setup, let the objective of the decision maker kernel (regular conditional probability) from X to Y, such that be the minimization of the cost Y Q( · |x) is a probability on the (Borel) σ-algebra B( ) T −1 on Y for every x ∈ X, and Q(A| · ) : X → [0, 1] is a Borel Q,Π J(P,Q, Π) = E c(Xt,Ut) , (1) Y P   for every A ∈B( ). Let a decision maker Xt=0 (DM) be located at the output an observation channel Q, with over the set of all admissible policies , where X U R inputs X and outputs Y . Let U be a Borel subset of some Π c : × → t t is a Borel measurable cost function and Q,Π denotes the Euclidean space. An admissible policy Π is a sequence of EP expectation with initial state probability measure given by control functions {γ , t ∈ Z } such that γ is measurable P t + t under policy and given channel . We adapt the convention with respect to the σ-algebra generated by the information Π Q variables that random variables are denoted by capital letters and lower- case letters denote their realizations. Also, given a probability

It = {Y[0,t],U[0,t−1]}, t ∈ N, I0 = {Y0}. measure µ the notation Z ∼ µ means that Z is a with distribution µ. where Ut = γt(It), t ∈ Z+, are the U-valued control actions In the paper, we will regard quantizers as a particular class of channels. With this interpretation, we will be interested in The authors are with the Department of and Statistics, the following problems: Queen’s University, Kingston, Ontario, Canada, K7L 3N6. Email: (yuk- sel,linder)@mast.queensu.ca. This research was partially supported by the PROBLEM P1. CONTINUITY ON THE SPACE OF CHANNELS Natural Sciences and Engineering Research Council of Canada (NSERC). (STOCHASTIC KERNELS) Suppose {Qn,n ∈ N} is a sequence

1 of communication channels converging in some sense to a We will observe that the optimal cost is concave in the channel Q. When does space of observation channels. This is related to statistics in the context of comparison of experiments as discussed by Q → Q n Blackwell [5]. imply A related area is on the theory of optimal quantization: inf J(P,Qn, Π) → inf J(P,Q, Π)? References [1], [13] are related as these papers study the P P Π∈ Π∈ effects of uncertainties in the input distribution and consider PROBLEM P2. CONCAVITY ON THE SPACE OF CHANNELS Is robustness in the quantizer design. References [17] and [19] the optimization problem study the consistency of optimal quantizers based on empirical

T −1 data for an unknown source. In the context of decentralized Q,Π detection, [21] studied certain topological properties and the J(P,Q) := inf EP c(Xt,Ut) Π   existence of optimal quantizers. Xt=0 The rest of the paper is organized as follows. In the next convex/concave on the space of channels? section, we introduce three relevant topologies on the space PROBLEM P3: EXISTENCEOFOPTIMALCHANNELS Let Q of communication channels. The continuity problem is consid- be a set of communication channels. When do there exist ered in Section III. In Section IV, concavity properties of the minimizing and maximizing channels for the problems optimization problems are discussed. We study the problem T −1 of existence of optimal channels in Section V, followed by Q,Π inf inf EP c(Xt,Ut) applications in quantization in Section VI. Concluding remarks Q∈Q Π∈P   Xt=0 and discussions in Section VIII end the paper. and T −1 II. SOME TOPOLOGIES ON THE SPACE OF COMMUNICATION Q,Π sup inf EP c(Xt,Ut) . CHANNELS Q Π∈P   ∈Q Xt=0 One question that we wish address is the choice of an ap- The answers may help solve problems in application areas propriate notion of convergence for a sequence of observation such as: channels. Toward this end, we first review three notions of • For a partially observed stochastic control problem, some- convergence for probability measures. times we have control over the observation channels by N Let P(R ) denote the family of all probability measure encoding/quantization. When does there exist an optimal N on (X, B(R )) for some N ∈ N. Let {µn, n ∈ N} be a quantizer for such a setup? (Optimal quantization) N sequence in P(R ). Recall that {µn} is said to converge to • Given an uncertainty set for the observation channels, and µ ∈ P(RN ) weakly if c(x)µ (dx) → c(x)µ(dx) tools available for learning the channel, can one identify RN n RN for every continuous andR bounded c : RN R→ R. On the a worst element/best element? (Robust control) N other hand, {µn} is said to converge to µ ∈ P(R ) setwise • When we do not know the channel, but have statistical if c(x)µ (dx) → c(x)µ(dx) for every measurable tools and empirical measurements to learn, typically un- RN n RN andR bounded c : RN →RR. Setwise convergence can also be der mild technical conditions, the empirical distributions defined through pointwise convergence on Borel subsets of RN converge to the actual distribution, in some sense. Does (see, e.g., [16]), that is µ (A) → µ(A), for all A ∈ B(RN ). this imply that we could design the optimal control poli- n since the space of simple functions are dense in the space of cies based on empirical estimates, and does the optimal bounded and measurable functions under the supremum norm. cost converge to the correct limit as the number of mea- For two probability measures µ, ν ∈P(RN ), the total vari- surements grows? (Consistency of empirical controllers) ation metric is given by In the following, we will address problems P1-P3. We will consider the case with T = 1. The multi-stage case is ad- kµ − νkT V := 2 sup |µ(B) − ν(B)| dressed in [26] and [27]. B∈B(RN ) A. Relevant literature = sup f(x)µ(dx) − f(x)ν(dx) , (2) ∞ Z Z f: kfk ≤1 The problems stated are related to three main areas of re- search: Robust control, optimal quantizer design and design where the infimum is over all measurable real f such that of experiments. kfk∞ = supx∈RN |f(x)| ≤ 1. A sequence {µn} is said to N References [6], [7], [8] have considered both optimal control converge to µ ∈P(R ) in total variation if kµn −µkT V → 0. and estimation and the related problem of optimal control Setwise convergence is equivalent to pointwise convergence design when the channel is unknown. Similarly, there are con- on Borel sets whereas total variation requires uniform conver- nections with robust detection, when the source distribution to gence on Borel sets. Thus these three convergence notions are be detected belongs to some set. Recently, [25] considered in increasing order of strength: convergence in total variation continuity and other functional properties of minimum mean implies setwise convergence, which in turn implies weak con- square estimation problems under Gaussian channels. vergence. A. Convergence of information (observation) channels before, in this section Q denotes the set of all channels with X Y Here X = Rn and Y = Rm, and Q denotes the set of all input space and output space . X Let us first look for conditions under which an optimal con- observation channels (stochastic kernels) with input space Q,γ and output space Y. For P ∈ P(X) and Q ∈ Q we let PQ trol policy exists; i.e, when the infimum in infγ EP [c(X,U)] denote the joint distribution induced on (X×Y, B(X×Y)) by is a minimum. channel Q with input distribution P : Theorem 3.1: Suppose assumptions A3 and A4 hold. Then, there exists an optimal control policy for any channel Q. X Y PQ(A)= Q(dy|x)P (dx), A ∈B( × ). A. Weak convergence ZA 1) Absence of continuity under weak convergence: We state Definition 2.1 (Convergence of Channels): the following. The proofs are available in [27]. (i) A sequence of channels {Qn} converges to a channel Q Theorem 3.2: J(P,Q) is not continuous on Q under weak weakly at input P if PQn → PQ weakly. convergence. (ii) A sequence of channels {Qn} converges to a channel 2) Upper semi-continuity under weak convergence: Q setwise at input P if PQn → PQ setwise, i.e., if Theorem 3.3: Suppose assumptions A1 and A5 hold. If PQn(A) → PQ(A) for all Borel sets A ⊂ X × Y. {Qn} is a sequence of channels converging weakly at input (iii) A sequence of channels {Qn} converges to a channel P to a channel Q, then Q in total variation at input P if PQn → PQ in total variation, i.e., if kPQn − PQkT V → 0. lim sup J(P,Qn) ≤ J(P,Q), n→∞ If we introduce the equivalence relation Q ≡ Q′ if and only if PQ = PQ′, Q,Q′ ∈ Q, then the convergence notions in that is, J(P,Q) is upper semi-continuous on Q under weak Definition 2.1 only induce the corresponding topologies (resp. convergence. metrics) on the resulting equivalence classes in Q, instead of B. Continuity properties under setwise convergence Q. Since in most of the development the input distribution P 1) Upper semi-continuity under setwise convergence: is fixed, there should be no confusion when (somewhat incor- Theorem 3.4: Under assumption A2 the optimal cost rectly) we talk about the induced topologies (resp. metrics) on Q. J(P,Q) := inf EQ,γ [c(X,U)] γ P B. Classes of assumptions is sequentially upper semi-continuous on the set of communi- Throughout the paper the following classes of assumptions cation channels Q under setwise convergence. will be adopted for the cost function c and the (Borel) set 2) Absence of continuity under setwise convergence: U ⊂ Rk in different contexts: Theorem 3.5: The optimal cost ASSUMPTIONS. Q,γ J(P,Q) := inf EP [c(X,U)] A1. The function c : X × U → R is non-negative, bounded, γ and continuous on X × U. is not sequentially continuous on the set of communication A2. The function function c : X × U → R is non-negative, channels Q under setwise convergence, even for continuous measurable, and bounded. cost functions and compact X, Y, and U. A3. The function c : X×U → R is non-negative, measurable, bounded, and continuous on U for every x ∈ X. C. Continuity under total variation A4. U is a compact set. Theorem 3.6: Under assumption A2 the optimal cost J(P,Q) A5. U is a convex set. is continuous on the set of communication channels Q under under the topology of total variation. III. PROBLEM P1: CONTINUITYOFTHEOPTIMALCOSTIN CHANNELS IV. PROBLEM P2: CONCAVITY ON THE SPACE OF CHANNELS In this section, we consider continuity properties under total variation, setwise convergence and weak convergence. We con- The following is a result with important consequences in sider the single-stage case, and thus investigate the continuity decentralized stochastic control problems as is elaborated on of the functional in [26] and [27]. Theorem 4.1: The expression Q,Π J(P,Q) = inf EP c(X0,U0) Q,γ Π J(P,Q) = inf E [c(X,U)]   γ P = inf c(x, γ(y))Q(dy|x)P (dx) γ∈G ZX×Y is a concave function of Q. Proof of Theorem 4.1. For α ∈ [0, 1] and Q′,Q′′ ∈Q, let in the channel , where is the collection of all Borel mea- Q G Q = αQ′ + (1 − α)Q′′ ∈Q, i.e., surable functions mapping Y into U. Note that by our previous notation, Π= γ is an admissible first-stage control policy. As Q(A|x)= αQ′(A|x)+(1 − α)Q′′(A|x) ′ N for all A ∈B(Y) and x ∈ X. Noting that PQ = αP Q + (1 − For a probability density function p on R we let Pp de- ′′ α)PQ , we have note the induced probability measure: Pp(A) = A p(x) dx, A ∈B(RN ). The next lemma gives a sufficient conditionR for J(P,Q)= J(P,αQ′ + (1 − α)Q′′) precompactness under total variation. Q,γ ∗ Lemma 5.3: Let be a finite Borel measure on RN and = inf EP [c(X,U)] = inf c(x, γ(y))PQ (dx, dy) µ γ γ∈G Z let F be an equicontinuous and uniformly bounded family of RN = inf α c(x, γ(y))PQ′(dx, dy) probability density functions. Define Ψ ⊂P( ) by γ∈G  Z Ψ= {Pp : Pp ≤ µ, p ∈ F}. +(1 − α) c(x, γ(y))PQ′′(dx, dy) Z  Then Ψ is precompact under total variation. Lemma 5.4: Let Q be a set of channels such that {PQ : ′ ≥ inf α c(x, γ(y))PQ (dx, dy) Q ∈ Q} is a precompact set of probability measures under γ∈G  Z  total variation. Then Q is precompact under total variation at + inf (1 − α) c(x, γ(y))PQ′′(dx, dy) input P . γ∈G  Z  The following theorem, when combined with the preceding ′ ′′ = αJ(P,Q )+(1 − α)J(P,Q ) (3) results, gives sufficient conditions for the existence of best and proving that J(P,Q) is concave in Q. worst channels when the given family of channels Q is closed under the appropriate convergence notion. V. PROBLEM P3: EXISTENCEOFOPTIMALCHANNELS Theorem 5.1: Recall problem P2. Here we study characterizations of compactness (or sequen- (i) There exist a worst channel in Q, that is, a solution for tial compactness) which will be useful in obtaining existence the maximization problem results. Q,γ sup J(P,Q) = sup inf EP E[c(X,U)] The discussion on weak convergence showed us that weak Q∈Q Q∈Q γ convergence does not induce a strong enough topology, i.e., when the set Q is weakly sequentially compact and as- under which useful continuity properties can be obtained. In sumptions A1, A4, and A5 hold. the following, we will obtain conditions for sequential com- (ii) There exist a worst channel in Q when the set Q is pactness for the other two convergence notions, that is, for setwise sequentially compact and assumption A2 holds. setwise convergence and total variation. (iii) There exist best and worst channels in Q, that is, solu- We first discuss setwise convergence. A set of probability tions for the minimization problem inf J(P,Q) and measures M on some measurable space is said to be set- Q∈Q the maximization problem sup J(P,Q) when the set wise sequentially precompact if every sequence in M has a Q∈Q is compact under total variation and assumption A2 subsequence converging setwise to a probability measure (not Q holds. necessarily in M). For two finite measures ν and µ defined on the same measurable space we write ν ≤ µ if ν(A) ≤ µ(A) VI. QUANTIZERSASACLASSOFCHANNELS for all measurable A. Here we consider the problem of convergence and optimiza- The following sufficient condition follows from Dunford tion of quantizers. We start with the definition of a quantizer. and Schwartz ([11], p. 305-306) (see also Hernandez-Lerma Definition 6.1: An M-cell vector quantizer, q, is a (Borel) and Lasserre ([16], p. 7): measurable mapping from X = Rn to the finite set {1, 2,...,M}, Lemma 5.1 (Dunford-Schwartz): Let µ be a finite Borel characterized by a measurable partition {B ,B ,...,B } such T 1 2 M measure on a locally compact . Assume a set that B = {x : q(x)= i} for i =1,...,M. The B are called T i i of Borel probability measures Ψ ⊂P( ) satisfies the cells (or bins) of q. P ≤ µ, for all P ∈ Ψ. REMARKS. • For later convenience we allow for the possibility that Then Ψ is setwise sequentially precompact. some of the cells of the quantizer are empty. As before, PQ ∈ P(X × Y) denotes the joint probability • Traditionally, in source coding theory, a quantizer is a measure induced by input and channel , where X Rn P Q = mapping q : Rn → R with a finite range. Thus q is Y Rm and = . A simple consequence of the preceding ma- defined by a partition and a reconstruction value in Rn jorization criterion is the following. for each cell in the partition. That is, for given cells Lemma 5.2: Let ν be a finite measure on B(X × Y) and {B1,...,BM } and reconstruction values {c1,...,cM }⊂ let P be a probability measure on B(X). Suppose Q is a set n R , we have q(x) = ci if and only if x ∈ Bi. In our of channels such that definition, we do not include the reconstruction values.

PQ ≤ ν, for all Q ∈Q. A quantizer q with cells {B1,...,BM }, however, can also be characterized as a stochastic kernel Q from X to {1,...,M}) Then Q is setwise sequentially precompact at input P in the defined by sense that any sequence in Q has a subsequence {Qn} such that Qn → Q setwise at input P for some channel Q. Q(i|x)=1{x∈Bi}, i =1,...,M M ˆ so that q(x)= i=1 Q(i|x). We denote by QD(M) the space Proof: We will prove the existence of a sequence {Qn} ˆ of all M-cell quantizersP represented in the channel form. In in QF R(M) such that Qn( · |x) → Q( · |x) setwise for all addition, we let Q(M) denote the set of (Borel) stochastic x ∈ X. M kernels from X to {1,...,M}, i.e., Q ∈Q(M) if and only if Let PM = {z ∈ R : z1 + · · · + zM = 1, zi ≥ 0, i = Q( · |x) is probability distribution on {1,...,M} for all x ∈ 1,...,M} denote the probability simplex in RM and note X, and Q(i| · ) is Borel measurable for all i =1,...,M. Note that each Q ∈Q(M) is uniquely represented by the function v that QD(M) ⊂Q(M), and by our definition QD(M − 1) ⊂ Q : X →PM defined by Q (M) for all M ≥ 2. We note that elements of Q(M) are D Qv(x) = (Q(1|x),Q(2|x),...,Q(M|x)). sometimes referred to in the literature as random quantizers. Lemma 6.1: The set of quantizers QD(M) is setwise se- For a positive integer n let PM,n be the collection of probabil- quentially precompact at any input P . ity vectors in PM with rational components having common Proof: Proof follows from Lemma 5.2 and the interpreta- denominator n, i.e., tion above regarding a quantizer as a channel. In particular,a 1 n − 1 P = z ∈P : z ∈{0, ,..., , 1}, i =1,...,M . majorizing finite measure is obtained by the relation: PQ(B × M,n M i n n {i}) ≤ P (B) for every B ∈B(Rm) and i =1,...,M.  Clearly, any z ∈ P can be approximated within error 1/n The following simple lemma provides a useful formula. M in the l∞ sense by a member of PM,n, i.e., Lemma 6.2: A sequence {Qn} in Q(M) converges to a Q in Q(M) setwise at input P if and only if ′ ′ 1 max ′ min kz−z k∞ = max ′ min max |zi−zi|≤ . z∈PM z ∈PM,n z∈PM z ∈PM,n i=1,...,M n X Qn(i|x)P (dx) → Q(i|x)P (dx) ∀A ∈B( ), ∀i Breaking ties in a predetermined manner, we can make the ZA ZA selection of z′ for a given z unique, and thus define a Borel Proof: The lemma follows by noticing that for any Q ∈ measurable mapping q : P → P such that z′ = q (z) Q(M) and measurable B ⊂ X ×{1,...,M}, n M M,n n approximates z in the above sense. Given Q ∈ Q(M), use M this mapping to define Qn ∈Q(M) through the relation PQ(B)= Q(dy|x)P (dx)= Q(i|x)P (dx) Z Z v v B Xi=1 Bi Qn(x)= qn(Q (x)). where Bi = {x ∈ X : (x, i) ∈ B}. (The measurability of Q(i|x) in x follows from the mea- (1) (L(n)) The following counterexample shows that the space of quan- surability of the mapping qn.) Let {z ,...,z } be an tizers QD(M) is not closed under setwise convergence: enumeration of those elements of PM,n for which the sets X Let = [0, 1] and P the uniform distribution on [0, 1]. Let v (j) 2k−2 2k−1 n Sj = {x : Qn(x)= z }, j =1,...,L(n) Lnk = 2n , 2n and let Bn,1 = i=1 Lnk and Bn,2 = M [0, 1]\Bn,1. Define {Qn} as the sequenceS of 2-cell quantizers are not empty (clearly, L(n) ≤ (n + 1) ). Note that the Si given by form an A-measurable partition of X and we have

(1) (2) L(n) L(n) Qn(1|x)=1{x∈Bn,1},Qn(2|x)=1{x∈Bn,2}. u := (z ,z ,...,z ) ∈ PM The proof of Riemann-Lebesgue Lemma can be used to show and  that for all A ∈B([0, 1]), v (j) Qn(x)= z if x ∈ Sj. 1 1 1 RM·L(n) L(n) lim Qn(dy|x)P (dx) = lim fn(t) dt = P (A), Viewed as a subset of , the set PM is com- n→∞ ZA n→∞ Z0 2 2 pact and convex and therefore by the Krein-Milman  theo- and thus, by Lemma 6.2, Qn converges setwise to Q given by rem (see, e.g., [2]) it is the closure of the convex hull of 1 L(n) Q(1|x)= Q(2|x)= 2 for all x ∈ [0, 1]. However, Q is not a its extreme points. The set of extreme points of PM is (deterministic) quantizer. L(n) EM , where EM = {e1,...,eM } is the standard basis for Definition 6.2: The class of finitely randomized quantizers L n RM . In particular, we can find ( ) and is the convex hull of , i.e., u1,...,uN ∈ EM QF R(M) QD(M) Q ∈ QF R(M) N 1 (α1,...,αN ) ∈ PN such that u − αkuk ≤ (k·k if and only if there exist k ∈ N, Q1,...,Qk ∈QD(M), and k=1 n k denotes the standard Euclidean normP in any dimension). Since α1,...,αk ∈ [0, 1] with i=1 αi =1, such that u = (u ,...,u ), where u ∈ E for all k and j, P k k,1 k,L(n) k,j M k we can define the deterministic quantizers Q ∈ Q (M), X n,k D Q(i|x)= αj Qj(i|x), for all i =1,...,M and x ∈ . k =1,...,N, by setting Xj=1 v Qn,k(x)= uk,j if x ∈ Sj . The next result shows that QR(M) is the (sequential) clo- sure of the convex hull of QD(M). Putting things together, we obtain that Theorem 6.1: For any Q ∈Q(M) there exists a sequence N ˆ 1 {Qn} of finitely randomized quantizers in QF R(M) which Qv (x) − α Qv (x) ≤ for all x ∈ X. (4) n k n,k n converges to Q setwise at any input P . Xk=1

ˆ n n n Define Qn ∈Q(M) by where Bi △ B = (Bi \ B) ∪ (B \ Bi ). Then we have

N kPQn − PQkT V M Qˆn(i|x)= αkQn,k(i|x). = sup f(x, i) 1 n − 1 n P (dx) kX=1 {x∈Bi } {x∈Bi } f:kfk∞≤1 ZX Xi=1  v v 1 M Combining (4) with kQ (x) − Qn(x)k∞ ≤ n , we obtain ≤ sup |f(x, i)|P (dx) f:kfk∞≤1 ZBn△B 2 Xi=1 i i |Q(i|x) − Qˆn(i|x)|≤ for all x ∈ X and i =1,...,M M n n ≤ P (Bi △ Bi) → 0 (5) Xi=1 which implies that Qˆn( · |x) → Q( · |x) setwise for all x ∈ X. Since each Qˆn is a convex combination of deterministic and convergence in total variation follows. quantizers in QD(M), the proof is complete. We next consider quantizers with convex codecells and a The preceding theorem has important consequences in that input distribution that is absolutely continuous with respect n it tells us that the space of deterministic quantizers is a “ba- to the Lebesgue measure on R [15]. Assume Q ∈ QD(M) sis” for the space of communication channels between X and with cells B1,...,BM , each of which is a convex subset of {1,...,M} in an appropriate sense. Rn. By the separating hyperplane theorem, there exist pairs of In the following we show that an optimal channel can be complementary closed half spaces {(Hi,j , Hj,i): 1 ≤ i, j ≤ replaced with an optimal quantizer without any loss in perfor- M,i 6= j} such that for all i =1,...,M, mance. B ⊂ H . Proposition 6.1: For any Q ∈ Q(M) there is a Q′ ∈ i i,j ′ j\6=i QD(M) with J(P,Q ) ≤ J(P,Q). If there exists an optimal ¯ channel in Q(M) for problem P3, then there is a quantizer in Each Bi := j6=i Hi,j is a closed convex polytope and by QD(M) that is optimal. the absolute continuityT of P one has P (B¯i \ Bi)=0 for all Proof: Only the first statement needs to be proved. We i = 1,...,M. One can thus obtain a (P –a.s) representation follow an argument common in the source coding literature of Q by the M(M − 1)/2 hyperplanes hi,j = Hi,j ∩ Hj,i. (see, e.g., the Appendix of [24]). Let QC (M) denote the collection of M-cell quantizers with For a policy γ : {1,...,M} → U = X (with finite cost) convex cells and consider a sequence {Qn} in QC (M). It can define for all i, be shown (see the proof of Thm. 1 in [15]) that using an appropriate parametrization of the separating hyperplanes, a subsequence Q can be can be chosen which converges to Bi = x : c(x, γ(i)) ≤ c(x, γ(j)), j =1,...,M . nk nk  a Q ∈ QC (M) in the sense that P (Bi △ Bi) → 0 for all nk i−1 i =1,...,M, where the Bi and the Bi are the cells of Qnk Letting B = B¯ and Bi = B¯i \ Bj , i =2,...,M, we 1 1 j=1 and Q, respectively. In view of (5), we obtain the following. obtain a partition {Bi,...,BM }Sand a corresponding quan- ′ Theorem 6.3: The set Q (M) is compact under total vari- tizer Q′ ∈ Q (M). It is easy to see that EQ ,γ[c(X,U)] ≤ C D P ation at any input measure P that is absolutely continuous with EQ,γ [c(X,U)] for any Q ∈Q(M). P respect to the Lebesgue measure on Rn. The following shows that setwise convergence of quantizers We can now state an existence result for optimal quantiza- implies convergence under total variation. tion (problem P1). Theorem 6.2: Let {Qn} be a sequence of quantizers in Theorem 6.4: Let P be absolutely continuous and suppose QD(M)} which converges to a quantizer Q ∈ QD(M) set- the goal is to find the best quantizer Q with M cells minimiz- wise at P . Then, the convergence is also under total variation Q,γ ing J(P,Q) = infγ EP (X,U) under assumption A2, where at P . Q is restricted to QC(M). Then an optimal quantizer exists. n n Proof: Let B1 ,...,BM be the cells of Qn. Since Qn → Proof: Existence follows from Theorems 5.1 and 6.3. Q setwise at input P , we have PQn(B×{i}) → PQ(B×{i}) In the quantization literature finding an optimal quantizer for any B ∈B(X). Since PQ (B×{i})= 1 n P (dx), n B {x∈Bi } means finding optimal codecells and corresponding reconstruc- we obtain R tion points. Our formulation does not require the existence of optimal reconstruction points (i.e., optimal policy γ). For cost n p n P (B ∩ Bi ) → P (B ∩ Bi), for all i =1,...,M. functions of the form c(x, u) = kx − uk for x, u ∈ R and some p> 0, the cells of “good” quantizers will be convex by If B1,...,BM are the cells of Q, the above implies P (Bj ∩ Lloyd-Max conditions of optimality; see [15] for further re- n Bi ) → P (Bj ∩ Bi) for all i, j ∈ {1,...,M}. Since both sults on convexity of bins for entropy constrained quantization n X {Bi } and {Bn} are partitions of , we obtain problems. We note that [1] also considered such cost functions for existence results on optimal quantizers; Graf and Luschgy n P (Bi △ Bi) → 0 for all i =1,...,M, [12] considered more general norm-based cost functions. The concavity property applies directly to the multi-stage kfkBL ≤ M} for some 0

− Qn(dy|x)c(x, γ(y)) P (dx) =0, control policies F are such that Z 

to be able to have continuity under setwise convergence. Thus, lim sup c(x, γ(y))µn(dx, dy) − c(x, γ(y))µ(dx, dy) n→∞ γ Z Z ∈F one important question of practical interest, is the follow-

= 0, ing: What type of stochastic control problems, cost functions, and allowable policies lead to solutions which admit such then we obtain consistency. a uniform convergence principle under setwise convergence? A class of measurable functions E is called a Glivenko- Some sufficient conditions for uniform setwise convergence Cantelli class [9], if the integrals with respect to the empirical are presented in [22]. measures converge almost surely to the integrals with respect to the true measure uniformly over E. Thus, if B. Robustness in source distribution G = {γ : c(x, γ(y)) ∈ E}, The results in this paper may also be applicable to ro- bustness to source distribution. We considered robustness in where E is a class of Glivenko-Cantelli family of functions, the observation channel. However, some of results also apply then we could establish consistency. One example of a Glivenko- almost without change to robustness in the input distribution. Cantelli family of real functions on RN is the family {f : Ornstein’s d¯-distance [18], [13] is also a meaningful distance when the uncertainty is in the input distribution, but requires [27] S. Y¨uksel and T. Linder, “Optimization and Convergence of Observation the cost function to be specific. Such problems for the specific Channels in Stochastic Control”, submitted to SIAM Journal on Control and Optimization, 2010 (available on arXiv). case of minimum mean square estimation were considered in [25]. The important general problem merits further research.

REFERENCES [1] E. A. Abaya and G. L. Wise, “Convergence of vector quantizers with applications to optimal quantization, SIAM Journal on Applied Mathematics, vol. 44, pp. 183–189, 1984. [2] A. Barvinok, A Course in Convexity, vol. 54 of Graduate Studies in Mathematics. Providence, RI: American Mathematical Society, 2002. [3] P. Billingsley, Convergence of Probability Measures. New York: Wiley 1968. [4] J. von Neumann, “A certain zero-sum two-person game equivalent to the optimal assignment problem,” Contrib. to the Theory of Games, vol. 2, pp. 5–12, Princeton University Press, Princeton, New Jersey, 1953. [5] D. Blackwell, “The comparison of experiments,” in Proc. Second Berkeley Symposium on Mathematical Statistics and Probability, pp. 93–102. Univ. of California Press, Berkeley, 1951. [6] C.D. Charalambous and F. Rezaei, “Stochastic uncertain systems subject to relative entropy constraints: Induced norms and monotonicity prop- erties of minimax games”, IEEE Transactions on Automatic Control, vol. 52, no. 4, pp 647–663, May 2007. [7] F. Rezaei, C.D. Charalambous and N. U. Ahmed, “Optimization of stochastic uncertain systems with variational norm constraints”, in Proc. IEEE Conference on Decision and Control, pp. 2159–2163, New Orleans, LA, USA, Dec. 2007 [8] Y. Socratous, F. Rezaei and C.D. Charalambous, “Nonlinear estimation for a class of systems”, IEEE Transactions on Information Theory, vol. 55, no. 4, pp. 1930-01938, Apr. 2009. [9] R.M. Dudley, E. Gine, and J. Zinn, “Uniform and universal Glivenko- Cantelli classes”, Journal of Theoretical Probability, vol. 4, pp. 485– 510, 1991. [10] R. M. Dudley, Real Analysis and Probability, Cambridge University Press, Cambridge, 2nd ed., 2002. [11] N. Dunford and J. Schwartz, Linear operators I, Interscience, New York, 1958 [12] S. Graf and H. Luschgy, Foundations of Quantization for Probability Distributions. Berlin, Heidelberg: Springer Verlag, 2000. [13] R. M. Gray and L. D. Davisson, “Quantizer mismatch,” IEEE Trans- actions on Communications, vol. 23, pp. 439–443, 1975. [14] A. Gy¨orgy and T. Linder, “Optimal entropy-constrained scalar quanti- zation of a uniform source”, IEEE Transactions on Information Theory, vol. 46, pp. 2704–2711, Nov. 2000. [15] A. Gy¨orgy and T. Linder, “Codecell convexity in optimal entropy- constrained vector quantization, IEEE Transactions on Information Theory, vol. 49, pp. 1821–1828, Jul. 2003. [16] O. Hernandez-Lerma, J.B. Lasserre, Markov Chains and Invariant Probabilities, Birkh¨auserVerlag, Basel, 2003. [17] T. Linder, “On the training distortion of vector quantizers”, IEEE Transactions on Information Theory, vol. 46, pp. 1617–1623, 2000. [18] D. Ornstein, “An application of ergodic theory to probability theory, The Annals of Probability, vol. 1, pp. 43–58, 1973. [19] D. Pollard, “Quantization and the method of k-means, IEEE Transac- tions on Information Theory, vol. 28, pp. 199–205, 1982. [20] W. Rudin, Real and Complex Analysis, New York: McGraw-Hill, 3rd ed., 1987. [21] J. N. Tsitsiklis, “Extremal properties of likelihood-ratio quantizers”, IEEE Transactions on Communications, Vol. 41, pp. 550–558, Apr. 1993. [22] F. Topsoe, “Uniformity in convergence of measures”, Z. Wahrschein- lichkeitsth, vol. 39, pp. 1–30, 1977. [23] V. N. Vapnik The Nature of Statistical Learning Theory Springer, New York, 2nd ed., 2000. [24] H. S. Witsenhausen, “On the structure of real-time source coders,” Bell Syst. Tech. J., 58:1437-1451, July/August 1979. [25] Y. Wu and S. Verd´u, “Functional properties of MMSE”, Proc. IEEE International Symposium on Information Theory, Austin, TX, June 13- 18, 2010. [26] S. Y¨uksel and T. Linder, “Optimization and Convergence of Observation Channels in Stochastic Control”, submitted to IEEE American Control Conference, 2011.