Rational Inattention in Controlled Markov Processes

Ehsan Shafieepoorfard, Maxim Raginsky and Sean P. Meyn

Abstract— The paper poses a general model for optimal Sims considers a model in which a representative agent control subject to information constraints, motivated in part decides about his consumption over subsequent periods of by recent work on information-constrained decision-making by time, while his computational ability to reckon his wealth – economic agents. In the average-cost optimal control frame- work, the general model introduced in this paper reduces the state of the dynamic system – is limited. A special case is to a variant of the linear-programming representation of the considered in which income in one period adds uncertainty average-cost optimal control problem, subject to an additional of wealth in the next period. Other modeling assumptions mutual information constraint on the randomized stationary reduce the model to an LQG control problem. As one policy. The resulting optimization problem is convex and admits justification for introducing the information constraint, Sims a decomposition based on the Bellman error, which is the object of study in approximate dynamic programming. The remarks [7] that “most people are only vaguely aware of their structural results presented in this paper can be used to obtain net worth, are little-influenced in their current behavior by performance bounds, as well as algorithms for computation or the status of their retirement account, and can be induced to approximation of optimal policies. make large changes in savings behavior by minor ‘informa- tional’ changes, like changes in default options on retirement I.INTRODUCTION plans.” Quantitatively, the information constraint is stated in terms of an upper bound on the mutual information in the In typical applications of stochastic dynamic program- sense of Shannon [8] between the state of the system and ming, the controller has access to limited information about the observation available to the agent. the state of the system. Unlike much of the existing literature Extensions of this work include Matejka and McKay [9], on problems with imperfect state information, in this paper who recovered the well-known multinomial logit model for it is assumed that the system designer has to decide not a situation where an individual must choose among discrete only about the control policy, but also about the observation alternatives yielding different values not perfectly known ex- channel based on which the control is derived. (A very partial ante. In the sequel [10], they extend this model to study list of references that touch upon similar themes includes Nash equilibria in a Bertrand oligopoly setting where N [1]–[5].) There are various applications which fit in this firms produce the same commodity with different qualities framework, either in engineering or economics. not perfectly known ex-ante; it is shown that this information In economic decision-making, the amount of information friction leads to increased prices for the commodity. On a required to make a truly optimal decision will typically similar theme, Peng [11] is motivated by the observation that exceed what an agent can handle. In his seminal work “investors have limited time and attention to process infor- [6], Christopher Sims (who has shared the 2011 Nobel mation.” In a factor model for asset returns, he investigates Memorial Prize in Economics with Thomas Sargent) adds the dynamics of prices and mutual information when there an information-processing constraint to a specific kind of is uncertainty in the decision making process. dynamic programming problem which is frequently used Given the appeal and generality of these questions, there in macroeconomic models. The aim, as he expresses, is is ample motivation for the creation of a general theory for to support John Maynard Keynes’ well-known view that optimal control subject to information constraints. The con- real economic behavior is inconsistent with the idea of tribution of the present paper is to initiate the development continuously optimizing agents interacting in continuously of such a theory, which would in turn enable us to address clearing markets. Sims uses the term “rational inattention” to many problems in macroeconomics and engineering in a describe the setting in which information-constrained agents systematic fashion. We focus on the average-cost optimal strive to make the best use of whatever information they are control framework and show that the construction of an op- able to handle. timal information-constrained controller reduces to a variant of the linear-programming representation of the average-cost E. Shafieepoorfard and M. Raginsky are with the Department of Electri- cal and Computer Engineering and the Coordinated Science Laboratory, optimal control problem, subject to an additional mutual University of Illinois at Urbana–Champaign, Urbana, IL 61801, USA; information constraint on the randomized stationary policy. {shafiee1,maxim}@illinois.edu. The resulting optimization problem is convex and admits S.P. Meyn is with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA; a decomposition in terms of the Bellman error, which is [email protected]. the object of study in approximate dynamic programming. Research supported in part by the NSF under CAREER award no. The structure revealed in this paper can be used to obtain CCF-1254041 and grant no. ECCS-1135598, and in part by the UIUC College of Engineering under Strategic Research Initiative on “Cognitive performance bounds, as well as algorithms for computation and Algorithmic Decision Making.” or approximation of optimal policies. Xt II.PRELIMINARIES AND NOTATION Controlled system All spaces are assumed to be standard Borel (i.e., iso- morphic to a Borel subset of a complete separable metric space), and will be equipped with their Borel σ-algebras. If Observation channel

X is such a space, then B(X) will denote its Borel σ-algebra, Ut Zt and P(X) will denote the space of all probability measures Controller on (X, B(X)). We will denote by M(X) the space of all measurable functions f ∶ X → R and by Cb(X) ⊆ M(X) the space of all bounded continuous functions. We will often use Fig. 1. System model. 1 bilinear form notation for expectations: for any f ∈ L (µ),

⟨µ, f⟩ ≜ f(x)µ(dx) = E[f(X)], processing inequality SX satisfies the : if X → Y → Z is a Markov chain, then where in the last expression it is understood that X is an X-valued random object with Law(X) = µ. I(X; Z) ≤ I(X; Y ). (2) Given two spaces X and Y, a mapping K(⋅S⋅) ∶ B(Y)×X → In words, no additional processing can increase information. [0, 1] is a Markov (or stochastic) kernel if K(⋅Sx) ∈ P(Y) for all X and is measurable for every x ∈ x ↦ K(BSx) III.SOMEGENERALCONSIDERATIONS B ∈ B(Y). The space of all such Markov kernels will be Consider a model in which the controller is constrained to denoted by M(YSX). Markov kernels K ∈ M(YSX) act on observe the system being controlled through an information- measurable functions f ∈ M(Y) from the left as limited channel. This is illustrated in the block diagram Kf(x) ≜ f(y)K(dySx), ∀x ∈ X shown in Figure 1, consisting of: SY the (time-invariant) controlled system, specified by an and on probability measures µ ∈ P(X) from the right as ● initial condition µ ∈ P(X) and a stochastic kernel Q ∈ M(XSX × U), where X is the state space and U is the µK(B) ≜ K(BSx)µ(dx), ∀B ∈ B(Y). SX control (or action) space; the observation channel, specified by a sequence W of Moreover, Kf ∈ M(X) for any f ∈ M(Y), and µK ∈ P(Y) Z Xt Zt 1 Ut 1 X relative entropy information ● stochastic kernels Wt ∈ M( S × × ), t = for any µ ∈ P( ). The (or 1 1, 2,..., where Z is some observation space− ; and− ) between any two µ, ν ∈ P(X) [8] is defined as the feedback controller, specified by a sequence Φ of ⎧ dµ stochastic kernels Φ U Z , t 1, 2,.... ⎪cµ, log h , if µ ≺ ν ● t ∈ M( S ) = D(µYν) ≜ ⎨ dν The X-valued state process {Xt}t 1, the Z-valued ob- ⎪+∞, otherwise ⎩ servation process Zt 1, and the ∞U-valued control pro- { }t = X cess Ut 1 are defined∞ on a common probability space Given any probability measure µ ∈ P( ) and any Markov { }t = Y X Ω, , P ∞and have the causal ordering kernel K ∈ M( S ), we can define a probability measure ( F ) = µ ⊗ K on the product space (X × Y, B(X) ⊗ B(Y)) via its X1,Z1,U1,...,Xt,Zt,Ut,... (3) action on the rectangles A × B, A ∈ B(X),B ∈ B(Y): where, P-almost surely, P(X1 ∈ A) = µ(A) for all A ∈ B(X), (µ ⊗ K)(A × B) ≜ K(BSx)µ(dx). and for all 1 2 , Z , U , X we SA t = , ,... B ∈ B( ) C ∈ B( ) D ∈ B( ) have Note that µ ⊗ K(X × B) = µK(B) for all B ∈ B(X). The t t 1 t 1 t t 1 t 1 Shannon mutual information [8] in the pair (µ, K) is P(Zt ∈ BSX ,Z ,U ) = Wt(BSX ,Z ,U ) (4a) U C Xt,Zt−,U t 1− Φ C Z − − (4b) I(µ, K) ≜ D(µ ⊗ KYµ ⊗ µK), (1) P( t ∈ S ) = t( S t) t t − t P(Xt 1 ∈ DSX ,Z ,U ) = Q(DSXt,Ut) (4c) where, for any µ ∈ P(X) and ν ∈ P(Y), µ ⊗ ν denotes the product measure defined via (µ⊗ν)(A×B) ≜ µ(A)ν(B) for This specification+ ensures that, for each t, the next state t 1 t t 1 all A ∈ B(X),B ∈ B(Y). The functional I(µ, K) is concave Xt 1 is conditionally independent of X ,Z ,U given in µ and convex in K. If X,Y is a pair of random objects X ,U (which is the usual case of a controlled− − Markov ( ) t+ t with Law(X,Y ) = Γ = µ ⊗ K, then we will also write process), and that the control Ut is conditionally independent t t 1 t 1 I(X; Y ) or I(Γ) for I(µ, K). of X ,Z ,U given Zt. In other words, at each time t the Finally, given a triple of jointly distributed random objects controller− takes− as input only the most recent observation Zt, (X,Y,Z) with Γ = Law(X,Y,Z), we will say that they which amounts to the assumption that there is a separation form a Markov chain X → Y → Z if there exist Markov 1This observation space can be very general — for example, the case kernels K1 Y X and K2 Z Y so that Γ can be ∈ M( S ) ∈ M( S ) when Z is a Cartesian product of countably many copies of X×U is allowed, disintegrated as Γ = µ ⊗ K1 ⊗ K2 (in words, if X and Z are so we can cover the case of arbitrarily long (or even perfect) controller conditionally independent given Y ). The mutual information memory. structure between the observation channel and the controller. observation that can be regarded as an instance of the This assumption is common in the literature [1], [3], [4]. Shannon–Mori–Zwanzig Markov model [13]: In the spirit of the rational inattention framework, we assume that the amount of information flow that can be Principle of Irrelevant Information. Let Ξ, Θ, Ψ, Υ be four maintained across the observation channel per time step is random variables defined on a common probability space, constrained, and wish to design a suitable channel W and such that Υ is conditionally independent of (Θ, Ξ) given Ψ. a controller Φ to minimize a given performance objective Then there exist four random variables Ξ , Θ , Ψ , Υ defined under the information constraint. For maximum flexibility, on the same spaces as the original tuple,′ such′ ′ that′ Ξ → we grant the system designer the freedom to choose the Θ → Ψ → Υ is a Markov chain, and moreover the bivariate′ observation space Z as well. In other words, the designer marginals′ ′ agree:′ is allowed to choose an optimal representation for the data Law(Ξ, Θ) = Law(Ξ , Θ ) supplied to the controller. Law Θ, Ψ Law Θ′ , Ψ′ As we will demonstrate shortly, the choice Z = P(X) ( ) = ( ) ′ ′ (i.e., letting Z be the space of beliefs on the state space) Law(Ψ, Υ) = Law(Ψ , Υ ) is information-theoretically optimal. For now, let us see how ′ ′ Using this principle, we can prove the following two lemmas much insight we can gain in the case of a fixed Z. Then (see Appendices for the proofs): the problem of optimal control with rational inattention can be formulated as follows. Let c X U be a given ∶ × → R Lemma 1 (Two-Stage Lemma). Suppose T = 2. Then the measurable one-step state-action cost function. Given+ a fixed 2 kernel W2(dz2Sx , z1, u1) can be replaced by another kernel finite horizon T and some R 0, we seek an observation ≥ W2(dz2Sx2), such that the resulting variables (Xt,Zt,Ut ), channel W and a controller Φ in order to t =′1, 2, satisfy ′ ′ ′ T minimize E Q c(Xt,Ut) (5a) E[c(X1,U1) + c(X2,U2)] = E[c(X1,U1) + c(X2,U2)] t 1 ′ ′ ′ ′ and ; ; , 1 2. subject to I(X=t; Zt) ≤ R, t = 1, 2,...,T (5b) I(Xt Zt) = I(Xt Zt) t = , ′ ′ Later we focus on a related average-cost performance crite- Lemma 2 (Three-Stage Lemma). Suppose T = 3, and Z3 rion. is conditionally independent of (Xi,Zi,Ui), i = 1, 2, given 2 A. Key structural result X3. Then the kernel W2(dz2Sx , z1, u1) can be replaced by another kernel W dz x , such that the resulting variables The optimization problem (5) seems formidable: for each 2( 2S 2) (Xi,Zi,Ui ), i = 1′, 2, 3, satisfy time step t = 1,...,T we must design stochastic kernels t t 1 t 1 ′ ′ ′ Wt(dztSx , z , u ) and Φt(dutSzt) for the observation 3 3 channel and the− controller,− and the complexity of the feasible E Q c(Xt,Ut ) = E Q c(Xt,Ut) t 1 ′ ′ t 1 set of Wt’s grows with t. However, the fact that (a) both the = = controlled system and the controller are Markov, and (b) the and I(Xt; Zt) = I(Xt; Zt) for t = 1, 2, 3. cost function at each stage depends only on the current state- ′ ′ action pair, permits a drastic simplification — at each time t, Armed with these two lemmas, we can now prove the theorem by backward induction and grouping of variables. we can limit our search to memoryless channels Wt(dztSxt) without impacting either the average cost in (5a) or the Fix any T . By the Two-Stage-Lemma, we may assume that information flow constraint in (5b): WT is memoryless, i.e., ZT is conditionally independent T 1 T 1 T 1 of X ,Z ,U given XT . Now we apply the Three- Theorem 1 (Memoryless observation channels suffice). Un- Stage Lemma− − to − der the specified information pattern (4), for any controller specification Φ and any channel specification W , there exists T 3 T 3 T 3 U X ,Z ,U ,XT 2,ZT 2 ,UT 2 U another channel specification W consisting of stochastic ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶− − − ² ² Stage 1 − Stage− 1 Stage− 1 kernels Wt(dztSxt), t = 1, 2,..., such′ that state observation control T T U XT 1 ,ZT 1 ,UT 1 U XT ,ZT ,UT U (6) E Q c(Xt,Ut ) = E Q c(Xt,Ut) ² ² ² ° ° ° Stage− 2 Stage− 2 Stage− 2 Stage 3 Stage 3 Stage 3 t 1 ′ ′ t 1 state observation control state observation control and = = T 1 T 2 T 2 to replace WT 1(dzT 1Sx , z , u ) with I X ; Z I X ; Z , t 1, 2,...,T W dz 1 x 1 without affecting− the− expected− cost or ( t t) = ( t t) = T 1( T S T ) − − ′ ′ the′ mutual information between the state and the observation where X ,U ,Z is the original process with − − − {( t t t)} at time T 1. We proceed inductively by merging the second µ, Q, W , Φ , while X ,U ,Z is the one with − ( ) { t t t)} and the third stages in (6), splitting the first stage in (6) into µ, Q, W , Φ . ′ ′ ′ ( ) two, and then applying the Three-Stage Lemma to replace ′ Proof. To prove the theorem, we follow the approach used the original observation kernel WT 2 with a memoryless by Wistenhausen in [12]. We start with the following simple one. − B. Long-term average cost criterion When T = 1, the problem in (5) becomes Despite the simplification afforded by Theorem 1, the minimize E [c(X,U)] (8a) optimization problem (5) is still difficult even when the observation space Z is fixed, and the only general way of subject to I(X; Z) ≤ R (8b) solving it is via infinite-dimensional dynamic programming. for a given law µ ∈ P(X) of X, where the minimization is Our goal is to gain theoretical insight into the structure of over all observation channels W ∈ M(ZSX) and controllers optimal control policies in the rational inattention framework. Φ ∈ M(USZ). If we denote the optimum value attained in To that end, we make several simplifications: (8) by V (R, Z), then the quantity of interest is 1) We replace the finite-horizon cost criterion in (5a) with V R inf V R, Z , (9) the one based on the long-term average cost. ( ) ≜ Z ( ) 2) We consider only stationary (time-invariant) observa- where the infimum is over all standard Borel observation tion channels and controllers. spaces. In other words, we seek an observation space Z , 3) Instead of separately optimizing over the observation an observation channel Z X , and a controller∗ space, the observation channel, and the controller, we W ∈ M( S ) Φ U Z , such that ∗ ; ∗ and the resulting collapse these decision variables into the choice of a ∈ M( S ) I(X Z ) ≤ R expected∗ cost∗E[c(X,U )] is minimized.∗ Markov randomized stationary (MRS) control law Φ ∈ In order to properly∗ frame the main result of [15], we U X satisfying the information constraint. M( S ) need to make some preliminary observations. If we fix Z Of these, Item 3) requires some explanation; in essence, it and an observation channel W Z X , then the optimal is justified by the simple fact that the mutual information ∈ M( S ) choice of the controller Φ U Z is to let Φ du z I X; U between two random variables X,U can be W ∈ M( S ) W ( S ) ( ) ( ) be supported on the set of∗ all u U, such that ∗ expressed as ∈ ∗ E[c(X, u )SZ = z] = min E[c(X, u)SZ = z]. u U I(X; U) = inf šI(X; Z) ∶ X → Z → UŸ, ∗ In fact, under suitable assumptions∈ on c, there exists a where the infimum is over all standard Borel spaces Z and all deterministic measurable selector ϕ Z U, so that Z-valued random objects Z jointly distributed with X,U , W ∶ → ( ) ∗ such that X Z U forms a Markov chain. Indeed, → → E[c(X, ϕW (z))SZ = z] = min E[c(X, u)SZ = z]. u U for any such triple we have I(X; U) ≤ I(X; Z) by the ∗ data processing inequality; the other direction follows by With this, we would then use the∈ deterministic controller considering the “degenerate” Markov chain X U U. Φ du z δ ∗ du , where δ denotes the Dirac → → W ( S ) = ϕW z ( ) u Consequently, for any choice of µ X , Z, W Z X measure∗ concentrated at u U. Thus, ∈ P( ) ∈ M( S ) ( ) ∈ and Φ ∈ M(USZ), the kernel Ψ = Φ ○ W ∈ M(USX) V (R, Z) = inf E[c(X, ϕW (Z))] will satisfy I(µ, Ψ) ≤ I(µ, W ) (recall notation from (1)). W ∈M(ZSX) I(X;Z)≤R ∗ Conversely, we can always factor any given Ψ ∈ M(USX) through some standard Borel space Z as Ψ = Φ○W with Φ ∈ = inf E min E[c(X, u)SZ] . (10) W ∈M(ZSX) u U M(USZ) and W ∈ M(ZSX), such that I(µ, W ) = I(µ, Ψ). I(X;Z)≤R In view of the above, we focus on the following problem: ∈ We also need to introduce some notions from rate-distortion find an MRS control law Φ ∈ M(USX) to theory [14]. For a given µ ∈ P(X) and a given R ≥ 0, 1 T consider the set minimize lim sup E Q c(Xt,Ut) (7a) T T t 1 Iµ(R) ≜ šK ∈ M(USX) ∶ I(µ, K) ≤ RŸ. subject to lim→ sup∞ I(Xt; =Ut) ≤ R, t = 1, 2,... (7b) t The set Iµ(R) is nonempty for every R ≥ 0. To see this, IV. ONE-STAGE→∞ PROBLEM: SOLUTIONVIA note that any kernel K ∈ M(USX) for which the function RATE-DISTORTION THEORY x K B x is constant (µ-a.e. for any B U ) satisfies ↦ ( S ) ◇ ∈ B( ) I µ, K 0. Before we analyze the infinite-horizon problem (7), let ( ◇ ) = The Shannon distortion-rate function (DRF) of µ is de- us show that the one-stage case can be solved completely ◇ using rate-distortion theory [14] (a branch of information fined as theory that deals with optimal compression of data subject Dµ(R) ≜ inf ⟨µ ⊗ K, c⟩. (11) to information constraints). To the best of our knowledge, K µ R this solution is originally due to Stratonovich [15] (see also We use the more cumbersome∈I ( notation) ; when we [16, Sec. 9.7]). Because our subsequent development builds Dµ(R c) need to specify the dependence of the DRF on the underlying on these ideas, we briefly describe them here. Moreover, this cost function . Starting with the easily proved variational analysis will provide additional justification for eliminating c expression the decision variables Z and W ∈ M(ZSX) in favor of an information-constrained controller Φ ∈ M(USX) directly I(µ, K) = inf D(µ ⊗ KYµ ⊗ ν) connected to the system being controlled. ν U ∈P( ) (where the infimum is achieved uniquely by ν = µK), we shown that the kernel W ∈ M(ZSX) given by the composi- ˇ can introduce the Lagrangian relaxation tion of K and the deterministic mapping u ↦ K (⋅Su) is feasible for∗ the problem (8) with Z = P(X). Moreover,∗ if K, ν, s sD µ K µ ν µ K, c Lµ( ) ≜ ( ⊗ Y ⊗ ) + ⟨ ⊗ ⟩ we choose the controller Φ ∈ M(USZ) in such a way that ˇ for s ≥ 0 and ν ∈ P(U), and establish the following key Φ(duSz) is supported on the set {u ∈ U ∶ K (⋅Su) = z}, results [14], [17]: then the resulting cost will not exceed Dµ(R). [In∗ fact, with the choice Z = P(X), we can write Dµ(R) in the form Proposition 1. The DRF D R is convex and nonde- µ( ) (10).] Again, taking the infimum over Z, we get V (R) ≤ creasing in R. If Dµ R , then a Markov kernel ( ) < ∞ Dµ(R). K ∈ M(USX) attains the infimum in (11) if and only if I(∗µ, K ) = R and the Radon–Nikodym derivative of µ⊗K V. AVERAGE-COSTOPTIMALCONTROLWITHRATIONAL w.r.t. µ∗⊗ µK takes the form ∗ INATTENTION ∗ We now turn to the analysis of the infinite-horizon control d(µ ⊗ K ) 1 c x,u (x, y) = α(x)e s (12) problem (7) with an information constraint. In multi-stage d(µ ⊗ µK∗ ) − ( ) control problems, such as this one, the control law has a ∗ where α ∶ X → R and s ≥ 0 are such that dual effect [20]: it affects both the cost at the current stage + 1 c x,u and the uncertainty about the state at future stages. The α(x)e s µ(dx) ≤ 1, ∀u ∈ U (13) SX − ( ) presence of the mutual information constraint (7b) enhances this dual effect, since it prevents the controller from ever and −s is the slope of a line tangent to the graph of Dµ(R) at R: learning “too much” about the state. This, in turn, limits the controller’s future ability to keep the average cost low. Dµ(R ) + sR ≥ Dµ(R) + sR, ∀R ≥ 0. (14) These considerations suggest that, in order to bring rate- ′ ′ ′ distortion theory to bear on the problem (7), we cannot use Proposition 2. The DRF Dµ(R) can be expressed as the one-stage cost c as the distortion function. Instead, we ⎡ ⎤ ⎢ 1 ⎥ must modify it to account for the effect of the control action Dµ(R) = sup inf s ⎢dµ, log 1 i − R⎥ 0 ν U s c x,u on future costs. As we will see, this modification implies a s ⎢ U e ν(du) ⎥ ⎣ ∫ ⎦ certain stochastic relaxation of the Average Cost Optimality ≥ ∈P( ) − ( ) We are now in a position to state and prove the main result Equation (ACOE) that characterizes optimal performance of this section: achievable by any MRS control law in the absence of Theorem 2 (Stratonovich). For any R ≥ 0 such that information constraints. , we have , and the infimum Dµ(R) < ∞ V (R) = Dµ(R) A. Reduction to single-stage optimization over Z in (9) is attained at Z = P(X). ∗ To construct this modification in a principled manner, Proof (Sketch). One direction, V (R) ≥ Dµ(R), is relatively we first reduce the dynamic optimization problem (7) to straightforward. Fix Z, and suppose that W ∈ M(ZSX) and a suitable static (single-stage) problem. Once this has been Φ ∈ M(USZ) attain the optimal value V (∗R, Z). Consider carried out, we will be able to take advantage of the results of the∗ Markov kernel K ∈ M(USX) obtained by composing Section IV. The reduction is based on the so-called convex- Φ and W : K = Φ ○ W . The joint action of W and analytic approach to controlled Markov processes [21]–[25], Φ∗ can be∗ described∗ by a∗ Markov chain X → Z →∗ U , which we briefly summarize here. Given a Markov control where∗ Law(X) = µ, Law(Z SX = x) = W (⋅∗Sx), and∗ problem with initial state distribution µ and transition kernel Law(U SX = x, Z = z) = Law∗ (USZ = z) ∗= Φ (⋅Sz). Q, the long-run expected average cost of an MRS control ∗ Law ∗ ∗ ; ∗ Moreover, (X,U ) = µ ⊗ K. Since I(X Z ) ≤ R, law Φ ∈ M(USX) is given by we have K ∈ Iµ(R) by∗ the data processing inequality∗ (2). 1 T Consequently, Dµ(R) ≤ ⟨µ ⊗ K, c⟩ = V (R, Z). Taking the Jµ(Φ) = lim sup E Q c(Xt,Ut) . (15) infimum over Z, we get Dµ(R) ≤ V (R). T T t 1 To prove the other direction, let Z X 2 and consider = P( ) We wish to find an MRS→ control∞ law= Φ that would minimize the optimal kernel K ∈ M(USX) that achieves the infimum Jµ(Φ) simultaneously for all µ. Any∗ MRS control law Φ in (11). Using (12) and∗ the Bayes’ rule, we can compute the induces a transition kernel QΦ on the state space X: posterior distribution (belief state)

1 c x,u QΦ A x Q A x, u Φ du x , A X . e s µ dx ( S ) ≜ S ( S ) ( S ) ∀ ∈ B( ) Kˇ dx u ( ) . U ( S ) = − 1(c x,u) s d ∗ ∫X e µ( x) We say that Φ is stable if: − ( ) Using the minimal sufficiency property of the belief state 1) There exists a probability measure πΦ ∈ P(X) which [18], [19] and the fact that K attains the DRF, it can be is invariant under QΦ, i.e., πΦ = πΦQΦ. ∗ 2) The average cost JπΦ (Φ) is finite, and moreover 2Note that because X is standard Borel, the space Z = P(X) is a complete separable metric space w.r.t. any metric compatible with weak convergence JπΦ (Φ) = ⟨ΓΦ, c⟩ = c(x, u)ΓΦ(dx, du), of probability measures, and so Z is standard Borel as well. SX U

× where ΓΦ ≜ πΦ ⊗ Φ. Proof. Define the function Let K ⊂ M(USX) denote the space of all such stable laws. ⎧ ⎪0, Φ ∈ Kπ Theorem 3. Suppose that the following assumptions are ιπ(Φ) ≜ ⎨ ⎪+∞, Φ ∈~ Kπ satisfied: ⎩⎪ (A.1) The one-stage cost function c is nonnegative, Then we can write

● lower semicontinuous, and coercive, i.e., there exist two Jπ (R) = inf aπ ⊗ Φ, cf + ιπ(Φ) . (19) sequences of compact sets Xn ↑ X and Un ↑ U such that Φ R ∗ π lim inf c x, u . ∈I ( ) c c ( ) = +∞ Moreover, n x Xn,u Un

(A.2) The transition→∞ ∈ kernel∈ Q is continuous, i.e., Qf ∈ ιπ(Φ) = sup aπQΦ, hf − aπ, hf (20) h Cb X ● Cb(X × U) for any f ∈ Cb(X). ∈ ( ) Let J ≜ infΦ infµ Jµ(Φ). Then there exists a control law Indeed, if Φ ∈ Kπ, then the right-hand side of (20) is zero. Φ ∈ K∗ , such that On the other hand, suppose that Φ ∈~ Kπ. Since X is standard ∗ Borel, any two probability measures µ, ν ∈ P(X) are equal if JπΦ∗ (Φ ) = J = inf ⟨ΓΦ, c⟩. (16) Φ and only if µ, h ν, h for all h Cb X . Consequently, ∗ ∗ ⟨ ⟩ = ⟨ ⟩ ∈ ( ) there exists some X such that . B. Bellman error minimization via∈K marginal decomposition h0 ∈ Cb( ) ⟨π, h0⟩ ≠ ⟨πQΦ, h0⟩ There is no loss of generality if we assume that πQ , h For our purposes, it is convenient to decompose the ⟨ Φ 0⟩− π, h 0. Then by considering functions hn nh for all infimum over Φ in (16) by first fixing the marginal π X . ⟨ 0⟩ > 0 = 0 ∈ P( ) n 1, 2,... and taking the limit as n , we can make the Consider the set of all stable control laws that leave π = → ∞ right-hand side of (20) grow without bound. Thus, we have invariant, proved (20). Substituting it into (19), we get (18).

Φ π πΦ Kπ ≜ š ∈ K ∶ = Ÿ To analyze the optimization problem (17), let us fix some This set may very well be empty for some π. Regardless, π and consider the dual value assuming that the conditions of Theorem 3 are satisfied, we J ,π(R) ≜ sup inf ⟨π ⊗ Φ, c + Qh − h⟩ (21) can write Φ h Cb X π R ∗ J = inf ⟨ΓΦ, c⟩ = inf inf ⟨π ⊗ Φ, c⟩. ∈ ( ) ∈I ( ) Φ π X Φ π Clearly, J R J ,π R for all π and R. Moreover: ∗ π ( ) ≥ ( ) ∗ We are now in∈K a position∈ toP( introduce) ∈K the information Proposition 4. Suppose∗ that assumption (A.1) above is constraint. Let Kπ(R) ≜ Kπ ∩ Iπ(R). Then we define the satisfied, and that Jπ (R) < ∞. Then the primal value Jπ (R) optimal steady-state value of the information-constrained and the dual value ∗J ,π(R) are equal. ∗ average-cost control problem as 0 ∗ Proof. Let Pπ,c(R) ⊂ P(X × U) be the closure, in the weak J (R) ≜ inf inf ⟨π ⊗ Φ, c⟩. (17) topology, of the set of all probability measures Γ ∈ P(X×U), π X Φ π R ∗ such that Γ(A×U) = π(A), I(Γ) ≤ R, and ⟨Γ, c⟩ < ∞. Since ∈P( ) ∈K ( ) As a first step to understanding solutions to (17), we consider Jπ (R) < ∞ by hypothesis, we can write each candidate invariant distribution π ∈ P(X) separately: ∗ Jπ (R) = inf sup ⟨Γ, c + Qh − h⟩. (22) Γ 0 R Proposition 3. For any π ∈ P(X), let ∗ π,c h Cb X ∈P ( ) ∈ ( ) Jπ (R) ≜ inf ⟨π ⊗ Φ, c⟩. Because c is coercive and nonnegative, the set {Γ ∈ P(X × Φ π R ∗ U) ∶ ⟨Γ, c⟩ < ∞} is tight [26, Proposition 1.4.15], so Then ∈K ( ) its closure is weakly sequentially compact by Prohorov’s theorem. Moreover, because the function Γ ↦ I(Γ) is weakly Jπ (R) = inf sup ⟨π ⊗ Φ, c + Qh − h⟩ (18) Φ R lower semicontinuous [8], the set Γ I Γ R is closed. ∗ π h Cb X { ∶ ( ) ≤ } Therefore, the set 0 is closed and tight, hence weakly ∈I ( ) Pπ,c(R) Remark 1. The function∈ h( )∈ Cb(X) plays the role of a 0 sequentially compact. Moreover, the sets Pπ,c(R) and Cb(X) Lagrange multiplier associated with the constraint Φ ∈ Kπ, are both convex, and the objective function on the right-hand which is what can be expected from the theory of average- side of (22) is affine in Γ and linear in h. Therefore, by Sion’s cost optimal control [25, Ch. 9]. minimax theorem [27] we may interchange the supremum On setting η = ⟨π ⊗ Φ, c⟩, the function c + Qh − h − η and the infimum to conclude that J R J R . is the Bellman error associated with that is central to π ( ) = ,π( ) h ∗ approximate dynamic programming. ∗ We are now in a position to relate the optimal value Jπ (R) Remark 2. Both in (18) and elsewhere, we can extend the to a suitable rate-distortion problem. ∗ supremum over to all 1 without affecting the value h h ∈ L (π) Theorem 4. Consider a probability measure π ∈ P(X) such of J R . π ( ) that Jπ (R) < ∞, and the supremum over h ∈ Cb(X) in (21) ∗ ∗ is attained by some h . Then there exists an MRS control Jπ (R) is equal to the value of the following optimization law Φ ∈ M(USX), such∗ that I(π, Φ ) = R, problem:∗ ∗ ∗ maximize λ Jπ (R) = ⟨π ⊗ Φ , c + Qh − h ⟩, ∗ ∗ ∗ ∗ 1 h log and the Radon–Nikodym derivative of π Φ w.r.t. π πΦ s.t. s dπ, 1 − i ⊗ ⊗ s c x,u Qh x,u U e ν(du) s takes the form ∗ ∗ ∫ ≥ λ−+[sR,( )+∀ν(∈ P)](U) 1 d x,u d π Φ e s λ 0, s 0, h C X ( ⊗ ) x, u , (23) ≥ ≥ ∈ b( ) ∗ ( ) = 1 d−x,u( ) d π πΦ s Φ d ( ⊗ ) ∫U e π ( u) Let us examine what happens as we relax the information ∗ − ( ) ∗ constraint, i.e., let R . From the fact that the DRF is where d(x, u) ≜ c(x, u) + Qh (x, u), and s ≥ 0 satisfies → ∞ ∗ convex and nondecreasing in R, and from the characteriza- Dπ(R ; c + Qh ) + sR ≥ Dπ(R; c + Qh ) + sR (24) tion (24) of −s as the slope of a tangent to the graph of ′ ∗ ′ ∗ the DRF at R, this is equivalent to letting s approach 0 for all R . (with the convention that sR → 0 even as R → ∞). Let us ′ recall Laplace’s principle [28]: for any ν ∈ P(U) and any Proof. Using Proposition 4 and the definition (21) of the F 1 measurable function F ∶ U → R such that e ∈ L (ν), we dual value J ,π R , we can express J R as a pointwise ( ) π ( ) have − supremum of a family of DRF’s: ∗ ∗ 1 F u − lim s log e s ν(du) = ess inf F (u), s 0 SU u U Jπ (R) = sup [Dπ(R; c + Qh) − ⟨π, h⟩] . (25) − ( ) h C X ∗ b where the essential↓ infimum is defined w.r.t.∈ ν. In view of ∈ ( ) this, the limit of Jπ (R) as R → ∞ is the value of Since Jπ (R) < ∞, we can apply Proposition 1 separately for ∗ each h ∗∈ Cb(X). In particular, taking any h ∈ Cb(X) that maximize λ achieves the supremum in (25) and using (12)∗ with s.t. cπ, inf [c(x, u) + Qh(x, u)] − hh ≥ λ 1 u U α x , ( ) = 1 d x,u λ 0, h Cb X s Φ d ≥ ∈ ∈ ( ) ∫U e π ( u) − ( ) ∗ Performing now the minimization over π X as well, we get (23). In the same way, (24) follows from (14). ∈ P( ) we see that the limit of J (R) as R → ∞ is given by the value of the following problem:∗ Theorem 5. The optimal value Jπ (R) admits the following variational representation: ∗ maximize λ s.t. inf [c(x, u) + Qh(x, u)] − h ≥ λ ⎪⎧ u U sup sup inf ⎪ Jπ (R) = ⎨ − ⟨π, h⟩ λ 0, h C X s 0 h C X ν U ⎪ ∈≥ ∈ b( ) ∗ b ⎩⎪ ⎡ ≥ ∈ ( ) ∈P( ) 1 ⎤ ⎪⎫ In particular, under Assumptions (A.1) and (A.2) of Theo- s ⎢ π, log R⎥ ⎪ 0 + ⎢d 1 c x,u Qh x,u i − ⎥ ⎬ rem 3, there exist a function h and a number λ ≥ that ⎢ e s ν du ⎥ ⎪ ⎣ ∫U ( ) ⎦ ⎭⎪ solve the Average Cost Optimality∞ Equation (ACOE)∞ − [ ( )+ ( )] Proof. Immediate from (25) and Proposition 2. h (x) + λ = inf [c(x, u) + Qh (x, u)] , u U ∞ ∞ ∞ To complete the computation of the optimal steady-state and J (R) → λ as R →∈ ∞. When a nontrivial information value J (R) defined in (17), we need to consider all can- constraint∗ is present,∞ i.e., when R ∈ (0, ∞), we may consider didate invariant∗ distributions π ∈ P(X) for which Kπ(R) is instead a stochastic relaxation of the ACOE, given by nonempty, and then choose among them any π that attains ⟨π, h⟩ + λ + sR the smallest value of Jπ (R) (assuming this value is finite). 1 The corresponding control∗ law will then be characterized by s inf π, log , = d 1 c x,u Qh x,u i ν U s d Theorem 4. Of course, even if we can guarantee the existence ∫U e ν( u) of such an optimal π under suitable assumptions on the cost which must be∈P( solved) for h, −s,[ and( λ)+. If( a solution)] triple c and the transition law Q, it will be exceedingly difficult to (h, s, λ) exists, and if ν ∈ P(U) attains the infimum on solve for it explicitly. On the other hand, if J R π ( ) < ∞ the right-hand side, then∗ a control law Φ ∈ M(USX) can be for some π, then Theorem 4 ensures that there∗ exists a defined through suboptimal control law satisfying the information constraint. 1 c x,u Qh x,u e s ν du Φ du x ( ) . ( S ) = − [1 (c x,u)+ Qh( x,u)] ∗ C. Some implications s d ∫U e ν ( u) − [ ( )+ ( )] ∗ We close this section by examining some implications of Moreover, if π = πΦ, then by Theorems 4 and 5, Jπ (R) = λ, the results proved above. Using Theorem 5, we see that and (h, Φ) are an optimizing pair for Jπ (R). ∗ ∗ VI.CONCLUSIONS where the variables (Xt,Zt,Ut ) are obtained from the orig- 2 The main contributions of this paper are to pose a general inal ones by replacing W′ 2(′dz2′Sx , z1, u1) by W2(dz2Sx2). ′ rational inattention model for optimal control with informa- REFERENCES tion constraints, and to reveal structure for the associated [1] P. Varaiya and J. Walrand, “Causal coding and control for Markov optimal control equations. Future work will include the chains,” Systems and Control Letters, vol. 3, pp. 189–192, 1983. construction of reinforcement learning algorithms that make [2] R. Bansal and T. Bas¸ar, “Simultaneous design of measurement and use of this structure to obtain approximately optimal policies control strategies for stochastic systems with feedback,” Automatica, vol. 25, no. 5, pp. 679–694, 1989. based on online measurements of the system to be controlled. [3] S. Tatikonda and S. Mitter, “Control over noisy channels,” IEEE This is particularly natural in applications to finance or Transactions on Automatic Control, vol. 49, no. 7, pp. 1196–2001, economics. July 2004. [4] S. Tatikonda, A. Sahai, and S. Mitter, “Stochastic linear control over APPENDIX I a communication channel,” IEEE Transactions on Automatic Control, vol. 49, no. 9, pp. 1549–1561, September 2004. ROOFOFTHE RINCIPLEOF RRELEVANT NFORMATION P P I I [5]S.Y uksel¨ and T. Linder, “Optimization and convergence of obser- Let M(dυSψ) be the conditional distribution of Υ given vation channels in stochastic control,” SIAM Journal on Control and Optimization, vol. 50, no. 2, pp. 864–887, 2012. Ψ, let Λ(dψSθ, ξ) be the conditional distribution of Ψ given [6] C. Sims, “Implications of rational inattention,” Journal of Monetary (θ, ξ), and disintegrate the joint distribution of Θ, Ξ, Ψ, Υ as Economics, vol. 50, no. 3, pp. 665–690, 2003. [7] ——, “Rational inattention: Beyond the linear-quadratic case,” The P (dθ, dξ, dψ, dυ) = P (dθ)P (dξSθ)Λ(dψSθ, ξ)M(dυSψ). American Economic Review, vol. 96, no. 2, pp. 158–163, 2006. [8] M. S. Pinsker, Information and Information Stability of Random If we define Λ (dψSθ) by Variables and Processes. Holden-Day, 1964. ′ [9] F. Matejka and A. McKay, “Rational Inattention to discrete choices: Λ (⋅Sθ) = S W (⋅Sθ, ξ)P (dξSθ) A new foundation for the multinomial logit model,” 2011, working ′ paper. and let the tuple Θ , Ξ , Ψ , Υ have the joint distribution [10] ——, “Simple market equilibria with rationally inattentive con- ( ) sumers,” The American Economic Review, vol. 102, no. 3, pp. 24–29, ′ ′ ′ ′ P dθ, dξ, dψ, dυ P dθ P dξ θ Λ dψ θ M dυ ψ , 2012. ( ) = ( ) ( S ) ( S ) ( S ) [11] L. Peng, “Learning with information capacity constraints,” Journal of ′ ′ then it is easy to see that it has all of the desired properties. Financial and Quantitative Analysis, vol. 40, no. 2, pp. 307–329, 2005. [12] H. S. Witsenhausen, “On the structure of real-time source coders,” Bell APPENDIX II System Technical Journal, vol. 58, no. 6, pp. 1437–1451, July 1979. [13] S. P. Meyn and G. Mathew, “Shannon meets Bellman: Feature based PROOFOFTHE TWO-STAGE LEMMA Markovian models for detection and optimization,” in Proc. 47th IEEE Note that Z1 only depends on X1, and that only the CDC, 2008, pp. 5558–5564. [14] T. Berger, Rate Distortion Theory, A Mathematical Basis for Data second-stage expected cost is affected by the choice of W2. Compression. Prentice Hall, 1971. We can therefore apply the Principle of Irrelevant Informa- [15] R. L. Stratonovich, “On value of information,” Izvestiya of USSR Academy of Sciences, Technical Cybernetics, vol. 5, pp. 3–12, 1965. tion to Θ = X2, Ξ = (X1,Z1,U1), Ψ = Z2 and Υ = U2. [16] ——, Information Theory. Moscow, USSR: Sovetskoe Radio, 1975. Because both the expected cost E[c(Xt,Ut)] and the mutual [17] I. Csiszar,´ “On an extremum problem of information theory,” Studia information I(Xt; Zt) depend only on the corresponding Scientiarum Mathematicarum Hungarica, vol. 9, pp. 57–71, 1974. bivariate marginals, the lemma is proved. [18] P. R. Halmos and L. J. Savage, “Application of the Radon–Nikodym theorem to the theory of sufficient ,” Annals of Mathematical APPENDIX III Statistics, vol. 20, pp. 225–241, 1949. [19] G. Picci, “Some connections between the theory of sufficient statistics ROOFOFTHE HREE TAGE EMMA P T -S L and the identifiability problem,” SIAM Journal on Applied Mathemat- Again, Z1 only depends on X1, and only the second- and ics, vol. 33, no. 3, pp. 383–398, 1977. the third-stage expected costs are affected by the choice of [20] Y. Bar-Shalom and E. Tse, “Dual effect, certainty equivalence, and separation in stochastic control,” IEEE Transactions on Automatic W2. By the law of iterated expectation, we have Control, vol. 19, pp. 494–500, 1974. [21] A. Manne, “Linear programming and sequential decisions,” Manage- E[c(X3,U3)] = E[E[c(X3,U3)SX2,U2]] = E[h(X2,U2)], ment Science, vol. 6, pp. 257–267, 1960. [22] V. S. Borkar, “A convex analytic approach to Markov decision pro- where h(X2,U2) ≜ E[c(X3,U3)SX2,U2]. Note that the cesses,” Probabillity Theory and Related Fields, vol. 78, no. 583-602, functional form of h does not depend on the choice of W2, 1988. since for any fixed realizations and we [23] O. Hernandez-Lerma´ and J. B. Lasserre, “Linear programming and X2 = x2 U2 = u2 average optimality of Markov control processes on Borel spaces — have, by hypothesis, unbounded costs,” SIAM Journal on Control and Optimization, vol. 32, no. 2, pp. 480–500, March 1994. h(x2, u2) = c(x3, u3)P (dx3, du3Sx2, u2) [24] V. S. Borkar, “Convex analytic methods in Markov decision pro- S cesses,” in Handbook of Markov Decision Processes, E. Feinberg and c x , u Q dx x , u W dz x Φ du dz . A. Shwartz, Eds. Boston, MA: Kluwer, 2001. = S ( 3 3) ( 3S 2 2) 3( 3S 3) 3( 3S 3) [25] S. P. Meyn, Control Techniques for Complex Networks. Cambridge Univ. Press, 2008. Therefore, applying the Principle of Irrelevant Information [26] O. Hernandez-Lerma´ and J. B. Lasserre, Markov Chains and Invariant to Θ = X2, Ξ = (X1,Z1,U1), Ψ = Z2, and Υ = U2, Probabilities. Birkhauser,¨ 2003. [27] M. Sion, “On general minimax theorems,” Pacific Journal of Mathe- E[c(X2,U2) + c(X3,U3)] = E[c(X2,U2) + h(X2,U2)] matics, vol. 8, pp. 171–176, 1958. ′ ′ ′ ′ ′ ′ ′ ′ [28] P. Dupuis and R. S. Ellis, A Weak Convergence Approach to the Theory = E[c(X2,U2) + h(X2,U2)] of Large Deviations. New York: Wiley, 1997. = E[c(X2,U2) + c(X3,U3)],