EE378A Statistical Signal Processing Lecture 2 - 04/06/2017 Lecture 2: Statistical Decision Theory Lecturer: Jiantao Jiao Scribe: Andrew Hilger
In this lecture, we discuss a unified theoretical framework of statistics proposed by Abraham Wald, which is named statistical decision theory. 1.
1 Goals
1. Evaluation: The theoretical framework should aid fair comparisons between algorithms (e.g., maxi- mum entropy vs. maximum likelihood vs. method of moments). 2. Achievability: The theoretical framework should be able to inspire the constructions of statistical algorithms that are (nearly) optimal under the optimality criteria introduced in the framework.
2 Basic Elements of Statistical Decision Theory
1. Statistical Experiment: A family of probability measures P = {Pθ : θ ∈ Θ}, where θ is a parameter and Pθ is a probability distribution indexed by the parameter.
2. Data: X ∼ Pθ, where X is a random variable observed for some parameter value θ.
3. Objective: g(θ), e.g., inference on the entropy of distribution Pθ. 4. Decision Rule: δ(X). The decision rule need not be deterministic. In other words, there could be a probabilistically defined decision rule with an associated Pδ|X . 5. Loss Function: L(θ, δ). The loss function tells us how bad we feel about our decision once we find out the true value of the parameter θ chosen by nature.
2 1 −(x−θ) 2 Example: P (x) = √ e 2 , g(θ) = θ, and L(θ, δ) = (θ − δ) . In other words, X is normally θ 2π distributed with mean θ and unit variance X ∼ N(θ, 1), and we are trying to estimate the mean θ. We judge our success (or failure) using mean-square error.
3 Risk Function
Definition 1 (Risk Function).
R(θ, δ) , E[L(θ, δ(X))] (1) Z = L(θ, δ(x))Pθ(dx) (2) ZZ = L(θ, δ)Pδ|X (dδ|x)Pθ(dx). (3)
1See Wald, Abraham. ”Statistical decision functions.” In Breakthroughs in Statistics, pp. 342-357. Springer New York, 1992.
1 Figure 1: Example risk functions computed over a range of parameter values for two different decision rules.
A risk function evaluates a decision rule’s success over a large number of experiments with fixed parameter value θ. By the law of large numbers, if we observe X many times independently, the average empirical loss of the decision rule δ will converge to the risk R(θ, δ). Even after determining risk functions of two decision rules, it may still be unclear which is better. Consider the example of Figure 1. Two different decision rules δ1 and δ2 result in two different risk functions R(θ, δ1) and R(θ, δ2) evaluated over different values on the parameter θ. The first decision rule δ1 is inferior for low and high values of the parameter θ but is superior for the middle values. Thus, even after computing a risk function R, it can still be unclear which decision rule is better. We need new ideas to enable us to compare different decision rules.
4 Optimality Criterion of Decision Rules
Given the risk function of various decision rules as a function of the parameter θ, there are various approaches to determining which decision rule is optimal.
4.1 Restrict the Competitors This is a traditional set of methods that were overshadowed by other approaches that we introduce later. A decision rule δ0 is eliminated (or formally, is inadmissible) if there are any other decision rules δ that are strictly better, i.e., R(θ, δ0) ≥ R(θ, δ) for any θ ∈ Θ and the inequality becomes strict for at least one θ0 ∈ Θ. However, the problem is that many decision rules cannot be eliminated in this way and we still lack a criterion to determine which one is better. Then to aid in selection, the rationale of the approach of restricting competitors is that only decision rules that are members of a certain decision rule class D are considered. The advantage is that, sometimes all but one decision rule in D is inadmissible, and we just use the only admissible one.
1. Example 1: Class of unbiased decision rules D0 = {δ : E[δ(X)] = g(θ), ∀θ ∈ Θ} 2. Example 2: Class of invariant estimators.
However, a serious drawback of this approach is that D may be an empty set for various decision theoretic problems.
2 4.2 Bayesian: Average Risk Optimality The idea is to use averaging to reduce the risk function R(θ, δ) to a single number for any given δ. Definition 2 (Average risk under prior Λ(dθ)). Z r(Λ, δ) = R(θ, δ)Λ(dθ) (4)
Here Λ is the is the prior distribution, a probability measure on Θ. The Bayesians and the frequentists disagree about Λ; namely, the frequentists do not believe the existence of the prior. However, there do exist more justifications of the Bayesian approach than the interpretation of Λ as prior belief: indeed, the complete class theorem in statistical decision theory asserts that in various decision theoretic problems, all the admissible decision rules can be approximated by Bayes estimators. 2 Definition 3 (Bayes estimator).
δΛ = arg min r(Λ, δ) (5) δ
The Bayes estimator δΛ can usually be found using the principle of computing posterior distributions. Note that ZZZ r(Λ, δ) = L(θ, δ)Pδ|X (dδ|x)Pθ(dx)Λ(dθ) (6) Z ZZ = L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) PX (dx) (7)
where PX (dx) is the marginal distribution of X and Pθ|X (dθ|x) is the posterior distribution of θ given X. In Equation 6, Pθ(dx)Λ(dθ) is the joint distribution of θ and X. In Equation 7, we only have to minimize the portion in parentheses to minimize r(Λ, δ) because PX (dx) doesn’t depend on δ. Theorem 4. 3 Under mild conditions,
δΛ(x) = arg min E[L(θ, δ)|X = x] (8) δ ZZ = arg min L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) (9) Pδ|X Lemma 5. If L(θ, δ) is convex in δ, it suffices to consider deterministic rules δ(x). Proof Jensen’s inequality: ZZ Z Z L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) ≥ L(θ, δPδ|X (dδ|x))Pθ|X (dθ|x). (10)
Examples 2 1. L(θ, δ) = (g(θ) − δ) ⇒ δΛ(x) = E[g(θ)|X = x]. In other words, the Bayes estimator under squared error loss is the conditional expectation of g(θ) given x.
2. L(θ, δ) = |g(θ) − δ| ⇒ δΛ(x) is any median of the posterior distribution Pθ|X=x. 1 3. L(θ, δ) = (g(θ) 6= δ) ⇒ δΛ(x) = arg maxθ Pθ|x(θ|x) = arg maxθ Pθ(x)Λ(θ). In other words, an indicator loss function results in a maximum a posteriori (MAP) estimator decision rule4.
2See Chapter 3 of Friedrich Liese, and Klaus-J. Miescke. ”Statistical Decision Theory: Estimation, Testing, and Selection.” (2009) 3See Theorem 1.1, Chapter 4 of Lehmann EL, Casella G. Theory of point estimation. Springer Science & Business Media; 1998 4 Next week, we will cover special cases of Pθ and how to solve Bayes estimator in a computationally efficient way. In the general case, however, computing the posterior distribution may be difficult.
3 4.3 Frequentist: Worst-Case Optimality (Minimax) Definition 6 (Minimax estimator). The decision rule δ∗ is minimax among all decision rules in D iff sup R(θ, δ∗) = inf sup R(θ, δ). (11) θ∈Θ δ∈D θ∈Θ
4.3.1 First observation Since ZZ R(θ, δ) = L(θ, δ)Pδ|X (dδ|x)Pθ(dx) (12) is linear in Pθ|X , this is a convex function in δ. The supremum of a convex function is convex, so finding the optimal decision rule is a convex optimization problem. However, solving this convex optimization problem may be computationally intractable. For example, it may not even be computationally tractable to compute the supremum of the risk function R(θ, δ) over θ ∈ Θ. Hence, finding the exact minimax estimator is usually hard.
4.3.2 Second observation Due to the previous difficulty of finding the exact minimax estimator, we turn to another goal: we wish to find an estimator δ0 such that inf sup R(θ, δ) ≤ sup R(θ, δ0) ≤ c · inf sup R(θ, δ) (13) δ θ θ δ θ where c > 1 is a constant. The left inequality is trivially true. For the right inequality, in practice one can 0 0 usually choose some specific δ and evaluate an upper bound of supθ R(θ, δ ) explicitly. However, it remains to find a lower bound of infδ supθ R(θ, δ). To solve the problem (and save the world), we can use the minimax theorem. Theorem 7 (Minimax Theorem (Sion-Kakutani)). Let Λ, X be two compact, convex sets in some topologi- cally vector spaces. Let function H(λ, x):Λ × X → R be a continuous function such that: 1. H(λ, ·) is convex for any fixed λ ∈ Λ 2. H(·, x) is concave for any fixed x ∈ X. Then
1. Strong duality: maxλ minx H(λ, x) = minx maxλ H(λ, x) 2. Existence of Saddle point: ∃(λ∗, x∗): H(λ, x∗) ≤ H(λ∗, x∗) ≤ H(λ∗, x) ∀λ ∈ Λ, x ∈ X.
The existence of saddle point implies the strong duality. We note that other than the strong duality, the following weak duality is always true without assumptions on H: sup inf H(λ, x) ≤ inf sup H(λ, x) (14) λ x x λ
We define the quantity rΛ , infδ r(Λ, δ) as the Bayes risk under prior distribution Λ. We have the following lines of arguments using weak duality: inf sup R(θ, δ) = inf sup r(Λ, δ) (15) δ θ δ Λ ≥ sup inf r(Λ, δ) (16) Λ δ
= sup rΛ (17) Λ
4 Equation (17) gives us a strong tool for lower bounding the minimax risk: for any prior distribution Λ, the Bayes risk under Λ is a lower bound of the corresponding minimax risk. When the condition of the minimax theorem is satisfied (which may be expected due to the bilinearity of r(Λ, δ) in the pair (Λ, δ)), equation (16) achieves equality, which shows that there exists a sequence of priors such that the corresponding Bayes risk sequence converges to the minimax risk. In practice, it suffices to choose some appropriate prior distribution Λ in order to solve the (nearly) minimax estimator.
5 Decision Theoretic Interpretation of Maximum Entropy
Definition 8 (Logarithmic Loss). For a distribution P over X and x ∈ X , 1 L(x, P ) log . (18) , P (x)
Note that P (x) is the discrete mass on x in the discrete case, and is the density at x in the continuous case. By definition, if P (x) is small, then the loss is large. Now we see how the maximum likelihood estimator reduces to the so-called maximum entropy under the logarithmic loss function. Specifically, suppose we have functions r1(x), ··· , rk(x) and compute the following empirical averages:
n 1 X β r (x ) = [r (X)], i = 1, ··· , k (19) i , n i j EPn i j=1
where Pn denotes the empirical distribution based on n samples x1, ··· , xn. By the law of large numbers, it is expected that βi is close to the true expectation under P , so we may restrict the true distribution P in the following set:
P = {P : EP [ri(X)] = βi, 1 ≤ i ≤ k}. (20) Also, denote by Q the set of all probability distributions which belong to the exponential family with central statistics r1(x), ··· , rk(x), i.e.,
1 Pk λ r (x) Q = {Q : Q(x) = e i=1 i i , ∀x ∈ X } (21) Z(λ)
where Z(λ) is the normalization factor. Now we consider two estimators of P . The first one is the maximum likelihood estimator over Q, i.e.,
n X PML = arg max log P (xj). (22) P ∈Q j=1
The second one is the maximum entropy estimator over P, i.e.,
PME = arg max H(P ) (23) P ∈P where H(P ) is the entropy of P , which is H(P ) = P −P (x) log P (x) in the discrete case, and H(P ) = R x∈X X −P (x) log P (x)dx in the continuous case. The key result is as follows:5
Theorem 9. In the above setting, we have PML = PME.
5Berger, Adam L., Vincent J. Della Pietra, and Stephen A. Della Pietra. ”A maximum entropy approach to natural language processing.” Computational linguistics 22.1 (1996): 39-71.
5 Proof We begin with the following lemma, which is the key motivation for the use of the cross-entropy loss in machine learning. Lemma 10. The true distribution minimizes the expected logarithmic loss:
P = arg min EP L(x, Q) (24) Q and the corresponding minimum EP L(x, P ) is the entropy of P . Proof The assertion follows from the non-negativity of the KL divergence D(P kQ) ≥ 0, where D(P kQ) = dP EP ln dQ .
1 Equipped with this lemma, and noting that EP [log Q(x) ] is linear in P and convex in Q, by the minimax theorem we have 1 1 max H(P ) = max min EP [log ] = min max EP [log ] (25) P ∈P P ∈P Q Q(x) Q P ∈P Q(x)
In particular, we know that (PME,PME) is a saddle point in P ×M, where M denotes the set of all probability 6 measures on X . We first take a look at PME which maximizes the LHS. By variational methods , it is easy to see that the entropy-maximizing distribution PME belongs to the exponential family Q, i.e.,
1 Pk λ r (x) P (x) = e i=1 i i (26) ME Z(λ) for some parameters λ1, ··· , λk. Now by the definition of saddle points, for any Q ∈ Q ⊂ M we have 1 1 EPME [log ] ≤ EPME [log ] (27) PME(x) Q(x) while for any Q ∈ Q with possibly different coefficients λ0, 1 X [log ] = [log Z(λ0) − λ0 r (x)] [since Q ∈ Q] (28) EPME Q(x) EPME i i i 0 X 0 = log Z(λ ) − λiβi [since PME ∈ P] (29) i 0 X 0 = EPn [log Z(λ ) − λiri(x)] [since Pn ∈ P] (30) i 1 = [log ] (31) EPn Q(x) n 1 X 1 = log . (32) n Q(x ) j=1 j As a result, the saddle point condition reduces to
n n 1 X 1 1 X 1 log ≤ log , ∀Q ∈ Q (33) n P (x ) n Q(x ) j=1 ME j j=1 j establishing the fact that PME = PML.
6See Chapter 12 of Thomas, Joy A., and Thomas M. Cover. Elements of information theory. John Wiley & Sons, 2006.
6 EE378A Statistical Signal Processing Lecture 3 - 04/11/2017 Lecture 3: State Estimation in Hidden Markov Processes Lecturer: Tsachy Weissman Scribe: Alexandros Anemogiannis, Vighnesh Sachidananda
In this lecture1, we define the Hidden Markov Processes (HMP) as well as the forward-backward recursion method that efficiently estimates the posteriors in HMPs.
1 Markov Chain
We begin by introducing the Markov triplet, which is useful in the design and analysis of algorithms for state estimation in Hidden Markov Processes. Let X,Y,Z denote discrete random variables. We say that X, Y , and Z form a Markov triplet (which we denote as X − Y − Z) given the following relationships:
X − Y − Z ⇐⇒ p(x, z|y) = p(x|y)p(z|y) (1) ⇐⇒ p(z|x, y) = p(z|y) (2)
If X,Y,Z form a Markov triplet, their joint distribution enjoys the important property that it can be factored into the product of two functions φ1 and φ2.
Lemma 1. X − Y − Z ⇐⇒ ∃ φ1, φ2 s.t. p(x, y, z) = φ1(x, y)φ2(y, z). Proof (Proof of necessity) Assume that X − Y − Z. Then by definition of the Markov chain, we have
p(x, y, z) = p(x|y, z)p(y, z) = p(x|y)p(y, z).
Now taking φ1(x, y) = p(x, y) and φ2(y, z) = p(y, z) proves this direction. (Proof of sufficiency) Now, the functions φ1, φ2 exist as mention in the statement. Then we have:
p(x, y, z) p(x|y, z) = p(y, z) p(x, y, z) = P x˜ p(˜x, y, z) φ1(x, y)φ2(y, z) = P φ2(y, z) x˜ φ1(˜x, y) φ1(x, y) = P . x˜ φ1(˜x, y) The final equality shows that p(x|y, z) is not a function of z and hence, X − Y − Z follows by definition.
2 Hidden Markov Processes
Based on our understanding of the Markov Triplet, we define the Hidden Markov Process (HMP) as follows:
1Reading: Ephrahim & Merhav 2002, lectures 9,10 of 2016 notes
1 Definition 2 (Hidden Markov Process). The process {(Xn,Yn)}n≥1 is a Hidden Markov Process if
•{Xn}n≥1 is a Markov process, i.e.,
n n Y ∀ n ≥ 1 : p(x ) = p(xt|xt−1), t=1
where by convention, we take X0 ≡ 0 and hence p(x1|x0) = p(x1).
•{Yn}n≥1 is determined by a memoryless channel; i.e.,
n n n Y ∀ n ≥ 1 : p(y |x ) = p(yt|xt). t=1
The processes {Xn}n≥1 and {Yn}n≥1 are called the state process and the noisy observations, respectively. In view of the above definition, we can derive the joint distribution of the state and the observation as follows:
n n n n n n Y p(x , y ) = p(x )p(y |x ) = p(xt|xt−1)p(yt|xt). t=1 The main problem in the Hidden Markov Models is to compute the the posterior probability of the state t at any time, given all the observations up to that time, i.e. p(xt|y ). The naive approach to do this is to simply use the the definition of the conditional probability:
t P t t−1 t−1 t p(xt, y ) xt−1 p(y |xt.x )p(xt, x ) p(xt|y ) = t = P P t t−1 t−1 . p(y ) t−1 p(y |x˜ , x )p(˜x , x ) x˜t x t t The above approach needs exponentially many computations both for the numerator and the denominator. To avoid this problem, we develop an efficient method for computing the posterior probabilities using forward recursion. Before getting to the algorithm, we establish some conditional independence relations.
3 Conditional Independence Relations
We introduce some conditional independence relations which help us find more efficient ways of calculating the posterior.
t−1 t n n 1. (X ,Y ) − Xt − (Xt+1,Yt+1): Proof We have :
n n n Y p(x , y ) = p(xi|xi−1)p(yi|xi) i=1 " t # " n # Y Y = p(xi|xi−1)p(yi|xi) p(xi|xi−1)p(yi|xi) i=1 i=t+1 | {z } | {z } φ (x ,(xt−1,yt)) n n 1 t φ2(xt,(xt+1,yt+1))
The statement then follows from Lemma 1.
2 t−1 t−1 n n 2. (X ,Y ) − Xt − (Xt+1,Yt ): n n Proof Note that if {Xi,Yi}i=1 is a Hidden Markov Process, then {Xn−i,Yn−i}i=1 will also be a Hidden Markov Process. Applying (a) to this reversed process proves (b).
t−1 t−1 n n 3. X − (Xt,Y ) − (Xt+1,Yt ): Proof Left as an exercise.
4 Causal Inference via Forward Recursion
We now derive the forward recursion algorithm as an efficient method to sequentially compute the causal posterior probabilities. Note that we have p(x , yt) p(x |yt) = t t P p(˜x , yt) x˜t t p(x , yt−1)p(y |x , yt−1) = t t t P p(˜x , yt−1)p(y |x˜ , yt−1) x˜t t t t p(yt−1)p(x |yt−1)p(y |x ) = t t t (by (b)) p(yt−1) P p(˜x |yt−1)p(y |x˜ ) x˜t t t t p(x |yt−1)p(y |x ) = t t t P p(˜x |yt−1)p(y |x˜ ) x˜t t t t
t t−1 Define αt(xt) = p(xt|y ) and βt(xt) = p(xt|y ), then the above computation can be summarized as
β (x )p(y |x ) α (x ) = t t t t . (3) t t P β (˜x )p(y |x˜ ) x˜t t t t t Similarly, we can write β as
t βt+1(xt+1) = p(xt+1|y ) X t = p(xt+1, xt|y )
xt X t t = p(xt|y )p(xt+1|y , xt)
xt X t = p(xt|y )p(xt+1|xt) (by (a)).
xt The above equation can be summarized as X βt+1(xt+1) = αt(xt)p(xt+1|xt). (4)
xt
Equations (3) and (4) indicate that αt and βt can be sequentially computed based on each other for t = 1, ··· , n with the initialization β(x1) = p(x1). Hence, with this simple algorithm, the causal inference can be done efficiently in terms of both computation and memory. This is called the forward recursion.
3 5 Non-causal Inference via Backward Recursion
n Now suppose that we are interested in finding p(xt|y ) for t = 1, 2, ··· , n in a non-causal manner. For this purpose, we can derive a backward recursion algorithm as follows.
n X n p(xt|y ) = p(xt, xt+1|y )
xt+1 X n t = p(xt+1|y )p(xt|xt+1, y ) (by (c))
xt+1 t X n p(xt|y )p(xt+1|xt) = p(xt+1|y ) t . (by (a)) p(xt+1|y ) xt+1
n Now, let γt(xt) = p(xt|y ). Then, the above equation can be summarized as
X αt(xt)p(xt+1|xt) γt(xt) = γt+1(xt+1) . (5) βt+1(xt+1) xt+1
Equation (5) indicates that γt can be recursively computed based on αt’s and βt’s for t = n − 1, n − 2, ··· , 1 with the initialization of γn(xn) = αn(xn). This is called the backward recursion.
4 EE378A Statistical Signal Processing Lecture 4 - 04/13/2017 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski
In this lecture we build on previous analysis of state estimation for Hidden Markov Models to create an algorithm for estimating the posterior distribution of the states given the observations.
1 Notation
A quick summary of the notation: 1. Random Variables (objects): used more “loosely”, i.e. X, Y , U, V 2. Alphabets: X , Y, U, V 3. Specific Realizations of Random Variables: x, y, u, v
4. Vector Notation: For a vector (x1, ··· , xn), we denote the vector of values from the i-th component j i i to the j-th component (inclusive) as xi , with the convention x , x1. 5. Markov Chains: We use the standard Markov Chain notation X − Y − Z to denote that X,Y,Z form a Markov triplet.
For discrete random variable (object), we write p(x) for the pmf P(X = x), and similarly p(x, y) for pX,Y (x, y) and p(y|x) for pY |X (y|x), etc.
2 Recap of HMP
For a set of unknown states {Xn}n≥1 of a Markov Process and noisy observations {Yn}n≥1 on those states taken through some memoryless channel, we can express the joint distribution of the states and outputs: n n n n Y Y p(x , y ) = p(xt|xt−1) p(yt|xt) (1) t=1 t=1 where x0 is taken to be some fixed arbitrary value (e.g., 0). Additionally we have the following principles for this setting:
t−1 t n n (X ,Y ) − Xt − (Xt+1,Yt+1) (a) t−1 t−1 n n (X ,Y ) − Xt − (Xt+1,Yt ) (b) t−1 t−1 n n X − (Xt,Y ) − (Xt+1,Yt ) (c) In the last lecture, through the forward-backward recursion we have the following algorithms for computing various posteriors:
t βt(xt)pYt|Xt (yt|xt) α (x ) = p t (x |y ) = (Measurement Update) t t Xt|Y t P β (˜x )p (y |x˜ ) x˜t t t Yt|Xt t t t−1 X t−1 βt+1(xt+1) = pXt|Y (xt|y ) = αt(xt)pXt+1|Xt (xt+1|xt) (Time Update) xt
n X αt(xt)pXt+1|Xt (xt+1|xt) n γt(xt) = pXt|Y (xt|y ) = γt+1(xt+1) (Backward Recursion) βt+1(xt+1) xt+1
1 3 Factor Graphs
Let us derive a graphical representation of the joint distribution expressed in (1) as follows:
1. Create nodes on the graph representing each random variable: X1, ..., Xn,Y1, ..., Yn. 2. For each pairing of nodes, create an edge if there is some factor in 1 which contains both variables. This form of graph is called a Factor Graph.
Example: Figure 1 shows the factor graph for the hidden Markov process in our setting.
Figure 1: The factor graph for our hidden Markov process
For any partitioning of the graph into three node sets S1, S2, and S3 such that any path from S1 to S3 passes some node in S2, these sets form a Markov process S1 − S2 − S3 (to see this, simply express the joint pmf into factors). Hence, for any si ∈ Si, i = 1, 2, 3, we have
p(s1, s3|s2) = φ1(s1, s2)φ2(s2, s3) (2) for some functions φ1, φ2 which consist of the factors which were used to form the factor graph. Using the factor graph, it is quite easy to establish principles (a)-(c) via graphs. For example, principles (a) and (b) can be proved via Figure 2 and 3 shown below.
Figure 2: The factor graph partition for (a)
Figure 3: The factor graph partition for (b)
2 We can also construct an additional principle from the factor graph partitioning presented in Figure 4:
t−1 n n X − (Xt,Y ) − Xt+1 (d)
Figure 4: The graphical proof of (d)
4 General Forward–Backward Recursion: Module Functions
Let us consider a generic discrete random variable U ∼ pU passed through a memoryless channel characterized by pV |U to get a discrete random variable V . By Bayes rule we know
pU (u)pV |U (v|u) pU|V (u|v) = P (3) u˜ pU (˜u)pV |U (v|u˜)
If we create the probability vector of pU|V for each value of U, we can define a vector function F s.t. pU (u)pV |U (v|u) {pU|V (u|v)}u = P , F (pU , pV |U , v) (4) u˜ pU (˜u)pV |U (v|u˜) u Additionally, overloading the notation of F , we can define the matrix function F as
F (pU , pV |U ) = {F (pU , pV |U , v)}v (5)
This function F acts as a module function for calculating the posterior distribution of a random variable given a prior distribution and a channel distribution. In addition to F , we can write a module function for calculating the marginal distribution given a prior distribution on the input and a channel distribution. We label this vector function G and define it as follows: ( ) X pV = {pV (v)}v = pU (˜u)pV |U (v|u˜) , G(pU , pV |U ) (6) u˜ v With these module functions in mind, we can express our recursion functions as follows
αt = F (βt, pYt|Xt , yt) (Measurement Update)
βt+1 = G(αt, pXt+1|Xt ) (Time Update)
γt = G(γt+1,F (αt, pXt+1|Xt )) (Backward Recursion)
3 5 State Estimation: the Viterbi Algorithm
Now we are ready to derive the joint posterior distribution p(xn|yn):
n−1 n n n Y n n p(x |y ) = p(xn|y ) p(xt|xt+1, y ) t=1 n−1 n Y n = p(xn|y ) p(xt|xt+1, y ) (By principle (d)) t=1 n−1 n Y t = p(xn|y ) p(xt|xt+1, y ) (By principle (c)) t=1 n−1 t t Y p(xt|y )p(xt+1|xt, y ) = p(x |yn) n p(x |yt) t=1 t+1 n−1 Y αt(xt)p(xt+1|xt) = γ (x ) (By principle (a)) n n β (x ) t=1 t+1 t+1 Write
n n n X ln p(x |y ) = gt(xt, xt+1) t=1 with ( ln αt(xt)p(xt+1|xt) t = 1, ··· , n − 1 βt+1(xt+1) gt(xt, xt+1) , ln γn(xn) t = n
we can solve the MAP estimatorx ˆMAP with the help of the following definition: Definition 1. For 1 ≤ k ≤ n, Let
n X Mk(xk) := max gt(xt, xt+1). xn k+1 t=k
MAP MAP n n n It is straightforward from definition that M1(ˆx1 ) = ln p(ˆx |y ) = maxxn p(x |y ) and
n X M (x ) = max max(g (x , x ) + g (x , x )) k k n k k k+1 t t t+1 xk+1 x k+2 t=k+1 n X = max(g (x , x ) + max g (x , x )) k k k+1 n t t t+1 xk+1 x k+2 t=k+1
= max(gk(xk, xk+1) + Mk+1(xk+1)). xk+1
Since Mn(xn) = g(xn, xn+1) = ln γn(xn) only depends on one term xn, we may start from n and use the previous recursive formula to obtain M1(x1). This is called the Viterbi Algorithm. The Bellman Equations, referred to in the communication setting as the Viterbi Algorithm, is an appli- cation of a technique called dynamic programming. Using the recursive equations derived in the previous section, we can express the Viterbi Algorithm as follows.
4 1: function Viterbi 2: Mn(xn) ← ln γn(xn) . Initialization of log-likelihood MAP 3: xˆ (xn) ← ∅ . Initialization of the MAP estimator 4: for k = n − 1, ··· , 1 do
5: Mk(xk) ← maxxk+1 (gk(xk, xk+1) + Mk+1(xk+1)) . Maximum of log-likelihood
6: xˆk+1(xk) ← arg maxxk+1 (gk(xk, xk+1) + Mk+1(xk+1)) MAP MAP 7: xˆ (xk) ← [ˆxk+1(xk), xˆ (ˆxk+1(xk))] . Maximizing sequence with leading term xk 8: end for
9: M ← maxx1 M1(x1) . Maximum of overall log-likelihood
10: xˆ1 ← arg maxx1 M1(x1) MAP MAP 11: xˆ ← [ˆx1, xˆ (ˆx1)] . Overall maximizing sequence 12: end function
5 EE378A Statistical Signal Processing Lecture 3 - 04/18/2017 Lecture 3: Particle Filtering and Introduction to Bayesian Inference Lecturer: Tsachy Weissman Scribe: Charles Hale
In this lecture, we introduce new methods for solving Hidden Markov Processes in cases beyond discrete alphabets. This requires us to learn first about estimating pdfs based on samples from a different distribution. At the end, we introduce the Bayesian inference briefly.
1 Recap: Inference on Hidden Markov Processes (HMPs) 1.1 Setting A quick recap on the setting of a HMP:
1. {Xn}n≥1 is a Markov process, called “the state process”.
2. {Yn}n≥1 is “the observation process” where Yi is the output of Xi sent through a “memoryless” channel characterized by the distribution PY |X . Alternatively, we showed in HW1 that the above is totally equivalent to the following:
1. {Xn}n≥1 is “the state process” defined by
Xt = ft(Xt−1,Wt)
where {Wn}n≥1 is an independent process, i.e-each Wt is independent of all the others.
2. {Yn}n≥1 is “the observation process” related to {Xn}n≥1 by
Yt = gt(Xt,Nt)
where {Nn}n≥1 is an independent process.
3. The processes {Nn}n≥1 and {Wn}n≥1 are independent of each other.
1.2 Goal
Since the state process, {Xn}n≥1 is “hidden” from us, we only get to observe {Yn}n≥1 and wish to find a way of estimating {Xn}n≥1 based on {Yn}n≥1. To do this, we define the forward recursions for causal estimation:
αt(xt) = F (βt,PYt|Xt , yt) (1)
βt+1(xt+1) = G(αt,PXt+1|Xt ) (2)
where F and G are the operators defined in lecture 4.
1.3 Challenges For general alphabets X , Y, computing the forward recursion is a very difficult problem. However, we know efficient algorithms that can be applied in the following situations: 1. All random variables are discrete with a finite alphabet. We can then solve these equations by brute force in time proportional to the alphabet sizes.
1 2. For t ≥ 1, ft and gt are linear and Nt,Wt,Xt are Gaussian. The solution is the “Kalman Filter” algorithm that we derive in Homework 2. Beyond these special situations, there are many heuristic algorithms for this computation: 1. We can compute the forward recursions approximately by quantizing the alphabets. This leads to a tradeoff between model accuracy and computation cost.
2. This lecture will focus on Particle Filtering, which is a way of estimating the forward recursions with an adaptive quantization.
2 Importance Sampling
Before we can learn about Particle Filtering, we first need to discuss importance sampling1.
2.1 Simple Setting
Let X1, ··· ,XN be i.i.d. random variables with density f. Now we wish to approximate f by a probability ˆ mass function, fN , using our data. The idea: we can approximate f by taking
N 1 X fˆ = δ(x − X ) (3) N N i i=1
where δ(x) is a Dirac-delta distribution.
ˆ Claim: fN is a good estimate of f.
Here we define the ”goodness” of our estimate by looking at
[g(X)] (4) EX∼fˆN and seeing how close it is to
EX∼f [g(X)] (5)
for any function g with E|g(X)| < ∞. We show that by this definition, (3) is a good estimate for f. Proof:
Z ∞ [g(X)] = fˆ (x)g(x)dx (6) EX∼fˆN N −∞ N Z ∞ 1 X = δ(x − X )g(x)dx (7) N i −∞ i=1 N 1 X Z ∞ = δ(x − X )g(x)dx (8) N i i=1 −∞ N 1 X = g(X ) (9) N i i=1
1See section 2.12.4 of the Spring 2013 lecture notes for further reference.
2 By the law of large numbers,
N 1 X a.s. g(X ) → [g(X)] (10) N i EX∼f i=1 as N → ∞. Hence,
[g(X)] → [g(X)] (11) EX∼fˆN EX∼f ˆ so fN is indeed a ”good” estimate of f.
2.2 General Setting
Suppose now that X1, ··· ,XN i.i.d. with common density q, but we wish to estimate a different pdf f.
Theorem: The following estimator
N ˆ X fN (x) = wiδ(x − Xi) (12) i=1 with weights
f(Xi) q(Xi) wi = (13) PN f(Xj ) j=1 q(Xj ) is a good estimator of f. Proof: It’s easy to see
N X E ˆ [g(X)] = wig(Xi) (14) X∼fˆN i=1
N f(Xi) X q(Xi) = g(Xi) (15) PN f(Xj ) i=1 j=1 q(Xj ) PN f(Xi) i=1 g(Xi) = q(Xi) (16) PN f(Xj ) j=1 q(Xj ) 1 PN f(Xi) i=1 g(Xi) = N q(Xi) . (17) 1 PN f(Xj ) N j=1 q(Xj ) As N → ∞, the above expression converges to
f(X) R ∞ f(x) E[ g(X)] g(x)q(x)dx q(X) = −∞ q(x) (18) f(X) R ∞ f(x) E[ q(X) ] −∞ q(x) q(x)dx R ∞ −∞ f(x)g(x)dx = R ∞ (19) −∞ f(x)dx Z ∞ = f(x)g(x)dx (20) −∞ = EX∼f [g(X)]. (21)
3 Hence,
E ˆ [g(X)] → EX∼f [g(X)] (22) X∼fˆN as N → ∞.
2.3 Implications
The above theorem is useful in situations when we don’t know f and cannot draw samples relying on Xi ∼ f, but we know that f(x) = c·h(x) for some constant c (in many practical cases it is computationally prohibitive to obtain the normalization constant c = (R h(x)dx)−1). Then
f(Xi) q(Xi) wi = (23) PN f(Xj ) j=1 q(Xj )
c·h(Xi) = q(Xi) (24) PN c·h(Xj ) j=1 q(Xj )
h(Xi) = q(Xi) (25) PN h(Xj ) j=1 q(Xj )
Using these wi’s, we see that we can approximate f using samples drawn from a different distribution and h.
2.4 Application
Let X ∼ fX and Y be the observation of X through a memoryless channel characterized by fY |X . For MAP reconstruction of X, we must compute
fX (x)fY |X (y|x) fX|Y (x|y) = R ∞ (26) −∞ fX (˜x)fY |X (y|x˜)dx˜ In general, this computation may be hard due to the integral appearing on the bottom. But the denominator is just a constant. Hence, suppose we have Xi i.i.d. ∼ fX for 1 ≤ i ≤ N. By the previous theorem, we can take
N ˆ X fX|y = wi(y)δ(x − Xi) (27) i=1 as an approximation, where
fX|y (Xi) fX (Xi) wi(y) = (28) PN fX|y (Xj ) j=1 fX (Xj )
fX (Xi)fY |X (y|Xi) = fX (Xi) (29) PN fX (Xj )fY |X (y|Xj ) j=1 fX (Xj )
f (y|Xi) = Y |X (30) PN j=1 fY |X (y|Xj)
Hence, using the samples X1,X2, ··· ,XN we can build a good approximation of fX|Y (·|y) for fixed y using the above procedure.
4 Let us denote this approximation as
N ˆ X fN (X1,X2, ..., XN , fX , fY |X , y) = wiδ(x − Xi) (31) i=1 3 Particle Filtering
Particle Filtering is a way to approximate the forward recursions of a HMP using importance sampling to approximate the generated distributions rather than computing the true solutions. (i) N particles ← generate {X1 }i=1 i.i.d. ∼ fX1 ; for t ≥ 2 do ˆ (i) N ˆ αˆt = FN ({X1 }i=1, βt,PYt|Xt , yt); // estimate αt using importance sampling as in (31) ˜ (i) N nextParticles ← generate {Xt }i=1 i.i.d. ∼ αˆt; reset particles; for 1 ≤ i ≤ N do (i) ˜ (i) particles ← generate Xt+1 ∼ fXt+1|Xt (·|Xt ); end ˆ 1 PN βt = N i=1 δ(x − Xt+1); end Algorithm 1: Particle Filter This algorithm requires some art in use. It is not well understood how errors propagate due to these successive approximations. In practice, the parameters need to be periodically reset. Choosing these hyper- parameters is an application-specific task.
4 Inference under logarithmic loss
Recall the Bayesian decision theory and let us have the following: • X ∈ X is something we want to infer;
• X ∼ px (assume that X is discrete); • xˆ ∈ Xˆ is something we wish to reconstruct;
• Λ: X × Xˆ → R is our loss function. First we assume that we do not have any observation, and define the Bayesian response as X U(px) = min EΛ(X, xˆ) = min px(s)Λ(x, xˆ) (32) xˆ xˆ x and ˆ XBayes(pX ) = arg min EΛ(X, xˆ). (33) xˆ
Example: Let X = Xˆ = R, Λ(x, xˆ) = (x − xˆ)2. Then
2 U(pX ) = min EX∼pX (X − xˆ) = Var(X) (34) xˆ ˆ XBayes(pX ) = E[X]. (35)
5 Consider now the case we have some observation (side information) Y , where our reconstruction Xˆ can be a function of Y . Then
EΛ(X, Xˆ(Y )) = E[E[Λ(X, Xˆ(y))|Y = y]] (36) X ˆ = E[Λ(X, X(y))|Y = y]pY (y) (37) y X X ˆ = pY (y) pX|Y =y(x)Λ(x, X(y)) (38) y x
This implies ˆ U(pX ) = min EΛ(X, X(Y )) (39) Xˆ (·) X X ˆ = min pY (y) pX|Y =y(x)Λ(x, X(y)) (40) ˆ X(·) y x X = pY (y)U(pX|Y =y) (41) y and thus ˆ ˆ Xopt(y) = XBayes(pX|Y =y) (42)
6 EE378A Statistical Signal Processing Lecture 6 - 04/20/2017 Lecture 6: Bayesian Decision Theory and Justification of Log Loss Lecturer: Tsachy Weissman Scribe: Okan Atalar
In this lecture1, we introduce the topic of Bayesian Decision Theory and define the concept of loss function for assessing the accuracy of the estimation. The log loss function is presented as an important special case.
1 Notation
A quick summary of the notation: 1. Random Variables (objects): used more “loosely”, i.e., X,Y
2. PMF (PDF): used more “loosely”, i.e. PX ,PY 3. Alphabets: X , Xˆ 4. Estimate of X: xˆ 5. Specific Values: x, y 6. Loss Function: Λ 7. Markov Triplet: X − Y − Z notation corresponds to a Markov triplet
8. Logarithm: log notation denotes log2, unless otherwise stated (u) For discrete random variable (object), U has p.m.f: PU , P (U = u). Often, we’ll just write p(u). (x,y) (y|x) Similarly: p(x, y) for PX,Y and p(y|x) for PY |X , etc.
2 Bayesian Decision Theory ˆ ˆ Let X ∈ X , X ∼ PX ,x ˆ ∈ X ,Λ: X × X −→ R. Definition 1 (Bayes Envelope). X U(PX ) min E[Λ(X, xˆ)] = min PX (x)Λ(X, xˆ) (1) , xˆ xˆ x Definition 2 (Bayes Response). ˆ XBayes(PX ) arg min E[Λ(X, xˆ)] (2) , xˆ Essentially, given a prior distribution on X, we are trying to achieve the best estimate through minimizing the expected value of some “loss function” that we have defined. If in addition to the prior distribution of X, we are also given the observation Y , the associated Bayes Envelope becomes: ˆ X min E[Λ(X, X(Y ))] = U(PX|Y )PY (y) = E[U(PX|Y )] (3) ˆ X(·) y In this case, the Bayes Response takes the form: ˆ ˆ X(y) = XBayes(PX|Y =y) (4)
1Reading: Lecture 3 of the EE378A Spring 2016 notes
1 3 Bayesian Inference Under Logarithmic Loss
In this section, we will be using the logarithm function as the loss function. Definition 3 (Logarithmic Loss Function). 1 Λ(x, Q) log (5) , Q(x) where x is the realization of X, Q is the estimator of the pmf of X. Using Definition 3, the expected value of the log loss function could be expressed as:
X X 1 X 1 PX (x) [Λ(X,Q)] = P (x)Λ(x, Q) = P (x) log = P (x) log (6) E X X Q(x) X Q(x) P (x) x x x X X 1 X PX (x) = P (x) log + P (x) log (7) X P (x) X Q(x) x X x Definition 4 (Entropy). X 1 H(X) P (x) log (8) , X P (x) x X Definition 5 (Relative Entropy, or the Kullback–Leibler Divergence).
X PX (x) D(P kQ) P (x) log (9) X , X Q(x) x Using Equation (6,7) and Definitions (4,5):
E[Λ(X,Q)] = H(X) + D(PX kQ) (10)
Claim: D(PX ||Q) > 0, and D(PX ||Q) = 0 ⇐⇒ Q = PX .
Before attempting the proof, we provide Jensen’s inequality:
Jensen’s Inequality: Let Q denote a concave function, and X be any random variable. Then
E[Q(X)] 6 Q(E[X]) (11) Further, if Q is strictly concave: E[Q(X)] = Q(E[X]) ⇐⇒ X is a constant. In particular, Q(x) = log(x) is a concave function, thus for a random variable X > 0:
E[log X] 6 log E[X]. (12) Proof of the Claim:
PX (x) Q(x) Q(x) X Q(x) D(P kQ) = [log ] = − [log ] − log [ ] = − log P (x) = 0 (13) X E Q(x) E P (x) > E P (x) X P (x) X X x X P The last equality follows from x Q(x) = 1, since Q(x) is a pmf and thus sums to one.
Hence, it follows from the claim that
min E[Λ(X,Q)] = H(X) = U(PX ) (14) Q
2 where the minimizer is Q = PX . Therefore:
XˆBayes(PX ) = PX . (15)
Introducing the notation H(X) ⇐⇒ H(PX ) and making use of equation (3):
ˆ X min E[Λ(X, X(Y ))] = E[U(PX|Y )] = H(PX|Y )PY (y) (16) ˆ X(·) y
Definition 6 (Conditional Entropy). X H(X|Y ) , H(PX|Y )PY (y) (17) y
Definition 7 (Mutual Information).
I(X; Y ) , H(X) − H(X|Y ) (18) Definition 8 (Conditional Relative Entropy). X D(PX|Y ||QX|Y ,PY ) , D(PX|Y kQX|Y )PY (y) (19) y
Definition 9 (Conditional Mutual Information).
I(X,Y |Z) , H(X|Z) − H(X|Y,Z) (20) Exercises: (a) H(X1,X2) = H(X1) + H(X2|X1) (b) I(X; Y ) = H(Y ) − H(Y |X) (c) I(X1,X2; Y ) = I(X1; Y ) + I(X2; Y |X1)
(d) D(PX1,X2 kQX1,X2 ) = D(PX1 kQX1 ) + D(PX2|X1 kQX2|X1 |PX1 )
4 Justification of Log Loss as the Right Loss Function
For general loss function Λ: Definition 10 (Coherent Dependence Function). ˆ C(Λ,PX,Y ) , inf E[Λ(X, xˆ)] − inf E[Λ(X, X(Y ))] (21) xˆ Xˆ (·)
The first term corresponds to the minimum expected error (depending on the choice of loss function) in estimating X based on only the prior distribution of X. The second term corresponds to the minimum expected error in estimating X by also taking the observation Y into account. Therefore, the coherent dependence function gives a measure on how much the estimate improves given observation Y , compared to estimating X based on only the prior distribution of X.
Two examples are provided below to make this concept more clear: 2 (1) Λ(x, xˆ) = (x − xˆ) ⇒ C(Λ,PX,Y ) = Var(X) − E[Var(X|Y )] 1 (2) Λ(x, Q) = log Q(x) ⇒ C(Λ,PX,Y ) = I(X,Y )
Proving the expressions below left as exercise: (a) C(Λ,PX,Y ) > 0, and C(Λ,PX,Y ) = 0 ⇐⇒ X and Y are independent
3 (b) C(Λ,PX,Y ) > C(Λ,PX,Z ) when X − Y − Z
Natural Requirement: If T = T (X) is a function of X, then C(Λ,PT,Y ) 6 C(Λ,PX,Y ).
This requirement basically states that a function of X cannot carry more information about X compared to X itself.
Definition 11 (Sufficient Statistic). T = T (X) is a sufficient statistic of X for Y if X − T − Y .
Modest Requirement: If T is a sufficient statistic of X for Y , then C(Λ,PT,Y ) 6 C(Λ,PX,Y )
Theorem: Assume |X | > 3, the only C(Λ,PX,Y ) satisfying the modest requirement is C(Λ,PX,Y ) = I(X; Y ) (up to a multiplicative factor).
Since it was shown that C(Λ,PX,Y ) = I(X; Y ) if (and only if) the logarithmic loss function is used 1 (Λ(x, Q) = log Q(x) ), it follows that the logarithmic loss function is the only loss function which satisfies the modest requirement.
4 EE378A Statistical Signal Processing Lecture 7 - 04/25/2017 Lecture 7: Prediction Lecturer: Tsachy Weissman Scribe: Cem Kalkanli
In this lecture we are going to study prediction under Bayesian setting with general and log loss. We will also quickly start the subject of prediction of individual sequences.
1 Notation
A quick summary of the notation: 1. Random Variables (objects): used more “loosely”, i.e. X, Y, U, V ; 2. Alphabets: X , Xˆ;
3. Sequence of Random Variables: X1,X2, ··· ,Xt; 4. Loss function: Λ(x, xˆ); 5. Estimate of X: xˆ; 6. Set of distributions on Alphabet X : M(X ).
(u) For discrete random variable (object), U has p.m.f: PU , P (U = u). Often, we’ll just write p(u). Similarly: p(x, y) for PX,Y (x, y) and p(y|x) for PY |X (y|x), etc.
2 Prediction
Consider X1,X2, ··· ,Xn ∈ X .
t−1 Definition 1 (Predictor). A predictor F is a sequence of functions F = (Ft)t≥1 with Ft : X → Xˆ. ˆ Definition 2 (Prediction Loss). Given a loss function Λ: X × X → R+, the prediction loss incurred by the predictor F is defined as
n 1 X L (Xn) = Λ(X ,F (Xt−1)). (1) F n t t t=1
2.1 Prediction under Bayesian Setting
In Bayesian setting, we assume X1,X2, ··· are governed jointly by some underlying law P . Then we have the following prediction loss.
Definition 3 (Prediction Loss Under Bayesian Setting).
n n n 1 X t−1 1 X t−1 min EP [LF (X )] = min EP [Λ(Xt,Ft(X ))] = EP [U(PXt|X )] (2) F n Ft n t=1 t=1
where U(·) is the Bayes envelope we have defined in the previous lecture.
1 As we can see from (2), our initial objective was to minimize expected prediction loss with respect to whole sequence of functions. However, since we can interchange summation and expectation, we can treat the overall predictor F as the composition of the sequential optimal predictors Ft. Finally, we note the overall loss is just the sum of expectations of Bayes envelopes. Theorem 4 (Miss-match Loss). The additional prediction loss incurred by doing prediction based on law Q instead of P is upper bounded by some form of relative entropy: r n n 2 [L Q (X ) − L P (X )] ≤ Λ · D(P n kQ n ), (3) EP F F max n X X
where Λmax = supx,xˆ Λ(x, xˆ). Now we are going to introduce Lemma 5 which will be used to prove Theorem 4. Lemma 5. ˆ ˆ p EP [Λ(X, XBayes(Q)) − Λ(X, XBayes(P ))] ≤ Λmax kP − Qk1 ≤ Λmax 2D(P kQ) (4) Proof ˆ ˆ EP [Λ(X, XBayes(Q)) − Λ(X, XBayes(P ))] ˆ ˆ ˆ ˆ ≤ EP [Λ(X, XBayes(Q)) − Λ(X, XBayes(P ))] + EQ[−Λ(X, XBayes(Q)) + Λ(X, XBayes(P ))] (5) X = (P (x) − Q(x))(Λ(x, XˆBayes(Q)) − Λ(x, XˆBayes(P ))) (6) x X ≤ |P (x) − Q(x)|Λmax (7) x
= Λmax kP − Qk1 (8) p ≤ Λmax 2D(P kQ) (9)
1 2 where in the last step we have used Pinsker’s inequality D(P kQ) ≥ 2 kP − Qk1.
Now we prove Theorem 4 as follows:
n n EP [LF Q (X ) − LF P (X )] (10) n 1 X t−1 t−1 = [ (Λ(X ,F Q (X )) − Λ(X ,F P (X )))] (11) EP n t t t t t=1 n 1 X t−1 t−1 t−1 = [ [(Λ(X ,F Q (X )) − Λ(X ,F P (X )))|X ]] (Iterated Expectation) (12) n EP EP t t t t t=1 n 1 X q ≤ [Λ 2D(P t−1 kQ t−1 |P t−1 )] (Using Lemma 5) (13) n EP max Xt|X Xt|X X t=1 v u n u 2 X ≤ Λmaxt P [D(P t−1 kQ t−1 |P t−1 )] (Jensen’s inequality) (14) n E Xt|X Xt|X X t=1 r 2 = Λ D(P n kQ n ) (Using chain rule for relative entropy) (15) max n X X as desired.
2 3 Prediction Under Logarithmic Loss
In this section, we specialize our setup to the logarithmic loss while changing our predictor set into a set of distributions. So here we have Xˆ = M(X ) and Λ(x, P ) = − log P (x) as the logarithmic loss. Also note that t−1 Ft(X ) ∈ M(X ). So we can represent our predictor with 3 different setups as follows:
1. Obviously we can carry over the setup we worked previously and see our predictors as a set of functions {Ft}t≥1.
t−1 2. Another approach would be to see each predictor as a sequence of conditional distributions {QXt|X }t≥1 t−1 t−1 where Ft(X ) = QXt|X . 3. And finally we can see our predictor as a model on the data. Then Q would be the law we assigned to the sequence we observe.
n Then we can immediately adapt previous results to log loss case. First one is LF(X ):
n 1 X 1 L (Xn) = log (16) F n F (Xt−1)[x ] t=1 t t n 1 X 1 = log (17) n Q t−1 (x ) t=1 Xt|X t 1 1 = log (Where now Q is our model about the data) (18) n Q(Xn)
Now let’s look at the logarithmic loss under the Bayesian setting while assuming X1,X2, ··· sequence is governed by P: 1 1 [L (Xn)] = [ log ] (19) EP F EP n Q(Xn) 1 P (Xn) = [ log ] (20) EP n Q(Xn)P (Xn)
1 n = [H(X ) + D(P n kQ n )] (21) n X X
n 1 n t−1 t−1 Then we can see that minF EP [LF (X )] = n H(X ) and it is achieved when Ft(X ) = PXt|X .
4 Prediction of Individual Sequences
As for the last part, we look at a different set of predictors. In this setup we are given a class of predictors F, and we seek a predictor F such that:
n n lim max[LF (X ) − min LG(X )] = 0 (22) n→∞ Xn G∈F
4.1 Prediction of Individual Sequences under Log Loss Now the question arises that given a class of law F, can we find P such that (23) is small for any Xn: 1 1 1 1 1 1 1 1 log n − min log n = log n − log n . (23) n P (X ) Q∈F n Q(X ) n P (X ) n maxQ∈F Q(X ) This problem will be solved in the upcoming lecture.
3 EE378A Statistical Signal Processing Lecture 8 - 04/27/2017 Lecture 8: Prediction of Individual Sequences Lecturer: Tsachy Weissman Scribe: Chuan–Zheng Lee
In the last lecture we have studied the prediction problem under Bayesian setting, and in this lecture we are going to study the prediction problem under logarithmic loss in the individual sequence setting.
1 Problem setup
We first recall the problem setup: given a class of probability laws F, we wish to find some law P with a small worst-case regret
1 1 WCRn(P, F) , max log − min log . (1) xn P (xn) Q∈F Q(xn)
Then a first natural question is as follows: could we find some P such that
WCRn(P, F) ≤ nn (2)
with n → 0 as n → ∞?
2 First answer when |F| < ∞: uniform mixture
The answer to the previous question turns out to be affirmative when |F| < ∞. Our construction of P is as follows: 1 X P (xn) = Q(xn). (3) Unif |F| Q∈F
The performance of PUnif is summarized in the following lemma:
Lemma 1. The worst-case regret of PUnif satisfies:
WCRn(PUnif, F) ≤ log |F|. (4)
Proof Just apply Q(xn) ≥ 0 for any Q ∈ F and xn to conclude that
1 1 max Q(xn) log − min log = log Q∈F (5) P (xn) Q∈F Q(xn) P (xn) max Q(xn) = log |F| + log Q∈F (6) P ˜ n Q˜∈F Q(x ) ≤ log |F|. (7)
log |F| By Lemma 1, we know that we can choose n = n → 0 as n → ∞, which provides an affirmative answer to our question.
1 3 Second answer: normalized maximum likelihood
In the previous section we know that n → 0 is possible when |F| < ∞, but we are still looking for some law P which gives the minimum n, or equivalently, a minimum WCRn(P, F). Complicated as the definition of WCRn(P, F) suggests, surprisingly its minimum admits a closed-form expression: Theorem 2 (Normalized Maximum Likelihood). The minimum worst-case regret is given by
X n min WCRn(P, F) = log max Q(x ). (8) P Q∈F xn This bound is attained by the normalized maximum likelihood predictor expressed as
n n maxQ∈F Q(x ) PNML(x ) = P n . (9) x˜n maxQ∈F Q(˜x )
n PNML(x ) Proof Since n is a constant, it is clear that PNML attains the desired equality. To prove the maxQ∈F Q(x ) 0 0 optimality, suppose P is any other predictor. Since both PNML and P are probability distributions, there must exist some xn such that
0 n n P (x ) ≤ PNML(x ). (10)
Considering this sequence xn gives
max Q(xn) WCR (P 0, F) ≥ log Q∈F (11) n P 0(xn) n maxQ∈F Q(x ) ≥ log n (12) PNML(x ) X = log max Q(xn) (13) Q∈F xn as desired.
4 Third answer: Dirichlet mixture
Although the normalized maximum likelihood predictor gives the optimal worst-case regret, it has a severe drawback that it is usually really hard to compute in practice. In contrast, the mixture idea used in the first answer can usually be efficiently computed (as will be shown below). This observation motivates us to come up with a mixture-based predictor P , which enjoys a worst-case regret close to that of PNML. The general mixture idea is as follows: suppose the model class F = (Qθ)θ∈Θ can be parametrized by θ. Consider some prior w(·) on Θ and choose Z n n Pω(x ) = Qθ(x )w(dθ) (14) Θ
2 as our predictor. Note that in this case,
t t−1 Pω(x ) Pω(xt|x ) = t−1 (15) Pω(x ) R t Θ Qθ(x )w(dθ) = R t−1 (16) Θ Qθ(x )w(dθ) R t−1 t−1 Θ Qθ(xt|x )Qθ(x )w(dθ) = R t−1 (17) Θ Qθ(x )w(dθ) Z t−1 t−1 Qθ(x ) = Qθ(xt|x )R t−1 0 w(dθ) (18) Θ Θ Qθ0 (x )w(dθ ) Z t−1 t−1 , Qθ(xt|x )w(dθ|x ) (19) Θ where t−1 t−1 Qθ(x )w(dθ) w(dθ|x ) = R t−1 0 . (20) Θ Qθ0 (x )w(dθ ) Then how to choose the prior w(·)? On one hand, the prior w(·) should be chosen such that it assigns roughly equal weights to the “representative” sources. On the other hand, it should be chosen in a way such that it is computationally tractable for the sequential probability assignments. We take a look at some concrete examples.
4.1 Bernoulli case n In the first example, we assume that Qθ = Bern(θ) and F = {Qθ}θ∈[0,1]. Note that in this case the model class consists of all i.i.d. distributions. We assign the Dirichlet prior 1 dθ 1 1 w(dθ) = · ∼ Beta( , ). (21) π pθ(1 − θ) 2 2 The properties of the resulting predictor are left as exercises:
n n n Exercise 3. Denote by n0 = n0(x ), n1 = n1(x ) the number of 0’s and 1’s in the sequence x , respectively. Prove the following: (a) The conditional probability is given by
n (xt−1) + 1 P (x |xt−1) = xt 2 . (22) ω t t This is called the Krichevsky–Trofimov sequential probability assignment.
1 (b) Show that WCRn(Pω, F) = 2 log n + O(1). 1 (c) Considering the normalized maximum likelihood predictor, also show that minP WCRn(P, F) = 2 log n+ O(1). Hence, our Dirichlet mixture Pω attains the optimal leading term of the worst-case regret. In fact, there is a general theorem on the performance of the mixture-based predictor:
d Theorem 4 (Rissanen, 1984). Let F = {Qθ}θ∈Θ with Θ ⊂ R . Under benign regularity and smoothness conditions, we have d min WCRn(P, F) = log n + o(log n). (23) P 2
3 Moreover, there exists some Dirichlet prior w(·) such that
d WCR (P , F) = log n + o(log n). (24) n ω 2 4.2 Markov sources Given that we can compete with i.i.d. Bernoulli sequences, how can we compete under logarithmic loss with respect to Markov models? Assume X = {0, 1} and denote by Mk the set of all k-th order Markov models. Note that for Q ∈ Mk,
n n 0 Y i−1 Q(x |x−k+1) = Q(xi|xi−k) (25) i=1 Y Y k = Q(xi|u ). (26) all possible k tuples uk i−1 k i:xi−k=u
Q k k The product i−1 k Q(xi|u ) gives us the probability of the subsequence {xi} i−1 k under Bernstein(Q(1|u )). i:xi−k=u xi−k=u Then it makes sense to employ the KT estimator on each of these subsequences. Consider the following P such that
t−1 n t (x ) + 1/2 t−1 xt−k −k+1 P (xt|x−k+1) = t−1 , (27) n t−1 (x ) + 1 xt−k −k+1
n Pn 1 i k where nuk (x ) = i=1 (xi−k+1 = u ). Then, for this P , we have
X 1 n−1 WCR (P, M ) ≤ ln(n k (x )) + o(ln n) (28) n k 2 u uk 2k n ≤ ln + o(ln n), (29) 2 2k which matches the Rissanen lower bound on the leading term.
4 EE378A Statistical Signal Processing Lecture 9 - 05/02/2017 Lecture 9: Context Tree Weighting Lecturer: Tsachy Weissman Scribe: Chuan-Zheng Lee
In the last lecture we have talked abut the mixture idea for i.i.d. sources and Markov sources. Specifically, we introduced the Krichevsky–Trofimov probability assignment for a binary vector xn ∈ {0, 1}n given by 1 1 Z n n Γ(n0 + )Γ(n1 + ) P KT(xn) = ω(θ)θn1(x )(1 − θ)n0(x )dθ = 2 2 , (1) 1 2 (Γ( 2 )) Γ(n + 1) 1 n n n where ω(θ) ∝ √ , n1 ≡ n1(x ) is the number of 1’s in x , n0 ≡ n0(x ) is the number of 0’s (so that θ(1−θ) 1 1 n = n0 +n1) and Γ(·) is the gamma function. Note that ω(θ) is a Dirichlet prior with parameters ( 2 , 2 ), and 1 1 the integrand is just a Dirichlet distribution with parameters (n0 + 2 , n1 + 2 ). The main result is that, for regular models with d degrees of freedom, the Dirichlet mixing idea achieves the optimal worst-case regret d 2 log n + o(log n) for individual sequences of length n. In this lecture, we will extend this idea to general tree sources.
1 Tree Source
We first begin with an informal defintion of a tree source.
Definition 1 (Context of a symbol). A context of a symbol xt is a suffix of the sequence that precedes xt, that is, (··· , xt−2, xt−1). Definition 2 (Tree source). A tree source is a source where a set of suffixes is sufficient to determine the probability distribution of the next item (regardless of the past) in the sequence. A tree source is hence a generalization of an order-one Markov source, where only the previous element in the sequence determines the next: now, instead of just one element being sufficient, the suffix is sufficient. It is perhaps best elucidated by example. Example 3. Let S = {00, 10, 1} and Θ = {θs, s ∈ S}. Then
p(Xt = 1|Xt−2 = 0,Xt−1 = 0) = θ00
p(Xt = 1|Xt−2 = 1,Xt−1 = 0) = θ10
p(Xt = 1|Xt−1 = 1) = θ1 We can represent this structure in a binary tree: t − 2 t − 1 t
θ1 1
θ00 0 0
1
θ10
Remark Note that the source in Example 3 is not first-order Markov, but can be cast as a special case of a second-order Markov source.
n n n We denote as PS,θ(x ) the associated probability mass function on x ∈ X .
1 2 Known tree, unknown parameters
Recall that with Markov sources, we could think of the pmf of any sequence as the product of conditional pmfs. For the tree sources, we can factorize similarly, but this pertains to suffixes. KT Let PS be the pmf induced by employing the Krichevsky–Trofimov sequence probability assignment (separately) on each subsequence corresponding to every s ∈ S. We will denote as ns the number of xi, 1 ≤ i ≤ n with context s, and we will show in Homework 4 that 1 1 1 log − min log ≤ log n + 1. KT n θ n n P (x ) QBernoulli(θ)(x ) 2
Now enumerating over different contexts, the regret on sequence xn is upper-bounded by
1 1 X 1 log − min log ≤ log n + 1 KT n n s P (x ) θ PS,θ(x ) 2 S s∈S X 1 1 = |S| log n + 1 |S| 2 s s∈S " # (a) 1 X 1 ≤ |S| log n + 1 2 |S| s s∈S |S| n = log + |S| 2 |S|
where (a) follows from Jensen’s inequality. Recall from Lecture 8 that Rissannen showed that the worst-case d regret could at best be 2 log n+o(log n). This shows that the above strategy—using the Krichevsky–Trofimov assignment—is optimal up to lower order terms.
3 Unknown tree
We call a context tree is of depth D if it is a full binary tree of depth D, where node s contains two integers n (ns0, ns1), i.e., the number of 0’s and 1’s in the sequence x which occurs right after the context s. Example 4. Consider the sequence
x1 x2 x3 x4 x5 x6 x7 1 1 0 0 1 0 0 1 1 0 ···
We can draw the tree, with each node s labeled with x(s):
2 t − 3 t − 2 t − 1 t
{∅} 1 {x7} 1 {x7} 0
{x3, x6, x7} 0 {∅} 1 1
{x3, x6} {x3, x6} 0
{x1, . . . , x7}
{x1} 1 {x1, x4} 0 1 {x4} 0
{x1, x2, x4, x5} 0 {x2} 1
{x , x } {∅} 0 2 5
For example, x3 is preceded by ‘001’, so is included in x(‘001’) (at t − 3), x(‘01’) (at t − 2) and x(‘1’) (at t − 1). Moreover, for s = 001, we have ns0 = ns1 = 1 (corresponding to x3 and x6, respectively). n KT Then the context-tree weighting at any leaf node s is PCTW(x , s) = P (ns0, ns1) if depth(s) = D. For any node s other than a leaf node, we have 1 1 P (xn, s) = P KT(n , n ) + P (xn, 0s)P (xn, 1s) CTW 2 s0 s1 2 CTW CTW where 0s means the sequence formed by prepending ‘0’ to s, and 1s means the sequence formed by prepending ‘1’ to s. n n Finally, the CTW probability assignment on x can be found recursively at the root via PCTW(x ) = n PCTW(x , ∅).
3 EE378A Statistical Signal Processing Lecture 10 - 05/04/2017 Lecture 10: Regret Analysis of Context Tree Weighting Lecturer: Tsachy Weissman Scribe: Chuan-Zheng Lee
In the last lecture we introduced how to use Krichevsky–Trofimov probability assignment to construct the context tree, and the corresponding weighting scheme for tree sources. In this lecture we will analyze the (worst-case, or the average-case) regret of the CTW algorithm under logarithmic loss.
1 Context Tree Weighting: Review
Notation: since the Krichevsky–Trofimov probability assignment only depends on the number of zeros and n n KT n KT ones in the sequence, we define ~n = (n0(x ), n1(x )) and write P (x ) as P (~n). The CTW algorithm imposes the Krichevsky–Trofimov probability assignment to every context s ∈ S separately, and for each n n context s ∈ S, we need to associate with a vector ~ns = (ns0(x ), ns1(x )), i.e., the number of occurrences of the context s0, s1 in the sequence xn, respectively. Now revisit the last example with nodes labeled with their corresponding ~n:
x1 x2 x3 x4 x5 x6 x7 1 1 0 0 1 0 0 1 1 0 ···
t − 3 t − 2 t − 1 t
~n111 = (0, 0) 1 ~n11 = (1, 0) 1 ~n011 = (1, 0) 0
~n1 = (2, 1) 0 ~n101 = (0, 0) 1 1
~n01 = (1, 1) ~n001 = (1, 1) 0 ~n = (4, 2)
~n110 = (1, 0) 1 ~n10 = (2, 0) 0 1 ~n010 = (1, 0) 0
~n0 = (2, 2) 0 ~n100 = (0, 2) 1
~n00 = (0, 2) ~n000 = (0, 0) 0
The probability assignment of the CTW algorithm on xn and the context s is given as follows:
( KT n P (~ns) if depth(s) = D PCTW(x , s) = 1 KT 1 n n . 2 P (~ns) + 2 PCTW(x , 0s)PCTW(x , 1s), otherwise
n n Recursively we find PCTW(x ) = PCTW(x , ∅), and also the conditional probability
n+1 n PCTW(x ) PCTW(xn+1|x ) = n . PCTW(x )
1 1 Remark The choice of the weight 2 ’s in that second line is not important; it can be changed into any (α, 1 − α) with α ∈ (0, 1) with almost no loss in theoretical analysis (as you may see below). In practice, the choice of α depends on the data.
2 Worst-Case Regret of CTW Algorithm
Consider the context tree in the previous example with depth 3, and suppose that the true tree model S is as follows: t − 2 t − 1 t
θ1 1
θ00 0 0
1
θ10
By definition of the CTW algorithm, we have 1 P (xn, 00) ≥ P KT(~n ) CTW 2 00 1 P (xn, 10) ≥ P KT(~n ) CTW 2 10 1 P (xn, 1) ≥ P KT(~n ) CTW 2 1 1 P (xn, 0) ≥ P (xn, 00)P (xn, 10) CTW 2 CTW CTW n n PCTW(x ) ≥ PCTW(x , ∅) 1 ≥ P (xn, 0)P (xn, 1) 2 CTW CTW 1 1 1 ≥ P (xn, 00)P (xn, 10) P KT(~n ) 2 2 CTW CTW 2 1 1 ≥ P KT(~n )P KT(~n )P KT(~n ) 25 00 10 1 1 = P KT(xn). 25 S As a result, for any sequence xn, in this case we have 1 1 1 1 log − min log ≤ 5 + log − min log n θ n KT n θ n PCTW(x ) PS,θ(x ) PS (x ) PS,θ(x ) 3 n ≤ 5 + log + 3 2 3 3 n = log + 8. 2 3
2 1 In general, repeating the analysis above, we lose a factor of at most 2 for each of the |S| leaves and |S|−1 internal nodes. Therefore, in general, for any S corresponding to a tree source of depth ≤ D, 1 1 log ≤ 2|S| − 1 + log n KT n PCTW(x ) PS (x ) |S| n 1 ≤ 2|S| − 1 + log + |S| + min log n 2 |S| θ PS,θ(x ) |S| n 1 ≤ log + 3|S| − 1 + min log n , 2 |S| θ PS,θ(x ) which again shows the order-optimality of the context-tree weighting for any unknown tree model with depth at most D.
3 Average Regret of CTW Algorithm
Recall that the worst-case regret under logarithmic loss is given by
1 1 WCRn(P, F) = max log − min log . xn P (xn) Q∈F Q(xn)
Now, for all true probability distribution Q ∈ F, we have
Q(xn) D(Q n kP n ) = log x x EQ P (xn) 1 1 ≤ EQ log − min log P (xn) Q0∈F Q0(xn)
≤ WCRn(P, F) which shows that the average-case regret is upper bounded by the worst-case regret:
sup D(Qxn kPxn ) ≤ WCRn(P, F). Q∈F
Applying this result to the tree source, for any tree source PS,θ with depth(S) ≤ D, we have
|S| D(P n kP n ) ≤ log n + O(1). S,Θ CTW 2 As a result, we know from Lecture 7 that for prediction under a general loss function Λ,
s n n 2 |S| L P (X ) − L P (X ) ≤ Λ log n + O(1) . E F CTW F S,θ max n 2
3 EE378A Statistical Signal Processing Lecture 11 - 05/09/2017 Lecture 11: Estimation of Directed Information Lecturer: Tsachy Weissman Scribe: Maggie Engler
In this lecture, we will introduce the quantity of directed information, provide motivation for estimating the directed information, and look at two estimators and their associated performance guarantees. First, though, we clarify a question from last time about the depth of context tree weighting.
1 Context Tree Weighting, Continued
Recall that for QCTW,D, denoting the probability assignment by context tree weighting with depth D, we showed in the previous lecture that 1 1 |S| n log n − min log n ≤ log + 3|S| − 1 (1) QCTW,D(X ) θ PS,θ(X ) 2 |S| |S| = log n + O(1) (2) 2 for any tree source S with depth at most D. In other words, to obtain the optimal regret up to lower-order terms, we can choose the tree depth D to be greater than our expected depth of the true tree source in practice. To avoid the issue of estimating the true tree depth, we can further mix many context trees with dif- ferent tree depths D. Let’s consider the following mixture: take any {wD}D≥1 such that all wD ≥ 0 and P∞ P∞ D=1 wD = 1, and let Q = D=1 wDQCTW,D. Now by the standard mixing argument, for any tree source S with depth D, we have 1 1 1 log n ≤ log n + log (3) Q(X ) QCTW,D(X ) wD |S| n 1 ≤ log + 3|S| − 1 + log . (4) 2 |S| wD Thus, the mixture Q as defined above is also guaranteed to perform within a constant term of optimal regret for any tree source S.
2 Universality
Note: Remember that if some data is governed by a law P , the excess log-loss of the probability assignment 1 Q (beyond what is optimal) is the relative entropy D(PXn ||QXn ), normalized to n D(PXn ||QXn ). ∞ Definition 1 (Stationary). Probability measure P on X is stationary if for any k-tuple (Xi1 ,Xi2 ,Xi3 , ··· ,Xik ) and any m,
d (Xi1 ,Xi2 ,Xi3 , ...Xik ) = (Xi1+m ,Xi2+m ,Xi3+m , ...Xik+m ) (5) The shifted k-tuple is equal in distribution to the original. This is also described as shift-invariance and is a common assumption for an unknown underlying distribution. Definition 2 (Universal). Probability measure Q on X ∞ is universal if
1 n→∞ D(P n ||Q n ) −−−−→ 0 (6) n X X for all stationary laws P.
1 Exercise: Show that the mixture Q in (3) is universal.
Universality of Q is clearly a desirable property, as we are guaranteed to have a vanishing excess log-loss with any stationary law P . Additionally, if some Q is universal, it can be used to estimate other quantities of interest, such as directed information.
3 Directed Information
Definition 3 (Directed information). The directed information from Xn to Y n is defined as
n n n X i i−1 I(X → Y ) , I(X ; Yi|Y ) (7) i=1 More generally, n n−d n d n−d n X i−d i−1 I(X → Y ) , I((∅ ,X ) → Y ) = I(X ; Yi|Y ). (8) i=1 We remark that mutual information is the canonical measure of the relevance of one process to another, and the directed information is the canonical measure of the causal relevance of one process to another. Notice that directed information is a summation of mutual information between the first i elements of X and the ith element of Y , conditioned on the previous i − 1 values of Y . Namely, we measure the causal i i−1 relevance of X for our prediction of Yi, given Y . The following two examples will illustrate why this is a very natural metric in prediction.
Example 1: Suppose that (X1,X2, ··· ,Xn) and (Y1,Y2, ··· ,Yn) represent two stocks in the New York Stock Exchange. You are a young investor desperate to impress your boss at the hedge fund, and you would like to predict the price of stock Y tomorrow, i.e., at day i. You can look solely at Y i−1, but, being more resource- i−1 i−1 i−1 ful than that, you decide to determine the extent to which X also informs Yi—that is, I(X ; Yi|Y ). Pn i−1 i−1 n−1 n Collected over time, we have i=1 I(X ; Yi|Y ), which is equivalent to I(X → Y ).
Example 2: Now, on the basis of your excellent application of directed information, you have been pro- moted to hedge fund manager. Your new role requires you to be prescient of global trends, so this time let (X1,X2, ··· ,Xn) be the Hang Seng Index, an overall performance measure for the Hong Kong Stock Ex- change, and let (Y1,Y2, ··· ,Yn) be the Dow Jones Industrial Average. Because of the time difference between New York and Hong Kong, the trading has already closed in Hong Kong before the opening bell in New York. i−1 i−1 Therefore, when you predict Yi, you now have access to Xi in addition to X and Y as before. The i i−1 Pn i i−1 conditional mutual information is I(X ; Yi|Y ), and over a period of n days we get i=1 I(X ; Yi|Y ) = I(Xn → Y n).
We describe mutual information as a measure of relevance or correlation between two processes, without implying causation. Directed information allows us to be more explicit about the flow of information from one process to another. The relation is shown below, to be proved as an exercise.
Exercise: Show that I(Xn; Y n) = I(Xn → Y n) + I(Y n−1 → Xn)
Also note that this implies that I(Xn → Y n) ≤ I(Xn; Y n):
n n n n X i i−1 X n i−1 n n I(X → Y ) = I(X ; Yi|Y ) ≤ I(X ; Yi|Y ) = I(X ; Y ) (9) i=1 i=1 where the second step is given by the data processing inequality.
2 4 Estimation of Directed Information
Definition 4 (Directed information rate). Let X and Y be a pair of processes. The directed information rate from X to Y is
1 n n I(X → Y) lim I(X → Y ), (10) , n→∞ n provided that the limit exists. Exercise: Show that the limit exists if X and Y are jointly stationary.
Note: Recall that for X ∼ PX , the entropy may be written H(X) or equivalently H(PX ), which is a function of the distribution.
Notice that n n n n X i i−1 X i−1 i i−1 I(X → Y ) = I(X ; Yi|Y ) = [H(Yi|Y ) − H(Yi|X ,Y )] (11) i=1 i=1 This suggests an estimator for the directed information rate. Specifically, n 1 X Iˆ(Xn → Y n) = [H(Y |Y i−1) − H(Y |Xi,Y i−1)]. (12) n i i i=1 Since the (conditional) entropy solely depends on the underlying unknown distribution, we can consider another probability assignment of our choice Q. The idea is that if Q is universal, then the (conditional) entropy based on Q should converge to that based on the true P . Theorem 5. If Q is universal and (X,Y ) are jointly stationary and ergodic (a concept we will not introduce here, which basically means that the law of large numbers holds in the process), then 1 [ (X → Y) − Iˆ(Xn → Y n)] −−−−→n→∞ 0 (13) E I n Therefore, this estimator of the directed information rate converges to the true rate. With a few more constraints, we can further bound how quickly the estimate converges. Theorem 6. If Q is derived from context tree weighting (CTW) and the joint process of (X,Y ) is stationary, ergodic, aperiodic, and Markov (of arbitrary order), then 1 log n [ (X → Y) − Iˆ(Xn → Y n)] ≤ C √ (14) E I n n where C > 0 is a universal constant not depending on n. For a more in-depth discussion, please see the paper “Universal Estimation of Directed Information” (Jiao, Permuter, Zhao, Kim, and Weissman, 2012). Another estimator is suggested by the observation that n n i i−1 i i−1 I(X → Y ) = D(PYi|X ,Y kPX ,Y ) (15) with the proof left as an exercise. We can then construct n n n 1 X Iˆ(X → Y ) D(Q i i−1 ||Q i i−1 ) (16) , n Yi|X ,Y X ,Y i=1 for our suitable choice of Q. We omit the details here, but remark that this estimator has similar performance guarantees to the first, and often has a slightly smoother convergence in practice as relative entropy is always non-negative.
3 EE378A Statistical Signal Processing Lecture 12 - 05/10/2017 Lecture 12: Introduction to the Discrete Universal Denoiser (DUDE) Lecturer: Tsachy Weissman Scribe: Yihui Quek
In this lecture1, we set the state for the Discrete Universal Denoiser (DUDE), an algorithm for denoising sequences whose symbols take values in a discrete alphabet. The following table shows at a glance how this new problem setting compares to those of the algorithms we have studied in the past:
Algorithm Key idea Input sequence alphabet Viterbi algorithm/Kalman Maximum a posteriori Continuous filter (MAP) decoding Context-tree weighting Decoding from context Discrete, generated by a tree (CTW) algorithm source with unknown param- eters Discrete Universal Denoiser MAP decoding from context Discrete, generating distribu- (DUDE) tion unknown
1 Setting
Figure 1 shows a toy setting we consider for discrete denoising.
n n n n n n Noisy z ∈ Z xˆ (z ) = x = (x1, . . . xn) ∈ X Denoiser n n n n Channel (ˆx1(z ), xˆ2(z )...xˆn(z )) ∈ Xˆ
n Figure 1: Denoising setting. In this toy example, we assume that the parameters of the source X (i.e., pX (x )), as well as the conditional probabilities of each output symbol produced by the channel on any given input, pX,Z (xi, zi), are known.
We assume all noiseless sequences and observations are composed of symbols which take values in a finite alphabet, X = a1, . . . a|X | (thus the ‘discrete’ in the name of the denoiser). Denoising means that given a noisy observation sequence zn, we would like to reconstruct the original noiseless signal, xn, by outputting a best estimatex ˆn(zn). Assume we have a per-symbol loss function Λ : X × Xˆ → R and are allowed to look at an entire length-n observation zn to guessx ˆn. Definition 1. An n-block denoiser is a mapping
Xˆ n(Zn): Zn → Xˆn.
As in the past, we measure the performance of Xˆ n by the normalized cumulative loss on a block zn:
n 1 X L (xn, zn) = Λ(x , Xˆ n(zn)[i]). (1) Xˆ n n i i=1
1Reading: Universal Discrete Denoising: Known Channel, T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, M.J. Wein- berger, 2005.
1 n n n (From now on we use the notation Xˆ (z )[i] ← xˆi(z ).) Let us start our investigation by assuming that the symbols are generated probabilistically with known parameters, i.e., probabilities of all symbols, channel and noise are known to the denoiser. On the cumulative loss, an optimal denoiser is then described by
" n # n n 1 X n arg min E LXˆ n (x , z ) = arg min E Λ(xi, xˆi(z )) (2) Xˆ n(·)∈D Xˆ n(·)∈D n n n i=1 where Dn denotes the set of all block-n denoisers. We can observe from the right-hand side of (2) that the cumulative loss is separable; it is a sum of terms, each of which depends only xi andx ˆi. This suggests a n greedy strategy for producing such an estimate; under this strategy,x ˆi(z ) is nothing but the Bayes response to the conditional posterior probability of each symbol:
n ˆ n n xˆi(z ) = XBayes(Pxi|Z =z ) (3) and the corresponding value of expected loss is the Bayes envelope,
n 1 X U(P n ) . (4) n E xi|Z i=1 X We remind the readers that the Bayes envelope is defined as U(PX ) := min PX (x)Λ(x, xˆ). In addition, ˆ xˆ∈X x the discreteness of the input alphabet permits us to write a loss matrix that will be useful in later sections:
{Λ(x, xˆ)} := λa1 λa2 ... λa|X |
where λai is the column vector Λ(·, xˆ = ai). With this definition in hand, we may rewrite the Bayes envelope in terms of the λis: T U(PX ) = min PX · λxˆ (5) xˆ The conclusion of this analysis reveals that the Bayes response denoiser (3) achieves the minimum loss on a sequence with known generating distribution. But we have glossed over some things in the above analysis: Firstly, the Kalman filter showed us that deriving the Bayes envelope is already hard in the continuous case; what more when the symbols come from a discrete alphabet. Secondly, we should not expect realistically to be handed on a silver platter with the probabilistic parameters of the problem; for one, we often do not know the parameters of the generating distribution of the given sequence. We have in fact treaded in these waters before; see the CTW algorithm.
2 Towards construction of a universal scheme
From now on we assume that we do not know the generating distribution of the sequence symbols, but we have some characterization of the channel. We would like to construct a universal scheme, i.e., a scheme that attains optimal performance asymptotically no matter the true law that governs the data. Here, the
ˆ n “optimal performance denoiser” is XBayes(Pxi|z ) as established previously. In this section we consider the “single-letter” problem as shown in Figure 2, where our input sequence is now only one symbol, X, and we are to guess Xˆ based on Z. Again, X and Xˆ take values in finite alphabet, X = a1, . . . a|X | . We can also define the channel transition matrix showing probabilities of mapping inputs to outputs:
{Π(x, z)} := πa1 πa2 ... πa|X |
2 X ∼ P Channel Z ∈ Z X Denoiser ˆ ˆ X ∈ X Π(x, z) X ∈ X
Figure 2: Single-letter denoising. We assume that we know the channel transition probability matrix Π(x, z).
where πai is the column vector PX,Z (·,Z = ai). Before we proceed, let us define the Schur/Hadamard product represented by the symbol :
Definition 2. For vectors v1, v2, v1 v2 is the vector obtained by element-wise multiplication.
[v1 v2]i = v1,iv2,i
Secondly, we generalize the Bayes response (implicitly defined on our loss function Λ(x, xˆ)) to take an arbitrary argument v: T XˆBayes(v) = arg min v · λxˆ (6) xˆ
With these definitions in hand, we may factorize PZ with a simple calculation:
X X T T PZ (z) := PX,Z (x, z) = PX (x)Π(x, z) = PX · πz = PX · Π (z) (7) x x
T T 2 i.e. PZ = PX · Π. Assume now that |X | = |Z| and Π is invertible . Then we may re-write this as
−T PX = Π PZ (8) and we may readily write the conditional probability
−T PX (x)Π(x, z) [Π PZ ](x) · πZ (x) PX|Z=z(x) := = PZ (z) PZ (z) −T [Π PZ ] πZ PX|Z=z = (9) PZ (z) Let φ(z) represent the operation of denoising. The previous section established that
min EΛ(x, φ(z)) = E[U(PX|Z )] φ(·) and this is achieved by ˆ ˆ −T φopt(z) = XBayes(PX|Z=z) = XBayes([Π PZ ] πZ ) (10) where we have elided the denominator of (9) in the substitution because the overall normalization is not a function ofx ˆ and does not affect the minimization. We can generalize the above analysis to situations where |X | ≤ |Z|, i.e., Π is non-square but of full row-rank: in those situations (to be proven in the homework) we have
ˆ ˆ T −1 φopt(z) = XBayes(PX|Z=z) = XBayes([(ΠΠ ) Π · PZ ] πZ ). By the way, (ΠΠT )−1Π can be defined for any arbitrary non-square matrix Π and is known as the Moore- Penrose pseudoinverse.
2 1 Usually a safe assumption but notable exceptions include the binary symmetric channel with flipping probability 2 – which is not a useful channel anyway.
3 3 Universal Discrete Denoising and introduction to DUDE
Now we extend the above analysis to denoising of a sequence, not just a single letter – and in doing so we introduce the DUDE. Figure 1 is applicable to our new setting too, but now we have far less knowledge of the system. More concretely, the assumptions are now: 1. Finite alphabets X , Z, Xˆ.
2. Source X is governed by a probabilistic and stationary but otherwise unknown generating distribution. 3. Channel is known to be a discrete memoryless channel; and Π is known and of full row rank. 4. Loss function Λ is given. The idea of DUDE is that we “correct by the context”; whereas in Section 1 we had the luxury of knowing
n n Pxi|Z =z which we could input into the Bayes response function, now we must estimate that. We do so by looking at left and right length-k contexts around the symbol to be denoised:
z1 . . . zi−k . . . zi−1 zi zi+1 . . . zi+k . . . zn
For every pair of left and right contexts, we can define a count vector m ∈ R|Z| to capture the number of times a symbol z ∈ Z appears in these contexts over the entire sequence
n k k i−1 k i+k k m(z , ` , r )[z] := {k + 1 ≤ i ≤ n − k : zi−k = ` , zi+1 = r , zi = z}, the DUDE denoiser effectively calculates the Bayes response (3) with this empirical probability distribution. That is, DUDE(k) performs the function F where:
n n i−1 i+k xˆi(z ) = F Λ, Π, m(z , zi−k, zi+1 ), zi k + 1 ≤ i ≤ n − k T −T n i−1 i+k = arg min λxˆ [Π · m(z , zi−k, zi+1 )] πZ . (11) xˆ
Readers will note the similarity with (10). Surprisingly, it will turn out that the operation performed by the DUDE is linear in n.
4 EE378A Statistical Signal Processing Lecture 13 - 05/16/2017 Lecture 13: The Performance of Discrete Universal Denoiser (DUDE) Lecturer: Tsachy Weissman Scribe: Ruiyang Song
1 Recap
The system model of the denoising problem is presented in Fig. 1. Let Xn be the unknown input sequence that passes through a discrete memoryless channel. The channel is characterized by its transition matrix Π which is known and is of full row rank. The output of the noisy channel is denoted by Zn which is observed n by the denoiser. The denoiser then outputs the reconstruction sequence Xˆ . Here, Xi ∈ X , Zi ∈ Z, Xˆi ∈ Xˆ, where X , Z, and Xˆ are finite alphabets. The distortion of the reconstruction sequence is characterized by a given loss function Λ : X × Xˆ → [0, +∞).
Zn n X Noisy channel Denoiser Xb n
Figure 1: Denoising setting. The DUDE with context length k works in the following way: in the first pass, the denoiser derives the counting vector of dimension |Z| as follows: