<<

EE378A Statistical Signal Processing Lecture 2 - 04/06/2017 Lecture 2: Statistical Lecturer: Jiantao Jiao Scribe: Andrew Hilger

In this lecture, we discuss a unified theoretical framework of proposed by Abraham Wald, which is named statistical decision theory. 1.

1 Goals

1. Evaluation: The theoretical framework should aid fair comparisons between algorithms (e.g., maxi- mum entropy vs. maximum likelihood vs. method of moments). 2. Achievability: The theoretical framework should be able to inspire the constructions of statistical algorithms that are (nearly) optimal under the optimality criteria introduced in the framework.

2 Basic Elements of Statistical Decision Theory

1. Statistical : A family of probability measures P = {Pθ : θ ∈ Θ}, where θ is a parameter and Pθ is a indexed by the parameter.

2. : X ∼ Pθ, where X is a observed for some parameter value θ.

3. Objective: g(θ), e.g., inference on the entropy of distribution Pθ. 4. Decision Rule: δ(X). The decision rule need not be deterministic. In other words, there could be a probabilistically defined decision rule with an associated Pδ|X . 5. : L(θ, δ). The loss function tells us how bad we feel about our decision once we find out the true value of the parameter θ chosen by nature.

2 1 −(x−θ) 2 Example: P (x) = √ e 2 , g(θ) = θ, and L(θ, δ) = (θ − δ) . In other words, X is normally θ 2π distributed with θ and unit X ∼ N(θ, 1), and we are trying to estimate the mean θ. We judge our success (or failure) using mean-square error.

3 Function

Definition 1 (Risk Function).

R(θ, δ) , E[L(θ, δ(X))] (1) Z = L(θ, δ(x))Pθ(dx) (2) ZZ = L(θ, δ)Pδ|X (dδ|x)Pθ(dx). (3)

1See Wald, Abraham. ”Statistical decision functions.” In Breakthroughs in Statistics, pp. 342-357. Springer New York, 1992.

1 Figure 1: Example risk functions computed over a of parameter values for two different decision rules.

A risk function evaluates a decision rule’s success over a large number of with fixed parameter value θ. By the law of large numbers, if we observe X many times independently, the average empirical loss of the decision rule δ will converge to the risk R(θ, δ). Even after determining risk functions of two decision rules, it may still be unclear which is better. Consider the example of Figure 1. Two different decision rules δ1 and δ2 result in two different risk functions R(θ, δ1) and R(θ, δ2) evaluated over different values on the parameter θ. The first decision rule δ1 is inferior for low and high values of the parameter θ but is superior for the middle values. Thus, even after computing a risk function R, it can still be unclear which decision rule is better. We need new ideas to enable us to compare different decision rules.

4 Optimality Criterion of Decision Rules

Given the risk function of various decision rules as a function of the parameter θ, there are various approaches to determining which decision rule is optimal.

4.1 Restrict the Competitors This is a traditional set of methods that were overshadowed by other approaches that we introduce later. A decision rule δ0 is eliminated (or formally, is inadmissible) if there are any other decision rules δ that are strictly better, i.e., R(θ, δ0) ≥ R(θ, δ) for any θ ∈ Θ and the inequality becomes strict for at least one θ0 ∈ Θ. However, the problem is that many decision rules cannot be eliminated in this way and we still lack a criterion to determine which one is better. Then to aid in selection, the rationale of the approach of restricting competitors is that only decision rules that are members of a certain decision rule class D are considered. The advantage is that, sometimes all but one decision rule in D is inadmissible, and we just use the only admissible one.

1. Example 1: Class of unbiased decision rules D0 = {δ : E[δ(X)] = g(θ), ∀θ ∈ Θ} 2. Example 2: Class of invariant .

However, a serious drawback of this approach is that D may be an empty set for various decision theoretic problems.

2 4.2 Bayesian: Average Risk Optimality The idea is to use averaging to reduce the risk function R(θ, δ) to a single number for any given δ. Definition 2 (Average risk under prior Λ(dθ)). Z r(Λ, δ) = R(θ, δ)Λ(dθ) (4)

Here Λ is the is the prior distribution, a on Θ. The Bayesians and the frequentists disagree about Λ; namely, the frequentists do not believe the existence of the prior. However, there do exist more justifications of the Bayesian approach than the interpretation of Λ as prior belief: indeed, the complete class theorem in statistical decision theory asserts that in various decision theoretic problems, all the admissible decision rules can be approximated by Bayes estimators. 2 Definition 3 (Bayes ).

δΛ = arg min r(Λ, δ) (5) δ

The δΛ can usually be found using the principle of computing posterior distributions. Note that ZZZ r(Λ, δ) = L(θ, δ)Pδ|X (dδ|x)Pθ(dx)Λ(dθ) (6) Z  ZZ  = L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) PX (dx) (7)

where PX (dx) is the marginal distribution of X and Pθ|X (dθ|x) is the posterior distribution of θ given X. In Equation 6, Pθ(dx)Λ(dθ) is the joint distribution of θ and X. In Equation 7, we only have to minimize the portion in parentheses to minimize r(Λ, δ) because PX (dx) doesn’t depend on δ. Theorem 4. 3 Under mild conditions,

δΛ(x) = arg min E[L(θ, δ)|X = x] (8) δ ZZ = arg min L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) (9) Pδ|X Lemma 5. If L(θ, δ) is convex in δ, it suffices to consider deterministic rules δ(x). Proof Jensen’s inequality: ZZ Z Z L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) ≥ L(θ, δPδ|X (dδ|x))Pθ|X (dθ|x). (10)

Examples 2 1. L(θ, δ) = (g(θ) − δ) ⇒ δΛ(x) = E[g(θ)|X = x]. In other words, the Bayes estimator under squared error loss is the conditional expectation of g(θ) given x.

2. L(θ, δ) = |g(θ) − δ| ⇒ δΛ(x) is any of the posterior distribution Pθ|X=x. 1 3. L(θ, δ) = (g(θ) 6= δ) ⇒ δΛ(x) = arg maxθ Pθ|x(θ|x) = arg maxθ Pθ(x)Λ(θ). In other words, an indicator loss function results in a maximum a posteriori (MAP) estimator decision rule4.

2See Chapter 3 of Friedrich Liese, and Klaus-J. Miescke. ”Statistical Decision Theory: Estimation, Testing, and Selection.” (2009) 3See Theorem 1.1, Chapter 4 of Lehmann EL, Casella G. Theory of . Springer Science & Business Media; 1998 4 Next week, we will cover special cases of Pθ and how to solve Bayes estimator in a computationally efficient way. In the general case, however, computing the posterior distribution may be difficult.

3 4.3 Frequentist: Worst-Case Optimality () Definition 6 (). The decision rule δ∗ is minimax among all decision rules in D iff sup R(θ, δ∗) = inf sup R(θ, δ). (11) θ∈Θ δ∈D θ∈Θ

4.3.1 First observation Since ZZ R(θ, δ) = L(θ, δ)Pδ|X (dδ|x)Pθ(dx) (12) is linear in Pθ|X , this is a convex function in δ. The supremum of a convex function is convex, so finding the rule is a convex optimization problem. However, solving this convex optimization problem may be computationally intractable. For example, it may not even be computationally tractable to compute the supremum of the risk function R(θ, δ) over θ ∈ Θ. Hence, finding the exact minimax estimator is usually hard.

4.3.2 Second observation Due to the previous difficulty of finding the exact minimax estimator, we turn to another goal: we wish to find an estimator δ0 such that inf sup R(θ, δ) ≤ sup R(θ, δ0) ≤ c · inf sup R(θ, δ) (13) δ θ θ δ θ where c > 1 is a constant. The left inequality is trivially true. For the right inequality, in practice one can 0 0 usually choose some specific δ and evaluate an upper bound of supθ R(θ, δ ) explicitly. However, it remains to find a lower bound of infδ supθ R(θ, δ). To solve the problem (and save the world), we can use the minimax theorem. Theorem 7 (Minimax Theorem (Sion-Kakutani)). Let Λ, X be two compact, convex sets in some topologi- cally vector spaces. Let function H(λ, x):Λ × X → R be a such that: 1. H(λ, ·) is convex for any fixed λ ∈ Λ 2. H(·, x) is concave for any fixed x ∈ X. Then

1. Strong duality: maxλ minx H(λ, x) = minx maxλ H(λ, x) 2. Existence of Saddle point: ∃(λ∗, x∗): H(λ, x∗) ≤ H(λ∗, x∗) ≤ H(λ∗, x) ∀λ ∈ Λ, x ∈ X.

The existence of saddle point implies the strong duality. We note that other than the strong duality, the following weak duality is always true without assumptions on H: sup inf H(λ, x) ≤ inf sup H(λ, x) (14) λ x x λ

We define the quantity rΛ , infδ r(Λ, δ) as the Bayes risk under prior distribution Λ. We have the following lines of arguments using weak duality: inf sup R(θ, δ) = inf sup r(Λ, δ) (15) δ θ δ Λ ≥ sup inf r(Λ, δ) (16) Λ δ

= sup rΛ (17) Λ

4 Equation (17) gives us a strong tool for lower bounding the minimax risk: for any prior distribution Λ, the Bayes risk under Λ is a lower bound of the corresponding minimax risk. When the condition of the minimax theorem is satisfied (which may be expected due to the bilinearity of r(Λ, δ) in the pair (Λ, δ)), equation (16) achieves equality, which shows that there exists a sequence of priors such that the corresponding Bayes risk sequence converges to the minimax risk. In practice, it suffices to choose some appropriate prior distribution Λ in order to solve the (nearly) minimax estimator.

5 Decision Theoretic Interpretation of Maximum Entropy

Definition 8 (Logarithmic Loss). For a distribution P over X and x ∈ X , 1 L(x, P ) log . (18) , P (x)

Note that P (x) is the discrete mass on x in the discrete case, and is the density at x in the continuous case. By definition, if P (x) is small, then the loss is large. Now we see how the maximum likelihood estimator reduces to the so-called maximum entropy under the logarithmic loss function. Specifically, suppose we have functions r1(x), ··· , rk(x) and compute the following empirical averages:

n 1 X β r (x ) = [r (X)], i = 1, ··· , k (19) i , n i j EPn i j=1

where Pn denotes the empirical distribution based on n samples x1, ··· , xn. By the law of large numbers, it is expected that βi is close to the true expectation under P , so we may restrict the true distribution P in the following set:

P = {P : EP [ri(X)] = βi, 1 ≤ i ≤ k}. (20) Also, denote by Q the set of all probability distributions which belong to the with central statistics r1(x), ··· , rk(x), i.e.,

1 Pk λ r (x) Q = {Q : Q(x) = e i=1 i i , ∀x ∈ X } (21) Z(λ)

where Z(λ) is the normalization factor. Now we consider two estimators of P . The first one is the maximum likelihood estimator over Q, i.e.,

n X PML = arg max log P (xj). (22) P ∈Q j=1

The second one is the maximum entropy estimator over P, i.e.,

PME = arg max H(P ) (23) P ∈P where H(P ) is the entropy of P , which is H(P ) = P −P (x) log P (x) in the discrete case, and H(P ) = R x∈X X −P (x) log P (x)dx in the continuous case. The key result is as follows:5

Theorem 9. In the above setting, we have PML = PME.

5Berger, Adam L., Vincent J. Della Pietra, and Stephen A. Della Pietra. ”A maximum entropy approach to natural language processing.” Computational linguistics 22.1 (1996): 39-71.

5 Proof We begin with the following lemma, which is the key motivation for the use of the cross-entropy loss in . Lemma 10. The true distribution minimizes the expected logarithmic loss:

P = arg min EP L(x, Q) (24) Q and the corresponding minimum EP L(x, P ) is the entropy of P . Proof The assertion follows from the non-negativity of the KL D(P kQ) ≥ 0, where D(P kQ) = dP EP ln dQ .

1 Equipped with this lemma, and noting that EP [log Q(x) ] is linear in P and convex in Q, by the minimax theorem we have 1 1 max H(P ) = max min EP [log ] = min max EP [log ] (25) P ∈P P ∈P Q Q(x) Q P ∈P Q(x)

In particular, we know that (PME,PME) is a saddle point in P ×M, where M denotes the set of all probability 6 measures on X . We first take a look at PME which maximizes the LHS. By variational methods , it is easy to see that the entropy-maximizing distribution PME belongs to the exponential family Q, i.e.,

1 Pk λ r (x) P (x) = e i=1 i i (26) ME Z(λ) for some parameters λ1, ··· , λk. Now by the definition of saddle points, for any Q ∈ Q ⊂ M we have 1 1 EPME [log ] ≤ EPME [log ] (27) PME(x) Q(x) while for any Q ∈ Q with possibly different coefficients λ0, 1 X [log ] = [log Z(λ0) − λ0 r (x)] [since Q ∈ Q] (28) EPME Q(x) EPME i i i 0 X 0 = log Z(λ ) − λiβi [since PME ∈ P] (29) i 0 X 0 = EPn [log Z(λ ) − λiri(x)] [since Pn ∈ P] (30) i 1 = [log ] (31) EPn Q(x) n 1 X 1 = log . (32) n Q(x ) j=1 j As a result, the saddle point condition reduces to

n n 1 X 1 1 X 1 log ≤ log , ∀Q ∈ Q (33) n P (x ) n Q(x ) j=1 ME j j=1 j establishing the fact that PME = PML.

6See Chapter 12 of Thomas, Joy A., and Thomas M. Cover. Elements of information theory. John Wiley & Sons, 2006.

6 EE378A Statistical Signal Processing Lecture 3 - 04/11/2017 Lecture 3: State Estimation in Hidden Markov Processes Lecturer: Tsachy Weissman Scribe: Alexandros Anemogiannis, Vighnesh Sachidananda

In this lecture1, we define the Hidden Markov Processes (HMP) as well as the forward-backward recursion method that efficiently estimates the posteriors in HMPs.

1 Markov Chain

We begin by introducing the Markov triplet, which is useful in the design and analysis of algorithms for state estimation in Hidden Markov Processes. Let X,Y,Z denote discrete random variables. We say that X, Y , and Z form a Markov triplet (which we denote as X − Y − Z) given the following relationships:

X − Y − Z ⇐⇒ p(x, z|y) = p(x|y)p(z|y) (1) ⇐⇒ p(z|x, y) = p(z|y) (2)

If X,Y,Z form a Markov triplet, their joint distribution enjoys the important property that it can be factored into the product of two functions φ1 and φ2.

Lemma 1. X − Y − Z ⇐⇒ ∃ φ1, φ2 s.t. p(x, y, z) = φ1(x, y)φ2(y, z). Proof (Proof of necessity) Assume that X − Y − Z. Then by definition of the Markov chain, we have

p(x, y, z) = p(x|y, z)p(y, z) = p(x|y)p(y, z).

Now taking φ1(x, y) = p(x, y) and φ2(y, z) = p(y, z) proves this direction. (Proof of sufficiency) Now, the functions φ1, φ2 exist as mention in the statement. Then we have:

p(x, y, z) p(x|y, z) = p(y, z) p(x, y, z) = P x˜ p(˜x, y, z) φ1(x, y)φ2(y, z) = P φ2(y, z) x˜ φ1(˜x, y) φ1(x, y) = P . x˜ φ1(˜x, y) The final equality shows that p(x|y, z) is not a function of z and hence, X − Y − Z follows by definition.

2 Hidden Markov Processes

Based on our understanding of the Markov Triplet, we define the Hidden Markov Process (HMP) as follows:

1Reading: Ephrahim & Merhav 2002, lectures 9,10 of 2016 notes

1 Definition 2 (Hidden Markov Process). The process {(Xn,Yn)}n≥1 is a Hidden Markov Process if

•{Xn}n≥1 is a Markov process, i.e.,

n n Y ∀ n ≥ 1 : p(x ) = p(xt|xt−1), t=1

where by convention, we take X0 ≡ 0 and hence p(x1|x0) = p(x1).

•{Yn}n≥1 is determined by a memoryless channel; i.e.,

n n n Y ∀ n ≥ 1 : p(y |x ) = p(yt|xt). t=1

The processes {Xn}n≥1 and {Yn}n≥1 are called the state process and the noisy observations, respectively. In view of the above definition, we can derive the joint distribution of the state and the observation as follows:

n n n n n n Y p(x , y ) = p(x )p(y |x ) = p(xt|xt−1)p(yt|xt). t=1 The main problem in the Hidden Markov Models is to compute the the of the state t at any time, given all the observations up to that time, i.e. p(xt|y ). The naive approach to do this is to simply use the the definition of the conditional probability:

t P t t−1 t−1 t p(xt, y ) xt−1 p(y |xt.x )p(xt, x ) p(xt|y ) = t = P P t t−1 t−1 . p(y ) t−1 p(y |x˜ , x )p(˜x , x ) x˜t x t t The above approach needs exponentially many computations both for the numerator and the denominator. To avoid this problem, we develop an efficient method for computing the posterior probabilities using forward recursion. Before getting to the algorithm, we establish some conditional independence relations.

3 Conditional Independence Relations

We introduce some conditional independence relations which help us find more efficient ways of calculating the posterior.

t−1 t n n 1. (X ,Y ) − Xt − (Xt+1,Yt+1): Proof We have :

n n n Y p(x , y ) = p(xi|xi−1)p(yi|xi) i=1 " t # " n # Y Y = p(xi|xi−1)p(yi|xi) p(xi|xi−1)p(yi|xi) i=1 i=t+1 | {z } | {z } φ (x ,(xt−1,yt)) n n 1 t φ2(xt,(xt+1,yt+1))

The statement then follows from Lemma 1.

2 t−1 t−1 n n 2. (X ,Y ) − Xt − (Xt+1,Yt ): n n Proof Note that if {Xi,Yi}i=1 is a Hidden Markov Process, then {Xn−i,Yn−i}i=1 will also be a Hidden Markov Process. Applying (a) to this reversed process proves (b).

t−1 t−1 n n 3. X − (Xt,Y ) − (Xt+1,Yt ): Proof Left as an exercise.

4 Causal Inference via Forward Recursion

We now derive the forward recursion algorithm as an efficient method to sequentially compute the causal posterior probabilities. Note that we have p(x , yt) p(x |yt) = t t P p(˜x , yt) x˜t t p(x , yt−1)p(y |x , yt−1) = t t t P p(˜x , yt−1)p(y |x˜ , yt−1) x˜t t t t p(yt−1)p(x |yt−1)p(y |x ) = t t t (by (b)) p(yt−1) P p(˜x |yt−1)p(y |x˜ ) x˜t t t t p(x |yt−1)p(y |x ) = t t t P p(˜x |yt−1)p(y |x˜ ) x˜t t t t

t t−1 Define αt(xt) = p(xt|y ) and βt(xt) = p(xt|y ), then the above computation can be summarized as

β (x )p(y |x ) α (x ) = t t t t . (3) t t P β (˜x )p(y |x˜ ) x˜t t t t t Similarly, we can write β as

t βt+1(xt+1) = p(xt+1|y ) X t = p(xt+1, xt|y )

xt X t t = p(xt|y )p(xt+1|y , xt)

xt X t = p(xt|y )p(xt+1|xt) (by (a)).

xt The above equation can be summarized as X βt+1(xt+1) = αt(xt)p(xt+1|xt). (4)

xt

Equations (3) and (4) indicate that αt and βt can be sequentially computed based on each other for t = 1, ··· , n with the initialization β(x1) = p(x1). Hence, with this simple algorithm, the causal inference can be done efficiently in terms of both computation and memory. This is called the forward recursion.

3 5 Non-causal Inference via Backward Recursion

n Now suppose that we are interested in finding p(xt|y ) for t = 1, 2, ··· , n in a non-causal manner. For this purpose, we can derive a backward recursion algorithm as follows.

n X n p(xt|y ) = p(xt, xt+1|y )

xt+1 X n t = p(xt+1|y )p(xt|xt+1, y ) (by (c))

xt+1 t X n p(xt|y )p(xt+1|xt) = p(xt+1|y ) t . (by (a)) p(xt+1|y ) xt+1

n Now, let γt(xt) = p(xt|y ). Then, the above equation can be summarized as

X αt(xt)p(xt+1|xt) γt(xt) = γt+1(xt+1) . (5) βt+1(xt+1) xt+1

Equation (5) indicates that γt can be recursively computed based on αt’s and βt’s for t = n − 1, n − 2, ··· , 1 with the initialization of γn(xn) = αn(xn). This is called the backward recursion.

4 EE378A Statistical Signal Processing Lecture 4 - 04/13/2017 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski

In this lecture we build on previous analysis of state estimation for Hidden Markov Models to create an algorithm for estimating the posterior distribution of the states given the observations.

1 Notation

A quick summary of the notation: 1. Random Variables (objects): used more “loosely”, i.e. X, Y , U, V 2. Alphabets: X , Y, U, V 3. Specific Realizations of Random Variables: x, y, u, v

4. Vector Notation: For a vector (x1, ··· , xn), we denote the vector of values from the i-th component j i i to the j-th component (inclusive) as xi , with the convention x , x1. 5. Markov Chains: We use the standard Markov Chain notation X − Y − Z to denote that X,Y,Z form a Markov triplet.

For discrete random variable (object), we write p(x) for the pmf P(X = x), and similarly p(x, y) for pX,Y (x, y) and p(y|x) for pY |X (y|x), etc.

2 Recap of HMP

For a set of unknown states {Xn}n≥1 of a Markov Process and noisy observations {Yn}n≥1 on those states taken through some memoryless channel, we can express the joint distribution of the states and outputs: n n n n Y Y p(x , y ) = p(xt|xt−1) p(yt|xt) (1) t=1 t=1 where x0 is taken to be some fixed arbitrary value (e.g., 0). Additionally we have the following principles for this setting:

t−1 t n n (X ,Y ) − Xt − (Xt+1,Yt+1) (a) t−1 t−1 n n (X ,Y ) − Xt − (Xt+1,Yt ) (b) t−1 t−1 n n X − (Xt,Y ) − (Xt+1,Yt ) (c) In the last lecture, through the forward-backward recursion we have the following algorithms for computing various posteriors:

t βt(xt)pYt|Xt (yt|xt) α (x ) = p t (x |y ) = (Measurement Update) t t Xt|Y t P β (˜x )p (y |x˜ ) x˜t t t Yt|Xt t t t−1 X t−1 βt+1(xt+1) = pXt|Y (xt|y ) = αt(xt)pXt+1|Xt (xt+1|xt) (Time Update) xt

n X αt(xt)pXt+1|Xt (xt+1|xt) n γt(xt) = pXt|Y (xt|y ) = γt+1(xt+1) (Backward Recursion) βt+1(xt+1) xt+1

1 3 Factor Graphs

Let us derive a graphical representation of the joint distribution expressed in (1) as follows:

1. Create nodes on the graph representing each random variable: X1, ..., Xn,Y1, ..., Yn. 2. For each pairing of nodes, create an edge if there is some factor in 1 which contains both variables. This form of graph is called a Factor Graph.

Example: Figure 1 shows the factor graph for the hidden Markov process in our setting.

Figure 1: The factor graph for our hidden Markov process

For any partitioning of the graph into three node sets S1, S2, and S3 such that any path from S1 to S3 passes some node in S2, these sets form a Markov process S1 − S2 − S3 (to see this, simply express the joint pmf into factors). Hence, for any si ∈ Si, i = 1, 2, 3, we have

p(s1, s3|s2) = φ1(s1, s2)φ2(s2, s3) (2) for some functions φ1, φ2 which consist of the factors which were used to form the factor graph. Using the factor graph, it is quite easy to establish principles (a)-(c) via graphs. For example, principles (a) and (b) can be proved via Figure 2 and 3 shown below.

Figure 2: The factor graph partition for (a)

Figure 3: The factor graph partition for (b)

2 We can also construct an additional principle from the factor graph partitioning presented in Figure 4:

t−1 n n X − (Xt,Y ) − Xt+1 (d)

Figure 4: The graphical proof of (d)

4 General Forward–Backward Recursion: Module Functions

Let us consider a generic discrete random variable U ∼ pU passed through a memoryless channel characterized by pV |U to get a discrete random variable V . By Bayes rule we know

pU (u)pV |U (v|u) pU|V (u|v) = P (3) u˜ pU (˜u)pV |U (v|u˜)

If we create the probability vector of pU|V for each value of U, we can define a vector function F s.t.   pU (u)pV |U (v|u) {pU|V (u|v)}u = P , F (pU , pV |U , v) (4) u˜ pU (˜u)pV |U (v|u˜) u Additionally, overloading the notation of F , we can define the matrix function F as

F (pU , pV |U ) = {F (pU , pV |U , v)}v (5)

This function F acts as a module function for calculating the posterior distribution of a random variable given a prior distribution and a channel distribution. In addition to F , we can write a module function for calculating the marginal distribution given a prior distribution on the input and a channel distribution. We label this vector function G and define it as follows: ( ) X pV = {pV (v)}v = pU (˜u)pV |U (v|u˜) , G(pU , pV |U ) (6) u˜ v With these module functions in mind, we can express our recursion functions as follows

αt = F (βt, pYt|Xt , yt) (Measurement Update)

βt+1 = G(αt, pXt+1|Xt ) (Time Update)

γt = G(γt+1,F (αt, pXt+1|Xt )) (Backward Recursion)

3 5 State Estimation: the Viterbi Algorithm

Now we are ready to derive the joint posterior distribution p(xn|yn):

n−1 n n n Y n n p(x |y ) = p(xn|y ) p(xt|xt+1, y ) t=1 n−1 n Y n = p(xn|y ) p(xt|xt+1, y ) (By principle (d)) t=1 n−1 n Y t = p(xn|y ) p(xt|xt+1, y ) (By principle (c)) t=1 n−1 t t Y p(xt|y )p(xt+1|xt, y ) = p(x |yn) n p(x |yt) t=1 t+1 n−1 Y αt(xt)p(xt+1|xt) = γ (x ) (By principle (a)) n n β (x ) t=1 t+1 t+1 Write

n n n X ln p(x |y ) = gt(xt, xt+1) t=1 with ( ln αt(xt)p(xt+1|xt) t = 1, ··· , n − 1 βt+1(xt+1) gt(xt, xt+1) , ln γn(xn) t = n

we can solve the MAP estimatorx ˆMAP with the help of the following definition: Definition 1. For 1 ≤ k ≤ n, Let

n X Mk(xk) := max gt(xt, xt+1). xn k+1 t=k

MAP MAP n n n It is straightforward from definition that M1(ˆx1 ) = ln p(ˆx |y ) = maxxn p(x |y ) and

n X M (x ) = max max(g (x , x ) + g (x , x )) k k n k k k+1 t t t+1 xk+1 x k+2 t=k+1 n X = max(g (x , x ) + max g (x , x )) k k k+1 n t t t+1 xk+1 x k+2 t=k+1

= max(gk(xk, xk+1) + Mk+1(xk+1)). xk+1

Since Mn(xn) = g(xn, xn+1) = ln γn(xn) only depends on one term xn, we may start from n and use the previous recursive formula to obtain M1(x1). This is called the Viterbi Algorithm. The Bellman Equations, referred to in the communication setting as the Viterbi Algorithm, is an appli- cation of a technique called . Using the recursive equations derived in the previous section, we can express the Viterbi Algorithm as follows.

4 1: function Viterbi 2: Mn(xn) ← ln γn(xn) . Initialization of log-likelihood MAP 3: xˆ (xn) ← ∅ . Initialization of the MAP estimator 4: for k = n − 1, ··· , 1 do

5: Mk(xk) ← maxxk+1 (gk(xk, xk+1) + Mk+1(xk+1)) . Maximum of log-likelihood

6: xˆk+1(xk) ← arg maxxk+1 (gk(xk, xk+1) + Mk+1(xk+1)) MAP MAP 7: xˆ (xk) ← [ˆxk+1(xk), xˆ (ˆxk+1(xk))] . Maximizing sequence with leading term xk 8: end for

9: M ← maxx1 M1(x1) . Maximum of overall log-likelihood

10: xˆ1 ← arg maxx1 M1(x1) MAP MAP 11: xˆ ← [ˆx1, xˆ (ˆx1)] . Overall maximizing sequence 12: end function

5 EE378A Statistical Signal Processing Lecture 3 - 04/18/2017 Lecture 3: Particle Filtering and Introduction to Lecturer: Tsachy Weissman Scribe: Charles Hale

In this lecture, we introduce new methods for solving Hidden Markov Processes in cases beyond discrete alphabets. This requires us to learn first about estimating pdfs based on samples from a different distribution. At the end, we introduce the Bayesian inference briefly.

1 Recap: Inference on Hidden Markov Processes (HMPs) 1.1 Setting A quick recap on the setting of a HMP:

1. {Xn}n≥1 is a Markov process, called “the state process”.

2. {Yn}n≥1 is “the observation process” where Yi is the output of Xi sent through a “memoryless” channel characterized by the distribution PY |X . Alternatively, we showed in HW1 that the above is totally equivalent to the following:

1. {Xn}n≥1 is “the state process” defined by

Xt = ft(Xt−1,Wt)

where {Wn}n≥1 is an independent process, i.e-each Wt is independent of all the others.

2. {Yn}n≥1 is “the observation process” related to {Xn}n≥1 by

Yt = gt(Xt,Nt)

where {Nn}n≥1 is an independent process.

3. The processes {Nn}n≥1 and {Wn}n≥1 are independent of each other.

1.2 Goal

Since the state process, {Xn}n≥1 is “hidden” from us, we only get to observe {Yn}n≥1 and wish to find a way of estimating {Xn}n≥1 based on {Yn}n≥1. To do this, we define the forward recursions for causal estimation:

αt(xt) = F (βt,PYt|Xt , yt) (1)

βt+1(xt+1) = G(αt,PXt+1|Xt ) (2)

where F and G are the operators defined in lecture 4.

1.3 Challenges For general alphabets X , Y, computing the forward recursion is a very difficult problem. However, we know efficient algorithms that can be applied in the following situations: 1. All random variables are discrete with a finite alphabet. We can then solve these equations by brute force in time proportional to the alphabet sizes.

1 2. For t ≥ 1, ft and gt are linear and Nt,Wt,Xt are Gaussian. The solution is the “Kalman Filter” algorithm that we derive in Homework 2. Beyond these special situations, there are many heuristic algorithms for this computation: 1. We can compute the forward recursions approximately by quantizing the alphabets. This leads to a tradeoff between model accuracy and computation cost.

2. This lecture will focus on Particle Filtering, which is a way of estimating the forward recursions with an adaptive quantization.

2 Importance

Before we can learn about Particle Filtering, we first need to discuss importance sampling1.

2.1 Simple Setting

Let X1, ··· ,XN be i.i.d. random variables with density f. Now we wish to approximate f by a probability ˆ mass function, fN , using our data. The idea: we can approximate f by taking

N 1 X fˆ = δ(x − X ) (3) N N i i=1

where δ(x) is a Dirac-delta distribution.

ˆ Claim: fN is a good estimate of f.

Here we define the ”goodness” of our estimate by looking at

[g(X)] (4) EX∼fˆN and seeing how close it is to

EX∼f [g(X)] (5)

for any function g with E|g(X)| < ∞. We show that by this definition, (3) is a good estimate for f. Proof:

Z ∞ [g(X)] = fˆ (x)g(x)dx (6) EX∼fˆN N −∞ N Z ∞ 1 X = δ(x − X )g(x)dx (7) N i −∞ i=1 N 1 X Z ∞ = δ(x − X )g(x)dx (8) N i i=1 −∞ N 1 X = g(X ) (9) N i i=1

1See section 2.12.4 of the Spring 2013 lecture notes for further reference.

2 By the law of large numbers,

N 1 X a.s. g(X ) → [g(X)] (10) N i EX∼f i=1 as N → ∞. Hence,

[g(X)] → [g(X)] (11) EX∼fˆN EX∼f ˆ so fN is indeed a ”good” estimate of f.

2.2 General Setting

Suppose now that X1, ··· ,XN i.i.d. with common density q, but we wish to estimate a different pdf f.

Theorem: The following estimator

N ˆ X fN (x) = wiδ(x − Xi) (12) i=1 with weights

f(Xi) q(Xi) wi = (13) PN f(Xj ) j=1 q(Xj ) is a good estimator of f. Proof: It’s easy to see

N X E ˆ [g(X)] = wig(Xi) (14) X∼fˆN i=1

N f(Xi) X q(Xi) = g(Xi) (15) PN f(Xj ) i=1 j=1 q(Xj ) PN f(Xi) i=1 g(Xi) = q(Xi) (16) PN f(Xj ) j=1 q(Xj ) 1 PN f(Xi) i=1 g(Xi) = N q(Xi) . (17) 1 PN f(Xj ) N j=1 q(Xj ) As N → ∞, the above expression converges to

f(X) R ∞ f(x) E[ g(X)] g(x)q(x)dx q(X) = −∞ q(x) (18) f(X) R ∞ f(x) E[ q(X) ] −∞ q(x) q(x)dx R ∞ −∞ f(x)g(x)dx = R ∞ (19) −∞ f(x)dx Z ∞ = f(x)g(x)dx (20) −∞ = EX∼f [g(X)]. (21)

3 Hence,

E ˆ [g(X)] → EX∼f [g(X)] (22) X∼fˆN as N → ∞.

2.3 Implications

The above theorem is useful in situations when we don’t know f and cannot draw samples relying on Xi ∼ f, but we know that f(x) = c·h(x) for some constant c (in many practical cases it is computationally prohibitive to obtain the normalization constant c = (R h(x)dx)−1). Then

f(Xi) q(Xi) wi = (23) PN f(Xj ) j=1 q(Xj )

c·h(Xi) = q(Xi) (24) PN c·h(Xj ) j=1 q(Xj )

h(Xi) = q(Xi) (25) PN h(Xj ) j=1 q(Xj )

Using these wi’s, we see that we can approximate f using samples drawn from a different distribution and h.

2.4 Application

Let X ∼ fX and Y be the observation of X through a memoryless channel characterized by fY |X . For MAP reconstruction of X, we must compute

fX (x)fY |X (y|x) fX|Y (x|y) = R ∞ (26) −∞ fX (˜x)fY |X (y|x˜)dx˜ In general, this computation may be hard due to the integral appearing on the bottom. But the denominator is just a constant. Hence, suppose we have Xi i.i.d. ∼ fX for 1 ≤ i ≤ N. By the previous theorem, we can take

N ˆ X fX|y = wi(y)δ(x − Xi) (27) i=1 as an approximation, where

fX|y (Xi) fX (Xi) wi(y) = (28) PN fX|y (Xj ) j=1 fX (Xj )

fX (Xi)fY |X (y|Xi) = fX (Xi) (29) PN fX (Xj )fY |X (y|Xj ) j=1 fX (Xj )

f (y|Xi) = Y |X (30) PN j=1 fY |X (y|Xj)

Hence, using the samples X1,X2, ··· ,XN we can build a good approximation of fX|Y (·|y) for fixed y using the above procedure.

4 Let us denote this approximation as

N ˆ X fN (X1,X2, ..., XN , fX , fY |X , y) = wiδ(x − Xi) (31) i=1 3 Particle Filtering

Particle Filtering is a way to approximate the forward recursions of a HMP using to approximate the generated distributions rather than computing the true solutions. (i) N particles ← generate {X1 }i=1 i.i.d. ∼ fX1 ; for t ≥ 2 do ˆ (i) N ˆ αˆt = FN ({X1 }i=1, βt,PYt|Xt , yt); // estimate αt using importance sampling as in (31) ˜ (i) N nextParticles ← generate {Xt }i=1 i.i.d. ∼ αˆt; reset particles; for 1 ≤ i ≤ N do (i) ˜ (i) particles ← generate Xt+1 ∼ fXt+1|Xt (·|Xt ); end ˆ 1 PN βt = N i=1 δ(x − Xt+1); end Algorithm 1: This algorithm requires some art in use. It is not well understood how errors propagate due to these successive approximations. In practice, the parameters need to be periodically reset. Choosing these hyper- parameters is an application-specific task.

4 Inference under logarithmic loss

Recall the Bayesian decision theory and let us have the following: • X ∈ X is something we want to infer;

• X ∼ px (assume that X is discrete); • xˆ ∈ Xˆ is something we wish to reconstruct;

• Λ: X × Xˆ → R is our loss function. First we assume that we do not have any observation, and define the Bayesian response as X U(px) = min EΛ(X, xˆ) = min px(s)Λ(x, xˆ) (32) xˆ xˆ x and ˆ XBayes(pX ) = arg min EΛ(X, xˆ). (33) xˆ

Example: Let X = Xˆ = R, Λ(x, xˆ) = (x − xˆ)2. Then

2 U(pX ) = min EX∼pX (X − xˆ) = Var(X) (34) xˆ ˆ XBayes(pX ) = E[X]. (35)

5 Consider now the case we have some observation (side information) Y , where our reconstruction Xˆ can be a function of Y . Then

EΛ(X, Xˆ(Y )) = E[E[Λ(X, Xˆ(y))|Y = y]] (36) X ˆ = E[Λ(X, X(y))|Y = y]pY (y) (37) y X X ˆ = pY (y) pX|Y =y(x)Λ(x, X(y)) (38) y x

This implies ˆ U(pX ) = min EΛ(X, X(Y )) (39) Xˆ (·) X X ˆ = min pY (y) pX|Y =y(x)Λ(x, X(y)) (40) ˆ X(·) y x X = pY (y)U(pX|Y =y) (41) y and thus ˆ ˆ Xopt(y) = XBayes(pX|Y =y) (42)

6 EE378A Statistical Signal Processing Lecture 6 - 04/20/2017 Lecture 6: Bayesian Decision Theory and Justification of Log Loss Lecturer: Tsachy Weissman Scribe: Okan Atalar

In this lecture1, we introduce the topic of Bayesian Decision Theory and define the concept of loss function for assessing the accuracy of the estimation. The log loss function is presented as an important special case.

1 Notation

A quick summary of the notation: 1. Random Variables (objects): used more “loosely”, i.e., X,Y

2. PMF (PDF): used more “loosely”, i.e. PX ,PY 3. Alphabets: X , Xˆ 4. Estimate of X: xˆ 5. Specific Values: x, y 6. Loss Function: Λ 7. Markov Triplet: X − Y − Z notation corresponds to a Markov triplet

8. Logarithm: log notation denotes log2, unless otherwise stated (u) For discrete random variable (object), U has p.m.f: PU , P (U = u). Often, we’ll just write p(u). (x,y) (y|x) Similarly: p(x, y) for PX,Y and p(y|x) for PY |X , etc.

2 Bayesian Decision Theory ˆ ˆ Let X ∈ X , X ∼ PX ,x ˆ ∈ X ,Λ: X × X −→ R. Definition 1 (Bayes Envelope). X U(PX ) min E[Λ(X, xˆ)] = min PX (x)Λ(X, xˆ) (1) , xˆ xˆ x Definition 2 (Bayes Response). ˆ XBayes(PX ) arg min E[Λ(X, xˆ)] (2) , xˆ Essentially, given a prior distribution on X, we are trying to achieve the best estimate through minimizing the of some “loss function” that we have defined. If in addition to the prior distribution of X, we are also given the observation Y , the associated Bayes Envelope becomes: ˆ X min E[Λ(X, X(Y ))] = U(PX|Y )PY (y) = E[U(PX|Y )] (3) ˆ X(·) y In this case, the Bayes Response takes the form: ˆ ˆ X(y) = XBayes(PX|Y =y) (4)

1Reading: Lecture 3 of the EE378A Spring 2016 notes

1 3 Bayesian Inference Under Logarithmic Loss

In this section, we will be using the logarithm function as the loss function. Definition 3 (Logarithmic Loss Function). 1 Λ(x, Q) log (5) , Q(x) where x is the realization of X, Q is the estimator of the pmf of X. Using Definition 3, the expected value of the log loss function could be expressed as:

X X 1 X 1 PX (x) [Λ(X,Q)] = P (x)Λ(x, Q) = P (x) log = P (x) log (6) E X X Q(x) X Q(x) P (x) x x x X X 1 X PX (x) = P (x) log + P (x) log (7) X P (x) X Q(x) x X x Definition 4 (Entropy). X 1 H(X) P (x) log (8) , X P (x) x X Definition 5 (Relative Entropy, or the Kullback–Leibler Divergence).

X PX (x) D(P kQ) P (x) log (9) X , X Q(x) x Using Equation (6,7) and Definitions (4,5):

E[Λ(X,Q)] = H(X) + D(PX kQ) (10)

Claim: D(PX ||Q) > 0, and D(PX ||Q) = 0 ⇐⇒ Q = PX .

Before attempting the proof, we provide Jensen’s inequality:

Jensen’s Inequality: Let Q denote a concave function, and X be any random variable. Then

E[Q(X)] 6 Q(E[X]) (11) Further, if Q is strictly concave: E[Q(X)] = Q(E[X]) ⇐⇒ X is a constant. In particular, Q(x) = log(x) is a concave function, thus for a random variable X > 0:

E[log X] 6 log E[X]. (12) Proof of the Claim:

PX (x) Q(x) Q(x) X Q(x) D(P kQ) = [log ] = − [log ] − log [ ] = − log P (x) = 0 (13) X E Q(x) E P (x) > E P (x) X P (x) X X x X P The last equality follows from x Q(x) = 1, since Q(x) is a pmf and thus sums to one.

Hence, it follows from the claim that

min E[Λ(X,Q)] = H(X) = U(PX ) (14) Q

2 where the minimizer is Q = PX . Therefore:

XˆBayes(PX ) = PX . (15)

Introducing the notation H(X) ⇐⇒ H(PX ) and making use of equation (3):

ˆ X min E[Λ(X, X(Y ))] = E[U(PX|Y )] = H(PX|Y )PY (y) (16) ˆ X(·) y

Definition 6 (Conditional Entropy). X H(X|Y ) , H(PX|Y )PY (y) (17) y

Definition 7 (Mutual Information).

I(X; Y ) , H(X) − H(X|Y ) (18) Definition 8 (Conditional Relative Entropy). X D(PX|Y ||QX|Y ,PY ) , D(PX|Y kQX|Y )PY (y) (19) y

Definition 9 (Conditional Mutual Information).

I(X,Y |Z) , H(X|Z) − H(X|Y,Z) (20) Exercises: (a) H(X1,X2) = H(X1) + H(X2|X1) (b) I(X; Y ) = H(Y ) − H(Y |X) (c) I(X1,X2; Y ) = I(X1; Y ) + I(X2; Y |X1)

(d) D(PX1,X2 kQX1,X2 ) = D(PX1 kQX1 ) + D(PX2|X1 kQX2|X1 |PX1 )

4 Justification of Log Loss as the Right Loss Function

For general loss function Λ: Definition 10 (Coherent Dependence Function). ˆ C(Λ,PX,Y ) , inf E[Λ(X, xˆ)] − inf E[Λ(X, X(Y ))] (21) xˆ Xˆ (·)

The first term corresponds to the minimum expected error (depending on the choice of loss function) in estimating X based on only the prior distribution of X. The second term corresponds to the minimum expected error in estimating X by also taking the observation Y into account. Therefore, the coherent dependence function gives a measure on how much the estimate improves given observation Y , compared to estimating X based on only the prior distribution of X.

Two examples are provided below to make this concept more clear: 2 (1) Λ(x, xˆ) = (x − xˆ) ⇒ C(Λ,PX,Y ) = Var(X) − E[Var(X|Y )] 1 (2) Λ(x, Q) = log Q(x) ⇒ C(Λ,PX,Y ) = I(X,Y )

Proving the expressions below left as exercise: (a) C(Λ,PX,Y ) > 0, and C(Λ,PX,Y ) = 0 ⇐⇒ X and Y are independent

3 (b) C(Λ,PX,Y ) > C(Λ,PX,Z ) when X − Y − Z

Natural Requirement: If T = T (X) is a function of X, then C(Λ,PT,Y ) 6 C(Λ,PX,Y ).

This requirement basically states that a function of X cannot carry more information about X compared to X itself.

Definition 11 (Sufficient ). T = T (X) is a sufficient statistic of X for Y if X − T − Y .

Modest Requirement: If T is a sufficient statistic of X for Y , then C(Λ,PT,Y ) 6 C(Λ,PX,Y )

Theorem: Assume |X | > 3, the only C(Λ,PX,Y ) satisfying the modest requirement is C(Λ,PX,Y ) = I(X; Y ) (up to a multiplicative factor).

Since it was shown that C(Λ,PX,Y ) = I(X; Y ) if (and only if) the logarithmic loss function is used 1 (Λ(x, Q) = log Q(x) ), it follows that the logarithmic loss function is the only loss function which satisfies the modest requirement.

4 EE378A Statistical Signal Processing Lecture 7 - 04/25/2017 Lecture 7: Prediction Lecturer: Tsachy Weissman Scribe: Cem Kalkanli

In this lecture we are going to study prediction under Bayesian setting with general and log loss. We will also quickly start the subject of prediction of individual sequences.

1 Notation

A quick summary of the notation: 1. Random Variables (objects): used more “loosely”, i.e. X, Y, U, V ; 2. Alphabets: X , Xˆ;

3. Sequence of Random Variables: X1,X2, ··· ,Xt; 4. Loss function: Λ(x, xˆ); 5. Estimate of X: xˆ; 6. Set of distributions on Alphabet X : M(X ).

(u) For discrete random variable (object), U has p.m.f: PU , P (U = u). Often, we’ll just write p(u). Similarly: p(x, y) for PX,Y (x, y) and p(y|x) for PY |X (y|x), etc.

2 Prediction

Consider X1,X2, ··· ,Xn ∈ X .

t−1 Definition 1 (Predictor). A predictor F is a sequence of functions F = (Ft)t≥1 with Ft : X → Xˆ. ˆ Definition 2 (Prediction Loss). Given a loss function Λ: X × X → R+, the prediction loss incurred by the predictor F is defined as

n 1 X L (Xn) = Λ(X ,F (Xt−1)). (1) F n t t t=1

2.1 Prediction under Bayesian Setting

In Bayesian setting, we assume X1,X2, ··· are governed jointly by some underlying law P . Then we have the following prediction loss.

Definition 3 (Prediction Loss Under Bayesian Setting).

n n n 1 X t−1 1 X t−1 min EP [LF (X )] = min EP [Λ(Xt,Ft(X ))] = EP [U(PXt|X )] (2) F n Ft n t=1 t=1

where U(·) is the Bayes envelope we have defined in the previous lecture.

1 As we can see from (2), our initial objective was to minimize expected prediction loss with respect to whole sequence of functions. However, since we can interchange summation and expectation, we can treat the overall predictor F as the composition of the sequential optimal predictors Ft. Finally, we note the overall loss is just the sum of expectations of Bayes envelopes. Theorem 4 (Miss-match Loss). The additional prediction loss incurred by doing prediction based on law Q instead of P is upper bounded by some form of relative entropy: r n n 2 [L Q (X ) − L P (X )] ≤ Λ · D(P n kQ n ), (3) EP F F max n X X

where Λmax = supx,xˆ Λ(x, xˆ). Now we are going to introduce Lemma 5 which will be used to prove Theorem 4. Lemma 5. ˆ ˆ p EP [Λ(X, XBayes(Q)) − Λ(X, XBayes(P ))] ≤ Λmax kP − Qk1 ≤ Λmax 2D(P kQ) (4) Proof ˆ ˆ EP [Λ(X, XBayes(Q)) − Λ(X, XBayes(P ))] ˆ ˆ ˆ ˆ ≤ EP [Λ(X, XBayes(Q)) − Λ(X, XBayes(P ))] + EQ[−Λ(X, XBayes(Q)) + Λ(X, XBayes(P ))] (5) X = (P (x) − Q(x))(Λ(x, XˆBayes(Q)) − Λ(x, XˆBayes(P ))) (6) x X ≤ |P (x) − Q(x)|Λmax (7) x

= Λmax kP − Qk1 (8) p ≤ Λmax 2D(P kQ) (9)

1 2 where in the last step we have used Pinsker’s inequality D(P kQ) ≥ 2 kP − Qk1.

Now we prove Theorem 4 as follows:

n n EP [LF Q (X ) − LF P (X )] (10) n 1 X t−1 t−1 = [ (Λ(X ,F Q (X )) − Λ(X ,F P (X )))] (11) EP n t t t t t=1 n 1 X t−1 t−1 t−1 = [ [(Λ(X ,F Q (X )) − Λ(X ,F P (X )))|X ]] (Iterated Expectation) (12) n EP EP t t t t t=1 n 1 X q ≤ [Λ 2D(P t−1 kQ t−1 |P t−1 )] (Using Lemma 5) (13) n EP max Xt|X Xt|X X t=1 v u n u 2 X ≤ Λmaxt P [D(P t−1 kQ t−1 |P t−1 )] (Jensen’s inequality) (14) n E Xt|X Xt|X X t=1 r 2 = Λ D(P n kQ n ) (Using chain rule for relative entropy) (15) max n X X as desired.

2 3 Prediction Under Logarithmic Loss

In this section, we specialize our setup to the logarithmic loss while changing our predictor set into a set of distributions. So here we have Xˆ = M(X ) and Λ(x, P ) = − log P (x) as the logarithmic loss. Also note that t−1 Ft(X ) ∈ M(X ). So we can represent our predictor with 3 different setups as follows:

1. Obviously we can carry over the setup we worked previously and see our predictors as a set of functions {Ft}t≥1.

t−1 2. Another approach would be to see each predictor as a sequence of conditional distributions {QXt|X }t≥1 t−1 t−1 where Ft(X ) = QXt|X . 3. And finally we can see our predictor as a model on the data. Then Q would be the law we assigned to the sequence we observe.

n Then we can immediately adapt previous results to log loss case. First one is LF(X ):

n 1 X 1 L (Xn) = log (16) F n F (Xt−1)[x ] t=1 t t n 1 X 1 = log (17) n Q t−1 (x ) t=1 Xt|X t 1 1 = log (Where now Q is our model about the data) (18) n Q(Xn)

Now let’s look at the logarithmic loss under the Bayesian setting while assuming X1,X2, ··· sequence is governed by P: 1 1 [L (Xn)] = [ log ] (19) EP F EP n Q(Xn) 1 P (Xn) = [ log ] (20) EP n Q(Xn)P (Xn)

1 n = [H(X ) + D(P n kQ n )] (21) n X X

n 1 n t−1 t−1 Then we can see that minF EP [LF (X )] = n H(X ) and it is achieved when Ft(X ) = PXt|X .

4 Prediction of Individual Sequences

As for the last part, we look at a different set of predictors. In this setup we are given a class of predictors F, and we seek a predictor F such that:

n n lim max[LF (X ) − min LG(X )] = 0 (22) n→∞ Xn G∈F

4.1 Prediction of Individual Sequences under Log Loss Now the question arises that given a class of law F, can we find P such that (23) is small for any Xn: 1 1 1 1 1 1 1 1 log n − min log n = log n − log n . (23) n P (X ) Q∈F n Q(X ) n P (X ) n maxQ∈F Q(X ) This problem will be solved in the upcoming lecture.

3 EE378A Statistical Signal Processing Lecture 8 - 04/27/2017 Lecture 8: Prediction of Individual Sequences Lecturer: Tsachy Weissman Scribe: Chuan–Zheng Lee

In the last lecture we have studied the prediction problem under Bayesian setting, and in this lecture we are going to study the prediction problem under logarithmic loss in the individual sequence setting.

1 Problem setup

We first recall the problem setup: given a class of probability laws F, we wish to find some law P with a small worst-case

 1 1  WCRn(P, F) , max log − min log . (1) xn P (xn) Q∈F Q(xn)

Then a first natural question is as follows: could we find some P such that

WCRn(P, F) ≤ nn (2)

with n → 0 as n → ∞?

2 First answer when |F| < ∞: uniform mixture

The answer to the previous question turns out to be affirmative when |F| < ∞. Our construction of P is as follows: 1 X P (xn) = Q(xn). (3) Unif |F| Q∈F

The performance of PUnif is summarized in the following lemma:

Lemma 1. The worst-case regret of PUnif satisfies:

WCRn(PUnif, F) ≤ log |F|. (4)

Proof Just apply Q(xn) ≥ 0 for any Q ∈ F and xn to conclude that

1 1 max Q(xn) log − min log = log Q∈F (5) P (xn) Q∈F Q(xn) P (xn) max Q(xn) = log |F| + log Q∈F (6) P ˜ n Q˜∈F Q(x ) ≤ log |F|. (7)

log |F| By Lemma 1, we know that we can choose n = n → 0 as n → ∞, which provides an affirmative answer to our question.

1 3 Second answer: normalized maximum likelihood

In the previous section we know that n → 0 is possible when |F| < ∞, but we are still looking for some law P which gives the minimum n, or equivalently, a minimum WCRn(P, F). Complicated as the definition of WCRn(P, F) suggests, surprisingly its minimum admits a closed-form expression: Theorem 2 (Normalized Maximum Likelihood). The minimum worst-case regret is given by

X n min WCRn(P, F) = log max Q(x ). (8) P Q∈F xn This bound is attained by the normalized maximum likelihood predictor expressed as

n n maxQ∈F Q(x ) PNML(x ) = P n . (9) x˜n maxQ∈F Q(˜x )

n PNML(x ) Proof Since n is a constant, it is clear that PNML attains the desired equality. To prove the maxQ∈F Q(x ) 0 0 optimality, suppose P is any other predictor. Since both PNML and P are probability distributions, there must exist some xn such that

0 n n P (x ) ≤ PNML(x ). (10)

Considering this sequence xn gives

max Q(xn) WCR (P 0, F) ≥ log Q∈F (11) n P 0(xn) n maxQ∈F Q(x ) ≥ log n (12) PNML(x ) X = log max Q(xn) (13) Q∈F xn as desired.

4 Third answer: Dirichlet mixture

Although the normalized maximum likelihood predictor gives the optimal worst-case regret, it has a severe drawback that it is usually really hard to compute in practice. In contrast, the mixture idea used in the first answer can usually be efficiently computed (as will be shown below). This observation motivates us to come up with a mixture-based predictor P , which enjoys a worst-case regret close to that of PNML. The general mixture idea is as follows: suppose the model class F = (Qθ)θ∈Θ can be parametrized by θ. Consider some prior w(·) on Θ and choose Z n n Pω(x ) = Qθ(x )w(dθ) (14) Θ

2 as our predictor. Note that in this case,

t t−1 Pω(x ) Pω(xt|x ) = t−1 (15) Pω(x ) R t Θ Qθ(x )w(dθ) = R t−1 (16) Θ Qθ(x )w(dθ) R t−1 t−1 Θ Qθ(xt|x )Qθ(x )w(dθ) = R t−1 (17) Θ Qθ(x )w(dθ) Z t−1 t−1 Qθ(x ) = Qθ(xt|x )R t−1 0 w(dθ) (18) Θ Θ Qθ0 (x )w(dθ ) Z t−1 t−1 , Qθ(xt|x )w(dθ|x ) (19) Θ where t−1 t−1 Qθ(x )w(dθ) w(dθ|x ) = R t−1 0 . (20) Θ Qθ0 (x )w(dθ ) Then how to choose the prior w(·)? On one hand, the prior w(·) should be chosen such that it assigns roughly equal weights to the “representative” sources. On the other hand, it should be chosen in a way such that it is computationally tractable for the sequential probability assignments. We take a look at some concrete examples.

4.1 Bernoulli case n In the first example, we assume that Qθ = Bern(θ) and F = {Qθ}θ∈[0,1]. Note that in this case the model class consists of all i.i.d. distributions. We assign the Dirichlet prior 1 dθ 1 1 w(dθ) = · ∼ Beta( , ). (21) π pθ(1 − θ) 2 2 The properties of the resulting predictor are left as exercises:

n n n Exercise 3. Denote by n0 = n0(x ), n1 = n1(x ) the number of 0’s and 1’s in the sequence x , respectively. Prove the following: (a) The conditional probability is given by

n (xt−1) + 1 P (x |xt−1) = xt 2 . (22) ω t t This is called the Krichevsky–Trofimov sequential probability assignment.

1 (b) Show that WCRn(Pω, F) = 2 log n + O(1). 1 (c) Considering the normalized maximum likelihood predictor, also show that minP WCRn(P, F) = 2 log n+ O(1). Hence, our Dirichlet mixture Pω attains the optimal leading term of the worst-case regret. In fact, there is a general theorem on the performance of the mixture-based predictor:

d Theorem 4 (Rissanen, 1984). Let F = {Qθ}θ∈Θ with Θ ⊂ R . Under benign regularity and smoothness conditions, we have d min WCRn(P, F) = log n + o(log n). (23) P 2

3 Moreover, there exists some Dirichlet prior w(·) such that

d WCR (P , F) = log n + o(log n). (24) n ω 2 4.2 Markov sources Given that we can compete with i.i.d. Bernoulli sequences, how can we compete under logarithmic loss with respect to Markov models? Assume X = {0, 1} and denote by Mk the set of all k-th order Markov models. Note that for Q ∈ Mk,

n n 0 Y i−1 Q(x |x−k+1) = Q(xi|xi−k) (25) i=1 Y Y k = Q(xi|u ). (26) all possible k tuples uk i−1 k i:xi−k=u

Q k k The product i−1 k Q(xi|u ) gives us the probability of the subsequence {xi} i−1 k under Bernstein(Q(1|u )). i:xi−k=u xi−k=u Then it makes sense to employ the KT estimator on each of these subsequences. Consider the following P such that

t−1 n t (x ) + 1/2 t−1 xt−k −k+1 P (xt|x−k+1) = t−1 , (27) n t−1 (x ) + 1 xt−k −k+1

n Pn 1 i k where nuk (x ) = i=1 (xi−k+1 = u ). Then, for this P , we have

X 1 n−1 WCR (P, M ) ≤ ln(n k (x )) + o(ln n) (28) n k 2 u uk 2k n ≤ ln + o(ln n), (29) 2 2k which matches the Rissanen lower bound on the leading term.

4 EE378A Statistical Signal Processing Lecture 9 - 05/02/2017 Lecture 9: Context Tree Weighting Lecturer: Tsachy Weissman Scribe: Chuan-Zheng Lee

In the last lecture we have talked abut the mixture idea for i.i.d. sources and Markov sources. Specifically, we introduced the Krichevsky–Trofimov probability assignment for a binary vector xn ∈ {0, 1}n given by 1 1 Z n n Γ(n0 + )Γ(n1 + ) P KT(xn) = ω(θ)θn1(x )(1 − θ)n0(x )dθ = 2 2 , (1) 1 2 (Γ( 2 )) Γ(n + 1) 1 n n n where ω(θ) ∝ √ , n1 ≡ n1(x ) is the number of 1’s in x , n0 ≡ n0(x ) is the number of 0’s (so that θ(1−θ) 1 1 n = n0 +n1) and Γ(·) is the gamma function. Note that ω(θ) is a Dirichlet prior with parameters ( 2 , 2 ), and 1 1 the integrand is just a Dirichlet distribution with parameters (n0 + 2 , n1 + 2 ). The main result is that, for regular models with d degrees of freedom, the Dirichlet mixing idea achieves the optimal worst-case regret d 2 log n + o(log n) for individual sequences of length n. In this lecture, we will extend this idea to general tree sources.

1 Tree Source

We first begin with an informal defintion of a tree source.

Definition 1 (Context of a symbol). A context of a symbol xt is a suffix of the sequence that precedes xt, that is, (··· , xt−2, xt−1). Definition 2 (Tree source). A tree source is a source where a set of suffixes is sufficient to determine the probability distribution of the next item (regardless of the past) in the sequence. A tree source is hence a generalization of an order-one Markov source, where only the previous element in the sequence determines the next: now, instead of just one element being sufficient, the suffix is sufficient. It is perhaps best elucidated by example. Example 3. Let S = {00, 10, 1} and Θ = {θs, s ∈ S}. Then

p(Xt = 1|Xt−2 = 0,Xt−1 = 0) = θ00

p(Xt = 1|Xt−2 = 1,Xt−1 = 0) = θ10

p(Xt = 1|Xt−1 = 1) = θ1 We can represent this structure in a binary tree: t − 2 t − 1 t

θ1 1

θ00 0 0

1

θ10

Remark Note that the source in Example 3 is not first-order Markov, but can be cast as a special case of a second-order Markov source.

n n n We denote as PS,θ(x ) the associated probability mass function on x ∈ X .

1 2 Known tree, unknown parameters

Recall that with Markov sources, we could think of the pmf of any sequence as the product of conditional pmfs. For the tree sources, we can factorize similarly, but this pertains to suffixes. KT Let PS be the pmf induced by employing the Krichevsky–Trofimov sequence probability assignment (separately) on each subsequence corresponding to every s ∈ S. We will denote as ns the number of xi, 1 ≤ i ≤ n with context s, and we will show in Homework 4 that 1 1 1 log − min log ≤ log n + 1. KT n θ n n P (x ) QBernoulli(θ)(x ) 2

Now enumerating over different contexts, the regret on sequence xn is upper-bounded by

1 1 X 1  log − min log ≤ log n + 1 KT n n s P (x ) θ PS,θ(x ) 2 S s∈S X 1 1  = |S| log n + 1 |S| 2 s s∈S " # (a) 1 X 1 ≤ |S| log n + 1 2 |S| s s∈S |S| n = log + |S| 2 |S|

where (a) follows from Jensen’s inequality. Recall from Lecture 8 that Rissannen showed that the worst-case d regret could at best be 2 log n+o(log n). This shows that the above strategy—using the Krichevsky–Trofimov assignment—is optimal up to lower order terms.

3 Unknown tree

We call a context tree is of depth D if it is a full binary tree of depth D, where node s contains two integers n (ns0, ns1), i.e., the number of 0’s and 1’s in the sequence x which occurs right after the context s. Example 4. Consider the sequence

x1 x2 x3 x4 x5 x6 x7 1 1 0 0 1 0 0 1 1 0 ···

We can draw the tree, with each node s labeled with x(s):

2 t − 3 t − 2 t − 1 t

{∅} 1 {x7} 1 {x7} 0

{x3, x6, x7} 0 {∅} 1 1

{x3, x6} {x3, x6} 0

{x1, . . . , x7}

{x1} 1 {x1, x4} 0 1 {x4} 0

{x1, x2, x4, x5} 0 {x2} 1

{x , x } {∅} 0 2 5

For example, x3 is preceded by ‘001’, so is included in x(‘001’) (at t − 3), x(‘01’) (at t − 2) and x(‘1’) (at t − 1). Moreover, for s = 001, we have ns0 = ns1 = 1 (corresponding to x3 and x6, respectively). n KT Then the context-tree weighting at any leaf node s is PCTW(x , s) = P (ns0, ns1) if depth(s) = D. For any node s other than a leaf node, we have 1 1 P (xn, s) = P KT(n , n ) + P (xn, 0s)P (xn, 1s) CTW 2 s0 s1 2 CTW CTW where 0s the sequence formed by prepending ‘0’ to s, and 1s means the sequence formed by prepending ‘1’ to s. n n Finally, the CTW probability assignment on x can be found recursively at the root via PCTW(x ) = n PCTW(x , ∅).

3 EE378A Statistical Signal Processing Lecture 10 - 05/04/2017 Lecture 10: Regret Analysis of Context Tree Weighting Lecturer: Tsachy Weissman Scribe: Chuan-Zheng Lee

In the last lecture we introduced how to use Krichevsky–Trofimov probability assignment to construct the context tree, and the corresponding weighting scheme for tree sources. In this lecture we will analyze the (worst-case, or the average-case) regret of the CTW algorithm under logarithmic loss.

1 Context Tree Weighting: Review

Notation: since the Krichevsky–Trofimov probability assignment only depends on the number of zeros and n n KT n KT ones in the sequence, we define ~n = (n0(x ), n1(x )) and write P (x ) as P (~n). The CTW algorithm imposes the Krichevsky–Trofimov probability assignment to every context s ∈ S separately, and for each n n context s ∈ S, we need to associate with a vector ~ns = (ns0(x ), ns1(x )), i.e., the number of occurrences of the context s0, s1 in the sequence xn, respectively. Now revisit the last example with nodes labeled with their corresponding ~n:

x1 x2 x3 x4 x5 x6 x7 1 1 0 0 1 0 0 1 1 0 ···

t − 3 t − 2 t − 1 t

~n111 = (0, 0) 1 ~n11 = (1, 0) 1 ~n011 = (1, 0) 0

~n1 = (2, 1) 0 ~n101 = (0, 0) 1 1

~n01 = (1, 1) ~n001 = (1, 1) 0 ~n = (4, 2)

~n110 = (1, 0) 1 ~n10 = (2, 0) 0 1 ~n010 = (1, 0) 0

~n0 = (2, 2) 0 ~n100 = (0, 2) 1

~n00 = (0, 2) ~n000 = (0, 0) 0

The probability assignment of the CTW algorithm on xn and the context s is given as follows:

( KT n P (~ns) if depth(s) = D PCTW(x , s) = 1 KT 1 n n . 2 P (~ns) + 2 PCTW(x , 0s)PCTW(x , 1s), otherwise

n n Recursively we find PCTW(x ) = PCTW(x , ∅), and also the conditional probability

n+1 n PCTW(x ) PCTW(xn+1|x ) = n . PCTW(x )

1 1 Remark The choice of the weight 2 ’s in that second line is not important; it can be changed into any (α, 1 − α) with α ∈ (0, 1) with almost no loss in theoretical analysis (as you may see below). In practice, the choice of α depends on the data.

2 Worst-Case Regret of CTW Algorithm

Consider the context tree in the previous example with depth 3, and suppose that the true tree model S is as follows: t − 2 t − 1 t

θ1 1

θ00 0 0

1

θ10

By definition of the CTW algorithm, we have 1 P (xn, 00) ≥ P KT(~n ) CTW 2 00 1 P (xn, 10) ≥ P KT(~n ) CTW 2 10 1 P (xn, 1) ≥ P KT(~n ) CTW 2 1 1 P (xn, 0) ≥ P (xn, 00)P (xn, 10) CTW 2 CTW CTW n n PCTW(x ) ≥ PCTW(x , ∅) 1 ≥ P (xn, 0)P (xn, 1) 2 CTW CTW 1 1 1 ≥ P (xn, 00)P (xn, 10) P KT(~n ) 2 2 CTW CTW 2 1 1 ≥ P KT(~n )P KT(~n )P KT(~n ) 25 00 10 1 1 = P KT(xn). 25 S As a result, for any sequence xn, in this case we have 1 1 1 1 log − min log ≤ 5 + log − min log n θ n KT n θ n PCTW(x ) PS,θ(x ) PS (x ) PS,θ(x ) 3 n ≤ 5 + log + 3 2 3 3 n = log + 8. 2 3

2 1 In general, repeating the analysis above, we lose a factor of at most 2 for each of the |S| leaves and |S|−1 internal nodes. Therefore, in general, for any S corresponding to a tree source of depth ≤ D, 1 1 log ≤ 2|S| − 1 + log n KT n PCTW(x ) PS (x ) |S| n 1  ≤ 2|S| − 1 + log + |S| + min log n 2 |S| θ PS,θ(x ) |S| n 1 ≤ log + 3|S| − 1 + min log n , 2 |S| θ PS,θ(x ) which again shows the order-optimality of the context-tree weighting for any unknown tree model with depth at most D.

3 Average Regret of CTW Algorithm

Recall that the worst-case regret under logarithmic loss is given by

 1 1  WCRn(P, F) = max log − min log . xn P (xn) Q∈F Q(xn)

Now, for all true probability distribution Q ∈ F, we have

Q(xn) D(Q n kP n ) = log x x EQ P (xn)  1 1  ≤ EQ log − min log P (xn) Q0∈F Q0(xn)

≤ WCRn(P, F) which shows that the average-case regret is upper bounded by the worst-case regret:

sup D(Qxn kPxn ) ≤ WCRn(P, F). Q∈F

Applying this result to the tree source, for any tree source PS,θ with depth(S) ≤ D, we have

|S| D(P n kP n ) ≤ log n + O(1). S,Θ CTW 2 As a result, we know from Lecture 7 that for prediction under a general loss function Λ,

s    n n  2 |S| L P (X ) − L P (X ) ≤ Λ log n + O(1) . E F CTW F S,θ max n 2

3 EE378A Statistical Signal Processing Lecture 11 - 05/09/2017 Lecture 11: Estimation of Directed Information Lecturer: Tsachy Weissman Scribe: Maggie Engler

In this lecture, we will introduce the quantity of directed information, provide motivation for estimating the directed information, and look at two estimators and their associated performance guarantees. First, though, we clarify a question from last time about the depth of context tree weighting.

1 Context Tree Weighting, Continued

Recall that for QCTW,D, denoting the probability assignment by context tree weighting with depth D, we showed in the previous lecture that 1 1 |S| n log n − min log n ≤ log + 3|S| − 1 (1) QCTW,D(X ) θ PS,θ(X ) 2 |S| |S| = log n + O(1) (2) 2 for any tree source S with depth at most D. In other words, to obtain the optimal regret up to lower-order terms, we can choose the tree depth D to be greater than our expected depth of the true tree source in practice. To avoid the issue of estimating the true tree depth, we can further mix many context trees with dif- ferent tree depths D. Let’s consider the following mixture: take any {wD}D≥1 such that all wD ≥ 0 and P∞ P∞ D=1 wD = 1, and let Q = D=1 wDQCTW,D. Now by the standard mixing argument, for any tree source S with depth D, we have 1 1 1 log n ≤ log n + log (3) Q(X ) QCTW,D(X ) wD |S| n 1 ≤ log + 3|S| − 1 + log . (4) 2 |S| wD Thus, the mixture Q as defined above is also guaranteed to perform within a constant term of optimal regret for any tree source S.

2 Universality

Note: Remember that if some data is governed by a law P , the excess log-loss of the probability assignment 1 Q (beyond what is optimal) is the relative entropy D(PXn ||QXn ), normalized to n D(PXn ||QXn ). ∞ Definition 1 (Stationary). Probability measure P on X is stationary if for any k-tuple (Xi1 ,Xi2 ,Xi3 , ··· ,Xik ) and any m,

d (Xi1 ,Xi2 ,Xi3 , ...Xik ) = (Xi1+m ,Xi2+m ,Xi3+m , ...Xik+m ) (5) The shifted k-tuple is equal in distribution to the original. This is also described as shift-invariance and is a common assumption for an unknown underlying distribution. Definition 2 (Universal). Probability measure Q on X ∞ is universal if

1 n→∞ D(P n ||Q n ) −−−−→ 0 (6) n X X for all stationary laws P.

1 Exercise: Show that the mixture Q in (3) is universal.

Universality of Q is clearly a desirable property, as we are guaranteed to have a vanishing excess log-loss with any stationary law P . Additionally, if some Q is universal, it can be used to estimate other quantities of interest, such as directed information.

3 Directed Information

Definition 3 (Directed information). The directed information from Xn to Y n is defined as

n n n X i i−1 I(X → Y ) , I(X ; Yi|Y ) (7) i=1 More generally, n n−d n d n−d n X i−d i−1 I(X → Y ) , I((∅ ,X ) → Y ) = I(X ; Yi|Y ). (8) i=1 We remark that mutual information is the canonical measure of the relevance of one process to another, and the directed information is the canonical measure of the causal relevance of one process to another. Notice that directed information is a summation of mutual information between the first i elements of X and the ith element of Y , conditioned on the previous i − 1 values of Y . Namely, we measure the causal i i−1 relevance of X for our prediction of Yi, given Y . The following two examples will illustrate why this is a very natural metric in prediction.

Example 1: Suppose that (X1,X2, ··· ,Xn) and (Y1,Y2, ··· ,Yn) represent two stocks in the New York Stock Exchange. You are a young investor desperate to impress your boss at the hedge fund, and you would like to predict the price of stock Y tomorrow, i.e., at day i. You can look solely at Y i−1, but, being more resource- i−1 i−1 i−1 ful than that, you decide to determine the extent to which X also informs Yi—that is, I(X ; Yi|Y ). Pn i−1 i−1 n−1 n Collected over time, we have i=1 I(X ; Yi|Y ), which is equivalent to I(X → Y ).

Example 2: Now, on the basis of your excellent application of directed information, you have been pro- moted to hedge fund manager. Your new role requires you to be prescient of global trends, so this time let (X1,X2, ··· ,Xn) be the Hang Seng Index, an overall performance measure for the Hong Kong Stock Ex- change, and let (Y1,Y2, ··· ,Yn) be the Dow Jones Industrial Average. Because of the time difference between New York and Hong Kong, the trading has already closed in Hong Kong before the opening bell in New York. i−1 i−1 Therefore, when you predict Yi, you now have access to Xi in addition to X and Y as before. The i i−1 Pn i i−1 conditional mutual information is I(X ; Yi|Y ), and over a period of n days we get i=1 I(X ; Yi|Y ) = I(Xn → Y n).

We describe mutual information as a measure of relevance or correlation between two processes, without implying causation. Directed information allows us to be more explicit about the flow of information from one process to another. The relation is shown below, to be proved as an exercise.

Exercise: Show that I(Xn; Y n) = I(Xn → Y n) + I(Y n−1 → Xn)

Also note that this implies that I(Xn → Y n) ≤ I(Xn; Y n):

n n n n X i i−1 X n i−1 n n I(X → Y ) = I(X ; Yi|Y ) ≤ I(X ; Yi|Y ) = I(X ; Y ) (9) i=1 i=1 where the second step is given by the data processing inequality.

2 4 Estimation of Directed Information

Definition 4 (Directed information rate). Let X and Y be a pair of processes. The directed information rate from X to Y is

1 n n I(X → Y) lim I(X → Y ), (10) , n→∞ n provided that the limit exists. Exercise: Show that the limit exists if X and Y are jointly stationary.

Note: Recall that for X ∼ PX , the entropy may be written H(X) or equivalently H(PX ), which is a function of the distribution.

Notice that n n n n X i i−1 X i−1 i i−1 I(X → Y ) = I(X ; Yi|Y ) = [H(Yi|Y ) − H(Yi|X ,Y )] (11) i=1 i=1 This suggests an estimator for the directed information rate. Specifically, n 1 X Iˆ(Xn → Y n) = [H(Y |Y i−1) − H(Y |Xi,Y i−1)]. (12) n i i i=1 Since the (conditional) entropy solely depends on the underlying unknown distribution, we can consider another probability assignment of our choice Q. The idea is that if Q is universal, then the (conditional) entropy based on Q should converge to that based on the true P . Theorem 5. If Q is universal and (X,Y ) are jointly stationary and ergodic (a concept we will not introduce here, which basically means that the law of large numbers holds in the process), then 1 [ (X → Y) − Iˆ(Xn → Y n)] −−−−→n→∞ 0 (13) E I n Therefore, this estimator of the directed information rate converges to the true rate. With a few more constraints, we can further bound how quickly the estimate converges. Theorem 6. If Q is derived from context tree weighting (CTW) and the joint process of (X,Y ) is stationary, ergodic, aperiodic, and Markov (of arbitrary order), then 1 log n [ (X → Y) − Iˆ(Xn → Y n)] ≤ C √ (14) E I n n where C > 0 is a universal constant not depending on n. For a more in-depth discussion, please see the paper “Universal Estimation of Directed Information” (Jiao, Permuter, Zhao, Kim, and Weissman, 2012). Another estimator is suggested by the observation that n n i i−1 i i−1 I(X → Y ) = D(PYi|X ,Y kPX ,Y ) (15) with the proof left as an exercise. We can then construct n n n 1 X Iˆ(X → Y ) D(Q i i−1 ||Q i i−1 ) (16) , n Yi|X ,Y X ,Y i=1 for our suitable choice of Q. We omit the details here, but remark that this estimator has similar performance guarantees to the first, and often has a slightly smoother convergence in practice as relative entropy is always non-negative.

3 EE378A Statistical Signal Processing Lecture 12 - 05/10/2017 Lecture 12: Introduction to the Discrete Universal Denoiser (DUDE) Lecturer: Tsachy Weissman Scribe: Yihui Quek

In this lecture1, we set the state for the Discrete Universal Denoiser (DUDE), an algorithm for denoising sequences whose symbols take values in a discrete alphabet. The following table shows at a glance how this new problem setting compares to those of the algorithms we have studied in the past:

Algorithm Key idea Input sequence alphabet Viterbi algorithm/Kalman Maximum a posteriori Continuous filter (MAP) decoding Context-tree weighting Decoding from context Discrete, generated by a tree (CTW) algorithm source with unknown param- eters Discrete Universal Denoiser MAP decoding from context Discrete, generating distribu- (DUDE) tion unknown

1 Setting

Figure 1 shows a toy setting we consider for discrete denoising.

n n n n n n Noisy z ∈ Z xˆ (z ) = x = (x1, . . . xn) ∈ X Denoiser n n n n Channel (ˆx1(z ), xˆ2(z )...xˆn(z )) ∈ Xˆ

n Figure 1: Denoising setting. In this toy example, we assume that the parameters of the source X (i.e., pX (x )), as well as the conditional probabilities of each output symbol produced by the channel on any given input, pX,Z (xi, zi), are known.

We assume all noiseless sequences and observations are composed of symbols which take values in a finite  alphabet, X = a1, . . . a|X | (thus the ‘discrete’ in the name of the denoiser). Denoising means that given a noisy observation sequence zn, we would like to reconstruct the original noiseless signal, xn, by outputting a best estimatex ˆn(zn). Assume we have a per-symbol loss function Λ : X × Xˆ → R and are allowed to look at an entire length-n observation zn to guessx ˆn. Definition 1. An n-block denoiser is a mapping

Xˆ n(Zn): Zn → Xˆn.

As in the past, we measure the performance of Xˆ n by the normalized cumulative loss on a block zn:

n 1 X L (xn, zn) = Λ(x , Xˆ n(zn)[i]). (1) Xˆ n n i i=1

1Reading: Universal Discrete Denoising: Known Channel, T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, M.J. Wein- berger, 2005.

1 n n n (From now on we use the notation Xˆ (z )[i] ← xˆi(z ).) Let us start our investigation by assuming that the symbols are generated probabilistically with known parameters, i.e., probabilities of all symbols, channel and noise are known to the denoiser. On the cumulative loss, an optimal denoiser is then described by

" n #  n n  1 X n arg min E LXˆ n (x , z ) = arg min E Λ(xi, xˆi(z )) (2) Xˆ n(·)∈D Xˆ n(·)∈D n n n i=1 where Dn denotes the set of all block-n denoisers. We can observe from the right-hand side of (2) that the cumulative loss is separable; it is a sum of terms, each of which depends only xi andx ˆi. This suggests a n greedy strategy for producing such an estimate; under this strategy,x ˆi(z ) is nothing but the Bayes response to the conditional posterior probability of each symbol:

n ˆ n n xˆi(z ) = XBayes(Pxi|Z =z ) (3) and the corresponding value of expected loss is the Bayes envelope,

n 1 X   U(P n ) . (4) n E xi|Z i=1 X We remind the readers that the Bayes envelope is defined as U(PX ) := min PX (x)Λ(x, xˆ). In addition, ˆ xˆ∈X x the discreteness of the input alphabet permits us to write a loss matrix that will be useful in later sections:  

{Λ(x, xˆ)} :=  λa1 λa2 ... λa|X | 

where λai is the column vector Λ(·, xˆ = ai). With this definition in hand, we may rewrite the Bayes envelope in terms of the λis: T U(PX ) = min PX · λxˆ (5) xˆ The conclusion of this analysis reveals that the Bayes response denoiser (3) achieves the minimum loss on a sequence with known generating distribution. But we have glossed over some things in the above analysis: Firstly, the Kalman filter showed us that deriving the Bayes envelope is already hard in the continuous case; what more when the symbols come from a discrete alphabet. Secondly, we should not expect realistically to be handed on a silver platter with the probabilistic parameters of the problem; for one, we often do not know the parameters of the generating distribution of the given sequence. We have in fact treaded in these waters before; see the CTW algorithm.

2 Towards construction of a universal scheme

From now on we assume that we do not know the generating distribution of the sequence symbols, but we have some characterization of the channel. We would like to construct a universal scheme, i.e., a scheme that attains optimal performance asymptotically no matter the true law that governs the data. Here, the

ˆ n “optimal performance denoiser” is XBayes(Pxi|z ) as established previously. In this section we consider the “single-letter” problem as shown in Figure 2, where our input sequence is now only one symbol, X, and we are to guess Xˆ based on Z. Again, X and Xˆ take values in finite alphabet,  X = a1, . . . a|X | . We can also define the channel transition matrix showing probabilities of mapping inputs to outputs:  

{Π(x, z)} :=  πa1 πa2 ... πa|X | 

2 X ∼ P Channel Z ∈ Z X Denoiser ˆ ˆ X ∈ X Π(x, z) X ∈ X

Figure 2: Single-letter denoising. We assume that we know the channel transition probability matrix Π(x, z).

where πai is the column vector PX,Z (·,Z = ai). Before we proceed, let us define the Schur/Hadamard product represented by the symbol :

Definition 2. For vectors v1, v2, v1 v2 is the vector obtained by element-wise multiplication.

[v1 v2]i = v1,iv2,i

Secondly, we generalize the Bayes response (implicitly defined on our loss function Λ(x, xˆ)) to take an arbitrary argument v: T XˆBayes(v) = arg min v · λxˆ (6) xˆ

With these definitions in hand, we may factorize PZ with a simple calculation:

X X T  T  PZ (z) := PX,Z (x, z) = PX (x)Π(x, z) = PX · πz = PX · Π (z) (7) x x

T T 2 i.e. PZ = PX · Π. Assume now that |X | = |Z| and Π is invertible . Then we may re-write this as

−T PX = Π PZ (8) and we may readily write the conditional probability

−T PX (x)Π(x, z) [Π PZ ](x) · πZ (x) PX|Z=z(x) := = PZ (z) PZ (z) −T [Π PZ ] πZ PX|Z=z = (9) PZ (z) Let φ(z) represent the operation of denoising. The previous section established that

min EΛ(x, φ(z)) = E[U(PX|Z )] φ(·) and this is achieved by ˆ ˆ −T φopt(z) = XBayes(PX|Z=z) = XBayes([Π PZ ] πZ ) (10) where we have elided the denominator of (9) in the substitution because the overall normalization is not a function ofx ˆ and does not affect the minimization. We can generalize the above analysis to situations where |X | ≤ |Z|, i.e., Π is non-square but of full row-rank: in those situations (to be proven in the homework) we have

ˆ ˆ T −1 φopt(z) = XBayes(PX|Z=z) = XBayes([(ΠΠ ) Π · PZ ] πZ ). By the way, (ΠΠT )−1Π can be defined for any arbitrary non-square matrix Π and is known as the Moore- Penrose pseudoinverse.

2 1 Usually a safe assumption but notable exceptions include the binary symmetric channel with flipping probability 2 – which is not a useful channel anyway.

3 3 Universal Discrete Denoising and introduction to DUDE

Now we extend the above analysis to denoising of a sequence, not just a single letter – and in doing so we introduce the DUDE. Figure 1 is applicable to our new setting too, but now we have far less knowledge of the system. More concretely, the assumptions are now: 1. Finite alphabets X , Z, Xˆ.

2. Source X is governed by a probabilistic and stationary but otherwise unknown generating distribution. 3. Channel is known to be a discrete memoryless channel; and Π is known and of full row rank. 4. Loss function Λ is given. The idea of DUDE is that we “correct by the context”; whereas in Section 1 we had the luxury of knowing

n n Pxi|Z =z which we could input into the Bayes response function, now we must estimate that. We do so by looking at left and right length-k contexts around the symbol to be denoised:

z1 . . . zi−k . . . zi−1 zi zi+1 . . . zi+k . . . zn

For every pair of left and right contexts, we can define a count vector m ∈ R|Z| to capture the number of times a symbol z ∈ Z appears in these contexts over the entire sequence

n k k i−1 k i+k k m(z , ` , r )[z] := {k + 1 ≤ i ≤ n − k : zi−k = ` , zi+1 = r , zi = z}, the DUDE denoiser effectively calculates the Bayes response (3) with this empirical probability distribution. That is, DUDE(k) performs the function F where:

n n i−1 i+k  xˆi(z ) = F Λ, Π, m(z , zi−k, zi+1 ), zi k + 1 ≤ i ≤ n − k T  −T n i−1 i+k  = arg min λxˆ [Π · m(z , zi−k, zi+1 )] πZ . (11) xˆ

Readers will note the similarity with (10). Surprisingly, it will turn out that the operation performed by the DUDE is linear in n.

4 EE378A Statistical Signal Processing Lecture 13 - 05/16/2017 Lecture 13: The Performance of Discrete Universal Denoiser (DUDE) Lecturer: Tsachy Weissman Scribe: Ruiyang Song

1 Recap

The system model of the denoising problem is presented in Fig. 1. Let Xn be the unknown input sequence that passes through a discrete memoryless channel. The channel is characterized by its transition matrix Π which is known and is of full row rank. The output of the noisy channel is denoted by Zn which is observed n by the denoiser. The denoiser then outputs the reconstruction sequence Xˆ . Here, Xi ∈ X , Zi ∈ Z, Xˆi ∈ Xˆ, where X , Z, and Xˆ are finite alphabets. The distortion of the reconstruction sequence is characterized by a given loss function Λ : X × Xˆ → [0, +∞).

Zn n X Noisy channel Denoiser Xb n

Figure 1: Denoising setting. The DUDE with context length k works in the following way: in the first pass, the denoiser derives the counting vector of dimension |Z| as follows:

n k k  i+k k k m z , l , r [z] = k + 1 ≤ i ≤ n − k : zi−k = (l , z, r ) . (1) Note that m[z] counts the number of items in the sequence zn that equals z with left context lk and right context rk. In the second pass, DUDE outputs the reconstruction sequence Xˆ n with ˆ n n i−1 i+k  Xi(z ) = φ Λ, Π, m z , zi−k, zi+1 , zi , for k + 1 ≤ i ≤ n − k, (2) where

T −1  φ(Λ, Π, v, z) = XˆBayes (ΠΠ ) Π v πz (3) T T −1  = arg min λxˆ (ΠΠ ) Π v πz . (4) xˆ Exercise 1. Consider the following specific examples: 1 − δ δ  0 1 • For BSC(δ) and Hamming loss, i.e., Π = , Λ = . Show that δ 1 − δ 1 0 ( z v(z) ≥ 2δ(1−δ) φ(Λ, Π, v, z) = v(¯z) δ2+(1−δ)2 . (5) z¯ otherwise

• For BEC() and Hamming loss, the alphabets are Z = X S{e}, Xˆ = X , and  1 −  z = x  Π(x, z) =  z = e . (6) 0 otherwise Show that ( z z 6= e φ(Λ, Π, v, z) = . (7) arg maxxˆ∈Xˆ v(ˆx) z = e

1 2 Choice of the context length k

The choice of k is important for the performance of DUDE. On one hand, a large k provides more information in each context sequence; on the other hand, when k is large, we will have a small number of counts and might suffer from unreliable context statistics. In practice, we choose 1 log n  k = k = , (8) n 5 log |Z|

ˆ n which can be justified through performance analysis. In the following sections, we denote by XDUDE the reconstructed sequence given by the DUDE with context length kn.

3 Performance of DUDE 3.1 Stochastic setting In this section we assume that the source X is a stationary stochastic process. We introduce the definition of denoisibility as follows: n n D(X, Z) = lim min ELXˆ n (X ,Z ), (9) n→∞ n Xˆ ∈Dn

where Dn is the set of all denoisers of block length n. Note that n n n 1 X  n min ELXˆ n (X ,Z ) = EU PXi|Z . (10) Xˆ n∈D n n i=1

We introduce the properties of D(X, Z) in the following exercises. Exercise 2.

• If (X, Z) is jointly stationary, then D(X, Z) exists.      l n n−i • (X, Z) = liml→∞ U PX |Z . Note that when X, Z are jointly stationary U PXi|Z = U P . D E 0 −l E E X0|Z1−i In the following theorem, we show the optimality of DUDE for the stochastic setting. Theorem 3. For any X,

n n lim L ˆ (X ,Z ) = (X, Z). (11) n→∞ E XDUDE D

3.2 Semi-stochastic setting

In this section, we assume that the source x = (x1, x2,...) is some deterministic sequence. Note that Z is n n Qn still a stochastic process with distribution P(Z = z ) = i=1 π(xi, zi). This is actually more general than the stochastic setting since we assume nothing about the input sequence. Define n−k n n 1 X i+k  Dk(x , z ) = min Λ xi, f(zi−k ) (12) f:Z2k+17→Xˆ n i=k+1 to be the error induced by the optimal denoiser with double-sided context length k. Note that here the optimal denoiser has access to both the original sequence and the noisy sequence, but is restricted to have the same output for symbols with the same length-k contexts. Theorem 4. For any source sequence x,

 n n n n  lim L ˆ n (x ,Z ) − Dkn (x ,Z ) = 0 w.p.1. (13) n→∞ XDUDE

2 Note that Theorem 4 is stronger than Theorem 3. We can show that Theorem 3 can be proved directly by Theorem 4. We first introduce some mathematical background. For a sequence {an}, we define

lim sup an = lim sup am, (14) n→∞ n→∞ m≥n

lim inf an = lim inf am. (15) n→∞ n→∞ m≥n

Note that lim sup and lim inf exist for any sequence. Note also the following fact:

Fact 5. limn→n an exists iff. lim supn→∞ an = lim infn→∞ an. We will also need the following lemma:

Lemma 6 (Fatou’s lemma). If {Rn} is a sequence of non-negative random variables, then

[lim inf Rn] ≤ lim inf [Rn]. (16) E n→∞ n→∞ E With the preparation above, we prove Theorem 3 with Theorem 4. Proof For any fixed l, we have    n n n n  0 = E lim sup LXˆ n (x ,Z ) − Dkn (x ,Z ) (17) n→∞ DUDE h n n n n i ≥ lim sup ELXˆ n (x ,Z ) − EDkn (x ,Z ) (18) n→∞ DUDE h n n n n i ≥ lim sup ELXˆ n (x ,Z ) − EDl(x ,Z ) , (19) n→∞ DUDE where the first inequality follows from Fatou’s lemma. Note that we can prove

n n   Dl(X ,Z ) ≤ U P l . (20) E E X0|Z−l

n n   Hence, lim supn→∞ L ˆ n (X ,Z ) ≤ U PX |Zl . Since l is arbitrary, we have E XDUDE E 0 −l

n n   lim sup L n (X ,Z ) ≤ lim EU P l = (X, Z). (21) E Xˆ X0|Z D n→∞ DUDE l→∞ −l

On the other hand, D(X, Z) measures the fundamental denoisability, and we must have

n n lim inf EL ˆ n (X ,Z ) ≥ D(X, Z) (22) n→∞ XDUDE

3 EE378A Statistical Signal Processing Lecture 14 - 05/18/2017 Lecture 14: Nonparametric Function Estimation Lecturer: Yanjun Han Scribe: Yanjun Han

1 Introduction

DUDE considers the denoising setting with a discrete source alphabet and reconstruction alphabet, and performs well under both the stochastic and semi-stochastic settings. However, the context-based counting idea used by DUDE cannot be directly applied to the continuous setting. In this lecture we start to look at a new problem: the nonparametric function estimation problem with a continuous alphabet and new assumptions on the underlying function. We look at the following regression setting: we have n observations

yi = f(xi) + σξi, i = 1, ··· , n (1) i where f(·) is our target function on [0, 1] which we assume to lie in some prescribed function class F, xi = n i.i.d are equally spaced points on [0, 1], σ is the known noise level, and ξi ∼ N (0, 1) are i.i.d Gaussian noise. ˆ Our target is to find some estimator f which comes close to minimizing the worst-case Lq risk: ˆ ˆ∗ sup Ef kf − fkq . inf sup Ef kf − fkq. (2) f∈F fˆ∗ f∈F

an Notations: for non-negative sequences {an}, {bn}, we use an bn to denote lim sup < ∞, i.e., . n→∞ bn there exists universal constant C not depending on n such that an ≤ Cbn for any n ∈ N. Similarly, we write an & bn to denote bn . bn, and an  bn to denote an . bn and bn . an. References: 1. A. Nemirovski. “Topics in non-.” Ecole d’Ete de Probabilites de Saint-Flour 28 (2000): 85. 2. A.B. Tsybakov. “Introduction to Nonparametric Estimation.” Springer, 2009.

2 The Bias–Variance Decomposition

Before looking at specific function class F and estimators fˆ, we take a look at how to evaluate the performance of any given estimator fˆ. We define the following two quantities: ˆ 1. Deterministic error: b(x) , Ef f(x) − f(x); this term corresponds to the bias, and is a deterministic scalar. ˆ ˆ 2. Stochastic error: s(x) , f(x)−Ef f(x); this term corresponds to the variance, and is a random variable.

With the previous definition, for the Lq loss with 1 ≤ q < ∞, we can write ˆ ˆ ˆ ˆ Ef kf − fkq = Ef kf − Ef f + Ef f − fkq (3) ˆ ˆ ˆ ≤ Ef kf − Ef fkq + Ef kEf f − fkq (4) = Ef kskq + kbkq (5) 1 q q ≤ (Ef kskq) + kbkq (6)

where we have used the triangle inequality and Jensen’s inequality. As a result, we can decompose the Lq 1 q q risk into the bias term kbkq and the variance term (Ef kskq) separately.

1 3 Function Estimation in H¨olderBalls

First we assume that the function class F is the H¨olderball with smoothness parameter s, and construct a sound estimator fˆ for f in this case.

3.1 Introduction to H¨olderBalls The H¨olderball Hs(L) consists of all functions which are H¨oldersmooth of order s > 0, i.e., for s = m + α with m ∈ N, α ∈ (0, 1],

( (m) (m) ) s |f (x) − f (y)| H (L) = f ∈ C[0, 1] : sup α ≤ L . (7) x6=y∈[0,1] |x − y| In other words, the H¨olderball consists of functions which are “smooth” of a certain smoothness order.

Exercise 1. Show that for s > 1, if we use m = 0, α = s in the definition (7), then the only function which satisfies the new definition is the constant function.

3.2 Kernel–based Estimator Now we consider how to find an estimator fˆ of f. First suppose that there is no noise, i.e., σ = 0. In this case, we know perfectly the value of f on points xi, but do not know the value of x ∈ (xi, xi+1) in between. However, since we know that the function f is smooth, it may be tempted to use interpolation to obtain f(x), e.g.,

ˆ xi+1 − x x − xi f(x) = f(xi) + f(xi+1). (8) xi+1 − xi xi+1 − xi In general, we may write ˆ X f(x) = w(x, xi)f(xi) (9)

i:xi∈Ih(x)

for some weight function w(x, xi), where Ih(x) , [x−h, x+h] is a small “window” around x with bandwidth h (to be specified later). Also in the noisy case where only yi is available, we may construct ˆ X f(x) = w(x, xi)yi. (10)

i:xi∈Ih(x) Note that for simplicity we have ignored the boundary effect when x is close to zero or one. Now we specify our choice of the weight function. A natural requirement is that P w(x, x ) = 1 i:xi∈Ih(x) i for any x, i.e., the weights should sum into one. This requirement can be generalized in the following way: when f(x) = xk for k = 0, ··· , m, the estimator in the LHS of (9) should be equal to the true value f(x) = xk. In other words, we require that

X k k w(x, xi)xi = x , ∀k = 0, 1, ··· , m. (11)

i:xi∈Ih(x) Note that as long as nh → ∞, (11) can be satisfied since we have 2nh degrees of freedom while there are only m + 1 linear constraints. The following lemma shows some properties of the weights: Lemma 2. For any x ∈ [0, 1], there exists some weight function w(x, ·) which satisfies (11) and 1 kw(x, ·)k1 . 1, kw(x, ·)k2 . √ . (12) nh 1 As a sanity check, just consider the case where m = 0 and w(x, xi) = 2nh for any xi ∈ Ih(x).

2 3.3 Performance Analysis Now we analyze the performance of the estimator in (10). By the bias–variance decomposition, it suffices to deal with the variance and the bias separately. For the variance (stochastic error), we have

ˆ ˆ s(x) = f(x) − Ef f(x) (13) X X = w(x, xi)yi − w(x, xi)Eyi (14)

i:xi∈Ih(x) i:xi∈Ih(x) X X = w(x, xi)(f(xi) + σξi) − w(x, xi)f(xi) (15)

i:xi∈Ih(x) i:xi∈Ih(x) X = σ w(x, xi)ξi (16)

i:xi∈Ih(x) 2 2 ∼ N (0, σ kw(x, ·)k2). (17)

2 q q q Since for X ∼ N (0, σ ), we have E|X| . σ for any q ∈ [1, ∞), for the stochastic error we have E|s(x)| . q kw(x, ·)k2 (suppressing the dependence on σ), and thus

1 1 q q (Ekskq) . kw(x, ·)k2 . √ (18) nh where in the last step we have used Lemma 2. Now we take a look at the bias. By definition,

ˆ X b(x) = Ef f(x) − f(x) = w(x, xi)f(xi) − f(x). (19)

i:xi∈Ih(x)

By our choice of the weights, we know that for any polynomial P (x) of degree at most m, we have

X |b(x)| = w(x, xi)(f(xi) − P (xi)) − (f(x) − P (x)) (20)

i:xi∈Ih(x)

≤ (kw(x, ·)k1 + 1) · kf − P k∞,Ih(x) (21)

. kf − P k∞,Ih(x) (22)

where kfk∞,I , ess sup{|f(x)| : x ∈ I}. Since this inequality holds for any P , we further have

|b(x)| . inf kf − P k∞,Ih(x) (23) P ∈Polym

where Polym denotes the class of all polynomials of degree at most m. Hence, to control the bias, it suffices to find a good polynomial approximation of f in the interval Ih(x). Due to the definition of the H¨olderball, a natural choice of the polynomial is the Taylor expansion polynomial, i.e.,

m X f (k)(x) P (y) = (y − x)k. (24) k! k=0

3 By Taylor’s theorem and the Lagrange remainder term, for any y ∈ Ih(x) we have

m X f (k)(x) |f(y) − P (y)| = f(y) − (y − x)k (25) k! k=0 |f (m)(ξ) − f (m)(x)| = · |y − x|m (26) m! L|ξ − x|α ≤ · |y − x|m (27) m! Lhα ≤ · hm (28) m! s . h (29)

s where again we suppress the dependence on L. As a result, we have |b(x)| . h for any x ∈ [0, 1], and thus s kbkq . h . (30) Combining the bias and variance together, we have

ˆ 1 s sup Ef kf − fkq . √ + h . (31) f∈Hs(L) nh

− 1 Setting these two terms to be equal yields h  n 2s+1 , and thus

s ˆ − 2s+1 sup Ef kf − fkq . n . (32) f∈Hs(L)

In fact, this bound turns out to be optimal:

Theorem 3. For any s > 0 and q ∈ [1, ∞), the minimax Lq risk in estimating f is given by

s ˆ − 2s+1 inf sup Ef kf − fkq  n . (33) fˆ f∈Hs(L)

− 1 Moreover, the kernel-based estimator in (10) with bandwidth h  n 2s+1 attains the minimax risk within multiplicative constants.

3.4 Discussions We briefly discuss the previous result. The most important point is on the bias–variance tradeoff: when the bandwidth h increases, the deterministic error b(x) gets larger and the stochastic error s(x) gets smaller. Specifically, we have

q 1 1 (E|s(x)| ) q . √ , (34) nh

|b(x)| . inf kf − P k∞,Ih(x). (35) P ∈Polym The reason is also straightforward intuitively: as the bandwidth h increases, the error incurred by using f(x ± h) to approximate f(x) gets larger, i.e., the bias gets larger. In contrast, the number of the points to be averaged over (i.e., 2nh) gets larger, which yields to a smaller variance. Finally, we need to choose a suitable bandwidth h∗ to balance the bias and variance, as is shown in the following figure 1. ˆ The second point to note is that the overall estimator f is linear in the observations y1, ··· , yn. This is called a linear estimator. Theorem 3 shows that, for nonparametric function estimation over H¨olderballs, linear estimator can attain the minimax risk within constants. Then a natural question arises: can linear

4 Figure 1: The bias-variance tradeoff.

approaches always attain the minimax risk in nonparametric function estimation for other natural function classes F? We will show in the next section that the answer is no in general. The third point is more technical and high-level: in choosing the weight function, we require that our weight should keep all polynomials of degree at most m. Some more thoughts lead to the following question: why are polynomials so special here? Can we choose other basis functions? To answer this question, we need to introduce the Kolmogorov n-width: Definition 4 (Kolmogorov n-width). Let (X, k · k) be a normed vector space, and K ⊂ X be a compact subset. The Kolmogorov n-width of K is defined by

dn(K) , dn(K, k · k) , inf sup inf kx − yk (36) V ⊂X x∈K y∈V

where the first infimum is taken over all linear subspaces V ⊂ X of dimension at most n. The following lemma shows the Kolmogorov n-width of the H¨olderball: Lemma 5. For the H¨olderball Hs(L) with smoothness parameter s > 0, its Kolmogorov n-width is

s −s dn(H (L))  n (37)

and is attained by the polynomial basis V = span{1, x, ··· , xn−1}. Now we take a look at the bias (35), which shows that the bias is determined by the approximation error of the basis functions we choose. As a result, we should choose the basis functions which attain the Kolmogorov n-width, and by the previous lemma we know that polynomial basis is the right basis to use here. In general, we would like to remark another important point here: the choice of basis is very important!! We will see more examples in the next lecture.

5 4 Function Estimation in Sobolev Balls

Next we consider the same regression problem in Sobolev balls, which is a generalization of the H¨olderball in the sense that some spatial inhomogeneity is allowed here. We will show that in certain scenarios, any linear approach fails to give the optimal minimax risk.

4.1 Introduction to Sobolev Balls Sobolev space is another space which naturally measures the smoothness of a function, and plays an important k,p role in approximation theory and partial differential equations. Specifically, the (1D) Sobolev ball W1 (L) consists of all functions f on [0, 1] such that

(k) kf kp ≤ L (38) where k > 0 is an integer, and p ∈ [1, ∞]. Technically, we remark that the derivative is defined in terms of distributions (not the classical derivative) and exists for any function. Moreover, to ensure a continuous k,p 1 embedding W1 (L) ⊂ C[0, 1], we need an additional assumption k > p . To see a counterexample, consider 1 1 0 1 0 1,1 f(x) = (x ≥ 2 ), we have f (x) = δ(x − 2 ) with kf k1 = 1, so f ∈ W1 (1) but f is not continuous. We also k,p note that the “effective” smoothness s of W1 (L) will become 1 s = k − . (39) p

4.2 Performance Analysis of Linear Approaches Let’s consider the performance of the linear approach in (10), where the weight function keeps all polynomial of degree at most k − 1. By the bias-variance tradeoff, it suffices to deal with these two quantities separately. For the variance, (34) does not depend on the specific function class F, so we still have

1 1 q q (Ekskq) . √ . (40) nh For the bias, the polynomial approximation error incurred by the Taylor expansion polynomial will change for the Sobolev ball. In fact, by the same argument, for any y ∈ Ih(x) we have

k−1 (k−1) (k−1) |f(y) − P (y)| . h · |f (ξ) − f (x)| (41) Z ξ ≤ hk−1 |f (k)(z)|dz (42) x 1 1− 1 Z ξ ! p Z ξ ! p ≤ hk−1 |f (k)(z)|pdz 1dz (43) x x

1 Z ! p k− 1 (k) p ≤ h p · |f (z)| dz . (44) Ih(x) As a result, we have

 1 q 1 ! p Z 1 Z q k− p (k) p kbkq . h · |f (z)| dz  dx (45) 0 Ih(x)

q Z 1 Z ! p (k− 1 )q (k) p = h p · |f (z)| dz dx. (46) 0 Ih(x)

6 To upper bound this quantity, we need to employ the Sobolev ball condition. Note that without the exponent q/p, we have

Z 1 Z ! ZZ Z 1 (k) p (k) p (k) p p |f (z)| dz dx = |f (z)| dxdz ≤ 2h |f (z)| dz ≤ 2hL . h. (47) 0 Ih(x) |x−z|≤h 0

In other words, we would like to find the maximum of some integral with exponent q/p given that this integral without the exponent is a fixed constant. Recall the following fact:

Exercise 6. For non-negative reals a1, ··· , an, show that

( 1−r r r r r n (a1 + a2 + ··· + an) if r ∈ [0, 1], a1 + a2 + ··· + an ≤ r (48) (a1 + a2 + ··· + an) if r > 1.

Using this fact (and after some analysis), we can obtain that

( kq q h if q ≤ p, kbkq . (k− 1 + 1 )q (49) h p q if p < q < ∞.

Now upon choosing the optimal bandwidth h, we arrive at

 − k n 2k+1 if q ≤ p, ˆ k− 1 + 1 sup Ef kf − fkq . − p q (50) k,p 2(k− 1 + 1 )+1 f∈W1 (L) n p q if p < q < ∞.

It turns out that this is the best performance we can hope for using linear approaches:

1 Theorem 7. For any p ∈ [1, ∞], k > p and q ∈ [1, ∞), the minimax linear risk in estimating f over Sobolev k,p ball W1 (L) is given by

 − k n 2k+1 if q ≤ p, ˆlin k− 1 + 1 inf sup Ef kf − fkq  − p q (51) fˆlin k,p 2(k− 1 + 1 )+1 f∈W1 (L) n p q if p < q < ∞.

where the infimum is taken over all linear estimators.

4.3 Optimal Performance One may wonder whether the performance of the best linear estimator in Theorem 7 is optimal even if we allow for non-linear estimators. The answer is yes when q ≤ p, and is no otherwise. This can be seen intuitively with the help of the previous exercise: 1. When q ≤ p, the bias-maximizing function f is non-zero everywhere in [0, 1]. In this case, the function f is almost homogeneous, and it is fine to use the same bandwidth h everywhere. This is called the “regular” regime. 2. When p < q < ∞, the bias-maximizing function f is only supported on some interval of length h, and vanishes outside. As a result, if we knew this supporting interval, we can simply set the bandwidth outside this interval to be Θ(1) (otherwise we are accumulating noise but no signal!), and the variance becomes

1 1 1 q q q (Ekskq) . √ · h , (52) nh

7 which is smaller than the case where we use the same bandwidth everywhere. Combining the bias k− 1 + 1 bound kbkq . h p q , the optimal bandwidth will result in the error

k− 1 + 1 − p q 2(k− 1 )+1 ˆ p sup Ef kf − fkq . n . (53) k,p f∈W1 (L) This is called the “sparse” regime. The previous insights explain the reason why linear approaches cannot always obtain the minimax risk, and remark that different bandwidths should be used at different areas. The exact minimax risk is given in the following theorem:

1 Theorem 8. For any p ∈ [1, ∞], k > p and q ∈ [1, ∞], the minimax risk in estimating f over Sobolev ball k,p W1 (L) is given by

 − k n 2k+1 if q ≤ (2k + 1)p, ˆ k− 1 + 1 inf sup Ef kf − fkq  − p q (54) ˆ k,p 2(k− 1 )+1 f f∈W (L)  n p 1 ( log n ) if (2k + 1)p < q ≤ ∞. where the infimum is taken over all possible estimators.

5 Adaptive Bandwidth: Lepski’s Trick

As is shown in the previous section, we need an adaptive bandwidth when: 1. there is spatial homogeneity: the function f may behave differently at different points, and we should choose a suitable bandwidth to achieve the bias-variance tradeoff “locally”;

2. some parameters, e.g., s, k, p, are unknown: previously our choice of the optimal bandwidth relies on the parameters s, k, p. If these parameters are unknown, we need to select a good bandwidth based on empirical data we have seen. Let’s first take a look at what we have already known. For any x ∈ [0, 1], consider the linear estimator ˆ fh in (10) with bandwidth h ≡ h(x) to estimate f(x). Denote by bh(x), sh(x) the corresponding bias and variance of this estimator with bandwidth h, then we would like to find some optimal bandwidth h∗ ≡ h∗(x) to balance the bias and variance1:

bh∗ (x) = sh∗ (x). (55)

The problem is that we do not know anything about the LHS other than the monotonicity of bh(x) in h, q log n and for the RHS we also only know that |sh(x)| . nh with overwhelming probability. Let’s just define r ! log n s (x) Θ . (56) h , nh

1 strictly speaking, bh(x), sh(x) are not the exact deterministic and stochastic errors, but suitable upper bounds (e.g., (35), (34)) of these quantities.

8 5.1 Lepski’s Selection Rule Surprisingly, the monotonicity of the bias suffices to provide the knowledge to choose the bandwidth adap- tively. Here is Lepski’s selection rule: for any x ∈ [0, 1], define the set2

ˆ ˆ 0 A(x) , {h ≥ 0 : |fh(x) − fh0 (x)| ≤ 4sh0 (x), ∀h ∈ (0, h)} (57) consisting of “admissible” bandwidths, and then choose the bandwidth hˆ(x) to be

ˆ h(x) , max A(x). (58)

Finally, the overall estimator fˆLep is given by

ˆLep ˆ f (x) = fhˆ(x)(x). (59) Note that we only add a little bit “non-linearity” here: we still use a kernel-based estimator at any point, but the bandwidth differs spatially. Moreover, the overall estimator does not require the knowledge of any parameters (except for an upper bound on the smoothness parameter, for we need to fix a weight function (kernel) to keep all polynomials of degree up to the smoothness parameter).

5.2 Intuitions of Lepski’s Trick Let’s explain intuitively why Lepski’s trick will work, i.e., we will have ˆ ˆ |fhˆ (x) − f(x)| . |fh∗ (x) − f(x)|. (60) The analysis decomposes into two steps: 1. We first show that h∗ ∈ A(x). It suffices to check the definition: when h < h∗, we have

ˆ ˆ ˆ ˆ |fh∗ (x) − fh(x)| ≤ |fh∗ (x) − f(x)| + |fh(x) − f(x)| (61)

≤ bh∗ (x) + sh∗ (x) + bh(x) + sh(x) (62)

≤ sh∗ (x) + sh∗ (x) + sh(x) + sh(x) (63)

≤ sh(x) + sh(x) + sh(x) + sh(x) (64)

= 4sh(x). (65)

Hence, h∗ satisfies the condition, and h∗ ∈ A(x).

2. Next we show that hˆ indeed works. Since hˆ = max A(x) and h∗ ∈ A(x), we have hˆ ≥ h∗, and by the definition of A(x) we know that

ˆ ˆ ˆ ˆ |fhˆ (x) − f(x)| ≤ |fhˆ (x) − fh∗ (x)| + |fh∗ (x) − f(x)| (66) ≤ 4sh∗ (x) + bh∗ (x) + sh∗ (x) (67)

≤ 6sh∗ (x) (68)

as desired. 2strictly speaking, we should require that h0 take value in a finite fine grid of [0, 1] instead of the continuum.

9 5.3 Performance Analysis of Lepski’s Trick The performance of Lepski’s estimator fˆLep is summarized in the following theorem:

1 ˆLep Theorem 9. For any p ∈ [1, ∞], k > p and q ∈ [1, ∞], the Lepski’s estimator f satisfies

 n − k ( ) 2k+1 if q ≤ (2k + 1)p,  log n ˆLep k− 1 + 1 sup Ef kf − fkq  − p q (69) k,p 2(k− 1 )+1 f∈W (L)  n p 1 ( log n ) if (2k + 1)p < q ≤ ∞. Compared with Theorem 8, we see that this estimator is optimal in estimating function over Sobolev balls within logarithmic factors. In fact, this logarithmic factor is unavoidable if we want our estimator to be adaptive for any smoothness parameters and any not-too-small subset of [0, 1], and is the price we need to pay for adaptivity.

10 EE378A Statistical Signal Processing Lecture 15 - 05/23/2017 Lecture 15: Shrinkage Lecturer: Yanjun Han Scribe: Yanjun Han

1 Recap

In the last lecture, we considered the nonparametric function estimation problem in the regression setting. In particular, we showed that the kernel-based estimator with a suitable-chosen adaptive bandwidth essentially achieves the minimax risk over H¨olderand Sobolev balls. Specifically, the estimator is constructed as follows:

1. Fix some kernel (or weight function) with suitable regularity conditions (e.g., keep polynomials up to ˆ a prescribed degree), and use this kernel to construct a linear estimator fh(x) for any point x ∈ [0, 1] and bandwidth h > 0;

2. For any point x ∈ [0, 1], use some rule (e.g., Lepski’s trick) to choose an adaptive bandwidth hˆ(x); ˆ 3. Finally, for any point x ∈ [0, 1], use fhˆ(x)(x) as the estimator for f(x). We also recall the Lepski’s trick: for any x ∈ [0, 1], we pick up a set consisting of “admissible” bandwidths:

ˆ ˆ 0 A = {h ∈ [0, 1] : |fh(x) − fh0 (x)| ≤ 4sh0 (x), ∀h ∈ (0, h)} (1) and then choose hˆ(x) = max A. We validate this choice via two steps: we show that the optimal bandwidth h∗ which balances the bias and variance locally belongs to this set, and then we show that hˆ also works by relating hˆ to the unknown h∗. This idea can also be generalized to high-dimensional cases with caution, and you may look at Homework 6 for details. The main insights of the previous approach are two-fold: 1. The bias-variance tradeoff should be understood and analyzed carefully: this step determines the choice of the optimal bandwidth; 2. “A little bit” non-linearity needs to be added to the estimator: we have shown in the previous lecture that all linear approaches may fail to give the order-optimal risk in certain models, while an “almost” linear estimator with “a little bit” non-linearity can succeed. In the previous example, we almost use a linear kernel-based estimator, and the only additional non-linearity is to choose the bandwidth differently at different points. In this lecture, we will attack the same problem using a different approach called wavelet shrinkage, where we can see the same phenomena from a different viewpoint.

2 Gaussian White Noise Model and Change of Basis

In the last lecture we have looked at the regression problem in Gaussian noise, and also remarked that the same idea can also be used in the setting, where the kernel-based estimator is called the KDE (Kernel Density Estimator). In this lecture we will look at the Gaussian White Noise Model, and remark that this is essentially the same as the previous two models.

1 2.1 Equivalences between Models We call two statistical models are equivalent if they can almost simulate each other. An equivalent formulation ˆ is that, for any bounded loss function and any objective to be estimated, if there is an estimator f1(x) for ˆ one model, we can construct another estimator f2(y) for the other model whose estimation performance (i.e., ˆ the risk) is almost no worse than that of f1(x) under any true parameter θ ∈ Θ, and vice versa. Intuitively, if two models are equivalent, for estimation purposes it suffices to look at any one of them. For a rigorous treatment, we refer interested readers to the Le Cam distance1. We will consider four different models in the context of nonparametric estimation:

n i 1. Regression Model: we observe n samples {yi}i=1 with yi = f(xi) + σξi for 1 ≤ i ≤ n, where xi = n i.i.d and ξi ∼ N (0, 1);

2. Gaussian White Noise Model: we observe a process (Yt)t∈[0,1] with

Z t σ Yt = f(s)ds + √ Bt, t ∈ [0, 1] (2) 0 n

where (Bt)t∈[0,1] is the standard Brownion Motion on [0, 1]. This model can also be written in the √σ stochastic differential equation (SDE) form as dYt = f(t)dt + n dBt;

i.i.d 3. Density Estimation Model: we observe n i.i.d samples y1, ··· , yn ∼ g(·), where the density g is supported on [0, 1];

4. Poisson Process Model: we observe a Poisson process (Yt)t∈[0,1] with a time-varying intensity ng(·), where the density function g is supported on [0, 1]. In each model, there is some unknown function/density treated as the unknown parameter, and we observe discrete samples or a stochastic process. Suppose that the is f ∈ F, where F is some function class possessing certain order of smoothness. The main result is that these four models are equivalent: Theorem 1. 2 3 Under certain technical conditions, these four models are asymtotically equivalent as n → 2 1 ∞, with g = f , σ = 2 when talking about the last two models.

2.2 Change of Basis Due to the model equivalence, in this lecture we consider the Gaussian white noise model σ dY = f(t)dt + √ dB , t ∈ [0, 1] (3) t n t

and our target is to find an estimator fˆ which comes close to minimizing the worst-case squared-error risk

ˆ 2 sup Ef kf − fk2. (4) f∈F

Now we take a look at how the model and the loss function behave after we transform the problem into a different domain. 1see, e.g., F. Liese and K. J. Miescke, Statistical Decision Theory: Estimation, Testing and Selection. Springer, 2008. 2L. D. Brown, and M. G. Low. Asymptotic equivalence of and white noise. The Annals of Statistics, 24(6), pp. 2384–2398, 1996. 3L. D. Brown, A. V. Carter, M. G. Low, and C. H. Zhang, Equivalence theory for density estimation, Poisson processes and Gaussian white noise with drift. The Annals of Statistics, 32(5), pp. 2074–2097, 2004.

2 ∞ 2 Let (φj)j=1 be an orthonormal basis of L [0, 1], then we can represent the function f via its coefficients ∞ θ = (θj)j=1, where Z 1 θj , φj(t)f(t)dt. (5) 0 Note that the restriction f ∈ F will be transformed into the condition θ ∈ Θ for some proper parameter set Θ. Also, for any estimator fˆ, we can also represent it by Z 1 ˆ ˆ θj , φj(t)f(t)dt. (6) 0 As for the observation, we may define Z 1 yj , φj(t)dYt (7) 0 Z 1  σ  = φj(t) f(t)dt + √ dBt (8) 0 n Z 1 σ Z 1 = φj(t)f(t)dt + √ φj(t)dBt (9) 0 n 0

≡ θj +  · ξj (10)

√σ where  , n denotes the noise level, and Z 1 ξj , φj(t)dBt. (11) 0

∞ i.i.d Exercise 2. Use the orthonormality of (φj)j=1 to conclude that ξj ∼ N (0, 1).

By the previous exercise, we know that the Gaussian white noise model reduces to the following Gaussian ∞ sequence model under the basis (φj)j=1: σ y = θ + ξ , θ = (θ , θ , ··· ) ∈ Θ,  = √ , ξ i.i.d∼ N (0, 1). (12) j j j 1 2 n j ˆ ˆ ˆ ˆ ˆ Also, kθ − θk2 = kf − fk2 by Parseval’s identity, our target becomes to find some θ = (θ1, θ2, ··· ) which comes close to minimize ∞ ˆ 2 X ˆ 2 sup Eθkθ − θk2 = sup Eθ(θj − θj) . (13) θ∈Θ θ∈Θ j=1 As a result, under the basis change, we operate in a Gaussian sequence model and would like to estimate the mean vector simultaneously.

2.3 Projection Estimator and Bias-Variance Tradeoff First let’s see how the bias-variance tradeoff will look at after the change of basis. Since the observation itself is a natural estimator for the mean in the Gaussian model, a natural candidate to solve the Gaussian sequence model (12) is the projection estimator: ( ˆ yj, if 1 ≤ j ≤ m, θj = (14) 0, if j > m.

3 It’s easy to see that

( 2 ˆ 2  , if 1 ≤ j ≤ m, Eθ(θj − θj) = 2 (15) θj , if j > m.

As a result, we have

ˆ 2 2 X 2 Eθkθ − θk2 = m + θj . (16) j>m

Now we have the bias-variance tradeoff in the Gaussian sequence model:

P 2 1. Bias: the second term j>m θj , which originates from throwing away the data in the tail. Note that when m increases, the bias will decrease;

2 ∞ 2. Variance: the first term m , which originates from the noise in the observation sequence (yj)j=1. Note that when m increases, the variance will increase. We can think of the threshold m as the reciprocal of the bandwidth h in the kernel-based method: as we know from , the bandwidth in the and that in the domain satisfies the uncertainty principle. Also, in this case, the relationship of the bias/variance in m corresponds to that of the bias/variance in h in the time domain. As a result, in the transformed domain, the bias-variance tradeoff depends on the cutting position of the observed sequence.

3 Besov Ball and Wavelet Basis

Before introducing how to add the non-linearity in the transformed domain, first we specify the choice of ∞ the function class F and the orthonormal basis (φj)j=1. Specifically, we will choose F to be the Besov ball s ∞ Bp,q(L), and (φj)j=1 to be the wavelet basis. To avoid technicality, we will treat them informally and only talk about the insights behind these concepts, and refer interested readers to the following reference:

• W. H¨ardle,G. Kerkyacharian, D. Picard, and A. Tsybakov, , approximation, and statistical applications. Springer Science & Business Media, Vol. 129, 2012.

3.1 Introduction to Besov Ball Like the H¨olderand Sobolev balls, the Besov ball is another ball which characterizes the smoothness of a function in a more delicate and complicated way. Specifically, for any s > 0, p, q ∈ [1, ∞], we may define a k · k s which is somehow (informally) close to Bp,q

(s) kfk s ≈ kf k (17) Bp,q p where f (s) is the order-s derivative of f. Note that s may not be an integer, but in the above definition we have a proper definition of a fractional derivative, and the parameter q also affects the definition of the derivative slightly. Intuitively, k · k s is another norm which characterizes the order-s smoothness. Bp,q s Naturally, the Besov ball Bp,q(L) is defined by

s B (L) {f : kfk s ≤ L}. (18) p,q , Bp,q

4 3.2 Introduction to Wavelet Basis The wavelet basis is an orthonormal basis which exploits the idea of multi-resolution analysis: any function is viewed from multiple resolutions. Specifically: 1. There is a father wavelet φ(x) and a mother wavelet ψ(x) on [0, 1]; 2. At level j and location k, we define

j j φjk(x) , 2 2 φ(2 x − k), (19) j j ψjk(x) , 2 2 ψ(2 x − k), (20)

with j ∈ N, 0 ≤ k ≤ 2j − 1. k k+1 −j Note that supp(φjk) = supp(ψjk) = [ 2j , 2j ], the level j characterizes the resolution 2 , and the parameter k characterizes the spatial location to look at.

1 1 1 1 1 Example 3. The Haar wavelet is defined by φ(x) = (x ∈ [0, 1]), ψ(x) = (x ∈ [0, 2 ]) − (x ∈ [ 2 , 1]). This is the first wavelet basis proposed in 1909.

Not all functions can be the father and wavelet wavelets. A crucial property (besides the orthonormality) for wavelets is that, defining

j Vj , span{φjk(x), 0 ≤ k ≤ 2 − 1} (21) j Wj , span{ψjk(x), 0 ≤ k ≤ 2 − 1} (22)

then for any j0 ∈ N we have 2 L [0, 1] = Vj0 ⊕ Wj0 ⊕ Wj0+1 ⊕ · · ·. (23) As a result, any f ∈ L2[0, 1] can be written as

2j0 −1 ∞ 2j −1 X X X f(x) = αj0kφj0k(x) + βjkψjk(x) (24)

k=0 j=j0 k=0 | {z } | {z } Gross Information Detail Information at level j

for some coefficients (αj0k), (βjk). The first term corresponds to the “gross information”, i.e., some informa- tion in the average sense, e.g., the average magnitude in a small interval. The second term corresponds to the “detail information”, i.e., some information related to the local change, e.g., how function oscillates in a small interval. We can view the detail in different scale/resolution, which is characterized by different levels j = j0, j0 + 1, ··· . The reason why we introduce the wavelet basis is that it is the right basis for the Besov ball, as is shown in the following theorem.

Theorem 4. Under certain regularity conditions on the wavelet basis, the Besov norm k · k s for function Bp,q is equivalent to k · k s for its wavelet coefficients, where bp,q

1 1   1 q q 2j0 −1  p ∞ 2j −1  p X p X j(s+ 1 − 1 ) X p kfkbs |αj k| +  2 2 p |βjk|   . (25) p,q ,  0        k=0 j=j0 k=0

j(s+ 1 − 1 ) We can also write it in a more compact form as kfk s = kα k + k2 2 p kβ k k , where the ` norm bp,q j0 p j p q p is taken with respect to k, and the `q norm is taken with respect to j.

5 Recall that we call two norms (X, k · k1), (X, k · k2) are equivalent if there exists universal constants c1, c2 > 0 such that c1kfk2 ≤ kfk1 ≤ c2kfk2 for any f ∈ X. Then the following corollary is immediate: Corollary 5. Defining

s Θ {θ = ((α ), (β )) : kθk s ≤ 1}, (26) p,q , j0k jk bp,q there exists some constants c1, c2 > 0 such that

s s s c1Θp,q ⊂ Bp,q(L) ⊂ c2Θp,q. (27)

s We remark that although the form of Θp,q is still quite complicated, we will make use of some crucial s properties of Θp,q to propose sound estimators and validate the fact that the wavelet basis is the right basis for the Besov space.

4 Thresholding and VisuShrink Estimator

In this section we present the thresholding idea to add the correct non-linearity, and thereby motivates the VisuShrink estimator. The reference for this section and the following ones is:

• I. Johnstone, Gaussian Estimation: Sequence and Wavelet Models. Online manuscript (http:// statweb.stanford.edu/~imj/GE09-08-15.pdf), September 2015.

4.1 Ideal Truncated Estimator Before we look into the Gaussian sequence estimation problem, first we gain some insights from the Gaussian mean estimation in the scalar case. Consider estimating the mean θ ∈ R in the following scalar model: y = θ + ξ, ξ ∼ N (0, 1) (28) where  > 0 is known, and the only assumption we impose on θ is that |θ| ≤ τ for some known τ. The target ˆ ˆ 2 is to find some estimator θ such that the Eθ(θ − θ) is small. The following insights are straightforward: 1. When τ is large, there is essentially no restriction on the parameter θ, and it is expected that the observation itself θˆ = y should be a near-optimal estimator. In fact, if τ = ∞, the natural estimator θˆ = y is the Uniformly Minimum Variance Unbiased Estimator (UMVUE) and also minimax. 2. When τ is small (e.g., much smaller than ), the signal θ is almost completely obscured by the noise. In this case, θˆ = 0 is expected to be a good estimate, since E(θˆ − θ)2 = θ2 ≤ τ 2 is really small. Based on these insights, we expect that either θˆ = y or θˆ = 0 can do a good job. As a result, we introduce the following concept of the ideal truncated estimator: suppose that there is a genie who knows the true parameter θ but is restricted to use θˆ = y or θˆ = 0, it is easy to see that the optimal estimator is ˆ θITE = y1(|θ| ≥ ) (29) whose mean squared error is given by

ˆ 2 2 2 Eθ(θITE − θ) = min{θ ,  }. (30) The following theorem shows that the ideal truncated estimator essentially attains the minimax risk for any τ ≥ 0:

6 Theorem 6. 4 For any τ ≥ 0, the ideal truncated estimator attains the minimax risk over |θ| ≤ τ within a multiplicative factor of 2.22:

2 2 ˆ 2 ˆ 2 min{τ ,  } = sup Eθ(θITE − θ) ≤ 2.22 · inf sup Eθ(θ − θ) . (31) |θ|≤τ θˆ |θ|≤τ

It’s straightforward to generalize this result to the sequence case. Consider the Gaussian sequence model (12) with θ ∈ R(τ), where τ = (τ1, τ2, ··· ) is a non-negative sequence and R(τ) is a orthosymmetric rectangle with one vertex τ:

R(τ) , {θ = (θ1, θ2, ··· ): |θi| ≤ τi, ∀i = 1, 2, ···}. (32) The following corollary is immediate:

Corollary 7. For any non-negative vector τ = (τ1, τ2, ··· ), for Gaussian sequence model (12) we have

∞ X 2 2 ˆ 2 ˆ 2 min{τi ,  } = sup EθkθITE − θk2 ≤ 2.22 · inf sup Eθkθ − θk2. (33) ˆ i=1 θ∈R(τ) θ θ∈R(τ)

4.2 Soft- and Hard-Thresholding Estimator Note that the ideal truncated estimator is not indeed an estimator since it requires the unavailable information θ. However, the previous corollary motivates us to find another estimator θˆ whose performance is comparable ˆ to that of θITE, which has been shown to be nearly minimax. Specifically, we want to find an estimator ˆ ˆ ˆ θ = (θ1, ··· , θm) such that in the Gaussian sequence model (12) of length m, the inequality

m ! ˆ 2 2 X 2 2 Eθkθ − θk2 ≤ something ×  + min{θi ,  } (34) i=1

holds for any θ ∈ Rm. Note that for minor technical reasons we need to have an additional 2 term in the RHS. Now we take a careful look at our requirement. As a sanity check, the RHS is really small when θ = 0, which forces the LHS to be small as well. In other words, when the true parameter θ is the zero vector, our estimator θˆ must be close to the zero vector as well. Prove the following result:

i.i.d 2 √ Exercise 8. For X1, ··· ,Xm ∼ N (0,  ), we have P(max1≤i≤m |Xi| ≥  2 log m) → 0 as m → ∞. ˆ ˆ √ Using this exercise, a natural constraint on the estimator θ can be that, θ = 0 whenever max1≤i≤m |yi| ≤  2 log m. This observation motivates us to do some type of thresholding: specifically, we can define the soft-thresholding and hard-thresholding functions as follows:

s ηt (y) = sign(y) · (|y| − t)+ (35) h 1 ηt (y) = y · (|y| ≥ t). (36) Note that these thresholding functions are close to truncation: when |y| is small the functions return zero, and when |y| is large the functions return something close to y. ˆs s s By acting coordinatewisely, we may also define the soft-thresholding estimator θt (y) = (ηt (y1), ··· , ηt (ym)) ˆh h h and the hard-thresholding estimator θt (y√) = (ηt (y1), ··· , ηt (ym)), and the previous analysis motivates us to choose the threshold t to be roughly  2 log m. The following theorem shows that the oracle inequality (34) holds for thresholding estimators:

4D. L. Donoho, R. C. Liu, and B. MacGibbon. Minimax risk over hyperrectangles, and implications. The Annals of Statistics, pp. 1416–1437, 1990.

7 √ 5 ˆs Theorem 9. For t =  2 log m, the soft-thresholding estimator θt (y) satisfies the following oracle inequal- ity in the Gaussian sequence model (12):

m ! ˆs 2 2 X 2 2 Eθkθt (y) − θk2 ≤ (2 log m + 1) ·  + min{θi ,  } . (37) i=1 √ ˆh The same result holds for the hard-thresholding estimator θt (y) with t =  2 log m + log log m.

4.3 VisuShrink Estimator s Now we are about to describe the VisuShrink estimator for the Gaussian sequence model (12) with Θ = Θp,q. j The parameters in this model are θ = ((αj0k), (βjk): j ≥ j0, 0 ≤ k ≤ 2 − 1), and we rewrite this vector as θ = (θ1, θ2, ··· ). The VisuShrink estimator does the following: ( ηs(y ) if i ≤ m θˆVISU = t i (38) i 0 if i > m √ where m is a parameter to be chosen later, and the threshold t is chosen to be t =  2 log m. Note that compared with the projection estimator, the only difference is that when i ≤ m we replaced the raw data yi s √ by its soft-thresholding ηt (yi). Similarly, we can do hard-thresholding as well with t =  2 log m + log log m according to Theorem 9. Now we’re about to analyze the performance of the VisuShrink estimator. According to Theorem 9, we know that m ! ˆVISU 2 2 X 2 X 2 Eθkθ − θk2 ≤ (2 log m + 1)  + min{θi,  } + |θi| (39) i=1 i>m ∞ ! 2 X 2 X 2 ≤ (2 log m + 1)  + min{θi,  } + |θi| . (40) i=1 i>m s −A By the definition of Θp,q, by choosing m =  for some large enough constant A > 0, we may have the tail P 2 2 bound: sup s |θi|   . As a result, for this choice of m, we have θ∈Θp,q i>m ∞ 1 X 1 kθˆVISU − θk2 log( ) · min{θ , 2} + 2 log( ). (41) Eθ 2 .  i  i=1 s Now we come to the crucial property of the parameter set Θp,q: it is solid and orthosymmetric. Equivalently, s this means that for any θ ∈ Θp,q, the orthosymmetric hyperrectangle R(|θ|) with a vertex θ is contained s again in the parameter set Θp,q. As a result, by Corollary 7 we know that ∞ 1 X 1 kθˆVISU − θk2 log( ) · min{θ , 2} + 2 log( ) (42) Eθ 2 .  i  i=1 1 ˆ 0 2 2 1 ≤ 2.22 log( ) · inf sup Eθ0 kθ − θ k2 +  log( ) (43)  θˆ θ0∈R(|θ|)  1 ˆ 0 2 2 1 ≤ 2.22 log( ) · inf sup Eθ0 kθ − θ k2 +  log( ) (44)  ˆ 0 s  θ θ ∈Θp,q s where in the last inequality we have used the property R(|θ|) ⊂ Θp,q. As a result, the VisuShrink estimator attains the minimax risk within a logarithmic factor. s s Finally, we note the relationship between Θp,q and Bp,q(L), and summarize the VisuShrink estimator as follows: 5D. L. Donoho, and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika, pp. 425–455, 1994.

8 1 1. Fix some initial level j0 (which is of the constant order) and termination level j  log(  );

2. Transform the observation process (Yt)t∈[0,1] to the wavelet domain with initial level j0, and obtain the corresponding α-coefficients and β-coefficients empirically; 3. Use the following procedure to obtain new coefficient estimates:

(a) For α-coefficients (which are only at level j0), keep them all;

(b) For β-coefficients, for j > j discard them all, and for j0 ≤ j ≤ j apply the thresholding estimator (either soft or hard one) with suitable threshold given in Theorem 9 to the observation vector.

4. Transform the estimated wavelet coefficients back to the function space, and obtain fˆVISU.

The property of the VisuShrink estimator fˆVISU is summarized in the following theorem: Theorem 10. 6 For any s > 0, p, q ∈ [1, ∞], the VisuShrink estimator fˆVISU attains the minimax risk over s Besov balls Bp,q(L) within logarithmic factors:

ˆVISU 2 1 ˆ 2 sup Ef kf − fk2 . log( ) · inf sup Ef kf − fk2. (45) s  ˆ s f∈Bp,q (L) f f∈Bp,q (L)

4.4 Discussions We make some discussions on the previous VisuShrink estimator. First of all, the VisuShrink estimator is almost a linear estimator (i.e., close to the projection estimator), while there is also a little bit non-linearity here, i.e., the thresholding idea. We can think of the thresholding approach as a selector which selects the coefficients to keep in a data-dependent manner: when the empirical coefficient is large, we expect it to be useful signal and keep it; when the empirical coefficient is small, we expect it to be the noise and discard it. We can compare this idea with Lepski’s trick to deal with the sparse regime we defined in the last lecture: in the sparse regime, the signal is supported on a small interval and all others are noise. In this case, Lepski’s trick selects a large bandwidth in the noise regime to essentially neglect all noise, and the VisuShrink estimator simply selects the peak and neglects all others of the transformed signal in the wavelet domain. Secondly, the VisuShrink estimator employs the shrinkage idea, which means that reduce the variance significantly with a little bit increase on the bias in statistics. Actually, this is where the term “shrink” in the name “VisuShrink” comes from. Specifically, compared with the projection estimator which keeps the raw observation, the thresholding idea incurs a larger bias (note that the previous one is indeed unbiased!), while 2 2 2 reduces the variance significantly (e.g., from  to min{ , θi } per symbol). We remark that the variance is a main issue in the nonparametric function estimation, while we will see in the next lecture that in functional estimation problems, bias becomes dominant! s Thirdly, by inspecting the proof we find that we do not use any specific properties of Θp,q (which is of a complicated form) other than that it is solid and orthosymmetric. Also, we prove that the VisuShrink estimator is nearly minimax without even figuring out what the minimax risk is. These observations indicate that the geometry of the parameter set is really important, and the reason why we choose the wavelet basis for the Besov ball is that the Besov ball becomes solid and orthosymmetric in the wavelet domain! In other words, for any orthonormal basis (φi)i∈I , the VisuShrink idea still works as long as the associated parameter set Θ is solid and orthosymmetric in the transformed space. In fact, this property requires that the basis be an unconditional basis: 6D. L. Donoho, and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika, pp. 425–455, 1994.

9 Definition 11 (Unconditional Basis). An orthonormal basis (φi)i∈I is an unconditional basis of the real normed vector space (X, k · k) if and only if there exists a universal constant C > 0 such that

X X iφi ≤ C φi (46) i∈J i∈J holds for any finite J ⊂ I. The main messages are that: 1. Unconditional basis is the optimal basis in nonparametric function estimation;

2 2. Wavelet basis is an unconditional basis for the Besov norm (L [0, 1], k·k s ) (and Fourier basis is not). Bp,q Finally, we remark that by construction, the VisuShrink estimator does not require the knowledge of parameters s, p, q, L and is thus an adaptive estimator. Similar to the Lepski’s estimator, the only knowledge the VisuShrink requires is an upper bound of the smoothness parameter s, for the termination level j depends on this upper bound.

5 Thresholding and SureShrink Estimator

In the previous section we have validated the thresholding idea by proving an oracle inequality (Theorem s 9), and use the geometry of the parameter set Θp,q to relate the risk of the ideal truncated estimator to the minimax risk. In this section we will validate the thresholding idea from a different viewpoint, and introduce the resulting SureShrink estimator.

5.1 Gaussian Mean Estimation over `p Balls with `q Error

Consider the Gaussian sequence estimation problem (12) with a simpler parameter set: Θ = {θ : kθkp ≤ R} is the `p ball. Also, instead of the mean squared error loss, we consider the `q loss as the general loss functions where p ∈ (0, ∞], q ∈ [1, ∞). The question is that: for this simple example, which estimator is nearly minimax under different parameter configurations (p, q, R, )? We first begin with some insights. When R is large (or equivalently  is small), the constraint on θ is quite loose, and thus we should use an estimator close to the natural one (i.e., the empirical observation). When R is small ( is large), the vector is close to zero and we may directly apply a zero estimator. When p ∈ (0, ∞] is small, we know that the parameter θ is somehow quite sparse, and thus the resulting estimator should have many zero entries. In contrast, when p is large, the parameter θ can be quite dense, and the natural estimator is expected to work here. ˆ ˆ s As a result, if we treat the natural estimator θ = y as the thresholding estimator θ = ηt (y) with threshold ˆ ˆ s t = 0, and the zero estimator θ = 0 as the thresholding estimator θ = ηt (y) with threshold t = ∞, it seems that the thresholding estimator with some suitable threshold should work. This intuition turns out to be correct, which is shown in the following theorem:

7 Theorem 12. For most parameter configurations (p, q, R, ), there exists a universal constant Cp,q > 0 such that for the Gaussian sequence model (12) with Θ = {θ : kθkp ≤ R},

s q ˆ q inf sup Eθkηt (y) − θkq ≤ Cp,q · inf sup Eθkθ − θkq. (47) t θ∈Θ θˆ θ∈Θ

h The same result also holds for the hard-thresholding estimator ηt (y).

7 D. L. Donoho, and I. M. Johnstone. Minimax risk over `p-balls for `q-error. Probability Theory and Related Fields, 99(2), pp. 277–303, 1994.

10 s The implication of Theorem 12 to our case is that: although the parameter space Θp,q is of a very complicated form, it only involves the the combination of `p and `q norms! Hence, we expect that the s thresholding idea also works in our case over Θp,q. Specifically, we consider the following estimator:

1. Transform the observation to wavelet coefficients starting from initial level j0 (of a constant order); 2. Keep all father wavelet coefficients, and for each level j, apply the (soft- or hard-)thresholding estimator to the mother wavelet coefficients with threshold tj; 3. Transform back the coefficients into functions to yield fˆ. ˆ We write the resulting estimator as ft, where t = (tj0 , tj0+1, ··· ) is the threshold sequence. Based on the previous insights and with the help of Theorem 12, the following result holds: Theorem 13. 8 For the nonparametric function estimation over Besov balls, the thresholding estimator with appropriate thresholds attains the minimax risk within a multiplicative factor: ˆ 2 ˆ 2 inf sup Ef kft − fk2 . inf sup Ef kf − fk2. (48) t=(tj ,tj +1,··· ) s ˆ s 0 0 f∈Bp,q (L) f f∈Bp,q (L)

5.2 SURE (Stein’s Unbiased Risk Estimate) The previous Theorem ensures that some thresholding estimator works, but does not specify which threshold ˆ we should choose. A na¨ıve thought is that, if we could compare the performances of ft with different t’s, then we should choose the one with the minimum error: ∗ ˆ 2 t = arg min f kft − fk2. (49) t E However, this approach is infeasible since we do not know the true function f. Despite this difficulty, the ˆ 2 good news is that we can still apply this idea and use an unbiased estimator of Ef kft − fk2 without knowing f. Now we start to illustrate the idea. Consider the Gaussian sequence model in (12) with length m, and fix any estimator θˆ(y) of θ. Note that g(y) = θˆ(y) − y only depends on y but not on θ, and ˆ 2 2 Eθkθ − θk2 = Eθkg(y) + y − θk2 (50) 2 2 T = Eθkg(y)k2 + Eθky − θk2 + 2Eθ[(y − θ) g(y)]. (51) Exercise 14. Prove Stein’s identity: for X ∼ N (µ, σ2) and any (weakly) differentiable function f, we have E[(X − µ)f(X)] = σ2E[f 0(X)].

By Stein’s identity, we further have ˆ 2 2 2 T Eθkθ − θk2 = Eθkg(y)k2 + Eθky − θk2 + 2Eθ[(y − θ) g(y)] (52) 2 2 2 = Eθkg(y)k2 + m + 2 Eθ[∇ · g(y)] (53)  2 2 = Eθ (m + 2∇ · g(y)) + kg(y)k2 . (54) As a result, we have the following definition of the Stein’s Unbiased Risk Estimator: Definition 15 (SURE). For the Gaussian sequence model in (12) with length m, then for any estimator ˆ ˆ θ(y) with a weakly differentiable g(y) , θ(y) − y, the Stein’s Unbiased Risk Estimator (SURE) is defined by SURE 2 2 r (y) , (m + 2∇ · g(y)) + kg(y)k2. (55) SURE ˆ 2 The SURE satisfies Eθ[r (y)] = Eθkθ − θk2 for any θ ∈ Θ. 8D. L. Donoho, and I. M. Johnstone. Minimax estimation via wavelet shrinkage. The annals of Statistics, 26(3), pp. 879–921, 1998.

11 5.3 The SureShrink Estimator With the SURE in hand, we may introduce the SureShrink estimator, which chooses the threshold sequence

t = (tj)j≥j0 based on the risk estimate. Specifically, on each level j ≥ j0, we randomly divide the index set {0, 1, ··· , 2j − 1} into two halves I,I0, and for the soft-thresholding we define

X 2 2 tI arg min (1 − 2 · 1(|yj,k| ≤ t)) + (min{t, |yj,k|}) (56) , t≥0 k∈I0 X 2 2 tI0 arg min (1 − 2 · 1(|yj,k| ≤ t)) + (min{t, |yj,k|}) (57) , t≥0 k∈I

where (yj,k)0≤k≤2j −1 are the empirical wavelet coefficients on level j. Then we apply the soft-thresholding 0 estimator with threshold tI to all yj,k for k ∈ I, and apply it with threshold tI0 to all yj,k for k ∈ I . The same idea can also be applied to the hard-thresholding estimator. We remark that the random sample splitting approach is purely for technical purposes (to gain independence), which is not necessary in practice. The ˆ SureShrink estimator is defined to be ftˆ with the thresholds given by the previous equation. The theoretical performance of the SureShrink estimator is summarized in the following theorem.

9 1 1 ˆSURE ˆ Theorem 16. For s > p − 2 , the SureShrink estimator f = ftˆ essentially performs as well as the op- s timal thresholding estimator, and thus attains the minimax risk over Besov balls Bp,q(L) within multiplicative constants:

ˆSURE 2 ˆ 2 sup Ef kf − fk2 ≤ (1 + o(1)) · inf sup Ef kft − fk2 (58) s (tj )j≥j s f∈Bp,q (L) 0 f∈Bp,q (L) ˆ 2 . inf sup Ef kf − fk2. (59) ˆ s f f∈Bp,q (L)

6 Minimax Lr Risk over Besov Balls In the previous sections we have studied the nonparametric function estimation problem over Besov balls using the mean squared error loss, where we have used the L2 isometry to establish the transformation from the function domain to the wavelet domain. However, we remark that the L2 isometry is not essential here: the same thresholding idea also applies to general Lr error, for r ∈ [1, ∞]. Furthermore, note that we have shown the minimax optimality of various estimators without specifying the value of the minimax risk, for completeness we give the most general result here.

10 1 1 Theorem 17. For any s > p − r , p, q, r ∈ [1, ∞], the minimax Lr risk in estimating the function f over s Besov balls Bp,q(L) is given by

 s 2 2s+1 1 ( ) if r < (2s + 1)p ! r  s 1 p  2 ( − )+ ˆ r ( log(1/)) 2s+1 (log(1/)) 2 qr if r = (2s + 1)p inf sup f kf − fk  . (60) E r s− 1 + 1 ˆ s p r f f∈Bp,q (L)   2(s− 1 )+1 (2 log(1/)) p if r > (2s + 1)p

9D. L. Donoho, and I. M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the American statistical association, 90(432), pp. 1200–1224, 1995. 10D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Density estimation by wavelet thresholding. The Annals of Statistics, pp. 508–539, 1996.

12 s In contrast, the minimax linear risk in estimating the function f over Besov balls Bp,q(L) is given by

 s (2) 2s+1 if r ≤ p 1   s− 1 + 1 ! r  p r  2(s− 1 + 1 )+1 ˆlin r 2 p r inf sup Ef kf − fkr  ( ) if p < r < ∞ (61) ˆlin s 1 f f∈Bp,q (L)  s−  p  2(s− 1 )+1 (2 log(1/)) p if r = ∞ where the infimum is taken over all possible linear estimators.

13 EE378A Statistical Signal Processing Lecture 16 - 05/25/2017 Lecture 16: Achieving and Estimating the Fundamental Limit Lecturer: Jiantao Jiao Scribe: William Clary

In this lecture, we formally define the two distinct problems of achieving and estimating the fundamental limit, and show that under the logarithmic loss, it is easier to estimate the fundamental limit than to achieve it.

1 The Bayes envelope

The Bayes envelope introduced in the previous lectures can be viewed as the fundamental limit of prediction. Indeed, for a specified loss function Λ(x, xˆ), the minimum average loss in predicting X ∼ PX is given by the Bayes envelope:

U(PX ) min EX∼PX [Λ(X, xˆ)]. (1) , xˆ

i.i.d. Throughout this lecture, we observe X1,X2,...,Xn ∼ PX , where X ∈ X = {1, 2,...,S}. In other words, the alphabet size of X is |X | = S. We denote by MS the space of probability measures on X . We take Λ(x, xˆ) to be the logarithmic loss in the sequel, in other words, we have 1 Λ(x, xˆ) = Λ(x, Pˆ) = log , (2) Pˆ(x) for any x ∈ X , Pˆ ∈ MS. For non-negative sequences aγ , bγ , we use the notation aγ . bγ to denote that there exists a universal aγ constant C such that sup ≤ C, and aγ bγ is equivalent to bγ aγ . Notation aγ  bγ is equivalent to γ bγ & . aγ aγ bγ and bγ aγ . Notation aγ  bγ means that lim infγ = ∞, and aγ  bγ is equivalent to bγ  aγ . . . bγ We write a ∧ b = min{a, b} and a ∨ b = max{a, b}. Moreover, polyK denotes the set of all polynomials of degree no more than K.

2 Achieving the fundamental limit

i.i.d. Given n i.i.d. observations X1,X2,...,Xn ∼ PX , we would like to construct a predictor Pˆ = Pˆ(X1,X2,...,Xn) to predict a fresh new independent random variable X ∼ PX , where X is independent of the training data n {Xi}i=1. The average risk of predicting X using the predictor Pˆ under the logarithmic loss is given by " # 1 EP log , (3) Pˆ(X)

⊗(n+1) where the expectation is over the of (X1,X2,...,Xn,X) ∼ PX .

1 2.1 The inappropriate question of minimax risk

Since the distribution PX is unknown, we may take the minimax approach in decision theory and aim at solving the minimax risk. In other words, we aim at solving " # 1 inf sup log . (4) EP ˆ Pˆ PX ∈MS P (X)

We now show that this question leads to a degenerate answer that may not be what we want. Theorem 1. The minimax risk is given by " # 1 inf sup log = log(S), (5) EP ˆ Pˆ PX ∈MS P (X)

ˆ 1 1 1  and the minimax risk achieving P can be taken to be US = S , S ,..., S , where US is the uniform distribution on X .

Proof We first show that the minimax risk is at least log(S). Indeed, for any predictor Pˆ ∈ MS, we have " # 1 X 1 log X ,X ,...,X = P (x) log (6) EP ˆ 1 2 n X ˆ P (X) x∈X P (x) X 1 ≥ PX (x) log (7) PX (x) x∈X

= H(PX ), (8)

where we used the non-negativity of the KL divergence, and H(PX ) is the Shannon entropy. Taking PX = US, we have H(PX ) = log S. Taking expectations on both sides with respect to X1,X2,...,Xn, we know " # 1 EP log ≥ log S (9) Pˆ(X)

for any predictor Pˆ. On the other hand, taking Pˆ ≡ US, we have " # 1 EP log = log S, (10) Pˆ(X)

which proves that the minimax risk is at most log S.

Theorem 1 shows that solving the minimax risk in prediction may lead to inappropriate answers. Indeed, the minimax optimal solutions turns out to be a degenerate answer that ignores all the training data. What we show next is that focusing on the minimax regret solves this problem in a meaningful way.

2.2 Achieving the fundamental limit: minimax regret

As we argued in the proof of Theorem 1, for any predictor Pˆ = Pˆ(X1,X2,...,Xn), we have " # 1 EP log ≥ H(PX ). (11) Pˆ(X)

2 It motivates us to define the minimax regret as follows: " # 1 inf sup log − H(P ). (12) EP ˆ X Pˆ PX ∈MS P (X)

We have the following algebraic manipulations for any predictor Pˆ: " # " " ## 1 X 1 X 1 EP log − H(PX ) = EP EP PX (x) log − PX (x) log X1,X2,...,Xn (13) ˆ ˆ PX (x) P (X) x∈X P (x) x∈X " # X PX (x) = P (x) log (14) EP X ˆ x∈X P (x) ˆ = EP D(PX kP ), (15)

P P (x) where D(P kQ) = x∈X P (x) log Q(x) is the KL divergence between P and Q. In other words, solving the minimax regret of predicting a fresh new independent random variable X based on n i.i.d. training samples X1,X2,...,Xn is equivalent to solving the problem of estimating the discrete distribution PX under the KL divergence loss. The minimax regret is characterized by the following theorem. Theorem 2. 1 2

" # ( S−1 1 (1 + o(1)) 2n log(e) if n  S inf sup EP log − H(PX ) = (16) ˆ ˆ S P PX ∈MS P (X) (1 + o(1)) log( n ) if n  S

n Moreover, if lim sup S ≤ c ∈ (0, ∞), the minimax regret is bounded away from zero. The predictor Pˆ that achieves the performance above in the regime of n  S is:

n(x) + β(n(x)) Pˆ(x) = for any x ∈ X , (17) PS n + j=1 β(n(˜xj)) where n X n(x) = 1(Xi = x), (18) i=1

and (X1,X2,...,Xn) is the training data. Here  1 if k = 0  2 β(k) = 1 if k = 1 (19)  3 4 o.w.

The predictor Pˆ that achieves the performance above in the regime of n  S is:

n(x) + n log S ˆ S n P (x) = S for any x ∈ X . (20) n + n log n 1Paninski, Liam. ”Variational Minimax Estimation of Discrete Distributions under KL Loss.” In NIPS, pp. 1033-1040. 2004. 2Braess, Dietrich, and Thomas Sauer. ”Bernstein polynomials and learning theory.” Journal of Approximation Theory 128, no. 2 (2004): 187-206.

3 Vanishing minimax regret implies that there exists a predictor Pˆ such that its average prediction error h i log 1 on the test set approaches the fundamental limit H(P ). Theorem 2 shows that its takes at EP Pˆ(X) X least n  S samples to achieve vanishing minimax regret. It can be understood intuitively that one needs at least to see all the symbols at least once to be able to construct a predictor whose performance is able to approach the fundamental limit. The minimax regret definition reflects the traditional way of understanding of the difficulty of machine learning tasks. In machine learning practice, we iteratively improve our training algorithm, and use its prediction accuracy on the test set to measure the performance of our prediction algorithm. The best performance achieved by existing schemes on the test set is usually understood as the “limit” of prediction for a specific dataset. In this context, Theorem 2 can be interpreted in the way that with n . S samples, there does not exist any prediction algorithm based on n training samples whose performance on the test set can approach the Bayes envelope in the worst case. As we show in the next section, there exist algorithms that can estimate the fundamental limit with n  S samples without explicitly constructing a prediction algorithm.

3 Estimating the fundamental limit

We define the problem of estimating the fundamental limit as solving the following minimax problem: ˆ inf sup EP |H − H(PX )|, (21) Hˆ PX ∈MS where the infimum is taken over all possible estimators Hˆ = Hˆ (X1,X2,...,Xn) that are functions of the empirical training data. The materials in this section are mainly taken from 1. Jiao, Jiantao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Minimax estimation of functionals of discrete distributions.” IEEE Transactions on Information Theory 61, no. 5 (2015): 2835-2885. 2. Jiao, Jiantao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Maximum likelihood estimation of functionals of discrete distributions.” arXiv preprint arXiv:1406.6959 (2014).

3.1 The minimax rates We have the following theorem.

3 4 5 6 S Theorem 3. Suppose n & log S . Then,

ˆ S ln S inf sup EP |H − H(PX )|  + √ . (22) Hˆ PX ∈MS n ln n n

S Theorem 3 shows that it suffices to take n  log S samples to consistently estimate the fundamental limit H(PX ). It is very surprising that the number of samples required is in fact sublinear in S: one can estimate the Shannon entropy uniformly over all PX ∈ MS even if one has not seen most of the symbols in the alphabet X in the empirical samples.

3Valiant, Gregory, and Paul Valiant. ”Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs.” In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 685-694. ACM, 2011. 4Valiant, Gregory, and Paul Valiant. ”The power of linear estimators.” In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pp. 403-412. IEEE, 2011. 5Wu, Yihong, and Pengkun Yang. ”Minimax rates of entropy estimation on large alphabets via best polynomial approxi- mation.” IEEE Transactions on Information Theory 62, no. 6 (2016): 3702-3720. 6Jiao, Jiantao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Minimax estimation of functionals of discrete distribu- tions.” IEEE Transactions on Information Theory 61, no. 5 (2015): 2835-2885.

4 3.2 Natural candidate: the empirical entropy

One of the most natural estimators for the Shannon entropy H(PX ) given n i.i.d. samples is the empirical entropy, which is defined as the following. ˆ 1 Pn 1 Denote the empirical distribution by Pn = (ˆp1, pˆ2,..., pˆS), wherep ˆi = n i=1 (Xi = i) is the empirical frequency of symbol i in the training set. The empirical entropy is defined as H(Pˆn), which plugs-in the empirical distribution into the Shannon entropy functional. Intuitively, since the Shannon entropy is a continuous functional for finite alphabet distributions, and Pˆn converges to the true distribution PX as n → ∞, the plug-in estimate H(Pˆn) should be a decent estimator for H(PX ) if S is fixed and n → ∞. It is indeed true: it is only in the high dimensions that the empirical entropy starts to behave poorly as an estimate for the Shannon entropy. We have the following theorem quantifying the performance of the empirical entropy in estimating H(PX ). 7 8 Theorem 4. Suppose n & S. Then, ˆ S ln S sup EP |H(Pn) − H(PX )|  + √ . (23) PX ∈MS n n

S Comparing Theorem 4 and 3, it seems that the main difference is that one has improved the term n to S n ln n in the minimax rate-optimal entropy estimator, while keeping the second term unchanged. We now investigate where the two terms come from, and how one may construct the minimax rate-optimal estimators based on the empirical entropy.

3.3 Analysis of the empirical entropy

For any estimator Hˆ , its performance in estimating H(PX ) can be characterized via its bias defined as ˆ ˆ ˆ EP H − H(PX ), and the concentration of H around its expectation EP H. The concentration property may ˆ ˆ ˆ ˆ 2 be partially characterized by the variance of the estimator H, namely Var(H) = EP (H − EP H) . S ln√S We now argue that in Theorem 4, the term n comes from the bias, and the term n comes from the variance. 1 Introduce the concave function f(x) = x ln( x ) on [0, 1]. It is clear that

S X H(Pˆn) = f(ˆpi). (24) i=1 We have the following claim. 1 Claim 5. If pi ≥ n , then 1 0 ≤ f(p ) − f(ˆp )  . (25) i E i n Moreover, (ln(n))2 2(ln(S) + 3)2 Var(H(Pˆ )) ≤ ∧ . (26) n n n The results in Claim 5 are inspiring. It shows that the variance of the empirical entropy can be universally bounded regardless of the support size S. Moreover, the bias contributed by each symbol will be linearly S added up together, contributing the term n . It is clear that in the regime of S fixed and n → ∞, the variance dominates, but in the high dimensions the bias dominates. Hence, the key to improving the empirical entropy would be to reduce the bias in high dimensions without incurring too much additional variance.

7Jiao, Jiantao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Maximum likelihood estimation of functionals of discrete distributions.” arXiv preprint arXiv:1406.6959 (2014). 8Wu, Yihong, and Pengkun Yang. ”Minimax rates of entropy estimation on large alphabets via best polynomial approxi- mation.” IEEE Transactions on Information Theory 62, no. 6 (2016): 3702-3720.

5 3.4 How can we improve the empirical entropy? It has been a long journey to find the minimax rate-optimal estimators. Harris in 1975 proposed expanding ˆ EpH(Pn) using a Taylor expansion and obtained

S S − 1 1 X 1 1 H(Pˆ ) = H(P ) − + (1 − ) + o( ). (27) EP n X 2n 12n2 p n3 i=1 i

The Taylor expansion result looks decent in the regime where pi’s are not too small. Indeed, for very PS 1 small pi the remainder term may be much larger than the true entropy H(PX ) itself. This intuition i=1 pi turns out to be correct: it suffices to do a first-order bias correction using Taylor series in the regime of “not too small” pi. In general, for n · pˆn ∼ B(n, p), we may write 1 p(1 − p)  1  [f(ˆp )] = f(p) + f 00(p) + O , EP n 2 n P n2 which motivates the bias correction: 1 pˆ (1 − pˆ ) fˆc = f(ˆp ) − f 00(ˆp ) n n . n 2 n n In the entropy estimation case, we follow the bias correction above and do the following 9 ln n 1 Construction 6. If the true pi & n , we use f(ˆpi) + 2n instead of f(ˆpi) to estimate f(pi). S Now the focus is on the small pi regime. We need to understand precisely which term contributed the n ln n bias bound. Assume for now that all pi . n . We have the following manipulations:

S X H(Pˆn) − H(PX ) = f(ˆpi) − f(pi) (28) i=1 S S S X X X = (f(ˆpi) − PK (p ˆi)) − (f(pi) − PK (pi)) + (PK (p ˆi) − PK (pi)), (29) i=1 i=1 i=1

where PK (·) is an arbitrary polynomial with order no more than K. The following two observations are crucial for the improvements of empirical entropy. ln n ln n 1 Claim 7. If pi . n , we have pˆi . n with probability at least 1 − n4 . Claim 8. Suppose K  ln n. Then for any constant c > 0, c inf sup |f(x) − PK (x)|  . (30) PK ∈polyK c ln n n ln n x∈[0, n ]

c ln n c ln n Utilizing those two claims, and conditioning on the event that allp ˆi ≤ n , pi ≤ n , we immediately obtain that S X S (f(ˆp ) − P (p ˆ )) (31) i K i . n ln n i=1 S X S (f(p ) − P (p )) , (32) i K i . n ln n i=1 9Note that this bias correction intuition does not easily generalize to higher order corrections. For a systematic approach to do higher order bias correction with Taylor series, we refer the readers to Yanjun Han, Jiantao Jiao, Tsachy Weissman, ”Minimax Rate-Optimal Estimation of Divergences between Discrete Distributions”, arXiv preprint arXiv:1605.09124 (2016).

6 which implies that

" S # X S (P (p ˆ ) − P (p )) (33) EP K i K i & n i=1

since we already know in Claim 5 that H(Pˆ ) − H(P ) S . EP n X & n Thus, we have identified the reason of the poor bias of the empirical entropy: it is because the plug-in approach in estimating the polynomial PK incurs too much bias. Realizing this turns out to be the crucial factor that leads to the minimax rate-optimal estimator: under the multinomial model there exists unbiased estimators for any polynomial PK whose order is no more than n. Indeed, when X ∼ B(n, p), for any integer r ∈ {1, . . . , n}: X(X − 1) ... (X − r + 1) = pr. E n(n − 1) ... (n − r + 1) We complete the construction of the minimax rate-optimal estimator by doing the following:

ln n Construction 9. If the true pi . n , we use the unbiased estimator of polynomial PK (pi) to estimate  c ln n  f(pi). Here PK (·) is the best approximation polynomial of f(pi) over the interval 0, n introduced in Claim 8.

As for the last step, we need to use the Chernoff bound to show the following results on confidence intervals in the binomial model:

Claim 10. There exist c1, c2, c3, c4 positive real numbers such that:

log(n) log(n) 1 • if pˆi ∈ [0, c1 n ] then pi ∈ [0, c2 n ] with probability at least 1 − n4 .

log(n) log(n) 1 • if pˆi ∈ [c3 n , 1] then pi ∈ [c4 n , 1] with probability at least 1 − n4 . There are other details needed to make the whole proof work: for example, one needs to argue that this approach does not increase the variance by too much, and also show minimax lower bounds. In practice one may also remove the constant term in PK (·) to ensure that one assigns zero to symbols that have never appeared in the training data. Thus, we have constructed a minimax rate-optimal estimator that does not require the knowledge of the support size S, but behaves nearly as well as the exact minimax estimator with the knowledge of the support size S.

7 EE378A Statistical Signal Processing Lecture 3 - 05/29/2017 Lecture 17: Minimax estimation of high-dimensional functionals Lecturer: Jiantao Jiao Scribe: Jonathan Lacotte

1 Estimating the fundamental limit is easier than achieving it: other loss functions

We emphasize that it is a general phenomenon that estimating the fundamental limit is easier than achieving it. Recall the definition of the two problems of achieving and estimating the fundamental limit:

• Achieving the fundamental limit: ˆ inf sup EP [Λ(X, X) − U(PX )] Xˆ (X1,...,Xn) PX ∈P

• Estimating the fundamental limit: ˆ inf sup EP [|U − U(PX )|] Uˆ(X1,...,Xn) PX ∈P

Here we observe n i.i.d. samples X1,X2,...,Xn ∈ X with distribution PX ∈ P, where P is a collection of probability distribution on X . The U(PX ) = arg minxˆ EP [Λ(X, xˆ)] is the Bayes envelope. In last lecture, we show that in the case of Λ(X, xˆ) = Λ(X, Pˆ) = log 1 and P = M , it takes n  S Pˆ(X) S S samples to achieve the fundamental limit, and n  ln S samples to estimate the fundamental limit. In this lecture we show that similar phenomenon happens for another widely used loss function: the Hamming loss. Suppose X = (Y,Z) where Y ∈ Y, |Y| = S and Z ∈ {0, 1}. One may interpret Y as the feature, and Z as the label while casting it as a binary classification problem. Consider the Hamming loss: Λ(x, t) = 1{t(Y )6=Z}. The Bayes envelope in this case is given by

U(PX ) = E[min(η(Y ), 1 − η(Y ))],

where η(y) = P[Z = 1 | Y = y]. 1 Theorem 1. Consider P = {PX | PZ (0) = 2 }. Then, it requires n  S samples to achieve U(PX ) and S n  log(S) samples to estimate U(PX ).

2 Boosting the Chow-Liu algorithm with improved mutual infor- mation estimates

Graphical models provide us with efficient computational tools to conduct inference in high dimensional data with potential structure, cf. [1] and references therein. Learning the structure and parameters of graphical models from empirical data is therefore the starting point for all these applications. It has been known that exact learning of a general is NP-hard [2], and there exist tractable sub-classes among which tree graphical models are the most famous. The seminal work of Chow and Liu [3] contributed an efficient algorithm to compute the Maximum Likelihood Estimator (MLE) of tree structured graphical model based on empirical data, and constitutes one of the very few cases where the exact MLE can be solved efficiently. There are various approaches towards learning more complex structures, for which we refer the reader to [4] for a review.

1 Concretely, the Chow–Liu algorithm (CL) addresses the following question. Given n i.i.d. samples of a random vector X = (X1,X2,...,Xd), where Xi ∈ X , |X | < ∞, we want to estimate the joint distribution of X. Chow and Liu [3] assumed that PX can be factorized as:

d Y PX = PX |X , 0 ≤ j(i) < i, (1) mi mj(i) i=1 where (m1, m2, . . . , md) is an unknown permutation of integers 1, 2, . . . , d, and PXi|X0 is by definition equal to

PXi . Then, CL outputs the distribution PX that maximizes the likelihood of the observed data. Interestingly, this optimization problem can be efficiently solved after being transformed into a Maximum Weight Spanning Tree (MWST) problem, which can be solved using the Prim or Kruskal algorithm. In particular, they showed that the MLE of the tree structure boils down to the following expression: X EML = arg max I(Pˆe), (2) EQ:Q is a tree e∈EQ where I(Pˆe) is the mutual information associated with the empirical distribution of the two nodes connected via edge e, and EQ is the set of edges of a tree distribution Q (i.e., Q factors as a tree). In words, it d suffices to first compute the empirical mutual information between any two nodes (in total 2 pairs), and the maximum weight spanning tree is the tree structure that maximizes the likelihood. To obtain estimates of distributions on each edge, Chow and Liu [3] simply assigned the empirical distribution while picking an arbitrary node as the tree root. We begin by asking the following natural question: Question 2. Is the Chow–Liu algorithm optimal for learning tree graphical models? Since the Chow–Liu algorithm exactly solves the MLE, and has been widely used in many applications, its optimality seems to be tacitly assumed in much of the literature. However, a closer inspection of the [5, 6] reveals that it is only known that the Chow–Liu algorithm performs essentially optimally when the number of samples n grows to ∞, while the number of states of the tree has fixed size. Indeed, the modern theory of the maximum likelihood estimation paradigm [7] only justifies the asymptotic efficiency of MLE, without general non-asymptotic guarantees when we have finitely many samples. In contrast, various modern data-analytic applications deal with datasets that do not have the luxury of too many observations compared to the alphabet size. To explain the insights underlying our improved algorithm, we revisit equation (2) and note that if we were to replace the empirical mutual information with the true mutual information, the output of the MWST would be the true edges of the tree. In light of this, the CL algorithm can be viewed as a “plug-in” estimator that replaces the true mutual information with an estimate of it, namely the empirical mutual information. Naturally then, it is to be expected that a better estimate of the mutual information would lead to smaller probability of error in identifying the tree. However, how bad can the empirical mutual information be as an estimate for the true mutual information? The following theorem in [8] implies that it can be highly sub-optimal in high dimensional regimes.

Theorem 3. Suppose we have two random variables X1,X2 ∈ X , |X | < ∞. The minimax sample complexity 2 in estimating the mutual information I(X1; X2) under mean squared error is Θ(|X | / ln |X |), while the sample complexity required by the empirical mutual information to be consistent is Θ(|X |2). In words, Theorem 3 implies that for the minimax rate-optimal estimator, it suffices to take n  2 |X | / ln |X | samples to consistently estimate the mutual information I(X1; X2) for any underlying distri- butions. At the same time, unless n  |X |2, there exist distributions for which the error of the empirical mutual information would be bounded away from zero. Furthermore, when |X | is large, the approach of estimating the three entropy terms in

I(X1; X2) = H(X1) + H(X2) − H(X1,X2) (3)

2 separately using the minimax rate-optimal entropy estimators is minimax rate-optimal for estimating the mutual information. Now we apply the improved mutual information estimates to improve the Chow–Liu algorithm. In the following experiment, we fix d = 7, |X | = 300, construct a star tree (i.e. all random variables are conditionally independent given X1), and generate a random joint distribution by assigning independent Beta(1/2, 1/2)-

distributed random variables to each entry of the marginal distribution PX1 and the transition probabilities 3 4 PXk|X1 , 2 ≤ k ≤ d (with normalization). Then, we increase the sample size n from 10 to 5.5 × 10 , and for each n we conduct 20 Monte Carlo simulations. Note that the true tree has d − 1 = 6 edges, and any estimated set of edges will have at least one overlap with these 6 edges because the true tree is a star graph. We define the wrong-edges-ratio in this case as the number of edges different from the true set of edges divided by d − 2 = 5. Thus, if the wrong-edges-ratio equals one, it means that the estimated tree is maximally different from the true tree and, in the other extreme, a ratio of zero corresponds to perfect reconstruction. We compute the expected wrong-edges-ratio over 20 Monte Carlo simulations for each n, and the results are exhibited in Figure 1.

1 Modified CL 0.9 Original CL 0.8

0.7

0.6

0.5

0.4

0.3 Expected wrong−edges−ratio 0.2

0.1

0 0 1 2 3 4 5 6 4 number of samples n x 10

Figure 1: The expected wrong-edges-ratio of our modifed algorithm and the original CL algorithm for sample sizes ranging from 103 to 5.5 × 104.

Figure 1 reveals intriguing phase transitions for both the modified and the original CL algorithm. When we have fewer than 3 × 103 samples, both algorithms yield a wrong-edges-ratio of 1, but soon after the sample size exceeds 6 × 103, the modified CL algorithm begins to reconstruct the network perfectly, while the original CL algorithm continues to fail maximally until the sample size exceeds 47 × 103, 8 times the sample size required by the new algorithm, which we temporarily call “Modified Chow–Liu” algorithm. Open Question 4. The exact sample complexity for learning tree graphical models is open. Although it is

3 clear that using improved mutual information estimates improves the tree learning process, it is not clear that it is the optimal way for tree learning. One may start from asking the following easier question: for any jointly distributed random variables (X1,X2,X3), what is the sample complexity of testing I(X1; X2) − I(X1; X3) ≥ 2 against I(X1; X2) − I(X1; X3) ≤ 1? Here 0 < 1 < 2 are fixed constants.

3 Support size estimation

Given a discrete probability distribution P , we want to estimate its support size: X S(P ) = 1{pi>0} i

1 under the assumption that mini pi ≥ S . Note that this assumption immediately implies that the support size of P can be at most S. The question is: how can we design the minimax rate-optimal estimator for estimating the true support size? The materials in this section come from [9]. We demonstrate in this section that using the approximation based methodology in last lecture, one can intuitively obtain the sample complexity in a systematic fashion. log n log n It was shown in the last lecture that we can distinguish the cases pi . n and pi & n with over- whelming probability. It motivates us to separate the problem into two regimes:

1 log n 1. The case of S & n : in this case, the functional we want to estimate is a constant, and it suffices to use the plug-in approach. The resulting bias for each pi is:

n 1 1 n −npi − S |E[ (ˆpi 6= 0) − pi>0]| = (1 − pi) ≤ e ≤ e . (4)

1 log n h log n i 2. The case of S . n : in this case, there might be some pi that falls in the interval 0, n , while h log n i h log n i others fall in the interval n , 1 . Those pi that fall into n , 1 are handled with the MLE, and it h log n i suffices to look at the pi’s that are in 0, n .

Claim 5. Suppose q ∈ (0, 1). Then,

√ K   √ 1 − q −K q inf sup |PK (x) − 1|  √ . e . (5) PK ∈polyK ,PK (0)=0 x∈[q,1] 1 + q

h i 1 log n S Applying Claim 5 to the scaled interval 0, n , where q = log n ,K  log n, we obtain that the bias n h 1 log n i for pi ∈ S , n is

√ √ √ −K q − log n n − n log n e = e S log n = e S , (6)

S which vanishes at long as n  ln S . Hence, we have (intuitively) shown that the bias of the minimax optimal approach should be of order

√  −Θ n log n ∨ n S · e S S . (7)

4 References

[1] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends R in Machine Learning, vol. 1, no. 1-2, pp. 1–305, 2008. [2] D. Karger and N. Srebro, “Learning Markov networks: Maximum bounded tree-width graphs,” in Pro- ceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied , 2001, pp. 392–401. [3] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees,” Infor- mation Theory, IEEE Transactions on, vol. 14, no. 3, pp. 462–467, 1968. [4] Y. Zhou, “Structure learning of probabilistic graphical models: a comprehensive survey,” arXiv preprint arXiv:1111.6925, 2011. [5] C. Chow and T. Wagner, “Consistency of an estimate of tree-dependent probability distributions (cor- resp.),” Information Theory, IEEE Transactions on, vol. 19, no. 3, pp. 369–371, 1973. [6] V. Y. Tan, A. Anandkumar, L. Tong, and A. S. Willsky, “A large- analysis of the maximum- likelihood learning of Markov tree structures,” Information Theory, IEEE Transactions on, vol. 57, no. 3, pp. 1714–1735, 2011. [7] E. L. Lehmann and G. Casella, Theory of point estimation. Springer, 1998, vol. 31.

[8] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distribu- tions,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015. [9] Y. Wu and P. Yang, “Chebyshev polynomials, , and optimal estimation of the unseen,” arXiv preprint arXiv:1504.01227, 2015.

5 EE378A Statistical Signal Processing Lecture 18 - 06/01/2017 Lecture 18: Effective Sample Size Enlargement Lecturer: Jiantao Jiao Scribe: Andrew Naber

1 Effective sample size enlargement

Let MS be the set of all possible discrete distributions with support size S. Consider a random variable X with distribution P ∈ MS. We desire to estimate some functional F (P ) from n independent samples X1,...,Xn of P . Examples of F include the entropy or the L1-distance from another distribution Q ∈ MS. Let Fˆ = Fˆ(X1,...,Xn) denote our estimator of F and let Pˆn denote the empirical distribution as calculated from the n samples. Consider the minimax and the plug-in risk functions defined as

ˆ Rminmax(F, P, n) = inf sup E|F − F (P )| (1) Fˆ(X1,...,Xn) P ∈P ˆ Rplug-in(F, P, n) = sup E|F (Pn) − F (P )|. (2) P ∈P

In the following table we compare the minmax and plug-in risk for several different functionals F . There are certain conditions on the parameters to make the risk expression hold, and we ignore them here for simplicity in presentation.

F (P ) P Rminmax(F, P, n) Rplug-in(F, P, n) S X  1  S log(S) S log(S) p log M + √ + √ i p S n log(n) n n n i=1 i S X S S F (P ) = pα, 0 < α ≤ 1 M α i 2 S (n log(n))α nα i=1 S S1−α S S1−α F (P ), 1 < α < 1 M + √ + √ α 2 S (n log(n))α n nα n 3 −(α−1) −(α−1) Fα(P ), 1 < α < 2 MS (n log(n)) n 1 1 F (P ), α ≥ 3 M √ √ α 2 S n n  q  S n log(n) n X −Θ max S , S n 1 −Θ( S ) 1(pi 6= 0) {P : mini pi ≥ S } Se Se i=1 S S r S r X X qi X qi |p − q | M q ∧ q ∧ i i S i n ln n i n i=1 i=1 i=1 In the following table we compare the minmax and plug-in risk for several different functionals of two distributions P,Q ∈ MS where we have m samples from p and n samples from q. For the Kullback-Leibler 2 Pi and χ divergence estimators we only consider (P,Q) ∈ {(P,Q)|P,Q ∈ MS, ≤ u(S)} where u(S) is some Qi function of S.

1 F (P,Q) Rminmax(F, P, m, n) Rplug-in(F, P, m, n) S s s X S S |p − q | + i i min{m, n} log(min{m, n}) min{m, n} i=1 S s s 1 X √ √ S S ( p − q )2 2 i i min{m, n} log(min{m, n}) min{m, n} i=1 S   p p X pi S Su(S) log(u(S)) u(S) S Su(S) log(u(S)) u(S) D(P kQ) = p log + + √ + √ + + √ √ i q m log(m) n log(n) m n m n m n i=1 i S X p2 Su(S)2 u(S) u(S)3/2 Su(S)2 u(S) u(S)3/2 χ2(P kQ) = i − 1 + √ + √ + √ + √ q n log(n) m n n m n i=1 i

Note: In many cases (with the notable exception of the support size estimator), in passing from the plug-in risk to the minmax risk, we replace m and n in the denominator with m log(m) and n log(n). We term this phenomenon the effective sample size enlargement. In the next section, we examine the underlying reason for this phenomenon.

2 Preliminaries on approximation theory

Definition 1. The rth symmetric difference of f ∈ C[0, 1] is ( 0 if x + rh or x − rh ∈/ [0, 1] ∆r f(x) = 2 2 h Pr kr h k=1(−1) k f(x + r( 2 ) − kh) otherwise Example:

1 h h ∆hf(x) = f(x + 2 ) − f(x − 2 ) 2 ∆hf(x) = f(x + h) − 2f(x) + f(x − h)

Definition 2.

r r ω (f, t) = sup k∆hfk 0

r+1 r 1. ωϕ (f, t) ≤ κ · ωϕ(f, t), κ a constant r r 2. limt→0+ ωϕ(f, t)/t = 0 ⇒ f ∈ polyr−1 2 r 2 Example: Let f(x) = x log(x). We then have that ω (f, t)  t and ωϕ(f, t)  t , r ≥ 2. Definition 3. Peetre’s K-functional

r  r (r)  Kr(f, t ) = inf kf − gk + t kg k g r  r r (r)  Kr,ϕ(f, t ) = inf kf − gk + t kϕ g k g

2 Lemma 4. For all f ∈ C[0, 1], r r Kr(f, t )  ω (f, t) r r Kr,ϕ(f, t )  ωϕ(f, t). We present a systematic approach to bound the gap |Ef(X) − f(EX)| using the K-functional approach below. Example: Suppose X ∈ [0, 1] is a random variable and f ∈ C[0, 1]. We then have that

|Ef(X) − f(EX)| = |E(f(X) − g(X) + g(X) − g(EX) + g(EX) − f(EX)) ≤ 2kf − gk + |E(g(X) − g(EX))| var(X) 00 ≤ 2kf − gk + 2 kg k ! pvar(X) ω2 f, , . 2

1 00 2 where we have used the Taylor’s theorem g(X) − g(EX) = g(EX)(X − EX) + 2 g (ζ)(X − EX) to create 1 00 the bound |E(g(X) − g(EX))| ≤ 2 kg kvar(X) followed by an application of lemma 4.

3 The mathematics behind the effective sample size enlargement

Theorem 5. Consider a binomial model, i.e., pˆk ∼ B(k, p) and let f ∈ C[0, 1]. Note that

k X  i  k f(ˆp ) = f pi(1 − p)k−i Ep k k i i=0 is a polynomial of degree at most k, and is called the Bernstein approximation of the function f(p) of order k.

1. Bernstein: For all f ∈ C[0, 1], kEpf(ˆpk) − f(p)k → 0 as k → ∞. 2. Totik’94, Knoop and Zhou’94: For all f ∈ C[0, 1], k f(ˆp ) − f(p)k  ω2 (f, √1 ) Ep k ϕ k Theorem 6. For any r < k, the best approximation error of f(p) ∈ C[0, 1] is upper bounded by

r 1 inf sup |f(x) − pk(x)| . ωϕ(f, k ) pk∈polyk x∈[0,1] where polyk is the set of all polynomials of degree at most k on [0, 1]. Comparing the two theorems above, we see that the key improvement from the Bernstein√ approximation to the best approximation is reflected on two points: one has changed the argument from 1/ k to 1/k, and the modulus order has improved from 2 to an arbitrary number r that is smaller than k. Since increasing the√ modulus order in various situations does not help reduce the approximation error, in most situations the 1/ k ⇒ 1/k phenomenon dominates, which can be seen from the following table. f(p) Bernstein approximation error Best approximation error 1 1 1 p log p k k2  1 1  1 pα, α > 0, α∈ / max , N k kα k2α ( √ 0 if p = 0 1 − q k (1 − q)k √ 1 if p ≥ q 1 + q r q(1 − q) pq(1 − q) |p − q|, q ∈ [0, 1] ∧ q ∧ 1 − q ∧ q ∧ 1 − q k k

3 Now we present a heuristic argument of the effective sample size enlargement phenomenon. Consider h log n i the entropy estimation problem. Since the nonsmooth regime is of order 0, n , if we (approximately) conduct Bernstein approximation over this interval with order k  log n, the overall approximation error after scaling would be 1 log n 1 ·  . (3) k n n However, if we use the best approximation with order k  log n in this interval, we have the approximation error 1 log n 1 ·  , (4) k2 n n ln n which justifies the improvement from n to n log n. α h log n i Similarly, consider the function p , 0 < α < 1. The nonsmooth regime is still 0, n . The Bernstein approximation error would be

1 log nα 1  , (5) kα n nα and the best approximation error would be

1 log nα 1  . (6) k2α n (n ln n)α

It is also clear why this phenomenon fails to hold for the support size problem: for this problem the √ √ improvement is no longer 1/ k ⇒ 1/k, but q ⇒ q.

4 EE378A Statistical Signal Processing Lecture 3 - 06/08/2017 Lecture 18: Bias correction with Jackknife and Boostrap Lecturer: Jiantao Jiao Scribe: Daria Reshetova

Both Jackknife and bootstrap are generic methods that can be used to reduce the bias of statistical estimators. However, the traditional theory proves incapable of answering whether the bootstrap or jackknife can help reduce the bias of the empirical entropy so as to achieve the minimax rates of entropy estimation. We provide the answers in this lecture using advanced theory of approximation. To simplify the presentation we focus on the binomial model npˆn ∼ B(n, p) throughout this lecture. However, we emphasize that the methodology applies to any statistical models of interest.

1 Jackknife intuition

Let e1,n(p) = f(p)−Epf(ˆpn), f ∈ C([0; 1]) be the bias term. Construct the jackknife bias corrected estimator as ˆ f2 = nf(ˆpn) − (n − 1)f(ˆpn−1) We now heuristically claim that this approach can reduce bias. Suppose that a(p) b(p)  1  e (p) = + + O , (1) 1,n n n2 n3

where a(p), b(p) are unknown functions of p which do no depend on n. We also have

a(p) b(p)  1  e (p) = + + O . (2) 1,n−1 n − 1 (n − 1)2 (n − 1)3 ˆ Hence, the overall bias of f2 is: ˆ f(p) − Epf2 = ne1,n(p) − (n − 1)e1,n−1(p) (3) b(p) b(p)  1  = − + O (4) n n − 1 n2 b(p)  1  = − + O , (5) n(n − 1) n2

1 1 which shows that the bias has been reduced to order n2 instead of order n . Note: although the result seems solid, it does not capture the dependence in p. For example, if f(p) = 1 1 p log p , standard Taylor expansion shows that the function b(p) and the final O( n2 ) term contain functions of p that explode to infinity as p → 0. Traditional statistical theory fails to analyze the uniform bias reduction  1  properties of the jackknife for functions such as f(p) = p log p . This task can be done via the advanced theory of approximation.

2 Approximation-theoretic analysis of Jackknife

Recap:

r Pr kr • r-th order symmetric difference: ∆hf(x) = k=0(−1) k f(x + (r − k)h)

r r p • ωϕ(f, t) = sup k∆hϕf(x)k, ϕ(x) = x(1 − x) 0≤h≤r

1 1 r 2 • f(p) = p log p ⇒ ωϕ(f, t)  t for fixed r ≥ 2 Pr Let n1 < n2, . . . < nr = kn1 = n, where k is fixed and ni ∈ N and choose c1, . . . , cr such that i=1 ci = 1 and r X ci for ρ ∈ , 1 ≤ ρ ≤ r − 1 : = 0 Z nρ i=1 i Then the estimator would be: r ˆ X ˆ fr = cif(ˆpni ) i=1

Note that this framework encorporates the one described in section 1 with n1 = n − 1, n2 = n, with c1 = −(n − 1), c2 = n.

Lemma 1. The solutions of ci, 1 ≤ i ≤ r are given by

Y ni ci = , 1 ≤ i ≤ r. (6) ni − nj j6=i

Pr Theorem 2. Suppose that i=1 |ci| ≤ C, where C is a constant independent of n. Fix r > 0. Then:

ˆ 2r √1 −r 1. kf − Efrk∞ ≤ C(ωϕ (f, n ) + n kfk) 2. for fixed α ∈ (0, 2r): ˆ −α/2 2r α kf − Epftk = O(n ) ⇔ ωϕ (f, t) = O(t )

1 ˆ 1 3. if f(p) = p log p , then kf − Ef2k  n There are several interesting interpretations of this results: 1. Iterating the jackknife one more time is equivalent to raising the modulus order by 2

1 2. Raising the modulus order does not reduce the error for functions such as f(p) = p log p , hence doing jackknife does not improve the worst case bias of empirical entropy in terms of order

3. The traditional jackknife, which has n1 = n−1, n2 = n, does not satisfy the conditions of this theorem. We now show that the reason why the traditional jackknife is not included in the previous theorem is that it does not satisfy the nice properties enlisted in it. ˆ Example 3. Denote the r-th order jackkife with nr = r, nj − nj−1 = 1 as fr. Then there exists a fixed ˆ r−1 function f ∈ C((0, 1]) andkfk ≤ 1, such that kEfr − fk & n . Also, there exists f ∈ C([0; 1]) such that ˆ n2 Var(f2) ≥ e . This result shows that the traditional jackknife may have very bad bias and variance properties if the function is not very smooth. However, this does not imply that the traditional jackknife does not work 1 for functions like f(p) = p log p . Since the general theorem does not apply to this situation, one needs to carefully compute the bias of the jackknife estimator for entropy. It proves to be a highly challenging task. We have the following conjecture: 1 ˆ 1 Conjecture 4. Consider f(p) = p log p . We conjecture that kf − Efrk  n for any fixed r ≥ 2. It is proved to be true when r = 2.

2 3 Bootstrap

The bootstrap relies on two principles: the plug-in principle and the Monte–Carlo principle.

1. Plug-in principle: if Pˆ is a good estimate of P then F (Pˆ) is a good estimate of F (P ). ˆ R 2. Monte–Carlo principle: in order to compute the plug-in estimator F (θ) of F (θ) = g(X)dPθ, where i.i.d. X ∼ Pθ, we sample B random objects X1,X2,...,XB ∼ Pθˆ, and then use

B 1 X g(X ) (7) B i i=1

to approximately compute F (θˆ).

ˆ Now we investigate the bias correction problem. Since e1(p) = f(p) − Epf(ˆpn), the plug-in principle suggests that e1(ˆpn) ≈ e1(p), which further suggests the bias corrected estimator ˆ f2 = f(ˆpn) + e1(ˆpn). (8)

Note that one in general needs to use the Monte–Carlo principle to approximately compute e1(ˆpn). It is also clear that one can do this correction infinitely many times. Indeed, we have

e2(p) = e1(p) − Ee1(ˆpn) r−1 ˆ X fr = f(ˆpn) + ei(ˆpn) i=1 It begs the question: how can we understand the bias correction properties of this bootstrap approach? ˆ 1. Question 1: Does fr converge as r → ∞? The answer is YES! 2. Question 2: What does it converge to?

Lemma 5. lim ker(p) − (f(p) − Ln[f](p))k = 0 for any f ∈ C([0; 1]), where Ln[f](p) is the unique r→∞ Lagrange interpolating polynomial of f at n + 1 points {0, 1/n, 2/n, . . . , 1}.

3. Question 3: does Ln[f](p) have good approximation properties?

Example 6 (Bernstein). For f(p) = |p − 1/2|, lim |f(p) − Ln[f](p)| = ∞ for any p except p = 0, p = r→∞ 1/2.

These results show that iterating the bootstrap bias correction too many times may not be a wise idea. What about doing it only a few times? We now show that it essentially has the similar performance as doing the jackknife exactly the same number of times. Theorem 7. Fix r > 0. Then

ˆ 2r √1 −r 1. kf(p) − Epfrk∞ ≤ C(ωϕ (f, n ) + n kfk) 2. for fixed α ∈ (0, 2r]: ˆ −α/2 2r α kf(p) − Epfrk = O(n ) ⇔ ωϕ (f, t) = O(t )

1 ˆ 1 3. if f(p) = p log p , then kf − Efrk  n

3