EE378A Statistical Signal Processing Lecture 2 - 04/06/2017 Lecture 2: Statistical Lecturer: Jiantao Jiao Scribe: Andrew Hilger

In this lecture, we discuss a unified theoretical framework of proposed by Abraham Wald, which is named statistical decision theory. 1.

1 Goals

1. Evaluation: The theoretical framework should aid fair comparisons between algorithms (e.g., maxi- mum entropy vs. maximum likelihood vs. method of moments). 2. Achievability: The theoretical framework should be able to inspire the constructions of statistical algorithms that are (nearly) optimal under the optimality criteria introduced in the framework.

2 Basic Elements of Statistical Decision Theory

1. Statistical : A family of probability measures P = {Pθ : θ ∈ Θ}, where θ is a parameter and Pθ is a indexed by the parameter.

2. : X ∼ Pθ, where X is a observed for some parameter value θ.

3. Objective: g(θ), e.g., inference on the entropy of distribution Pθ. 4. Decision Rule: δ(X). The decision rule need not be deterministic. In other words, there could be a probabilistically defined decision rule with an associated Pδ|X . 5. : L(θ, δ). The loss function tells us how bad we feel about our decision once we find out the true value of the parameter θ chosen by nature.

2 1 −(x−θ) 2 Example: P (x) = √ e 2 , g(θ) = θ, and L(θ, δ) = (θ − δ) . In other words, X is normally θ 2π distributed with θ and unit X ∼ N(θ, 1), and we are trying to estimate the mean θ. We judge our success (or failure) using mean-square error.

3 Risk Function

Definition 1 (Risk Function).

R(θ, δ) , E[L(θ, δ(X))] (1) Z = L(θ, δ(x))Pθ(dx) (2) ZZ = L(θ, δ)Pδ|X (dδ|x)Pθ(dx). (3)

1See Wald, Abraham. ”Statistical decision functions.” In Breakthroughs in Statistics, pp. 342-357. Springer New York, 1992.

1 Figure 1: Example risk functions computed over a of parameter values for two different decision rules.

A risk function evaluates a decision rule’s success over a large number of with fixed parameter value θ. By the law of large numbers, if we observe X many times independently, the average empirical loss of the decision rule δ will converge to the risk R(θ, δ). Even after determining risk functions of two decision rules, it may still be unclear which is better. Consider the example of Figure 1. Two different decision rules δ1 and δ2 result in two different risk functions R(θ, δ1) and R(θ, δ2) evaluated over different values on the parameter θ. The first decision rule δ1 is inferior for low and high values of the parameter θ but is superior for the middle values. Thus, even after computing a risk function R, it can still be unclear which decision rule is better. We need new ideas to enable us to compare different decision rules.

4 Optimality Criterion of Decision Rules

Given the risk function of various decision rules as a function of the parameter θ, there are various approaches to determining which decision rule is optimal.

4.1 Restrict the Competitors This is a traditional set of methods that were overshadowed by other approaches that we introduce later. A decision rule δ0 is eliminated (or formally, is inadmissible) if there are any other decision rules δ that are strictly better, i.e., R(θ, δ0) ≥ R(θ, δ) for any θ ∈ Θ and the inequality becomes strict for at least one θ0 ∈ Θ. However, the problem is that many decision rules cannot be eliminated in this way and we still lack a criterion to determine which one is better. Then to aid in selection, the rationale of the approach of restricting competitors is that only decision rules that are members of a certain decision rule class D are considered. The advantage is that, sometimes all but one decision rule in D is inadmissible, and we just use the only admissible one.

1. Example 1: Class of unbiased decision rules D0 = {δ : E[δ(X)] = g(θ), ∀θ ∈ Θ} 2. Example 2: Class of invariant estimators.

However, a serious drawback of this approach is that D may be an empty set for various decision theoretic problems.

2 4.2 Bayesian: Average Risk Optimality The idea is to use averaging to reduce the risk function R(θ, δ) to a single number for any given δ. Definition 2 (Average risk under prior Λ(dθ)). Z r(Λ, δ) = R(θ, δ)Λ(dθ) (4)

Here Λ is the is the prior distribution, a probability measure on Θ. The Bayesians and the frequentists disagree about Λ; namely, the frequentists do not believe the existence of the prior. However, there do exist more justifications of the Bayesian approach than the interpretation of Λ as prior belief: indeed, the complete class theorem in statistical decision theory asserts that in various decision theoretic problems, all the admissible decision rules can be approximated by Bayes estimators. 2 Definition 3 ().

δΛ = arg min r(Λ, δ) (5) δ

The Bayes estimator δΛ can usually be found using the principle of computing posterior distributions. Note that ZZZ r(Λ, δ) = L(θ, δ)Pδ|X (dδ|x)Pθ(dx)Λ(dθ) (6) Z  ZZ  = L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) PX (dx) (7)

where PX (dx) is the marginal distribution of X and Pθ|X (dθ|x) is the posterior distribution of θ given X. In Equation 6, Pθ(dx)Λ(dθ) is the joint distribution of θ and X. In Equation 7, we only have to minimize the portion in parentheses to minimize r(Λ, δ) because PX (dx) doesn’t depend on δ. Theorem 4. 3 Under mild conditions,

δΛ(x) = arg min E[L(θ, δ)|X = x] (8) δ ZZ = arg min L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) (9) Pδ|X Lemma 5. If L(θ, δ) is convex in δ, it suffices to consider deterministic rules δ(x). Proof Jensen’s inequality: ZZ Z Z L(θ, δ)Pδ|X (dδ|x)Pθ|X (dθ|x) ≥ L(θ, δPδ|X (dδ|x))Pθ|X (dθ|x). (10)

Examples 2 1. L(θ, δ) = (g(θ) − δ) ⇒ δΛ(x) = E[g(θ)|X = x]. In other words, the Bayes estimator under squared error loss is the conditional expectation of g(θ) given x.

2. L(θ, δ) = |g(θ) − δ| ⇒ δΛ(x) is any of the posterior distribution Pθ|X=x. 1 3. L(θ, δ) = (g(θ) 6= δ) ⇒ δΛ(x) = arg maxθ Pθ|x(θ|x) = arg maxθ Pθ(x)Λ(θ). In other words, an indicator loss function results in a maximum a posteriori (MAP) estimator decision rule4.

2See Chapter 3 of Friedrich Liese, and Klaus-J. Miescke. ”Statistical Decision Theory: Estimation, Testing, and Selection.” (2009) 3See Theorem 1.1, Chapter 4 of Lehmann EL, Casella G. Theory of . Springer Science & Business Media; 1998 4 Next week, we will cover special cases of Pθ and how to solve Bayes estimator in a computationally efficient way. In the general case, however, computing the posterior distribution may be difficult.

3 4.3 Frequentist: Worst-Case Optimality (Minimax) Definition 6 (Minimax estimator). The decision rule δ∗ is minimax among all decision rules in D iff sup R(θ, δ∗) = inf sup R(θ, δ). (11) θ∈Θ δ∈D θ∈Θ

4.3.1 First observation Since ZZ R(θ, δ) = L(θ, δ)Pδ|X (dδ|x)Pθ(dx) (12) is linear in Pθ|X , this is a convex function in δ. The supremum of a convex function is convex, so finding the optimal decision rule is a convex optimization problem. However, solving this convex optimization problem may be computationally intractable. For example, it may not even be computationally tractable to compute the supremum of the risk function R(θ, δ) over θ ∈ Θ. Hence, finding the exact minimax estimator is usually hard.

4.3.2 Second observation Due to the previous difficulty of finding the exact minimax estimator, we turn to another goal: we wish to find an estimator δ0 such that inf sup R(θ, δ) ≤ sup R(θ, δ0) ≤ c · inf sup R(θ, δ) (13) δ θ θ δ θ where c > 1 is a constant. The left inequality is trivially true. For the right inequality, in practice one can 0 0 usually choose some specific δ and evaluate an upper bound of supθ R(θ, δ ) explicitly. However, it remains to find a lower bound of infδ supθ R(θ, δ). To solve the problem (and save the world), we can use the minimax theorem. Theorem 7 (Minimax Theorem (Sion-Kakutani)). Let Λ, X be two compact, convex sets in some topologi- cally vector spaces. Let function H(λ, x):Λ × X → R be a continuous function such that: 1. H(λ, ·) is convex for any fixed λ ∈ Λ 2. H(·, x) is concave for any fixed x ∈ X. Then

1. Strong duality: maxλ minx H(λ, x) = minx maxλ H(λ, x) 2. Existence of Saddle point: ∃(λ∗, x∗): H(λ, x∗) ≤ H(λ∗, x∗) ≤ H(λ∗, x) ∀λ ∈ Λ, x ∈ X.

The existence of saddle point implies the strong duality. We note that other than the strong duality, the following weak duality is always true without assumptions on H: sup inf H(λ, x) ≤ inf sup H(λ, x) (14) λ x x λ

We define the quantity rΛ , infδ r(Λ, δ) as the Bayes risk under prior distribution Λ. We have the following lines of arguments using weak duality: inf sup R(θ, δ) = inf sup r(Λ, δ) (15) δ θ δ Λ ≥ sup inf r(Λ, δ) (16) Λ δ

= sup rΛ (17) Λ

4 Equation (17) gives us a strong tool for lower bounding the minimax risk: for any prior distribution Λ, the Bayes risk under Λ is a lower bound of the corresponding minimax risk. When the condition of the minimax theorem is satisfied (which may be expected due to the bilinearity of r(Λ, δ) in the pair (Λ, δ)), equation (16) achieves equality, which shows that there exists a sequence of priors such that the corresponding Bayes risk sequence converges to the minimax risk. In practice, it suffices to choose some appropriate prior distribution Λ in order to solve the (nearly) minimax estimator.

5 Decision Theoretic Interpretation of Maximum Entropy

Definition 8 (Logarithmic Loss). For a distribution P over X and x ∈ X , 1 L(x, P ) log . (18) , P (x)

Note that P (x) is the discrete mass on x in the discrete case, and is the density at x in the continuous case. By definition, if P (x) is small, then the loss is large. Now we see how the maximum likelihood estimator reduces to the so-called maximum entropy under the logarithmic loss function. Specifically, suppose we have functions r1(x), ··· , rk(x) and compute the following empirical averages:

n 1 X β r (x ) = [r (X)], i = 1, ··· , k (19) i , n i j EPn i j=1

where Pn denotes the empirical distribution based on n samples x1, ··· , xn. By the law of large numbers, it is expected that βi is close to the true expectation under P , so we may restrict the true distribution P in the following set:

P = {P : EP [ri(X)] = βi, 1 ≤ i ≤ k}. (20) Also, denote by Q the set of all probability distributions which belong to the with central statistics r1(x), ··· , rk(x), i.e.,

1 Pk λ r (x) Q = {Q : Q(x) = e i=1 i i , ∀x ∈ X } (21) Z(λ)

where Z(λ) is the normalization factor. Now we consider two estimators of P . The first one is the maximum likelihood estimator over Q, i.e.,

n X PML = arg max log P (xj). (22) P ∈Q j=1

The second one is the maximum entropy estimator over P, i.e.,

PME = arg max H(P ) (23) P ∈P where H(P ) is the entropy of P , which is H(P ) = P −P (x) log P (x) in the discrete case, and H(P ) = R x∈X X −P (x) log P (x)dx in the continuous case. The key result is as follows:5

Theorem 9. In the above setting, we have PML = PME.

5Berger, Adam L., Vincent J. Della Pietra, and Stephen A. Della Pietra. ”A maximum entropy approach to natural language processing.” Computational linguistics 22.1 (1996): 39-71.

5 Proof We begin with the following lemma, which is the key motivation for the use of the cross-entropy loss in machine learning. Lemma 10. The true distribution minimizes the expected logarithmic loss:

P = arg min EP L(x, Q) (24) Q and the corresponding minimum EP L(x, P ) is the entropy of P . Proof The assertion follows from the non-negativity of the KL D(P kQ) ≥ 0, where D(P kQ) = dP EP ln dQ .

1 Equipped with this lemma, and noting that EP [log Q(x) ] is linear in P and convex in Q, by the minimax theorem we have 1 1 max H(P ) = max min EP [log ] = min max EP [log ] (25) P ∈P P ∈P Q Q(x) Q P ∈P Q(x)

In particular, we know that (PME,PME) is a saddle point in P ×M, where M denotes the set of all probability 6 measures on X . We first take a look at PME which maximizes the LHS. By variational methods , it is easy to see that the entropy-maximizing distribution PME belongs to the exponential family Q, i.e.,

1 Pk λ r (x) P (x) = e i=1 i i (26) ME Z(λ) for some parameters λ1, ··· , λk. Now by the definition of saddle points, for any Q ∈ Q ⊂ M we have 1 1 EPME [log ] ≤ EPME [log ] (27) PME(x) Q(x) while for any Q ∈ Q with possibly different coefficients λ0, 1 X [log ] = [log Z(λ0) − λ0 r (x)] [since Q ∈ Q] (28) EPME Q(x) EPME i i i 0 X 0 = log Z(λ ) − λiβi [since PME ∈ P] (29) i 0 X 0 = EPn [log Z(λ ) − λiri(x)] [since Pn ∈ P] (30) i 1 = [log ] (31) EPn Q(x) n 1 X 1 = log . (32) n Q(x ) j=1 j As a result, the saddle point condition reduces to

n n 1 X 1 1 X 1 log ≤ log , ∀Q ∈ Q (33) n P (x ) n Q(x ) j=1 ME j j=1 j establishing the fact that PME = PML.

6See Chapter 12 of Thomas, Joy A., and Thomas M. Cover. Elements of information theory. John Wiley & Sons, 2006.

6