<<

EE378A Statistical Signal Processing Lecture 11 - 05/03/2016 Lecture 11: Minimax Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Jie Jun Ang

1 Minimax Framework

Recall that the Bayes minimizes the average risk Z r(Λ, δ) = R(θ, δ) dΛ(θ),

where Λ is the prior. One complaint about the is that the choice of Λ is often arbitrary and hard to defend. An alternative approach is to assume that nature is malicious, and will always pick the worst value of θ in response to the statistician’s choice of δ. In this case, we instead seek to minimize the maximum risk (or worst-case risk): Z min sup L(θ, δ) dPX|θ. δ θ Claim 1. If the L(θ, δ) is convex in δ, then finding the minimax rule can be reduced to solving a convex optimization problem. R Why is this claim true? We note that the average of convex functions is convex, hence L(θ, δ) dPX|θ is R convex in δ. Also, the supremum of convex functions is easily checked to be convex, hence supθ L(θ, δ) dPX|θ is convex in δ. As a result, the problem Z min sup L(θ, δ) dPX|θ δ θ aims to minimize a convex function, and is thus a convex optimization problem. Even though there is substantial literature on solving convex optimization problems, this particular problem is difficult because it’s hard to evaluate the supremum over θ. Indeed, if we are given a decision rule δ, we cannot evaluate the gradient because computing supθ is computationally intractable.

2 Minimax Theorem

The first minimax theorem was proved by von Neumann, in the setting of zero-sum . It states that every two-person finite- zero-sum has a mixed-strategy . For the purposes of this class, however, we introduce the generalization by Sion and Kakutani: Theorem 2 (Sion-Kakutani Minimax Theorem). Let X and Λ be compact convex sets in linear spaces (i.e., topological vector spaces). Let H(λ, x):Λ × X → R be a continuous function such that • H(λ, ·) is convex for every fixed λ ∈ Λ, • H(·, x) is concave for any fixed x ∈ X. Then 1. The strong duality holds: max min H(λ, x) = min max H(λ, x). (1) λ x x λ

1 2. There exists a ”saddle point” (λ∗, x∗), for which

H(λ, x∗) ≤ H(λ∗, x∗) ≤ H(λ∗, x) for all λ ∈ Λ, x ∈ X. (2)

We omit the proof, but strongly recommend the following exercises: • Show that (2) implies (1).

• Show that for an arbitrary function F (λ, x), we have

max min F (λ, x) ≤ min max F (λ, x). λ x x λ This is called the weak duality.

There is a game-theoretic way to view weak duality. Two people, named Max and Min, are playing a zero- sum game. Max chooses λ ∈ Λ and Min chooses x ∈ X, then F (λ, x) is the reward function for Max (and −F (λ, x) is the reward function for Min). Weak duality states that going first is advantageous for Max, no matter the choice of the function F .

3 Applications of the Minimax Problem 3.1 A failed approach R Since we are interested in finding minδ supθ L(θ, δ) dPX|θ, a first approach is to lower bound this quantity using weak duality. The following is slightly sloppy mathematically: Z Z min sup L(θ, δ) dPX|θ ≥ sup min L(θ, δ) dPX|θ. δ θ θ δ The problem is that in the expression on the RHS, the statistician is allowed to choose δ after nature has chosen θ. If L(θ, g(θ)) = 0, the statistician may pick the constant predictor δ = g(θ), so the right hand side is just 0. We have given up too much by applying weak duality.

3.2 Wald’s approach What if, instead of operating directly on θ, we instead maximize over priors on θ? Recall that the risk R function R(θ, δ) is defined to be L(θ, δ) dPX|θ. Claim 3. For any δ, we have Z sup R(θ, δ) = sup R(θ, δ) dΛ(θ). θ Λ(θ) Proof Since the right hand side is an average of R(θ, δ) over some values of θ and the left hand side is the supremum of R(θ, δ), we have LHS ≥ RHS. For the reverse inequality, given any  > 0, there exists some θ0 such that R(θ0, δ) +  > supθ R(θ, δ). Taking Λ0 to be the delta distribution with point mass at θ = θ0, we get Z Z sup R(θ, δ) dΛ(θ) ≥ R(θ, δ) dΛ0(θ) = R(θ0, δ) > sup R(θ, δ) − . Λ θ Taking  → 0+, we see that RHS ≥ LHS, so we are done.

2 Thus, maximizing over θ is the same as maximizing over Λ(θ). We use the minimax theorem: Z min sup L(θ, δ) dPX|θ = min sup R(θ, δ) δ θ δ θ Z = min sup R(θ, δ) dΛ(θ) δ Λ Z = sup min R(θ, δ) dΛ(θ) Λ δ ZZ = sup min L(θ, δ) dPX|θ dΛ(θ). Λ δ We have not verified that the hypotheses of the minimax theorem are satisfied. The main idea is that, when Λ is fixed, then this function is convex in δ (as discussed in the first section when L(θ, ·) is convex), and when δ is fixed, this function is linear in Λ. We gloss over the details regarding compactness and convexity, the norms on X and Λ, continuity, etc. The above result may be succinctly summarized:

min sup R(θ, δ) = sup min r(Λ, δ). δ θ Λ(θ) δ Define the Bayes risk rΛ = min r(Λ, δ) = min EΛ[R(θ, δ)] δ δ to be the risk of the Bayes estimator under the prior θ ∼ Λ(θ). Then the above result is minδ supθ R(θ, δ) = supΛ(θ) rΛ. The left-hand side’s expression may be interpreted as the statistician choosing δ first, then nature picking θ, which seems very bad for the statistician. However, this is not as pessimistic as it initially seems: in the right hand side’s expression, nature acts first by choosing a prior Λ, then the statistician picks δ. An objection to Wald’s minimax formulation was that nature is not malicious, but this way of viewing the problem reveals that it is not so pessimistic.

3.3 Difficulties Similar to the fact that the first formulation of the minimax problem is hard to solve, this second form also has its own difficulties. While it is easy to evaluate rΛ for each choice of Λ, there are too many possible choices for Λ to evaluate supΛ rΛ. Ultimately, we have traded the problem of being unable to evaluate the objective supθ R(θ, δ) with the problem of searching over an another possibly infinite-dimensional space.

4 Minimax in Action

Unfortunately, the minimax problem is difficult to solve exactly, even when it has been reformulated. Very few results of exact optimality are known. Fortunately, statisticians have two ways of dealing with this problem: • Instead of searching for optimal solutions, we can instead seek potential solutions, which are asymp- totically optimal. More on this next week. • We may upper bound and lower bound the minimax risk by constants. Indeed, in the expression

min sup R(θ, δ) = sup rΛ, δ θ Λ(θ)

the left hand side is a minimum over δ, so for any choice δ0, the quantity supθ R(θ, δ0) is an upper bound on minimax risk. Similarly, the right hand side is a supremum over Λ, so for any choice of Λ0,

3 the quantity rΛ0 is a lower bound on the minimax risk. If we get upper and lower bounds which are close, then we are quite satisfied with the result. This is still a difficult task, however.

0 Definition 4. A prior Λ is called least favorable if, for any prior Λ , we have rΛ ≥ rΛ0 . This definition is completely motivated by the minimax equation. The notion is very formal in the sense that it’s usually not possible to check whether Λ is least favorable.

Theorem 5. Suppose Λ(θ) is a prior on θ, and δΛ is the Bayes estimator under Λ. Suppose also that r(Λ, δΛ) = rΛ = supθ R(θ, δΛ). Then

1. δΛ is the minimax estimator;

2. If δΛ is the unique Bayes estimator with respect to Λ, then it is the unique minimax estimator; 3. Λ is least favorable. This theorem is almost tautological. Nevertheless, we prove the first assertion, and leave the rest as exercises. Proof Let δ be any decision rule. Then Z sup R(θ, δ) ≥ R(θ, δ) dΛ(θ) θ Z ≥ R(θ, δΛ) dΛ(θ)

= rΛ

= sup R(θ, δΛ). θ

Hence δΛ minimizes the minimax risk.

5 Reference

B. Levit (2010). “Minimax Revisited: I, II”. This paper is a survey of several techniques people have used to bound the minimax risk of the following problem: suppose you have a single observation X ∼ N(µ, σ2), where σ2 is known, and you know that |µ| ≤ 1. You want to estimate µ. The minimax estimator is not known! You can have a feeling of the difficulty in solving the exact minimax estimator!

4