Estimation and Inference for Set-identi…ed Parameters Using Posterior Lower Probability

Toru Kitagawa CeMMAP and Department of , UCL

First Draft: September 2010 This Draft: March, 2012

Abstract

In inference about set-identi…ed parameters, it is known that the Bayesian probability state- ments about the unknown parameters do not coincide, even asymptotically, with the frequen- tist’scon…dence statements. This paper aims to smooth out this disagreement from a multiple prior robust Bayes perspective. We show that there exist a class of prior distributions (am- biguous belief), with which the posterior probability statements drawn via the lower envelope (lower probability) of the posterior class asymptotically agree with the frequentist con…dence statements for the identi…ed set. With such class of priors, we analyze statistical decision problems including point estimation of the set-identi…ed parameter by applying the gamma- minimax criterion. From a subjective robust Bayes point of view, the class of priors can serve as benchmark ambiguous belief, by introspectively assessing which one can judge whether the frequentist inferential output for the set-identi…ed model is an appealing approximation of his posterior ambiguous belief.

Keywords: Partial Identi…cation, Bayesian Robustness, Belief Function, Imprecise Probability, Gamma-minimax, Random Set. JEL Classi…cation: C12, C15, C21.

Email: [email protected]. I thank Andrew Chesher, Siddhartha Chib, Jean-Pierre Florens, Charles Manski, Ulrich Müller, Andriy Norets, Adam Rosen, Kevin Song, and Elie Tamer for valuable discussions and comments. I also thank the seminar participants at Academia Sinica Taiwan, Cowles Summer Conference 2011, EC2 Conference 2010, MEG 2010, RES Annual Conference 2011, Simon Fraser University, and the University of British Columbia for helpful comments. All remaining errors are mine. Financial support from the ESRC through the ESRC Centre for Microdata Methods and Practice (CEMMAP) (grant number RES-589-28-0001) is gratefully acknowledged.

1 1 Introduction

In the usual situations of inferring identi…ed parameters, the Bayesian probability statements about the unknown parameters are similar, at least asymptotically, to the frequentist con…dence state- ments for the true value of the parameters. In partial identi…cation analysis initiated by Manski (1989, 1990, 2003, 2008), it is known that such asymptotic harmony between the two inference paradigms breaks down (Moon and Schorfheide (2011)). The Bayesian interval estimates for the set-identi…ed parameter are shorter, even asymptotically, than the frequentist ones, and they as- ymptotically lie strictly inside of the frequentist’scon…dence intervals. Frequentists might interpret this phenomenon as that the Bayesian’s over-con…dence in their inferential statement is …ctitious. Bayesians, on the other hand, might reply that the frequentist’scon…dence statements su¤er from an extreme conservatism that lacks any posterior probability justi…cation. The main aim of this paper is to smooth out such disagreement between the two schools of from a perspective of robust Bayes inference: a third paradigm of statistical inference, in which we can incorporate researcher’spartial prior knowledge into posterior inference. Among various robust Bayes approaches, this paper in particular focuses on a multiple prior Bayes analysis, where the partial prior knowledge or the robustness concern against prior misspeci…cation is modeled by a class of priors (ambiguous belief). The Bayes rule is applied to each prior to form the class of posteriors. Posterior inference procedures considered in this paper operate on the thus-constructed class of posteriors by focusing on its lower and upper envelopes, the so-called posterior lower and upper probabilities. Along this robust Bayes approach, this paper considers the following questions. First, are there any classes of priors, with which the probabilistic statements made via the posterior lower probability can approximate the frequentist’s con…dence statements for set-identi…ed parameters? The answer turns out to be yes. When the parameters are not identi…ed, a prior distribution of the model parameters is decomposed into two components: one that can be updated by data (revisable prior knowledge) and one that can never be updated by data (unrevisable prior knowledge). We show that, if a prior class is designed in such way that it shares a single prior distribution for revisable prior knowledge, but allows for arbitrary prior distributions for the unrevisable prior knowledge, then, under certain regularity conditions, the posterior interval estimates constructed based upon the posterior lower probability asymptotically attains the correct frequentist coverage probability for the identi…ed set.

2 The second question we examine is, with such class of priors, is it possible to formulate and solve statistical decision problems including point estimation of the set-identi…ed parameter? We approach this question by adapting the gamma-minimax decision analysis, which can be seen as a minimax analysis in the multiple prior setup. We demonstrate that the proposed prior class leads us to an analytically tractable and numerically solvable formulation of the gamma-minimax decision problem, provided that the identi…ed set for the parameter of interest can be computed for each value of the identi…ed parameters We believe our analyses are insightful not only from the "objective" Bayesian point of view but also from the subjective robust Bayesian point of view, because the classes of priors that can asymptotically match with the frequentist’s inference can serve as benchmark ambiguous belief. That is, by subjectively assessing if the ambiguous belief represented by the benchmark prior class appears reasonable or not, one can judge if the frequentist’s inferential output is too conservative or not in view of his posterior ambiguous belief. On this issue, we argue through an example that some priors in the benchmark prior class appear less appealing than some others, so that subjective robust Bayesians would consider that the frequentist inference statements for identi…ed set are too uninformative to interpret it as an approximation of their posterior ambiguous belief.

1.1 Literature Review

Estimation and inference in partial identi…ed models have been a growing area of research in . From the frequentist perspective, Horowitz and Manski (2000) construct con…dence sets for the interval identi…ed set. Imbens and Manski (2004) propose uniformly asymptotically valid con…dence sets for an intervally-identi…ed parameter. Stoye (2009) extends their analysis to a di¤erent set of conditions. Chernozhukov, Hong, and Tamer (2007) develop a way to construct asymptotically valid con…dence sets for an identi…ed set based on the criterion function approach, which can be applied to the wide range of partially identi…ed models including moment inequality models. Related to the criterion function approach, the literatures on construction of con…dence sets by inverting test include, Andrews and Guggenberger (2009), Andrews and Soares (2010), and Romano and Shaikh (2010), to list a few.

From the Bayesian perspective, Neath and Samaniego (1997), Poirier (1998), and Gustafson (2009, 2010) analyze how the Bayesian updating operates when a model lacks identi…ability. Liao and Jiang (2010) conduct Bayesian inference for moment inequality models based on the pseudo-

3 likelihood. Moon and Schorfheide (2011) compares asymptotic properties of frequentist and Bayesian inference for set-identi…ed models, and our robust Bayes analysis is motivated by their important …nding on the asymptotic disagreement between them.

The lower and upper probabilities originate from Dempster (1966, 1967a, 1967b, 1968) in his …ducial argument of drawing posterior inferences without specifying a prior distribution. Dem- spter’sanalyses have given a large impact to the …eld of statistics, arti…cial intelligence, and decision theory, such as the belief function analysis of Shafer (1976, 1982), the imprecise probability analysis of Walley (1991). The lower and upper probabilities have been playing the major role also in the robust Bayes analysis for measuring the global sensitivity of the posterior (Berger (1984), Berger and Berliner (1986)) and for characterizing a class of posteriors (DeRobertis and Hartigan (1981), Wasserman (1989, 1990), and Wasserman and Kadane (1990)). In econometrics, the pioneering work using multiple priors was carried out by Chamberlain and Leamer (1976), and Leamer (1982), who obtained the bounds for the posterior mean of the regression coe¢ cients when a prior varies over a certain class. These previous studies do not explicitly considers non-identi…ed models, while this paper focuses on non-identi…ed models, and aims to clarify a link between an early idea of the lower and upper probabilities and a recent issue on inference in set-identi…ed models.

The posterior lower probability that we shall obtain with our speci…cation of prior class turns out to be an in…nite-order capacity, or equivalently, a containment functional in the random set theory (see, e.g., Molchanov (2005)). Beresteanu and Molinari (2008) and Bereseanu, Molchanov, and Molinari (2012) show the usefulness and wide applicability of random set theory to partially identi…ed models by viewing an observation as a random set and the estimand as its Aumann expec- tation. Galichon and Henry (2006, 2009) proposed a use of in…nite-order capacity in de…ning and inferring the identi…ed set in the framework of incomplete econometric models. Our analysis albeit the di¤erence in the source of probability is seen as a robust Bayes analogue to these frequentist inference procedures based of the random set theory and the capacity.

Our decision theoretic analysis employs the gamma-minimax criterion (see Berger (1985)); this leads us to a decision that minimizes the risk formed under the most pessimistic prior within the class. The gamma-minimax decision problem often becomes challenging both analytically and numerically, and its analysis is limited to rather simple parametric models with a certain choice of prior class (see Betro and Ruggeri (1992), and Vidakovic (2000), for example). Our speci…cation of prior class, in contrast, o¤ers a simple solution to the gamma-minimax decision problem, provided

4 that the identi…ed set for the parameter of interest can be computed for each value of the identi…ed parameters.

1.2 Plan of the Paper

The rest of the paper is organized as follows. In Section 2, we illustrate our main results using a simple example of . In Section 3, a general framework is introduced. We construct a class of prior distributions that can contain arbitrary unrevisable prior knowledge, and derive the posterior lower and upper probabilities. Statistical decision analyses with multiple priors are examined in Section 4. In Section 5, we investigate how to construct the posterior credible region based on the posterior lower probability, and its large-sample behavior is examined in an interval-identi…ed parameter case. Proofs and lemmas are provided in Appendix A.

2 An Illustrating Example: Missing Data

In this section we illustrate the main results of the paper using an example of missing data (Manski (1989)). Let Y 1; 0 be a binary outcome of interest (e.g., a worker is employed or not) and let 2 f g D 1; 0 be an indicator of whether Y is being observed (D = 1) or not (D = 0), i.e., the subject 2 f g responded or not. Data are given by a random sample of size n, x = (Yi Di;Di): i = 1; : : : ; n . f  g The starting point of our analysis is to specify a parameter vector   that pins down the 2 distribution of data as well as the parameter of interest. Here,  can be speci…ed by a vector of four probability masses, (yd), yd Pr (Y = y; D = d), y = 1; 0, d = 1; 0. The observed data  likelihood for  is written as

n11 n01 nmis p(x ) =   [10 + 00] ; j 11 01 n n n where n11 = YiDi; n01 = (1 Yi)Di; nmis = (1 Di). Clearly, this likeli- i=1 i=1 i=1 hood function depends on  only through the three probability masses,  = ( ;  ;  ) P P P 11 01 mis  (11; 01; 10 + 00);  , no matter what the observations are, so that the likelihood for  has 2 "data-independent" ‡at regions expressed by a set-valued map of ,

()   : 11 =  ; 01 =  ; 10 + 00 =  :  f 2 11 01 misg The parameter of interest is the mean of Y; which is written as a function of ,  Pr(Y = 1) =  11 + 10. The identi…ed set of , H (), as the set-valued map of  is de…ned by the range of 

5 when  varies over (),

H() = [11; 11 + mis]; which are the Manski (1989)’sbounds of Pr(Y = 1). In a standard Bayes inference for , we specify a prior of , update it by the Bayes rule, and marginalize the posterior of  to . If the likelihood for  has data-independent ‡at regions, then the prior for  conditional on  () (i.e., belief on how the proportion of missing observations f 2 g mis is divided into 10 and 00) will never be updated by data, so that the posterior of  as well as  become sensitive to how to specify it. The robust Bayes procedure considered in this paper aims to make posterior inference free from such sensitivity concern by introducing multiple priors for . The …rst step for constructing our prior class is to specify a single prior  for identi…ed parameters . In view of , prior  speci…es how much prior belief should be assigned to each ‡at region of the ’s likelihood (), whereas, depending on ways to allocate the assigned belief over  () (for each ), the implied prior for  may di¤er. Therefore, by collecting all the possible 2 ways to allocate the assigned belief over  () ;   , we can obtain the following class of f 2 2 g prior distributions of ,  =  :  ( (B)) =  (B) for all B  . M      By applying the Bayes rule  to each   and marginalizing the posterior of  for , we 2 M obtain the class of posteriors of , F X :    . The class of posteriors is summarized j 2 M by its lower envelope (lower probability), F X ( ) = inf  ( ) F X ( ), which is interpreted as j   2M  j  that the posterior belief for  is at least F X ( ) no matter what   is used, (i.e. f 2 g j   2 M no matter how prior beliefs are formed for 10 and 00 given the proportion of missing  observations

mis). In the main theorem of this paper, we show

F X ( ) = F X (  : H () ) , j   j f  g where F X denotes the posterior distribution of  implied from the prior . An implication j of this result is that the posterior probability law of identi…ed sets, H ():  F X , can be  j used to summarize the posterior ambiguous belief (multiple porsteriors) for  implied by prior class  , and this fact plays a key role in our development of statistical decision and set inference M  for  based on the posterior lower probability. With leaving their formal analysis in later sections, let us state brie‡y how to implement the posterior lower probability inference for .

1. Specify a prior for  and update it by the Bayes rule. When a credible prior for  is not

6 available, a reasonably "non-informative" prior may be used.1

2. Let  : s = 1;:::;S be random draws of  from the posterior of . The mean and median of f s g the posterior lower probability of  can be de…ned via the gamma-minimax decision criterion, and they can be approximated by

1 S 1 S arg min sup (a )2 and arg min sup a  ; a S a S j j s=1  H(s) s=1  H(s) X 2 X 2 respectively.

3. The posterior lower credible region of  at credibility level 1 , which can be interpreted as a contour set of the posterior lower probability of , is de…ned by the smallest interval that contains H () with posterior probability 1 (Proposition 5.1 below proposes an algorithm to compute it for interval-identi…ed cases). Under certain regularity conditions, which are satis…ed in the current missing data example, the posterior lower credible region of  constructed in this way asymptotically attains the frequentist coverage probability 1 for the true identi…ed set.

3 Multiple-prior Analysis and the Lower and Upper Probabilities

3.1 Likelihood and Set Identi…cation: The General Framework

Let (X; ) and (; ) be measurable spaces of a sample X X and a parameter vector  , X A 2 2 respectively. Our analytical framework covers both a parametric model  = d, d < , and a R 1 non-parametric model where  is a separable Banach space. We make the sample size implicit in our notation until the large sample analysis in Section 5. Let  be a marginal probability distribution on the parameter space (; ), referred to as a prior distribution for . We assume A that the conditional distribution of X, given , has the probability density p(x ) at every   j 2 with respect to a -…nite measure on (X; ). We call p(x ) the likelihood for . X j The parameter vector  may consist of parameters that determine the behaviors of economic agents as well as those that characterize the distribution of unobserved heterogeneities in the population. In the context of the missing data or counterfactual causal models,  indexes the distribution of the underlying population outcomes or the potential outcomes (see Example 3.1 below). In all of these cases, the parameter  should be distinguished from the parameters that are

1 See Kass and Wasserman (1996) for a survey of “reasonably”non-informative priors.

7 used solely to index the sampling distribution of observations. A problem with the identi…cation of  typically arises in this context: if multiple values of  can generate the same distribution of data, then we claim that these ’s are observationally equivalent and the identi…cation of  fails.

In terms of the likelihood function, the observational equivalence of  and 0 =  means that the 6 values of the likelihood at  and 0 are equal for every possible sample, i.e., p(x ) = p(x 0) for every j j x X (Rothenberg (1971), Drèze (1974), and Kadane (1974)). We represent the observational 2 equivalence relation of ’sby a many-to-one function g : (; ) (; ): A ! B

g() = g(0) if and only if p(x ) = p(x 0) for all x X: j j 2 The equivalence relationship partitions the parameter space  into equivalent classes, in each of which the likelihood of  is “‡at”, irrespective of observations, and  = g() maps each of these equivalent classes to a point in another parameter space . In the language of structural models in econometrics (Hurwicz (1950), and Koopman and Reiersol (1950)),  = g() is interpreted as the reduced-form parameter that carries all the information for the structural parameters  through the value of the likelihood function. In the literature of Bayesian statistics,  = g() is referred to as the minimally su¢ cient parameter (abbreviated to su¢ cient parameter), and the range space of g( ), (; ), is called the su¢ cient parameter space (Barankin (1960), Dawid (1979), Florens  B and Mouchart (1977), Picci (1977), and Florens, Mouchart, and Rolin (1990)).2 See Florens and Simoni (2011) for comprehensive discussions on the relationship between frequentist and Bayesian identi…cation. In the presence of su¢ cient parameters, the likelihood depends on  only through the function g(), i.e., there exists a -measurable function p^(x ) such that B j p(x ) =p ^(x g()) x X and   (3.1) j j 8 2 2 holds (Lemma 2.3.1 of Lehmann and Romano (2005)). Denote the inverse image of g( ) by :  () =   : g() =  , f 2 g where () and (0) for  = 0 are disjoint, and ();   constitutes a partition of . We 6 f 2 g assume that g( ) is onto, g() = ; so that () is assumed to be non-empty for every  .3  2 2 The su¢ cient parameter space is unique up to one-to-one transformations (Picci (1977)). 3 In an observationally restrictive model in the sense of Koopman and Reiersol (1950), p^(x ), i.e., the likelihood j

8 In the set-identi…ed model, the parameter of interest is typically a subvector or a transformation of . We denote parameters of interest by a measurable transformation of ,  = h(), h : (; ) A ! ( ; ). We de…ne the identi…ed set of  by the projection of () onto through h( ). H D H  De…nition 3.1 (Identi…ed Set of ) (i) The identi…ed set of  is a set-valued map H :  H de…ned by H() h():  () :   f 2 g (ii) The parameter  = h() is point-identi…ed at  if H() is a singleton, and  is set-identi…ed at  if H () is not a singleton.

Note that identi…cation of  is de…ned in the pre-posterior sense because it is based on the likelihood evaluated at every possible realization of a sample, not only on the observed one. The task of constructing the sharp bounds of  in the partially identi…ed model is equivalent to …nding the expression of H().

3.2 Examples

We now give some examples, both to illustrate the above concepts and notations, and to provide a concrete focus for later development.

Example 3.1 (Bounding ATE by Linear Programming) Consider the treatment e¤ect model with incompliance and a binary instrument Z 1; 0 , as considered in Imbens and Angrist (1994), 2 f g and Angrist, Imbens, and Rubin (1996). Assume that the treatment status and the outcome of in- terest are both binary. Let Y be the observed outcome and (Y1;Y0) be the potential outcomes, 2 Y 2 as de…ned in Example 2.1. We denote by (D1;D0) 1; 0 the potential selection responses to 2 f g the instrument with the observed treatment status D = ZD1 + (1 Z)D0. The data consist of a random sample of (Yi;Di;Zi). Following Imbens and Angrist (1994), we partition the population into four subpopulations de…ned in terms of potential treatment-selection responses:

c if D1i = 1 and D0i = 0 : complier,

8 at if D1i = D0i = 1 : always-taker, Ti = > > nt if D = D = 0 : never-taker, <> 1i 0i d if D1i = 0 and D0i = 1 : de…er, > > function for:> the su¢ cient parameters, is well de…ned for a domain larger than g(); for example, see Example 2.3 in Section 2.2. In this case, the model possesses the refutability property, and () can be empty for some  . We 2 do not consider such refutable models in this paper.

9 where Ti is the indicator for the types of selection responses that are latent since we cannot observe both (D1i;D0i).

We assume a randomized instrument, Z (Y1;Y0;D1;D0). Then, the distribution of observ- ? ables and the distribution of potential outcomes satisfy the following equalities for y 1; 0 : 2 f g

Pr(Y = y; D = 1 Z = 1) = Pr(Y1 = y; T = c) + Pr(Y1 = y; T = at); (3.2) j Pr(Y = y; D = 1 Z = 0) = Pr(Y1 = y; T = d) + Pr(Y1 = y; T = at); j Pr(Y = y; D = 0 Z = 1) = Pr(Y0 = y; T = d) + Pr(Y1 = y; T = nt); j Pr(Y = y; D = 0 Z = 0) = Pr(Y0 = y; T = c) + Pr(Y1 = y; T = nt): j Ignoring the marginal distribution of Z, a full parameter vector of the model can be speci…ed by a joint distribution of (Y1;Y0;T ) that consists of 16 probability masses:

 = Pr(Y1 = y; Y0 = y0;T = t): y = 1; 0; y0 = 1; 0; t = c; nt; at; d :  Let ATE be the parameter of interest.

 E(Y1 Y0) = [Pr(Y1 = 1;T = t) Pr(Y0 = 1;T = t)]  t=c;nt;at;d X = [Pr(Y1 = 1;Y0 = y; T = t) Pr(Y1 = y; Y0 = 1;T = t)] t=c;nt;at;d y=1;0 X X h():  The likelihood conditional on Z depends on  only through the distribution of (Y;D), so the su¢ cient parameter vector consists of eight probability masses:

 = (Pr(Y = y; D = d Z = z): y = 1; 0; d = 1; 0; z = 1; 0) : j The set of equations given in (3.2) characterizes the observationally equivalent set of distributions of (Y1;Y0;T ) ; () when the data distribution is given at . Balke and Pearl (1997) derive the identi…ed set of ATE, H() = h(()); by maximizing or minimizing h(), subject to  being in the probability simplex and satisfying constraints (3.2). Since the objective function and the constraints are all linear, this optimization can be solved by linear programming and, consequently, H() is obtained as the convex intervals whose lower and upper bounds are the achieved minimum and maximum of the objective function in the linear optimization (see Balke and Pearl (1997) for a closed-form expression of the bounds and Kitagawa (2009) for an extension to general support of Y ).

10 Note that, in this model, special attention is needed for the su¢ cient parameter space  in order to ensure that the identi…ed set of , (), is non-empty. Pearl (1995) shows that the distribution of data is compatible with the instrument exogeneity condition, Z (Y1;Y0;D1;D0) ; if and only if ? max max Pr(Y = y; D = d) Z = z 1: (3.3) d z y f j g  X This implies that in order to guarantee () = , a prior distribution for  must be supported only 6 ; on the distributions of data that ful…ll (3.3).

Example 3.2 (Linear Moment Inequality Model) Consider the model where the parameter of interest  is characterized by linear moment inequalities, 2 H E(m(X) A) 0;  where the parameter space is a subset of L, m(X) is a J-dimensional vector of known functions H R of an observation, and A is a J L known constant matrix. By augmenting the J-dimesional  parameter  [0; )J , these moment inequalities can be written as the J-moment equalities4 2 1 E(m(X) A ) = 0: We let the full parameter vector  = (; ) [0; )J . 2 H  1 To obtain a likelihood function for the moment equality model, we employ the exponentially tilted empirical likelihood for  that is considered in the Bayesian context in Schennach (2005). Let xn be a size n random sample of observations, and let g() = A + . If the convex hull of

i m(xi) g() contains the origin, then the exponentially tilted empirical likelihood is written as [ f g

n p(x ) = wi(); j Y where

exp (g())0 (m(xi) g()) wi() = n f g ; exp (g()) (m(xi) g()) i=1 f 0 g P n (g()) = arg min exp 0 (m(xi) g()) : RJ 2 + (i=1 ) X  4 The Bayesian formulation of the moment inequality model shown here is owed to Tony Lancaster, who suggested this to us in 2006.

11 Thus, the parameter  = (; ) enters the likelihood only through g() = A+, so we take  = g() as the su¢ cient parameters. The identi…ed set for  is given by

() = (; ) [0; )L : A +  =  : 2 H  1  The coordinate projection of () onto gives H(), the identi…ed set for  (see, e.g., Bertsimas H and Tsitsiklis (1997, Chap.2) for an algorithm for projecting a polyhedron).

3.3 A Class of Unrevisable Prior Knowledge

Let  be a prior of  and  be the marginal probability measure on the su¢ cient parameter space (; ) implied by  and g( ), i.e., B    (B) =  ((B)) for all B .   2 B Let x X be sampled data, and assume that  has a dominating measure. The posterior 2  distribution of , denoted by F X ( ), is then obtained as j 

F X (A) =  (A )dF X ();A , (3.4) j j j j 2 A Z where  (A ) denotes the conditional distribution of  given , and F X ( ) is the posterior j j j  distribution of  (Barankin (1960), Florens and Mouchart (1974), Picci (1977), Dawid (1979), Poirier (1998)). The posterior distribution of  given in (3.4) shows that we are incapable of updating the conditional prior information of  given  because of the ‡at likelihood on () . Accordingly,  we can consider the prior information marginalized to the su¢ cient parameter  as the revisable prior knowledge, which we can potentially update by data, and the conditional priors of  given ,

  ( ):   , as the unrevisable prior knowledge, which we are never be able to update by j j 2 data.n o If we want to obtain the posterior uncertainty of  in the form of a probability distribution on (; ), as desired in the Bayesian paradigm, we need to have a single prior distribution of , which A necessarily induces unique unrevisable prior knowledge  . If the researcher could justify his j choice of   by any credible prior information, the standard Bayesian updating (3.4) would yield j a valid posterior distribution of : A challenging situation would arise if he is short of a credible prior opinion on . In this case, the researcher, who is aware that   will never be updated by j data, might feel anxious in implementing the Bayesian inference procedure that requires him to

12 specify unique  , because an uncon…dently speci…ed   can have a signi…cant in‡uence to the j j subsequent posterior inference. The robust Bayes analysis in this paper particularly focuses on such situation. Let be the M set of probability measures on (; ), and  be a prior on (; ) speci…ed by the researcher. A  B Consider the class of prior distributions of  de…ned by

( ) =  :  ((B)) =  (B) for every B : M   2 M   2 B  ( ) consists of prior distributions of  whose marginal for the su¢ cient parameters coincides M  5 with the prespeci…ed . That is, we accept specifying a single prior distribution for the su¢ cient parameters , while we leave conditional priors   unspeci…ed and allow for arbitrary ones as long j 6 as ( ) =   ( )d is a probability measure on (; ).  j j A This priorR class may be understood more intuitively if we consider a discrete su¢ cient parameter space  =  ; : : : ;  . Constraint  ( ( )) =  ( ) in the de…nition of ( ) says that f 1 K g  k  k M  prior probability for su¢ cient parameter;  (k) for each k = 1;:::;K, speci…es one’s belief for

 (k) . On the other hand, conditional prior   ( k) is interpreted as his prior belief for  f 2 g j j conditional on knowing that the  lies in (k). Arbitrary probability distributions for conditional prior   ( k) means any probability distributions of   ( k) including a probability mass are j j j j allowed as long as it is supported only on (k), and no constraints imposed across any   ( k) j j and   ( k ), k = k0. We can therefore express () as j j 0 6 M K

() =   ( k) (k):   ( k) are prob. measures for  supported only on (k) . M j j j j (k=1 ) X Prior class ( ) induces the prior class for  = h(), which is represented by M  K

() =   ( k) (k):   ( k) are prob. measures for  supported only on H (k) : M j j j j (k=1 ) X This representation of ( ) o¤ers heuristic interpretation on the ambiguous belief of  implied M  by ( ); the researcher has a single prior belief on which identi…ed set H ( ) is true in the M  k population, while, he is lack of belief on where true  likely to be conditional on knowing that  lies

5 We thank Jean-Pierre Florens for suggesting this way of representing the prior class. 6 Su¢ cient parameters  are de…ned by examining the entire model p (x ): x ;   , so that prior class f j 2 X 2 g ( ) is by construction model dependent. This distinguishes our approach from the standard robust Bayes analysis M  where a prior class is intended to represent the researcher’s subjective assessment of his imprecise prior knowledge (Berger (1985)).

13 in H (k) for every k = 1;:::;K. That is, the ambiguous prior knowedge for  is interpreted as the K state of knowledge that one cannot assess which combination of conditional priors   ( k) j j k=1 is more credible than other combinations, and he considers every possible combinationn of conditionalo K priors   ( k) as plausible. The measurable selection argument in random set theory (see, j j k=1 e.g., Molchanovn (2005,o Chap.1, Sec2.2)), which is used also in the proof of Theorem 3.1 below, allows us to extend the interpretation of ambiguity stated here to a general space of .

Some readers might wonder if it is sensible to apply some of the existing selection rules for “non-informative” priors for  instead of introducing the class of priors. First of all, Je¤reys’s general rule (Je¤reys (1961)), which takes the prior density to be proportional to the square root of the determinant of the Fisher information matrix, is not well de…ned if the information matrix for  is singular at almost every  , which is the case in many partially identi…ed model, e.g., 2 the missing data example of Section 2. The empirical Bayes-type approach of choosing a prior for  (Robbins (1964), Good (1965), Morris (1983), and Berger and Berliner (1986)) also breaks down if the model lacks identi…cation because the marginal likelihood of data depends only on ,

p(x )d = p^(x )d m(x  ): (3.5) j  j   j  Z Z It is also worth noting that we cannot obtain the reference prior of Bernardo (1979) because the conditional Kullback–Leibler distance between the posterior density f X () and a prior density j d():

f X () p^(x ) p^(x ) log j dF X () = log j j d; d () j m(x  )  m(x  ) Z    Z  j   j  again depends only on  for every realization of X. These prior selection rules are therefore useful for choosing a prior for the su¢ cient parameters , and are of no use for selecting  out of ( ). M  In the analysis given below, we shall not discuss how to select , and shall treat  as given.

The in‡uence of  on the posterior of  will diminish as the sample size increases, so the sensitivity issue of the posterior of  is expected to be less severe when the sample size is moderate or large.

3.4 Posterior Lower and Upper Probabilities

The prior class ( ) yields the class of posterior distributions of . We …rst consider summarizing M  the posterior class by the posterior lower probability F X ( ) and the posterior upper probability j  

14 FX ( ), which are de…ned, for A , as j  2 A

F X (A) inf F X (A); j    ( ) j 2M  FX (A) sup F X (A): j   ( ) j 2M  Note that the posterior lower probability and the upper probability have a conjugate property, c F X (A) = 1 FX (A ), so it su¢ ces to focus on one of them in deriving their analytical form. j  j For the lower and upper probabilities to be well de…ned, we assume the following regularity conditions.

Condition 3.1 (i) A prior for ,  , is a proper probability measure on (; ), and it is absolutely continuous  B with respect to a -…nite measure on (; ) : B

(ii) g : (; ) (; ) is measurable and its inverse image () is a closed set in ,  -almost A ! B  every  . 2

(iii) h : (; ) ( ; ) is measurable and H () = h (()) is a closed set in H,  -almost every A ! H D   : 2

These conditions are imposed in order for the identi…ed sets () and H () to be interpreted as random closed sets in  and induced by a probability law on (; ).7 The closedness of H B () and H () are implied, for instance, by continuity of g ( ) and h ( ). Under these conditions,   we can guarantee that for each A ,  : () A = and  : () A are supported by a 2 A f \ ;g f  g probability measure on (; ) : B

Theorem 3.1 Assume Condition 3.1. (i) For each A A, 2

F X (A) = F X (  : () A ); (3.6) j  j f  g

FX (A) = F X (  : () A = ) ; (3.7) j j f \ 6 ;g 7 The inference procedure proposed in this paper can be implemented as long as the posterior of  is proper, while we do not know how to accommodate an improper prior for  in our development of the analytical results.

15 where F X (B) is the posterior probability measure of , F X (B) = B f X ()d for B B. j j j 2 R (ii) De…ne the posterior lower and upper probabilities of  = h () by, for each D D, 2 1 F X (D) inf F X (h (D)); j    ( ) j 2M  1 F X (D) sup F X (h (D)): j   ( ) j 2M  These are given by

F X (D) = F X (  : H() D ); j  j f  g

F X (D) = F X (  : H() D = ): j j f \ 6 ;g Proof. For a proof of (i), see Appendix A. For a proof of (ii), see equation (3.8) below.

The expression for F X (A) given above implies that the posterior lower probability on A j  calculates the probability that the set () is contained in subset A in terms of the posterior probability law of . On the other hand, the upper probability is interpreted as the posterior probability that the set () hits subset A. Since the probability law of random sets is uniquely characterized by the containment probability, as in (3.6), or the hitting probability, as in (3.7) (the Choquet Theorem), these results show that the posterior lower and upper probabilities for  and  with our speci…cation of the prior class represents a unique probability law of the identi…ed sets of  and .

The second statement of the theorem provides a procedure for transforming or marginalizing the lower and upper probabilities of  into those of the parameter of interest . The expressions of F X (D) and F X (D) are simple and easy to interpret: the lower and upper probabilities of j  j  = h() are the containment and hitting probabilities of the random sets obtained by projecting () through h( ). This marginalization rule of the lower probability follows from  1 F X (D) = F X (h (D)) j  j  1 = F X (  : () h (D) ) j  = F X ( : H() D ), (3.8) j f  g and the marginalization rule for the upper probability is obtained similarly. It is worth noting that the operation of marginalizing the posterior information of  to  is very di¤erent between the lower probability inference and the standard Bayesian inference. In

16 the standard Bayesian inference with a single posterior, marginalization of the posterior of  to  is done by integrating the posterior probability measure of  for , while in the lower probability inference, marginalization for  corresponds to obtaining the probability law of H () the projected random set of () by  = h ( ). 

As is well known in the literature, the lower and upper probabilities of a set of probability measures are in general a monotone and nonadditive measure (capacity). In our speci…cation of prior class, the closed form expressions of Theorem 3.1 assures that the resulting posterior lower and upper probabilities are supermodular and submodular, respectively.

Corollary 3.1 Assume Condition 3.1. The posterior lower and upper probabilities of  are su- permodular and submodular, respectively. For A1, A2 subsets in , 2 A

F X (A1 A2) + F X (A1 A2) F X (A1) + F X (A2); j  [ j  \  j  j 

FX (A1 A2) + FX (A1 A2) FX (A1) + FX (A2): j [ j \  j j

Also, the posterior lower and upper probabilities of  are supermodular and submodular, respec- tively. For D1, D2 subsets in , 2 D H

F X (D1 D2) + F X (D1 D2) F X (D1) + F X (D2); j  [ j  \  j  j 

F X (D1 D2) + F X (D1 D2) F X (D1) + F X (D2): j [ j \  j j

The result of Theorem 3.1 (i) can be seen as a special case of Wasserman’s(1990) general con- struction of the posterior lower and upper probabilities. An important di¤erence from Wasserman (1990) is that with our way of specifying prior class  , we can guarantee that the lower M  probability of the posterior class is an -order monotone capacity (a containment functional of 1 random sets).8 Submodularity property for the posterior upper probability implied from this result is crucial for simplifying the gamma minimax analysis of Section 4.9

8 Wasserman (1990, p.463) posed an open question asking which class of priors can assure the posterior lower probability to be a Shafer’s belief function (equivalently, a containment functional of random sets). Theorem 3.1 gives an answer to his open question in the situation where the model lacks identi…ability. 9 To the best of my knoweldge, it is not yet known if submodularity of the posterior upper probability generally holds or not in the general setup of Wasserman (1990). Corollary 3.1 demonstrates that, with our speci…cation of prior class, we obtain a submodular posterior upper probability.

17 4 Point Decision of  = h() with Multiple Priors

In the standard Bayes posterior analysis, the mean or median of a posterior distribution of  is often reported as a point estimate of . If we summarize the posterior information of  in terms of its posterior lower and upper probabilities instead of a posterior distribution, how should we construct a reasonable point estimator for ?

We consider a setting where a statistical decision is made after observing data. Let a a 2 H be an action, where a is an action space. We interpret an action as reporting a non-randomized H point estimate for , so that we assume action space a is a subset of . H H Given an action a to be taken, and 0 being the true state of nature, a loss function L(0; a):

a + yields the cost to the decision maker of taking such an action. We assume loss H  H !R function L(; a) is non-negative and -measurable for each a a. D 2 H Given a single prior for , , the posterior risk is de…ned by

(; a) L(; a)dF X (); (4.1)  j ZH where the …rst argument  in the posterior risk represents the dependence of the posterior of  on the speci…cation of the prior for . Our analysis involves multiple priors, so the class of posterior risks ( ; a):  ( ) is considered. The posterior gamma-minimax criterion10   2 M  ranks actions in terms of the worst case posterior risk (upper posterior risk):

(; a) sup (; a),   ( ) 2M  where the …rst argument in the upper posterior risk represents the dependence of the prior class on a prior for .

De…nition 4.1 A posterior gamma-minimax action a with respect to prior class  is an x M  action that minimizes the upper posterior risk, i.e., 

(; ax) = inf (; a) = inf sup (; a). a a a a  ( ) 2H 2H 2M  10 In the robust Bayes literature, the class of prior distributions is often denoted by , and this is why it is called the (conditional) gamma-minimax criterion. Unfortunately, in the literature of belief functions and lower and upper probabilities, often denotes a set-valued mapping that generates the lower and upper probabilities. In this paper, we adopt the latter notational convention, but still refer to the decision criterion as the gamma-minimax criterion.

18 The gamma-minimax decision approach involves a favor for a conservative action that guards against the least favorable prior within the class, and it can be seen as a compromise of the Bayesian decision principle and the minimax decision principle.

The next proposition shows that minimizing the upper posterior risk (; a) is equivalent to minimizing the Choquet expected loss with respect to the posterior upper probability.

Proposition 4.1 Under Condition 3.1, the upper posterior risk satis…es

(; a) = L(; a)dF X () = sup L(; a)f X ( x)d, (4.2) j   H() j j Z Z 2 whenever L(; a)dF X () < . Accordingly, a gamma-minimax decision ax with respect to prior j 1 class R exists if and only if E X sup H() L(; a) has a minimizer in a a. M j 2 2 H    Proof. See Appendix A.

The third expression in (4.2) shows that the posterior gamma-minimax criterion is written as the expectation of the worst-case loss function, sup H() L(; a), with respect to the posterior of . 2 The supremum part stems from the ambiguity of : given , what the researcher knows about  is only that it lies within the identi…ed set H () ; so the researcher forms the loss by supposing that the nature chooses the worst case in response to his action a. On the other hand, the expectation in  represents the posterior uncertainty of the identi…ed set H (): with the …nite number of observations, the identi…ed set of  is a random set induced by the posterior of . The gamma- minimax criterion with the class of priors  combines such ambiguity of  with the posterior M  uncertainty of the identi…ed set H () to yield a single objective function to be minimized.

Although a closed-form expression of ax is not in general available, this proposition suggests a simple numerical algorithm for approximating ax using a random sample of  from its posterior S f X ( x). Let s s=1 be S random draws of  from posterior f X (). We shall approximate j j f g j ax by

1 S a^x arg min sup L(; a).  a a S 2H s=1  H(s) X 2

The reader may wonder if the posterior gamma minimax action ax can be interpreted as a Bayes action for some posterior distributions in the class. In case of the quadratic loss, the saddle-point

19 argument as in Grünwald and Dawid (2003) shows that the gamma-minimax action ax corresponds to the mean of a posterior distribution (Bayes action) that has maximal variance in the class. Such maximal variance posterior for  may be obtained by allocating the posterior probability at each  to some boundary points of H ().11

In the presence of multiple priors, it is known that a posteriori optimal gamma-minimax action as considered here can di¤er from an a priori optimal gamma-minimax decision (dynamic inconsis- tency problem, Vidakovic (2000)). Interestingly, with our speci…cation of prior class, such dynamic inconsistency problem does not arise, so the gamma-minimax action obtained above is supported as an optimal gamma-minimax decision in the unconditional sense. For details and a proof of this claim, see Appendix B.

As an alternative to the (posterior) gamma-minimax criterion considered above, it is possible to consider the gamma-minimax regret criterion (Berger (1985, p. 218), and Rios Insua, Ruggeri, and Vidakovic (1995)), which is seen as an extension of the minimax regret criterion of Savage (1951) to the multiple-prior Bayes context. In Appendix B, we provide some analytical results of the gamma-minimax regret analysis where the parameter of interest  is a scalar and the loss function is quadratic, L(; a) = ( a)2. We demonstrate that the gamma-minimax regret decision can di¤er from the gamma-minimax decision derived above, but that they coincide asymptotically.

5 Set Estimation of 

This section discusses how to use the posterior lower probability of  to conduct set estimation of . In the standard Bayesian inference, set estimation is often done by reporting contour sets of the posterior probability density of  (highest posterior density region). If the posterior information for  is summarized by the lower and upper probabilities, how should we conduct set estimation of ?

11 A common criticism to the gamma-minimax analysis is that a posterior that yields the gamma minimax action as a Bayes action does not appear appealing. If this criticism applies to our context, it would make sense to constrain the prior class, or to seek for a more appealing decision criterion. Further discussion related to this point is provided in Section 6.1.

20 5.1 Posterior Lower Credible Region

For (0; 1), consider a subset C1 such that the posterior lower probability F X (C1 ) 2  H j  is greater than or equal to 1 :

F X (C1 ) = F X (H() C1 )) 1 . (5.1) j  j  

In words, C1 is interpreted as “a set on which the posterior credibility of  is at least 1 , no matter which posterior is chosen within the class”. If we drop the italicized part from this statement, we obtain the usual interpretation of the posterior credible region, so C1 de…ned in this way seems to be a natural way to extend the Bayesian posterior credible regions to those of the posterior lower probability. Analogous to the Bayesian posterior credible region, there are also multiple C1 ’sthat satisfy (5.1). For instance, given a posterior credibility region of  with credibility 1 , B1 , C1 =  B1 H () suggested in an earlier version of Moon and  [ 2 Schorfheide (2011) satis…es (5.1). In our proposal of set inference for , we restrict multiplicity of set estimates as well as penalize their volume by focusing on a smallest one among C1 ’ssatisfying (5.1),

C1 arg min Leb(C) (5.2)   C 2C s.t. F X (H() C)) 1 ; j   where Leb(C) is the volume of subset C in terms of the Lebesgue measure and is a family of C subsets in over which the volume-minimizing lower credible region is searched. Analogous to the H highest posterior density region in the standard Bayesian inference, focusing on the smallest set estimate has a decision theoretic justi…cation in terms of gamma-minimax criterion.12 We refer to

C1 as a posterior lower credible region with credibility 1 .  12 Focusing on smallest C1 has a decision theoretic justi…cation. We can show that under mild conditions C1  can be supported as a gamma minimax action (if it exists),

C1 = arg min sup L (; C) dF x  C 2 j 3 2C  ( )  2M  Z 4 5 with loss function,

L (; C) = Leb (C) + b ( ) [1 1C ()] ; where b ( ) is a positive constant that depends on credibility level 1 . Note that the loss function is written in terms of parameter ; so the object of interest is  rather than identi…ed set H ().

21 Finding C1 is challenging if  is multi-dimensional and no restriction is placed on class of  subsets . In what follows, we restrict our analysis to scalar  and constrain to the class of C C closed connected intervals. Let Cr( ) be a closed interval in , centered at  , with width c H c 2 H 2 r: Accordingly, the constrained minimization problem of (5.2) is reduced to  min r (5.3) r;c

s.t. F X (H() Cr(c)) 1 : j   The next proposition provides an alternative way to formulate this optimization problem, which will be useful for obtaining C1 numerically. 

Proposition 5.1 Let d : + measures distance from  to set H () in terms of H  D ! R c 2 H

d (c;H()) sup c  .   H() fk kg 2

For each c , let r1 (c) be the (1 )-th quantile of the distribution of d (c;H()) induced 2 H by the posterior distribution of , i.e.,

r1 (c) inf r : F X (  : d(c;H()) r ) 1 .  j f  g   Then, the solution of the constrained minimization problem (5.3) is given by (r1 ; c), where

c = arg min r1 (c) and r1 = r1 (c).  c2H Proof. See Appendix A.

Let  : s = 1;:::;S be random draws of  from its posterior. At each  , we …rst calcu- f s g c 2 H late r^1 (c), the empirical (1 )-th quantile of d(c;H()), based on the simulated d(c;H(s)), s = 1;:::;S. Provided that r1 (c) is continuous in c, the obtained empirical (1 )-th quan- tile, r^1 (c), is a valid approximation of the underlying (1 )-th quantile r1 (c), so that we can approximate (r1 ; c) by searching a minimizer and the minimized value of r^1 (c).

5.2 Asymptotic Property of the Posterior Lower Credible Region

This section examines the large-sample behavior of the posterior lower probability in an intervally identi…ed case, i.e., H () is a closed bounded interval for almost all  . From now on, we  R 2

22 make the sample size explicit in our notation: a size n sample Xn is generated from its sampling distribution PXn  , where 0 denotes the value of the su¢ cient parameters that corresponds to j 0 the true data-generating process. We denote the maximum likelihood estimator for  by ^.

13 Provided that the posterior of  is consistent to 0 and set-valued map H (0) is continuous at  = 0 in terms of the Hausdor¤ metric d, it is straightforward to show that random sets H () represented by the posterior lower probability F Xn ( ) is consistent to true identi…ed set H (0) j  in the sense of limn F Xn (  : d (H () ;H (0)) >  ) = 0 for almost every sampling sequence !1 j f g of Xn . Our main focus in what follows is therefore on the asymptotic coverage property of the f g lower credible region C1 constructed in the previous section. Our main analytical result rely  on the following set of regularity conditions.

Condition 5.1

(i) The parameter of interest  is a scalar and the identi…ed set H() is -almost surely a non- empty and connected interval, i.e., H () = [ ();  ()],  ()  () . The true l u 1  l  u  1 identi…ed set H (0) = [l(0); u(0)] is a bounded interval.

n n (ii) For sequence an , de…ne random variables L () = an  ()  (^) and U () = ! 1 l l an  ()  (^) , whose distribution is induced by the posterior distribution of  given sam- u u pleXn. The cumulative distribution function of (Ln () ;U n ()) given Xn denoted by J n ( ) is  continuous almost surely under p^(xn  ) for all n. j 0

(iii) There exists bivariate random variables (L; U), whose cumulative distribution function J ( )  2 2 on R is continuous, and at each c (cl; cu) R , the cumulative distribution function of  2 n n n n (L () ;U ()) given X , J (c) ; converges in probability under PXn  to J (c). j 0

(iv) The sampling distribution of an  ( )  (^) ; an  ( )  (^) converges in dis- l 0 l u 0 u tribution to (L; U) :     

Conditions 5.1 (iii) and (iv) state that the estimator for the lower and upper bounds of H () satisfy the Bernstein–von Mises asymptotic equivalence between the Bayesian estimation and the

13 What we mean for posterior consistency of  is that limn F Xn (G) = 1 for every G open neighborhood of !1 j 0 for almost every sampling sequence. In case of …nite dimensional , this posterior consistency for  is implied by a set of higher-level conditions for the likelihood of . We do not list up all those conditions here for the sake of brevity. See Section 7.4 of Schervish (1995) for details.

23 maximum likelihood estimation in the sense of Theorem 7.101 in Schervish (1995). In a parametric model with …nite dimensional , Condition 5.1 (iii) and (iv) usually imply an = pn, and (L; U) are distributed bivariate normal. Note that these conditions are implied from more primitive assumptions: (a) the regularity of the likelihood of  and asymptotic normality of pn  ^ , (b)  puts a positive probability on every open neighborhood of 0 and ’sdensity is smooth at 0, and (c) applicability of the delta method with non-zero …rst derivative to  ( ) and  ( ) at  =  . l  u  0 See Schervish (1995, Section 7.4) for more details on these assumptions.

The next proposition establishes the large-sample coverage property of the posterior lower cred- ible region C1 . 

Theorem 5.1 (Asymptotic Coverage Property) Assume Conditions 3.1 and 5.1. Let C1  be the posterior lower credible region, as de…ned above. C1 can be interpreted as a frequentist  con…dence interval for the true identi…ed set H (0) with pointwise asymptotic coverage probability (1 ), i.e.,

lim PXn  (H (0) C1 ) = 1 : n j 0  !1  Proof. See Appendix A.

This result shows that, for interval-identi…ed , the posterior lower credible region C1  achieves the desired coverage for the identi…ed set asymptotically in the frequentist sense (Horowitz and Manski (2000), Chernozhukov, Hong, and Tamer (2007), and Romano and Shaikh (2010)). It is worth noting that the posterior lower credible region C1 cannot be interpreted as the con…dence  intervals for the parameter of interest as considered in Imbens and Manski (2004) and Stoye (2009); in case H (0) is an interval, C1 will be asymptotically wider than the frequentist’s con…dence  interval for . This observation shall indicate that set of priors  considered in our analysis M  needs to be downsized in order for a robusti…ed posterior credible region to be interpreted as the frequentist’scon…dence interval for , although we do not know how to do so. It is also worth noting that the asymptotic coverage probability presented above is in the sense of pointwise asymptotic coverage rather than of the asymptotic uniform coverage over 0. The literature has stressed the importance of the uniform coverage property of interval estimates in order to ensure that the intervals estimates can have an accurate coverage probability in a …nite sample situation (Andrew and Guggenberger (2009), Romano and Shaikh (2010), Andrew and

24 Soares (2010), among many others). We do not know whether or not the posterior lower credible region constructed above can attain a uniformly valid coverage probability.

We note that Condition 5.1 (iii) and (iv) are quite delicate conditions, and we can come up with the models where these conditions are violated and C1 does not attain the desired asymptotic  coverage probability for H (0). We illustrate this point through two examples.

Example 5.1 This example is provided to the author by Andriy Norets and Ulrich Müller (2011, personal communication). Consider xi, i = 1; : : : ; n, follow iid N (0; 1) with mean 0 = 0:5. The maximum likelihood estimator, ^, is the sample average of xi’s and its sampling distribution is ^ N  ; n 1 , while, with the uniform prior of , the posterior of  is  xn N ;^ n 1 .  0 j  Consider that the identi…ed set given in the following form,  

[  (1 ) ;  (1 )] for  [0; 1] ; H () = 2 . 8 0 for  = [0; 1] . < f g 2 This identi…ed: set is largest at  =  and H () H ( ) holds for all  R. 0  0 2 It is straightforward to show that n  ( )  (^) converges to U 2 (1). The posterior u 0 u  distribution of U n () = n  ()  (^) , however, fails to converge to 2 (1). To see this, u u consider  

U n () = n  ()  (^) u u   2 = 2n ^   ^ n  ^ 0 = 2ZZ^ Z2;     where Z^ = pn ^  is a standard normal random variable induced by the sampling distribution 0 of xn and hence a constant once xn is conditioned, and Z = pn  ^ is a standard normal random variable induced by the posterior distribution of . Note that, the posterior distribution of U n () depends on a realization of the sample xn no matter how large n is since the in‡uence of Z^ to the posterior distribution of U n () never vanishes, so U n () does not have a limiting distribution, leading to violation of Condition 5.1 (iii).

If we apply set inference using C1 , then the coverage probability turns out to be PXn  (H (0) C1 ) =  j 0   0 for every n. This is because C1 constructed as a smallest interval containing H () with  posterior probability (1 ) always fails to cover the true identi…ed set H ( ), which is the 0 widest among all the possible H (). Interestingly, the correct coverage will be retained if we

25 use C1 =  B1 H () as a robusti…ed posterior credible region, where B1 is the (1 )-th [ 2 highest posterior density region of .

Example 5.2 Let the identi…ed set be given by H () = [max  ;  ; min  ;  ]. This type f 1 2g f 3 4g of bound commonly appears in the intersection bound analysis (Manski (1990)), and has attracted considerable attention in the literature (e.g., Hirano and Porter (2011), Chernozhukov, Lee, and Rosen (2011)). Here, we illustrate that Condition 5.1 (iii) and (iv) do not hold in this class of models when the true values of 1 and 2 happen to be equal. Let us focus on the lower bound  () = max  ;  .Assume that the maximum likelihood l f 1 2g estimators ^ and ^ are independent and ^ N  ; 1 and ^ N  ; 1 with  =  . 1 2 1  10 n 2  20 n 10 20 n ^ 1 n ^ 1 As for the posterior of 1 and 2, assume that 1 x N 1; and 2 x  N 2; . In j  n j  n this case, the sampling distribution of L = pn  ( )  (^) and the posterior distribution  of l 0 l Ln () = pn  ()  (^) are   l l   ^ Z1 Z1 + Z L max ;Ln () xn min ;  8 9 j  8 ^ 9 Z2 Z2 + Z < = < + =

: ; : ^; ^ ^ where (Z1;Z2) are independent standard normal variables, Z = pn 1 2 , a = min 0; a , j j f g and a = max 0; a . The posterior distribution of Ln () fails to converge to the sampling j j+ f g distribution of pn  (^)  ( ) . Note that, even when Z^ happens to be zero, Ln ()’sposterior l l 0 distribution di¤ers from the sampling distribution of L.

If we use C1 in the current context, C1 will estimate H (0) with inward bias, and   the coverage probability for H (0) will be lower than the nominal coverage. A lesson from this example is that, despite the decision theoretic justi…cation, our robust Bayes posterior inference procedure based on C1 does not resolve the frequentist’s bias issue in the intersection bounds  analysis. In contrast, as in Example 5.1, the correct coverage will be attained again if we use

C1 =  B1 H () as a robusti…ed posterior credible region. [ 2

6 Discussion and Conclusion

6.1 Does Prior Class  Subjectively Make Sense? M  From the subjective Bayes point of view, a posterior inference procedure makes sense or not relies on how plausibile the prior input is. In response to such concern, we shall brie‡y argue whether

26 prior class  o¤ers subjectively appealing representation of ambiguity for  or not. For M  speci…city, we consider  the missing data example of Section 2. Let outcome observation Y indicate if a worker is employed (Y = 1) or not (Y = 0), and the parameter of interest be the employment rate in a target population. For ease of illustration, assume that the researcher knows there are only two possible distributions of data,

A = (0:5; 0:2; 0:3) ;

B = (0:501; 0:199; 0:3) , where  = (Pr (Y = 1;D = 1) ; Pr (Y = 0;D = 1) ; Pr (D = 0)). Accordingly, the possible bounds for the employment rate are

H A = [0:5; 0:8] ;

H B = [0:501; 0:801] .  As argued in Section 3.3, the class of priors for  induced by  can be generated by pairs M  A B A B of conditional priors    ;    where    and    are probability j j j j j j j j A B distributions supported only on H  and H  , respectively.  Suppose you consult with two experts about their conditional priors,  and obtain the following pairs,

A B Expert 1:     = 0:5; and     = 0:801, j j j j A B Expert 2 :     = U [0:5; 0:8] and    = U [0:501; 0:801] ; j j j j   where U [a; b] is the uniform distribution on [a; b] and  is the point mass measure at : Do you think one of the expert opinions appears more reasonable than the other? Even when you are lack of precise prior for the employment rate, many of you would agree that Expert 2’sopinion sounds more sensible than Expert 1’s. This is perhaps because Expert 1’s opinion for the common parameter changes from one extreme to the other extreme despite the two bounds H A and H B appear almost identical, while Expert 2’s opinion remains almost the A B same in accordance  with the fact that H  and H  are very similar. Once we admit the ambiguous belief of  implied from prior class   , however, we are not allowed to say Expert M 2’sopinion makes more sense than Expert 1!  This example illustrates that some priors in the class seems to be less sensible than others. That is, despite the advantage of the proposed prior class in its analytical tractability, ambiguous knowledge represented by the prior class is not neccesarily appealing from the subjective point

27 of view. We therefore believe that it would be worth making some e¤ort to either eliminate or downweight apparently implausible ones so as to downsize the prior class. Development of a feasible and appealing way to do so would be an interesting future research topic.

6.2 Concluding Remark

This paper proposes a framework of robust Bayes analysis for set-identi…ed models in econometrics. We demonstrate that the posterior lower probability corresponding to prior class  can be in- M  terpreted as the posterior probability law of the identi…ed set (Theorem 3.1). This robust  Bayesian way of generating the identi…ed set as a random object has not been investigated in the literature, and it highlights seamless links among partial identi…cation analysis, robust Bayes inference, and random set theory. We employ the gamma-minimax criterion to develop a conservative point- decision for the set-identi…ed parameter. The objective function of the gamma-minimax criterion integrates ambiguity associated with set-identi…cation and posterior uncertainty of the identi…ed set into a single objective function, and leads to a numerically solvable gamma-minimax decision as long as we can simulate the identi…ed sets H () from the posterior of . A non-standard …nding with our speci…cation of prior class is that the gamma-minimax optimal action is invariant whether or not the optimal decision is computed before and after observing the data; this is a potentially useful insight for extending the analysis to a sequential decision problem. The posterior lower probability is a non-additive measure, so one drawback of the lower prob- ability inference is that we cannot plot it as we would do for the posterior probability densities. In order to visualize it, we focus on the posterior lower credible region de…ned as an analog to the posterior credible region in the standard Bayes procedure. We show that, in the interval-identi…ed parameter cases, the posterior lower credible region with credibility (1 ) can be easily com- puted and, under the regularity conditions, it can be interpreted as asymptotically valid frequentist con…dence intervals for the identi…ed set with coverage (1 ).

Appendix

A Lemmas and Proofs

In this appendix, we …rst demonstrate that the set-valued mappings () and H () de…ned in the main text are closed random sets (measurable and closed set-valued mappings) induced by a

28 probability measure on (; ). B

Lemma A.1 Assume (; ) and (; ) are complete separable metric spaces. Under Condition A B 3.1, () and H () are random closed sets induced by a probability measure on (; ), i.e., () B and H () are closed and, for A and D , 2 A 2 H  : () A = for A ; f \ 6 ;g 2 B 2 A  : H () D = for D : f \ 6 ;g 2 B 2 H Proof. Closedness of () and H () is implied directly from Conditions 3.1 (ii) and (iii). To prove the measurability of  : () A = , we use Theorem 2.6 in Molchanov, which states f \ 6 ;g that, given (; ) as Polish,  : () A = holds if and only if  :  () is true A f \ 6 ;g 2 B f 2 g 2 B for every  . Since () is an inverse image of the many-to-one and onto mapping, g :  , 2 ! a unique value of   exists for each  , and  , since  is a metric space. Hence, 2 2 f g 2 B  :  () holds. f 2 g 2 B To verify the measurability of  : H () D = , we note that f \ 6 ;g 1  : H () D = =  : () h (D) = . f \ 6 ;g \ 6 ; Since h 1 (D) , by the measurability of h (Condition 3.1 (iii)), the …rst statement of this lemma 2 A implies  : H () D = . f \ 6 ;g 2 B

A.1 Proofs of Theorem 3.1

Given the measurability () and H (), as proved in Lemma A.1, our proof of Theorem 3.1 uses the following two lemmas. The …rst lemma says that, given a …xed subset A in the parameter 2 A space of ; for every  (), the conditional probability  (A ) can be bounded below 2 M j j by the indicator function 1 () A= (). The second lemma shows that for each …xed subset f \ 6 ;g A , we can construct a probability measure on (; ) that belongs to the prior class ( ) 2 A A M  and achieves the lower bound of the conditional probability obtained in the …rst lemma. Theorem 3.1 follows as a corollary of these two lemma.

Lemma A.2 Assume Condition 3.1 and let A be an arbitrary …xed subset of . For every 2 A  ( ),  2 M 

1 () A ()  (A ) f  g  j j

29 holds -almost surely.

Proof. For the given subset A, de…ne A =  : () A . Note that, by Lemma A.1, A 1 f  g  1 belongs to the su¢ cient parameter -algebra . To prove the claim, it su¢ ces to show that B

1A ()d  (A )d (A.1) 1  j j ZB ZB for every  ( ) and B .  2 M  2 B Consider

A  (A )d  (A )d = (A (B 1 )). B j j  B A j j \ \ Z Z \ 1 By the construction of A, (B A) A holds, so 1 \ 1   (A (B A)) =  ((B A))  \ \ 1  \ 1 =  (B A)  \ 1

= 1 A ()d : 1  ZB Thus, the inequality (A.1) is proven.

Lemma A.3 Assume Condition 3.1. For each A , there exists  () whose conditional 2 A  2 M distribution   achieves the lower bound of  (A ) obtained in Lemma A.2, -almost surely. j j j

Proof. Fix subset A throughout the proof. Consider partitioning the su¢ cient parameter 2 A space  into three, based on the relationship between () and A,

A =  : () A = ; 0 f \ ;g A =  : () A ; 1 f  g A =  : () A = and () Ac = ; 2 f \ 6 ; \ 6 ;g where each of them belongs to the su¢ cient parameter -algebra by Lemma A.1. Note that A, B 0 A A 1 , and 2 are mutually disjoint and constitute a partition of . Now, consider a -valued measurable selection A ( ) de…ned on A such that A () [ () Ac]  2 2 \ holds for  -almost every  A. Note that such measurable selection A () can be constructed,  2 2 A for instance, by  () = arg max () d (; A), where d (; A) = inf0 A  0 (see Theorem 2.27 2 2

30 in Chapter 1 of Molchanov (2005) for B-measurability of such A ()). Let us pick a probability measure from the prior class,   , and construct another measure  by 2 M   ~ ~ A ~ A A ~ A ~  A =  A 0 +  A 1 +   () A 2 ; A :  \ \ 2 \ 2 A       n o  It can be checked that  thus constructed is a probability measure on (; ), i.e.,  satis…es  A   ( ) = 0,  () = 1, and countable additivity. Furthermore,  belongs to  because,  ;   M for B ,  2 B A A A A  ( (B)) =  (B) 0 +  (B) 1 +   () (B) 2  \ \ 2 \ =  B A + B A +  B A   \ 0  \ 1  \ 2 =  B A + B A +  B A   \ 0  \ 1  \ 2 =  (B) ;    where the second line follows because ()’s are disjoint and  () () holds for almost every 2  A. 2 2 With the thus-constructed  and an arbitrary subset B , consider  2 B A A  (A (B)) = (A (B) (0 )) + (A (B) (1 ))  \ \ \ \ \ + A () [A (B)] A .  2 \ \ 2  A A  A A Here, by the construction of j and  (), we have A (0 ) = , (1 ) A, and j=1;2;3 \ ;   A () [A (B)] nA =o 0. Accordingly, we obtain  2 \ \ 2   A  (A (B)) = ((B) (1 ))  \ \ =  B A  \ 1

= 1 A ()d . 1  ZB Since B is arbitrary, this implies that  (A ) = 1A (); -almost surely. Thus,  achieves 2 B  j 1  the lower bound obtained in Lemma A.2.

Proof of Theorem 3.1 (i). Under the given assumptions, the posterior of  is given by (see equation (3.4))

F X (A) =  (A )dF X (): j j j j Z

31 By the monotonicity of the integral, F X (A) is minimized over the prior class by plugging the j attainable pointwise lower bound of  (A ) into the integrand. By Lemma A.2 and Lemma A.3, j j the attainable pointwise lower bound of  (A ) is given by 1 () A (), so j j f  g

F X (A) = 1 () A ()dF X () = F X (  : () A ): j  f  g j j f  g Z The posterior upper probability follows by its conjugacy with the lower probability:

c FX (A) = 1 F X (A ) = F X (  : () A = ): j j  j f \ 6 ;g

A.2 Proof of Proposition 4.1

The next lemma is used to prove Proposition 4.1. The lemma says that the class of posteriors induced by prior class ( ) exhausts all the probability measures lying uniformly between the M  posterior lower and upper probabilities. In the terminology of Huber (1973), this property is called the representability of the class of probability measures by the lower and upper probabilities.

Lemma A.4 Assume Condition 3.1. Let

 = G : G probability measure on (; ) , F X (A) G(A) FX (A) for every A : G A j    j 2 A n o Then,  = F X : F X posterior distribution on (; ) induced by some  () . G j j A 2 M 

Proof of Lemma A.4. For each  (), F X (A) F X (A) FX (A) holds for every A 2 M j   j  j 2 , by the de…nition of the lower and upper probabilities. Hence,  contains F X :  () . A G j 2 M To show the converse, recall Theorem 3.1 (i), which shows that the lower and upper probabilities are containment and capacity functionals of the random closed set (). As a result, by applying the selectionability theorem of random sets (Molchanov (2005), Theorem 1.2.20), it holds that for each G , there exists a -valued random variable (), a so-called selection of (), such that 2 G () () holds for every   and G(A) = F X (() A), A . 2 2 j 2 2 A Let G  be a …xed arbitrary distribution in . Let  () be the associated selection of 2 G G  (), and let  be the probability distribution of such a  () induced by the prior of :

(A) =  (  : () A ) .   f 2 g

32 Note that such a  belongs to ( ) since, for each subset B in the su¢ cient parameter  M  2 B space,

((B)) =  (  : () (B) )   f 2 g = (B); where the second equality holds because ():  B are mutually disjoint and () () for f 2 g 2 every .   Since the conditional distribution for (A), given , A , is  (A ) = 1 () A (), the 2 A j f 2 g  j posterior distribution of  generated from  is, by (??),

~ F X (A) = 1 () A ()f X ( x)d j f 2 g j j Z = F X (() A) j 2 = G(A).

 Thus, we have shown that for each G , there exists a prior  ( ) with which the 2 G  2 M  posterior of  coincides with G. Hence,  F X :  () . G  j 2 M  Proof of Proposition 4.1. Let  be the class of probability measures on (; ), as de…ned G A in Lemma A.4. We apply Proposition 10.3 in Denneberg (1994) that states that if the upper envelope of the class of probability measure , FX ( ), is submodular, then, for any non-negative G j  measurable function k :  +, !R

k()dFX = sup k()dG (A.2) j G   Z 2G Z  holds. Corollary 3.1 assures submodularity of F ( ), and Lemma A.4 implies that sup k()dG X G    j  2G is equivalent to sup ( )  k()dF X . Hence, setting k() = L(h(); a) in (A.2) leadsR to 2M  j the equality of the Choquet integralR of L(h( ); a) with respect to FX to the posterior upper risk, j

L(h(); a)dFX = sup L(h(); a)dF X (A.3) j  ( )  j Z 2M  Z 

= sup L(; a)dF X  ( ) j 2M  ZH  =  ; a . 

33 Note further that, by the de…nition of the Choquet integral and Theorem 3.1 (ii),

L(; a)dF  = F  (  : L(; a) t )dt  X  X f  g Z j Z j = F  (  : L(h(); a) ) dt  X f g Z j = L(h(); a)dFX : (A.4) Z j Combining (A.3) and (A.4) yields the …rst equality of the proposition. Next, we show the second equality of the proposition. By Theorem 3.1 (ii),

1 L(; a)dF  () = F  (  : L(; a) t ) dt  X  X f  g Z j Z0 j 1 = F X (  :  : L(; a) t H() = ) dt: j f f  g \ 6 ;g Z0

Note that  : L(; a) t H() = is true if and only if sup H() L(; a) t , so it holds f  g \ 6 ; 2 f g  n o 1 L(; a)dF X () = F X  : sup L(; a) t dt. j 0 j  H() f g  Z Z ( 2 )! By interchanging the order of the integrations by Tonelli’stheorem, we obtain

1 L(; a)dF( x) = 1 sup L(;a) t ()dF X dt j  H() j Z Z0 Z f 2 f g g 1 = 1 sup L(;a) t (t)dtdF X  H() j Z Z0 f 2 f g g = sup L(; a) dF X . (A.5)   H() f g j Z 2

A.3 Proofs of Proposition 5.1 and Theorem 5.1

Proof of Proposition 5.1. The event H() Cr( ) happens if and only if d( ;H()) r . f  c g f c  g So, r1 (c) inf r : F X (  : d(c;H()) r ) 1 is the radius of the smallest interval  j f  g  centered at c that contains random sets H() with posterior probability at least (1 ). There- fore, …nding a minimizer of r1 (c) in c is equivalent to searching for a center of the smallest interval that contains H() with posterior probability 1 , and the attained minimum of r1 (c) gives its radius.

34 A.4 Proof of Theorem 5.1

Throughout the proofs below, superscript n is used to denote a random object induced by Xn  n PXn  ; the probability law of X at  = 0. We denote convergence in probability under PXn  j 0 j 0 by " ". B denotes the unit open ball in 2. We de…ne a sum of two sets in the sense of !P R Minkowski sum.

We …rst prove the following three lemma.

Lemma A.5 Under Condition 5.1,

sup J n (c) J (c) 0: c 2 j j !P 2R Proof. Lemma 2.11 in van der Vaart (1998) shows a non-stochastic version of this lemma. It is straightforward to extending it to the case with stochastic convergence.

n n Lemma A.6 Let Lev1 and Lev1 be (1 )-level sets of J ( ) and J ( ),   n 2 n Lev1 = c : J (c) 1 ; 2 R  2 Lev1 = c : J (c) 1 . 2 R   n Under Condition 5.1, Lev1 converges to Lev1 in the following sense; for every  > 0, n (a) PXn  Lev1 Lev1 + B 1; j 0  ! n (b) PXn  Lev1 Lev1 + B 1, as n . j 0  ! ! 1  n n n Note that (a) implies for every c Lev , there exist random sequence c such that P n c Lev = 1 X 0 1 2 j 2 1 for all n and c cn 0. On the other hand, (b) implies that, for any random sequence cn  k k !P n n n with P n c Lev = 1 for all n, inf c c 0. X 0 1 c Lev1 j 2 2 k k !P  Proof. To prove (a), suppose the conclusion is false. That is, there exist ;  > 0, and random n n n n vector c = (cl ; cu) such that PXn  (c Lev1 ) = 1 for all n and j 0 2

n n PXn  c = Lev1 + B >  for in…nitely many n. j 0 2 

35 n n n n  n  n  n  n Event c = Lev1 + B implies J cl + 2 ; cu + 2 < 1 since cl + 2 ; cu + 2 = Lev1 . 2 2 Therefore, we have  

n n  n  PXn  J cl + ; cu + < 1 >  for in…nitely many n. (A.6) j 0 2 2     Under Condition 5.1 (iii), however, Lemma A.5 implies

n n  n  n  n  J cl + ; cu + J cl + ; cu + 0. 2 2 2 2 !P     This convergence combined with   J cn + ; cn + J (cn; cn) 1 l 2 u 2  l u    n n  n  implies that PXn  J cl + 2 ; cu + 2 1 1 as n . This contradicts (A.6). j 0  ! ! 1 To prove (b), suppose again the conclusion is false. Then, it implies there exist ;  > 0, and n n n n n random vector c = (c ; c ) such that P n c Lev = 1 for all n and l u X 0 1 j 2 n  PXn  (c = Lev1 + B) >  for in…nitely many n. j 0 2 n n  n  Similar to the proof of (a), event c = Lev1 + B implies J cl + ; cu + < 1 , so 2 2 2 n  n   PXn  J cl + ; cu + < 1 >  holds for in…nitely many n. (A.7) j 0 2 2     To …nd contradiction, note that     J cn + ; cn + = J cn + ; cn + J (cn; cn) l 2 u 2 l 2 u 2 l u   h+ [J (cn; cn) J n (cn; cn)] + J n (icn; cn) l u l u l u [J (cn; cn) J n (cn; cn)] + 1  l u l u 1 !P n  n  where the last line is implied by Lemma A.5. This implies PXn  J cl + 2 ; cu + 2 1 1 j 0  ! as n , which contradicts (A.7).   ! 1 Lemma A.7 De…ne

n n K = arg min cl + cu : J (c) 1 ; f  g K = arg min cl + cu : J (c) 1 : f  g Under Condition 5.1, we have, for every  > 0,

n PXn  (K K + B) 1, as n . j 0  ! ! 1

36 n n n n That is, for any random sequence c^ with PXn  (^c K ) = 1 for all n; a random sequence c j 0 2 n n n n de…ned by c arg infc K c^ c satis…es c^ c 0.  2 k k k k !P

Proof. Note …rst that Kn and K are nonempty and closed because the objective function is n monotonically decreasing, and level sets Lev1 and Lev1 are closed and bounded from below by continuity of J n and J as imposed in Condition 5.1 (ii) and (iii). Suppose that the conclusion is n n n n n false, that is, there exist ;  > 0, and c^ = (^cl ; c^u) with PXn  (^c K ) = 1 for all n, such that j 0 2

n PXn  (^c = K + B) >  holds for in…nitely many n. (A.8) j 0 2

n n n By the construction of c^ , P n c^ Lev = 1 for all n, so Lemma A.6 (b) assures that there X 0 1 j 2 n n n n n n exist c~ = (~cl ; c~u) such that it satis…es PXn (~c Lev1 ) = 1 for all n and c^ c~ 0. j 0 2 k k !P Consequently, (A.8) implies that an analogous statement holds for c~n as well,

n PXn  (~c = K + B) >  for in…nitely many n: j 0 2 ^n n n ~n n n n Let f =c ^ +c ^ , f =c ~ +c ~ , and f = min cl + cu : J (c) 1 . If c~ = K + B is the l u l u  f  g 2 n case, then assumption PXn  (~c Lev1 ) = 1 implies that there exist  > 0 such that j 0 2 ~n PXn  f f  >  >  holds for in…nitely many n: j 0   Since c^n c~n 0, f^n f~n 0 holds as well, and therefore, k k !P !P

^n  PXn  f f  > >  holds for in…nitely many n. (A.9) j 0 2  

In order to derive contradiction, consider arbitrary c K, and applying Lemma A.6 (a) to …nd 2 n n n n n n random sequence c = (c ; c ) such that P n c Lev = 1 for all n and c c 0.  l  u X 0  1  j 2 k k !P n n n Along such sequence c , we have f  (c  + cu) 0. Therefore, by combining with (A.9), it l !P holds

^n n n  PXn  f (cl  + cu) > >  holds for in…nitely many n, j 0 2   n n implying that the value of objective function evaluated at feasible point c  Lev1 is lower than 2 n n n n at c^ with positive probability. This contradicts that c^ is a minimizer, PXn  (^c K ) = 1 for j 0 2 all n.

37 Proof of Theorem 5.2. By denoting a connected interval as C = [l; u], we write the optimization problem for obtaining C1 as  min [u l] l;u s.t. F Xn (l l () and u () u) 1 : j    n n In terms of random variables L () = an  ()  (^) and U () = an  ()  (^) , we l l u u can rewrite the constraint as    

n ^ n ^ F Xn L () an l l() and U () an u u() 1 : j         ^ ^ Therefore, by de…ning cl = an l  () and cu = an u  () ; we can rewrite this con- l u strained minimization problem as   

min [cl + cu] n s.t. J (cl; cu) 1 :  Let Kn be the set of solutions of this minimization problem as de…ned in Lemma A.7. For n n n n n c^ = (^cl ; c^u) with PXn  (^c K ) = 1, the posterior lower credible region is obtained as C1 = j 0  c^n n 2  (^) l ;  (^) + c^u . l an u an h Consider now the coveragei probability of C1 for the true identi…ed set: 

PXn  (H (0) C1 ) j 0   n n c^l c^u = P n  ^  ( ) and  ( )  (^) + X 0 l l 0 u 0 u j an   an     ^ n ^ n = PXn  an l (0) l() c^l and an u (0) u() c^u : j 0         By Condition 5.1 (iii), the sampling distribution of an  ( )  (^) and an  ( )  (^) l 0 l u 0 u converge in distribution to (L; U) that has continuous cdfJ ( ). Furthermore, by Lemma A.7, there  n n n exists a random sequence c K = arg min cl + cu : J (c) 1 , such that c c^ 0: 2 f  g k k !P Therefore,

PXn  (H (0) C1 ) j 0   ^ n ^ n = PXn  an l (0) l() c^l and an u (0) u() c^u j 0     ^  n  ^  n = PXn  an l (0) l() cl and an u (0) u() cu + o(1) j 0   = J (c) +o (1) , for arbitrary c K,    2 1 + o (1) , 

38 where the third equality follows by the following three facts; (i) an  ( )  (^) and an  ( )  (^) l 0 l u 0 u converges weakly to (L; U) that has continuous cdf J ( ) by Condition 5.1 (iv), (ii) the cumulative   distribution function of an  ( )  (^) and an  ( )  (^) converges uniformly to J ( ) l 0 l u 0 u  n by lemma 2.11 in van der Vaart (1998), and (iii) PXn  (c K) = 1 for all n by their construction. j 0 2 The last inequality is reduced to equality because, by strict monotonicity of the objective function and continuity of J ( ), c lies in the boundary of the level set c 2 : J (c) 1 .   2 R   B More on Gamma-minimax Decision Analysis

B.1 Gamma-minimax Decision and Dynamic Consistency

In this appendix, we consider a gamma-minimax decision problem when an optimal decision is made prior to observing data, i.e., unconditional gamma-minimax decision as considered in Kudo (1967), Berger (1985, pp. 213–218), Vidakovic (2000). Let  ( ) be a decision function that maps  each x X to the action space a and let  be the space of decisions (the set of measurable 2 H  H functions X a.) The Bayes risk is de…ned as usual: !H

r(; ) = L(h(); (x))p(x )dx d: (B.1)  j Z ZX  Given the prior class ( ), the gamma-minimax criterion ranks decisions in terms of the supre- M  mum of the Bayes risk, r(; ) sup ( ) r(; ). Accordingly, the optimal decision under  2M  this criterion is de…ned as follows.14

De…nition B.1 A gamma-minimax decision   is a decision rule that minimizes the upper 2 Bayes risk, i.e.,

r(; ) = inf r(; ) = inf sup r(; ).      ( ) 2 2 2M  In the standard Bayes decision problem with a single prior, the Bayes rule that minimizes r( ; ) coincides with the posterior Bayes action for every possible sample x X, and being either  2 unconditional or conditional on data does not matter for the actual action to be taken. With 14 A decision criterion similar to the one considered here appears in Kud¯o(1967), Manski (1981), and Lambert and Duncan (1986). These papers considered a model where the subjective probability distribution on the state of nature can be elicited only up to the class of coarse subsets of the parameter space. Our decision problem shares a similar feature to theirs since the posterior upper probability of  can be viewed as a posterior probability distribution over the coarse collection of subsets H():   . f 2 g  H

39 multiple priors, however, the decision rule that minimizes r(; ) does not in general coincide with the conditional gamma-minimax action (for example, see Betro and Ruggeri (1992)). This phenomenon can be easily understood by writing the Bayes risk as the average of the posterior risks with respect to the marginal distribution of data; interchanging the order of integrations in (B.1) gives

r( ; ) = ( ; (x))m(x  )dx. (B.2)   j  ZX Given   and the class of priors,  that maximizes r( ; ) does not necessarily maximize 2   the posterior risk (; (x)) since the Bayes risk r(; ) may depend on  not only through the posterior risk ( ; (x)), but also through the marginal distribution of data m(x  ). Recall,  j  however, that the marginal distribution of data depends only on  (see (3.5)) and our class of priors admits only one speci…cation of . As a result, the supremum of the Bayes risk can be written as the supremum of the posterior risk averaged by m(x  ): j 

sup r(; ) = sup (; (x))m(x )dx:  ( ) X  ( ) j 2M  Z 2M  Hence, the unconditional gamma-minimax decision that minimizes the left-hand side should coin- cide with the posterior gamma-minimax action at m(x  )-almost every x. j 

B.2 Gamma-minimax Regret Analysis

In this appendix, we consider the gamma-minimax regret criterion as an alternative to the (poste- rior) gamma-minimax criterion considered in Section 4. In order to keep analytical tractability, we consider the case where the parameter of interest  is a scalar and the loss function is quadratic, L(; a) = ( a)2. The statistical decisions under the conditional and unconditional gamma-minimax regret criteria are set up as follows.

De…nition B.2 De…ne the lower bound of the posterior risk, given , by () = infa (; a). 2H reg The posterior gamma-minimax regret action ax solves 2 H

inf sup (; a) () : a a 2H  () 2M 

40 De…nition B.3 De…ne the lower bound of the Bayes risk, given , by r() = inf  r(; ). 2 reg The gamma-minimax regret decision  : X a solves !H

inf sup r(; ) r() .    ( ) f g 2 2M 

Since the loss function is quadratic, the posterior risk (; a) for a given  is minimized at ^ , the posterior mean of , if it exists. Therefore, the lower bound of the posterior risk is simply  2 the posterior variance, () = E X (  ^ ), and the posterior regret can be written as j    2 2 (; a) () = E X ( a)  ^ j    2   = E X (a ^ ) . j  h i Let  ;  be the range of the posterior mean of  when  varies over the prior class M( ), x x   whichh we assumei is bounded, which can be implied if H () is bounded, -almost surely. Then, the posterior gamma-minimax regret is simpli…ed to

 + 2 x x (x a) for a 2 ; sup ( ; a) ( ) = 2  (B.3)    +x  ( ) 8 x    a for a > 2 ; 2M  < x    + : x x and, therefore, the posterior gamma-minimax regret is minimized at a = 2 , i.e., the posterior gamma-minimax regret action is simply obtained as the midpoint of  ;  , which is qualitatively x x similar to the local asymptotic minimax regret decision analyzed inh Song (2009).i

Proposition B.1 Let and L(; a) = ( a)2. Assume that the posterior variance of  is H  R …nite for every  ( ). Let () inf  :  H() and () sup  :  H() .  2 M   f 2 g  f 2 g (i) The posterior gamma-minimax regret action is

reg E X (()) + E X (()) a = j j . x 2

(ii) The unconditional gamma-minimax regret decision with the quadratic loss satis…es reg(x) = reg ax ; m(x  )-almost surely. j 

Proof. (i) Given that the posterior variance of  is …nite for  ( ), the posterior gamma-  2 M  minimax regret is well de…ned and it is given by

2 (; a) () = E X (a ^ ) ; j  h i

41 where ^ is the posterior mean of  when the prior of  is  . Consider the bounds of ^ when     varies over ( ):  M 

 inf ^ = inf h()dF X ; x   ( )   ( ) j 2M  2M  Z  sup ^ = sup h()dF . x   X   ( )  ( )  j 2M  2M  Z By the same argument as that used in obtaining (A.3), (A.4), and (A.5),  and  thus de…ned x x are equal to

 = E X [inf  :  H() ] ; x j f 2 g x = E X [sup  :  H() ] , j f 2 g which are assumed to be …nite by the …nite posterior variance assumption. Then, as already  + x x discussed in (B.3), the posterior gamma-minimax regret action is obtained as 2 . (ii) Note that the lower bound of the Bayes risk when  ( ) is written as the average of  2 M  the posterior variance of  with respect to the marginal distribution of data:

r() = inf (; (x))m(x )dx   j 2 ZX = ( )m(x  )dx:  j  ZX Therefore, the unconditional regret is written as

r( ; ) r( ) = ( ; (x))m(x  )dx ( )m(x  )dx    j   j  ZX ZX = ( ; (x)) ( ) m(x  )dx:   j  ZX   Since the marginal distribution of data does not depend on  once  is …xed, the unconditional gamma-minimax regret can be written as the average of the posterior gamma-minimax regret with respect to m(x  ): j 

sup r( ; ) r( ) = sup ( ; a) ( ) m(x  )dx: f   g   j   () ZX  () 2M 2M  This implies that the optimal gamma-minimax regret decision reg(x) coincides with the posterior reg gamma-minimax regret action ax , m(x  )-almost surely. j 

Since the lower bound of the posterior risk () in general depends on the prior , the posterior reg gamma-minimax regret action ax di¤ers from the posterior gamma-minimax action ax obtained in

42 Section 4. To illustrate the di¤erence, consider the posterior gamma-minimax action of Proposition 4.1 when  is a scalar and the loss is quadratic. Let [(); ()] be as de…ned in Proposition B.1, and let m() = (() + ())=2 and r() = (() ())=2 0 be the midpoint and the radius of  the smallest interval that contains H(). The objective function to be minimized in the posterior gamma-minimax decision problem can be written as

2 2 2 E X sup ( a) = E X max (() a) ; (() a) j  H() j " 2 #    2 2 r() = E X (m() a) + E X [r() m() a ] + E X : j j j j j " 2 # h i   reg Note that ax minimizes the …rst term, but it does not necessarily minimize the second term, and, reg therefore, ax can, in general, di¤er from the gamma-minimax action. The gamma-minimax regret decision with quadratic loss depends only on the distribution of m(), but the gamma-minimax decision depends on the joint distribution of m() and r(). For large samples, this di¤erence reg disappears and ax and ax converge to the same action. In general, when the loss function is reg speci…ed to be a monotonically increasing function of a metric  a , ax and a converge to the k k x center of the smallest circle that contains H(0).

References

[1] Andrews, D.W.K. and P. Guggenberger (2009) "Validity of Subsampling and ’Plug-in Asymp- totics’ Inference for Parameters De…ned by Moment Inequalities," Econometric Theory, 25, 669-709.

[2] Andrews, D.W.K. and G. Soares (2010) “Inference for Parameters De…ned by Moment In- equalities Using Generalized Moment Selection,” Econometrica, 78, 119-157.

[3] Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996) “Identi…cation of Causal E¤ects Using Instrumental Variables,”Journal of the American Statistical Association, 91, 444-472.

[4] Balke, A. and J. Pearl (1997) “Bounds on Treatment E¤ects from Studies with Imperfect Compliance,”Journal of the American Statistical Association, 92, 1171-1176.

[5] Barankin, E.W. (1960) “Su¢ cient Parameters: Solution of the Minimal Dimensionality Prob- lem,” Annals of the Institute of Mathematical Statistics, 12, 91-118.

43 [6] Beresteanu, A. and F. Molinari (2008) “Asymptotic Properties for a Class of Partially Iden- ti…ed Models,”Econometrica, 76, 763-814.

[7] Beresteanu, A., I.S. Molchanov, and F. Molinari (2012) “Partial Identi…cation Using Random Sets Theory,”Journal of Econometrics, 166, 17-32.

[8] Berger, J.O. (1984) "The Robust Baeysian Viewpoint." In J. Kadane ed. Robust of Bayesian Analysis, Amsterdam: North-Holland

[9] Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis, Second edition. Spring- Verlag.

[10] Berger, J.O. and L.M. Berliner (1986) “Robust Bayes and Empirical Bayes Analysis with -contaminated Priors,” Annals of Statistics, 14, 461-486.

[11] Bernardo, J.M. (1979) “Reference Posterior Distributions for Bayesian Inference,” Journal of the Royal Statistical Society, Series B, 41, No. 1, 113-147.

[12] Bertsimas, D. and J.N. Tsitsiklis (1997) Introduction to Linear Optimization. Athena Scien- ti…c, Nashua, NH.

[13] Betro, B. and Ruggeri, F. (1992) “Conditional -minimax Actions under Convex Losses,” Communications in Statistics, Part A –Theory and Methods, 21, 1051-1066.

[14] Chamberlain, G. and E.E. Leamer (1976) “Matrix Weighted Averages and Posterior Bounds,” Journal of the Royal Statistical Society, Series B, 38, 73-84.

[15] Chernozhukov, V., H. Hong, and E. Tamer (2007) “Estimation and Con…dence Regions for Parameter Sets in Econometric Models,”Econometrica, 75, 1243-1284.

[16] Chernozhukov, V., S. Lee, and A. Rosen (2011) “Intersection Bounds: Estimation and Infer- ence,”cemmap working paper.

[17] Dawid, A.P. (1979) “Conditional Independence in Statistical Theory,” Journal of the Royal Statistical Society, Series B, 41, No. 1, 1-31.

[18] DeRobertis, L. and J.A. Hartigan (1981) “Bayesian Inference Using Intervals of Measures,” The Annals of Statistics, 9, No. 2, 235-244.

44 [19] Dempster, A.P. (1966) “New Methods for Reasoning Towards Posterior Distributions Based on Sample Data,” Annals of Mathematical Statistics, 37, 355-374.

[20] Dempster, A.P. (1967a) “Upper and Lower Probabilities Induced by a Multivalued Mapping,” Annals of Mathematical Statistics, 38, 325-339.

[21] Dempster, A.P. (1967b) “Upper and Lower Probability Inferences Based on a Sample from a Finite Univariate Population,” Biometrika, 54, 515-528.

[22] Dempster, A.P. (1968) “A Generalization of Bayesian Inference,” Journal of the Royal Sta- tistical Society, Series B, 30, 205-247.

[23] Denneberg, D. (1994) Non-additive Measures and Integral. Kluwer Academic Publishers, Dordrecht, The Netherlands.

[24] Drèze, J.H. (1974) “Bayesian Theory of Identi…cation in Simultaneous Equations Models.” In S.E. Fienberg and A. Zellner eds. Studies in Bayesian Econometrics and Statistics, In Honor of Leonard J. Savage, pp. 175-191. North-Holland, Amsterdam.

[25] Florens, J.-P. and M. Mouchart (1977) “Reduction of Bayesian Experiments.” CORE Discus- sion Paper 7737, Universite Catholique de Louvain, Louvain-la-Neuve, Belgium.

[26] Florens, J.-P., M. Mouchart, and J.-M. Rolin (1990) Elements of Bayesian Statistics. Marcel Dekker, New York.

[27] Florens, J.-P. and A. Simoni (2011) "Bayesian Identi…cation and Partial Identi…cation." mimeo, Toulouse School of Economics and Universita Bocconi.

[28] Galichon, A. and M. Henry (2006) “Inference in Incomplete Models,” unpublished manuscript, Ecole Polytechnique and Universite de Montreal.

[29] Good, I.J. (1965) The Estimation of Probabilities. MIT Press, Cambridge, Massachusetts.

[30] Gustafson, P. (2009) “What Are the Limits of Posterior Distributions Arising from Noniden- ti…ed Models, and Why Should We Care?” Journal of the American Statistical Association, 104, No. 488, 1682-1695.

[31] Gustafson, P. (2010) “Bayesian Inference for Partially Identi…ed Models,” The International Journal of Biostatistics, 6, Issue 2.

45 [32] K. Hirano and J. Porter (2011) "Impossibility Results for Nondi¤erentiable Functionals," forth- coming in Econometrica.

[33] Horowitz, J., and C.F. Manski (2000) “Nonparametric Analysis of Randomized Experiments with Missing Covariate and Outcome Data,” Journal of the American Statistical Association, 95, 77-84.

[34] Huber, P.J. (1973) “The Use of Choquet Capacities in Statistics,” Bulletin of International Statistical Institute, 45, 181-191.

[35] Hurwicz, L. (1950) “Generalization of the Concept of Identi…cation,” Statistical Inference in Dynamic Economic Models, Cowles Commision Monograph 10. John Wiley and Sons, New York.

[36] Imbens, G. W. and J. D. Angrist (1994) “Identi…cation and Estimation of Local Average Treatment E¤ects,”Econometrica, 62, 467-475.

[37] Imbens, G. W. and C.F. Manski (2004) “Con…dence Intervals for Partially Identi…ed Parame- ters,”Econometrica, 72, 1845-1857.

[38] Je¤reys, H. (1961) Theory of Probability, Third edition. Oxford University Press, Oxford.

[39] Kadane, J.B. (1974) “The Role of Identi…cation in Bayesian Theory.” In S.E. Fienberg and A. Zellner eds. Studies in Bayesian Econometrics and Statistics, In Honor of Leonard J. Savage, pp. 175-191. North-Holland, Amsterdam.

[40] Kass, R.E. and L. Wasserman (1996) “The Selection of Prior Distributions by Formal Rules,” Journal of the American Statistical Association, 92, No. 435, 1343-1369.

[41] Kitagawa, T. (2009) “Identi…cation Region of the Potential Outcome Distributions under Instrument Independence.” Cemmap Working Paper, University College London.

[42] Kud¯o,H. (1967) “On Partial Prior Information and the Property of Parametric Su¢ ciency.” In L. Le Cam and J. Neyman eds. Proceedings the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 251-265.

[43] Koopmans and Riersol (1950) “The Identi…cation of Structural Characteristics,” Annals of Mathematical Statistics , 21, 165-181.

46 [44] Lambert, D. and G.T. Duncan (1986) “Single-parameter Inference Based on Partial Prior Information,” The Canadian Journal of Statistics, 14, No. 4, 297-305.

[45] Leamer, E.E. (1982) “Sets of Posterior Means with Bounded Variance Priors,” Econometrica, 50, 725-736.

[46] Lehmann, E.L. and J. P. Romano (2005) Testing Statistical Hypotheses, Second edition. Springer Science + Business Media, New York.

[47] Liao, Y. and W. Jiang (2010) “Bayesian Analysis in Moment Inequality Models,” The Annals of Statistics, 38, 1, 275-316.

[48] Manski, C.F. (1981) “Learning and Decision Making When Subjective Probabilities Have Subjective Domains,” The Annals of Statistics, 9, No. 1, 59-65.

[49] Manski, C.F. (1989) “Anatomy of the Selection Problem,” The Journal of Human Resources, 24, No. 3, 343-360.

[50] Manski, C.F. (1990) “Nonparametric Bounds on Treatment E¤ects,” The American Economic Review, 80, No. 2, 319-323.

[51] Manski, C.F. (2000) “Identi…cation Problems and Decision under Ambiguity: Empirical Analysis of Treatment Response and Normative Analysis of Treatment Choice,” Journal of Econometrics, 95, 415-442.

[52] Manski, C.F. (2002) “Treatment Choice under Ambiguity Induced by Inferential Problems,” Journal of Statistical Planning and Inference, 105, 67-82.

[53] Manski, C.F. (2003) Partial Identi…cation of Probability Distributions. Springer-Verlag, New York.

[54] Manski, C.F. (2008) Identi…cation for Prediction and Decision. Harvard University Press, Cambridge, Massachusetts.

[55] Manski, C.F. (2009) “Diversi…ed Treatment under Ambiguity,” International Economic Re- view, 50, 1013-1041.

[56] Molchanov, I. (2005) Theory of Random Sets. Springer-Verlag.

47 [57] Moon, H.R. and F. Schorfheide (2011) “Bayesian and Frequentist Inference in Partially Iden- ti…ed Models,” forthcoming in Econometrica.

[58] Morris, C. (1983) “Parametric Empirical Bayes Inference: Theory and Applications,” Journal of the American Statistical Association, 78, 47-65.

[59] Neath, A.A. and F.J. Samaniego (1997) "On the E¢ cacy of Bayesian Inference for Noniden- ti…able Models," The American Statistician, 51, 3, 225-232.

[60] Pearl, J. (1995) “On the Testability of Causal Models with Latent and Instrumental Variables,” Uncertainty in Arti…cial Intelligence, 11, 435-443.

[61] Picci, G. (1977) “Some Connections Between the Theory of Su¢ cient Statistics and the Identi…ability Problem,” SIAM Journal on Applied Mathematics, 33, 383-398.

[62] Poirier, D.J. (1998) “Revising Beliefs in Nonidenti…ed Models,” Econometric Theory , 14, 483-509.

[63] Rios Insua, D., F. Ruggeri, and B. Vidakovic (1995) “Some Results on Posterior Regret -minimax estimation,” Statistics and Decisions, 13, 315-331.

[64] Robbins, H. (1964) “The empirical Bayes Approach to Statistical Decision Problems,” Annals of Mathematical Statistics, 35, 1-20.

[65] Romano, J.P. and A.M. Shaikh (2010) “Inference for the Identi…ed Set in Partially Identi…ed Econometric Models,” Econometrica, 78, 169-211.

[66] Rothenberg, T.J. (1971) “Identi…cation in Parametric Models,” Econometrica, 39, No. 3, 577-591.

[67] Savage, L. (1951) “The Theory of Statistical Decision,” Journal of the American Statistical Association, 46, 55-67.

[68] Shafer, G. (1976) A Mathematical Theory of Evidence. Princeton University Press, Princeton, NJ.

[69] Shafer, G. (1982) “Belief Functions and Parametric Models,” Journal of the Royal Statistical Society, Series B, 44, No. 3, 322-352.

48 [70] Schennach, S.M. (2005) “Bayesian Exponentially Tilted Empirical Likelihood,” Biometrika, 92, 1, 31-46.

[71] Schervish, M.J. (1995) Theory of Statistics. Springer-Verlag, New York.

[72] Song, K. (2009) “Point Decisions for Interval Identi…ed Parameters,”unpublished manuscript, University of Pennsylvania.

[73] Stoye, J. (2009) “More on Con…dence Intervals for Partially Identi…ed Parameters,” Econo- metrica, 77, 1299-1315.

[74] van der Vaart, A.W. (1998) Asymptotic Statistics. Cambridge University Press.

[75] Vidakovic, B. (2000) “-minimax: A Paradigm for Conservative Robust Bayesians.” In D. Rios Insua and F. Ruggeri eds. Robust Bayesian Analysis, Lecture Notes in Statistics, 152, 241-259. Springer, New York.

[76] Walley, P. (1991) Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London.

[77] Wasserman, L.A. (1989) “A Robust Bayesian Interpretation of Likelihood Regions,” The Annals of Statistics, 17, 3, 1387-1393.

[78] Wasserman, L.A. (1990) “Prior Envelopes Based on Belief Functions,” The Annals of Statis- tics, 18, 1, 454-464.

[79] Wasserman, L.A. and J.B. Kadane (1990) “Bayes Theorem for Choquet Capacities,” The Annals of Statistics, 18, 3, 1328-1339.

49