NONPARAMETRIC BAYES: INFERENCE UNDER NONIGNORABLE MISSINGNESS AND MODEL SELECTION

By ANTONIO LINERO

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 c 2015 Antonio Linero

Dedicated to Mike and Hani for their gracious support, and Katie for her patience! ACKNOWLEDGMENTS I would first like to convey my sincerest gratitude to my advisers, Professor Michael J. Daniels and Professor Hani Doss, for their help and encouragement. I have been fortunate to have two excellent mentors to bounce ideas off of, and to offer assurances when I doubted my work. They have provided me with every opportunity to succeed, and it has been an honor to work with them. I would also like to thank my committee members, Professor Malay Ghosh who always had an open door and Professor Arunava Banerjee for providing a valuable perspective. Outside of my committee, I am grateful to Professor Daniel O. Scharfstein for his insights into the missing data problem and for giving me the opportunity to visit Johns Hopkins. I am also particularly grateful to Professor Ron Randles, whose Introduction to Mathematical course provided the initial inspiration for me to pursue Statistics, and to Professor Andrew Rosalsky for providing me with a deep appreciation of Probability. Lastly, I would like to thank my partner Katie for her inspiration and patience, and my parents, without whose support I would not have had the opportunity to succeed.

4 TABLE OF CONTENTS page ACKNOWLEDGMENTS...... 4 LIST OF TABLES...... 8 LIST OF FIGURES...... 9 ABSTRACT...... 10

CHAPTER 1 PRELIMINARIES ON BAYESIAN NONPARAMETRICS...... 12 1.1 Introduction...... 12 1.2 Posterior Consistency...... 13 1.3 Review of Random Measures...... 15 1.3.1 Dirichlet Processes...... 15 1.3.2 Mixtures of Dirichlet Processes and Mixtures... 17 1.3.3 Dependent Random Measures...... 19 2 INFORMATIVE MISSINGNESS IN LONGITUDINAL STUDIES...... 20 2.1 Introduction...... 20 2.2 Notation...... 21 2.3 Rubin’s Classification of Missing Data...... 23 2.4 Why Bayesian Nonparametrics?...... 25 2.5 Existing Approaches...... 26 2.5.1 Likelihood Factorizations...... 26 2.5.2 Non-Likelihood Based Approaches...... 29 2.6 Identifying Restrictions and Sensitivity Parameters...... 30 2.7 Partial and Latent Ignorability...... 33 2.8 Intermittent Missingness...... 34 2.9 Summary of Our Strategy and Our Contributions...... 35 3 NONPARAMETRIC BAYES FOR NONIGNORABLE MISSINGNESS..... 37 3.1 Introduction...... 37 3.2 Strategy for Prior Specification...... 39

3.3 Posterior Consistency of pobs ...... 40 3.4 Kullback-Leibler Property for Kernel Mixture Models...... 44 3.5 Identifying Restrictions...... 49 3.5.1 Monotone Missingness...... 49 3.5.2 Non-monotone Missingness...... 51 3.6 Inference by G-Computation...... 53 3.7 Discussion...... 56

5 4 A DIRICHLET PROCESS MIXTURE WORKING MODEL, WITH APPLICATION TO A SCHIZOPHRENIA CLINICAL TRIAL...... 57 4.1 Introduction...... 57 4.2 The Schizophrenia Clinical Trial...... 57 4.3 A Dirichlet Process Mixture Working Prior...... 59 4.4 The Extrapolation Distribution...... 61 4.5 Computation and Inference...... 62 4.6 Simulation Studies...... 63 4.6.1 Performance for Mean Estimation under MAR...... 64 4.6.2 Performance for Effect Estimation Under MNAR...... 66 4.7 Application to the Schizophrenia Clinical Trial...... 69 4.7.1 Comparison to Alternatives and Assessing Model Fit...... 69 4.7.2 Inference and Sensitivity Analysis...... 71 4.8 Discussion...... 75 5 EMPIRICAL BAYES ESTIMATION AND MODEL SELECTION FOR HIERARCHICAL NONPARAMETRIC PRIORS...... 77 5.1 Introduction...... 77 5.1.1 Motivating Examples...... 81 5.1.2 Our Contributions...... 82 5.2 Theoretical Development...... 83 5.2.1 Marginal Likelihoods...... 84 5.2.2 Limiting Cases of the Hierarchical Dirichlet Process...... 87 5.3 Estimation of Bayes Factor Surfaces...... 91 5.3.1 Testing Against Boundary Values...... 95 5.3.2 Empirical Bayes Estimation...... 98 5.4 Illustrations...... 100 5.4.1 Quality of Hospital Care Data...... 100 5.4.2 Topic Modeling...... 104 5.5 Discussion...... 109 6 DISCUSSION AND FUTURE WORK...... 110 6.1 Rates of Convergence...... 110 6.2 More Work on Non-monotone Missingness...... 110 6.3 Multivariate Models for Missing Data...... 111 6.4 Causal Inference...... 111 6.5 Alternatives to the Hierarchical Dirichlet Process...... 112

APPENDIX A APPENDIX TO CHAPTER 3...... 113 A.1 Proof of Theorem 3.2...... 113 A.2 Proof of Theorem 3.3...... 113

6 A.3 Proof of Theorem 3.5...... 119 B APPENDIX TO CHAPTER 4...... 122 B.1 Blocked Gibbs Sampler...... 122 B.2 Prior Specification...... 123 B.2.1 Parametric Priors...... 123 B.2.2 Nonparametric Default Priors...... 123 B.3 Simulation Settings...... 124 B.3.1 Section 4.6.1...... 124 B.3.2 Section 4.6.2...... 125 B.4 Exponential Tilting...... 130 C APPENDIX TO CHAPTER 5...... 133 C.1 Proof of Theorem 5.2...... 133 C.2 Proof of Lemma 5.1...... 133 C.3 Proof of Lemma 5.2...... 134 C.4 Impropriety of Posterior Under an Improper Prior...... 135 REFERENCES...... 137 BIOGRAPHICAL SKETCH...... 146

7 LIST OF TABLES Table page 2-1 Schematic representation of ACMV...... 31 2-2 Schematic representation of NFD...... 32 4-1 Simulation results under MAR...... 66 4-2 Comparison of results on SCT data under MAR...... 70 B-1 Results from simulation study in Section 4.6.2...... 128

8 LIST OF FIGURES Figure page 3-1 Schematic describing the working prior framework...... 40 3-2 Graphical depiction of the coupling interpretation of the transformation method. 50 4-1 Trajectories of two latent classes...... 59 4-2 Results from simulation study in Section 4.6.2...... 68 4-3 Model checking for SCT...... 71 4-4 Improvement of treatments relative to placebo...... 74 4-5 Contour plot for treatment effects as functions of sensitivity parameters..... 75 5-1 Graphical depiction of a Bayesian hierarchical data generating mechanism.... 78 5-2 Graphical representation of the HDP as a directed acyclic graph...... 79 5-3 Draws from a targeting improper posterior of γ...... 80 5-4 Models corresponding to boundary values of (α, γ)...... 89 5-5 Models obtained by letting α or γ tend to 0 or ...... 90 ∞ 5-6 MCMC output for (α, γ) under an informative prior...... 102 5-7 Logarithm of Bayes factor surface of (α, γ)...... 103 5-8 True topics used in simulation experiment...... 105 5-9 Histogram of samples from the posterior distribution of α and γ...... 106

5-10 L1 error in estimating the most prevalent topic...... 107 5-11 Sensitivity of estimation of the most prevalent topic to choice of hyperparameter. 108 B-1 Dataset generated under M2...... 129 B-2 Dataset generated under M3...... 130

9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy NONPARAMETRIC BAYES: INFERENCE UNDER NONIGNORABLE MISSINGNESS AND MODEL SELECTION By Antonio Linero August 2015 Chair: Michael J. Daniels Cochair: Hani Doss Major: Statistics This dissertation concerns two essentially independent topics, with the primary link between the two being the use of Bayesian nonparametrics as an inference tool. The first topic concerns inference in the presence of missing data, with emphasis on longitudinal clinical trials with attrition. In this setting, it is well known that many effects of interest are not identified in the absence of untestable assumptions; the best one can do is to conduct a sensitivity analysis to determine the effect that such assumptions have on inferences. The second topic we address is model selection and hyperparameter estimation in hierarchical nonparametric Bayes models, with an emphasis on hierarchical Dirichlet processes. For various hyperparameter values on the boundary of the parameter space, such nonparametric models may reduce to parametric or semiparametric submodels, effectively giving tests of nonparametric models. Chapter 1 and Chapter 2 provide some necessary background material on nonparametric Bayes and missing data problems. In Chapter 1, we discuss Dirichlet processes and present theoretical results of interest. In Chapter 2, we describe in detail various aspects of the missing data problem, and the generic approach we will take to addressing it via identifying restrictions. In Chapter 3, we provide a general framework for specifying nonparametric priors in missing data models which allow for fine control over the assumptions made; care is

10 taken to not inadvertently “identify away” the missing data problem. To accomplish this, we place a prior directly on the observed data distribution, leaving the “extrapolation distribution” untouched. This is accomplished essentially by marginalizing over the extrapolation distribution for a given prior on the complete-data distribution. The extrapolation distribution is then identified via a suitable family of identifying restrictions with interpretable sensitivity parameters. In Chapter 4, we apply this methodology to a longitudinal clinical trial designed to assess the efficacy of a proposed treatment for acute schizophrenia. We construct a Dirichlet process mixture model to model the observed data generating distribution, and identify the distribution of the missing data by introducing a sensitivity parameter representing a location shift. The methodology is additionally validated through simulation. In Chapter 5, we consider the problem of hierarchical Bayesian modeling. We use hierarchical Dirichlet processes to construct tests of nonparametric models against semiparametric and parametric alternatives. Additionally, we develop empirical Bayes estimators for the hyperparameters of the hierarchical Dirichlet process. We conclude this dissertation in Chapter 6 with a discussion, and suggest possible avenues for future work.

11 CHAPTER 1 PRELIMINARIES ON BAYESIAN NONPARAMETRICS 1.1 Introduction

We consider nonparametric and semiparametric Bayesian methods in this dissertation; this chapter provides a review of such methods.

iid A canonical problem in statistics is to observe data Zi P0 for i = 1, . . . , n, and ∼ attempt to learn from the data Zi about some aspect ψ0 = ψ(P0) of P0, and to quantify our uncertainty about ψ0. In general, the Zi’s may be regarded as taking values in some complete, separable, metric space endowed with its Borel σ-algebra B. The Bayesian Z approach tackles the canonical problem by regarding P0 as the realization of a random probability distribution P Π, where Π is a probability distribution on a space of ∼ distributions P. One then draws inferences about ψ0 from the posterior distribution

Π(dP Z1:n). | Research in Bayesian methods initially focused on the parametric setting, regarding

p P as a parametric family Pθ : θ Θ where Θ is an open subset of R and such { ∈ } that the mapping θ Pθ is smooth (e.g., see Ghosh et al., 2007). For even relatively 7→ simple datasets, however, there is often little a priori justification for believing that

P0 lies precisely in some parametric family P. Depending on the parametric model chosen and the true data generating distribution P0, inferences about ψ0 through the

posterior may be in some sense robust to misspecification of Pθ . For example, Bayesian { } inference about regression coefficients in linear models with homoskedastic errors remains frequentist-valid in the presence of a misspecified normality assumption. In complex examples of interest, however, this is frequently not the case. An earnest Bayesian might

then consider enlarging P to be a nonparametric, or semiparametric, family; we primarily consider P = P : P ν where ν is some dominating measure (e.g., Lebesgue measure {  } J on R ) and denote by p0 and p the respective densities dP0/dν and dP/dν. One then

12 obtains the posterior distribution through Bayes rule

n B i=1 p(Zi) Π(dP ) Π(B Z1:n) = . | n p(Z ) Π(dP ) R Qi=1 i The family being dominated simplifies mattersR Q by ensuring that Bayes rule holds (e.g., see Ghosh and Ramamoorthi, 2003). We begin by reviewing relevant material in Bayesian nonparametrics; in particular, we discuss posterior consistency and some common methods for constructing random probability measures. 1.2 Posterior Consistency

iid Suppose Zi P0 and let Π(dP Z1:n) denote the posterior of P given the ∼ | observations Z1,..., Zn. If P = Pθ : θ Θ is a regular parametric model and { ∈ } J P0 = Pθ0 for some θ0 Θ (Θ an open subset of R ), then, under mild conditions on Π, ∈ it is well-known that the posterior Π(dP Z1:n) concentrates its mass in neighborhoods | of Pθ0 . Moreover, under further regularity conditions, one obtains a Bernstein-von-Mises theorem stating that

ˆ −1 sup Π(θ B Z1:n) (B θn, n V0) 0 B ∈ | − N | →

ˆ for any (frequentist) efficient estimator θn, where V0 is the inverse Fisher information, ( µ, Σ) denotes the Gaussian distribution with mean µ and covariance Σ, and the N · | supremum is taken over Borel subsets of Θ (e.g., see van der Vaart, 2000, Chapter 10). Hence, in the case of iid sampling from a regular parametric model, Bayesian methods are, for the most part, efficient and agree with frequentist methods. Early examples (Diaconis and Freedman, 1986; Freedman, 1963) showed that Bayesian nonparametric procedures, by contrast, do not generally obtain good frequentist properties. While jarring, it is not surprising that, in some sense, a finite sample may be insufficient to eventually “swamp” the information in the prior concerning an infinite number of parameters. On the other hand, sufficient conditions even in the

13 nonparametric setting were known quite early (Schwartz, 1965). We begin by defining posterior consistency in the nonparametric setting. Definition 1 (Posterior consistency). A prior Π is said to be consistent at a distribution

P0 if, for every neighborhood U of P0, the posterior Π(dP Z1:n) concentrates on U; that | is,

c Π(U Z1:n) 0,P0 almost-surely. (1–1) | →

The definition above is dependent on the topology considered. U is said to be a weak neighborhood of P0 if U contains a set of the form

P : φi dP0 φi dP < , i = 1, . . . , k , φi bounded, continuous. −  Z Z 

If (1–1) holds for all U where U is a weak neighborhood of P0, then Π is said to be weakly consistent. U is said to be a total variation neighborhood if U contains a set of the form

S(P0) = P : P P0 1 <  { k − k } where 1 denotes the total variation norm k · k 1 dP dQ P Q 1 = sup P (B) Q(B) = dν. k − k B | − | 2 dν − dν Z

If (1–1) holds whenever U is a total variation neighborhood, then Π is said to be strongly consistent.

The weak and strong topologies can be metrized by the Prokhorov and L1 metrics respectively. Another useful notion of distance is given by the Kullback-Leibler divergence. Definition 2 (Kullback-Leibler divergence). Let P and Q be probability measures on

( , B). The Kullback-Leibler divergence from P to Q is defined as Z p K(P,Q) = p log dν, (1–2) q Z  

14 where P ν and Q ν with densities p and q respectively. U is said to be a   Kullback-Leibler neighborhood of P0 if U contains a set of the form

K(P0) = P : K(P0,P ) <  . { }

Jensen’s inequality shows that K(P,Q) 0, with equality holding if and only if ≥ P = Q. The definition of K(P,Q) is also independent on the dominating measure ν. Posterior consistency seems to be a minimal requirement in terms of frequentist properties of Bayesian procedures; nevertheless, it may not be taken for granted. We now present two useful results for proving posterior consistency.

Definition 3 (Kullback-Leibler support). P0 is said to be in the Kullback-Leibler support

of Π if Π(K(P0)) > 0 for all  > 0.

Theorem 1.1 (Schwartz 1965). Suppose that P0 lies in the Kullback-Leibler support of Π.

Then Π is weakly-consistent at P0. Weak consistency guarantees posterior consistency of the expectations of bounded linear functions with respect to P ; that is, the posterior of the parameter φ dP is

consistent at φ dP0 in the usual (parametric) sense, for φ bounded and continuous.R For our purposes,R this will suffice for the most part. 1.3 Review of Random Measures

The methods developed in this dissertation will make heavy use of random probability measures. In particular, we will focus on Dirichlet processes; however, many of the approaches can be modified to use different types of priors. 1.3.1 Dirichlet Processes

Perhaps the most useful tool in Bayesian nonparametrics for the construction of random measures is the Dirichlet process (DP) prior introduced by Ferguson(1973). Definition 4 (Dirichlet process). A random probability measure P on a complete, separable, metric space is said to have a Dirichlet process distribution, written (αH) Z D and parametrized by a base probability measure H and concentration parameter α, if for

15 every measurable partition A1,...,Ak of , the joint distribution of (P (A1),...,P (Ak)) is Z a k-dimensional Dirichlet distribution with shape parameter (αH(A1), . . . , αH(Ak)). From this characterization based on finite-dimensional marginals, H can be interpreted as the prior mean of P in the sense that E[P (A)] = H(A) and α can be interpreted as a measure of concentration around H from the fact that Var(P (A)) = H(A)[1 H(A)]/(α + 1). Dirichlet processes play a fundamental role in Bayesian − nonparametrics because the class of Dirichlet process priors is conjugate to iid sampling in the sense that the posterior distribution of P given data Z1:n is again a Dirichlet process with concentration parameter α + n and base probability measure

α n Pˆ = H + n α + n α + nPn

−1 n where Pn = n i=1 δZi denotes the empirical process. From this, we see that the posterior mean ofP the Dirichlet process is asymptotically equivalent to the empirical distribution function. Dirichlet processes also possess a “large” support; the support of the (αH) distribution is D

P : the support of P is contained in the support of H , { } assuming the following notion of support. Definition 5 (Weak support). The weak support of a prior Π is the smallest closed set F in the weak topology such that Π(F ) = 1. The Dirichlet process has several additional characterizations which are particularly

useful when constructing Markov chain Monte Carlo algorithms. If Z1, Z2,... is a

sequence drawn conditionally-iid from some P (αH), then the Zi marginally possess ∼ D the following P´olyaUrn representation (Blackwell and MacQueen, 1973):

α n−1 1 [Z Z ,..., Z ] H + δ . (1–3) n n−1 1 α + n 1 α + n 1 Zi | ∼ i=1 − X −

16 Here, δz denotes the point-mass distribution at the point z. Dirichlet processes also possess the following stick-breaking construction (Sethuraman, 1994):

∞ d iid P = βkδθ , θk H, k ∼ k=1 Xk−1 0 0 0 iid βk = βk (1 βj), βj Beta(1, α). (1–4) j=1 − ∼ Y The symbol =d denotes equality in distribution. An immediate extension of the Dirichlet

0 indep process is obtained by considering the family of priors determined by β Beta(aj, bj) j ∼ provided that log(1 β0 ) = with Π-probability 1. If P Π has such a j − − j ∞ ∼ stick-breakingP construction we refer to P as a stick-breaking measure. Examples of stick-breaking measures include Pitman-Yor processes (Ishwaran and James, 2001; Pitman and Yor, 1997) and probit stick-breaking processes (Rodriguez and Dunson, 2011). 1.3.2 Mixtures of Dirichlet Processes and Dirichlet Process Mixtures

The Dirichlet process, as described above, takes the prior mean H to be specified a priori. Mixtures of Dirichlet processes, introduced by Antoniak(1974), allow for modeling of uncertainty in H; the base measure H is assumed to lie in a parametric family Hη : { η E which is then endowed with a prior π(dη). The prior probability of P B is then ∈ } ∈

Π(B) = (B αHη) π(dη). (1–5) D | Z A potential drawback of mixtures of Dirichlet processes is that they place prior probability 1 on discrete measures with countable support. When modelling continuous distributions, in addition to being unrealistic in this respect, the discrete support precludes

a large Kullback-Leibler support. Rather than drawing Zi from P (αHη), we ∼ D

17 introduce a kernel density fθ( ) on and model the density of Zi as · Z

p(z) = fθ(z) P (dθ), Z P (αHη), ∼ D η π(dη). (1–6) ∼

Dirichlet process mixtures (DPMs) were introduced by Lo(1984) for the purpose of density estimation but have seen a wide variety of uses. They form an important building block for developing more complicated methods. Conditions for strong and weak consistency, as well as rates of convergence, were given by Barron et al.(1999) and Ghosal et al.(1999, 2000). Inference in Dirichlet process mixtures typically proceeds by Markov chain Monte Carlo (MCMC). MCMC methods were first developed for DPMs in seminal work by Escobar and West(1995), who marginalized over the infinite-dimensional P and conducted

inference about the predictive distribution of the Zi’s by constructing a Gibbs sampler based on the P´olya Urn representation (1–3); see also Neal(2000). Various techniques have been developed for conducting inference about stick-breaking measures using Gibbs sampling. Ishwaran and James(2001) take the approach of truncating the stick-breaking expansion by setting β0 1 in (1–4) and give bounds k ≡ on the L1-distance between the marginal likelihood using the truncation and the marginal likelihood without the truncation. To conduct inference, they give a blocked Gibbs sampler. Papaspiliopoulos and Roberts(2008) develop a retrospective sampler applicable to generic stick-breaking measures similar to the sampler given by Doss(1994), which gives an exact MCMC scheme by adaptively determining the truncation point. Walker (2007) similarly develops an exact sampler with an adaptive truncation level by using a slice-sampling approach. To sample functionals of the Dirichlet process, Muliere and Tardella(1998) give an “ -truncation”.

18 1.3.3 Dependent Random Measures

In many settings, one prefers to think of a family Px : x of random { ∈ X } measures, rather than a single random P . The literature on dependent families of random probability measures appears to have been initiated by MacEachern(1999), who introduced the class of dependent Dirichlet processes (DDPs),

Px = βk(x)δθk(x), k=1 X 0 0 0 where βk(x) = β (x) (1 β (x)) is chosen such that β (x) Beta(1, α(x)) holds k j

19 CHAPTER 2 INFORMATIVE MISSINGNESS IN LONGITUDINAL STUDIES 2.1 Introduction

When handling data from longitudinal clinical trials, analysts are often confronted with missing observations. In some settings, data is inherently incomplete - for example, when conducting causal inference in an observational study under the Rubin causal model (Rubin, 1974, 2005), it is impossible to observe the potential outcomes of an individual assigned to multiple treatments simultaneously. Alternatively, some data may be observable but simply not recorded. This is commonly the case in longitudinal clinical trials; subjects are administered some treatment, and a response of interest is observed over time. Some subjects may “drop out” of the study in the sense that after a certain time point the response of interest for that subject is no longer observed. We are concerned with the estimation of causal effects in longitudinal studies in the presence of missingness, e.g., the intention to treat effect of the randomization to treatment regimes. A crucial aspect of this problem is that many causal effects of practical interest are not identified unless the analyst makes strong, untestable, assumptions about the relationship between the missingness process and the response process (Daniels and Hogan, 2008; National Research Council, 2010; Scharfstein et al., 1999). Quoting Hogan et al.(2014), when confronted with the inevitability of untestable assumptions, one might:

(i) Make an assumption, such as the missing at random (MAR) assumption, and assume it holds with no further critique.

(ii) Fit models under several different assumptions to the joint distribution of the response and missingness, then assess how the inferences change.

(iii) Fit an under-identified model and focus on obtaining uncertainty regions for the effects of interest (Manski, 2009; Vansteelandt et al., 2006). In the absence of strong subject matter justification for an assumption like MAR, we view option (i) as unrealistically optimistic. Our focus will be on option (ii) and various

20 extensions. We refer to the act of fitting various models under different assumptions to assess how the inferences change as conducting sensitivity analysis. Option (iii) has attracted substantial attention as a basis for frequentist inference; by shifting focus from point estimation to the construction of uncertainty regions, one supposedly attains a more “objective” analysis. A disadvantage of this approach is that using uncertainty regions essentially correspond to a worst-case analysis, whereas incorporating subject matter knowledge may allow for a more reasonable weighting of assumptions. We will not comment further on this approach, other than to note that inference obtained in terms of uncertainty intervals often corresponds to a particular choice of prior distribution (Hogan et al., 2014). We view the use of the Bayesian paradigm as particularly attractive because it provides a clear mechanism for incorporating subject matter expertise in a principled manner – one simply elicits an informative prior from the subject matter expert. This allows for a formal decision-theoretic analysis to be performed in a regulatory setting. In this chapter, we provide a review of the relevant concepts required to develop the techniques in the remainder of this dissertation. We also provide motivation for taking the Bayesian nonparametric approach to conducting inference on causal effects. 2.2 Notation

Assume that we have collected data on i = 1, . . . , n individuals. Let Yi = (Yi1,...,YiJ )

J denote a vector of observations intended to be collected on subject i, with Ri 0, 1 a ∈ { } binary vector such that Yij is observed or unobserved according as Rij = 1 or Rij = 0. The particular values that Ri can take are referred to as missing data patterns. Missingness is said to be monotone if Rij = 0 implies Rik = 0 for k > j; that is, missingness is the result of dropout. Missingness which is not monotone is referred to as non-monotone or intermittent. When dropout is monotone, the missing data pattern can be summarized by

Si = max j : Rij = 1 . Yi can be partitioned into an observed part Yobs,i = (Yij : Rij = 1) { }

21 and a missing part Ymis,i = (Yij : Rij = 0). Formally, we will let

Yobs,ij = RijYij (1 Rij) , (2–1) − − ∞ where we will take the measure-theoretic convention that 0 = 0; essentially, we are ∞ · using the placeholder to denote a missing value. −∞

We will write Oi = (Yobs,i, Ri) for the observed data on subject i and Ci = (Yi, Ri) for the complete data on subject i. We will subscript by n : m to denote creating a vector spanning the indices n, n + 1, . . . , m; for example, O1:n = (O1,..., On) will denote the observed data on all subjects. In the setting of monotone missingness, it is convenient to ¯ let Yij = (Yi1,...,Yij) denote the response history of Yi up to time j.

iid It is assumed that Ci p0(y, r) where p0(y, r) is the density of a distribution P0 ∼ with respect to some relevant dominating measure. P0 is modeled as the realization of some random distribution P Π, where Π(dP ) is a prior on an appropriate space P of ∼ distributions with respect to dy dr. We again abuse notation, with dr being counting × measure on 0, 1 J and dy potentially denoting Lebesgue measure, counting measure, { } and so forth depending on context. We will similarly use dyj, dy¯j, dyobs, and so on, as a notational convenience to refer to the appropriate dominating measures. Subscripting densities by 0 will refer to an a priori fixed, or “true,” density for the observations. When no confusion is possible, we will also abuse notation by writing, for example, p(y r) for the conditional density of the response Yi given the missingness pattern | Ri = r. If confusion is possible, we will use subscripting, writing for example pY |R(y r) | instead. It is also convenient to define

f(y) = p(y, r) dr, Z p(y, r) π(r y) = . | f(y)

22 The density f(y) is referred to as the full data response model, and π(r y) is commonly | referred to as the missing data mechanism. Hence

p(y, r) = f(y) π(r y), × | and we can write the prior as Π(dp) = Π(df, dπ). Our focus will typically be on functionals ψ(f) of the full data response model f(y), such as the mean at each scheduled observation time ψj = yjf(y) dy. AnotherR useful factorization, referred to by Daniels and Hogan(2008) as the extrapo- lation factorization, is given by

p(y, r) = p (y , r) p (y y , r). obs obs × mis mis | obs

Because the observed data generating distribution pobs is the density of Oi, pobs is identified; conversely, because we never observe Ymis,i, the extrapolation distribution pmis is completely unidentified. Hence, the extrapolation distribution isolates the unidentified components of the model from the identified components. We will write Pobs = Pobs : P P for { ∈ } the space of observed data distributions, and Pmis = Pmis : P P for the space of { ∈ } extrapolation distributions Pmis. Both of these families are dominated, and we write the relevant dominating measures as dy dr and dy . obs × mis 2.3 Rubin’s Classification of Missing Data

Rubin(1976) introduced a classification of types of missingness based on assumptions about the missing data mechanism. Definition 6 (Missing completely at random). The missing data is said to be missing completely at random (MCAR) if

π(r y) = π(r). (2–2) |

MCAR missingness generally is not expected to hold in practice. An exception is when the missingness process is known or designed by the analyst; for example, if

23 budget constraints force the analyst to follow-up on only a randomly chosen subset of the subjects, then missingness is MCAR. More common is the case of missing at random (MAR) missingness. Because MCAR imposes restrictions on the observed data generating

distribution pobs it is possible to formally test the hypothesis that missingness is MCAR against the following MAR assumption. (Little, 1988). Definition 7 (Missing at random). The missing data is said to be missing at random (MAR) if π(r y) = π(r y ). (2–3) | | obs

As stated, this definition requires some explanation. Taken literally, by (2–1) Ri

is Yobs,i-measurable and so the right-hand side of (2–3) must be either 0 or 1. A more accurate statement is

π(R = r Y = y) = π(R = r Yr = yr) | |

where yr denotes the entries of y such that rj = 1. In keeping with the literature, however, we will still write (2–3). Unlike MCAR, the MAR assumption is always

compatible with a given pobs (Molenberghs et al., 2008). We note that, under the original definition of MAR and MCAR given by Rubin (1976), MAR and MCAR are properties of both the density p and the observed data

(Yobs,i, Ri); that is, it is possible for Ymis to be MAR for some realizations of (Yobs,i, Ri) but not for others. We, however, use a more global definition, placing restrictions on the entire joint distribution p(y, r). See Seaman et al.(2013) for a more thorough discussion of this point. Definition 8 (Missing not at random). If the missing data is not MAR, it is said to be missing not at random (MNAR) or informative. The MAR assumption is an important ingredient in ignorability (Rubin, 1976). Definition 9 (Bayesian ignorability). Suppose that everything is defined as above. Missingness is said to be ignorable if the following conditions hold:

24 (1) Missingness is MAR.

(2) f and π are a priori independent; i.e., Π(df, dπ) = Πf (df) Ππ(dπ). × When these conditions hold, it can be shown that

Π(df, dπ O1:n) = Πf (df O1:n) Ππ(dπ O1:n), | | × | and thus one is free to model only f(y) to make inference on functionals ψ(f). Let Li denote the likelihood contribution of observation i; then under MAR,

Li(f, π oi) = f(yi)π(ri yi) dymis,i | | Z

= f(yi)π(ri yobs,i) dymis,i | Z

= f(yobs,i)π(ri yobs,i). (2–4) |

Because the likelihood factors in this way, if f and π are a independent in the prior then they are also independent in the posterior. 2.4 Why Bayesian Nonparametrics?

In contrast to typical settings where Bayesian nonparametrics is applied, in this dissertation we will primarily be interested in relatively straight-forward functionals of the underlying distribution, ψ(f), such as the mean response at time j. In the absence of missing data, if our goal is just to estimate these functionals, then the use of Bayesian nonparametric techniques might be - justifiably - criticized as being needlessly complicated. For example, we might instead assume that the distribution of

Yi is multivariate Gaussian instead, safe in the knowledge that our analysis is robust to violations of this assumption. The presence of missing data, however, requires us to estimate aspects of p0(y, r) which would normally be considered a nuisance; moreover, the accuracy with which we estimate the nuisance directly influences the quality of our inferences. For example, under ignorability and monotone dropout, consistent

25 estimation of ψj requires, at a minimum, the estimation of at least one of the following infinite-dimensional parameters:

1. The conditional regression functions r(yobs,i) = Ef [Ymis,i Yobs,i = yobs,i]. | 2. The propensity scores π(r y). | Poor estimation of these parameters will introduce bias into the estimation of ψj; on the other hand, in light of the curse of dimensionality (Robins and Ritov, 1997), it is not feasible to estimate either of these functions nonparametrically when Yi is of even moderate dimension. The Bayesian nonparametric approach offers both the flexibility of nonparametric modeling and control over model complexity through shrinkage towards a simple parametric structure; thus, once one has committed to the Bayesian approach for inference, nonparametric Bayesian methods appear to be the natural choice. Beyond flexible modeling, the Bayesian approach has a natural appeal in the

manner it addresses identifiability issues. Because any assumptions which identify pmis are untestable, in some sense the identification problem is inherently subjective. As a result, we will rely on subject matter expertise to determine whether a given assumption is likely to hold. The Bayesian approach then provides a natural means for incorporating subject matter expertise into an analysis: the prior. This allows the analyst to account for uncertainty in the identifying assumptions when coming to a final conclusion. 2.5 Existing Approaches

We briefly review some existing approaches for the analysis of informative missingness in longitudinal studies. The methods discussed fall in to one of two categories: (1) likelihood-based parametric approaches and (2) semiparametric or nonparametric frequentist approaches. 2.5.1 Likelihood Factorizations

When missingness is assumed to be informative, likelihood-based approaches can largely be categorized by how the joint p(y, r) is factorized when describing the model. A very common approach is to base models on the selection model factorization (Diggle and

26 Kenward, 1994; Heckman, 1979)

p(y, r) = f(y) π(r y). × |

One might then, for example, assume that f(y) is a multivariate-Gaussian density and π(r y) is a sequential hazard regression. Parametric models based on the selection model | factorization often possess the “benefit” of being fully identified; hence, one might use the parametric model to test the MAR assumption. This does not contradict our earlier claim that MAR is untestable - the assumption that the parametric model holds is now an assumption which cannot be fully tested. Our belief is that the ability of parametric selection models to identify the entire joint distribution p(y, r) is a drawback which masks the inherent lack of identifiability in the problem. It is possible to overcome this defect by making the model p(y, r) suitably nonparametric or semiparametric. Frequentist semiparametric approaches are discussed in Section 2.5.2. Bayesian nonparametric variants of selection models have also been considered. In a cross-sectional setting, Scharfstein et al.(2003) considered a selection model where f(y) was given a Dirichlet process prior, with π(r y) modeled parametrically, and sensitivity analysis based on weakly-identified | parameters in the model. In the setting of spatial statistics, Pati et al.(2011) used a selection model to capture information in informative sampling locations, with the entire model identified. The pattern mixture factorization (Hogan and Laird, 1997; Little, 1993, 1994) is essentially the reverse of the selection model factorization,

p(y, r) = g(y r) φ(r). | ×

The pattern mixture factorization is closely related to the extrapolation factorization

p(y, r) = g(y r) φ(r) = g(y y , r) g(y r) φ(r) . | × mis | obs × obs | × pmis pobs | {z } | {z }

27 The factor φ(r) is identified from the observed data while the factor g(y r) is not. Two | options for addressing g(y r) are to: | (a) Invoke parametric assumptions on g(y r) which specify parametrically how information is shared across dropout times;| for example, one might assume that y

depends on r through a linear regression of dropout time s = maxrj =1 j on time j (see examples in Harel and Schafer, 2009). { }

(b) Leave parts of g(y r) unidentified to facilitate a sensitivity analysis (Daniels and Hogan, 2000, 2008).| In line with our philosophy so far, our advice is to prefer approach (b). A final approach commonly used (Henderson et al., 2000; Wu and Carroll, 1988) is the shared parameter approach,

p(y, r) = p(y, r b) G(db). (2–5) | Z It is additionally common to assume that the joint distribution p(y, r b) factors as | p(y b) p(r b) so that Yi and Ri are independent given some latent variable bi. When | × | bi has support 1, 2,...,K the shared parameter model is called a latent class model { } (Roy, 2003). Shared parameter models are powerful in their ability to capture complex relationships within the data in a parsimonious manner. This is especially useful for multivariate longitudinal models. For example, Dunson and Perreault(2001) used a latent factor approach to model high-dimensional longitudinal data with informative missingness. It is often assumed that b has a multivariate Gaussian distribution. The form of (2–5), however, is suggestive of nonparametric possibilities. For example, one might take G(db) to be a Dirichlet process, or a Dirichlet process mixture, to attain a flexible model for the latent variables. Approaches along these lines have been taken by Kleinman and Ibrahim (1998), among others. We are unaware of any specific work which treats G( ) as given by · (2–5) in this nonparametric Bayesian manner, although Dunson(2007) remarked that this is an obvious possibility.

28 2.5.2 Non-Likelihood Based Approaches

Most non-likelihood based approaches to missing data appear to be based on the GEE approach of Liang and Zeger(1986). A complete-case analysis based on GEE’s is valid under the MCAR assumption. GEE’s are consistent under MCAR when the marginal mean of the response is correctly specified. Robins et al.(1995) extended the GEE framework to allow for MAR missingness through inverse-probability weighting (IPW). Assume monotone dropout for simplicity.

The general idea is to inverse-weight a complete-data GEE estimating equation i ϕ(Yi; ψ, η)

by the probability of a subject completing the study given their response Yi. OneP obtains the estimating equation

n I(S = J)ϕ(Y ; ψ, η) i set= 0. (2–6) π(S = J Y ) i=1 i X | Here, ψ is a Euclidean parameter of interest and η is a nuisance parameter needed to estimate ψ. Inverse-probability weighting of complete case (IPWCC) estimators require estimating the propensity scores π(s = J y). | Since their introduction, a variety of useful variants of IPW estimators have been introduced. Augmented IPW (AIPW) estimators add a so-called augmentation term to (2–6) to obtain estimating equations such as

n  J−1  I(Si = J) I(Si = j) λj(Yi)I(Si j) ¯ set  ϕ(Yi; ψ, η) + − ≥ E[ϕ(Yi; ψ, η) Yij] = 0. π(S = J Y ) π(S > j Y ) |  i=1  i i j=1 i i  X  | X |  Augmentation Term      | {z } Here,λj(Yi) denotes the dropout hazard at time j. In addition to solving this estimating equation, one must also estimate the propensity scores in the denominator and the dropout hazard. The augmentation term allows for contributions from not just complete cases, but all observations, and therefore is more efficient. In fact, estimators of this form are often (locally) semiparametric efficient (Tsiatis, 2006; van der Laan and Robins, 2003).

29 Additionally, many AIPW estimators are doubly robust in the sense that if at least one of either the propensity scores or the conditional mean structure is modeled correctly then the estimator is consistent. It is worth noting that doubly robust estimators, and IPW estimators in general, can have poor performance when the inverse-probability weights are highly variable (Kang and Schafer, 2007). Going beyond MAR, Rotnitzky et al.(1998) and Scharfstein et al.(1999) developed AIPW estimators which are valid for informative missingness. Sensitivity analysis can be performed in these models by varying a (weakly identified) selection bias parameter. Vansteelandt et al.(2007) extended this approach to the case of non-monotone missingness. In light of the curse of dimensionality, AIPW estimators impose modeling assumptions ¯ on π and E[ϕ Yij]. The paradigm of targeted maximum likelihood estimation (TMLE) | (van der Laan and Rose, 2011; van der Laan and Rubin, 2006) attempts to provide a more flexible method for constructing doubly robust estimators, using an ensemble-of-likelihoods approach to modeling referred to as “super-learning”. Likelihoods used in this approach are carefully tweaked so that their MLE’s are locally semiparametric efficient. 2.6 Identifying Restrictions and Sensitivity Parameters

Identifying restrictions (Little, 1993, 1994) provide a means of identifying the

extrapolation distribution pmis in terms of the observed data generating distribution pobs. Definition 10 (Identifying restrictions). An identifying restriction is any mapping

g : Pobs Pmis, i.e., g : pobs pmis. We call g a partial identifying restriction if → 7→ g : pobs U Pmis. 7→ ⊂ The idea of a partial identifying restriction is to capture the notion that we are not

imposing a full restriction on pmis; some features of pmis are implied by pobs, but not all of them. For example, g : pobs Pmis essentially corresponds to a total lack of assumptions 7→ about pmis.

30 Table 2-1. Schematic representation of ACMV when J = 4. Distributions above the dividing line are not identified by the observed data. Subscripting by j or j denotes conditioning on the events [S = j] and [S j] respectively. ≥ ≥ j = 1 j = 2 j = 3 j = 4 S = 1 p1(y1) p≥2(y2 y1) p≥3(y3 y¯2) p≥4(y4 y¯3) | | | S = 2 p2(y1) p2(y2 y1) p≥3(y3 y¯2) p≥4(y4 y¯3) | | | S = 3 p3(y1) p3(y2 y1) p3(y3 y¯2) p≥4(y4 y¯3) | | | S = 4 p4(y1) p4(y2 y1) p4(y3 y¯2) p4(y4 y¯3) | | |

A result due to Molenberghs et al.(1997) is that, under monotone missingness, MAR is an identifying restriction according to the above definition with a very convenient form; MAR holds if and only if

p(yj y¯j−1, s = k) = p(yj y¯j−1, s j), (j > k 1). (2–7) | | ≥ ≥

Equation (2–7) is referred to as the available case missing value restriction (ACMV). A schematic representation of ACMV is given in Table 2-1. In Chapter3 and Chapter4, we consider sensitivity analyses obtained by embedding the ACMV restriction within the class of non-future dependent (NFD) restrictions (Kenward et al., 2003). An identifying restriction is said to be NFD if it implies

p(yj+1 y¯j, s = k) = p(yj+1 y¯j, s j), (j > k 1). (2–8) | | ≥ ≥

NFD is only a partial identifying restriction, mapping to the subset of pmis which satisfy this identity, and places no restrictions on the conditionals p(yj+1 y¯j, s = j). Taking |

p(yj+1 y¯j, s = j) = p(yj+1 y¯j, s j + 1) | | ≥ shows that the ACMV restriction is contained in the NFD restriction. A schematic representation of NFD is given in Table 2-2. Like ACMV, NFD can be expressed as a restriction imposed on the missing data mechanism. A result due to Kenward et al.(2003) is that NFD holds if and only if

π(s y) = π(s y¯s, ys+1). (2–9) | |

31 Table 2-2. Schematic representation of NFD when J = 4. Distributions above the dividing line are not identified by the observed data. Subscripting by j or j denotes conditioning on the events [S = j] and [S j] respectively. ≥ ≥ j = 1 j = 2 j = 3 j = 4 S = 1 p1(y1) ? p≥2(y3 y¯2) p≥3(y4 y¯3) | | S = 2 p2(y1) p2(y2 y1) ? p≥3(y4 y¯3) | | S = 3 p3(y1) p3(y2 y1) p3(y3 y¯2) ? | | S = 4 p4(y1) p4(y2 y1) p4(y3 y¯2) p4(y4 y¯3) | | |

This gives NFD a (weak) causal justification: because cause flows forward through time,

it is argued that probability of dropout at time s should depend on the past (y¯s) and the present (ys+1), but not the future (y(s+1):J ). We note, as pointed out by Jaynes(1996), that such causal observations do not justify probabilistic independencies such as (2–9), with shared parameter models giving realistic counterexamples. It is rare that one will be able to confidently assert that a particular identifying restriction holds. To incorporate uncertainty into our underlying assumptions, we define the notion of a sensitivity parameter. Our definition is a nonparametric version of the definition given by Daniels and Hogan(2008); other nonparametric definitions are given by Vansteelandt et al.(2006) and Robins and Gill(1997). Definition 11 (Sensitivity parameters). Consider a family of identifying restrictions

gξ : ξ Ξ , and let the model for the extrapolation distribution be pmis = gξ(pobs). The { ∈ } index ξ is said to be a sensitivity parameter. Some properties of sensitivity parameters include:

1. Sensitivity parameters are unidentified in the sense that

Π(dξ pobs, O1:n) = Π(dξ pobs). | |

Additionally, the likelihood L = i pobs(Yobs,i, Ri) does not depend on the sensitivity parameter. Q

2. When ξ is correctly specified, because pobs is identified, both pmis and p(y, r) are also identified (under mild conditions on gξ).

Our approach is to specify a family gξ in the following manner. First, we consider { } a baseline assumption g0, such as MAR, which is easily interpretable to clinicians. Next,

32 we smoothly deviate from this model by introducing ξ such that ξ has an interpretation as a deviation from the baseline assumption; for example, ξ may be chosen to correspond to certain types of location-shift or selection bias parameters. 2.7 Partial and Latent Ignorability

Two variations of MAR and ignorability are particularly useful. These generalizations capture different notions of the missing data mechanism becoming ignorable when additional information is supplied. Both were discussed at length by Harel and Schafer (2009). Definition 12 (Bayesian Partial ignorability). The missing data said to be partially missing at random (PMAR) given h(r) if

p(r y, h(r)) = p(r y , h(r)). (2–10) | | obs

The missing data mechanism is said to be partially ignorable given h(r) if the following conditions hold:

1. The missing data is PMAR given h(r).

2. The functions p(r y, h(r)) and p(y, h(r)) are a priori independent. | By analogy with ignorability, under partial ignorability we do not need to model the distribution p(r y, h(r)) when conducting Bayesian inference. For us, partial ignorability | provides a default way to extend our work to accommodate non-monotone missing data by taking h(R) = max j : Rj = 1 S. This may be appropriate if it is thought that the { } ≡ dependence between R and the missing values Ymis can be captured entirely through the dependence between h(R) and Ymis. Partial ignorability captures the notion of part of the missing data being ignorable given a summary h(r). The notion of latent ignorability (coined by Frangakis and Rubin, 1999) captures the notion that the missing data would have been ignorable had we

additionally observed a coarse summary of the missing data h(ymis).

33 Definition 13 (Bayesian latent ignorability). The missing data is said to be latently

missing at random (LMAR) given a function h(ymis) if

p(r y) = p(r y , h(y )). | | obs mis

The missing data mechanism is said to be latently ignorable given h(ymis) if the missing data is LMAR and the functions p(r y , h(y )) and p(y , h(y )) are independent a | obs mis obs mis priori.

This assumption may be useful in the context of latent class models, where Ymis is augmented to include the latent class membership; such an assumption may provide computational benefits over the usual MAR assumption. PMAR and LMAR are useful to us as alternatives to MAR for an anchoring identifying restriction g0. We find PMAR to be most useful for dealing with incidental intermittent missingness, while we use LMAR primarily for computational benefits. 2.8 Intermittent Missingness

Partial ignorability greatly simplifies analysis in the presence of intermittent missingness but, as with any other assumption about the missing data, it must be evaluated on subject matter grounds and may or may not hold in any given situation. Addressing intermittent missingness when the missingness is not partially ignorable is a difficult problem. Much of the extant literature on intermittent missingness imposes strong parametric assumptions on the data. Latent variable methods can be readily adapted to allow for intermittent missingness; this allows for relatively easy joint modeling of the response process with a latent process which determines the missingness - see, e.g., Lin et al. (2004), Albert(2000) and Dunson and Perreault(2001). Parametric selection model approaches have also been proposed (Ibrahim et al., 2001; Troxel et al., 1998). The above approaches are unsatisfactory in that they do not provide much scope for a principled sensitivity analysis. Vansteelandt et al.(2007) provides a more appropriate

34 analysis by developing doubly-robust estimators which allow for a sensitivity analysis similar to Rotnitzky et al.(1998) and Scharfstein et al.(1999). As an anchoring assumption, Vansteelandt et al.(2007) make use of the sequential explainability assumption

p(rj = 1 r¯j−1, y) = p(rj = 1 r¯j−1, yobs,1, . . . , yobs,(j−1)) (2–11) | | so that the probability of missingness depends only on the observed history. Sensitivity parameters are then introduced which allow for smooth deviations from this assumption. Like NFD, sequential explainability is a partial identifying restriction; however, unlike NFD, sequential explainability is sufficient by itself to identify many of the effects of interest. 2.9 Summary of Our Strategy and Our Contributions

We now describe the overall strategy we take in Chapter3 and Chapter4. We will take the following three-step approach:

(i) Specify a prior pobs Πobs. Derive inference about pobs from Πobs(dpobs O1:n). ∼ |

(ii) Specify a family of identifying restrictions gξ : ξ Ξ . For each gξ, we obtain a { ∈ } posterior Π(dp O1:n, ξ). Vary ξ through Ξ to obtain an assessment of the impact of the missingness.|

(iii) Once the impact of missingness is assessed, elicit an informative prior on Ξ from a field expert and draw final inferences from Π(dp O1:n), i.e., after marginalizing over ξ. | In order to elicit clinically meaningful priors on Ξ, it is essential that each member of the family gξ : ξ Ξ have a clear interpretation. To facilitate this, we will anchor { ∈ } our analysis by choosing gξ so that g0(pobs) corresponds to a well-understood identifying restriction such as MAR. Deviations of ξ from 0 can then be interpreted as deviations of the missing data mechanism from the MAR assumption, with ξ close to 0 corresponding to small deviations from MAR and ξ far away from 0 corresponding to large deviations from MAR. The deviations of ξ away from 0 will also be interpretable. For example, ξ may have the interpretation of a selection bias parameter (Birmingham et al., 2003) or a

35 location-scale shift (Daniels and Hogan, 2000, 2008). Additional methods for introducing interpretable sensitivity parameters will be discussed throughout this dissertation. In Chapter3 we introduce a framework for accomplishing (i) by means of an auxiliary prior, which we refer to as a working prior. Some theoretical properties are established. We also develop techniques for sensitivity analysis, primarily aimed at continuous data, by introducing transformation-based adjustments to the response of subjects who drop out; this addresses (ii) and (iii). Chapter4 focuses on an application of this methodology to data from a clinical trial, with the main emphasis put on prior specification and assessment of the quality of the model through simulation.

36 CHAPTER 3 NONPARAMETRIC BAYES FOR NONIGNORABLE MISSINGNESS 3.1 Introduction

In this chapter, we introduce techniques for developing flexible Bayesian nonparametric methods for the analysis of longitudinal data with informative missingness. This is done in such a way that carrying out the three-step approach described in Section 2.9 is straight

iid forward. We consider (Yi, Ri) p0 where p0 is a fixed (frequentist) density supported ∼ on RJ 0, 1 J which is thought of as the “true” data-generating distribution. The × { } vector Yi = (Yi1,...,YiJ ) denotes a response vector of interest, while Ri = (Ri1,...,RiJ ) denotes a vector of missingness indicators, such that only the entries of Yi corresponding

to Rij = 1 are observed. The observed data is denoted by Oi = (Yobs,i, Ri), and the

complete data is denoted by Ci = (Yi, Ri). We focus primarily on developing the methodology at an abstract level, with an in-depth example deferred to Chapter4. Recall from Section 2.9 that our three-step approach is:

(1) Specify a prior p Π and draw inferences about p through the posterior obs ∼ obs obs Πobs(dpobs O1:n). |

(2) Specify an interpretable family of identifying restrictions gξ : ξ Ξ , and derive { ∈ } inference from Π(dp O1:n, ξ) as a function of the sensitivity parameter ξ. |

(3) If a final inference is needed, elicit a prior Πξ on ξ, and conduct final inference about p0 through Π( O1:n) = Π( O1:n, ξ)Πξ(dξ). · | · | We begin by developing aR framework, which we refer to as the working prior frame- work, for specification of a nonparametric prior to address (1). The essential idea is to specify a prior p Π indirectly. We do this by constructing an auxiliary prior obs ∼ obs distribution Π? on the space P of complete data distributions. We favor this approach for its computational tractability, the ease with which one can specify such a prior, and for its direct access to theoretical results. The working prior framework is developed in Section

3.2, where benefits of this approach versus direct specification of Πobs are discussed.

37 In Section 3.3, we provide sufficient conditions for the (frequentist) consistency of this approach. In Section 3.4, we construct a class of priors possessing large Kullback-Leibler support. This class of priors is large, and contains many priors which are useful in practice. Combined with the results of Section 3.2, this yields consistent estimation of a

large class of pobs,0’s. This section is mostly independent of other sections, and is perhaps of independent interest. We provide examples which illustrate that this class contains priors which are of practical interest in both the setting of monotone missingness (i.e., dropout) and non-monotone missingness, including a prior very similar to the working prior used in Chapter4. Methods for constructing identifying restrictions (in both the monotone and non-monotone settings) are introduced in Section 3.5, with an emphasis placed on continuous data. Our approach is based on constructing a suitable transformation which “corrects” for an observation being missing. Similar, but less general, techniques were used by Daniels and Hogan(2000, 2008). We favor this method because it is easy to use, computationally tractable, and results in identifying restrictions which are interpretable by clinical experts. In Section 3.6, we consider the issue of posterior computation. We propose to calculate causal effects of interest by using a G-computation algorithm (Robins, 1986; Scharfstein et al., 2013). A specialized algorithm, such as G-computation, is needed due to the fact that, even if p is known, computation of an effect of interest ψ(p) is often intractable. This intractability is distinct from the already existing computational hurdles that necessitate the use of MCMC. When combined with our approach to specifying identifying restrictions, the G-computation approach provides a methodology for calculating causal effects, provided that we can simulate from the conditional distributions

of pobs.

38 3.2 Strategy for Prior Specification

We now introduce our principle tool for prior specification, the working prior.

? Definition 14 (Working prior). Let Πobs denote a prior for pobs. A prior Π is said to be a

working prior for Πobs if

Π (A) = Π?(p A). obs obs ∈

? In words, Πobs is the marginal distribution of pobs under Π .

? Rather than using the posterior derived from the working prior Π (dp O1:n) to |

derive inferences about p, we instead base inferences on a combination of Πobs(dpobs O1:n) |

and our identifying restriction gξ : pobs pmis. This strategy can be contrasted with the 7→

approach of directly specifying a prior Πobs; indeed, one may wonder what the benefit of taking an indirect approach to prior specification is. The indirect approach possesses the

following potential benefits over the approach of directly specifying Πobs.

1. To avoid the curse of dimensionality, the prior p Π must place its mass obs ∼ obs on a structured subset of Pobs. By passing to the level priors on complete data

distributions P, we find it easier to reason about the distributions on which Πobs places its mass. How information is shared across time points and dropout patterns is more transparent using this approach.

2. This perspective can be leveraged to construct straight-forward MCMC sampling schemes based on Π? via a type of data augmentation.

3. In Section 3.3, we leverage the working prior perspective to automate the proofs of several desirable properties of our nonparametric prior.

4. One may also consider specifying Π directly, and base inference on Π(dp O1:n). This is a delicate issue, as one must be careful not to fully identify the extrapolation|

distribution pmis(ymis yobs, r). Our approach, by contrast, inherently leaves the extrapolation distribution| unidentified. In reasoning about specification of the working prior, the following interpretation is helpful: when compared to our true prior Π, the working prior Π? tells an alternate

story about how the observed data Oi was generated. This interpretation is depicted

? schematically in Figure 3-1. The path in solid arrows from Π to Oi = (Yobs, R) represents

39 Figure 3-1. Schematic describing the working prior framework. The working prior Π? ? induces a prior Πobs on pobs. Additionally, Π gives an interpretation of how the data was generated - first, generate p? Π?, then generate (Y ?, R) p?, and ? ∼ ∼ finally discard the components Yj such that Rj = 0.

the actual use of Π? in our approach, while the path in dashed arrows represents a latent interpretation of our approach as giving rise to a working model p? and complete data

? ? (Y , R). We emphasize that we do not base inference on the posterior Π (dp O1:n), but | instead on the combination of the posterior of the observed data distribution Π (dp obs obs |

O1:n) and our identifying restriction gξ : pobs pmis. Regardless, the working model 7→ interpretation of Figure 3-1 is valuable for several reasons:

(i) It provides a means for reasoning about how information is shared across the times j and dropout patterns r.

(ii) It provides a straight forward method for conducting inference about pobs; we may ? ? regard the working model p and the complete data (Y1:n, R1:n) as latent variables, and proceed by data-augmentation.

3.3 Posterior Consistency of pobs The following “invariance” property of the Kullback-Leibler divergence (Kullback and Leibler, 1951) will be useful. Intuitively, this result says that applying a fixed transformation t to data X cannot make it easier to distinguish whether X P or ∼ X Q. ∼

40 Proposition 3.1. If P and Q are probability measures and t is any measurable function then

K(P t−1,Q t−1) K(P,Q). ◦ ◦ ≤

This, combined with Schwartz’s theorem (Theorem 1.1), gives conditions for weak

? consistency of Πobs in terms of the Kullback-Leibler support of Π .

Lemma 3.1. Suppose that P0 is a probability measure on a probability space ( , B) Z and let t : 0 be a measurable mapping from ( , B) to ( 0, B0). Let P and P0 Z → Z Z Z denote the associated spaces of probability measures. Let Π? denote a prior on P and

0 −1 let Π denote the prior on P induced by the mapping P P t . If P0 is in the 7→ ◦ ? −1 Kullback-Leibler support of Π , then P0 t is in the Kullback-Leibler support of Π. ◦

Proof. For any P and P0, the invariance principle gives K(P0,P ) K(Q0,Q) where ≥ −1 −1 Q0 = P0 t and Q = P t . This implies ◦ ◦

−1 K(P0) P : P t K(Q0) . ⊂ { ◦ ∈ }

Thus, for any  > 0,

? −1 ? Π(K(Q0)) = Π P t K(Q0) Π (K(P0)) > 0. ◦ ∈ ≥ 

? Theorem 3.1. Suppose that Π (K(p0)) > 0 for all  > 0. Then the posterior Πobs(dpobs |

O1:n) is weakly consistent at p0,obs.

Proof. The mapping (Y , R) (Y , R) is measurable, thus this follows immediately from 7→ obs a combination of Lemma 3.1 and Schwartz’s theorem.

The value of this result is that it gives sufficient conditions for posterior consistency of

the prior Πobs in terms of straight forward sufficient conditions for posterior consistency of Π. In fact, it is possible to state theorems regarding convergence in total variation, obtain

41 convergence rates, and so forth, using essentially the same technique. This allows us to leverage existing results in the Bayesian nonparametric literature to obtain results for our priors. Weak consistency is relevant for us as it guarantees posterior consistency of every

functional of the form ψ(pobs) = t(yobs, r) pobs(yobs, r) dyobs dr, where t(yobs, r) is bounded and continuous. Write ψR= ψ(pobs) and ψ0 = ψ(pobs,0) and recall that the posterior

distribution of ψ is said to be consistent at ψ0 if

Π( ψ ψ0 <  O1:n) 1, pobs,0-almost-surely for all  > 0. | − | | →

When t(yobs, r) is bounded, the set of pobs for which ψ ψ0 <  occurs is, by definition, a | − |

weak neighborhood of pobs,0; hence, posterior consistency holds. We are not necessarily interested in only the expectations of bounded continuous functions, however; for example, one is often interested in the mean response of the observed data at time J. The following result provides a criterion for posterior consistency of the expectation of an unbounded continuous function; a proof is provided in the AppendixA. Theorem 3.2. Suppose p Π is a random density with respect to a dominating measure ∼ iid ν, and Zi p0 for i = 1,..., . Let ψ(p) = t(z) p(z) ν(dz) with t(z) continuous, but ∼ ∞ × not necessarily bounded. Assume the following:R

(C1) The posterior Π( Z1:n) is weakly consistent at p0. · |

(C2) With p0-probability 1, t(z) is uniformly integrable with respect to the sequence of predictive densities

p(z) n p(z ) Π(dp) p˜ (z) = i=1 i . n n p(z ) Π(dp) R iQ=1 i R Q Then Π( ψ(p) ψ(p0) >  O1:n) 0 for every  > 0, i.e., the posterior of ψ(p) is | − | | → consistent at ψ(p0).

42 We expect uniform integrability to be a modest condition; for example, it is satisfied

1+ ∞ if the sequence t(z) p˜n(z) in the previous theorem is bounded p0-almost { | | }n=1 surely. To the author’sR best knowledge, there are no general approaches in the literature to addressing this uniform integrability condition, but we conjecture that it holds for the models considered in Section 3.4 under mild assumptions; detailed results will be the subject of future work. Typically, one is interested in conducting inference on functionals of p rather than

pobs. In light of identifiability concerns, one cannot expect to consistently estimate p. The

identifying restriction gξ : pobs pmis defines a mapping hξ : pobs p which identifies p from 7→ 7→ pobs. Given that posterior consistency of Π for p is unfeasible, the most that we can ask for is convergence of hξ(pobs) to hξ(pobs,0). That is, we ask

Πobs (hξ(pobs) U O1:n) 1,Pobs,0-almost-surely, ∈ | →

where U is a neighborhood of hξ(pobs,0). In general, we should only expect this to occur if

pobs being close to pobs,0 implies that hξ(pobs) is close hξ(pobs,0).

Lemma 3.2. Suppose that Πobs is consistent at pobs,0 in a given topology, and that

hξ : Pobs P is continuous with respect to this topology. Then the induced prior on →

hξ(pobs) is consistent at hξ(pobs,0) in this topology. The proof of Lemma 3.2 is trivial and is omitted. In view of this result, one is led to the question of which identifying restrictions are smooth with respect to a given topology, e.g., is MAR continuous? Unfortunately, the situation is complicated for weak convergence; generally, MAR is not weakly continuous

at a given p0. This is due to the fact that weak convergence of joint distributions does not ensure weak convergence of conditional distributions (Sethuraman, 1961; Steck, 1957; Sweeting, 1989). On the other hand, under mild conditions, strong convergence suffices. For example, we prove the following theorem in AppendixA.

43 Theorem 3.3. Consider the setting of monotone missingness. Assume that the dis- tribution of the response is absolutely continuous with respect to Lebesgue measure and that the dropout time is supported on 1,...,J . Suppose that pobs,0 Pobs is such that { } ∈

pobs,0(S = J y¯j−1) > 0. Let ( 2,..., J ) be continuously differentiable and strictly mono- | T T tone functions, which are support-preserving in the sense that if Yj p0(yj y¯j−1,S A) ∼ | ∈ then Yj and j(Yj y¯j−1) have the same support. T | Then, the NFD identifying restriction determined by (3–6) is continuous in the strong

topology at pobs,0. We immediately obtain the following corollary, which implies that the posterior will

be strongly consistent at p0 under the MAR assumption. Corollary 1. Under the same conditions as Theorem 3.3, MAR is continuous in the

strong topology at pobs,0.

Proof. Take the transformations j to be the identity. T 3.4 Kullback-Leibler Property for Kernel Mixture Models

In this section, we prove a result which provides sufficient conditions for a distribution

p0 to be in the Kullback-Leibler support of a prior Π when Π is a certain Bayesian nonparametric kernel mixture. For clarity, the results are useful in their own right within the context of complete data. When applied to results of other sections, these results amount to showing that the working priors we use have large support. Combined with the results of Section 3.3, this shows consistency. We have recently become aware of independent work in the complete data setting due to Canale and Dunson(2015) which proceeds along similar lines. The notation here is independent of the notation used elsewhere. We consider mixtures of the form

p(y, r) = ((y, z) µ, Σ) dz F (dµ, dΣ), (3–1) N | ZZA(r)

44 where F (αH), with H(dµ, dΣ) = Hµ(dµ) HΣ(dΣ), and (x µ, Σ) denotes ∼ D × N | a multivariate Gaussian density with mean vector µ and covariance matrix Σ. A( ) is a · set-function which maps r to a set A(r) RJ such that the sets A(r): r 0, 1 J form ⊂ { ∈ { } } a partition of RJ . Let ρ : RJ 0, 1 J be such that z A(ρ(z)). If (Y , R) p then one → { } ∈ ∼ has (Y , R) =d (Y ?, ρ(Z?)), where

(Y ?, Z?) q(y, z) = ((y, z) µ, Σ) F (dµ, dΣ). (3–2) ∼ N | Z The idea is that, if the random density q has good large-support properties, then p will also have good large-support properties. Conveniently, this is more-or-less a missing data problem (with Z? partially observed), so that the arguments of Section 3.3 will be effective. The Kullback-Leibler support of distributions of the form (3–2) are well studied; for example, the following result due to Canale and De Blasi(2013), which is a multivariate extension of results due to Wu et al.(2008), is useful.

d Theorem 3.4. Let p0(x) be a given density on R and let H be such that Hµ(dµ) has full

J support on R and HΣ(dΣ) is supported on a subset of the set of all non-negative definite

d d matrices such that the eigenvalues of Σ HΣ are contained in any interval (a, b) × ∼ (0 < a < b) with positive probability. Let Π denote a prior on densities p induced by the map F p(x) = (x µ, Σ) F (dµ, dΣ), 7→ N | Z where F (αH). Assume the following conditions hold: ∼ D

(A1) 0 < p0(x) < M for some M.

(A2) p0 log p0 dx < . ∞ R (A3) There exists a δ > 0 such that p0 log[p0/φδ] dx < where φδ(x) = inf p0(t). ∞ kt−xk<δ R 2(1+η)d (A4) There exists an η > 0 such that x p0(x) dx < . k k ∞ Then p0 is in the Kullback-Leibler supportR of Π.

45 Strictly speaking, this is a modified result with a condition on the eigenvalues of Σ; it follows from modifications of the proofs in Canale and De Blasi(2013) and Wu et al.

(2008), as the proofs depend on the distribution HΣ only through its marginal distribution

on the eigenvalues of Σ HΣ. ∼ By regarding (Y ?, Z?) as “complete data” and (Y ?, ρ(Z?)) as “observed data”, we can proceed as in Theorem 3.1 and obtain the following result.

Lemma 3.3. Suppose p0(y, r) can be expressed in the form

p0(y, r) = q0(y, z) dz, (3–3) ZA(r) such that the sets A(r): r 0, 1 J form a measurable partition of RJ . Let Π be { ∈ { } } the prior on a random density p induced by (3–1), where F (αH) and where H is ∼ D as in Theorem 3.4. If q0 satisfies conditions A1–A4 of Theorem 3.4, then p0 is in the Kullback-Leibler support of Π.

? Proof. Let Π denote the prior on densities induced by (3–2). The conditions on q0

? guarantee that q0 is in the Kullback-Leibler support of Π . The mapping t :(y, z) 7→ (y, ρ(z)), where ρ( ) is defined by z A(ρ(z)), is measurable. The result follows by · ∈ Lemma 3.1.

In view of Lemma 3.3, one may make progress on establishing the Kullback-Leibler

property of a distribution p0 by showing that it admits a representation (3–3) for which the associated q0 satisfies conditions A1–A4. Constructing such a representation is fairly simple.

Proposition 3.2. Let Λ( ) be an absolutely continuous distribution on RJ with full · support, and let λ( ) be its density. Then p0(y, r) = q0(y, z) dz where · A(r) R p0(y) p0(ρ(z) y) λ(z) q0(y, z) = × | × , (3–4) Λ(A(ρ(z)))

where A(r): r 0, 1 J is any partition of RJ and ρ(z) is such that z A(ρ(z)). { ∈ { } } ∈

46 Combining Lemma 3.3 and Proposition 3.2 gives the following (possibly crude) result. A proof is provided in AppendixA. Theorem 3.5. Let Π? denote a prior on densities q(y, z) satisfying the conditions of Theorem 3.4 and let Π denote a prior on densities p(y, r) defined by the mapping

q p(y, r) = q(y, z) dz, (3–5) 7→ ZA(r) J J where A(r): r 0, 1 is a measurable partition of R . Suppose p0(y, r) is a density { ∈ { } } on RJ 0, 1 J satisfying the following conditions. × { }

(D1) 0 < p0(y, r) < M for some M.

(D2) p0 log p0 dy < . r ∞

(D3) ThereP R exists a δ > 0 such that p0 log[p0/ψδ] dy < where ψδ(y) = r ∞ infr,ky−tk<δ p0(t, r). P R 4(1+η)J (D4) There exists an η > 0 such that y p0(y, r) dy < . r k k ∞ Then p0(y, r) is in the Kullback-LeiblerP supportR of Π. There are many priors of interest that satisfy (3–1); we provide some examples below.

Example 1. Consider bivariate data (y, z) R2, with partition function A(0) = ∈ [0, ) and A(1) = ( , 0). Using the modified Cholesky decomposition (Daniels and ∞ −∞ Pourahmadi, 2002), we can write

2 2 ((y, z) µ, Σ) = (y µy, σ ) (z µz + φ(y µy), ρ ), N | N | y × N | − z

2 2 for some φ R and ρz > 0, where (µy, µz, σy) are marginal means and variances obtained ∈ from Σ.. Expression (3–1) evaluates to

r 1−r 2 µz + φ(y µy) µz + φ(y µy) p(y, r) = (y µy, σ ) Φ − 1 Φ − F (dµ, dΣ). N | y ρ − ρ Z   z    z 

Writing a = µz/ρz and b = φ/ρz, this becomes

2 r 1−r p(y, r) = (y µy, σ ) [Φ (a + b(y µy))] [1 Φ(a + b(y µy))] F (dµ, dΣ). N | y − − − Z

47 This is a mixture of products of a probit regression for r and a Gaussian distribution for y. We may then choose H(dµ, dΣ) so that the base probability measure is Gaussian on

2 2 (µy, µz, φ) and inverse-Gamma on (σy, ρz), independently. Example 2. This approach can be used to model monotone missingness by setting

A(r) = unless r corresponds to a monotone missingness pattern. If r is such that rk = 0 ∅ for k > j and rj = 1 for k j, then set ≤

J−j−1 A(r) = [0, ) [0, ) ( , 0) R . ∞ × · · · × ∞ × −∞ × j times | {z } Next, consider HΣ(dΣ) such that, conditional on (µ, Σ, Y ), the Zj are independent with mean

ηj(y¯j−1) = µz + β`j(y` µy ), j − ` `

(suppressing dependence of ηj on y¯j−1) P

p(y, s) = (y µy, Σy) Φ(ηj+1/ρz )) [1 Φ(η`/ρz )] F (dµ, dΣ). N | × j+1 − ` `≤j Z Y This is essentially a probit version of the working prior we will use in Chapter4. When used as a working prior for missing data, we refer to such a model as a Dirichlet process mixture of missing at random models. The MAR structure of the kernel greatly simplifies the computations for this model when implementing the G-computation algorithm of Section 3.6 while not affecting posterior consistency.

Example 3. To address non-monotone missingness, let A(r) denote the orthant of RJ

associated with r. Now, assume that the Zj’s are independent given (µ, Σ, Y ). By similar arguments as above, this results in a model of the form

J rj 1−rj p(y, r) = (y µy, Σy) [Φ(ηj/ρzj ) (1 Φ(ηj/ρzj )) ] F (dµ, dΣ), N | j=1 − Z Y

48 where now

J

ηj(y) = µz + β`j(y` µy ). j − ` `=1 X As in Example2, the restriction on the form of the mixture kernel possesses computational benefits – in particular, it is possible to marginalize over components of y in closed form, allowing us to implement the G-computation algorithm described in Section 3.6 for non-monotone data. 3.5 Identifying Restrictions

We now introduce a generic approach for constructing interpretable identifying restrictions. Our approach is based on introducing an appropriate transformation which, in some sense, corrects for an observation being missing. We describe this approach separately for monotone missingness (where the anchoring identifying restriction is taken to be MAR) and non-monotone missingness (where we also introduce a new identifying restriction to anchor a sensitivity analysis to). 3.5.1 Monotone Missingness

In the setting of monotone missingness, we advocate anchoring our analysis to the MAR assumption and operating within the family of NFD identifying restrictions (see Section 3.5). A particular identifying restriction within the family of NFD identifying

restrictions can be specified by assuming the existence of a transformation j(yj y¯j−1, ξ) T | such that

¯ d ¯ ¯ Yj Yj−1,S = j 1 = j(Yj Yj−1, ξ) Yj−1,S j (2 j J), (3–6) | − T | | ≥ ≤ ≤     d where = denotes equality in distribution. The right-hand-side is determined by pobs, while

the left-hand-side is determined by pmis; hence, distributional equality together with NFD implicitly defines an identifying restriction.

An advantage of identifying pmis through the NFD restriction and (3–6) is that this can be easily explained to subject matter experts by considering two “coupled” subjects:

49 70

(Y Y , ξ) = Y + ξ T 4 | 1:3 4 65

S = 3 60 Response

55 ξ

Now 50 S 4 ≥

1 2 3 4 Time

Figure 3-2. Graphical depiction of the coupling interpretation of the transformation method. Subjects A and B have identical responses up-to time 3, with A continuing on study (S 4) and B dropping out (S = 3). The conditional distribution of B’s response≥ at time 4 (red) is identical to the distribution of A’s response (blue) after applying the transformation . T

“Suppose A and B are subjects who are equivalent at baseline and that we are observing their response trajectories over the course of the study. At times j = 1, . . . , s, subjects A and B have identical response values. Subject A drops out of our study at time s while subject B does not. Then, this assumption states that the response of subject B at time s + 1 is stochastically identical to

the response of A at time s + 1, after applying the correction j to the response T of subject B.” Subject matter experts may then speculate on the nature of an appropriate adjustment to

Yj to account for missingness. A graphical depiction of this is given in Figure 3-2 in the ¯ case where j is a location shift (independent of Yj−1). Beyond location shifts, one may T wish to consider transformations which do not allow the observations to be extrapolated too far beyond the range of the data.

50 In addition to its ease of interpretation, we note that reducing NFD missingness to

the specification of a transformation j does not impose any restrictions on NFD. T Theorem 3.6. Let N denote the family of NFD identifying restrictions, and T denote the subfamily of NFD identifying restrictions, indexed by measurable functions ( 2,..., J ), T T obtained by the assumption

¯ d ¯ ¯ Yj Yj−1,S = j 1 = j(Yj Yj−1) Yj−1,S j , (2 j J). | − T | | ≥ ≤ ≤     Suppose that Y is a continuous random variable. Then T = N .

Proof. By definition, T N ; conversely, any identifying restriction in N can be ⊂ constructed in T by considering an appropriate probability-integral transformation.

Theorem 3.6 makes a strictly theoretical point, as substantively meaningful choices

of j will necessarily be parameterized by a low-dimensional parameter, e.g., a location T parameter. 3.5.2 Non-monotone Missingness

The situation under non-monotone missingness is more delicate due to there being nothing as sensible as NFD to anchor our sensitivity analysis to; additionally, the plausibility of MAR under non-monotone missingness is arguably more dubious (National Research Council, 2010; Robins and Gill, 1997; Vansteelandt et al., 2007). We describe another transformation-based approach for non-monotone data which is both tractable and interpretable. We begin by defining the following identifying restriction. Definition 15 (Observed Data Missing Value). Consider observed data (Y , R) and let

(y, r) denote a fixed realization of (Y , R). Let Yr denote those entries of Y such that

? rj = 1. For a fixed binary vector r, let rj be equal to r with the modification that the

th ? j entry of rj is 1. The missing data is said to satisfy the observed data missing value

51 (ODMV) restriction if, for all data realizations (y, r), the joint distribution p(y, r) satisfies

? p(Yj Yr, R = r) = p(Yj Yr, R = r ). | | j

Unlike the MAR restriction, ODMV is only a partial identifying restriction – that is, it does not identify the joint distribution p(y, r). This can be seen by noting that,

conditioned on (Yobs, R), ODMV only imposes restrictions on the marginals of Ymis. Any copula on these marginals results in a model satisfying ODMV. The ODMV class does not, generally, include MAR as a special case. Relative to commonly used identifying restrictions, this identifying restriction is most closely related to the nearest case missing value (NCMV) restriction (Thijs et al., 2002). Despite the fact that ODMV is a partial identifying restriction, it still identifies many functionals of interest. In particular, it identifies all marginal distributions p(Yj), and therefore identifies the marginal mean response. That this is true can be seen from the fact that it is possible to simulate from these marginals; an explicit scheme is described in Section 3.5. As an alternative to sensitivity analysis based on partial ignorability (2–10) or sequential explainability (2–11), we propose a type of sensitivity analysis which is similar to our transformation-based approach in the monotone setting. We use the ODMV restriction in place of ACMV. It is assumed that there exists a transformation r( yr, ξ) T · | such that

d ? [Yj Yr, R = r] = [ r(Yj Yr, ξ) Yr, R = r ], for all r and j such that rj = 0. | T | | j

We refer to this identifying restriction as transformation-perturbed ODMV ( -ODMV). T As in Section 3.5.1, this approach admits an interpretation which is easily explained to clinicians by considering two “coupled” subjects. “Suppose A and B are two subjects who are equivalent at baseline and that we are observing their response trajectories over the course of the study.

52 Let r be the missingness pattern of B, and suppose that we observe A at exactly the same times we observe B, with the exception of time j, where A is observed but B is not. Moreover, A and B have the same response values at the common times they are observed. Then, the response of subject B at time j is stochastically identical to the response of A at time j, after applying the

correction r to the response of subject A. T The considerations for specifying these transformations are largely the same as they were in the monotone setting: focus should be on constructing transformations which are both low-dimensional and interpretable. Location or location-scale shifts are again prime candidates for transformations, perhaps with modifications to match subject matter expertise. 3.6 Inference by G-Computation

Recalling steps (1–3) from Section 3.1, inference may proceed in two steps:

(a) Draw samples p(t) : 1 t T of p from a Markov chain targeting the posterior { obs ≤ ≤ } obs distribution Πobs(dpobs O1:n). | (b) For each p(t) , simulate ξ from its prior and calculate ψ(p) where ψ( ) is the targeted obs · functional of interest and pmis = gξ(pobs). When this methodology is applied (perhaps using the priors given in the examples of Section 3.4), (a) is easily accomplished by taking advantage of existing sampling algorithms designed for Π?; see Section 4.5 for an explicit example. Unfortunately, calculation of ψ(p) for an arbitrary smooth functional ψ( ) is · challenging. For example, note that, under monotone missingness with ACMV, when ψ(p) = t(y) p(y, s) dy ds we have R ψ(p) = t(y) p(y¯s S = s) p(ys+1 y¯s,S s + 1) s × | × | ≥ X  Z (3–7) p(ys+2 y¯s+1,S s + 2) p(yJ y¯J−1,S = J) dy p(s). × | ≥ × · · · × | 

53 We will typically be unable to evaluate this expression, even when it is possible to sample directly from p. We provide an algorithm to calculate linear functionals of the form (3–7)

? ? by Monte Carlo integration; we sample pseudo-data Y1 ,..., YN from p and form the

−1 N ? average ψ = N i=1 t(Yi ). This is essentially an application of the G-computation paradigm (Robins, 1986; Scharfstein et al., 2013) applied in a Bayesian context. b P We give the G-computation algorithm associated with our NFD identifying restrictions in Algorithm1. The G-computation algorithm under MAR is obtained as

a special case by assuming p(yj y¯j−1,S = j 1) = p(yj y¯j−1,S j). | − | ≥ Algorithm 1 Generic G-Computation algorithm for drawing (Y ?,S?) p under the NFD assumption. ∼

1. Draw (Y ? ,S?) p (y , s), and set j = s + 1. obs ∼ obs obs 2. If j = J + 1, return (Y ?,S?). Otherwise, continue to step 3. ? ¯ ? 3. If j = s + 1, simulate Yj p(yj Yj−1,S = j 1), then set j j + 1 and return to step 2. ∼ | − ← 4. If j > s + 1:

(a) Draw C Bernoulli(c) where ∼ ¯ p(Yj−1,S = j 1) c = ¯ − . p(Yj−1,S j 1) ≥ − ¯ ? (b) If C = 1, draw Yj p(yj Y ,S = j 1). ∼ | j−1 − ¯ ? (c) If C = 0, draw Yj p(yj Y ,S j). ∼ | j−1 ≥ (d) Set j j + 1 and return to step 2. ←

For Algorithm1 to be tractable, we need to be able to simulate from distributions of

the form p(yobs, s) and p(yj y¯j−1,S j). This is the case for the class of mixture models | ≥ considered in Chapter4. Identifying restrictions based on transformations are convenient in implementing

Algorithm1 - to simulate from p(yj y¯j−1,S = j 1), all that is required is to simulate | −

54 ¯ Zj p(yj y¯j−1,S j) and apply the transformation Yj = j(Zj Yj−1, ξ). This is spelled ∼ | ≥ T | out in Algorithm2.

Algorithm 2 Modification of Step 4 of Algorithm1 when using a transformation-based identifying restriction.

? ¯ ? 1. Draw Y p(yj Y ,S j). j ∼ | j−1 ≥ 2. Draw C Bernoulli(c) as in Algorithm1. ∼ ? ? ¯ ? 3. If C = 1, set Yj j(Yj Yj−1, ξ). The G-computation algorithm← T |

We also give the G-computation algorithm associated with transformation-based deviations from the ODMV restriction in Algorithm3. This algorithm differs from Algorithm1 in the essential aspect that draws ( Y ?, R?) are not jointly drawn according

to p; indeed, conditional on Yobs, the missing data generated according to this algorithm are independent. This algorithm remains valid, however, because every Yj is drawn from the correct marginal distribution. As a result, we can use this algorithm to consistently estimate functionals of the form t(yj) p(y, r) dy dr.

Algorithm 3 Generic G-ComputationR algorithm for drawing (Y ?, R?) under the -ODMV restriction. T 1. Draw (Y ? , R?) p (y , r). obs ∼ obs obs 2. For each j such that Rj = 0:

? ? ? ? (a) Draw Y p(yj Y , R ), where R is identical to R, but with j’th entry 1. j ∼ | obs j j ? ? ? (b) Set Y R? (Y Y , ξ). j ← T j | obs

One potential criticism of the G-computation approach to inference is that it appears to be computationally expensive; the approach requires a separate Monte Carlo integration step at each iteration of the Gibbs Sampler. Similar criticisms apply, for example, to the algorithm of Scharfstein et al.(2013), which is a combination of G-computation with a parametric bootstrap. Other issues include the choice of the pseudo-sample size N ? and how to incorporate the approximation error of the Monte Carlo integration into the

55 analysis. Fortunately, these issues have thus far not been overly burdensome in practice; while this issue depends on context, we have generally found that the Monte Carlo error associated with the G-computation algorithm is negligible when compared to the error of the MCMC algorithm when N ? is chosen so that the computation times for the two steps are comparable. More formally, the net effect of the G-computation approximation can be assessed by considering a measurement-error model

ψ(t) = ψ(t) + (t), where the ψ(t) follow an ergodic Markovb chain and (t) · (0, s2(t)/N ?) with (approximately) ∼ N known error variances s2(t) estimated from the Monte Carlo samples from p(t). If the ψ(t)’s were independent, the posterior distribution Π(dψ O1:n), and any functionals of interest, | would then be estimable through deconvolution. Unfortunately, we are unaware of any useful results for deconvolution when iid samples are replaced by samples from an ergodic Markov chain. 3.7 Discussion

In this chapter, we introduced a novel technique for constructing priors Πobs on the space of observed data distributions, and established some theoretical properties of the procedure. After establishing the theory in abstract, we provided a class of Dirichlet process mixture models which satisfy the necessary criteria. We then introduced a generic transformation-based approach to introducing identifying restrictions which are interpretable to subject matter experts, and provided a computational algorithm for conducting inference via MCMC. In the next chapter, we provide a concrete implementation of this approach on real data, and verify the frequentist performance via simulation.

56 CHAPTER 4 A DIRICHLET PROCESS MIXTURE WORKING MODEL, WITH APPLICATION TO A SCHIZOPHRENIA CLINICAL TRIAL 4.1 Introduction

In this chapter, we develop a nonparametric model for the analysis of longitudinal clinical trials using the techniques introduced in Chapter3, in particular through the specification of a working prior. We take the working prior to be a Dirichlet process mixture of distributions which satisfy the MAR assumption. Some operational properties of this approach are studied, with coverage of the credible intervals assessed in simulation experiments. The methodology is motivated by an application to data from a longitudinal clinical trial designed to assess the efficacy of a new treatment for acute Schizophrenia. Crucially, this missingness is monotone, allowing for invocation of the NFD assumption of Kenward et al.(2003). This chapter is primarily based on Linero and Daniels(2015). 4.2 The Schizophrenia Clinical Trial

Our work is motivated by a multicenter, randomized, double-blind clinical trial which aimed to assess the safety and efficacy of a test drug (81 subjects) relative to placebo (78 subjects) and an active control drug (45 subjects) for an individual suffering from acute schizophrenia. We will refer to this data as the Schizophrenia Clinical Trial (or SCT) data. The primary instrument used to assess the severity of symptoms was the positive-and-negative-syndrome-scale (PANSS) (Kay et al., 1987), a clinically validated measure of severity determined by a brief interview with a clinician. Measurements were scheduled to be collected at baseline, day 4 after baseline, and weeks 1,2,3, and 4 after baseline.

We let Yi = (Yi1,...,YiJ ) denote the vector of PANSS scores that would have been collected had we continued to follow subject i after (potential) dropout, and let ¯ Yij = (Yi1,...,Yij) be the history of responses through the first j visits. J = 6 is the total number of observations scheduled to be collected. Dropout was monotone, so that missingness of Yij implies missingness of Yi(j+1). We define Si = j if Yij was observed but

57 Yi(j+1) was not, with Si = J if all data on subject i was collected. We will write Vi = 1, 2, 3 if subject i was assigned to the test drug, active control, or placebo, respectively. The ¯ observed data for subject i is (YiSi ,Si,Vi). The target causal effects of interest in this study were the intention-to-treat effects

ηv0 = E0(Yi6 Yi1 Vi = v), − | where E0 is the expectation operator with respect to the true data generating distribution.

In particular, the contrasts η10 η30 and η20 η30 were of interest. Moderate dropout was − − observed with 33%, 19%, and 25% of subjects dropping out for V = 1, 2, 3 respectively. Subjects dropped out for a variety of reasons including lack of efficacy and withdrawal of patient consent, with some reasons unrelated to the trial such as pregnancy or protocol violation. The active control arm featured the smallest amount of dropout; dropout on this arm was often for reasons that are not likely to be associated with missing response values (33% of dropout). Dropouts on the placebo and test drug arms were more often for reasons which are thought to be predictive of missing responses (100% and 82% of dropout, respectively). It is desirable to treat the different causes separately, particularly because the different treatments have different amounts of dropout and different proportions of dropout attributable to each cause. The primary analysis for this clinical trial was based on a longitudinal model assuming multivariate normality and MAR with the mean unconstrained across treatment and time, and an unconstrained correlation structure shared across treatments. There is substantial evidence in the data that the multivariate normality assumption does not hold - there are obvious outliers and a formal test of normality gives a p-value less than 0.0001. Our experience is that this tends to be the rule rather than the exception. In addition to outliers there appears to be heterogeneity in the data that cannot be explained by a normal model. For example, Figure 4-1 shows two groups of observations in the placebo arm discovered from the latent class interpretation of the Dirichlet mixture we develop;

58 140 140 120 120 100 100 80 80 PANSS Score PANSS Score 60 60 40 40

0 4 7 14 21 28 0 4 7 14 21 28

Time Time

Figure 4-1. Trajectories of two latent classes of individuals in the placebo arm of the trial, and mean response over time, measured in days from baseline, within class. Each figure contains 16 trajectories for the purpose of comparison. subjects are grouped together if they have a high posterior probability of being the same mixture component. One group consists of 40 individuals who are relatively stable across time and the other consists of 16 individuals whose trajectories are more erratic but tend to improve more over the course of the study. These deviations from normality do not necessarily imply that analysis based on multivariate normality will fail as we might expect a degree of robustness, but it motivates us to assess the sensitivity of our analysis to the normality assumption and search for robust alternatives - particularly in the presence of nonignorable missingness where we are unaware of any methods with theoretical guarantees of robustness under model misspecification of the observed data distribution. 4.3 A Dirichlet Process Mixture Working Prior

We stratify the model by treatment group; the treatment variable v will be suppressed

for simplicity of presentation. Recall that we model p0(y, s) as the realization of a random density p(y, s) drawn according to a prior Π(dp). Our approach to prior specification begins with a working prior Π?, which we take to be a Dirichlet process mixture of models

59 which satisfy the MAR assumption. Let fθ(y) denote a kernel for continuous data (such as

the multivariate Gaussian kernel) and let πγ(s y) denote a kernel for ordinal data given a | ? ? continuous predictor which satisfies the MAR restriction πγ(s y) = πγ(s y¯s). If p Π , | | ∼ then

∞ ? d p (y, s) = wkf (k) (y)π (k) (s y) (4–1) θ γ | k=1 X (k) (k) iid where (θ , γ ) H for some base distribution H and w = (w1, w2,...) is given the ∼ stick-breaking prior (1–4). One possible choice of πγ(s y) is a sequential hazard model | (Diggle and Kenward, 1994),

T T πγ(s y) = expit(ζs + λs y¯s) (1 expit(ζj + λj y¯j), | j

It remains to specify the base distribution H. We assume H(dθ, dγ) = Hθ(dθ) × Hγ(dγ). When fθ( ) is multivariate Gaussian, it is convenient to choose Hθ conjugate to · fθ; a standard choice is the normal-inverse-Wishart prior. We instead favor a prior which reparametrizes Σ(k) in terms of autoregressive parameters (Daniels and Pourahmadi, 2002);

given that Yi belongs to mixture component k, we write

j−1 (k) (k) (k) 2(k) Yij = µ + φ (yi` µ ) + ij, ij (0, ρ ). (4–2) j `j − ` ∼ N j `=1 X Reparameterizing in terms of (µ(k), φ(k), ρ(k)) offers more flexibility to the practitioner to shrink the Σ(k)’s towards structures which reflect the longitudinal structure of the

(k) data; for example, one might heavily penalize the φj ’s which correspond to high order

60 autoregressive effects in order to shrink the within-mixture model towards a low-order . The base distribution H will typically not be specified a priori, but instead will be given a hyperprior. Prior specification for Dirichlet process mixture models (or mixture models in general) is a delicate matter, as poor choices for the prior on H may result in improper or just-proper posteriors. We provide a default prior in AppendixB. 4.4 The Extrapolation Distribution

We now discuss the family of identifying restrictions used in the analysis of the SCT data. The fact that missingness is monotone substantially simplifies the situation. Recall that MAR is equivalent to the ACMV identifying restriction given by the assumption (2–7) that

p(yj y¯j−1, s = k) = p(yj y¯j−1, s j), (j > k 1). | | ≥ ≥

Recall also the non-future dependence (NFD) assumption (2–9)

π(s y) = π(s y¯s, ys+1), | | which is equivalent to the pattern-mixture assumption (2–8)

p(yj+1 y¯j, s = k) = p(yj+1 y¯j, s j), (j > k 1). | | ≥ ≥

Our approach is to embed the MAR identifying restriction within a suitable family of NFD assumptions. The only freedom within the NFD family is that we are free to specify any distribution for p(yj y¯j−1, s = j 1). We will consider two methods for specifying | − this distribution. ¯ First, we consider the existence of a transformation j(Yj Yj−1, ξ) such that T | ¯ d ¯ ¯ [Yj Yj−1,S = j 1] = [ j(Yj Yj−1, ξ) Yj−1S j]. | − T | | ≥

61 ¯ Wang and Daniels(2011) implicitly took this approach, with j(Yj Yj−1, ξ) assumed to T | ¯ be an affine transformation. If j(Yj Yj−1, 0) = Yj, then deviations of ξj from 0 represent T | deviations of the assumed model from MAR. Although we do not apply this to the SCT data analysis, we also describe here an approach based on exponential tilting (Birmingham et al., 2003; Scharfstein et al., 1999). We take

p(yj y¯j−1, s = j 1) p(yj y¯j−1, s j) exp qj(y¯j; ξ) . | − ∝ | ≥ { }

This assumption is equivalent to the assumption that

Odds(s = j 1 y¯j, s j) 0 log − | ≥ = qj(y¯j; ξ) qj(y¯ ; ξ), Odds(s = j 1 y¯0 , s j) − j  − | j ≥  0 provided that y¯j−1 = y¯j−1. The tilting function qj(y¯j; ξ) represents the effect of a unit

change on the scale of log-odds ratios of yj on dropout, holding y¯j−1 fixed. 4.5 Computation and Inference

Computation of posterior distributions of effects is carried out through Markov chain Monte Carlo (MCMC). Recall that our computational strategy proceeds in two steps:

S1. Draw samples of pobs Π(dpobs O1:n). ∼ | S2. Draw samples of the target effects from their posterior distributions through G-computation as in Section 3.6. We carry out S1 approximately by truncating the infinite sum (4–1), applying the blocked

? ? ? Gibbs sampler of Ishwaran and James(2001) to draw samples p Π (dp O1:n), and – ∼ |

in principle – retaining only pobs from each of these samples. This is equivalent to drawing

(k) (k) the parameters (wk, θ , γ ) for k = 1,...,K where K is the truncation level. Guidelines for the choice of truncation level K are given by Ishwaran and James(2001). We note that it is also possible to construct a higher-quality, adaptive, truncation by instead using a sampler based on a P´olya urn scheme (Escobar and West, 1995; Neal, 2000) and then drawing p? offline from the -Dirichlet distribution of Muliere and Tardella(1998). The

62 only modification made to the algorithm proposed by Ishwaran and James(2001) is the addition of a data-augmentation step filling in the missing data; the full Gibbs sampler is provided in AppendixB.

Suppose that ψ(p0) = E0[t(Y )] is a functional of interest; we now describe the G-computation algorithm for the Dirichlet process mixture working prior to sample functionals ψ = ψ(p) from the posterior Π(ψ O1:n). Recall that our strategy is to draw | ? ? ? −1 N ? ? ? pseudo-data Y1 ,..., YN ? and form the average (N ) i=1 t(Yi ) for some large N . The algorithms needed to carry out the G-computationP step are given in Algorithm4, Algorithm5, and Algorithm6. The derivations of each algorithm are based on standard manipulations of the conditional distributions of finite mixtures and are omitted. Of special note, however, is the role played by the restriction to mixtures of models which

satisfy the MAR restriction. This assumption on the model πγ(s y) ensures that all | of the expressions involved (i.e., the terms $k in Algorithm5 and the probability r in Algorithm6) can be computed without the need to compute any integrals over missing data.

Algorithm 4 Algorithm to draw (y¯s, s) pobs for the Dirichlet process mixture working model (4–1) ∼

1. Draw c Categorical(w). ∼ 2. Draw y f (c)(y). ∼ θ 3. Draw s π (c) (s y). ∼ γ | 4. Retain (y¯s, s).

4.6 Simulation Studies

In this section, we present the results of several simulation studies designed to assess the viability of our approach for the Schizophrenia clinical trial (SCT). In the first experiment, we illustrate our approach for J = 3 times under the MAR assumption, and compare it to a parametric working prior as well as a doubly robust estimator. In the second experiment, we fit several models to the SCT data and compare the Dirichlet

63 Algorithm 5 Algorithm to draw ys+1 p(ys+1 y¯s, s) for the Dirichlet process mixture ∼ | working prior (4–1) under NFD, using a transformation s+1(ys+1 y¯s+1, ξ). T | 1. Draw c Categorical($) where ∼ s

$k wkfθ(k) (y¯s) 1 πγ(k) (j y¯j) ∝ × " − j=1 | # X 0 2. Draw y f (c) (ys+1 y¯s). s+1 ∼ θ | 0 3. Set ys+1 = (y y¯s, ξ). T s+1 |

Algorithm 6 Algorithm to draw ys+1 p(ys+1 y¯s,S s) under the NFD restriction. ∼ | ≥ 1. Draw R Bernoulli(r) where ∼ K wkf (k) (y¯s)π (k) (s y¯s) r = k=1 θ γ | . K s−1 k=1 wPkfθ(k) (y¯s) 1 j=1 πγ(k) (j y¯j) − | P h P i 2. If R = 0, draw ys+1 p(ys+1 y¯s,S s + 1), otherwise draw ys+1 p(ys+1 y¯s, s). ∼ | ≥ ∼ |

mixture working prior to a parametric working prior, with the focus being on robustness and sensitivity of the methods to deviations from MAR. 4.6.1 Performance for Mean Estimation under MAR

We first assess the performance of our model for mean estimation under MAR. Data were generated under two settings with J = 3 time points. The targeted parameter of

interest was the mean response at completion of the study E0[Y3]. Under the first setting, the observed data were generated from a multivariate Gaussian working model with an

AR-1 covariance structure such that Cov(Yj,Yj+1) = 0.7, and a lag-1 selection model. Under the second setting, the observed data were generated using a working model which was a mixture of two Gaussian distributions and a piecewise-constant hazard of dropout. Details of the parameters used to generate the simulated data can be found in Appendix B. We compare our Dirichlet mixture working prior with MAR imposed to (a) a multivariate Gaussian working prior with a noninformative prior and (b) augmented

64 inverse-probability weighting (AIPW) methods (Rotnitzky et al., 1998; Tsiatis, 2006; Tsiatis et al., 2011). The AIPW estimator used solves the estimating equation

n J−1 I(Si = J) I(Si = J) λj(Yi)I(Si j) ¯ ϕ(Yi, θ) + − ≥ E[ϕ(Yi, θ) Yij] = 0, Pr(S = J Y ) Pr(S > j Y ) i=1 ( i i j=1 i i | ) X | X | where i ϕ(Yi, θ) = 0 is a complete data least-squares estimating equation for the regressionP of Y1 on Y2 and (Y1,Y2) on Y3, and λj(Yi) is the dropout hazard at time j. This estimator is doubly robust in the sense that if either the dropout model or the mean response model is correctly specified, then the associated estimator is consistent and asymptotically Gaussian. Our AIPW method assumed the correct dropout model and hence is consistent. A sandwich estimator of the covariance of the parameter vector θ was used to construct interval estimates. One thousand datasets were generated with N = 100 observations per dataset. Results are given in Table 4-1. When the data are generated under the Gaussian working prior all methods perform similarly. Under the mixture model, however, Gaussian-based inference is now inefficient, although it does attain the nominal coverage rate. The AIPW estimator and Dirichlet process mixture give similar performance. These results suggest the Dirichlet mixture is a reasonable alternative to the Gaussian model – even when the data have a Gaussian distribution, we lose little by using the mixture model while allowing for robustness if the Gaussian model does not model the observed data well. The AIPW method also performs well and is a reasonable semiparametric alternative, but is not directly applicable to our desired approach to sensitivity analysis. The proposed modeling approach provides, in this example, the robustness of AIPW within the Bayesian framework and thus naturally allows for quantification of uncertainty about the missingness via priors and allows inference for any functional of the full-data observed response model.

65 Table 4-1. Comparison of methods for estimating the mean at time J = 3. DP, Gaussian, and AIPW refer to inferences based on the Dirichlet mixture model, the Gaussian model, and the AIPW method respectively. Bias 103 CI Width CI Coverage Mean Squared Error 102 × × Gaussian Model DP -1(4) 0.493(0.001) 0.963(0.006) 1.443(0.06) Gaussian -5(4) 0.494(0.002) 0.944(0.007) 1.524(0.07) AIPW -1(4) 0.470(0.002) 0.943(0.007) 1.530(0.07)

Mixture of Gaussian Models DP -10(4) 0.542(0.001) 0.950(0.007) 1.82(0.08) Gaussian -39(5) 0.586(0.001) 0.949(0.007) 2.20(0.10) AIPW 1(4) 0.523(0.001) 0.944(0.007) 1.85(0.08)

4.6.2 Performance for Effect Estimation Under MNAR

To determine the suitability of our approach for the SCT data we conducted a simulation study to assess the accuracy and robustness under several data generating mechanisms. We consider three different models for the observed data: M1. A lag-2 selection model, Y (µ, Σ) and logit Pr(S = s S s, Y ) = ∼ N | ≥ ζs + γ1sYs + γ2sYs−1.

M2. A finite mixture of lag-1 selection models, C Categorical(ξ), [Y C] (µC , ΣC ), ∼ | ∼ N and logit Pr(S = s S s, Y ,C) = ζsC + γsC Ys. | ≥ M3. A lag-2 selection model where Yj Skew- ν(µj, σj, ωj) where ωj is a skewness ∼ T parameter (see Azzalini, 2013) and logit Pr(S = s S s, Y ) = ζs + γ1sYs + γ2sYs−1. | ≥ The marginals of Y are linked by a Gaussian copula. We took α log ( 3, 22) to induce a strong preference for simpler models. Parameters ∼ N − were generated by fitting our model to the active control arm of the SCT data, and the sample sizes for the simulation was set to 200. Our approach was compared to a default analysis based on model M1. Setting M1 was chosen to assess the loss of our approach from specifying the nonparametric mixture model when a simple parametric alternative holds. Setting M2 was chosen to determine the loss in accuracy when inference is based on a multivariate Gaussian when a model similar to the Dirichlet mixture holds. At N = 200,

66 datasets generated under M2 are not more obviously non-Gaussian than the original data. M3 was chosen to assess the robustness of both the multivariate Gaussian and the Dirichlet mixture to the presence of skewness, kurtosis, and a non-linear relationship between components. To generate the parameters used in M3 we generated data under M1 and transformed it to be more skewed, kurtotic, and non-linear by applying a Gaussian distribution function and skew-T quantile function; details and parameter values are given in AppendixB, as well as sample datasets generated under M2 and M3. ¯ d To complete these models, the NFD assumption [Yj Yj−1,S = j 1] = [Yj + σjξ | − | ¯ ¯ Yj−1,S j] was made, where σj is the standard deviation of [Yj Yj−1] under MAR. ≥ | The parameter ξ represents the number of standard deviations larger Yj is, on average, for those who dropped before time j compared to those who remained on study at time j. Figure 4-2 shows the frequentist coverage and average width of 95% credible intervals, as well as the root mean squared error (RMSE) of the posterior meanη ˆ of the target parameter η = E(Y6) for each value of ξ along a grid 0, 0.5,..., 1.5, 2 ; exact values { } and Monte-Carlo standard errors are given in AppendixB. We take ξ 0 to reflect the ≥ belief that those who dropped out had stochastically higher PANSS scores than those who stayed on study. This was done to assess whether the quality of inferences varies as the sensitivity parameter increases; intuitively, this might happen for large values of ξ, as the extrapolation distribution becomes increasingly concentrated on regions where we lack observed data. The Dirichlet mixture performs at least we well as the Gaussian model under M1 and uniformly better when either M2 or M3 hold; the Dirichlet mixture working prior attains its nominal coverage under M2 and M3 while analysis based on the multivariate Gaussian working prior has inadequate coverage which even appears to degrade for larger values of ξ under M3. Given the negligible loss incurred using the Dirichlet mixture under M1 and the large drop in coverage and higher RMSE when M1 did not generate the data, we see little reason to assume a multivariate Gaussian model for the observed data. Finally, we

67 Mixture Normal Skew-T 1.00

0.95 Model Dirichlet Mixture Interval)

∈ Normal

η 0.90 ( P

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

Mixture Normal Skew-T 3.0 η 2.5 Model Dirichlet Mixture 2.0 Normal

Root1.5 MSE of ˆ

0.0 0.5 1.0 1.5 2.00.0 0.5 1.0 1.5 2.00.0 0.5 1.0 1.5 2.0

Mixture Normal Skew-T 11 10 9 Model Dirichlet Mixture 8 Normal 7 6

Average Interval Width 0.0 0.5 1.0 1.5 2.00.0 0.5 1.0 1.5 2.00.0 0.5 1.0 1.5 2.0 ξ

Figure 4-2. Results from the simulation study in Section 4.6.2. Normal refers to M1, Mixture to M2, and Skew-T to M3.

68 note that under M3, while the average interval length is roughly the same for both models, the interval length varies twice as much for the Dirichlet mixture, so while on average the Dirichlet mixture produces intervals of similar length, intervals may be wider or smaller depending on the data. These results again suggest that our approach may add a layer of robustness while giving up little when the corresponding parametric model holds, and we see no reason to prefer the parametric approach over the nonparametric approach. 4.7 Application to the Schizophrenia Clinical Trial

We now apply our methodology to the SCT data. Recall that the effects of interest

are ηv = E(Yi6 Yi1 Vi = v) where v = 1, 2, 3 denotes randomization to the test − | drug, active control, and placebo, respectively, and in particular we are interest in the improvement of each treatment over the placebo, ηv η3. − 4.7.1 Comparison to Alternatives and Assessing Model Fit

We consider two parametric models for the observed data in addition to a Dirichlet process mixture of lag-2 selection models. We considered several variants of pattern mixture models and selection models and found that the following provided reasonable fits within each class.

1. A pattern mixture model. The law of S is modeled nonparametrically with a discrete distribution across time points while the law of [Y S] is modeled with | a multivariate Gaussian (µS, ΣS) distribution. Due to sparsity in the observed N patterns, we must share information across S to get practical estimates of (µS, ΣS). Observations with S 1, 2, 3 , S 4, 5 , or S 6 were given the same value of ∈ { } ∈ { } ∈ { } µS while ΣS = Σ was shared across patterns. MAR is imposed on top of this, with MAR taking precedence over the sharing of µS.

2. A selection model. The outcome Y is modeled with a multivariate Gaussian (µ, Σ) distribution. [S Y ] is modeled with a discrete hazard logistic model, N | T Prζ,λ(Si = j Yi,S j) = expit(ζj + λ y¯s), which for j 3 was simplified to | ≥ s ≥ expit(ζj + λ1jYj + λ2jYj−1).

69 Table 4-2. Comparison of results under MAR assumption. The posterior mean is given for ηv η3 and standard error is given in parenthesis. − Model η1 η3 η2 η3 LPML Dirichlet Mixture -1.7(-8.0,− 4.8) -5.4(-12.6,− 2.3) -3939 Selection Model -1.8(-8.8, 5.5) -6.1(-14.2, 2.0) -4080 Pattern Mixture model -2.2(-10.1,5.3) -6.8(-15.6,2.3) -4072

The Dirichlet mixture used is a mixture of the selection model above. Models were assessed by their posterior predictive ordinates,

POi = p(Oi O−i), | where O−i = (O1,..., Oi−1, Oi+1,..., On) denotes the observed data with the i’th observation removed (Geisser and Eddy, 1979). The PO’s can be easily be calculated from the MCMC output and combined to give an omnibus model selection criterion

n

LPML = log POi, i=1 X the log pseudo-marginal likelihood – see Lopes et al.(2003) and Hanson et al.(2008) for examples in a Bayesian nonparametric setting. A comparison of model fit and comparison of inferences under the different proposed models is given in Table 4-2. LPML selects the Dirichlet mixture over the selection model and pattern mixture model. Improvement over the selection model is unsurprising in light of the established failure of multivariate Gaussianity. Like the Dirichlet mixture, the marginal response distribution for the pattern mixture model is a discrete Gaussian mixture, so an improvement in LPML here is more informative. In addition to the improvement in LPML, the simulation results suggesting robustness argue for inference based on the Dirichlet mixture. We also note the Dirichlet mixture results in narrower interval estimates. To confirm that the Dirichlet mixture reasonably models the observed data, we compare model-free estimates and intervals of the dropout rates and observed-data means at each time point to those obtained by the model under each treatment. Results are

70 Active Placebo Test

0.4

● 0.3 ● ● ● 0.2 ● ● ●

Percent Missing Percent ● ● 0.1 ● ● ● ● ● ● 0.0

Baseline Day 4Week 1 Week 2 Week 3 Baseline Day 4Week 1 Week 2 Week 3 Baseline Day 4Week 1 Week 2 Week 3 Time

Active Placebo Test 100

● ● ● ● ● ● ● 90 ● ● ● ●

● ● 80 ●

Observed Mean ● ● ● ●

70

Baseline Day 4Week 1 Week 2 Week 3 Week 4 Baseline Day 4Week 1 Week 2 Week 3 Week 4 Baseline Day 4Week 1 Week 2 Week 3 Week 4 Time

Figure 4-3. Top: modeled dropout versus observed dropout over time. Bottom: modeled observed means versus empirical observed means. The solid line represents the empirical statistics, solid dots represent the modeled statistics. Dashed error bars represent frequentist 95% confidence intervals and solid error bars represent the models 95% credible intervals. displayed in Figure 4-3. There do not appear to be any problems with the model’s fit to estimates obtained from the empirical distribution of the data. 4.7.2 Inference and Sensitivity Analysis

Reasons for dropout were partitioned into those thought to be associated with MNAR missingness - withdrawal of patient consent, physician decision, lack of efficacy, and disease progression - and those which were thought to be associated with MAR missingness - pregnancy, adverse events such as occurrence of side effects, and protocol violation. We let

Mij = 1 if a subject dropped out at time j for reasons consistent with MNAR missingness and Mij = 0 otherwise. Given that a subject is last observed at time S we model

¯ Pr(M = 1 Ys = y¯s,S = s, V = v) = χv(y¯s). (4–3) |

71 The function χv(y¯s) can be estimated from the data using information about dropout. To make use of this information in the G-computation we make the NFD completion given by the mixture distribution

¯ d ¯ ¯ Yj Yj−1,S = j 1,V = v = χv(Yj−1) (Yj ξj) Yj−1,S j | − T | | ≥    ¯ ¯  + [1 χv(Yj−1)] Yj Yj−1,S j . (4–4) − | ≥   This is a mixture of an MAR completion and the transformation based NFD completion. This encodes the belief that, if a subject drops out for a reason associated with MAR missingness, we should impute the next missing value under ACMV. In selecting a model ¯ for χv(y¯s), S and YS were found to have a negligible effect on the fit of (4–3) while the treatment V was found to be very important, so we take χv(y¯s) = χv to depend only on

V . The coefficients χv were given independent Uniform(0, 1) priors and were drawn from their posterior during the MCMC simulation. To identify the effect of interest it still remains to specify the transformation (y ξ) T | and place an appropriate prior on ξ. We take (y ξ) = y + ξ to be a location shift. This T | encodes the belief that the conditional distribution of Yj for an individual who dropped out at time j 1 is the same as it is it would be for a hypothetical individual with the − same history who is still on study but shifted by ξ. This can be explained to clinicians as an adjustment to the PANSS score of an individual who remained on study at time j that would need to be made to make this individual have the same average as an individual with the same history who dropped out for an informative reason, with the caveat that the same adjustment must be made regardless of their history and response value. In general, if subject matter experts feel constrained by needing to specify a single adjustment this may be reflected by revising the transformation chosen. Information regarding the scale of the data can be used as an anchor for prior specification. The residual standard deviation in the observed data pooled across time was roughly 8, and it is thought unlikely that deviations from MAR would exceed a

72 standard deviation. The ξj were restricted to be positive to reflect the fact that subjects who dropped out were thought to be those whose PANSS scores were lower than predicted

iid under MAR. Based on this we specified ξj (0, 8) with the ξj shared across treatment. ∼ U While it may seem as though sharing the ξj across treatment will cause the effect of MNAR to cancel out in comparisons, the differing amounts of dropout and differing proportions of dropout attributable to each cause will cause ξ to affect each treatment differently.

Results are summarized in Figure 4-4. The effect η1 η3 had posterior mean 1.7 − − and 95% credible interval ( 8.0, 4.8) under MAR and posterior mean 1.6 and credible − − interval ( 8.4, 5.4) under MNAR. The effect η2 η3 had posterior mean 5.4 and − − − credible interval ( 12.6, 2.3) under MAR and posterior mean 6.2 and credible interval − − ( 13.8, 2.0) under MNAR. There appears to be little evidence in the data that the test − drug is superior to the placebo, and for much of the trial the placebo arm appears to have had better performance. The effect of the MNAR assumption on inferences is negligible here due the fact that the placebo and test drug arms had similar dropout profiles, and

because the sensitivity parameters ξj had the same prior mean across treatments. The data does contain some evidence of an effect of the active control, and we see here that the MNAR assumption increases the gap between the active control and the placebo due to the fact the attrition on the active arm was less frequent and when it occurred was more frequently for noninformative reasons. We varied this prior specification in two ways. First, we considered sensitivity to the

dependence assumptions regarding the ξj’s by allowing them to be dependent across j and

allowing different values ξjv across treatment, while keeping the marginal prior on each

ξjv the same. The ξjv’s were linked together by a Gaussian copula parameterized by ρtime

and ρtreatment determining the correlation between ξjv with ρtime = 0 and ρtreatment = 1 corresponding to the original prior. The result of this analysis was that inferences were

73 Active Drug Improvement Over Placebo

5

0 MAR -5 MNAR Difference

-10

-15 Day 4 Week 1 Week 2 Week 3 Week 4 Time

Test Drug Improvement Over Placebo

5

0 MAR -5 MNAR Difference

-10

-15 Day 4 Week 1 Week 2 Week 3 Week 4 Time

Figure 4-4. Improvement of treatments, measured as the difference in change from baseline, over placebo over time. Smaller values indicate more improvement relative to placebo. Whiskers on the boxes extend to the 0.025 and 0.975 quantiles, the boundaries of the boxes represent the quartiles, and the dividing line within the boxes represents the posterior mean.

invariant to the choice ρtime and ρtreatment to within Monte-Carlo error, so detailed results are omitted. Second, we considered the effect of the mean and variability of the prior on inferences by giving each ξ a point-mass prior and varying the prior along a grid. This analysis is useful in its own right, as some may feel uncomfortable specifying a single prior on the sensitivity parameters. To ease the display of the inferences, we assumed that all

ξj’s within treatment were equal; we write ξP, ξT, and ξA for the sensitivity parameters corresponding to the placebo, test, and active control arms respectively. Figure 4-5

74 2 -4 0.97 8 8 0.9 -5 0 6 6 -6 0.95 P P

ξ -2 ξ 4 4 -7

0.8 -4 0.925 -8 2 2

0.7 -6 0.9 -9 0.85 0 0.6 0.5 0.4 0 0 2 4 6 8 0 2 4 6 8

ξT ξA

Figure 4-5. Contour plot giving inference for the effects ηv η3 for different choices of the sensitivity parameters. The color represents the− posterior mean while dark lines give contours of the posterior probability of ηv η3 > 0. − displays results of this analysis in a contour plot. To illustrate, if we chose as a cutoff a 0.95 probability of superiority as being significant evidence of an effect then we see that even for the most favorable values of ξT and ξP we do not reach a 0.95 posterior probability of η1 η3 > 0. Conversely, a 0.95 posterior probability of η2 η3 > 0 is − − attained, although it occurs in a region where ξA is substantially smaller than ξP. The additional uncertainty in the ηv’s induced by using a prior itself appears for this data to have little effect on the posterior, as inference when ξP = ξT = ξA = 4 gives roughly the same inferences as the original prior. 4.8 Discussion

In this chapter, we introduced a novel working prior for the analysis of incomplete longitudinal data subject to informative missingness which provides both a flexible model for the observed data as well as substantial scope for conducting a principled sensitivity analysis. This was accomplished by modeling the observed data with a Dirichlet process mixture. We attain both flexible modeling of the observed data and flexible specification of the extrapolation distribution. We note that there is nothing particular about the

75 Dirichlet process to our specification; our approach applies to essentially any class of stick-breaking priors for which approximate draws from the posterior can be generated. An alternative to the transformation based sensitivity analysis presented here is an exponential tilting assumption p(yj y¯j−1, s = j 1) p(yj y¯j−1, s j) exp (qj(y¯j; ξ)). | − ∝ | ≥ Our method is also amenable to this approach if the Gaussian kernel is used and qj(y¯j; ξ) is piecewise-linear in yj. Since qj(y¯j; ξ) is unidentified and typically will be elicited from a subject-matter expert, the piecewise-linearity assumption may not be a substantial restriction. A modification of the G-computation algorithm in this setting is provided in AppendixB. In future work we hope to develop similar tools for continuous time dropout; when dropout time is continuous there is no longer a natural characterization of MAR in terms of identifying restrictions. Additionally, there is scope for incorporating baseline covariates. Often covariates are used to help imputation of missing values, or to make the MAR assumption more plausible, but are not of primary interest (i.e. auxiliary covariates). Another area for future work is extending our method to non-monotone missingness without needing to invoke a partial ignorability assumption.

76 CHAPTER 5 EMPIRICAL BAYES ESTIMATION AND MODEL SELECTION FOR HIERARCHICAL NONPARAMETRIC PRIORS 5.1 Introduction

In recent years, Bayesian hierarchical models have become a standard tool for the analysis of complicated data structures. Suppose we have data Y = Yij : 1 i { ≤ ≤ nj, 1 j J representing nj observations from group j, for j = 1,...,J. Within each ≤ ≤ } indep group, we have Yij Fj where Fj is a group-level distribution, with the Fj’s sharing ∼ some structure. This setup is depicted in Figure 5-1. The Bayesian view is convenient as it allows for a natural propagation of uncertainty, with information shared in a principled manner across groups. This approach has proven to be useful in settings as varied as the study of hospital standards across states (Rodriguez et al., 2008), hierarchical social science data (Gelman and Hill, 2006), and text modeling (Blei et al., 2003).

One may proceed by assuming that the Fj’s are members of some parametric family,

Hω : ω Ω ; however, in many cases, there is little reason a priori to expect the Fj’s to { ∈ } lie in a given parametric family. Bayesian nonparametric approaches provide additional

flexibility by modeling the Fj’s as random measures. A common strategy is to specify a latent variable model,

indep Conditional on ψij, Yij fψ , (1 i nj, 1 j J), ∼ ij ≤ ≤ ≤ ≤ indep Conditional on Gj, ψij Gj, (1 i nj, 1 j J), ∼ ≤ ≤ ≤ ≤ where fψ is a density or mass function indexed by ψ. Taking a nonparametric approach, we might assume that the Gj’s are drawn from a Dirichlet process (Antoniak, 1974;

iid Ferguson, 1973): Gj (αG0) where G0 is a base probability measure and α > 0 ∼ D is a measure of prior concentration. The Fj’s are then determined by the mixtures

fψ( ) Gj(dψ). There are many ways of introducing dependencies among the probability · J measuresR Gj in order to share information across populations, including the general { }j=1 dependent Dirichlet process (DDP) framework (MacEachern, 1999).

77 F1 F2 FJ ···

Yi Yi YiJ 1 2 ··· i = 1, . . . , nj

Figure 5-1. Graphical depiction of a Bayesian hierarchical data generating mechanism. The Yij’s are drawn independently from their respective Fj’s. The Fj’s are linked through a prior, represented by the black box.

One way to share information across groups is to allow for the sharing of atoms among the Gj’s. Teh et al.(2006) introduced a prior referred to as a hierarchical Dirich- let process (HDP), which stipulates a shared support for the Gj’s, but with different proportions assigned to each atom. The hierarchical Dirichlet process takes the base distribution itself as being drawn from a Dirichlet process, with the full hierarchy given by

indep Conditional on ψij, Yij fψ , (1 i nj, 1 j J), ∼ ij ≤ ≤ ≤ ≤ indep Conditional on Gj, ψij Gj, (1 i nj, 1 j J), ∼ ≤ ≤ ≤ ≤ (5–1) indep Conditional on G0, Gj (αG0), (1 j J), ∼ D ≤ ≤

G0 (γHω), ∼ D 0 where Hω is a member of a parametric family Hω0 : ω Ω with density hω( ). This is { ∈ } · expressed graphically in Figure 5-2. The HDP features hyperparameters (α, γ, ω), and an important question is how to choose these hyperparameters. We focus on (α, γ), with ω given a fixed prior ν(ω). As illustrated in Section 5.4.2, the hyperparameters (α, γ) can have a substantial influence on the quality of estimates in practice. Additionally, α and γ determined how concentrated the HDP prior is on certain nonparametric and semiparametric submodels. For example, as α or α 0, the Dirichlet process (αH) tends to a point mass at H, and a → ∞ → D

78 ω γ α

H G0 Gj ψij Yij

i = 1, . . . , nj j = 1,...,J

Figure 5-2. Graphical representation of the HDP as a directed acyclic graph, with gray circles representing observed quantities, transparent circles representing unobserved data, and diamonds representing hyperparameters. The random measure G0 is drawn from the (γHω) distribution. The Gj’s are then drawn D independently from a (αG0), which then give rise to random effects ψij and D observations Yij.

point mass at δψ where ψ H, respectively (Sethuraman and Tiwari, 1982); similarly, ∼ γ controls how concentrated the Gj’s are on a point mass at G0 or a point-mass at a δφ, where φ G0. The cases where one of the precision hyperparameters tends to infinity ∼ represent a submodel in which, effectively, only one level of the hierarchy is nonparametric, or in which information is not shared across groups. It may be of interest to determining whether such a simplification is justified. This is discussed in Section 5.2.2. The most common approach for addressing the choice of hyperparameters is the fully-Bayesian approach of placing a prior on h = (α, γ), with inference carried out through Markov chain Monte Carlo (Teh et al., 2006) or by constructing a variational approximation to the posterior (Teh et al., 2007; Wang et al., 2011). However, there are several reasons one might wish to avoid this. On one hand, there is no useful objective prior for this problem; in particular, essentially any improper prior on the hyperparameters α and γ results in an improper posterior. A formal statement and proof of this result is provided in AppendixC. On the other hand, as a consequence of this lack of an objective prior, informative priors will impose information that is not “swamped” by the data, and may unduly affect inference. Figure 5-3 shows trace-plots of samples drawn from a Markov chain targeting the posterior of γ. The right plot corresponds to using an improper prior, while the left plot

79 1000 800 600 150000 γ γ 400 200 50000 0 0

0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Iteration Iteration

Figure 5-3. Draws from a Markov chain targeting the posterior of γ on a simulated dataset, generated using α = γ = 5 and nj = J = 100. On the left is a trace-plot of draws from the Markov chain when the γ parameter is given an exponential prior with mean 100; on the right is the trace-plot when γ is given an improper uniform prior on (0, ). ∞ corresponds to using a diffuse (but proper) exponential prior. The chain utilizing an improper prior exhibits transient behavior as a result of posterior impropriety, making inferences based on such a prior dubious. A troubling aspect is that, if the chain is terminated early, there may be no evidence of this behavior in the samples from the Markov chain (Hobert and Casella, 1996). Proper priors may be used on (α, γ), but results may be sensitive to the choice of prior; this places the practitioner in the original situation, with the choice of hyperprior replacing the choice of hyperparameter. For these and other reasons, one may seek an alternative to the fully-Bayesian approach. One alternative is to base inference on the marginal likelihood mα,γ(Y ). For most purposes, it suffices to estimate the marginal likelihood up to a (universal) constant. For example, Bayesian hypothesis testing requires computation only of the ratios of marginal likelihoods, which can be accomplished if the marginal likelihoods

80 are known up to a constant. One approach is to estimate a Bayes factor surface

BFq(α, γ) = mα,γ(Y )/q(Y ) where q(Y ) is the marginal likelihood of Y under some distinguished model. For example, one choice for q(Y ) is to set q(Y ) = mα0,γ0 (Y ) for

0 0 some hyperparameter values (α , γ ). While neither q(Y ) nor mα,γ(Y ) can be computed

tractably in general, it is often possible to approximate the Bayes factor BFq(α, γ).

Maximization of BFq(α, γ), which is equivalent to maximization of mα,γ, gives the empirical Bayes estimator hˆ = (ˆα, γˆ) of h = (α, γ). An additional benefit of focusing on the marginal likelihood is that one can assess various aspects of the structure of the model by examining the Bayes factor surface as the concentration parameters α and γ tend to 0 or . ∞ 5.1.1 Motivating Examples

We are motivated by two examples. First, we consider data collected to assess the quality of health care in the United States and Outlying Territories, obtained from the

Hospital Compare database available at www.medicare.gov/hospitalcompare. A variety

of measures of quality of care are available. In this setting, Yij = (Xij,Nij) where Xij is the number of times the ith hospital in region j (j = 1,..., 54) administered a correct and timely treatment out of Nij opportunities. Rodriguez et al.(2008) analyzed similar data, with the aim of clustering the regions, using a nested Dirichlet process. Here, we focus on (1) assessing the extent to which nonparametric models improve on parametric models and (2) determining whether sharing of information in the hierarchical structure improves upon naively modeling each region separately. Second, we consider the setting of topic modeling. We are given a corpus of J

th documents, with Yij denoting the i word in document j. The words in document j,

Yij : 1 i nj , take values in a vocabulary = 1,...,V of V words. The corpus { ≤ ≤ } V { } is envisioned as spanning a number of distinct topics, where a topic is, by definition, any distribution ψ on . For example, our corpus of documents might be a collection of V articles from the Associated Press (AP), and the topics in the corpus might heuristically

81 include “sports”, “weather”, “medicine”, and so forth. Documents may concern multiple topics; for example, we might have a document which reports that a tennis player became ill due to practicing in the rain, which covers the three topics mentioned. Crucially, the topics are not known a priori; rather, we attempt to infer them from the corpus.

Associated to the word Yij we imagine an underlying “topic” ψij from which the word is drawn. Here, since a topic is a distribution on , it can be represented as a point on SV , V the V -dimensional simplex.

∞ The HDP can be used as a topic model by regarding the atoms of G0, φd , as an { }d=1 infinite collection of potential topics (i.e., points on the V -dimensional simplex). Because

the Gj’s share the same support, the documents share the same topics; however, the Gj’s place differing amounts of mass on each topic, so that documents consist of the various

topics in differing proportions. Here, fψ corresponds to a categorical distribution, i.e., a mass function on . The HDP topic model may be used to facilitate information retrieval V (Cowans, 2004) or to explore the structure of the corpus in an unsupervised learning setting (Teh et al., 2006). 5.1.2 Our Contributions

This chapter is primarily concerned with the development of methods for: (a) estimation of Bayes factors, including Bayes factors at the boundary values 0 and ; (b) ∞ estimation of the expected value of some parameter of the model as a function of (α, γ); and (c) implementation of the empirical Bayes approach for models based on the HDP. By examining the limiting cases of the hyperparameters, we derive the limiting cases for the HDP and describe how one can use the HDP to compare a fully nonparametric hierarchical model to various alternatives. We derive expressions which allow us to conduct model selection and empirical Bayes hyperparameter estimation through Markov chain Monte Carlo. We take advantage of a representation of the Chinese Restaurant Process (CRP) as an exponential family, which allows us to characterize the empirical Bayes estimate as the solution to a

82 moment-matching problem. To calculate Bayes factors, we follow an approach which is similar to the one taken by Doss(2012), who estimated Bayes factor surfaces by using Radon-Nikodym derivatives. Our approach is simpler in that it avoids the need to derive expressions for these Radon-Nikodym derivatives, but still obtains the results of Doss (2012) as a special case. As in Doss(2010, 2012), our computational methods are based on a combination of importance sampling and Markov chain Monte Carlo. For the Markov chain Monte Carlo part, we use the serial tempering framework (Geyer, 2011; Geyer and Thompson, 1995; Marinari and Parisi, 1992). This framework leads directly to both an algorithm for computing the empirical Bayes estimator (ˆα, γˆ) = arg maxα,γ mα,γ(Y ) and an algorithm for conducting inference with the empirical Bayes estimator plugged in, using only the output of the serial tempering algorithm. We conclude with applications demonstrating the methodology, and offer a comparison of our results to results obtained under default informative priors used in the literature. We show that informative priors may clash with information contained in the marginal likelihood. We also use the methodology to both justify the use of a nonparametric prior and to illustrate the benefit of hierarchical sharing of information in an analysis of the Hospital Compare data. In a topic modeling setting, we use our methodology to demonstrate that certain aspects of topic models may not be optimally captured if the hyperparameters are selected by likelihood criteria. 5.2 Theoretical Development

Bayesian model selection proceeds by comparing the marginal likelihoods of the data

0 0 under different models through the Bayes factor BFα,γ(α , γ ) = mα0,γ0 (Y )/mα,γ(Y ). Large values of the Bayes factor represent evidence in favor of (α0, γ0) relative to (α, γ). In

Section 5.2.1 we derive expressions for the marginal likelihood mα,γ(Y ) and characterize the empirical Bayes estimator. In Section 5.2.2 we extend these calculations to values

83 of (α, γ). These expressions are intractable, but are useful for constructing inference algorithms. 5.2.1 Marginal Likelihoods

Let H be a distribution on a complete, separable, metric space endowed with X its Borel σ-field B. The stick-breaking construction of the Dirichlet process given by Sethuraman(1994) states that if F (αH), then F may be represented as F = ∼ D ∞ k=1 wkδφk , where the φk’s are independent draws from H, and the wk’s are random with Pa distribution which depends only on α. In particular, F is discrete with probability one.

iid Therefore, if ψ1, . . . , ψn F , then the ψi’s form clusters, with the ψi’s in the same cluster ∼ being equal. Consequently, the ψi’s induce a partition on the set of integers 1, . . . , n , { } and it is clear from the stick-breaking construction that if H is continuous then the distribution of the partition depends only on α. The resulting distribution on partitions is referred to as the Chinese Restaurant Process. If P is a partition of a set a1, . . . , an , { } then P denotes the number of sets in P, and if b a1, . . . , an , then b denotes | | ⊂ { } | | the cardinality of b. It is well known (see, e.g., Pitman, 2002) that the distribution on partitions induced by ψ1, . . . , ψn is given by

Γ(α)α|P| pα(P) = Γ( b ) Γ(α + n) | | b∈ YP (5–2) = Γ( b ) exp P a [log Γ(ea + n) log Γ(ea)] , | | {| | − − } "b∈ # YP where a = log(α). We note that this is an exponential family, with sufficient statistic P | | and canonical parameter a. From standard theory of exponential families, this provides one means of showing that expected number of clusters is

n d a a α Eα P = [log Γ(e + n) log Γ(e )] = . da α + i 1 | | − i=1 X − iid A method for directly sampling ψi F for some F (αH) based on the CRP(α) ∼ ∼ D distribution is described in Algorithm7.

84 iid Algorithm 7 Generating ψi F , i = 1, . . . , n, where F (αH) ∼ ∼ D 1. Draw a partition P CRP(α) of 1, . . . , n . ∼ { } iid 2. For each b P, draw φb H. ∈ ∼ 3. For i = 1, . . . , n, set ψi = φb if i b. ∈

Exploiting this connection, Teh et al.(2006) derived an algorithm for generating data from the HDP, which they termed the Chinese Restaurant Franchise (CRF). This is

accomplished by simultaneously marginalizing out the group-level distributions Gj and the

base-distribution G0. In place of the Gj’s, one instead works with group-level partitions

Tj; similarly, instead of G0, one works with a partition D of the set whose elements are def the equivalence classes of the group-level partitions, i.e., D is a partition of T = j Tj. This is summarized in Algorithm8. S

Algorithm 8 Generating ψij : 1 i nj, 1 j J from the HDP. { ≤ ≤ ≤ ≤ }

iid 1. For j = 1,...,J, draw a partition Tj CRP(α) of (1, j),..., (nj, j) . ∼ { } J 2. Draw a partition D CRP(γ) of T = Tj. ∼ j=1 iid 3. Draw φd Hω for each d D. S ∼ ∈ 4. For i = 1, . . . , nj and j = 1,...,J, set ψij = φd if (i, j) t for some t d. ∈ ∈

Throughout, let T , D, Tj, and φd be defined as in Algorithm8, and let T = T , | | Tj = Tj, and D = D. We also let a = log α and g = log γ. In the following theorem, | | | | we utilize the CRF representation of the HDP to derive several expressions for conditional and marginal probabilities. First, we give the probability of a particular realization from the CRF(α, γ) distribution, and express this in terms of an exponential-family-like structure. Using basic theory of exponential families, this allows us to express the empirical Bayes estimator as a solution to a moment-matching problem. We also give an expression for the marginal likelihood of Y . This expression shows how changes in

85 beliefs about (α, γ) affect the posterior distribution of interest; the expression is used to enable computations in Section 5.3. Theorem 5.1. For the HDP model given in (5–1), the following holds:

(i) The probability assigned to any given realization of the CRF(α, γ) distribution is

J Γ(α)αTj Γ(γ)γD pα,γ(T , D) = Γ( t ) Γ( d )  Γ(α + nj) | |  × Γ(γ + T ) | | j=1 t∈ d∈ Y YTj  YD (5–3)  J nj  T = h(T , D) exp T a log(ea + i 1) + Dg log(eg + i 1) , ( − j=1 i=1 − − i=1 − ) X X X where h(T , D) is a function that does not depend on the parameters (α, γ).

(ii) The marginal likelihood of Y is

mα,γ(Y ) = pα,γ(T , D) λ(Y ; T , D, ω) ν(ω) dω, (5–4) , TXD Z where

λ(Y ; T , D, ω) = fφ(Yij) Hω(dφ) ,   d∈ YD Z (i,jY)∈d and the sum in (5–4) extends over all permissible values of (T , D). Here, (i, j) d means (i, j) t where t d. ∈ ∈ ∈

(iii) Let πh(θ) denote the prior density of θ = (T , D, φ, ω) with respect to a dominating measure µ, where h = (α, γ). Then

J nj −1 T −1 D T ? ? πh(θ) γ α α + i γ + j = ? ? . (5–5) π ? (θ) γ α α + i γ + j h (j=1 i=0 )( j=0 )     Y Y Y The dominating measure µ in (iii), and throughout this chapter, is the product of

counting measure on the support of (T , D) and typically Lebesgue measure on R|D|.

Proof. Statement (i) follows directly from the first two steps of Algorithm8 and the formula for the probability mass function of the CRP(α) distribution given by (5–2). To prove statement (ii), note that, conditionally on (T , D, φ, ω), the density of Y is given by d∈D (i,j)∈d fφd (Yij). Marginalizing over φ, the function λ(Y ; T , D, ω) corresponds Q Q

86 to the density of Y when conditioned on (T , D, ω). Equation (5–4) then arises from marginalizing over (T , D, ω). To prove statement (iii), reasoning as above, we have

πh(θ) = pα,γ(T , D) hω(φd). d∈ YD Taking the ratio and cancelling common terms in the numerator and denominator proves statement (iii).

The following result is analogous to a result due to Liu(1996) for the Dirichlet process mixture. Liu’s result is obtained by direct calculation, while ours follows from the exponential family representation of the CRP. A proof is given in AppendixC.

Theorem 5.2. Let Eh[ ] and Eh[ Y ] denote the prior and posterior expectation · · | operators under the model indexed by h. The empirical Bayes estimator of (α, γ) satisfies the normal equations

n J j α E [T Y ] set= E (T ) = , h h α + i 1 | j=1 i=1 X X − T γ E [D Y ] set= E (E [D T ] Y ) = E Y . h h h h γ + i 1 | | | " i=1 | # X − The second expression is more complicated than the first due to the fact that T is not observed directly. Of course, these equations are not analytically tractable. The posterior

expectations of the form Eh[ Y ] cannot be obtained analytically, and this problem · | 2 is compounded by the fact that we need to deal with the family Eh[ Y ], h R+ . { · | ∈ } Techniques for solving these equations are given in Section 5.3.2. 5.2.2 Limiting Cases of the Hierarchical Dirichlet Process

We obtain the Bayes factor of a hierarchical and fully nonparametric model against parametric and semiparametric alternatives by considering limiting values of the hyperparameters (α, γ). Consider F (αH); a fundamental result is that (1) as ∼ D α , the law of F converges in distribution to a point mass at H and (2) as α 0, the → ∞ →

87 law of F converges to a point-mass distribution δφ where φ H. There are four possible ∼ asymptotic regimes for (α, γ):

1. Given G0, as α 0, the laws of the Gj’s converge to independent point-mass → iid distributions at δφ where φj G0. j ∼

2. Given G0, as α , the laws of the Gj’s converge to point-mass distributions at → ∞ G0.

3. Given ω, as γ 0, the law of G0 converges to a point-mass distribution at δψ where → ψ Hω. ∼

4. Given ω, as γ , the law of G0 converges to a point-mass distribution at Hω. → ∞ Graphical representations of these possibilities are given in Figure 5-4. Combined with the possibility of having non-boundary values of α and γ, this gives rise to nine distinct possibilities. Each possibility corresponds to a model with a simpler structure than the full HDP, although we note that all models corresponding to γ 0 are the same due to the → fact that this forces G0 and all of the Gj’s to be degenerate. The different possibilities are given in Figure 5-5. Some interesting possibilities include the following:

(a) Setting α = strips the hierarchy out of the model, and stipulates that the groups may be treated∞ as identical. It is therefore of interest to determine whether or not we can set α = . ∞ (b) Setting α = 0 with γ (0, ) corresponds to using a semiparametric random effects ∈ ∞ model, with the distribution of the random effects given a (αHω) prior. D (c) Setting γ = with α (0, ) corresponds to treating the groups as independent Dirichlet process∞ mixtures∈ given∞ ω. It may then be of interest to determine whether the sharing of information across groups induced by the HDP gives an advantage over a simpler model in which information is not shared across groups. We demonstrate formally the result for γ = . The following simple lemma, which ∞ essentially states that the CRP converges to a point-mass at P = 1, 2, . . . , n and {{ }} P = 1 , 2 ,..., n as α 0 and α , respectively, is proved in AppendixC. {{ } { } { }} → → ∞ Lemma 5.1. Fix n > 0 and 1 x n. ≤ ≤

88 ω γ ω γ

H G0 ψj Yij H G0 ψij Yij

i = 1, . . . , nj i = 1, . . . , nj j = 1,...,J j = 1,...,J

ω ω α

H ψ Yij H Gj ψij Yij

i = 1, . . . , nj i = 1, . . . , nj j = 1,...,J j = 1,...,J

Figure 5-4. Graphical models depicting the models corresponding to values of α = 0 (top left), α = (top right), γ = 0 (bottom left), and γ = (bottom right). ∞ ∞

1.As α , → ∞ αxΓ(α) I(x = n). Γ(α + n) → 2.As α 0, → αxΓ(α) I(x = 1) . Γ(α + n) → Γ(n)

Using Lemma 5.1, as γ we have → ∞

lim mα,γ(Y ) = lim pα,γ(T , D) λ(Y ; T , D, ω) ν(ω) dω γ→∞ γ→∞ , TXD Z λ(Y ; T , D, ω) ν(ω) dω × J Γ(α)αTj = Γ( t ) fφ(Yij) Hω(dφ) ν(ω) dω Γ(α + nj)  | |   j=1 t∈ t∈ Z XT Y  YTj  YTj Z (i,jY)∈t J   Γ(α)αTj   = Γ( t ) fφ(Yij) Hω(dφ) ν(ω) dω. Γ(α + nj)  | |  j=1 t∈ t∈ Z Y XTj  YTj  YTj Z (i,jY)∈t   (5–6)

89 α 0 →

iid γ 0 Parametric model: Yij fψ( ) where ψ Hω is a global parameter. → ∼ · ∼

indep iid γ (0, ) Semiparametric random effects model: Yij fψ ( ) with ψj G ∈ ∞ ∼ j · ∼ and G (γHω). ∼ D

indep iid γ Parametric random effects model: Yij fψ ( ) with ψj Hω. → ∞ ∼ j · ∼

α (0, ) ∈ ∞

iid γ 0 Parametric model: Yij fψ( ) where ψ Hω is a global parameter. → ∼ · ∼ γ (0, ) Hierarchical Dirichlet process. ∈ ∞ γ Independent Dirichlet process mixture models (given ω): → ∞ indep iid iid Yij fψ ( ) with ψij Gj and Gj (αHω). ∼ ij · ∼ ∼ D

α → ∞

iid γ 0 Parametric model: Yij fψ( ) where ψ Hω is a global parameter. → ∼ · ∼ γ (0, ) Nonparametric model: ∈ ∞ indep iid Yij fψ ( ) where ψij G and G (γHω). ∼ ij · ∼ ∼ D

indep iid γ Parametric model, with mixture: Yij fψ ( ) where ψij Hω. → ∞ ∼ ij · ∼ Figure 5-5. Models obtained by letting α or γ tend to 0 or . ∞

The second line follows from Proposition 5.1 and Lemma 5.1. The third line follows from the fact that D = T if and only if each t T is assigned to a unique d D where ∈ ∈ d = t ; this ensures that λ(Y ; T , D, ω) factors across groups and that Γ( d ) = 1. { } | | Inspection of the final expression shows that this is the marginal likelihood when the data are generated under J conditionally-independent Dirichlet process mixtures with shared base measure Hω and precision α.

90 5.3 Estimation of Bayes Factor Surfaces

Recalling Algorithm8, computation may proceed by regarding ( T , D, φ) as latent variables. To lighten the notation, we will let θ = (T , D, φ, ω). Recall that h = (α, γ).

We will write πh( Y ) for the posterior density of θ, πh( ) for the prior density of θ, and · | · fθ(Y ) for the density of Y given θ. The marginal likelihood of the data under the prior

πh( ) is mh(Y ) = fθ(Y )πh(θ) µ(dθ), where µ is an appropriate dominating measure. · Let h? be a fixedR hyperparameter value. Suppose that θ(1), θ(2),... is an ergodic

Markov chain with invariant density πh? ( Y ). In principle, one can estimate the Bayes · | factor mh(Y )/mh? (Y ) by noting that

M (m) 1 πh(θ ) a.s. πh(θ) πh? (θ)fθ(Y ) mh(Y ) (m) µ(dθ) = . (5–7) M π ? (θ ) π ? (θ) m ? (Y ) m ? (Y ) m=1 h −→ h × h h X Z ? Thus, we can consistently estimate the Bayes factor BFh? (h) of h relative to h , for

any h, from a single run of a Markov chain with invariant distribution πh? ( Y ). Teh · | et al.(2006) introduced Gibbs samplers operating on the state space of θ with invariant

distribution πh( Y ), which can be used to compute such estimates. · | Unfortunately, the estimate on the left side of (5–7) suffers a serious defect: unless

∗ h is close to h , πh( Y ) may be nearly singular with respect to πh∗ ( Y ) over the · | · | region where θ is likely to be, resulting in a very unstable estimate. In other words, there is effectively a “radius” around h∗ within which one can safely move. To state the problem

∗ more explicitly: there does not exist a single h for which the ratios πh(θ)/πh∗ (θ) have small variance simultaneously for all h = (α, γ) (0, ) (0, ). ∈ ∞ × ∞ Many approaches have been created to extend the insight underlying (5–7) to allow for accurate estimation of marginal likelihoods (up to a universal constant) for a wide range of valid hyperparameter values (Buta and Doss, 2011; Doss, 2010, 2012; Gelman and Meng, 1998; Geyer and Thompson, 1995; Kong et al., 2003; Meng and Wong, 1996).

K One recurring theme is to use a suitably-constructed mixture vk πh ( ) in the k=1 k · denominator on the left side of (5–7), such that every h is “close”P to at least one of

91 the hk’s. Of several approaches of this sort, we take the approach of serial tempering, originally developed by Marinari and Parisi(1992) for the purpose of improving the mixing rates of certain Markov chains used in statistical mechanics. Here, we use it for the very different purpose of increasing the range of values over which importance sampling estimates have small variance. (See Geyer(2011) for a review of various applications of serial tempering.) We now briefly summarize this methodology, and show how it can be used to produce estimates that are stable over a wide range of h values.

def Fix h1,..., hK = (0, ) (0, ) and positive constants c = (c1, . . . , cK ). The ∈ H ∞ × ∞ hk’s should be chosen to “cover” , in the sense that for every h , πh is “close to” at H ∈ H

least one of πh1 , . . . , πhK . We then run a Markov chain whose Markov transition function K has invariant density proportional to ck πh ( Y ). The updates will sample different k=1 k · | components of the mixture, with jumpsP from one component to another. We now describe this carefully. Let Θ denote the state space for θ. Define a label space = 1,...,K , L { } and for k , suppose that Ψk is a Markov transition function on Θ with stationary ∈ L density equal to πh ( Y ). Serial tempering considers the state space Θ, and forms a k · | L × K family of distributions Qc, c R+ on Θ with densities { ∈ } L ×

qc(k, θ) ck fθ(Y ) πh (θ). (5–8) ∝ k

The constants ck > 0 are tuning parameters, which we discuss later. Let Γ(k, ) be a · Markov transition function on . In our context, we would typically take Γ(k, ) to be the L · uniform distribution on k, where k is a set consisting of the indices of the hl’s which N N are close to hk. We then construct a Markov chain on Θ which can be viewed as a L × two-block Metropolis-Hastings (i.e. Metropolis-within-Gibbs) algorithm, which is given by Algorithm9.

92 Algorithm 9 Serial tempering update.

Let Ψk(θ, ) and Γ(k, ) be Markov transition functions as defined above. Given the current state (θ(m)·,L(m)): ·

(m+1) (m) 1. Draw θ Ψ (m) (θ , ). ∼ L · 2. Draw L? Γ(L(m), ). ∼ · 3. Draw U Uniform(0, 1). If ∼ (m+1) ? (m) ? cL πhL? (θ ) Γ(L ,L ) U (m+1) (m) ? , ≤ c (m) × πh (θ ) × Γ(L ,L ) L L(m)

then set L(m+1) = L?, otherwise set L(m+1) = L(m).

By standard arguments, the density (5–8) is an invariant density for the serial

tempering chain. A key observation is that the θ-marginal density of qc( , ) is · · 1 K K q (θ) = c f (Y ) π (θ), where Z = c m (Y ). (5–9) c Z k θ hk k hk k=1 k=1 X X (1) (1) (2) (2) Suppose that (L , θ ), (L , θ ),... is a serial tempering chain. To estimate mh(Y ), consider M (m) 1 πh(θ ) Bq(h) = . (5–10) M K (m) m=1 k=1 ck πhk (θ ) X Note that this estimate dependsb only on the Pθ-part of the chain. Assuming that we have established that the chain is ergodic, we have

K a.s. πh(θ) k=1 ck fθ(Y ) πhk (θ) Bq(h) µ(dθ) −→ K c π (θ) Z Z k=1 k hk P b f P(Y )π (θ) = θ h µ(dθ) (5–11) Z Z m (Y ) = h . Z

Therefore, for any choice of c, Bq(h), h can be used to estimate the family ∈ H mh(Y ), h , up to a single multiplicative constant. { ∈ H} b

93 To estimate the family of posterior expectations g(θ) πh(θ Y ) µ(dθ), h , we | ∈ H proceed as follows. Let R M (m) (m) 1 g(θ ) πh(θ ) Uq(h) = . (5–12) M K (m) m=1 k=1 ck πhk (θ ) X By we have b P

K a.s. g(θ) πh(θ) k=1 ck fθ(Y ) πhk (θ) Uq(h) µ(dθ) −→ K c π (θ) Z Z k=1 k hk P b f P(Y ) g(θ) π (θ) = θ h µ(dθ) (5–13) Z Z mh(Y ) = g(θ) πh(θ Y ) µ(dθ). Z | Z Combining the convergence statements (5–13) and (5–11), we see that

ˆ def Uq(h) a.s. Iq(h) = g(θ) πh(θ Y ) µ(dθ). Bq(h) −→ | b Z We now discuss the choice ofbc. Suppose that for some constant a, we have 1 ck = , k = 1,...,K. (5–14) amhk (Y )

−1 K Then Z = K/a, and qc(θ) = K πh (θ Y ), i.e. the θ-marginal of qc(k, θ) k=1 k | (see (5–9)) gives equal weight to eachP of the component distributions in the mixture. Therefore, for large n, the proportions of time spent in the K components of the mixture are about the same, ensuring that the chain spends enough time in all regions in to H allow for stable estimation at any particular h . ∈ H

In practice, we cannot arrange for (5–14) to be true, because mh1 (Y ), . . . , mhK (Y ) are unknown. One approach is to iteratively tune the ck’s by sequentially running serial tempering chains and adjusting the weights until (5–14) is approximately true; doing this is possible by (5–11). This approach is time consuming, and does not work well in our HDP setup, in which the posterior is multimodal. We have found the Stochastic Approximation Monte Carlo (SAMC) algorithm of Liang et al.(2007), which instead

94 updates c at each iteration of the chain, to be effective. Liang et al.(2007) also provide conditions under which their approach is successful at tuning c.

Algorithm 10 SAMC algorithm for adaptive serial tempering. (m−1) (m−1) Given (θ ,L ), c = (c1, . . . , cK ), and constants γm, m = 1, 2,... such that m γm = , γ2 < : ∞ m m ∞ P 1. PDraw (θ(M),L(M)) according to Algorithm9. 2. Set

(m) −1 log ck log ck + γm I(L = k) K − ← − −   3. Normalize the ck’s, setting

log ck log ck log c1, (k > 1), ← −

then set log c1 = 0.

5.3.1 Testing Against Boundary Values

The Bayes factors of submodels relative to the model implied by qc(θ) may be estimated using the techniques of Section 5.3 by passing the hyperparameter h to the appropriate limit in (5–10); see, for example, (5–6). To handle boundary cases, we extend

the definition of pα,γ(T , D). Writing

pα,γ(T , D) = aγ(T , D) bα(T ) × where

Γ(γ)γD aγ(D, T ) = Γ( d ), Γ(γ + T ) × | | d∈ YD J Γ(α)αTj bα(T ) = Γ( t ) , Γ(γ + nj) | |  j=1 t∈ Y YTj  

95 we can extend pα,γ(T , D) to [0, ] [0, ] by defining ∞ × ∞

a0(D, T ) = I(D = 1), a∞(D, T ) = I(D = T ),

b0(T ) = I(T = 1), b∞(T ) = I T = nj . (5–15) j ! X The estimate (5–10) is then calculated as usual with this extension to the boundary. We note that this extends the results of Doss(2012), and recovers these results for the usual Dirichlet process mixture by taking J = 1 and γ = . Consider the Dirichlet ∞ process mixture model, i.e., the model given by the first three lines of (5–1), and where G0

iid is fixed. In this model, the latent variables are ψ1, . . . , ψn, and we have ψ1, . . . , ψn G ∼ where G (αG0). Let τα denote the (prior) distribution of ψ = (ψ1, . . . , ψn). Doss ∼ D dτ (2012) obtained the Radon-Nikodym derivative α1 (ψ) by direct calculation. This dτα2 approach is very complicated in the setting of the hierarchical Dirichlet process and, by considering the CRF, we bypass entirely this approach. For the boundary values, the terms of (5–10) will be non-zero only for extreme values

of (T , D). This occurs with very low posterior probability unless α and γ are also quite

extreme. How extreme we need to take hk depends on the sample size. We now consider

heuristics to determine how extreme we must take α and γ for (T , D) to take extreme values. The following lemma provides guidance; see AppendixC for a proof.

Lemma 5.2. Suppose Pn CRP(αn) on the set = a1, . . . , an and let 0 <  < 1. ∼ A { } −1 n 3 2 1. As n and αn , log Pr( Pn = n) = α + O(n /α ); in particular, → ∞ → ∞ | | − n 2 n  n −1 2 Pr( Pn = n) = (1 ) + O(n ) if αn = . | | − log(1 ) − − −αn 2. As n and αn 0, Pr( Pn = 1) = n (1 + o(1)). In particular, → ∞ → | | log(1 ) Pr( Pn = 1) = (1 ) + o(1) if αn = − − . | | − log n

96 Summarizing, in the much simpler setting of a single Dirichlet process, we require our

αk’s to cover the region

log(1 ) n(n 1)  n2 α − − , − − , ∈ log n 2 log(1 ) ≈ log n 2  −    for the prior probability of observing extreme values of P to be non-negligible. While | | this range seems large, we note that when phrased in terms of the canonical parameter, a = log α, the ak’s must span only a region of size

a [log  log log n, 2 log n log 2 log ] ∈ − − − which grows logarithmically in n. The variable  corresponds to the prior probability of non-degenerate behavior of P at the extreme values of α. We regard  as a tuning | | parameter, which should be set so that the posterior probability of P being extreme is | | non-negligible. In our applications,  = 0.5 was used.

To apply this heuristically to the HDP, we must choose h1,..., hK so that the events

j[Tj = nj], j[Tj = 1], [D = T ] and [D = 1] occur frequently enough when running the serialT tempering.T It is straight forward to control the prior probabilities of the first two events by applying Lemma 5.2 to J independent CRP(α) distributions; this leads to the same bounds with /J in the role of . To control the prior probabilities of [D = 1] and

[D = T ], bounds for these prior probabilities can be obtained by assuming T = j nj.

One can then apply Lemma 5.2 with n replaced by j nj. P We have not found the size of the grid one needsP to construct to be the primary computational bottleneck. Rather, the primary computational difficulty is that the time required to simulate from Markov chains used in practice require more computations as the partitions D and T become finer. Thus, for large applications (such as the topic model in Section 5.4.2), it will be unfeasible to test to the case α , although assessing → ∞ γ may be possible for values of α such that the posterior of T is not concentrated on → ∞ prohibitively large values.

97 5.3.2 Empirical Bayes Estimation

The methods discussed in Section 5.3 also provide techniques for approximating the

empirical Bayes estimator of (α, γ), given by (ˆα, γˆ) = arg maxα,γ(Y ), using only the output of the serial tempering chain. We construct an EM algorithm based on Theorem

5.2 by viewing θ = (T , D, ω) as a latent variable. The E-step consists of taking the

expectation of log pα,γ(T , D) + log λ(Y ; T , D, ω) with respect to the posterior distribution of Y fixed at the current estimate h(t), while (noting that λ(Y ; T , D, ω) does not depend on h) the M-step consists of setting

(t+1) h arg max E (t) [log pα,γ(T , D, ω)] . ← h

As in Theorem 5.2, setting the gradient with respect to (α, γ) of the expression above equal to 0 is equivalent to solving the equations

n J j α(t+1) E (t) [T Y ] = , h α(t+1) + i 1 | j=1 i=1 X X − T γ(t+1) E (t) [D Y ] = E (t) Y . h h γ(t+1) + i 1 | " i=1 | # X − This amounts to iteratively solving the moment-matching equations given in Theorem 5.2. We summarize by giving an “ideal” EM-algorithm for estimating the hyperparameters.

Algorithm 11 Idealized EM-algorithm for estimating (α, γ). Given the current estimate h(t) = (α(t), γ(t)):

1. Choose α(t+1) so that

n J j α(t+1) E (t) [T Y ] = . h α(t+1) + i 1 | j=1 i=1 X X − 2. Choose γ(t+1) so that

T γ(t+1) E (t) [D Y ] = E (t) Y . h h γ(t+1) + i 1 | " i=1 | # X −

98 Because the posterior is intractable, we cannot directly solve these estimating equations. Instead, we solve Monte Carlo approximations of these equations. Assuming ergodicity, we have

M (m) (m) (m) ρh g(θ ) Eh[g(θ ) Y ], m=1 → | X (m) for any integrable function g(θ), where we define the ρh ’s to be the importance sampling weights

π (θ(m))/ c π (θ(m)) ρ(m) = h k k hk . h M π (θ(j))/ c π (θ(j)) j=1 h P k k hk A Monte Carlo EM algorithm (BoothP and HobertP, 1999) may then be constructed ˆ by replacing the expectations in Algorithm 11 by their expectations under Eh, where ˆ Eh(g(θ)) is defined by the mapping

M ˆ (m) (m) Ehg(θ) ρh g(θ ). 7→ m=1 X Similar strategies have been proposed. For example, Casella(2001) proposed a general-purpose EM algorithm, which requires running multiple Markov chains to approximate the E-step. Atchad´e(2011) proposed a stochastic approximation algorithm for empirical Bayes estimation. Our strategy differs from the approach of Casella(2001) in that our approach requires only one Markov chain. Our approach requires having already run the serial tempering chain to implement, and so may be done in post-processing. This approach to empirical Bayes estimation is only convenient for us because we have already committed to implementing serial tempering. The approaches of Casella(2001) and Atchad´e(2011) may be more appropriate if one is only interested in the empirical Bayes estimators rather than the entire Bayes factor surface, as they do not require an expensive serial tempering chain to be run.

99 5.4 Illustrations

Here, we apply our methodology to two datasets, one real and the other simulated. In Section 5.4.1 we consider part of the Hospital Compare dataset. In Section 5.4.2 we deal with topic modeling and consider an artificially-constructed corpus of documents. The reason for using this simulated data is that, for such data, all parameters in the model are known, and therefore we can compare the performance of the HDP model that uses the empirical Bayes estimate of the hyperparameter h with the performance of HDP models that use other values of h. 5.4.1 Quality of Hospital Care Data

Hospital Compare is a publicly available dataset which records various measures of hospital quality for over 4000 hospitals across the United States and associated territories. This dataset is used generally to help improve hospitals’ quality of care by distributing objective, easy to understand, data on hospital performance from a consumer perspective. For concreteness, we focus on the number of times a hospital prescribed a cholesterol-reducing drug to patients who were treated for a heart attack at that hospital, with the expectation that high quality hospitals will do this consistently.

We posit an HDP model with base distribution Hω where Hω is a Beta(ω1, ω2)

distribution, ψij (0, 1) is a success probability, and Yij Binomial(Nij, ψij). For ∈ ∼ simplicity, we assume that the number of patients Nij at each hospital is fixed by design,

although one can in principle model Nij as well. We use the HDP model to address the following questions.

1. Does the distribution of Yij vary across region, or do all hospitals have roughly the same quality-of-care distribution (α or γ 0)? → ∞ → 2. Is the quality-of-care distribution within region described well by a parametric model (α 0 or γ 0)? → → 3. Does it suffice to model all territories separately, or are there recurring clusters of hospitals across territories (γ )? → ∞

100 A default prior for h, which is often used in practice, takes α and γ to be independent gamma random variables with mean and variance 1; this is the default, for example, in the package hdp-sm, developed to support the work of Wang and Blei(2012). In a setting very similar to ours, Rodriguez et al.(2008) chose a gamma prior with mean 1 and variance 1/3. Conversely, Teh et al.(2006) used a mix of Gamma priors with both vague and informative priors, but do not provide insight into their choice of priors. For the purposes of comparison, we begin by conducting inference under Gamma priors for α and γ with mean 1 and variance 1. Estimates of the joint and marginal posterior densities are given in Figure 5-7; these were obtained by running a Markov chain based on the Chinese restaurant franchise (Teh et al., 2006) with the parameters (α, γ, ω) sampled with Metropolis-within-Gibbs slice-sampling updates (Neal, 2003).

The analysis under the default prior yields posterior means ofα ˆBayes = 13.8 and

γˆBayes = 4.1, with most of the mass of the posterior concentrated near (ˆαBayes, γˆBayes). On the basis of this posterior, we considered a grid of size 180 extending from α = 5 to α = 100 and γ = 0.5 to γ = 30, with entries evenly spaced on the log scale. Figure 5-7 gives estimates of the Bayes factors on the log scale, relative to the mixture distribution (5–9); the Bayes factors were estimated using (5–10). The plot is less smooth than one might expect, due to variability of the Monte Carlo estimate. The EM-algorithm described in Section 5.3.2 was applied, giving empirical Bayes estimates ofα ˆEB = 27.8 and

γˆEB = 8.9. It is interesting to note that the posterior obtained using the default prior is concentrated in a region with a relatively low marginal likelihood, with the marginal

likelihood of the model indexed by the Bayes estimator (ˆαBayes, γˆBayes) being relatively small compared to marginal likelihood of the model indexed by the empirical Bayes

estimator (ˆαEB, γˆEB). The source of the problem appears to be that the default prior is pulling the posterior too far towards 0. Thus, the likelihood and prior appear to be

101 10 Density

0.04 0.03 γ 0.02 5 0.01

0 10 15 20 25 α

0.2

α γ Density 0.1

0.0

0 10 20

Figure 5-6. Top: Estimate of the joint density based on a sample of 5000 draws from a Markov chain targeting the posterior distribution (α, γ) under a Gamma prior with mean 1 and variance 1. White dots represent draws from the Gibbs sampler after burn-in. Bottom: Estimate of the marginal densities of (α, γ) based on the same Markov chain.

102 30

20

γ 0.5 c BF 10 4.1

log 8.9 30

0

-10 25 50 75 100 α

Figure 5-7. Plot of the logarithm of the Bayes factor of the pair (α, γ) relative to the distribution obtained by the mixture qc(θ) given in (5–9).

in conflict. With hindsight, a more appropriate prior may have been an exponential distribution with a large mean. The Bayes factor surface argues in favor of the HDP in this setting. The marginal likelihood associated with good choices of α are much higher than the marginal likelihood associated with either boundary value, justifying the use of our nonparametric model. To assess the benefit of the HDP over independent Dirichlet process mixtures, we ran

another chain extending log γ to 13 and holding α fixed atα ˆEB. The log Bayes factor of the model without sharing (γ = ) relative to the model at the empirical Bayes estimate ∞ was estimated to be 56, indicating a strong preference for sharing information across − states.

103 5.4.2 Topic Modeling

We use the proposed methodology as a tool to determine the impact of hyperparameter choice on the output of topic models as described in Section 5.1.1. Because of the unsupervised nature of topic modeling, we consider a simulation to assess how well the HDP captures the “true” structure of data simulated from the latent Dirichlet allocation (LDA) model (Blei et al., 2003). LDA requires specification of the number of topics a priori; the HDP is often thought of as a nonparametric alternative which does not require input of the number of topics.

We simulated J = 100 documents consisting of nj = 100 words each from the LDA

12 model with D = 12 distinct topics φd which are depicted graphically in Figure 5-8. { }d=1 The vocabulary was taken to be = 1, 2,..., 25 , i.e., V = 25. The corpus was generated V { } in the following steps:

1. Set m = 3 v, where v S12 is given a (1,..., 1) distribution. × ∈ D iid 2. For each document j, draw a distribution on topics ρj (m). ∼ D iid 3. For each word Yij in document j, independently draw a topic Cij Categorical(ρj) ∼ and draw Yij Categorical(φC ). ∼ ij See Griffiths and Steyvers(2004) for a similar simulation model. The role of m is to allow some topics to be inherently more common throughout the corpus; the entries of m correspond to (unnormalized) frequencies of topics in the corpus, with the normalization constant controlling how much topic variability there is within a particular document. We use the HDP to form an infinite topic model. Recalling the hierarchy given in

(5–1), we take the base distribution Hω to be a (ω, . . . , ω) distribution on SV . The D ∞ distribution G0 = d=1 wd δφd is regarded as a distribution on a countable collection

th of topics φd . ThePj document has associated to it a unique distribution Gj on the { } topics, with a word being generated by first drawing a topic ψij Gj and next drawing ∼ Yij Categorical(ψij). ∼

104 Topic 1 Topic 2 Topic 3 Topic 4 5 4 3 2 1

Topic 5 Topic 6 Topic 7 Topic 8 5 Probability 4 0.20 0.15 3 0.10 2 0.05 1 0.00 Topic 9 Topic 10 Topic 11 Topic 12 5 4 3 2 1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Figure 5-8. True topics used in the simulation experiment. Each word is represented as a box on the 5 5 grid to ease visualization of each topic, with the probability of each word× in a given topic given by color.

As a preliminary analysis, we fit the HDP model to the data generated from the LDA model given above, with α and γ each given diffuse exponential priors with mean 100. The marginal posteriors of α and γ are given in Figure 5-9. The data appears to give strong information about the value of α, with the posterior concentrated in the region (2.5, 3.5). The analysis based on the marginal likelihood and empirical Bayes estimation largely agrees with the analysis based on the diffuse prior, with the empirical Bayes estimates beingα ˆ = 2.8 andγ ˆ = 2.7. Given the agreement between the empirical Bayes method

105 300 300

200 200

Frequency 100 Frequency 100

0 0

2.5 3.0 3.5 0.0 2.5 5.0 7.5 10.0 α γ

Figure 5-9. Histogram of samples from Markov chain targeting the posterior distribution of α and γ, with the parameters given an exponential prior with mean 100. and a full Bayesian analysis, one might expect these values of the hyperparameters to be optimal for recovering the object of actual interest - namely, the true topics underlying the documents. To determine if this was the case, we applied our methodology to calculate two quantities as the hyperparameters varied:

(a) An MCMC estimate of the posterior mean of the most prevalent topic φ?, defined to be the topic φd associated to the largest proportion of words in the corpus.

(b) An MCMC estimate of the expected L1 error in estimating the true most prevalent topic φ?,0 using the most prevalent topic drawn from the posterior,

V (α,γ) L1 = E φ?v φ?0v Y . (v=1 | − | | ) X Recalling that the state space of our Markov chain is θ = (T , D, φ), one can draw φ? by counting the number of words associated to each topic at a particular iteration. The serial tempering chain was run on a grid of size 100 with α and γ evenly spaced between 1 and 10 on the log scale.

106 0.150

0.125 γ 1 2.8

distance 0.100

1 10 L

0.075

0.050 2 4 6 8 α

Figure 5-10. The average L1 distance between Topic 1 and the most prevalent topic from the HDP, as a function of the hyperparameters, where the L1 distance V between two probability vectors φ1 and φ2 is given by φ1v φ2v . v=1 | − | P (α,γ) The value of L1 as the parameters (α, γ) are varied is given in Figure 5-10. It is apparent that α has a large impact on the L1 error, with larger values of α corresponding to smaller errors. This is especially interesting because the “optimal” value of α by this criteria appears to be much larger than the values of α deemed most likely by the marginal likelihood.

The analysis of the posterior mean of the most prevalent topic (φ?) gives insight into the results of the previous paragraph. The posterior mean of φ? is given in Figure 5-11.

We can see that the HDP more-or-less correctly determines that φ? corresponds to the true topic φ1 with roughly the correct proportions assigned to each word. Our interpretation of these results is that, in the case of topic modeling, seeking “optimal” values of α and γ is a somewhat delicate issue. Even when the true model is the (very similar) LDA model, the value of α suggested by the information contained in the marginal likelihood (either combined with a prior or analyzed directly) does not

107 α = 1, γ = 1 α = 1, γ = 2.8 α = 1, γ = 10

5

4

3

2

1

α = 3, γ = 1 α = 3, γ = 2.8 α = 3, γ = 10

5 Probability 4 0.20 0.15 3 0.10 2 0.05

1 0.00

α = 7.9, γ = 1 α = 7.9, γ = 2.8 α = 7.9, γ = 10

5

4

3

2

1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Figure 5-11. Sensitivity of estimation of the most prevalent topic (Topic 1) to choice of hyperparameter. Each heatmap gives the posterior mean of this topic under the given hyperparameter values. Note: the midpoint of the color range was compressed to .02 to highlight errors in estimating Topic 1.

108 actually lead to optimal recovery of the latent topics. Potentially, this occurs because we have generated data according to LDA rather than simulating from the HDP. In practice, of course, one does not have access to the “truth” in order to assess such failures. This analysis suggests that it may be more profitable, given the use cases of topic modeling, to instead be concerned with choosing hyperparameters to ensure that the extracted topics satisfy “nice” subjective properties, such as interpretability and sparseness, rather than selecting the hyperparameters using likelihood-based criteria. 5.5 Discussion

In this chapter, we developed methodology which enables us to do the following. First, estimate the marginal likelihood surface of the concentration parameters α and γ of the hierarchical Dirichlet process (up to a universal constant). This includes estimation at the boundary values α 0, and γ 0, . Second, estimate the family of posterior ∈ { ∞} ∈ { ∞} expectations of functions of interest as the hyperparameters vary. Our theoretical results apply to a model that is more general than the one considered by Doss(2012), and our methods are entirely different from those of Doss(2012), whose approach does not seem to be amenable to hierarchical Dirichlet process priors. Our basic Monte Carlo scheme is serial tempering, which involves running a single Markov chain. We illustrated the methodology in two applications. First, we applied it in the context of multi-level observational data, where it was used to assess the benefit obtained by hierarchical Bayesian nonparametric methods, both in terms of the benefit of the nonparametric modeling and of efficiency gained by sharing of information across levels. Next, we used the methodology to assess the importance of choice of hyperparameters in topic modeling. Here, we concluded that the model was ambiguous with respect to what hyperparameter one should choose, and that accurate estimation of topic proportions may be lost when one uses “optimal” values of the hyperparameters.

109 CHAPTER 6 DISCUSSION AND FUTURE WORK In this dissertation, we have made contributions to two distinct areas of statistics, using Bayesian nonparametric tools. In Chapter3 we introduced a framework for flexible Bayesian inference about causal effects; we focused on longitudinal studies in the presence of MNAR missingness. We introduced the notion of a working prior to flexibly model the observed data while simultaneously allowing for an interpretable sensitivity analysis to be implemented. This methodology was applied in Chapter4, where a Dirichlet process mixture model was constructed to analyze the Schizophrenia clinical trial. In Chapter5 we considered Bayes factor estimation and empirical Bayes estimation of hyperparameters in the hierarchical Dirichlet process, and used this to develop tools for testing a fully nonparametric hierarchical model against semiparametric and parametric alternatives. In this chapter, we outline areas for future work. 6.1 Rates of Convergence

In Chapter3 we introduced the working prior framework and proved consistency of our approach under very mild conditions. These results do not address the underlying rate of convergence of estimates of causal effects, and hence are unsatisfying. In future work, we hope to rectify this by examining conditions under which √n-estimation of causal effects is possible. Ideally, such convergence is uniform in the sensitivity parameter ξ. In light of work by Robins and Ritov(1997) and Ritov et al.(2014), we expect that the methods proposed in this dissertation have poor frequentist properties in the worst case, and that better frequentist performance requires careful incorporation of a priori dependence between the missing data mechanism π and the response distribution f. 6.2 More Work on Non-monotone Missingness

In future work we will extend the methods developed in Chapter4 to the case of non-monotone missingness. In this setting there is no direct analog of ACMV which corresponds to MAR, nor is there a direct analog of NFD; moreover, there are reasons to

110 expect that MAR may be inherently unreasonable in non-monotone settings, and hence MAR may not be the appropriate identifying restriction to anchor to (National Research Council, 2010; Robins and Gill, 1997). We hope to address non-monotone missingness by utilizing the working prior framework developed in Chapter3, perhaps utilizing the ODMV restriction introduced therein. Non-monotone missingness also necessitates a more complex model for the association between the response and missingness. Managing this requires a careful balance between model complexity and tractability of G-computation. Future work will (1) investigate more interpretable alternatives to ACMV and NFD which can be used to conduct a sensitivity analysis, (2) introduce a flexible modeling framework for non-monotone missing data which (3) can be combined with the given identifying restrictions to produce a tractable G-computation algorithm. 6.3 Multivariate Models for Missing Data

The work in Chapter3 and Chapter4 accounts only for a univariate response measured over time. It is common in medical studies to record several responses of interest, or to record many surrogate measures of a response of interest (Dunson, 2007; Dunson and Perreault, 2001). In future work, we hope to extend our methodology to multivariate longitudinal models. The primary challenge is to conduct a meaningful sensitivity analysis, while still allowing the model to be flexible enough to model the data well. One option is to conduct a sensitivity analysis by introducing sensitivity parameters on the scale of latent factors which determine the multivariate response. Care must be taken to ensure that sensitivity analysis on the scale of latent variables is interpretable to practitioners. One area of potential application is in the analysis of high-dimensional longitudinal surveys. 6.4 Causal Inference

We hope to apply the work in Chapter3 and Chapter4 to the general problem of causal inference. Causal inference is concerned with using data to establish causal associations. A standard example is inferring a causal effect of a treatment by utilizing

111 randomization in a randomized experiment. Similar to the missing data problem, causal inference requires the analyst to make assumptions which cannot be verified from the data alone. There are several strong connections between causal inference and missing data; for example, the Rubin Causal Model (RCM) encodes the causal inference problem in terms of potential outcomes, with potential outcomes corresponding to unobserved treatment assignments regarded as missing data (Rubin, 2005). Recent work takes the opposite approach, and encodes the missing data problem as a causal inference problem (Mohan et al., 2013). We believe that our approach should be useful in providing a Bayesian nonparametric approach to assessing the sensitivity of inferences to underlying assumptions. Applications of Bayesian nonparametrics to this area have occurred relatively recently, and this area warrants much further work (Hill, 2011; Karabatsos and Walker, 2012; Xu et al., 2014). 6.5 Alternatives to the Hierarchical Dirichlet Process

There are several possible directions for future work related to the material in Chapter5. Hierarchical models related to the HDP, such as the nested Dirichlet process (nDP’s, see Rodriguez et al., 2008), could be explored in a similar manner. Another avenue for future work is in rigorously establishing the theoretical properties of the MCMC schemes used; for example, in recent work of Park and Doss (2015), empirical process theory is used to establish convergence of the estimated Bayes factor surface to the true Bayes factor surface as a random function. Additionally, to our knowledge, the standard errors associated with estimates from sequential tempering chains have not been explored formally. Finally, the adaptive methods used in Chapter5 destroy the of the underlying “Markov” chain. This can be fixed by running the SAMC algorithm only during some initial segment of the burn-in. Future work might explore the effect of SAMC on the ergodicity properties of our Markov chains, with the anticipation that SAMC may be run during sampling without effecting ergodicity.

112 APPENDIX A APPENDIX TO CHAPTER 3 A.1 Proof of Theorem 3.2

Proof. For a fixed M > 0, let tM denote the function t(z)I( t(z) > M), and let | | M M tM (z) = t(z)I( t(z) M). Similarly, let ψ (p) = t dp and ψM (p) = tM dp. Denote | | ≤ the posterior Π(dp Z1:n) by Πn(dp). For fixed  > 0,R calculate R |

Πn ( ψ(p) ψ(p0) > ) Πn ( ψM (p) ψM (p0) > /3) | − | ≤ | − | M M + Πn( ψ (p0) > /3) + Πn( ψ (p) > /3). | | | |

By (C1), the first term on the right-hand-side goes to 0 for all M on a set of probability

M 1, while the second term is deterministically equal to I( ψ (p0) > /3) and will be 0 | | for sufficiently large M because ψ(p0) < . To address the third term, apply we apply ∞ Markov’s inequality and Fubini’s theorem to get

M ∞ ∞ ψ (p) Πn(dp) t(z) p(z) dz Πn(dp) t(z) p˜n(z) dz Π ( ψM (p) > /3) | | M | | = M | | . n /3 /3 /3 | | ≤ R ≤ RR R

By (C2) there exists a set with p0-probability 1 such that, for any fixed δ > 0, we may choose M large enough that the last term above is less than δ for all sufficiently large n. Letting n this give → ∞

lim inf Πn ( ψ(p) ψ(p0) > ) δ. n→∞ | − | ≤

The conclusion follows by noting that δ can be made arbitrarily small so that the limit is 0.

A.2 Proof of Theorem 3.3

Theorem 3.3 is proved through a sequence of lemmas concerning convergence in

total variation. We consider random vectors (Yn,Sn) for n = 1, 2,... and (Y0,S0), with

densities pn(y, s) and p0(y, s) respectively, and throughout we assume that all conditional

113 distributions are well defined and admit conditional densities. The dropout time Si is supported on 1,...,J . We define Oi = (Yi,obs,Si), for i = 0, 1,.... { } s s Our underlying assumption is that On O0, where denotes convergence in total → → s variation; our goal is to show (Yn,Sn) (Y0,S0) under our transformation-based NFD → assumption and the assumption that p0(S = J y) δ for some δ > 0. Recall that | ≥ s (Yn,Sn) (Y0,S0) means →

pn(y, s) p0(y, s) dy ds 0. | − | → Z We also note the well-known fact that

pn(y, s) p0(y, s) dy ds = 2 sup P ((Yn,Sn) B) P ((Y0,S0) B) | − | B | ∈ − ∈ | Z where the sup extends over all Borel sets. Integrating against the forms dy¯j, dy, ds, and so forth, will denote integrating against appropriate dominating measures associated to ¯ the distributions of Yij, Yi, Si, and so forth. We will additionally write expressions like ¯ pn(y¯j S j) to denote the conditional density of Ynj given Sn j; the reader will be | ≥ ≥ able to resolve the inherent ambiguities of such statements from context. Our first lemma gives a characterization of convergence in total variation for joint distributions of random vectors.

Lemma A.1. Let (Xi, Yi), i = 0, 1,... be a sequence of random vectors. Write pi(x, y) for the associated densities.

s Then, (Xn, Yn) (X0, Y0) implies →

pn(x) p0(x) dx 0, and ν(x) pn(y x) p0(y x) dy dx 0, | − | → | | − | | → Z Z for every probability density ν(x) such that the support of ν(x) is contained in the support of p0(x). For the converse, it suffices to let ν(x) = p0(x).

114 Proof. An application of the triangle inequality shows that

pn(x, y) p0(x, y) dx dy pn(x) p0(x) dx | − | ≤ | − | Z Z + p0(x) pn(y x) p0(y x) dy dx. | | − | | Z

This proves the converse of the lemma by taking ν(x) = p0(x). Next, we note that

pn(x) p0(x) dx pn(x, y) p0(x, y) dy dx | − | ≤ | − | Z Z by the alternate characterization of the total-variation distance; hence pn(x) | − p0(x) dx 0. R | → Finally, let ν(x) be any density such that ν(x) = 0 implies p0(x) = 0. For arbitrary K > 0, write

ν(x) pn(y x) p0(y x) dy dx = ν(x) pn(y x) p0(y x) dy dx | | − | | | | − | | Z Zν(x)≥Kp0(x)

+ ν(x) pn(y x) p0(y x) dy dx. | | − | | Zν(x)

We bound

ν(x) pn(y x) p0(y x) dy dx 2 ν(x) dx. | | − | | ≤ Zν(x)≥Kp0(x) Zν(x)≥Kp0(x)

If Z ν, the term of the right side is the probability 2P (ν(Z)/p0(Z) K). Because ∼ ≥ p0(Z) = 0 occurs with probability 0 under ν, we can choose K large enough that this expression is less than an arbitrary  > 0. To bound the second term on the right side of (A–1), an application of the triangle inequality gives

ν(x) pn(y x) p0(y x) dy dx K p0(x) pn(y x) p0(y x) dy dx | | − | | ≤ | | − | | Zν(x)

K pn(x, y) p0(x, y) dy dx ≤ | − | Z

+ pn(x) p0(x) dx . | − | Z 

115 For arbitrary K, the expression on the right goes to 0. We conclude that, for any  > 0, we may choose sufficiently large K and reason as above that

lim sup ν(x) pn(y x) p0(y x) dy dx . n→∞ | | − | | ≤ Z Conclude the proof by letting  0. ↓

The next lemma links strong convergence of conditional distributions to strong convergence of the joint distributions.

Lemma A.2. Let (Xi,Si), i = 0, 1,... be a sequence of random vectors. Suppose s s (Xn,Sn) (X0,S0). Then, [Xn Sn A] [X0 S0 A] for any set A such that S0 A → | ∈ → | ∈ ∈ holds with positive probability.

Proof. Write

1 P (Xn B,Sn A) P (X0 B,S0 A) pn(x S A) p0(x S A) dx = sup ∈ ∈ ∈ ∈ . 2 | | ∈ − | ∈ | B P (S A) − P (S A) Z n 0 ∈ ∈

Applying the triangle inequality, this is bounded by

1 sup P (Xn B,Sn A) P (X0 B,S0 A) P (Sn A) B | ∈ ∈ − ∈ ∈ | ∈ 1 1 + sup P (X0 B,S0 A). P (S A) − P (S A) B ∈ ∈ n 0 ∈ ∈ s As n , the above expression tends to 0 due to the fact that (Xn,Sn) (X0,S0). → ∞ → ¯ s ¯ Lemma A.3. Under our assumptions, [Yns Sn A] [Y0s S0 A] for all | ∈ → | ∈ A s, s + 1,...,J . ⊂ { } s Proof. Recall that On O0 and mimic the proof of Lemma A.2, noting that, for all Borel → ¯ ¯ sets B, the events [Yns B,Sn A] and [Y0s B,S0 A] are On and O0 measurable, ∈ ∈ ∈ ∈ respectively.

116 Lemma A.4. Under our assumption that p0(S = J y) > 0 we have |

p0(y¯s S A) p0(y¯s S B)(s < J), | ∈  | ∈ for all A such that p0(S A) > 0 and for all B such that J B, where q(y) r(y) means ∈ ∈  that r(y) = 0 implies q(y) = 0 for q-almost-all y.

Proof. It is clear that p0(y¯s S = J) p0(y¯s S B). Suppose that p0(y¯s S = J) = 0. |  | ∈ | Then,

p0(y¯s) p0(S = J y¯s) × | = 0. p0(S = J)

By cancellation, this implies that p0(y¯s) p0(y¯s S = J). Clearly, p0(y¯s S A)  | | ∈  p0(y¯s), which implies the result.

Because On and Oi are arbitrary aside from the given assumptions, the following { } theorem implies Theorem 3.3. We prove the theorem under the NFD assumption of

d −1 Theorem 3.3. Let Gj(y¯j) = (yj y¯j−1) . To lighten notation, we will write gj(y¯j) | dyj Tj | | −1 for (yj y¯j−1). Tj | s Theorem A.1. Under our assumptions, (Yn,Sn) (Y0,S0). → Proof. For simplicity, we prove the theorem carefully when J = 3. The extension to

s arbitrary J is clear, but notationally heavy. Our goal is to show that (Yn,Sn) (Y0,S0). → In view of Lemma A.1, it suffices to prove the following statements:

pn(y1, s) p0(y1, s) dy1 ds 0, (A–2) | − | → Z p0(y1, s) pn(y2 y1, s) p0(y2 y1, s) dy¯2 ds 0, (A–3) | | − | | → Z p0(y¯2, s) pn(y3 y¯2, s) p0(y3 y¯2, s) dy ds 0. (A–4) | | − | | → Z

117 Because (Yn1,Sn) and (Y01,S0) are functions of On and O0,(A–2) holds by Lemma A.1. To prove (A–3), integrating out S, we must prove the following statements.

p0(y1,S = 1) pn(y2 y1,S = 1) p0(y2 y1,S = 1) dy¯2 0, (A–5) | | − | | → Z p0(y1,S = 2) pn(y2 y1,S = 2) p0(y2 y1,S = 2) dy¯2 0, (A–6) | | − | | → Z p0(y1,S = 3) pn(y2 y1,S = 3) p0(y2 y1,S = 3) dy¯2 0. (A–7) | | − | | → Z

We assume that [S0 = j] occurs with positive probability; otherwise, these statements are trivial. This also implies that [Sn = j] occurs with positive probability for large enough n, so that each expression is well-defined. By a combination of Lemma A.3 and Lemma A.1,(A–6) and (A–7) hold. Under NFD, pi(y2 y1,S = 1) is given by | pi(g2(y¯2) y1,S 2) G2(y¯2) for i = 0, 1,.... Making an appropriate change of variable, and | ≥ noting that, by assumption, the support of p(y2 y1,S 2) is the same as the support of | ≥ p(g2(y¯2) y1,S 2) G2(y¯2), we have, | ≥

p0(y1,S = 1) pn(y2 y1,S = 1) p0(y2 y1,S = 1) dy¯2, | | − | | Z = p0(y1,S = 1) pn(y2 y1,S 2) p0(y2 y1,S 2) , dy¯2, (A–8) | | ≥ − | ≥ | Z = p0(S = 1) p0(y1 S = 1) pn(y2 y1,S 2) p0(y2 y1,S 2) , dy¯2, | | | ≥ − | ≥ | Z The equality is justified by NFD and a change of variable. By Lemma A.1, Lemma A.3, and Lemma A.4, the final expression in (A–8) converges to 0, completing the proof of (A–3). We now prove (A–4). Again, by marginalizing over S, we must prove the following statements:

p0(y¯2,S = 1) pn(y3 y¯2,S = 1) p0(y3 y¯2,S = 1) dy 0, (A–9) | | − | | → Z p0(y¯2,S = 2) pn(y3 y¯2,S = 2) p0(y3 y¯2,S = 2) dy 0, (A–10) | | − | | → Z p0(y¯2,S = 3) pn(y3 y¯2,S = 3) p0(y3 y¯2,S = 3) dy 0. (A–11) | | − | | → Z

118 Statement (A–11) is again implied by Lemma A.3 and Lemma A.1. The proof of statement (A–10) is directly analogous to the proof of (A–5). To prove (A–9), we write

p0(y¯2,S = 1) pn(y3 y¯2,S = 1) p0(y3 y¯2,S = 1) dy | | − | | Z = p0(y1,S = 1)p0(y2 y1,S = 1) pn(y3 y¯2,S = 1) p0(y3 y¯2,S = 1) dy | | | − | | Z (A–12) = p0(y1,S = 1)p0(g2(y¯2) y1,S 2) | ≥ Z pn(g3(y3) y¯2,S = 3) p0(g3(y3) y¯2,S = 3) G3(y3) G2(y¯2) dy. | | − | | Making a change of variable, expression (A–12) is

p0(y1,S = 1)p0(g2(y¯2) y1,S 2) pn(y3 y¯2,S = 3) p0(y3 y¯2,S = 3) G2(y¯2) dy. | ≥ | | − | | Z Let

ν(y¯2) = p0(y1 S = 1) p0(g2(y¯2) y1,S 2) G2(y¯2). | × | ≥ ×

Lemma A.4 shows that p0(y1 S = 1) p0(y1 S = 3), while the fact that 2 is |  | T support-preserving implies

G2(y¯2) p0(g2(y¯2) y1,S 2) p0(y2 y1,S 2) p0(y2 y1,S = 3). × | ≥  | ≥  |

Apply Lemma A.1 with ν(y¯2) and Lemma A.3 to complete the proof.

A.3 Proof of Theorem 3.5

By Lemma 3.3, it suffices to construct a density q0(y, z) such that (3–3) holds and

? q0(y, z) is in the Kullback-Leibler support of Π . Let λ(z) be any nonnegative density on

RJ satisfying the following conditions:

1.0 < λ(z) Mλ. ≤ 2. λ log λ dz < . | | ∞ R 3. For sufficiently small δ, λ log[λ/γδ] dz < where γδ = infkz−tk<δ λ(t). ∞ 4. λ(z) has moments of allR orders.

119 For example, taking λ(z) to be a product of standard Gaussian densities suffices. Choose q0(y, z) as in (3–4). We verify conditions A1–A4 on q0.

Proof of A1. By D1 and our assumption on λ(z), q0(y, z) is bounded above by M × Mλ/ infr Λ(A(r)), with the denominator strictly positive due to the fact that λ is positive and A(r): r 0, 1 J is a finite partition. { ∈ { } } Proof of A2. Write

q0(y, z) log q(y, z) = q0(y, z) log p0(y, ρ(z)) dy dz Z Z + q0(y, z) log λ(z) dy dz (A–13) Z + q0(y, z) log Λ(A(ρ(z))) dy dz. Z The expression log Λ(A(ρ(z))) is bounded, so the third term is trivially finite. To see that the first term is finite, note that

q0(y, z) log p0(y, ρ(z)) dy dz = q0(y, z) log p0(y, r) dy dz r z∈A(r) Z X Z = p0(y, r) log p0(y, r) dy. r X Z This is finite by D2. To address the middle term of (A–13), write

p (y, r)λ(z) q (y, z) log λ(z) dy dz = 0 log λ(z) dy dz 0 Λ(A(r)) r z∈A(r) Z X Z Z p (r) = 0 λ(z) log λ(z) dz. Λ(A(r)) r A(r) X Z The last term is finite by assumption on λ(z).

120 Proof of A3. Let γδ(z) = infkz−tk<δ λ(t) and let MA = max Λ(A(r)). For fixed δ > 0, calculate

q0(y, z) q0(y, z)MA q0(y, z) log dy dz q0(y, z) log dy dz φ (y, z) ≤ ψ (y)γ (z) Z δ Z δ δ p (y, r) = p (y, r) log 0 dy 0 ψ (y) (A–14) r δ X Z p (r) M λ(z) + 0 λ(z) log A . Λ(A(r)) γ (z) r δ X Z By assumption, the first term in the last expression of (A–14) is finite for sufficiently small δ, while the second term is finite for sufficiently small δ by choice of λ(z).

Proof of A4. We must check C4 where d = 2J. Taking η as in D4 we have

4(1+η)J 4(1+η)J 4(1+η)J (y, z) q0(y, z) dy dz C y + z q0(y, z) dy dz k k ≤ k k k k Z Z 4(1+η)J  = C y p0(y)dy k k Z p (r) + C 0 z 4(1+η)J λ(z) dz, Λ(A(r)) r A(r) k k X Z where C can be taken to be 24(1+η)J , and the equality on the second line follows a similar calculation used in bounding the middle term of (A–13). The last term is finite by D4 and our choice of λ(z).

121 APPENDIX B APPENDIX TO CHAPTER 4 In this appendix we discuss some additional technical and computational topics associated with material in Chapter4. B.1 Blocked Gibbs Sampler

To implement the blocked Gibbs sampler, following Ishwaran and James(2001) we truncate the Dirichlet process mixture at K components and introduce a latent variable

th Ci with Ci = k if observation i belongs to the k class of the mixture distribution and proceed by blocked Gibbs sampling. We alter their scheme slightly by introducing

additional latent variables representing Ymis,i for each observation, which is blocked with

Ci. This yields the following algorithm.

1. Conditional for (θ(k), γ(k)): Simulate

(k) (k) (k) (k) p(θ , γ C1:n, Y1:n, S1:n) H(dθ , dγ ) f (k) (Yi)π (k) (Si Yi). | ∝ θ γ | i:C =k Yi

2. Conditional for w: Let Mk denote the number of Ci equal to k, and M>k denote the number of Ci strictly greater than k. Simulate

0 indep w Beta(1 + Mk, α + M>k), (1 k K 1), k ∼ ≤ ≤ − 0 0 0 and set w 1. Then, set wk = w (1 w ). K ≡ k j

then simulate Ymis,i according to

(Ci) p(Ymis,i Yobs,i,Ci, θ ) = f (C ) (Ymis,i Yobs,i). | θ i |

Sampling Ci at this point involves calculating the observed data likelihood

K ¯ ¯ Lobs,i = wkf (k) (YiS )π (k) (Si YiS ) θ i γ | i k=1 X which may be retained if desired for model evaluation purposes.

122 If hyperpriors are placed on α and H, we can easily add steps corresponding to updating these parameters. The relevant likelihood of α is given by

K−1 α P log(1−w0 ) Lα = α e k k , which can be given a conjugate gamma prior, and the relevant likelihood for H is given by

K (k) (k) LH = H(dθ , dγ ). k=1 Y Any updates which cannot be done in closed form may be replaced by appropriate updates which leave these conditional distributions invariant, such as slice sampling updates (Neal, 2003). B.2 Prior Specification

B.2.1 Parametric Priors

First, the data are standardized so that the complete cases (Si = J) have mean 0 and variance 0.5. When a multivariate Gaussian model is used, we use the parametrization of (µ, Σ) based on the autoregressive parameters (µ, φ, ρ) given in (4–2). We specify a

6 (0, 10 ) prior for the mean parameters µj and autoregressive parameters φ`j, while the N scale parameters ρj are given Uniform(0, 100) priors. B.2.2 Nonparametric Default Priors

We use default hierarchical priors borrowing ideas from Rasmussen(2000) and Taddy

(2008). We first discuss the prior on πγ(s y). As a preprocessing step, we standardize | the data so the grand observed mean (across all treatments and times) is 0 and the grand observed variance is 0.5.

For the simulation in Section 4.6.1 we assume πγ(s y) = θ2s, i.e. Y and S are | independent within mixture component. We take γ(k) (ζ) where ζ is chosen so that ∼ D (k) J a priori E[γ ] = ζs/ j ζj is equal to the empirical probability of S = s. j=1 ζj is a J smoothing parameter,P analogous to α in the Dirichlet process. If j=1 ζj isP very large then the dropout distribution is essentially the same across classes,P making dropout and

123 J outcome approximately independent. Conversely if j=1 ζj is very small then only one J dropout pattern will typically be represented in a givenP class. We take j=1 ζj = 3.

In Section 4.6.2 and Section 4.7 we choose πγ(s y) so that P |

logit πγ(S = s Y ,S s) = ζs + λs1Ys + λs2Ys−1. { | ≥ }

2 2 where γ = (ζ, λ). All ζ and λ terms are given independent (µζ , σ ) and (µλ, σ ) N ζ N λ distributions. µζ and µλ are given Cauchy priors with location 0 and scales 5 and 2.5

2 2 −1 respectively. σζ and σλ are given Γ (1, 1) priors.

We now address the prior on fθ(y), where θ = (µ, Σ). We again parametrize in terms

(k) 2 (k) 2 of the autoregressive parameters (µ, φ, ρ). We set µ (mµ , s ) and φ (0, s ). j ∼ N j µj `j ∼ N φj We took mµ (0, 0.5). The variance components were specified as follows. j ∼ N

2(k) −1 −1 ρ Γ (aρ, aρbρ ), aρ 2 Γ (1, 1), j ∼ j − ∼ 2 −1 bρ Γ(1, 2/gj), s Γ (as , as δ1gj), j ∼ µj ∼ µ µ −1 as 2 Γ (1, 1), δ1 Γ(1, 1), µ − ∼ ∼ 2 −1 −1 s Γ (as , as δ2), as 2 Γ (1, 1), φj ∼ φ φ φ − ∼

δ2 Γ(1, 1), ∼ ¯ where gj is the MLE of the conditional variance of Yj given Yj−1 under normality and MAR. The parameters δ g and δ g represent a random scaling component for s2 and 1 j 2 j µj s2 . φj B.3 Simulation Settings

In this section, we describe the simulation settings used in Section 4.6. B.3.1 Section 4.6.1

In the first simulation setting of Section 4.6.1, data was generated according to

Y (µ, Σ) with µ = (0, 0, 0) and Σ an AR-1 covariance matrix with Var(Y1) = 1 and ∼ N Cov(Yj,Yj+1) = 0.7. Missingness was MAR with discrete hazard at times j = 1 and j = 2 given by λj(Y ) = Pr(S = j S j, Y ) = expit(aj + bjYj−1). The values of a1 and b1 | ≥

124 were chosen so that λ1( 2) = 0.5 and Pr(S = 1) = 0.2. a2 and b2 were chosen so that − λ2( 2) = 0.5 and Pr(S = 2 S 1) = .25. − | ≥ In the second simulation setting in Section 4.6.1, Y was drawn from a 50-50 mixture of Gaussian distributions with means µ1 = (2, 0, 2), µ2 = (6, 1.5, 0) and covariance − matrices Σ1 = diag(2,.1,.2) and Σ2 exchangeable with variance 1 and covariance 0.8. This was chosen to make the distributions of (Y1,Y2) and (Y1,Y3) roughly “L-shaped” while

(Y2,Y3) is roughly linear. Missingness is MAR with

λ1(Y ) = 0.4I(Y1 2) + 0.18I(2 < Y1 5) + 0.1I(Y1 > 5), ≤ ≤

λ2(Y ) = 0.45I(Y2 0) + 0.2I(0 < Y2 2) + 0.1I(Y2 > 2). ≤ ≤

I(Y A) here denotes the indicator function. The hazards were chosen so that Pr(S = ∈ 1) Pr(S = 2) 0.2. ≈ ≈ B.3.2 Section 4.6.2

Parameters under M1 are

µ = (95.5, 94.1, 91.6, 89.0, 86.2, 81.3)T

114.0   98.5 143.2      101.5 149.0 222.7  Σ =      115.3 156.6 225.0 335.1       119.8 145.6 220.6 355.8 444.3         118.3 142.1 210.9 337.0 420.2 441.6      ζ = ( 16.4, 0.7, 11.5, 9.9, 27.6)T , − − − − T λ1 = ( 0.1, 0.0, 0.2, 0.4, 0.4) , − −

λ2 = (Not Applicable, 0.0, 0.1, 0.4, 0.1). − − − −

125 These parameters come from fitting the selection model to the data and taking the posterior mean of each parameter. M2 is a 5 component mixture that was obtained by fitting a Dirichlet mixture of lag-1 selection models and taking the parameters corresponding to the 5 components of highest posterior probability (we do not take posterior means because the likelihood is invariant under permutations of component labels).

w = (0.119, 0.578, 0.001, 0.115, 0.186)T ,

9.58 10.65 9.25 9.86 9.42 − − − − −  9.45 10.31 9.55 9.08 9.72 − − − − −   ζ =  9.61 9.46 9.77 8.79 9.80 ,   − − − − −     9.51 9.08 10.45 9.33 9.44 − − − − −     9.03 8.91 10.19 9.21 9.58 − − − − −    0.18 0.56 0.09 0.54 0.10 − − − −  0.45 0.77 0.63 0.25 0.34 − − − − −   λ =  1.02 0.16 0.41 0.28 0.57 ,   − − − − −     0.90 0.11 0.11 0.39 0.11  − − −     0.67 0.16 0.11 0.13 0.09  − −    98.11 95.94 91.85 91.68 100.94 75.71 95.58 93.09 89.34 82.97 78.82 76.86    µ = 75.95 64.27 63.69 59.69 57.83 34.60  .       85.11 83.13 72.67 67.32 64.35 61.12      97.49 99.48 101.83 107.23 94.34 104.05    

126 Each row of a given matrix corresponds to a mixture component. The covariance matrices for each class are given by

245 222 207 199 195 183 72 68 64 61 56 55     222 268 250 240 235 220 68 117 110 105 98 94         207 250 270 259 254 237 64 110 174 166 155 147 Σ1 =   , Σ2=   ,     199 240 259 295 288 269 61 105 166 194 182 172         195 235 254 288 374 349 56 98 155 182 199 187             183 220 237 269 349 354 55 94 147 172 187 212         120 111 104 98 97 89 73 69 65 63 59 52     111 156 146 138 136 125 69 122 115 112 105 94         104 146 194 184 179 165 65 115 188 182 171 153 Σ3 =   , Σ4=   ,      98 138 184 239 231 211 63 112 182 227 214 191          97 136 179 231 274 250 59 105 171 214 220 197              89 125 165 211 250 251 52 94 153 191 197 221         106 100 96 91 83 84   100 125 119 113 104 105      96 119 167 159 147 148 Σ5 =   .    91 113 159 360 335 336      83 104 147 335 337 339        84 105 148 336 339 367    

To generate from model M3 we first generate data under M1 and apply the appropriate Gaussian distribution function to each component to get data which is marginally uniform. Next we apply the skew-t quantile function to each component to get data which is marginally skew-t. Recall the density of the skew-t distribution (Azzalini,

127 Table B-1. Results from the simulation study in Section 4.6.2. Normal refers to M1, Mixture to M2, and Skew-T to M3.

Normal Lag-2 Selection Model

95% CI Width Coverage Probability Root Mean Squared Error ξ Normal Dirichlet Normal Dirichlet Normal Dirichlet

0 6.80(0.46) 6.54(0.41) 93.7(1.4) 92.0(1.6) 1.73(0.07) 1.71(0.07) 0.5 7.55(0.53) 7.24(0.48) 94.3(1.3) 93.0(1.3) 1.93(0.08) 1.92(0.07) 1 8.54(0.61) 8.18(0.58) 94.3(1.3) 93.3(1.5) 2.19(0.09) 2.16(0.08) 1.5 9.70(0.69) 9.34(0.66) 93.7(1.4) 93.7(1.4) 2.50(0.10) 2.55(0.09) 2 10.99(0.78) 10.62(0.73) 94.0(1.4) 93.7(1.4) 2.83(0.12) 2.86(0.11)

Mixture of Lag-1 Selection Models

95% CI Width Coverage Probability Root Mean Squared Error ξ Normal Dirichlet Normal Dirichlet Normal Dirichlet

0 5.8(0.4) 5.9(0.4) 91.3(1.6) 95.3(1.2) 1.65(0.07) 1.40(0.064) 0.5 6.2(0.4) 6.3(0.4) 92.0(1.6) 95.3(1.2) 1.76(0.07) 1.51(0.068) 1 6.8(0.5) 6.8(0.5) 90.7(1.7) 95.3(1.2) 2.00(0.08) 1.67(0.076) 1.5 7.6(0.6) 7.6(0.7) 90.3(1.7) 95.3(1.2) 2.28(0.09) 1.86(0.085) 2 8.6(0.7) 8.5(0.8) 91.7(1.6) 95.3(1.2) 2.51(0.10) 2.10(0.095)

Skew-T Copula Lag-2 Selection Model

95% CI Width Coverage Probability Root Mean Squared Error ξ Normal Dirichlet Normal Dirichlet Normal Dirichlet

0 5.4(0.4) 5.5(0.9) 89.9(1.7) 95.6(1.2) 1.62(0.07) 1.35(0.06) 0.5 5.9(0.5) 6.0(1.1) 91.2(1.6) 96.6(1.0) 1.69(0.07) 1.51(0.06) 1 6.5(0.6) 6.7(1.3) 88.9(1.8) 95.3(1.2) 2.03(0.09) 1.65(0.07) 1.5 7.3(0.7) 7.6(1.6) 88.6(1.8) 94.9(1.3) 2.34(0.10) 1.86(0.08) 2 8.4(0.8) 8.9(1.9) 86.9(2.0) 96.0(1.1) 2.35(0.12) 2.10(0.06)

128 Mixture

40 60 80 120 20 60 100 140 20 40 60 80 120 120 100 Y1 80 60 120

80 Y2 60 40 120

Y3 80 60 40 140 100 Y4 60 20 140 100 Y5 60 20 100

Y6 60 20

60 80 100 120 40 60 80 120 20 60 100 140

Figure B-1. Dataset generated under M2.

2013) with location 0, scale 1, degrees of freedom ν, and shape ω is

ν + 1 f(z ν, ω) = 2t (z)T ωz , ν ν+1 z2 + ν | r ! where tν is the students-t density with ν degrees of freedom and Tν+1 is the students-t distribution function with ν + 1 degrees of freedom. We set ν = 15 for each component and ω = (10, 0, 10, 0, 10, 0) to induce a nonlinear relationship between components. The data were then returned approximately to their original scale by multiplying by 15. Sample

129 Skew-T

-60 -20 20 -40 0 20 -60 -20 0 20 40

Y1 20 0 20

Y2 -20 -60 30 Y3 10 0 20

0 Y4 -40 30 Y5 10 0 20 0

-20 Y6 -60

0 10 30 50 0 10 20 30 40 0 10 30

Figure B-2. Dataset generated under M3. datasets of data generated under M2 and M3 are given in Figures 1 and 2. Detailed simulation results are given in Table 1. B.4 Exponential Tilting

As an alternative to the transformation-based sensitivity analysis proposed for the SCT data in Chapter4, we describe an implementation of the exponential tilting approach. While we do not use this approach in the SCT data analysis, it is frequently used in practice (Scharfstein et al., 2013). The following simple fact will be useful.

130 Proposition B.1. Let ( µ, σ2) denote the Gaussian density with mean µ and variance N · | σ2. Then,

b2σ2 (y µ, σ2)ea+by = (y µ + bσ2, σ2) exp a + bµ + . N | N | 2   Proof. The result follows from completing the square in the exponential of this expression and routine algebra.

Within NFD, the identifying restriction made is

p(yj y¯j−1, s = j 1) p(yj y¯j−1, s j) exp qj(yj; ξ) . | − ∝ | ≥ { }

As an additional simplification, we have assumed qj depends only on yj. The function

qj( ; ξ) is user-specified and should be elicited from a subject-matter expert. Given our · control over qj, it does not seem overly restrictive to choose qj to be a linear spline,

D

qj(x) = [ad + bdx]I(`d x ud). (B–1) ≤ ≤ d=1 X

During MCMC we sample approximations of pobs which are finite mixtures induced by (k) (k) ¯ (4–1). Let µ1:j and Σ1:j,1:j denote the mean and variance of Yj given that Y is drawn from 2(k) ¯ mixture component k, and recall ρj denotes the conditional variance of Yj given Yj−1 (k) within cluster k, and that the φ`j ’s denote the autoregressive parameters of Σ. For fixed

y¯j−1, define the following constants

j−1 (k) (k) (k) mk = µ + φ (y` µ`) , j `j − `=1 X T Vk = wk (y¯j−1 µ1:j−1, Σ1:j−1,1:j−1) (1 expit(ζi + λi y¯i)) , N | "i

131 where dependence on y¯j−1 is suppressed. Then

p(y¯ , s j)eqj (yj ) ¯ j p(yj yj−1, s = j 1) = ≥ q (y ) | − p(y¯j, s j)e j j dyj ≥ K 2(k) qj (yj ) Vk (yj mk, ρ )e = R k=1 N | K 2(k) q (y ) Vk (yj mk, ρ )e j j dyj kP=1 N | K D W (y m + b ρ2(k), ρ2(k))I(` y u ) = Pk=1 dR=1 kd j k d d j d , N | 2(k) ≤ 2(≤k) K D ud−mk−bdρ `d−mk−bdρ Wkd Φ Φ P k=1P d=1 ρ(k) − ρ(k) h    i where Φ( ) is the standard GaussianP P distribution function, by Proposition B.1. This is a · mixture of KD truncated Gaussian distributions, with mixture weights given by

2(k) 2(k) ud mk bdρ `d mk bdρ $kd Wkd Φ − − Φ − − . ∝ ρ(k) − ρ(k)     

Hence p(yj y¯j−1, s = j 1) has a known distribution which can be sampled from by | − first selecting a mixture component and then sampling the associated truncated Gaussian distribution. Robert(1995) gives an efficient algorithm for sampling truncated Gaussian distributions. Summarizing, Algorithm 12 gives an alternative to Algorithm5.

Algorithm 12 Algorithm to draw yj p(yj y¯j−1, s = j 1) for the Dirichlet process ∼ | − mixture model, with qj given by (B–1).

1. Calculate mk and Wkd according to (B–2). 2. Draw (k?, d?) from the distribution on 1,...,K 1,...,D with probabilities { } × { } 2(k) 2(k) ud mk bdρ `d mk bdρ $kd Wkd Φ − − Φ − − . ∝ ρ(k) − ρ(k)      2(k?) 2(k?) 3. Sample yj from the (mk? + bd? ρ , ρ ) distribution truncated to the interval N [`d? , ud? ].

132 APPENDIX C APPENDIX TO CHAPTER 5 C.1 Proof of Theorem 5.2

Proof. Regard θ = (T , D, φ, ω) as a vector of latent variables and let h = (a, g), where we recall that a = log α and g = log γ. By definition, the empirical Bayes estimator is the maximizer of

log mh(Y ) = log fθ(Y ) πh(θ) µ(dθ), (C–1) × Z ˆ ˆ and a necessary condition for h to be the maximizer is for the gradient of log mh(Y ) at h to be 0. The gradient is

h fθ(Y ) πh(θ) µ(dθ) h fθ(Y ) πh(θ) µ(dθ) h log mh(Y ) = ∇ × = ∇ { × } ∇ fθ(Y ) πh(θ) µ(dθ) fθ(Y ) πh(θ) µ(dθ) R × R × Sh(θ) fθ(Y ) πh(θ) µ(dθ) = R × × =R Eh [Sh(θ) Y ] , fθ(Y ) πh(θ) µ(dθ) | R × R where Sh(θ) = h log πh(θ), and Eh[ Y ] denotes the expectation operator with respect ∇ · | to the posterior distribution of θ under the hyperparameter value h. The interchange of integration and differentiation is justified because the components of the integral which

depend on h are sums over the finite set of permissible values of (T , D). By Theorem 5.1, we can directly calculate

J nj −1 α T j=1 i=1 α+i−1 Sh(θ) = − .  T γ  DP Pi=1 γ+i−1  −   P  Setting Eh[Sh Y ] equal to 0 gives the result. | C.2 Proof of Lemma 5.1

Proof. Both statements follow easily from the algebraic properties of Γ(x). First,

αnΓ(α) αn n−1 α = = α→∞ 1, Γ(α + n) (α + n 1) (α + n 2) α α + i i=0 −→ − × − × · · · × Y proving the first statement of the lemma.

133 To prove the second statement we write

αΓ(α) 1 1 = α→0 . Γ(α + n) (α + n 1) (α + n 2) (α + 1) −→ Γ(n) − × − × · · · ×

C.3 Proof of Lemma 5.2

Proof. To prove the first statement, first note that [ Pn = n] if and only if Pn = | | a1 ,..., an , thus {{ } { }} n n−1 αnΓ(αn) 1 Pr( Pn = n) = = . Γ(α + n) 1 + i/α | | n i=0 n Y Then,

n−1 n−1 i 2 −2 log Pr( Pn = n) = log(1 + i/αn) = + i O(α ) α n | | i=0 − i=0 − n X X   n(n 1) 3 2 = − − + O(n /αn). 2αn

Taking αn = n(n 1)/[2 log(1 )] gives − − −

n(n 1) 3 2 −1 − − + O(n /αn) = log(1 ) + O(n ). 2αn −

O(n−1) −1 Exponentiating and noting that e = 1 + O(n ), we have Pr( Pn = n) = | | (1 ) + O(n−1). − To prove the second statement, [ Pn = 1] occurs if and only if P = , hence | | {A}

Γ(αn)Γ(n) Γ(αn + 1)Γ(n) Pr( Pn = 1) = αn = . | | Γ(αn + n) Γ(αn + n)

Applying Stirling’s formula,

e−n nn 2π/n [1 + R(n)] Pr( Pn = 1) = Γ(αn + 1) , e−(αn+n) (α + n)(αn+n) [1 + R(α + n)] | | n 2π/p(αn + n) n p

134 where R(x) is a function such that R(x) 0 as x . As n and αn 0, all terms → → ∞ → ∞ → tend to 1 except the third, for which we have

nn 1 αn+n = n−αn = n−αn [1 + o(1)]. (α + n)αn+n 1 + α /n n  n  Conclude that

−αn Pr( Pn = 1) = n [1 + o(1)]. | |

Setting αn = log(1 )/log n gives − −

Pr( Pn = 1) = (1 )[1 + o(1)] = (1 ) + o(1), | | − −

which completes the proof.

C.4 Impropriety of Posterior Under an Improper Prior

In this section, we formalize the notion that there are no useful improper priors for estimating the concentration parameters in the HDP. We note that our development

applies to Dirichlet process mixtures as well. Recall that mα,γ(Y )) denotes the marginal likelihood of the data Y as a function of (α, γ). Proposition C.1. Consider the HDP model given by (5–1), and suppose the following: (α, γ) π where π is a density on (0, ) (0, ); ω ν; (α, γ) and ω are independent; ∼ ∞ × ∞ ∼ and the kernels fψ are all mutually absolutely continuous. If π is continuous and { } improper, then for every ψ and fψ-almost-every Y , π(α, γ) mα,γ(Y ) dα dγ = , so the ∞ posterior given formally by π(α, γ Y ) π(α, γ) mα,γR (Y ), is improper. | ∝ Proof. Let π(α, γ) denote a prior on (α, γ) which is improper and continuous on (0, ) ∞ × (0, ). Then, for any C > 0, ∞

π(α, γ) dα dγ = π(α, γ) dα dγ + π(α, γ) dα dγ Z Z(0,C)×(0,C) Z(0,C)×(C,∞) + π(α, γ) dα dγ + π(α, γ) dα dγ = . ∞ Z(C,∞)×(0,C) Z(C,∞)×(C,∞)

135 At least one term in the summation above is infinite; we assume for simplicity that the term associated with the interval (C, ) (C, ) is infinite for all C > 0, ∞ × ∞

π(α, γ) dα dγ = . (C–2) ∞ Z(C,∞)×(C,∞) This covers the usual “flat” improper priors π(α, γ) = 1, or π(α, γ) = (αγ)−1. A full proof proceeds by cases, with each case proceeding in a similar manner. By Proposition 5.1, the marginal likelihood is given by

J Γ(γ)γD Γ(α)αTj mα,γ(Y ) = Γ( d ) Γ( t ) Γ(γ + T ) | |  Γ(α + nj) | |  T ,D "d∈D # j=1 t∈Tj (C–3) X Y Y Y   λ(Y ; T , D, ω)ν(ω) dω. × Z Letting α and γ , an application of Lemma 5.1 shows that the terms → ∞ → ∞ D Tj Γ(γ)γ /Γ(γ + T ) and Γ(α)α /Γ(α + nj) tend to I(D = T ) and I(Tj = nj) respectively.

The only value of (T , D) for which these terms are nonzero assigns every Yij to a unique mixture component; let (T ?, D ?) denote this value. Hence,

? ? lim lim mα,γ(Y ) = λ(Y ; T , D , ω)ν(ω) dω α→∞ γ→∞ Z J nj

= fφ(Yij) Hω(dφ) ν(ω) dω > 0, fψ-almost-surely. j=1 i=1 Z Y Y Z

For sufficiently large (α, γ), the above argument shows that mα,γ(Y ) > δ for some δ > 0; let C be such that this holds whenever α, γ > C. We then have

∞ ∞ ∞ ∞ π(α, γ)mα,γ(Y ) dα dγ π(α, γ)mα,γ(Y ) dα dγ 0 0 ≥ C C Z Z Z ∞Z ∞ δ π(α, γ) dα dγ = , ≥ ∞ ZC ZC with the last equality following by (C–2).

136 REFERENCES Albert, P. S. (2000). A transitional model for longitudinal binary data subject to nonignorable missing data. Biometrics, 56(2):602–608. Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, pages 1152–1174. Atchad´e,Y. F. (2011). A computational framework for empirical Bayes inference. Statistics and Computing, 21(4):463–473. Azzalini, A. (2013). The Skew-Normal and Related Families, volume 3. Cambridge University Press. Barron, A., Schervish, M. J., Wasserman, L., et al. (1999). The consistency of posterior distributions in nonparametric problems. The Annals of Statistics, 27(2):536–561. Birmingham, J., Rotnitzky, A., and Fitzmaurice, G. M. (2003). Pattern-mixture and selection models for analysing longitudinal data with monotone missing patterns. Journal of the Royal Statistical Society, Series B., 65:275–297. Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via p´olya urn schemes. The Annals of Statistics, pages 353–355. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B, 61(1):265–285. Buta, E. and Doss, H. (2011). Computational approaches for empirical Bayes methods and Bayesian sensitivity analysis. The Annals of Statistics, 39(5):2658–2685. Canale, A. and De Blasi, P. (2013). Posterior consistency of nonparametric location-scale mixtures for multivariate density estimation. arXiv preprint arXiv:1306.2671. Canale, A. and Dunson, D. B. (2015). Bayesian multivariate mixed-scale density estimation. Statistics and its Interface, 8(2):195–201. Casella, G. (2001). Empirical Bayes Gibbs sampling. Biostatistics, 2(4):485–500. Cowans, P. J. (2004). Information retrieval using hierarchical Dirichlet processes. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 564–565. Daniels, M. J. and Hogan, J. W. (2000). Reparameterizing the pattern mixture model for sensitivity analyses under informative dropout. Biometrics, 56(4):1241–1248.

137 Daniels, M. J. and Hogan, J. W. (2008). Missing Data In Longitudinal Studies. Chapman and Hall/CRC. Daniels, M. J. and Pourahmadi, M. (2002). Bayesian analysis of covariance matrices and dynamic models for longitudinal data. Biometrika, 89(3):553–566. De Iorio, M., M¨uller,P., Rosner, G. L., and MacEachern, S. N. (2004). An ANOVA model for dependent random measures. Journal of the American Statistical Association, 99(465):205–215. Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates. The Annals of Statistics, pages 1–26. Diggle, P. and Kenward, M. G. (1994). Informative drop-out in longitudinal data analysis. Applied Statistics, 43:49–73. Doss, H. (1994). Bayesian nonparametric estimation for incomplete data via successive substitution sampling. The Annals of Statistics, pages 1763–1786. Doss, H. (2010). Estimation of large families of Bayes factors from Markov chain output. Statistica Sinica, 20(2):537–560. Doss, H. (2012). Hyperparameter and model selection for nonparametric Bayes problems via Radon-Nikodym derivatives. Statistica Sinica, 22:1–26. Dunson, D. B. (2007). Bayesian methods for latent trait modelling of longitudinal data. Statistical Methods in Medical Research. Dunson, D. B. and Perreault, S. D. (2001). Factor analytic models of clustered multivariate data with informative censoring. Biometrics, 57(1):302–308. Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209–230. Frangakis, C. E. and Rubin, D. B. (1999). Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika, 86(2):365–379. Freedman, D. A. (1963). On the asymptotic behavior of Bayes’ estimates in the discrete case. The Annals of , pages 1386–1403. Geisser, S. and Eddy, W. F. (1979). A predictive approach to model selection. Journal of the American Statistical Association, 74(365):153–160. Gelman, A. and Hill, J. (2006). Data Analysis Using Regression and Multi- level/Hierarchical Models. Cambridge University Press.

138 Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science, 13:163–185. Geyer, C. J. (2011). Importance sampling, simulated tempering, and umbrella sampling. In Handbook of Markov Chain Monte Carlo, pages 301–318. CRC press. Geyer, C. J. and Thompson, E. A. (1995). Annealing Markov chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association, 90(431):909–920. Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. V. (1999). Posterior consistency of Dirichlet mixtures in density estimation. Annals of Statistics, 27(1):143–158. Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28(2):500–531. Ghosh, J. K., Delampady, M., and Samanta, T. (2007). An introduction to Bayesian analysis: theory and methods. Springer Science & Business Media. Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian Nonparametrics. Springer. Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(1):5228–5235. Hanson, T. E., Kottas, A., and Branscum, A. J. (2008). Modelling stochastic order in the analysis of receiver operating characteristic data: Bayesian non-parametric approaches. Journal of the Royal Statistical Society: Series C (Applied Statistics), 57(2):207–225. Harel, O. and Schafer, J. L. (2009). Partial and latent ignorability in missing-data problems. Biometrika, 96:37–50. Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47:153–161. Henderson, R., Diggle, P. J., and Dobson, A. (2000). Joint modelling of longitudinal measurements and event time data. Biostatistics (Oxford), 1:465–480. Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1). Hobert, J. P. and Casella, G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 91(436):1461–1473. Hogan, J. W., Daniels, M. J., and Hu, L. (2014). A Bayesian perspective on assessing sensitivity to assumptions about unobserved data. In Molenberghs, G., Fitzmaurice, G., Kenward, M. G., Tsiatis, A., and Verbeke, G., editors, Handbook of Missing Data Methodology. CRC Press.

139 Hogan, J. W. and Laird, N. M. (1997). Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine, 16:239–257. Ibrahim, J. G., Chen, M.-H., and Lipsitz, S. R. (2001). Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. Biometrika, 88(2):551–564. Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96:161–173. Jaynes, E. T. (1996). : The Logic of Science. Washington University St. Louis, MO. Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, pages 523–539. Karabatsos, G. and Walker, S. G. (2012). A Bayesian nonparametric causal model. Journal of Statistical Planning and Inference, 142(4):925–934. Kay, S. R., Flszbein, A., and Opfer, A. L. (1987). The positive and negative syndrome scale (PANSS) for schizophrenia. Schizophrenia Bulletin, 13(2):261. Kenward, M. G., Molenberghs, G., and Thijs, H. (2003). Pattern-mixture models with proper time dependence. Biometrika, 90:53–71. Kleinman, K. P. and Ibrahim, J. G. (1998). A semiparametric Bayesian approach to the random effects model. Biometrics, pages 921–938. Kong, A., McCullagh, P., Meng, X.-L., Nicolae, D., and Tan, Z. (2003). A theory of statistical models for Monte Carlo integration. Journal of the Royal Statistical Society: Series B, 65(3):585–604. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, pages 79–86. Liang, F., Liu, C., and Carroll, R. J. (2007). Stochastic approximation in Monte Carlo computation. Journal of the American Statistical Association, 102(477):305–320. Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13–22. Lin, H., McCulloch, C. E., and Rosenheck, R. A. (2004). Latent pattern mixture models for informative intermittent missing data in longitudinal studies. Biometrics, 60:295–305. Linero, A. R. and Daniels, M. J. (2015). A flexible Bayesian approach to monotone missing data in longitudinal studies with informative dropout with application

140 to a schizophrenia clinical trial. Journal of the American Statistical Association, 110(1):45–55. Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404):1198–1202. Little, R. J. A. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88:125–134. Little, R. J. A. (1994). A class of pattern-mixture models for normal incomplete data. Biometrika, 81:471–483. Liu, J. S. (1996). Nonparametric hierarchical Bayes via sequential imputations. The Annals of Statistics, pages 911–930. Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. density estimates. The annals of statistics, 12(1):351–357. Lopes, H. F., M¨uller,P., and Rosner, G. L. (2003). Bayesian meta-analysis for longitudinal data models using multivariate mixture priors. Biometrics, 59(1):66–75. MacEachern, S. N. (1999). Dependent nonparametric processes. In ASA Proceedings of the Section on Bayesian Statistical Science, pages 50–55. Manski, C. F. (2009). Identification for prediction and decision. Harvard University Press. Marinari, E. and Parisi, G. (1992). Simulated tempering: a new Monte Carlo scheme. Europhysics Letters, 19(6):451–458. Meng, X.-L. and Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica, 6(4):831–860. Mohan, K., Pearl, J., and Tian, J. (2013). Graphical models for inference with missing data. In Advances in neural information processing systems, pages 1277–1285. Molenberghs, G., Beunckens, C., Sotto, C., and Kenward, M. G. (2008). Every missingness not at random model has a missingness at random counterpart with equal fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):371–388. Molenberghs, G., Kenward, M. G., and Lesaffre, E. (1997). The analysis of longitudinal ordinal data with non-random dropout. Biometrika, 84:33–44. Muliere, P. and Tardella, L. (1998). Approximating distributions of random functionals of Ferguson-Dirichlet priors. Canadian Journal of Statistics, 26(2):283–297. M¨uller,P., Quintana, F., and Rosner, G. (2004). A method for combining inference across related nonparametric bayesian models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(3):735–749.

141 National Research Council (2010). The Prevention and Treatment of Missing Data in Clinical Trials. The National Academies Press. Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265. Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31:705–767. Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for dirichlet process hierarchical models. Biometrika, 95(1):169–186. Pati, D., Dunson, D. B., and Tokdar, S. T. (2013). Posterior consistency in conditional distribution estimation. Journal of multivariate analysis, 116:456–472. Pati, D., Reich, B. J., and Dunson, D. B. (2011). Bayesian geostatistical modelling with informative sampling locations. Biometrika, 98(1):35–48. Pitman, J. (2002). Combinatorial stochastic processes. Technical Report 621, Department of Statistics, University of California, Berkeley. Pitman, J. and Yor, M. (1997). The two-parameter poisson-dirichlet distribution derived from a stable subordinator. The Annals of Probability, pages 855–900. Rasmussen, C. E. (2000). The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems, volume 12. Ritov, Y., Bickel, P., Gamst, A., Kleijn, B., et al. (2014). The bayesian analysis of complex, high-dimensional models: Can it be coda? Statistical Science, 29(4):619–639. Robert, C. P. (1995). Simulation of truncated normal variables. Statistics and computing, 5(2):121–125. Robins, J. M. (1986). A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Math Modeling, 7:1393–1512. Robins, J. M. and Gill, R. D. (1997). Non-response models for the analysis of non-monotone ignorable missing data. Statistics in medicine, 16(1):39–56. Robins, J. M. and Ritov, Y. (1997). Toward a curse of dimensionality appropriate(CODA) asymptotic theory for semi-parametric models. Statistics in medicine, 16(3):285–319. Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90(429):106–121. Rodriguez, A. and Dunson, D. B. (2011). Nonparametric bayesian models through probit stick-breaking processes. Bayesian analysis (Online), 6(1).

142 Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American Statistical Association, 103(483):1131–1154. Rotnitzky, A., Robins, J. M., and Scharfstein, D. O. (1998). Semiparametric regression for repeated outcomes with non-ignorable non-response. Journal of the American Statistical Association, 93:1321–1339. Roy, J. (2003). Modeling longitudinal data with nonignorable dropouts using a latent dropout class model. Biometrics, 59:441–456. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63:581–592. Rubin, D. B. (2005). Causal inference using potential outcomes. Journal of the American Statistical Association, 100(469). Scharfstein, D. O., Daniels, M. J., and Robins, J. M. (2003). Incorporating prior beliefs about selection bias ito the analysis of randomized trials with missing outcomes. Biostatistics, 4:495. Scharfstein, D. O., McDermott, A., Olson, W., and Weigand, F. (2013). Global sensitivity analysis for repeated measures studies with informative drop-out. Technical report, Johns Hopkins Bloomberg School of Public Health. Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Adjusting for nonignorable dropout using semiparametric nonresponse models. Journal of the American Statistical Association, 94. Schwartz, L. (1965). On Bayes procedures. Zeitschrift f¨urWahrscheinlichkeitstheorie und verwandte Gebiete, 4(1):10–26. Seaman, S., Galati, J., Jackson, D., Carlin, J., et al. (2013). What is meant by missing at random? Statistical Science, 28(2):257–268. Sethuraman, J. (1961). Some limit theorems for joint distributions. Sankhy¯a:The Indian Journal of Statistics, Series A, pages 379–386. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650. Sethuraman, J. and Tiwari, R. C. (1982). Convergence of Dirichlet measures and the interpretation of their parameter. In Gupta, S. S. and Berger, J. O., editors, Proceedings of the Third Purdue Symposium on Statistical Decision Theory and Related Topics. Academic Press, New York. Steck, G. P. (1957). Limit theorems for conditional distributions. University of California Press.

143 Sweeting, T. (1989). On conditional weak convergence. Journal of Theoretical Probability, 2(4):461–474. Taddy, M. A. (2008). Bayesian Nonparametric Analysis of Conditional Distributions and Inference for Poisson Point Processes. PhD thesis, University of California, Santa Cruz. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581. Teh, Y. W., Kurihara, K., and Welling, M. (2007). Collapsed variational inference for HDP. In Advances in Neural Information Processing Systems, pages 1481–1488. Thijs, H., Molenberghs, G., Michiels, B., Verbeke, G., and Curran, D. (2002). Strategies to fit pattern-mixture models. Biostatistics, 3:245–265. Troxel, A. B., Harrington, D. P., and Lipsitz, S. R. (1998). Analysis of longitudinal data with non-ignorable non-monotone missing values. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):425–438. Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. New York: Springer-Verlag. Tsiatis, A. A., Davidian, M., and Cao, W. (2011). Improved doubly robust estimation when data are monotonely coarsened, with application to longitudinal studies with dropout. Biometrics, 67:536–545. van der Laan, M. J. and Robins, J. M. (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer. van der Laan, M. J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media. van der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1). van der Vaart, A. W. (2000). Asymptotic Statistics, volume 3. Cambridge university press. Vansteelandt, S., Goetghebeur, E., Kenward, M. G., and Molenberghs, G. (2006). Ignorance and uncertainty regions as inferential tools in a sensitivity analysis. Sta- tistica Sinica, 16:953–979. Vansteelandt, S., Rotnitzky, A., and Robins, J. M. (2007). Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse. Biometrika, 94(4):841–860. Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in StatisticsSimulation and Computation R , 36(1):45–54.

Wang, C. and Blei, D. M. (2012). A split-merge MCMC algorithm for the hierarchical Dirichlet process. Available at arXiv:1201.1657.

144 Wang, C. and Daniels, M. J. (2011). A note on MAR, identifying restrictions, model comparison, and sensitivity analysis in pattern mixture models with and without covariates for incomplete date. Biometrics, 67:810–818. Wang, C., Paisley, J. W., and Blei, D. M. (2011). Online variational inference for the hierarchical Dirichlet process. In International Conference on Artificial Intelligence and Statistics, pages 752–760. Wu, M. C. and Carroll, R. J. (1988). Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics, 45. Wu, Y., Ghosal, S., et al. (2008). Kullback leibler property of kernel mixture priors in bayesian density estimation. Electronic Journal of Statistics, 2:298–331. Xu, Y., Mueller, P., Wahed, A. S., and Thall, P. F. (2014). Bayesian nonparametric estimation for dynamic treatment regimes with sequential transition times. arXiv preprint arXiv:1405.2656.

145 BIOGRAPHICAL SKETCH Antonio Linero received his bachelor’s degree in finance from the University of Florida in May of 2009. After receiving his bachelor’s degree, Antonio pursued a PhD in the Department of Statistics at the University of Florida under the supervision of Professor Michael J. Daniels and Professor Hani Doss, which he completed in August of 2015. In August of 2015, Antonio joined the faculty of Florida State University as an Assistant Professor in the Department of Statistics. Antonio’s research has primarily focused on Bayesian nonparametrics, with particular interest in missing data in longitudinal regulatory studies and in causal inference problems. His work has paid special attention to the foundational assumptions required to implement these methods, the design of methods which allow one to assess the sensitivity of inferences to these assumptions, and the incorporation of expert knowledge into inference.

146