<<

Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2019

A Bayesian Semiparametric Joint Model fPeongrp eLngo Wnangg itudinal and Survival Data

Follow this and additional works at the DigiNole: FSU's Digital Repository. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

A BAYESIAN SEMIPARAMETRIC JOINT MODEL

FOR LONGITUDINAL AND SURVIVAL DATA

By

PENGPENG WANG

A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy

2019

Copyright c 2019 Pengpeng Wang. All Rights Reserved. Pengpeng Wang defended this dissertation on April 16, 2019. The members of the supervisory committee were:

Elizabeth H. Slate Professor Co-Directing Dissertation

Jonathan R. Bradley Professor Co-Directing Dissertation

Amy M. Wetherby University Representative

Lifeng Lin Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the dissertation has been approved in accordance with university requirements.

ii ACKNOWLEDGMENTS

First of all, I would like to express my sincere gratitude to my major advisor Dr. Elizabeth H. Slate for the continuous support of my PhD study and related research, for her motivation, patience, and immense knowledge. Her guidance and encouragement helped me in all the time of my research. Her elegant and well-organized personality has a great impact on my life. I am very grateful to have her as my advisor and she will always be a role model in my life. I would like to thank my co-advisor Dr. Jonathan R. Bradley for sharing his research idea and expertise so willingly, for his insightful comments and encouragement. I appreciate him for spending time on my research and having discussions with me. Without his guidance and help this dissertation would not have been possible. I am also grateful to the rest of my dissertation committee: Dr. Amy M. Wetherby and Dr. Lifeng Lin for their time, support, guidance and good will throughout the preparation for my de- fense. And thanks for their review of this document. I am especially appreciative of the opportunity to be involved with Dr. Amy M. Wetherby’s research team on multiple projects including “Autism Adaptive Community-based Treatment to Improve Outcomes Using Navigators (ACTION) Net- work” and “Mobilizing Community Systems to Engage Families in Early ASD Detection & Ser- vices,” which provided support for my research. In addition, my sincere thanks go to my family, friends and fellow students who gave me help and support in my PhD study and life in general. Having fun with them made my graduate school life wonderful.

iii TABLE OF CONTENTS

List of Tables ...... vi List of Figures ...... vii Abstract ...... x

1 INTRODUCTION 1

2 BACKGROUND 4 2.1 Gaussian Assumption ...... 4 2.2 Log-Gamma Distribution ...... 5 2.3 Multivariate Log-Gamma Distribution ...... 7 2.4 Conditional Multivariate Log-Gamma Distribution ...... 8 2.5 Slice Sampler ...... 9 2.6 Dirichlet Process ...... 10 2.7 Kaplan-Meier Estimator ...... 12 2.8 Posterior Predictive p-Value ...... 13 2.9 Review of DIC ...... 14

3 A GAUSSIAN JOINT MODEL 15 3.1 Model Formulation ...... 15 3.2 Likelihood ...... 16 3.3 Dirichlet Process Prior ...... 18 3.3.1 Dirichlet Process Prior in the Gaussian Joint Model ...... 18 3.3.2 Concentration Parameter ...... 19 3.4 Prior Distributions ...... 20 3.5 Gibbs Sampler ...... 21

4 LOG-GAMMA JOINT MODEL 26 4.1 Model Formulation ...... 26 4.2 Likelihood ...... 27 4.3 Dirichlet Process Prior in the Log-Gamma Joint Model ...... 29 4.4 Priors and Hyperpriors ...... 30 4.5 Gibbs Sampler ...... 31

5 SIMULATION STUDY 38 5.1 Gaussian Joint Model ...... 38 5.1.1 Simulation Design ...... 39 5.1.2 Simulation Results ...... 41 5.1.3 Discussion ...... 42 5.2 Log-Gamma Joint Model ...... 48 5.2.1 Simulation Design ...... 48 5.2.2 Simulation Results ...... 50

iv 5.2.3 Discussion ...... 51 5.3 Model Comparison ...... 57 5.3.1 Comparison between Gaussian Joint Model and Log-Gamma Joint Model . . 57 5.3.2 Comparison with Parametric Models ...... 65

6 APPLICATIONS 68 6.1 HIV Data ...... 68 6.1.1 Data Description ...... 68 6.1.2 Results Using Joint Models ...... 69 6.1.3 Diagnosis ...... 72 6.2 PSA Data ...... 78 6.2.1 Data Description ...... 78 6.2.2 Results Using Joint Models ...... 79 6.2.3 Diagnosis ...... 84 6.3 Discussion ...... 85

7 CONCLUSION AND FUTURE WORK 87

Appendix A ADDITIONAL DETAILS ON THE MULTIVARIATE LOG-GAMMA DISTRI- BUTION 89 A.1 Marginal Multivariate Log-Gamma Distribution ...... 89 A.2 Data Augmentation Strategies for the cMLG Distribution ...... 90 A.3 Proof of Proposition in Section 2.2 ...... 94 A.4 Proof of Proposition in Section 2.3 ...... 95 A.5 Proof in Section 2.4 ...... 95

B IRB Approvals 96

Bibliography ...... 98 Biographical Sketch ...... 102

v LIST OF TABLES

5.1 Parameter estimates for simulation study of Case 1-3 (Gaussian distributed data) using model GaussianMH (the Gaussian joint model with Metropolis-Hastings) . . . . 43

5.2 Parameter estimates for simulation study of Case 1-3 (Gaussian distributed data) using model GaussianSS (the Gaussian joint model with slice sampler) ...... 44

5.3 MSE for the Gaussian joint model and the log-gamma joint model ...... 55

5.4 Parameter estimates for simulation study of Case 4-6 (log-gamma distributed data) using the log-gamma joint model...... 56

5.5 Clustering evaluation for the Gaussian joint model and the log-gamma joint model . . 61

5.6 Contingency table for the Rand index...... 61

5.7 Effective sample size for the Gaussian joint model and the log-gamma joint model . . 66

6.1 Parameter estimates for the HIV data using the Gaussian joint model with slice sam- pler and the log-gamma joint model...... 72

6.2 Parameter estimates for the PSA data using the Gaussian joint model with slice sampler and the log-gamma joint model...... 81

vi LIST OF FIGURES

2.1 Kernel density plots of the log-gamma distributions and the standard Gaussian dis- tribution...... 6

5.1 Histograms of simulated survival time. The number of clusters is indicated in the title heading of each panel...... 39

5.2 Trace plots and density curves for the last 3,000 iterations using model GaussianSS in Case 3. (a) MCMC trace plot of β01 (intercept for the first subject). (b) Posterior density estimate of β01 from model GaussianSS (solid line) and the true value of β01 (dashed line). (c) MCMC trace plot of β11 (slope for the first subject). (d) Posterior density estimates of β11 from model GaussianSS (solid line) and the true value of β11 (dashed line)...... 45

5.3 Trace plots and density curves for the last 3,000 iterations using model GaussianSS in Case 3. (a) MCMC trace plot of γ (link parameter). (b) Posterior density estimate of γ from model GaussianSS (solid line) and the true value of γ (dashed line). (c) MCMC trace plot of α (covariate parameter). (d) Posterior density estimates of α from model GaussianSS (solid line) and the true value of α (dashed line)...... 46

5.4 True longitudinal observations vs. estimated longitudinal trajectories using model GaussianSS in Case 3 (the simulation study with three cluster Gaussian distributied data for the Gaussian joint model)...... 47

5.5 True hazard rate vs. estimated hazard rate from model GaussianSS in Case 3 (the simulation study with three cluster Gaussian distributied data for the Gaussian joint model)...... 47

5.6 Histograms of simulated survival time. The number of clusters is indicated in the title heading of each panel...... 49

5.7 Trace plots and density curves for the last 3,000 iterations using the log-gamma joint model in Case 6. (a) MCMC trace plot of β0. (b) Posterior density estimates of β0 from the semiparametric log-gamma joint model (solid line) and the true value of β0 (dashed line). (c) MCMC trace plot of β1. (d) Posterior density estimates of β1 from the semiparametric log-gamma joint model (solid line) and the true value of β1 (dashed line)...... 52

5.8 Trace plots and density curves for the last 3,000 iterations using the log-gamma joint model in Case 6. (a) MCMC trace plot of γ. (b) Posterior density estimates of γ from the semiparametric log-gamma joint model (solid line) and the true value of γ (dashed line). (c) MCMC trace plot of β1. (d) Posterior density estimates of α from the semiparametric log-gamma joint model (solid line) and the true value of α (dashed line)...... 53

vii 5.9 True longitudinal observations vs. estimated longitudinal trajectories using the log- gamma joint model in Case 6 (three cluster simulation study of log-gamma distributed data)...... 54

5.10 True hazard rate vs. estimated hazard rate using the log-gamma joint model in Case 6 (three cluster simulation study of log-gamma distributed data)...... 54

5.11 True vs. estimated longitudinal trajectories and hazard rates at cluster level in Case 3. The red solid lines are the true trajectories and hazard rates in each cluster based on true cluster assignments. The blue dash lines are the estimated trajectories and hazard rates. (a) and (b) are the results from model GaussianMH; (c) and (d) are the results from model GaussianSS; (e) and (f) are the results from model LG...... 62

5.12 True vs. estimated longitudinal trajectories and hazard rates at cluster level in Case 6. The red solid lines are the true trajectories and hazard rates in each cluster based on true cluster assignments. The blue dash lines are the estimated trajectories and hazard rates. (a) and (b) are the results from model GaussianMH; (c) and (d) are the results from model GaussianSS; (e) and (f) are the results from model LG...... 63

5.13 Comparison of the semiparametric joint model and the parametric Cox models in Case 6 (three cluster simulation with log-gamma distributed data). The black solid line is the true cumulative hazard. The dash lines are the estimated cumulative hazard of the log-gamma semiparametric joint model (red dash line) and the estimated cumulative hazard of the parametric Cox model with longitudinal effect (blue dash line) and without longitudinal effect (green dash line)...... 67

6.1 Observed trajectories of CD4 cells counts for all 467 patients in the original scale and the square root scale...... 69

6.2 Trace plots and density curves using the Gaussian joint model with slice sampler for the HIV data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior density estimates of trajectory slope β1...... 73

6.3 Trace plots and density curves using the Gaussian joint model with slice sampler for the HIV data. (a) MCMC trace plot of the link parameter γ. (b) Posterior density estimates of the link parameter γ. (c) MCMC trace plot of covariate coefficients α. (d) Posterior density estimates of covariate coefficients α...... 74

6.4 Trace plots and density curves using the log-gamma joint model for the HIV data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior density estimates of trajectory slope β1...... 75

6.5 Trace plots and density curves using the Gaussian joint model for the HIV data. (a) MCMC trace plot of the link parameter γ. (b) Posterior density estimates of the link

viii parameter γ. (c) MCMC trace plot of covariate coefficients α. (d) Posterior density estimates of covariate coefficients α...... 76

6.6 Observed square root of CD4 cells counts (black circles) vs. estimated longitudinal trajectories of randomly selected 4 patients using the Gaussian joint model with slice sampler (blue dash line) and the log-gamma joint model (red dash line)...... 77

6.7 Predicted CD4 trajectories, predicted hazard rate and predicted survival probability for each cluster by using the Gaussian joint model with slice sampler and the log- gamma joint model for the AIDS data...... 77

6.8 QQ-plot of residuals using Gaussian joint model and log-gamma joint model for the AIDS data...... 78

6.9 Observed trajectories of PSA readings for 647 subjects who have more than three PSA readings...... 79

6.10 Trace plots and density curves using the Gaussian joint model with slice sampler for the PSA data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior density estimates of trajectory slope β1. (e) MCMC trace plot of the link parameter γ. (f) Posterior density estimates of the link parameter γ...... 82

6.11 Trace plots and density curves using the log-gamma joint model for the PSA data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior density estimates of trajectory slope β1. (e) MCMC trace plot of the link parameter γ. (f) Posterior density estimates of the link parameter γ...... 83

6.12 Observed PSA readings (black circles) vs. estimated PSA longitudinal trajectories of randomly selected 4 subjects by using the Gaussian joint model with slice sampler (blue dash line) and the log-gamma joint model (red dash line)...... 84

6.13 QQ-plot of residuals using Gaussian joint model and log-gamma joint model for the PSA data ...... 85

6.14 Predicted longitudinal trajectories, predicted hazard rate and predicted survival prob- ability for each cluster by using the Gaussian joint model with slice sampler and the log-gamma joint model for the PSA data. In the bottom left panel we truncate the trajectory of the cluster labeled with yellow at the largest observed time of PSA read- ings. We do this because a high percentage of individuals in this cluster were diagnosed with prostate cancer early, and hence, feel less confident in making estimates after this time point...... 86

ix ABSTRACT

Many biomedical studies monitor both a longitudinal marker and a survival time on each subject under study. Modeling these two endpoints as joint responses has potential to improve the inference for both. We consider the approach of Brown and Ibrahim [6] that proposes a Bayesian hierarchical semiparametric joint model. The model links the longitudinal and survival outcomes by incorpo- rating the mean longitudinal trajectory as a predictor for the survival time. The usual parametric mixed effects model for the longitudinal trajectory is relaxed by using a Dirichlet process prior on the coefficients. A Cox proportional hazards model is then used for the survival time. The com- plicated joint likelihood increases the computational complexity. We develop a computationally efficient method by using a multivariate log-gamma distribution instead of Gaussian distribution to model the data. We use Gibbs sampling combined with Neal’s algorithm [28] and the slice sampler for inference. Simulation studies illustrate the procedure and compare this log-gamma joint model with the Gaussian joint models. We apply this joint modeling method to a human immunodeficiency virus (HIV) data and a prostate-specific antigen (PSA) data.

x CHAPTER 1

INTRODUCTION

Longitudinal data, sometimes referred to as panel data, is a collection of repeated observations of the same subjects over short or long time periods. The subjects can be individuals, households, establishments, etc. Longitudinal data follows the same subjects over time and it’s useful for measuring of within-subject change over time, while cross-sectional data measures different subjects at each time point. In clinical trials, patients are often measured in scheduled follow-up time points. This kind of repeated observations of the same patients, over a period of time is also referred to as longitudinal data. We are interested in longitudinal measurements, since in this setting, we can use repeated observations to assess a within-patient change to quantify a treatment effect. For example, human immunodeficiency virus (HIV) patients may be followed over time, and their CD4 cell counts may be measured monthly. We can make inferences for the treatment effect based on the change in their observed CD4 cell counts. In many cases, we are also concerned with time-to-event observations, which is the expected duration of time until one or more events of interest occur. This end point is accompanied with information on “failure” or “survival.” The event can be death, occurrence of a disease, or burnout of light bulbs, among other possibilities. The time-to-event or survival time can be measured in days, weeks, years or any unit of time. For example, if the event of interest is a heart attack, then the survival time can be the time in years until a person develops a heart attack. Longitudinal and survival outcomes may be associated with each other. The change of lon- gitudinal measurements may affect time-to-event. For example, HIV weakens a person’s immune system by destroying important cells that fight disease and infection. CD4 cells could be infected by HIV and we could repeatedly measure CD4 cells count over time on each patient. If a patient’s CD4 count is decreasing, then they might be more likely to be diagnosed with AIDS. On the other hand, if a patient’s CD4 count increases, then the HIV virus might be controlled with effective HIV treatment so that they might be less likely to be diagnosed with AIDS. Therefore, it is reason- able to obtain information from both longitudinal and survival outcomes, and learn the association

1 between them as well. Joint modeling of these two outcomes has the potential to improve the inference over separate independent analyses. To jointly analyze the longitudinal outcome and survival outcome, we develop a joint statistical model. A routine framework is to use a regression model for the longitudinal outcome and a proportional hazards model for the survival outcome. To link these two endpoints, a trajectory function is incorporated as a predictor for the survival endpoint. Tsiatis and Davidian [37] review the rationale for and development of this kind of joint model. Song, Davidian and Tsiatis [32] proposed a similar joint model with a random effects longitudinal model and a Cox proportional hazards model. They use the expectation-maximization (EM) algorithm for estimation. Guo and Carlin [17] proposed their joint model with a mixed random effects longitudinal model and two hazards models including a Weibull model and a Cox proportional hazards model. They use a Bayesian approach via Markov chain Monte Carlo (MCMC) for inference. All of the above approaches assume a parametric trajectory function. However, the longitudinal measurements may be highly diverse, and the parametric model may impose too many parametric assumptions. Wang and Taylor [39] use the same structure of joint model but include an integrated Ornstein-Uhlenbech (IOU) stochastic process in the longitudinal model. This formulation allows the trajectory to vary from a parametric model setting, but it may not be flexible enough to model between-patient variability which implies heterogeneity in population. In this dissertation, we consider the approach of Brown and Ibrahim [6], which proposed a Bayesian hierarchical semiparametric joint model for longitudinal and survival outcomes. The usual parametric mixed effects model for the longitudinal trajectory is relaxed by using a Dirichlet process prior on the coefficients. A Cox proportional hazards model with the trajectory as a predictor is then used for the survival time. The current literature is based on either likelihood- based methods or Bayesian approaches for joint modeling. All of these approaches assume that latent random effects are Gaussian distributed. However, the likelihood of our joint model is complicated and the commonly used Gaussian assumption creates computational and inferential difficulties in a Bayesian analysis. Therefore, we drop the Gaussian assumption and consider using a log-gamma distribution for the longitudinal error. This choice is arguably more flexible than the Gaussian assumption, because we can adjust the shape and scale parameters of the log-gamma distribution to make it approximate Gaussian distribution (Bradley et al. [5]). This log-gamma

2 distribution is computationally convenient because the full conditional distributions are conjugate and straightforward to sample from. Thus, we use a Gibbs Sampler for inference. In Chapter 2, we review the univariate and multivariate log-gamma distributions and some statistical techniques that we use in this joint modeling work. In Chapter 3, we review the joint model under Gaussian assumption, complete the specification of the Bayesian model, and implement a Gibbs sampler for inferences. In Chapter 4, we propose the joint model in the same framework in Chapter 3 but under the log-gamma assumption. We also complete the specification of the Bayesian log- gamma joint model and gives the full-conditional distributions in the Gibbs sampler. In Chapter 5, simulation studies illustrate the procedure under both Gaussian assumption and log-gamma assumption. Comparison results are shown to discuss the advantages and weakness of joint models under these two assumptions. In Chapter 6, we apply the joint models to an HIV data and a prostate cancer data. Chapter 7 gives a conclusion of this dissertation and proposes some further work of this joint modeling framework.

3 CHAPTER 2

BACKGROUND

2.1 Gaussian Assumption

In linear regression, it is common to assume the error terms are Gaussian distributed, which implicitly assumes that the underlying observation is also Gaussian distributed. There are many useful properties of the Gaussian distribution. For example, central limit theorems often imply that sample averages are Gaussian distributed asymptotically. Thus, the Gaussian distribution is often reasonable to assume in practice, as often, outcomes obtained on individuals can often be interpreted as an average (Lehmann [23]). In this dissertation, we are particularly interested in another property, namely, conjugacy. In a Bayesian analysis, if the posterior distribution is in the same family as the prior distribution, then the prior and posterior distributions are called conjugate for the likelihood function. If the likelihood function is a Gaussian distribution with unknown mean and known , then using a Gaussian prior for the mean will ensure that the corresponding posterior distribution is also a Gaussian distribution. In this case, it is straightforward to sample directly from the posterior or full conditional distribution. Previous work on joint modeling most often assumes a Gaussian based regression model for the longitudinal outcome with errors that are Gaussian distributed. However, the joint longitudinal-survival likelihood does not yield a recog- nizable function because the likelihood of the survival model is not a Gaussian distribution. It is difficult, then, to find conjugate priors, and sampling directly from the posteriors may not be pos- sible. In this case, we may have to use a computationally less efficient (in terms of effective sample size) Metropolis-Hastings algorithm or to approximately sample from the posterior distributions. In this dissertation, we make an improvement by using a conjugate form for the parameters. Specifically, we use a log-gamma distribution instead of a Gaussian distribution for the longitudinal error. We find conjugate priors for many of the parameters in our joint model and use slice sampling (Neal [29]) to sample from their full-conditional distributions efficiently.

4 2.2 Log-Gamma Distribution

We define and introduce some properties of the log-gamma distribution. Let z be a gamma distributed random variable with shape parameter α > 0 and rate parameter κ > 0. Thus, the probability density function (pdf) of z is

κα f(z|α, κ) = zα−1 exp (−κz) . (2.1) Γ(α)

The mean and variance of the gamma random variable z are well known and given by,

α E(z) = κ α V ar(z) = . κ2

Let q = log(z). (2.2)

Then the random variable q has a log-gamma distribution with shape parameter α > 0 and rate parameter κ > 0 denoted by LG(α, κ). From (2.1) and (2.2), the pdf of the log-gamma random variable q is given by,

κα f(q|α, κ) = exp {αq − κ exp(q)} ; q ∈ . (2.3) Γ(α) R

The mean and variance of the log-gamma random variable are given by (see Prentice [30])

E(q) = ω0(α) − log(κ) (2.4)

V ar(q) = ω1(α).

Here ωk(·) is the polygamma function for non-negative integer k. For a real value h we have that dk+1 ω (h) ≡ log(Γ(h)). k dhk+1

There are many relationships between the log-gamma distribution and other distributions in- cluding the Gumbel distribution, the Amoroso distribution, and the Gaussian distribution (e.g., see Crooks [8]). These relationships are derived by considering special cases of the pdf associated with the log-gamma random variable q in (2.2). It is well known (Bartlett and Kendall [4]) that the rescaled log-gamma distribution approaches a Gaussian distribution as the parameters become large.

5 Figure 2.1: Kernel density plots of the log-gamma distributions and the standard Gaussian distri- bution.

1 Proposition 1. (Bradley et al. [5]) Let q ∼ LG(α, α), and let q+ = α 2 q. Then q+ converges in distribution to the standard Gaussian distribution as α goes to infinity. Proof. See Appendix A.3. Figure 2.1 provides some examples of the log-gamma distribution. It is shown that the log- gamma distribution gets closer to the standard Gaussian distribution as the shape and rate pa- rameters increase. One of the reasons of using the log-gamma distribution instead of the Gaussian distribution in our joint model is because the re-scaled log-gamma distribution is a limiting case of the Gaussian distribution. Another reason is that there is a double exponential term in the pdf of the log-gamma distribution, which (as will be shown in Chapter 3) is also present in our hazard model. Consequently, the log-gamma distribution will lead to conjugacy for many of the parameters in our model. This motivates us to use the log-gamma distribution instead of Gaussian distribution in our joint model.

6 2.3 Multivariate Log-Gamma Distribution

Similar to the multivariate Gaussian distribution, the multivariate log-gamma distribution (MLG) is a generalization of the one-dimensional (univariate) log-gamma distribution to higher dimensions. The MLG variable is obtained by using a linear combination of independent log- 0 gamma random variables. Specifically, let the m-dimensional random vector w = (w1, . . . , wm) consist of m mutually independent log-gamma random variables such that wi ∼ LG(αi, κi) for m i = 1, . . . , m. Then, we define a multivariate log-gamma random variable q ∈ R as

q = c + Vw, (2.5)

m m m where V ∈ R × R is invertible and c ∈ R . Then the pdf of q satisfies

m α ! 1 Y κ i f(q|c, V, α, κ) = i exp α0V−1(q − c) − κ0 exp{V−1(q − c)} , (2.6) det(VV0)1/2 Γ(α ) i=1 i

0 where “det” represents the determinant function, the m-dimensional vector α = (α1, . . . , αm) 0 consists of shape parameters, and the m-dimensional vector κ = (κ1, . . . , κm) consists of rate parameters. The mean and variance of the MLG random variable q is given by,

E(q|α, κ) = c + V(ω0(α) − log(κ)),

0 Cov(q|α, κ) = Vdiag{ω1(α)}V ,

0 where for a generic m-dimensional real-valued vector k = (k1, . . . , km) , let diag(k) be the m × m dimensional diagonal matrix with main diagonal equal to k. The function ωj(h), for non-negative integer j, is a vector-valued polygamma function, where the i-th element of ωj(h) is defined to be dj+1 j+1 log(Γ(hi)) for i = 1, . . . , m. dhi Let MLG(c, V, α, κ) be shorthand for the pdf in (2.6). We see that the MLG pdf also has a double exponential term. The pattern is the main reason why conjugacy exists between our hazard model and the log-gamma and MLG distributions, which we take advantage of in subsequent chapters. The linear combination in (2.5) is similar to the derivation of the multivariate Gaussian dis- tribution. If we replace w with an m-dimensional random vector consisting of independent and identically standard Gaussian random variables, we obtain the multivariate Gaussian distribution

7 with mean c and covariance VV0. However, in our multivariate log-gamma distribution, each uni- variate component wi has its own shape and scale parameters αi and κi. So each component of the MLG random variable could have different log-gamma distribution, which is an important differ- ence with the multivariate Gaussian distribution and potentially gives more flexibility. Similar to the univariate case, the m-dimensional MLG distribution is asymptotic to a multivariate Gaussian distribution as the shape and rate parameters go to infinity.

1/2 Proprsition 2. (Bradley et al. [5]) Let q ∼ MLG(c, α V, α1m, α1m). Then q converges in distribution to a multivariate Gaussian distribution with mean c and covariance matrix VV0 as α goes to infinity. Proof. See Appendix A.4.

2.4 Conditional Multivariate Log-Gamma Distribution

In this dissertation, we use a Gibbs sampler for Bayesian inference. For the model proposed in Chapter 3, the Gibbs sampler will require simulating from conditional distributions of multivari- ate log-gamma random vectors. Thus, we introduce the definition of the conditional log-gamma distribution. We use the same definition for conditional multivariate log-gamma distribution as Bradley et al. 0 0 0 [5]. Let q ∼ MLG(c, V, α, κ), and partition this m-dimensional random vector so that q = (q1, q2) . q1 is a g-dimensional random vector, and q2 is a (m − g)-dimensional random vector. Partition −1 V = [HB] into an m × g matrix H and an m × (m − g) matrix B. Then (q1|q2 = d, c, V, α, κ) has a conditional multivariate log-gamma distribution. Here d is a (m − g)-dimensional real valued vector. The pdf of the conditional multivariate log-gamma distribution is,

 0 0 f(q1|q2 = d, c, V, α, κ) = C exp α Hq1 − κc exp(Hq1) , (2.7) where −1 κc ≡ exp{Bd − V c + log(κ)}, (2.8) and the normalizing constant C is

m α ! 1 Y κ i exp(α0Bd − α0V−1c) C = i . (2.9) det(VV0)1/2 Γ(α ) [R f(q|c, V, α, κ)dq ] i=1 i 1 q2=d

8 Proof of (2.7) - (2.9): See Appendix A.5.

Let cMLG(H, α, κc) be a shorthand for the pdf in (2.7), where “cMLG” stands for “conditional multivariate log-gamma.” From the pdf in (2.7) we know that the cMLG distributions do not fall within the same class of pdfs defined in (2.6). This is because the m × g real-valued matrix H in the pdf in (2.7) is not square, while the real-valued matrix V in the pdf in (2.6) is a square matrix. This property is different from multivariate Gaussian distribution, where both marginal and conditional distributions obtained from a multivariate Gaussian random vector are multivariate Gaussian distributions. Since cMLG is not MLG, it is difficult to sample directly from cMLG. To avoid sampling from a cMLG distribution in a Gibbs sampler, a data augmentation techniques can be used to instead simulate from a marginal MLG random vector. See more details in Appendix A.1-A.2. One another way to sample from the cMLG distribution is the slice sampler (Neal [29]).

2.5 Slice Sampler

Slice sampler is a type of Markov chain Monte Carlo algorithm for pseudo-random number sampling, i.e. for drawing random samples from a statistical distribution. The method is based on the observation that to sample a random variable one can sample uniformly from the region under the graph of its density function. Suppose we want to sample from a distribution for a random variable, x, taking values in some n subset of R , whose density is proportional to some function f(x). To do this we can sample uniformly from the (n + 1)-dimensional region that lies under the plot of f(x). This idea can be formalized by introducing an auxiliary real variable, y, and defining a joint distribution over x and y that is uniform over the region U = {(x, y) : 0 < y < f(x)} below the curve or surface defined by f(x). That is, the joint density for (x, y) is   1/Z if 0 < y < f(x), p(x, y) =  0 otherwise, where Z = R f(x)dx. The marginal density for x is then

Z f(x) p(x) = (1/Z)dy = f(x)/Z 0 as desired. To sample for x, we can sample jointly for (x, y), and then ignore y. Generating independent samples uniformly from U may not be easy, so we might instead define a Markov

9 chain that will converge to this uniform distribution. One way to do this is using the Gibbs sampler: we sample alternately from the conditional distribution of y|x ∼ U(0, f(x)), and from the conditional distribution of x|y over the region S = {x : y < f(x)}, which is called the “slice” defined by y. Here we provide the implementation of univariate and multivariate slice sampler.

The univariate slice sampler discussed here replace the current value, x0, with a new value, x1, found by a three-step procedure. To sample a random variable x with density proportional to a function f(x), we introduce an auxiliary variable y and iterate as follows:

1. Draw a real value, y, uniformly from (0, f(x0)), thereby defining a horizontal “slice”: S = {x : y < f(x)};

2. Find an interval, I = (L, R), around x0 that contains all, or much, of the slice;

3. Sample the new point, x1, from the part of the slice within this interval.

Finding the bounds of the horizontal slice in Step 2 may not be easy, which involves inverting the function describing the distribution being sampled from. If both the pdf and its inverse are available, and the distribution is unimodal, then finding the slice and sampling from it are simple. If not, a stepping-out procedure can be used to find a region whose endpoints fall outside the slice. Then, a sample can be drawn from the slice using rejection sampling. Various procedures for this are described in detail by Neal [29]. Note that this algorithm can be used to sample from the area under any curve, regardless of whether the function integrates to one. In fact, scaling a function by a constant has no effect on the sampled x. This means that the algorithm can be used to sample from a distribution whose pdf is only known up to a constant (i.e. whose normalizing constant is unknown). To sample from a multivariate distribution, such univariate slice sampler can be applied to each variable in turn repeatedly. Or we can use multivariate slice sampler with hyperrectangles. This method adapts the univariate algorithm to the multivariate case by substituting a hyperrectangle for the one-dimensional interval used in step 2. In this dissertation, our parameters are low dimensional, and most of them are one-dimensional. We use the slice sampler for implementation.

2.6 Dirichlet Process

The building block of the Dirichlet process is the Dirichlet distribution. The Dirichlet distri- bution, denoted as Dir(ρ), is a family of continuous multivariate probability distributions with a

10 K-dimensional parameter vector ρ of positive real numbers. The Dirichlet distribution of order

K ≥ 1 has parameters ρ = (ρ1, . . . , ρK ), ρi > 0 (i = 1,...,K), and has a probability density function (pdf) given by, K 1 Y f(x , . . . , x ; ρ , . . . , ρ ) = xρi−1, 1 K 1 K B(ρ) i i=1 PK where i=1 xi = 1 and xi ≥ 0 for i = 1,...,K. The normalizing constant is the multivariate Beta function, which can be expressed as QK Γ(ρ ) B(ρ) = i=1 i . PK  Γ i=1 ρi Then given a measurable set S, a base probability distribution P and a positive real number ρ, a Dirichlet process is a stochastic process whose sample path is a probability distribution over S, n such that the following holds. For any measurable finite partition of S, denoted {Bi}i=1, if

X ∼ DP(ρ, P ), then

X(B1),...,X(Bn) ∼ Dir(ρP (B1), . . . , ρP (Bn)).

The Dirichlet process is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables. The base distribution P is the expected value of the process, i.e., the Dirichlet process draws distributions “around” the base distribution similar to the way that a normal distribution draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The concentration parameter ρ specifies how strong the discretization is. When ρ → 0, the realizations are all concentrated at a single value; when ρ → ∞, the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as ρ increases. For our purpose, it is of interest to study the homogeneity and heterogeneity of the longitudinal data. This type of clustering may help to group subjects together according to some predetermined similarity criterion, such that the resulting groups of subjects all display similar properties, which can then be used to extract useful insight from the data. Also, clustering may help to identify the subjects with different treatments. Thus, it is natural to cluster the subjects according to their associated longitudinal trajectories, which follows from the discrete nature of the posterior distribution under Dirichlet process prior (Antoniak [3]).

11 2.7 Kaplan-Meier Estimator

In this section we review a technique for inference on the distribution of time-to-event, using the right-censored survival data. We will use this Kaplan-Meier estimator as a model to compare to in applications in Chapter 6. Let T be a continuous random variable, which denotes the time until an event of interest occurs. Recall that the survival function is defined as

S(t) = P (T > t), t ≥ 0.

Suppose that the events occur at D distinct times t1 < t2 < ··· < tD, and at time ti (i = 1,...,D) there are di (di > 0) events that occur. The data available for estimating S(t) is not only the event times (t1, t2, . . . , tD), but also the censoring times of the subjects. Let Ni (i = 1,...,D) be the number of subjects who are at risk at time ti. That is, Ni is a count of the number of subjects who are alive at ti or experience the event of interest at ti. The quantity di/Ni provides an estimate of the conditional probability that a subject who survives to just prior to time ti experiences the event at time ti. We will use this quantity to construct an estimator of the survival function and the cumulative hazard function – the Kaplan-Meier estimator (Klein and Moeschberger [21]). The Kaplan-Meier estimator is a nonparametric estimator of the survival function. It was proposed by Kaplan and Meier [20], also known as the product limit estimator. In medical research, this estimator is often used to estimate the survival probability of patients after treatment. In other fields, this estimator may be used to measure the length of time people remain unemployed after a job loss (Meyer [26]), or the time-to-failure of a machine. The Kaplan-Meier estimator is defined as:  1, if t < t ,  1    Sˆ(t) = Y di (2.10) 1 − , if t1 ≤ t.  Ni t1≤t This estimator is a step function with jumps at the observed event times. The size of these jumps depends not only on the number of events observed at each time point ti, but also on the censoring pattern prior to ti. One of the common estimators of the variance of the Kaplan-Meier estimator is given by Greenwood [16]:

h i X di Vˆ ar Sˆ(t) = Sˆ(t)2 . Ni(Ni − di) t1≤t

12 The Kaplan-Meier estimator is not well defined when t is larger than the maximum observated time. For example, if the time-to-event corresponds to the time-to-death and the largest observed time is a death time, then the estimated value of the survival function is zero beyond this time point. However, if the largest observed time is a censoring time, then the value of the survival function beyond this time point is undetermined because we do not know when this last survivor would have died. There are some nonparametric estimators proposed to address this ambiguity. For example, Efron [11] suggests that

Sˆ(t) = 0, if t > tmax, where tmax represents the largest observed time. This estimator implies that the last survivor would die immediately after tmax, which is the survivor’s censoring time. Thus it underestimates the survival probability when t > tmax. Gill [14] suggests that

Sˆ(t) = Sˆ (tmax) , if t > tmax.

This estimator implies that the last survivor would die at t = +∞. Thus, it overestimates the survival probability when t > tmax.

2.8 Posterior Predictive p-Value

Let yij be the response of the i-th subject at time point j. Denote the rep-th MCMC replicate rep rep with yij . Here yij is interpreted as a new replicate of yij, which is assumed to be independent and rep identically distributed to yij. We stress that yij may be different in value from yij . To compute the posterior predictive p-value (Meng [25]), we perform the following steps, for rep = 1,...,B, rep rep rep rep Step1. Generate new values yij from f(yij |Θ ), where Θ is the rep-th MCMC iteration of a generic finite dimensional real valued parameter vector Θ. rep Step2. Compute the average of the new values yij over all iterations 1,...,B, i.e., B 1 X Eˆ[y |y] = yrep, ij B ij b=1 where y is the observed longitudinal measures and observed survival time. Step3. Compute χ2 statistics, i.e., h i2 yrep − Eˆ[yrep|y] X X ij ij χ2 = , rep ˆ rep i j E[yij |y]

13 and h i2 y − Eˆ[yrep|y] X X ij ij χ2 = . 0 ˆ rep i j E[yij |y] Step4. The posterior predictive p-value is computed as

B 1 X p = I(χ2 ≥ χ2). B rep 0 b=1 ˆ rep Suppose yij is exactly equal to E[yij |y], which we interprete as “overfitting.” In this case p = 1. Similarly, p = 0 suggests “oversmoothing.” In general, an intermediate value of p between zero and one is preferable (e.g., 0.5), however it may not necessarily be 0.5 (e.g., see Gelman [13] for an example).

2.9 Review of DIC

The deviance information criterion (DIC; Speigelhalter [33]) is a statistic that approximates the following quantity,

Eyrep|θ[Drep(E[θ|y])], where the expectation operator “Eyrep|θ(·)” is defined to be the expected value with repect to the rep “true model,” and Drep(·) operator is the “true deviance.” Specifically, Drep(θ) = −2 log(p(y |θ)), where p(yrep|θ) is the true unknown model for y. For more details see Speigelhalter [33].

14 CHAPTER 3

A GAUSSIAN JOINT MODEL

Consider data sets that consist of both longitudinal measurements and survival outcomes such that both types of responses are correlated. To capitalize on these correlations, we define two models in our joint modeling framework: a linear regression model for the longitudinal outcome, and a Cox proportional hazard function used for the survival outcome. We assume the mean of the longitudinal trajectory function to be a predictor in the survival model so that these two models imply cross-dependence in the two outcomes. This strategy is similar to the one proposed in Tsiatis and Davidian [37]. To gain more flexibility in this joint model, we specify a Dirichlet process prior for the coefficients in the longitudinal trajectory. Sections 3.1 and 3.2 provide the details in our joint model formulation and the implied likelihood respectively. Section 3.3 provides a review of the Dirichlet process prior, and introduces its use for jointly modeling longitudinal and survival outcomes. Details on the remaining prior distributions and Gibbs sampling are provided in SEctions 3.4 and 3.5, respectively. In Chapter 4 we will discuss the Bayesian joint model in the same framework but drop the Gaussian assumption and assume the longitudinal data to follow a log-gamma distribution.

3.1 Model Formulation

Let Yij be the longitudinal measurement for subject i at time point j, where i = 1, . . . , n and j = 1, . . . , mi. Here n is the total number of the subjects, and mi is the number of measurements Pn for subject i. Let N = i=1 mi be the total number of observations. The longitudinal model is given by,

Yij = ψβi (tij) + ij, (3.1)

ij ∼ F(ij),

where ψβi (tij) is referred to as the trajectory function, and ij is the error term with distribution

F(ij). In this dissertation, we consider the trajectory function to have a linear form given by,

ψβi (tij) = β0i + β1itij. (3.2)

15 This is a common response for immunologic measures to therapy. However, others have used a quadratic trajectory when modeling CD4 counts (e.g. Tsiatis et al. [37]). The Gaussian distribution is commonly assumed for the longitudinal measurements. In this chapter, we will review the semiparametric joint model under Gaussian assumption proposed by Brown and Ibrahim [6], and derive the full conditional distributions in the Gibbs Sampler. A simulation study is given in Chapter 5 to illustrate their method. If we assume the longitudinal observations are Gaussian distributed, we can define the longitudinal model as

Yij = ψβi (tij) + ij, (3.3)

2 ij ∼ N(0, σ ), (3.4)

where ψβi (tij) is the trajectory function given in (3.2), for i = 1, . . . , n and j = 1, . . . , mi. Each subject has an observation on a possibly censored time-to-event (“failure” or “survival”), and additional covariate information. The association between the longitudinal effect and survival outcomes is captured by including the longitudinal trajectory function among the predictors for the survival outcome. Specifically, the hazard model is given by,

0 h(t|Yi) = λ(t) exp{γψβi (t) + Xiα}, (3.5) where γ is a scalar parameter linking the trajectory to the hazard function, λ(t) is the baseline 0 hazard, X = (X1, X2,..., Xn) is an n × p matrix of the baseline covariates, and the p-dimensional parameter α is a vector of coefficients of the p-dimensional baseline covariates Xi.

3.2 Likelihood

Under the Gaussian assumption, the likelihood for the i-th subject in the longitudinal model is given by,    mi mi 2 1  1 X 2 f(Yi|βi, σ ) = √ exp − [Yij − ψβ (tij)] , 2 2σ2 i 2πσ  j=1  where Yi = (Yi1,...,Yimi ) is the vector of the longitudinal observations of subject i, and βi = 0 (βi0, βi1) is the vector of longitudinal parameters of subject i. Let si be the survival time and νi be the censoring indicator for the i-th subject, i.e.   0, censored νi =  1, not censored.

16 The specification of the hazard function in (3.5) leads to the following distribution for the survival component given the trajectory function:

 Z si  νi 0 γψβi (u)+Xα f(si, νi|Yi, λ, γ, α) = λ(si) exp{νi[γψβi (si) + Xiα]} exp − λ(u)e du . 0 Now the joint likelihood of the i-th subject for the full set of parameters of interest under Gaussian assumption can be written as

2 2 f(Yi, si, νi|βi, σ , γ, λ, α) = f(si, νi|Yi, γ, λ, α) × f(Yi|βi, σ ) si  Z 0  νi 0 γψβi (u)+Xiα = λ(si) exp νi[γψβi (si) + Xiα] − λ(u)e du 0    mi mi 1  1 X 2 × √ exp − [Yij − ψβ (tij)] . (3.6) 2 2σ2 i 2πσ  j=1  We assume that the baseline hazard function is piecewise constant such that

λ(u) = λl, ul ≤ u < ul+1, l = 1, . . . , L, where u1, . . . , uL+1 define the intervals for λ(u). Then the cumulative hazard

Z si γψ (u)+X0 α λ(u)e βi i du, (3.7) 0 can be written as L X0 α X e i Hil(βi, γ, λl), l=1 where Z min(ul+1,si) γψβ (u) Hil(βi, γ, λl) = I{si ≥ ul}λl e i du, (3.8) ul and I{si ≥ ul} is an indicator function which is equal to one if the event time occurs in or later than the l-th interval, and zero otherwise. Brown and Ibrahim [6] mention that the integral in equation (3.8) does not have an analytical solution when the trajectory is quadratic. Instead, they use GNU Scientific Library(GSL) (Galassi, Gough, and Jungman, 2001) to perform nonadaptive Gauss-Kronrod numerical integration. In Chapter 3 and Chapter 4, we consider an approximation for this integral. In a general case, we may consider

si Z 0 0 γψβ (u)+X α γψβ (ci)+X α λ(u)e i i du ≈ siλ(ci)e i i , (3.9) 0 where ci ∈ [0, si]. We will use this approximation in (3.9) in our implementation of both the Gaussian joint model and the log-gamma joint model in Chapter 4.

17 3.3 Dirichlet Process Prior

It can be difficult to specify a parametric distributions for the parameter βi that defines the trajectory function are difficult to specify. In particular, we cannot be sure that the βi’s all come from the same distribution or confirm that a distributional assumption is correct. Also, in many settings, there is evidence of non-Gaussianity in the data. For example, Zhang and Davidian [41] relax the Gaussian assumption by approximating the random effects density by using the seminonparametric (SNP) density. In this dissertation, we also relax the typical distributional assumptions made for βi’s. Specifically, we use a Dirichlet process (DP) prior (Antoniak [3]), which is a common approach to build a semiparametric model in nonparametric Bayesian statistics (reviewed in Section 2.6). This approach allows one to easily obtain posterior estimates using MCMC methods such as Gibbs sampler. Thus, we incorporate a Dirichlet process model into our joint model. This new model allows for a more flexible and robust method to examine the relationship between longitudinal measurements and survival time, as it accounts for uncertainty in specifying a distribution for {βi}.

3.3.1 Dirichlet Process Prior in the Gaussian Joint Model

We relax the distributional assumption on the βi’s in the trajectory function (3.2) by applying a Dirichlet process prior on them, which is given by,

βi|G ∼ G,

G ∼ DP (M,G0), (3.10)

G0 = N2(b0, V0), where “DP” stands for Dirichlet process, M is a positive scalar, N2(b0, V0) is a 2-dimensional mul- tivariate normal distribution with a 2-dimensional mean vector b0 and a 2 × 2 variance-covariance matrix V0. Both b0 and V0 are unknown hyperparameters, so we complete our Bayesian hierar- chical model by specifying hyperprior distributions for them. Here we use a multivariate normal distribution and a Wishart distribution as their priors,

b0 ∼ N2(µb, Σb),

−1 V0 ∼ W (Sv, nv),

18 where “W ” stands for the Wishart distribution. Here µb is a given 2-dimensional vector, Σb and

Sv are given 2 × 2 positive definite matrices, and nv is a given positive real number.

3.3.2 Concentration Parameter

The concentration parameter M is a smoothing parameter in the Dirichlet process prior. It specifies how strong the discretization is among the unique values of βi’s. In other words, the concentration parameter M describes how different the trajectories are among different clusters, i.e., the heterogeneity of the subjects. Escobar and West [12] discussed some advanced techniques related to the concentration parameter. One of the techniques is to use a single gamma prior on M and update M in the Gibbs sampler. Here we briefly introduce this method developed by Escobar and West [12], which we will use in this dissertation. Suppose that we have a prior distribution p(M) for the parameter M. Let k be the number of unique values of βi (i = 1, 2, . . . , n), that is, k denotes the number of clusters clusters, and let D be the configuration of βi. From our model, βi’s are conditionally independent of M when M, b0,

V0 and the clustering configuration are known, and the parameters (b0, V0) are also conditionally independent of M when k and the configuration are known. From Escobar and West [12], the full conditional distribution of M is

p(M|k, n, b0, V0,D) = p(M|k, n) ∝ p(M)P (k|M, n). (3.11)

Using the result of Antomiak [3], the likelihood in (3.11) is given by, Γ(M) P (k|M, n) = P (k|M = 1, n) n! M k , (k = 1, 2, . . . , n), (3.12) Γ(M + n) where the first term P (k|M = 1, n) does not involve M. Suppose M has a single gamma prior

Gamma(aM , bM ) with a shape parameter aM > 0 and a rate parameter bM > 0. For M > 0, the gamma functions in (3.12) can be written as Γ(M) (M + n)B(M + 1, n) = , Γ(M + n) MΓ(n) where B(M + 1, n) is the usual beta function with parameters (M + 1) and n. Then using the definition of the beta function, the full conditional distribution in (3.11) can be written as

p(M|k, n) ∝ p(M)M k−1(M + n)B(M + 1, n) Z 1 ∝ p(M)M k−1(M + n) xM (1 − x)n−1dx, 0

19 for any k = 1, 2, . . . , n. This implies that p(M|k, n) is the marginal distribution from a joint distribution of (M, η), where (η|M, k, n) ∼ B(M + 1, n) and

p(η|M, k, n) ∝ ηM (1 − η)n−1 (0 < η < 1). (3.13)

Under the Gamma(aM , bM ) prior for M > 0,

h i p(M|η, k, n) ∝ M aM −1e−bM M M k−1(M + n) ηM (1 − η)n−1

∝ M aM +k−2(M + n)e−M(bM −log(η))

∝ M aM +k−1e−M(bM −log(η)) + nM aM +k−2e−M(bM −log(η)), which reduces easily to a mixture of two gamma densities, i.e.

(M|η, k, n) ∼ πηGamma(aM + k, bM − log(η)) + (1 − πη)Gamma(aM + k − 1, bM − log(η)), (3.14) where the weight πη is defined by

π a + k − 1 η = M . (1 − πη) n(bM − log(η))

This full-conditional distribution is used to update the concentration parameter M in the Gibbs sampler. In each iteration, we first sample a value for η from the beta distribution in (3.13) conditional on the current values of M, k and n; then sample a new value for M from the mixture of gamma distributions in (3.14) conditional on the same k, n, and the value of η just generated. Here we use a single gamma prior for M. West [40] generalized this technique with a mixture gamma distributions as the prior for M. Alternatively, we may consider the Metropolis-Hastings method or the adaptive rejection sampling to sample from (3.11).

3.4 Prior Distributions

Besides the Dirichlet process prior on the longitudinal coefficients, we need to specify proper priors on the other parameters in the joint likelihood (3.6). In the longitudinal model, we use an inverse gamma distribution as the prior distribution of σ2, i.e.

2 σ ∼ IG(aσ, bσ),

20 where “IG” stands for the inverse gamma distribution, aσ and bσ are given positive real numbers. The prior distributions of the parameters in the survival model are shown below.

2 γ ∼ N(µγ, σγ),

λl ∼ Gamma(al, bl), l = 1, . . . , L,

α ∼ Np(µα, Σα), where µγ and σγ are given real numbers, al’s and bl’s are given positive real numbers. µα is a given p-dimensional real vector, and Σα is a given p × p positive definite matrix, p is the number of baseline covariates.

3.5 Gibbs Sampler

Gibbs sampling is a common method used for mixture of Dirichlet process models. When the prior is conjugate, it is convenient to sample directly from the posterior distribution since the posterior has the same form as the prior. However, when we have a non-conjugate prior, it is difficult to do the sampling since the posterior may be intractable. In the Gaussian joint model, We use the Gibbs sampling method with Metropolis-Hastings (Hastings [18]) to estimate the parameters. The algorithm is described in Steps 1–8, below.

In the Gaussian joint model, suppose we have k distinct values of βi, which means there are k clusters. Let zi (i = 1, . . . , n) denote the cluster indicator of Yi, that is, zi = j implies individual i is in the j-th cluster. For each cluster, z, φz(1 ≤ z ≤ k) is the value of βi’s in that cluster.

Step 1. Simulate zi(i = 1, . . . , n) using Neal’s algorithm 8 (Neal [28]). − − For i = 1, . . . , n, let k be the number of distinct zj for j 6= i, and let h = k + m, where m is − the number of auxiliary parameters. Label these zj (j 6= i) with values in {1, . . . , k }. If zi = zj − for some j 6= i, draw values independently from G0 for those φz for which k + 1 ≤ z ≤ h. If zi 6= zj for all j 6= i, let φk−+1 = φzi , and draw values independently from G0 for those φz for − which k + 2 ≤ z ≤ h. Now the distinct values of βi’s are {φ1,..., φk− , φk−+1,..., φh}. Draw a new value for zi from {1, . . . , h} using the following probabilities:  n b −i,z f(Y , s , ν |φ , σ2, γ, λ, α) 1 ≤ z ≤ k−  n − 1 − M i i i z P (zi = z|z−i, Yi, φ1,..., φh) = (3.15) M/m 2 −  b f(Yi, si, νi|φz, σ , γ, λ, α) k < z ≤ h, n − 1 − M

21 where z−i = (z1, . . . , zi−1, zi+1, . . . , zn), n−i,z is the number of zj for j 6= i that are equal to z, and b is the appropriate normalizing constant. f(Yi, si, νi|φz, Ω) is the Gaussian joint likelihood of our joint model.

Step 2. Simulate φz(z = 1, . . . , k) using Metropolis-Hastings method or the slice sampler. Recall that the Metropolis-Hastings algorithm for sampling from a distribution for x with probabilities π(x), using a proposal distribution g(x∗|x), updates the state x as follows: Draw a candidate state, x∗, according to the probabilities g(x∗|x). Compute the acceptance probability

 g(x|x∗) π(x∗) a(x∗, x) = min 1, . (3.16) g(x∗|x) π(x)

With probability a(x∗, x), set the new state, x0, to x∗. Otherwise, let x0 be the same as x. This update from x to x0 leaves π invariant.

This approach can be applied in the Gaussian joint modeling problem to update φz (z =

1, 2, . . . , k), which are the distinct values of the longitudinal coefficients βi’s. We will also use the Metropolis-Hastings method to update some of the other parameters in our joint model. Let

Y = (Y1,..., Yn), and z = (z1, . . . , zn). For z = 1, 2, . . . , k, the full conditional distribution of φz is given by,

2 p(φz|Y, z, b0, V0, σ , γ, λ, α) " n # Y 2 ∝ f(Yi, si, νi|βi, b0, V0, σ , γ, λ, α) × G0(φz|b0, V0) i=1 " n # Y 2 ∝ f(Yi, si, νi|φz, b0, V0, σ , γ, λ, α)I(zi = z) × G0(φz|b0, V0). i=1

If we choose G0(φz|b0, V0) to be the proposal distribution, we find that this factor cancels when computing the acceptance probability in (3.16), leaving

 Qn ∗ 2  ∗ i=1 f(Yi, si, νi|φz, b0, V0, σ , γ, λ, α)I(zi = z) a(φz, φz) = min 1, Qn 2 . i=1 f(Yi, si, νi|φz, b0, V0, σ , γ, λ, α)I(zi = z)

∗ ∗ With probability a(φz, φz), set the new state of φz to be φz. Otherwise, let the new state be the same as φz. Or we can use the slice sampler to simulate from the full conditional distribution of

φz.

22 Step 3. Simulate b0 and V0 using conjugate priors.

The Dirichlet process prior in the Gaussian joint model has a base distribution with mean b0 and covariance matrix V0. We specify a conjugate prior N2(µb, Σb) on b0, and a conjugate prior −1 W (Sv, nv) on V0 , where “W” stands for the Wishart distribution. Thus we can sample directly from their full conditional distributions to update these two parameters. The full conditional distribution of b0 is given by,

" n # 2 Y 2 p(b0|Y, z, φ, V0, σ , λ, γ, α) ∝ f(Yi, si, νi|βi, b0, V0, σ , γ, λ, α) p(φ)p(b0) i=1 ∝ p(φ)p(b0) " k # Y ∝ p(φz|b0, V0) × p(b0) z=1 " k # Y ∝ G0(φz|b0, V0) × N2(µb, Σb) z=1 " k # !  −1 −1−1 −1 −1 X  −1 −1−1 ∼ N3 Σb + kV0 Σb µb + V0 φz , Σb + kV0 . z=1

Similarly, the full conditional distribution of V0 is given by,

" n # −1 2  Y 2 −1 p V0 |Y, z, φ, b0, σ , λ, γ, α ∝ f(Yi, si, νi|βi, b0, V0, σ , γ, λ, α) p(φ)p V0 i=1 −1 ∝ p(φ)p(V0 ) " k # Y −1 ∝ p(φz|b0, V0) × p(V0 ) z=1 " k # Y −1  ∝ G0(φz|b0, V0) × W V0 |Sv, nv z=1 " k #−1  X 0 ∼ W  Sv + (φz − b0)(φz − b0) , nv + k . z=1 where “W” stands for the Wishart distribution. Step 4. Simulate σ2 using conjugate prior. The Gaussian joint model assumes that the longitudinal observations have a constant variance 2 2 σ . With an IG(aσ, bσ) prior, the full conditional distribution of σ is given by,

" n # 2 Y 2 2 p(σ |Y, z, φ, b0, V0, λ, γ, α) ∝ f(Yi, si, νi|βi, b0, V0, σ , γ, λ, α) p(σ ) i=1

23  n m  Pn i   − mi/2  1 X X  −aσ−1 bσ ∝ σ2 i=1 exp − [Y − ψ (t )]2 × σ2 exp − 2σ2 ij βi ij σ2  i=1 j=1 

 n n m  1 X 1 X Xi ∼ Γ a + m , b + [Y − ψ (t )]2 .  σ 2 i σ 2 ij β ij  i=1 i=1 j=1

Step 5. Update γ using Metropolis-Hastings method. 2 The link parameter γ has a normal prior N(µγ, σγ), so the full conditional distribution of γ is given by,

" n # 2 Y 2 2 p(γ|Y, z, φ, b0, V0, σ , λ, α) ∝ f(Yi, si, νi|βi, b0, V0, σ , γ, λ, α) N(µγ, σγ). i=1

2 If we choose the prior N(µγ, σγ) to be the proposal distribution, this factor cancels when computing the acceptance probability in ((3.16)), leaving

 F (γ∗) a(γ∗, γ) = min 1, γ , Fγ(γ)

n n 0 L o Q Xiα P ∗ where Fγ(γ) = i=1 exp νiγψβi (si) − e l=1 Hil(β, γ, λl) . With probability a(γ , γ), set the new state of γ to be γ∗. Otherwise, let the new state be the same as the current value of γ.

Step 6. Simulate λl (l = 1,...,L) using conjugate prior. The baseline hazard function is assumed to be piecewise constants. For each l = 1,...,L, we specify a conjugate prior for λl, which is Γ(al, bl). Then for each l = 1,...,L, the full conditional distribution of λl is given by,

2 p(λl|Y, z, φ, b0, V0, σ , γ, α) " n # Y 2 ∝ f(Yi, si, νi|βi, b0, V0, σ , γ, λ, α) Γ(λl|al, bl) i=1 " n ( L )# 0 Y νi X α X ∝ λ(si) exp −e i Hil(β, γ, λl) Γ(λl|al, bl) i=1 l=1 ( n ) X 0 nl Xiα ∝ λl exp − e Hil(β, γ, λl) Γ(λl|al, bl) i=1 ( n ) Z min(ul+1,si) X 0 nl Xiα γψβ(u) al−1 −blλl ∝ λl exp −λl e I(si ≥ ul) e du × λl e i=1 ul

n min(u ,si) ! 0 Z l+1 X X α γψβ(u) ∼ Γ al + nl, bl + e i I(si ≥ ul) e du , i=1 ul

24 where Hil(β, γ, λl) is defined in (3.8); nl is the number of subjects whose survival time is within the interval [ul−1, ul), for l = 1,...,L. Step 7. Simulate α using Metropolis-Hastings method or the slice sampler.

The baseline coefficients α has a non-conjugate prior Np(µα, Σα).The full conditional distribu- tion of α is given by,

" n # 2 Y 2 p(α|Y, z, φ, b0, V0, σ , γ, λ) ∝ f(Yi, si, νi|βi, b0, V0, σ , γ, λ, α) Np(µα, Σα) i=1 " n ( L )# Y 0 X 0 Xiα ∝ exp{νiXiα} exp −e Hil(β, γ, λl) Np(µα, Σα). i=1 l=1

2 We choose the prior N(µα, σα) to be the proposal distribution, then the acceptance probability will be  F (α∗) a(α∗, α) = min 1, α , Fα(α)

n n 0 L o Q 0 Xiα P ∗ where Fα(α) = i=1 exp νiXiα − e l=1 Hil(β, γ, λl) . With probability a(α , α), set the new state of α to be α∗. Otherwise, let the new state be the same as the current values of α. Or we can use the slice sampler to simulate from the full conditional distribution of α. Step 8. Simulate M using the technique discussed in Section 3.3.2.

We use a single gamma prior Γ(aM , bM ) on the concentration parameter M in the Dirichlet process prior. As discussed in Section 3.3.3, we first simulate an auxiliary parameter η from a beta distribution specified in (3.13), then sample a new value for M from a mixture of gamma distributions specified in (3.14).

25 CHAPTER 4

LOG-GAMMA JOINT MODEL

We discussed the Gaussian joint model in Chapter 3. In this chapter, we are still interested in jointly modeling both longitudinal measurements and survival outcomes. The framework of the joint model in this chapter is the same as Chapter 3: a linear regression model used to fit the longitudinal measurements, and a Cox proportional hazard function used for the survival outcome. To join these two models, We still use the mean of the longitudinal trajectory function as a predictor in the survival model. However, we will challenge the Gaussian assumption in Chapter 3, and assume the longitudinal measurements to have a log-gamma distribution instead. We also use a Dirichlet process prior for the coefficients in the longitudinal trajectory to allow more flexibility in our models. Then we will complete the specification of the Bayesian model and implement a Gibbs sampler for inferences.

4.1 Model Formulation

Let Yij be the longitudinal measurement for subject i at time point j, where i = 1, . . . , n and j = 1, . . . , mi. Here n is the total number of the subjects, and mi is the number of measurements Pn for subject i. Let N = i=1 mi be the total number of observations. The longitudinal model is given by,

Yij = ψβi (tij) + ij,

ij ∼ F(ij),

where ψβi (tij) is referred to as the trajectory function, and ij is the error term with distribution

F(ij). In this chapter, we still consider the trajectory function to have a linear form given by,

ψβi (tij) = β0i + β1itij. (4.1)

This is a common response for immunologic measures to therapy. However, others forms (e.g. quadratic trajectory) could also be considered in the longitudinal model. Gaussian distribution is

26 commonly used for the longitudinal data (e.g. Brown and Ibrahim [6]). We reviewed the Gaussian joint model and derived the full conditional distributions in a Gibbs sampler for inferences in Chapter 3. In this chapter, we are instead assume the longitudinal data to have a log-gamma distribution. Then the longitudinal model is defined as

Yij = ψβi (tij) + ij,

ij ∼ LG(α, κ),

where ψβi (tij) is the trajectory function given in (4.1), for i = 1, . . . , n and j = 1, . . . , mi. Here we assume that the mean of the longitudinal error is equal to zero. From (2.4) we have κ = exp{ω0(α)}, where ω0(·) is the digamma function. Each subject has an observation on a possibly censored time-to-event (“failure” or “survival”) and additional covariate information. The association between the longitudinal and survival out- comes is captured by including the longitudinal trajectory function among the predictors for the survival outcome. Specifically, the hazard model for the i-th subject is given by,

0 h(t|Y ) = λ(t) exp{γψβi (t) + Xiα}, where γ is a scalar parameter linking the trajectory to the hazard function, λ(t) is the baseline haz- ard, and the p-dimensional parameter α is a vector of coefficients of the i-th subject’s p-dimensional baseline covariate Xi.

4.2 Likelihood

Under log-gamma assumption, the likelihood for the i-th subject in the longitudinal model is given by, m Yi κα f(Y |β , α , κ ) =  exp{α [Y − ψ (t )] − κ exp{Y − ψ (t )}}, i i   Γ(α )  ij βi ij  ij βi ij j=1  where Yi = (Yi1,...,Yimi ) is the vector of the longitudinal observations of subject i, and βi = 0 (βi0, βi1) is the vector of longitudinal parameters of subject i. As the same in the Gaussian joint model in Chapter 3, let si be the survival time and νi be the censoring indicator for the i-th subject, i.e.   0, censored νi =  1, not censored.

27 The specification of the hazard function in (4.1) leads to the following distribution for the survival component given the trajectory function:

si  Z 0  νi 0 γψβi (u)+Xiα f(si, νi|Yi, βi, λ, γ, α) = λ(si) exp{νi[γψβi (si) + Xiα]} exp − λ(u)e du , 0 where we introduce a subscript on the p-dimensional vector of covariates, Xi. Now the joint likelihood of the i-th subject for the full set of parameters of interest can be written as

f(Yi, si, νi|βi, λ, γ, α, α, κ)

= f(si, νi|Yi, λ, γ, α) × f(Yi|βi, α, κ) si  Z 0  νi 0 γψβi (u)+Xiα = λ(si) exp νi[γψβi (si) + Xiα] − λ(u)e du 0 m Yi κα ×  exp {α [Y − ψ (t )] − κ exp{Y − ψ (t )}} . (4.2) Γ(α )  ij βi ij  ij βi ij j=1 

We assume that the baseline hazard function is piecewise constant such that

λ(u) = λl; ul ≤ u < ul+1, l = 1, . . . , L, where u1, . . . , uL+1 define piecewise constant intervals for λ(u). Then the cumulative hazard

Z si γψ (u)+X0 α λ(u)e βi i du, (4.3) 0 can be written as L X0 α X e i Hil(βi, γ, λ), l=1 where Z min(ul+1,si) γψβ (u) Hil(βi, γ, λ) = I{si ≥ ul}λl e i du, (4.4) ul and I{si ≥ ul} is an indicator function which is equal to one if the event time occurs in or later than the l-th interval, and zero otherwise. We know that the integral in (4.4) does not have an analytical solution when the trajectory is quadratic. In Chapter 3, we gave an approximation (3.9) for the integral in general case. In this chapter, we consider both (3.9) and another approximated form for this integral. In our joint model, the trajectory function ψβi (u) has a linear form, so we

28 may consider the Taylor series to obtain an approximation for this integral,

Z min(ul+1,si) γ(β0i+β1iu) Hil(βi, γ, λl) = I{si ≥ ul}λl e du ul " # eγβ1i min(ul+1,si) − eγβ1iul γβ0i = e I{si ≥ ul}λl γβ1i (1 + γβ min(u , s )) − (1 + γβ u ) γβ0i 1i l+1 i 1i l ≈ e I{si ≥ ul}λl γβ1i γβ0i ≈ e I{si ≥ ul}λl [min(ul+1, si) − ul] .

Then the integral in (4.3) can be written as

si Z 0 0 γψβ (u)+X α X α+γβ0i λ(u)e i i du ≈ Jie i , (4.5) 0 where L X Ji = I{si ≥ ul}λl [min(ul+1, si) − ul] . l=1 We will use the approximation in (4.5) to update the link parameter γ. This is helpful to obtain conjugacy in the Gibbs sampler.

4.3 Dirichlet Process Prior in the Log-Gamma Joint Model

In the log-gamma joint model, we relax the typical distributional assumption on βi’s in (4.1) by applying a Dirichlet process prior on them, which is given by,

βi|G ∼ G,

G ∼ DP(M,G0), (4.6)

G0 = cMLG(HG, αG, κG),     1 t11  −  . .     . .      1 tnmn       γν1(1, s1)   δ 1   σ 1    1 N 1 N   .   δ 1 σ 1 where H =   .  , α =  2 n , and κ =  2 n . G   G  δ 1  G  σ 1   γνn(1, sn)   3 n   3 n      α 1 κ 1  γ(1, c1)  0 2 0 2     .     .      γ(1, cn)  V0 Here “DP” stands for Dirichlet process, δ1, δ2, δ3, σ1, σ2, σ3 are given positive real numbers. α0 and

29 κ0 are positive hyperparameters. The 2 × 2 matrix V0 is unknown, so we complete our Bayesian hierarchical model by specifying prior distribution for them. Let   v11 v12 V0 = . (4.7) v21 v22

Here we put a restriction on V0 to make it easier for inference. Let v11 = v22 = 1 and v12 = 0. v21 could be any real value. This implies that V0 is a triangular matrix, and we consequently have det(V0) = 1. This form will aid in simplifying Gibbs sampling. The prior of v21 is given by,

v21 ∼ LG(α1, κ1), where α1 is a positive shape hyperparameter and κ1 is a positive rate hyperparameter.

4.4 Priors and Hyperpriors

We specify proper priors on the parameters in the log-gamma joint likelihood (4.2). Specifically, we set up the priors on the baseline hazard and the covariate coefficients in the survival model as

γ ∼ LG(α2, κ2),

λl ∼ Gamma(al, bl), l = 1, . . . , L,       X δ41n σ41n α ∼ cMLG 0 , , , δ51p α3 κ3 where X is the n × p covariate matrix, and p is the number of covariates. α2, α3, κ2, κ3 are positive hyperparameters. al, bl (l = 1,...,L), δ4, δ5 and σ4 are given positive real numbers. In the longitudinal model, we set the priors as

α ∼ Gamma(θ1, τ1), (4.8) where “IG” stands for the inverse gamma distribution, θ1 and τ1 are given positive real numbers. Similar to Section 3.3.2, we use the gamma distribution as the prior of the concentration parameter M in the Dirichlet process prior. Then M can be updated using a Gibbs sampling method. Escobar and West [12] discussed several choices for the prior on M. Alternatively, we may consider the Metropolis-Hastings method or the adaptive rejection sampling to sample from (3.11). We use the

30 uniform distribution as the hyperprior for the shape parameters αk (k = 1, 2, 3, 4), and use the

Gamma distribution as the hyperprior for the rate parameter κk (k = 1, 2, 3, 4), i.e.,

αk ∼ U(0,U0),

κk ∼ Gamma(αk, βk), for k = 1, 2, 3, 4, where U0, αk’s and βk’s are given positive real numbers.

4.5 Gibbs Sampler

Gibbs sampling is a common method used for mixture of Dirichlet process models. When the prior is conjugate, it is convenient to sample from the posterior distribution since the posterior has the same form as the prior. But, when we have a non-conjugate prior, it is hard to do the sampling since the posterior may be intractable. In our log-gamma joint model, we use a collapsed Gibbs sampling method with Metropolis-Hastings (Hastings [18]) updates where needed. The algorithm is described in Steps 1–10, below.

Suppose we have k distinct values of βi, which means there are k clusters. Let zi (i = 1, . . . , n) denote the cluster indicator of Yi, that is, zi = j implies individual i is in the j-th cluster. For each cluster, z, φz(1 ≤ z ≤ k) is the value of βi’s in that cluster.

Step 1. Simulate zi(i = 1, . . . , n) using Neal’s algorithm 8 (Neal [28]). − − For i = 1, . . . , n, let k be the number of distinct zj for j 6= i, and let h = k + m, where m − is the number of auxiliary parameters. Label these zj(j 6= i) with values in {1, . . . , k }. If zi = zj − for some j 6= i, draw values independently from G0 for those φz for which k + 1 ≤ z ≤ h. If zi 6= zj for all j 6= i, let φk−+1 = φzi , and draw values independently from G0 for those φz for − which k + 2 ≤ z ≤ h. Now the distinct values of βi’s are {φ1,..., φk− , φk−+1,..., φh}. Draw a new value for zi from {1, . . . , h} using the following probabilities:  n b −i,z f(Y , s , ν |φ , λ, γ, α, α , κ ) 1 ≤ z ≤ k−  n − 1 − M i i i z   P (zi = z|z−i, Yi, φ1,..., φh) = M/m −  b f(Yi, si, νi|φz, λ, γ, α, α, κ) k < z ≤ h n − 1 − M where z−i = (z1, . . . , zi−1, zi+1, . . . , zn), n−i,z is the number of zj for j 6= i that are equal to z, and b is the appropriate normalizing constant. f(Yi, si, νi|φz, Ω) is the log-gamma joint likelihood of our joint model.

31 Step 2. Simulate intermediate parameters

In the collapsed Gibbs sampler, κ0, κ1, κ2 and κ3 are intermediate parameters. Their prior distributions are denoted as p(κ0), p(κ1), p(κ2) and p(κ3). Here we use Gamma(ακ, βκ) with shape parameter ακ and rate parameter βκ as the same prior distributions for them, although one could choose different values for ακ and βκ for these intermediate parameters. Then their full conditional distributions are given below.

n k Y Y p(κ0|Y, z, φ, V0, α, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × G0(φz|HG, αG, κG) × p(κ0) i=1 z=1 k Y ∝ G0(φz|HG, αG, κG) × Gamma(κ0|ακ, βκ) z=1 k Y 2α0  0 ακ−1 ∝ κ0 exp −κ012 exp(V0φz) × κ0 exp(−βκκ0) z=1 ( " k #) (2kα0+ακ)−1 X 0 ∝ κ0 exp −κ0 12 exp(V0φz) + βκ z=1 k ! X 0 ∼ Gamma 2kα0 + ακ, 12 exp(V0φz) + βκ , z=1

n Y p(κ1|Y, z, φ, V0, α, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × p(v21) × p(κ1) i=1 ∝ LG(v21|α1, κ1) × Gamma(κ1|ακ, βκ)

α1 ακ−1 ∝ κ1 exp{−κ1 exp(v21)} × κ1 exp (−βκκ1)

(α1+ακ)−1 v21 ∝ κ1 exp{−κ1(e + βκ)}

v21 ∼ Gamma (α1 + ακ, e + βκ) ,

n Y p(κ2|Y, z, φ, V0, α, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × p(γ) × p(κ2) i=1 ∝ LG(γ|α2, κ2) × Gamma(κ2|ακ, βκ)

α2 ακ−1 ∝ κ2 exp{−κ2 exp(γ)} × κ2 exp (−βκκ2)

(α2+ακ)−1 γ ∝ κ2 exp{−κ2(e + βκ)} γ ∼ Gamma (α2 + ακ, e + βκ) ,

32 and

n Y p(κ3|Y, z, φ, V0, α, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × p(α) × p(κ3) i=1 ∝ cMLG(α|α3, κ3) × Gamma(κ3|ακ, βκ)

α3  0 ακ−1 ∝ κ3 exp −κ3 exp(δ51pα) × κ3 exp(−βκκ3)

(α3+ακ)−1   0  ∝ κ3 exp −κ3 exp(δ51pα) + βκ 0  ∼ Gamma α3 + ακ, exp(δ51pα) + βκ .

Step 3. Simulate φz(z = 1, . . . , k) using conjugate prior.

In this dissertation we use a linear trajectory function, so the unique values φz’s are two 0 iid dimensional vectors. Let φz = (φ0z, φ1z) . We have φz ∼ G0(z = 1, 2,...). Let Sz = {i : zi = z}.

By using the approximation in (3.9) the full conditional distribution of φz(z = 1, 2,...) is given by,

p(φz|Y, z, V0, α, κ, γ, λ, α) " n # Y ∝ f(Yi, si, νi|βi, Ω)I(zi = z) × G0(φz|HG, αG, κG) i=1 " # Y ∝ f(Yi, si, νi|βi, Ω) × G0(φz|HG, αG, κG)

i∈Sz m Y Yi ∝ exp {α[Yij − ψβi (tij)] − κ exp{Yij − ψβi (tij)}} i∈Sz j=1

Y νi  0 0 × λ(si) exp νi[γψβ(si) + Xiα] − siλ(ci) exp{γψβi (ci) + Xiα} × G0(φz|HG, αG, κG) i∈Sz   0     I(z1 = z) 1 t11    .    . .  ∝ exp α  .  −  . .  φz   I(zn = z) 1 tnmn   Y 0     e 11 I(z1 = z) 1 t11     .   . .  × exp −κ  .  exp −  . .  φz  Ynmn   e I(zn = z) 1 tnmn  0    I(z1 = z) γν1(1, s1)    .   .  × exp  .   .  φz    I(zn = z) γνn(1, sn) 

33  X0 α 0    s1λ(c1)e i I(z1 = z) γ(1, c1)     .   .  × exp  .  exp  .  φz  X0 α    snλ(cn)e i I(zn = z)  γ(1, cn)   0 0 0 0 0 0 0 0 × exp (δ11N , δ21n, δ31n, α012)HGφz − (σ11N , σ21n, σ31n, κ012) exp(HGφz)

∼ cMLG(HG, αφ, κφ)   0  0 0 I(z1 = z) I(z1 = z) where α = α  .  + δ 10 ,  .  + δ 10 , δ 10 , α 10  and φ    .  1 N  .  2 n 3 n 0 2 I(zn = z) I(zn = z) 0   Y 0  X0 α 0  e 11 I(z1 = z) s1λ(c1)e i I(z1 = z) κ = κ  .  + σ 10 , σ 10 ,  .  + σ 10 , κ 10  . φ    .  1 N 2 n  .  3 n 0 2 0 Ynm X α e n I(zn = z) snλ(cn)e i I(zn = z) Step 4. Simulate V0.

In the parameter matrix V0, we assume that v21 has a LG prior LG(α1, κ1). The full conditional distribution of v21 is,

p(v21|Y, z, φ, γ, α, κ, λ, α) n Y Y ∝ f(Yi, si, νi|βi, Ω) × p(φz|V0) × p(v21) i=1 z " k # Y ∝ p(φz|V0) × p(v21) z=1 " k # Y ∝ G0(φz|HG, αG, κG) × LG(v21|α1, κ1) z=1 k       Y  φ0z  φ0z ∝ exp α0 v21 1 − κ0 exp v21 1 × LG(v21|α1, κ1) φ1z φ1z z=1 ( k ! k ) X X φ1z ∝ exp α0 φ0z v21 − κ0 e exp(v21φ0z ) × LG(v21|α1, κ1) z=1 z=1       φ01 φ01      0  .  φ11 φ1k  .  ∝ exp α01k  .  v21 − κ0e , ··· , κ0e exp  .  v21     φ0k  φ0k 

× exp {α1v21 − κ1 exp(v21)}       φ01 φ01      .     .   0  .  φ11 φ1k  .  ∝ exp (α01k, α1)   v21 − κ0e , ··· , κ0e , κ1 exp   v21   φ0k   φ0k    1  1 

∼ cMLG(Hv, αv, κv),

34   φ01  .   .  0 0 φ01 φ0k 0 where Hv =  , αv = (α01k, α1) , and κv = (κ0e , . . . , κ0e , κ1) . Here k is the number  φ0k  1 of unique values of βi’s.

Step 5. Simulate α using Metropolis-Hastings method.

The full conditional distribution of α is,

n Y p(α|Y, z, φ, V0, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × Gamma(α|θ1, τ1). i=1

If we choose the prior Gamma(θ1, τ1) to be the proposal distribution, we find that this factor cancels when computing the acceptance probability in ((3.16)), leaving

 Qn ∗  ∗ i=1 f(Yi, si, νi|φ, z, V0, α , κ, γ, λ, α) a(α , α) = min 1, Qn . i=1 f(Yi, si, νi|φ, z, V0, α, κ, γ, λ, α)

∗ Then we can use Metropolis-Hastings method to update α. With probability a(α , α), set the ∗ new state of α to be x . Otherwise, let the new state be the same as α. Step 6. Simulate γ using conjugate prior.

The link parameter γ has log gamma prior LG(α2, κ2). σ4 is a given positive number. Here we use the approximation given by (4.5). Then the full conditional distribution of γ is,

p(γ|Y, z, φ, V0, α, κ, λ, α) n Y ∝ f(Yi, si, νi|βi, Ω) × LG(γ|α2, κ2) i=1 n Y n 0 o νi  0  Xiα+γβ0i ∝ λ(si) exp νi γ (β0i + β1isi) + Xiα − Jie × LG(γ|α2, κ2) i=1 ( n n ) 0 X X X α+γβ0i ∝ exp γ νi (β0i + β1isi) − Jie i × exp {α2γ − κ2 exp(γ)} i=1 i=1 ∝ exp {(ν1, . . . , νn, ν1, . . . , νn, α2)Hγγ}

n  0 0  o X1α Xnα 0 × exp − J1e ,...,Jne , 0n, κ2 exp(Hγγ)

∼ cMLG (Hγ, αγ, κγ) ,

0 where Hγ = (β01, . . . , β0n, β11s1, . . . , β1nsn, 1) , αγ = (ν1, . . . , νn, ν1, . . . , νn, α2), and  0 0  X α Xnα 0 κγ = J1e 1 ,...,Jne , 0n, κ2 .

35 Step 7. Simulate λl(l = 1,...,L) using conjugate priors.

The baseline hazard function has piecewise constant values λl(l = 1,...,L). The conjugate priors of those λl’s are Gamma(al, bl). So the full conditional distribution of λl is given by,

n ( L ) 0 Y νi X α X p(λl|Y, z, φ, V0, α, κ, γ, α) ∝ λ(si) exp −e i Hil(β, γ, λ) × Gamma(λl|al, bl) i=1 l=1

( n ) X 0 nl Xiα ∝ λl exp − e Hil(β, γ, λ) × Gamma(λl|al, bl) i=1 ( n ) Z min(ul+1,si) X 0 nl Xiα γψβ (u) ∝ λl exp −λl e I(si ≥ ul) e du × Gamma(λl|al, bl) i=1 ul

n min(u ,si) ! 0 Z l+1 X X α γψβ (u) ∼ Gamma al + nl, bl + e i I(si ≥ ul) e du , i=1 ul where nl is the number of subjects whose survival time is within the interval (ul, ul+1), l = 1,...,L. Step 8. Simulate α using conjugate prior.       X δ41n σ41n The p-dimensional parameter vector α has a cMLG prior cMLG 0 , , . δ51p α3 κ3 Here X is the n × p covariates matrix for n subjects with p covariates. Then the full conditional distribution of α is

p(α|Y, z, φ, V0, α, κ, γ, λ) n ( L )       Y 0 X X δ 1 σ 1 0 Xiα 4 n 4 n ∝ exp νiXiα − e Hil(β, γ, λ) × cMLG 0 , , δ51p α3 κ3 i=1 l=1 ( n n L )       X X 0 X X δ 1 σ 1 0 Xiα 4 n 4 n ∝ exp νiXiα − e Hil(β, γ, λ) × cMLG 0 , , δ51p α3 κ3 i=1 i=1 l=1 ( L L ! ) X X ∝ exp (ν1, . . . , νn)Xα − H1l(β, γ, λ),..., Hnl(β, γ, λ) exp(Xα) l=1 l=1       0 X 0 X × exp (δ41n, α3) 0 α − (σ41n, κ3) exp 0 α δ51p δ51p   X   ∝ exp (ν1 + δ4, . . . , νn + δ4, α3) 0 α δ51p ( L L ! ) X X  X   × exp − H1l(β, γ, λ) + σ4,..., Hnl(β, γ, λ) + σ4, κ3 exp 0 α δ51p l=1 l=1  X   ∼ cMLG 0 , αα, κα , δ51p

36 0 0 PL PL  where αα = (ν1+δ4, . . . , νn+δ4, α3) , and κα = l=1 H1l(β, γ, λ) + σ4,..., l=1 Hnl(β, γ, λ) + σ4, κ3 .

Step 9. Simulate the shape parameters α0, α1, α2 and α3 in LG or cMLG distribution directly from their full conditional distributions. 4 We use the uniform distribution U(0, 10 ) as the prior distributions for α0, α1, α2 and α3. Then the full conditional distributions are given by,

n k Y Y 4 p(α0|Y, z, φ, V0, α, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × G0(φz|HG, αG, κG) × U(0, 10 ) i=1 z=1 k 2 Y  κα0  ∝ 0 exp{α 10 V φ } Γ(α ) 0 2 0 z z=1 0  α0 2k κ0 0 ∝ exp{α012V0φz}, Γ(α0)

n Y 4 p(α1|Y, z, φ, V0, α, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × LG(v21|α1, κ1) × U(0, 10 ) i=1 α1 κ1 ∝ exp{α1v21}, Γ(α1)

n Y 4 p(α2|Y, z, φ, V0, α, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × LG(λ|α2, κ2) × U(0, 10 ) i=1 α2 κ2 ∝ exp{α2γ}, Γ(α2) and n    Y X 4 p(α3|Y, z, φ, V0, α, κ, γ, λ, α) ∝ f(Yi, si, νi|βi, Ω) × cMLG , αα, κα × U(0, 10 ) δ51p i=1 α3 κ3 ∝ exp{α3δ51pα}. Γ(α3)

Step 10. Simulate M using the technique discussed in Section 3.3.2.

We use a single gamma prior Γ(aM , bM ) on the concentration parameter M in the Dirichlet process prior. As discussed in Section 3.3.3, we first simulate an auxiliary parameter η from a beta distribution specified in (3.13), then sample a new value for M from a mixture of gamma distributions specified in (3.14).

37 CHAPTER 5

SIMULATION STUDY

In Section 5.1 and 5.2, we illustrate the use of the Gaussian joint model (see Chapter 3) and the log-gamma joint model (see Chapter 4), respectively. We also incorporate clustering into these joint models, which allows us to detect different features between two or more clusters (e.g., two or more treatments). Thus, in each simulation study we have three different models that we use to generate data: one cluster, two cluster and three cluster simulation data. We run 5,000 iterations in the MCMC chain for all simulated datasets, and the results were based on the last 3,000 iterations. We estimated the posterior mean and the standard deviation for each parameter of interest by using the sample mean and the sample standard deviation of the last 3,000 iterations. We also include 95% credible intervals, which are computed as the 2.5% and 97.5% quantiles of the draws in the last 3,000 iterations. In Section 5.3, we compare the performance between the semiparametric Gaussian joint models and the semiparametric log-gamma joint models at both individual and cluster levels. At individual level, we compute mean (MSE) of the longitudinal observations and hazard from both of these two joint models. At cluster level, we compare their clusterings by evaluating the estimated number of cluster and by computing the adjusted Rand index. We also and compare the semiparametric joint models with parametric Cox models by evaluating the cumulative hazard.

5.1 Gaussian Joint Model

We start this section by describing the specifications of the Gaussian joint model introduced in Chapter 3. First, we generate data with different numbers of clusters (one cluster, two clusters, and three clusters). The one cluster simulation data has 400 subjects in total; the two cluster simulation data has 200 samples in each cluster; and the three cluster simulation data has equal sample sizes in the three clusters, with 100 in each. In each cluster, the longitudinal responses Yij are generated from a Gaussian distribution with a cluster specific mean, while the variance is kept the same for all clusters. All subjects share the same link parameter γ and covariate paramter

38 Figure 5.1: Histograms of simulated survival time. The number of clusters is indicated in the title heading of each panel.

α across clusters. For simplicity, we assume there is only one covariate (i.e., the covariate vector X was n-dimensional and is generated from the standard Gaussian distribution, where n was the total number of subjects). The survival time is generated from its cumulative distribution function based on the hazard function in (3.5) given the mean longitudinal trajectory. Figure 5.1 shows histograms of the survival time for the one cluster, two cluster, and three cluster settings. We define the starting time and the end of the study arbitrarily. If a subject’s survival time is later than the end of the study, then it is right censored. The censoring percentages for one cluster, two cluster and three cluster data are 5.25%, 9.75% and 9%. This is no missing data in this simulation study. However, in real data problems, there could be missingness in the data. For example, it is possible that patients miss a clinic visit or drop out at some time.

5.1.1 Simulation Design

We assume a linear trajectory in this simulation study. Recall that the Gaussian joint model is structured as

Yij = ψβi (tij) + ij,

39 0 h(t|Yi) = λ(t) exp{γψβi (t) + Xiα},

where the mean trajectory is given by ψβi (tij) = β0i + β1itij; the longitudinal error has Gaussian distribution with mean zero and variance equal to 0.1 (i.e., ij ∼ N(0, 0.1)); γ is a scalar parameter linking the trajectory to the hazard function; λ(t) is the baseline hazard which is assumed to be piecewise constant; α, in this simulation study, is a one-dimensional covariate parameter linking a n-dimensional vector Xi of baseline covariates to the failure time; and n is the total number of subjects. We set up the true values of the parameters as follows:

• There are six observed time points, which are t = (0, 1, 2, 3, 4, 5).

• In Case 1 (one cluster simulation), let (β0i, β1i) = (3, −1) for i = 1,..., 400;

In Case 2 (two cluster simulation), let (β0i, β1i) = (3, −1) for i = 1,..., 200, and let (β0i, β1i) = (2, −0.5) for i = 201,..., 400;

In Case 3 (three cluster simulation), let (β0i, β1i) = (3, −1) for i = 1,..., 100, let (β0i, β1i) =

(3, −0.5) for i = 101,..., 200, and let (β0i, β1i) = (2, −0.5) for i = 201,..., 300.

• Assume that the longitudinal error has Gaussian distribution with mean zero and variance 0.1.

• Let the baseline hazard λ(t) be a piecewise constant function and

( 0.5, 0 < t < 4 λ(t) = 1, t ≥ 4.

The time points defining the intervals were taken to be t = (0, 1, 2, 3, 4, 5, 6). Denote the baseline hazard within each time interval as ( 0.5, l = 1, 2, 3, 4 λl = 1, l = 5, 6.

• Let the link parameter γ = −1, which implies that larger longitudinal measurements reduce the hazard.

• Let the covariate parameter α = −1.

These specifications are motivated by what is observed in an exploratory analysis of the aforemen- tioned AIDS dataset, where a decreasing trend is observed in the longitudinal observations. If we measure the CD4 cell counts as the longitudinal outcome and study their time to death, usually the higher CD4 cell counts mean lower chance to be diagnosed with AIDS.

40 5.1.2 Simulation Results

We use the Gibbs sampler presented in Chapter 3 for estimation. The priors and hyperpriors for the parameters in the Gaussian joint model were taken as

 0  b0 ∼ N2 (3, −1) , I2 ,

−1 V0 ∼ W ishart(nv, I2) M ∼ Gamma(1, 1)

σ2 ∼ IG(2, 1)

γ ∼ N(−1, 0.1),

λl ∼ Gamma(1, 1), l = 1, 2,..., 5

α ∼ N(−1, 1),

where I2 is a 2 × 2 identity matrix, “N2,” “IG,” “LG” and “Gamma” stand for the multivariate normal, inverse gamma, log-gamma and gamma distributions. When we use Metropolis-Hastings method to update β0i and β1i, we set nv = 30 for one cluster, nv = 10 for two cluster simulation data, and nv = 10 for the three cluster simulation. When we use slice sampler to update β0i and

β1i, we set nv = 10 for all cases. Parameter tuning will be discussed in Section 5.1.3. Metropolis-

Hastings method and slice sampler are used to update β0i, β1i and α. The Gaussian joint model implemented with Metropolis-Hastings method on β0i, β1i and α is labeled as “GaussianMH,” and the Gaussian joint model with slice sampler on β0i, β1i and α is labeled as “GaussianSS.” We run 5,000 iterations for the Gibbs sampler with 2,000 as the “burn-in.” We specify the burn- in number by observing MCMC trace plot. That is, the posterior mean is computed using the last 3,000 iterations to obtain point estimates of the parameters β, γ, α and λ at the individual level. For the longitudinal coefficients β, we compute the posterior mean for each subject, which implies that each subject may have different estimated values for β regardless of clustering. We ran a total of 50 replicate simulations to evaluate according to Section 5.1.1. for parameter estimation. Tables 5.1 and 5.2 show the true and estimated values and 95% credible intervals of the parameters from model GaussianMH and GaussianSS. The MCMC trace plots and the posterior density estimates of β01, β11, γ, α from one of the replicates of Case 3 (three cluster Gaussian distributed simulated data) from model GaussianSS are given in Figure 5.2 and Figure 5.3. Figure 5.4 shows the estimated

41 longitudinal trajectory from model GaussianSS for six subjects in Case 3 (three cluster Gaussian distributed simulated data). Subject 1 and 2 were randomly selected from the first cluster; subject 3 and 4 were randomly selected from the second cluster; and subject 5 and 6 were randomly selected from the third cluster. The estimated trajectories capture the true longitudinal trajectories well and clearly shows a decreasing trend over time. Figure 5.5 shows the true and estimated hazard rate from model GaussianSS for those six subjects. Both of the fitted longitudinal trajectory and the fitted hazard are close to the actual values. In this simulation study we have one covariate and the covariate parameter α is a scalar. Our model can easily be extended to higher dimensions.

5.1.3 Discussion

In this section, we perform a simulation study with Gaussian distributed simulation data as outlined in Section 5.1.1. In the Gibbs sampler, we consider both the Metropolis-Hastings method and the slice sampler to update β and α and both choices tended to work well in terms of MSE and coverage of the 95% pointwise credible intervals. That is, the estimates of β are close to the true values. The small biases are likely due to Monte Carlo simulation error, because they are less than the MCMC standard error. For parameter γ and α, the Gaussian joint models clearly have biased estimates in some of the cases. The MSE of longitudinal observations and hazard rate are shown in Table 5.3. The MSE is defined as the mean squared difference between the estimated values and the true value. Both GaussianMH and GaussianSS have small MSE of the longitudinal observations. GaussianSS has slightly smaller longitudinal MSE than GaussianMH. These two models have close values of the MSE of hazard rate. (Cases are defined in the captions of Table 5.4.)

In the Gibbs sampler, we arbitrarily chose the value of nv (hyperparameter in the Dirichlet −1 process prior). We have found that nv is positively correlated with the posterior mean of V0 , and consequently, we choose nv to be large enough to ensure that the diagonal values of V0, which are the in G0 (base distribution of Dirichlet process prior), are not too large. Otherwise it can be difficult for the MCMC chain to converge. In our simulation study, we gave a strong informative prior on the link parameter γ. However, the estimates from the Gaussian joint model are still biased. Therefore, the Gaussian model performed well in terms of the MSE of hazard but has poor performance in estimating γ. This is particularly troubling in our joint modeling setting, since γ is the parameter that links the longitudinal and survival data.

42 Table 5.1: Parameter estimates for simulation study of Case 1-3 (Gaussian distributed data) using model GaussianMH (the Gaussian joint model with Metropolis-Hastings)

True value Estimate (SE) 95% Credible Interval Lower Bound (SE) Upper Bound (SE) Case 1 β01 = 3 3.02 (0.05) 2.90 (0.07) 3.13 (0.11) β11 = −1 −1.03 (0.03) −1.11 (0.08) −0.98 (0.05) γ = −1 −1.22 (0.19) −1.32 (0.12) −1.11 (0.25) α = −1 −1.00 (0.07) −1.13 (0.08) −0.87 (0.07) λ1 = 0.5 0.91 (0.30) 0.54 (0.22) 1.41 (0.38) λ2 = 0.5 0.70 (0.14) 0.49 (0.11) 0.95 (0.16) λ3 = 0.5 0.51 (0.05) 0.41 (0.04) 0.63 (0.06) λ4 = 0.5 0.40 (0.07) 0.32 (0.05) 0.50 (0.09) λ5 = 1 0.66 (0.20) 0.48 (0.11) 0.89 (0.31) λ6 = 1 1.00 (0.02) 0.03 (0.003) 3.68 (0.12) Case 2 β01 = 3 2.98 (0.11) 2.75 (0.32) 3.07 (0.07) β11 = −1 −1.00 (0.06) −1.06 (0.06) −0.88 (0.16) β0,201 = 2 2.08 (0.11) 1.97 (0.32) 2.20 (0.07) β1,201 = −0.5 −0.55 (0.06) −0.61 (0.06) −0.50 (0.16) γ = −1 −0.65 (0.03) −0.68 (0.05) −0.64 (0.03) α = −1 −0.95 (0.06) −1.07 (0.07) −0.82 (0.07) λ1 = 0.5 0.26 (0.05) 0.18 (0.04) 0.36 (0.06) λ2 = 0.5 0.32 (0.04) 0.24 (0.03) 0.42 (0.05) λ3 = 0.5 0.41 (0.05) 0.32 (0.04) 0.51 (0.05) λ4 = 0.5 0.53 (0.04) 0.42 (0.04) 0.64 (0.05) λ5 = 1 1.24 (0.13) 1.00 (0.11) 1.52 (0.15) λ6 = 1 1.00 (0.02) 0.03 (0.003) 3.67 (0.10) Case 3 β01 = 3 2.95 (0.14) 2.66 (0.24) 3.12 (0.09) β11 = −1 −0.96 (0.11) −1.07 (0.06) −0.78 (0.18) β0,101 = 3 3.00 (0.14) 2.91 (0.24) 3.05 (0.09) β1,101 = −0.5 −0.53 (0.11) −0.58 (0.06) −0.49 (0.18) β0,201 = 2 2.09 (0.14) 1.99 (0.24) 2.29 (0.09) β1,201 = −0.5 −0.54 (0.11) −0.65 (0.06) −0.50 (0.18) γ = −1 −0.65 (0.03) −0.67 (0.04) −0.64 (0.03) α = −1 −0.91 (0.07) −1.07 (0.08) −0.76 (0.07) λ1 = 0.5 0.26 (0.06) 0.17 (0.05) 0.39 (0.07) λ2 = 0.5 0.31 (0.05) 0.22 (0.04) 0.43 (0.07) λ3 = 0.5 0.37 (0.05) 0.27 (0.04) 0.49 (0.06) λ4 = 0.5 0.46 (0.06) 0.35 (0.05) 0.60 (0.07) λ5 = 1 1.03 (0.11) 0.80 (0.09) 1.29 (0.13) λ6 = 1 1.00 (0.02) 0.03 (0.003) 3.66 (0.11)

NOTE: Standard errors are in parentheses.

43 Table 5.2: Parameter estimates for simulation study of Case 1-3 (Gaussian distributed data) using model GaussianSS (the Gaussian joint model with slice sampler)

True value Estimate (SE) 95% Credible Interval Lower Bound (SE) Upper Bound (SE) Case 1 β01 = 3 3.00 (0.03) 2.77 (0.06) 3.22 (0.09) β11 = −1 −1.02 (0.01) −1.13 (0.05) −0.91 (0.03) γ = −1 −1.10 (0.11) −1.27 (0.08) −0.90 (0.13) α = −1 −0.77 (0.11) −0.98 (0.12) −0.57 (0.10) λ1 = 0.5 0.82 (0.21) 0.43 (0.12) 1.38 (0.31) λ2 = 0.5 0.66 (0.13) 0.44 (0.09) 0.94 (0.17) λ3 = 0.5 0.54 (0.05) 0.42 (0.04) 0.67 (0.06) λ4 = 0.5 0.45 (0.06) 0.34 (0.05) 0.57 (0.08) λ5 = 1 0.70 (0.14) 0.46 (0.08) 1.04 (0.22) λ6 = 1 1.00 (0.02) 0.02 (0.003) 3.69 (0.12) Case 2 β01 = 3 2.95 (0.10) 2.57 (0.28) 3.18 (0.04) β11 = −1 −0.99 (0.05) −1.10 (0.05) −0.82 (0.14) β0,201 = 2 2.02 (0.10) 1.89 (0.28) 2.25 (0.04) β1,201 = −0.5 −0.51 (0.04) −0.62 (0.05) −0.46 (0.14) γ = −1 −0.66 (0.03) −0.70 (0.07) −0.64 (0.03) α = −1 −0.87 (0.09) −1.04 (0.09) −0.70 (0.08) λ1 = 0.5 0.29 (0.05) 0.20 (0.04) 0.41 (0.07) λ2 = 0.5 0.34 (0.05) 0.25 (0.04) 0.45 (0.06) λ3 = 0.5 0.43 (0.04) 0.33 (0.04) 0.53 (0.05) λ4 = 0.5 0.52 (0.05) 0.42 (0.04) 0.64 (0.06) λ5 = 1 1.20 (0.15) 0.94 (0.12) 1.48 (0.18) λ6 = 1 1.00 (0.02) 0.03 (0.003) 3.69 (0.09) Case 3 β01 = 3 2.97 (0.06) 2.60 (0.21) 3.21 (0.08) β11 = −1 −0.97 (0.07) −1.11 (0.05) −0.75 (0.18) β0,101 = 3 2.98 (0.06) 2.73 (0.21) 3.17 (0.08) β1,101 = −0.5 −0.53 (0.07) −0.63 (0.05) −0.42 (0.18) β0,201 = 2 2.05 (0.06) 1.89 (0.21) 2.35 (0.08) β1,201 = −0.5 −0.52 (0.07) −0.67 (0.05) −0.46 (0.18) γ = −1 −0.65 (0.03) −0.66 (0.04) −0.63 (0.03) α = −1 −0.92 (0.09) −1.12 (0.10) −0.72 (0.08) λ1 = 0.5 0.25 (0.05) 0.16 (0.04) 0.38 (0.07) λ2 = 0.5 0.31 (0.06) 0.21 (0.05) 0.43 (0.08) λ3 = 0.5 0.38 (0.05) 0.28 (0.04) 0.51 (0.05) λ4 = 0.5 0.48 (0.06) 0.36 (0.05) 0.62 (0.08) λ5 = 1 1.00 (0.10) 0.77 (0.08) 1.26 (0.12) λ6 = 1 1.00 (0.02) 0.03 (0.003) 3.67 (0.12)

NOTE: Standard errors are in parentheses.

44 Figure 5.2: Trace plots and density curves for the last 3,000 iterations using model GaussianSS in Case 3. (a) MCMC trace plot of β01 (intercept for the first subject). (b) Posterior density estimate of β01 from model GaussianSS (solid line) and the true value of β01 (dashed line). (c) MCMC trace plot of β11 (slope for the first subject). (d) Posterior density estimates of β11 from model GaussianSS (solid line) and the true value of β11 (dashed line).

45 Figure 5.3: Trace plots and density curves for the last 3,000 iterations using model GaussianSS in Case 3. (a) MCMC trace plot of γ (link parameter). (b) Posterior density estimate of γ from model GaussianSS (solid line) and the true value of γ (dashed line). (c) MCMC trace plot of α (covariate parameter). (d) Posterior density estimates of α from model GaussianSS (solid line) and the true value of α (dashed line).

46 Figure 5.4: True longitudinal observations vs. estimated longitudinal trajectories using model GaussianSS in Case 3 (the simulation study with three cluster Gaussian distributied data for the Gaussian joint model).

Figure 5.5: True hazard rate vs. estimated hazard rate from model GaussianSS in Case 3 (the simulation study with three cluster Gaussian distributied data for the Gaussian joint model).

47 5.2 Log-Gamma Joint Model

We start this section by describing the specifications of the log-gamma joint model introduced in Chapter 4. Similar to Section 5.1, we generate data with different numbers of clusters (one cluster, two clusters, and three clusters). The one cluster simulation data has 400 subjects in total; the two cluster simulation data has 200 samples in each cluster; and the three cluster simulation data has equal sample sizes in the three clusters, with 100 in each. In each cluster, the longitudinal responses Yij are generated from a log-gamma distribution with a cluster specific mean, while the variance is kept the same for all clusters. All subjects share the same link parameter γ and covariate paramter α across clusters. For simplicity, we assume there is only one covariate (i.e., the covariate vector X was n-dimensional and is generated from the standard Gaussian distribution, where n was the total number of subjects). The survival time is generated from its cumulative distribution function based on the hazard function in (4.1) given the mean longitudinal trajectory. Figure 5.6 shows histograms of the survival time for the one cluster, two cluster, and three cluster settings. We define the starting time and the end of the study arbitrarily. If a subject’s survival time is later than the end of the study, then it is right censored. The censoring percentages for one cluster, two cluster and three cluster data are 3.75%, 14.75% and 7%. This is no missing data in this simulation study. However, in real data problems, there could be missingness in the data. For example, it is possible that patients miss a clinic visit or drop out at some time.

5.2.1 Simulation Design

We assume a linear trajectory in this simulation study. Recall that the log-gamma joint model is proposed as

Yij = ψβi (tij) + ij,

0 h(t|Yi) = λ(t) exp{γψβi (t) + Xiα},

where the mean trajectory is given by ψβi (t) = β0i + β1itij; the longitudinal error has a log-gamma distribution (i.e., ij ∼ LG(α, κ)); γ is a scalar parameter linking the trajectory to the hazard function’ λ(t) is the baseline hazard assumed to be piecewise constant; α, in this simulation study, is a one-dimensional parameter linking a n-dimensional vector Xi of baseline covariates to the failure time; and n is the total number of subjects. We set up the true values of the parameters as follows:

48 Figure 5.6: Histograms of simulated survival time. The number of clusters is indicated in the title heading of each panel.

• There are six observation time points, which are t = (0, 1, 2, 3, 4, 5).

• In Case 4 (one cluster simulation), let (β0i, β1i) = (3, −1) for i = 1,..., 400;

In Case 5 (two clusters simulation), let (β0i, β1i) = (3, −1) for i = 1,..., 200, and let

(β0i, β1i) = (2, −0.5) for i = 201,..., 400;

In Case 6 (three clusters simulation), let (β0i, β1i) = (3, −1) for i = 1,..., 100, let (β0i, β1i) =

(3, −0.5) for i = 101,..., 200, and let (β0i, β1i) = (2, −0.5) for i = 201,..., 300.

• Assume that the longitudinal error has mean zero and variance 0.1. From (2.4) and (2.5),

this implies that α ≈ 10.49 and κ ≈ 10.00.

• Let the baseline hazard λ(t) be a piecewise constant function and

( 0.5, 0 < t < 4 λ(t) = 1, t ≥ 4.

The time points defining the intervals were taken to be t = (0, 1, 2, 3, 4, 5, 6). Denote the baseline hazard within each interval as ( 0.5, l = 1, 2, 3, 4 λl = 1, l = 5, 6.

49 • Let γ = −1, which implies that larger longitudinal measurements reduce the hazard.

• Let the covariate parameter α = −1.

These specifications are the same as the simulation study for Gaussian joint model in Section 5.1, except that the longitudinal error has a log-gamma distribution instead of Gaussian distribution. The distribution of the longitudinal observations is lightly skewed under log-gamma assumption, while it is symmetric under Gaussian assumption. Our specification of log-gamma is not strongly skewed (roughly −0.32). However, it may still affect the estimation accuracy of these two types of joint models. This choice makes the comparison between the log-gamma and Gaussian joint models in Section 5.3, somewhat more favorable to the Gaussian joint model.

5.2.2 Simulation Results

We use the Gibbs sampler presented in Chapter 4 for estimation. The priors and hyperpriors for the parameters in the log-gamma joint model are taken as

κ0 ∼ IG(10, 10),

κ1 ∼ IG(10, 10),

κ3 ∼ IG(10, 10),

α ∼ Gamma(100, 10)

M ∼ Gamma(1, 1)

γ ∼ LG(α2, κ2),

λl ∼ Gamma(10, 10), l = 1, 2,..., 5  X   0.005 × 1   0.005 × 1  α ∼ cMLG , n , n , 1 α3 κ3 where “IG”, “LG”, “Gamma” and “cMLG” stand for the inverse gamma, log-gamma, gamma and log-gamma distributions. In this simulation study, we update α0, α1, and α3 directly from their full-conditional distributions. We don’t update the hyperparameters α2 and κ2. In Case 4 and Case 5 we let α2 = 200 and κ2 = 250. In Case 6 we let α2 = 100 and κ2 = 250. We specify the hyperparameters of the prior of γ to make the estimate comparable to the result from the Gaussian joint models. Recall comparisons will be given in section 5.3.

50 We ran 5,000 iterations for the Gibbs sampler with 2,000 as the “burn-in.” We specify the burn- in number by observing MCMC trace plot. We compute the posterior mean using the last 3,000 iterations to produce point estimates of the parameters β, γ, α and λ. Similar to the Gaussian joint model simulation study, each subject may have different estimated values for β, since we compute the posterior mean of β for every subject. We ran a total of 50 replicate simulations according to Section 5.2.1 to evaluate parameter estimation. Table 5.4 shows the true and estimated values and 95% pointwise credible intervals of the parameters in the log-gamma joint model. The MCMC trace plots and the posterior density estimates of β01, β11, γ, α in one of the replicates are given in

Figure 5.7 and Figure 5.8. In Figure 5.7 (d), the two modes in the posterior density of β1 may due to label switching (Stephens [34]). Figure 5.9 shows the estimated longitudinal trajectory for six subjects in Case 6 (simulation study with three cluster log-gamma distributed data for log-gamma joint model). Subject 1 and 2 were randomly selected from the first cluster; subject 3 and 4 were randomly selected from the second cluster; and subject 5 and 6 were randomly selected from the third cluster. The estimated trajectories capture the true longitudinal trajectories well and clearly shows a decreasing trend over time. Figure 5.10 shows the true and estimated hazard rate for those six subjects. Both of the fitted longitudinal trajectory and the fitted hazard are close to the actual values. In this simulation study we have one covariate and the covariate parameter α is a scalar. Our model can easily be extended to higher dimensions.

5.2.3 Discussion

In this section, we perform a simulation study with log-gamma distributed simulation data as outlined in Section 5.2.1. In the Gibbs sampler, we use the slice sampler to update φ (unique values of β), V0, α, γ and α. The baseline hazard λ and the hyperparameters have conjugate priors and their full-conditional distributions are straightforward to sample from. The estimates of β, γ and α are close to the true values. The small biases are likely due to Monte Carlo simulation error, because they are less than the MCMC standard error. Our log-gamma joint model performed well in estimation in terms of MSE. Both the MSE of longitudinal observations and MSE of hazard rate are small, which are shown in Table 5.3.

51 Figure 5.7: Trace plots and density curves for the last 3,000 iterations using the log-gamma joint model in Case 6. (a) MCMC trace plot of β0. (b) Posterior density estimates of β0 from the semiparametric log-gamma joint model (solid line) and the true value of β0 (dashed line). (c) MCMC trace plot of β1. (d) Posterior density estimates of β1 from the semiparametric log-gamma joint model (solid line) and the true value of β1 (dashed line).

52 Figure 5.8: Trace plots and density curves for the last 3,000 iterations using the log-gamma joint model in Case 6. (a) MCMC trace plot of γ. (b) Posterior density estimates of γ from the semiparametric log-gamma joint model (solid line) and the true value of γ (dashed line). (c) MCMC trace plot of β1. (d) Posterior density estimates of α from the semiparametric log-gamma joint model (solid line) and the true value of α (dashed line).

53 Figure 5.9: True longitudinal observations vs. estimated longitudinal trajectories using the log- gamma joint model in Case 6 (three cluster simulation study of log-gamma distributed data).

Figure 5.10: True hazard rate vs. estimated hazard rate using the log-gamma joint model in Case 6 (three cluster simulation study of log-gamma distributed data).

54 Table 5.3: MSE for the Gaussian joint model and the log-gamma joint model

Case Data Distribution Joint Model Longitudinal MSE (SE) Hazard MSE (SE) Case 1a Gaussian GaussianMH 0.101 (0.004) 121.44 (138.26) GaussianSS 0.097 (0.004) 9.94 (7.46) LG 0.100 (0.004) 8.48 (8.23) Case 2b Gaussian GaussianMH 0.093 (0.004) 11.06 (9.42) GaussianSS 0.091 (0.003) 14.07 (6.87) LG 0.094 (0.003) 5.22 (4.52) Case 3c Gaussian GaussianMH 0.092 (0.004) 12.40 (8.84) GaussianSS 0.090 (0.004) 13.48 (11.61) LG 0.095 (0.004) 7.37 (6.42) Case 4d LG GaussianMH 0.096 (0.004) 4.54 (2.26) GaussianSS 0.094 (0.004) 9.84 (8.93) LG 0.100 (0.003) 8.17 (11.42) Case 5e LG GaussianMH 0.092 (0.004) 10.93 (5.53) GaussianSS 0.090 (0.005) 14.04 (7.65) LG 0.093 (0.003) 3.94 (2.06) Case 6f LG GaussianMH 0.091 (0.004) 14.43 (11.90) GaussianSS 0.089 (0.003) 13.99 (8.84) LG 0.097 (0.011) 9.72 (9.50)

NOTE: Standard errors are in parentheses. aCase 1. one cluster Gaussian simulation data with unique values of β (3, −1). bCase 2. two cluster Gaussian simulation data with unique values of β (3, −1) and (2, −0.5). cCase 3. three cluster Gaussian simulation data with unique values of β (3, −1), (3, −0.5) and (2, −0.5). dCase 4. one cluster log-gamma simulation data with unique values of β (3, −1). eCase 5. two cluster log-gamma simulation data with unique values of β (3, −1) and (2, −0.5). fCase 6. three cluster log-gamma simulation data with unique values of β (3, −1), (3, −0.5) and (2, −0.5).

55 Table 5.4: Parameter estimates for simulation study of Case 4-6 (log-gamma distributed data) using the log-gamma joint model.

True value Estimate (SE) 95% Credible Interval Lower Bound (SE) Upper Bound (SE) Case 4 β01 = 3 2.97 (0.29) 2.94 (0.32) 2.99 (0.27) β11 = −1 −0.99 (0.10) −1.00 (0.09) −0.98 (0.11) γ = −1 −1.11 (0.07) −1.23 (0.08) −1.00 (0.06) α = −1 −0.97 (0.06) −1.09 (0.07) −0.85 (0.06) λ1 = 0.5 0.75 (0.12) 0.49 (0.08) 1.09 (0.16) λ2 = 0.5 0.65 (0.08) 0.48 (0.07) 0.85 (0.11) λ3 = 0.5 0.56 (0.05) 0.45 (0.04) 0.68 (0.06) λ4 = 0.5 0.49 (0.06) 0.40 (0.05) 0.59 (0.07) λ5 = 1 0.85 (0.09) 0.65 (0.07) 1.08 (0.12) λ6 = 1 1.00 (0.01) 0.48 (0.01) 1.71 (0.03) Case 5 β01 = 3 2.96 (0.14) 2.88 (0.26) 3.02 (0.02) β11 = −1 −0.98 (0.07) −1.01 (0.01) −0.94 (0.13) β0,201 = 2 2.07 (0.14) 1.97 (0.26) 2.24 (0.02) β1,201 = −0.5 −0.54 (0.07) −0.62 (0.01) −0.49 (0.13) γ = −1 −0.92 (0.04) −1.02 (0.04) −0.82 (0.03) α = −1 −0.93 (0.06) −1.06 (0.07) −0.81 (0.06) λ1 = 0.5 0.52 (0.07) 0.36 (0.06) 0.72 (0.10) λ2 = 0.5 0.52 (0.05) 0.39 (0.04) 0.67 (0.07) λ3 = 0.5 0.52 (0.06) 0.41 (0.05) 0.64 (0.07) λ4 = 0.5 0.54 (0.05) 0.43 (0.04) 0.65 (0.06) λ5 = 1 1.02 (0.10) 0.82 (0.08) 1.25 (0.11) λ6 = 1 1.00 (0.01) 0.48 (0.01) 1.70 (0.02) Case 6 β01 = 3 2.90 (0.22) 2.76 (0.35) 2.99 (0.16) β11 = −1 −0.91 (0.13) −0.98 (0.08) −0.76 (0.23) β0,101 = 3 2.99 (0.22) 2.95 (0.35) 3.03 (0.16) β1,101 = −0.5 −0.54 (0.13) −0.57 (0.08) −0.49 (0.23) β0,201 = 2 2.02 (0.22) 1.96 (0.35) 2.14 (0.16) β1,201 = −0.5 −0.51 (0.13) −0.57 (0.08) −0.49 (0.23) γ = −1 −1.02 (0.04) −1.13 (0.04) −0.90 (0.04) α = −1 −0.93 (0.07) −1.08 (0.07) −0.79 (0.07) λ1 = 0.5 0.62 (0.08) 0.40 (0.06) 0.89 (0.10) λ2 = 0.5 0.60 (0.06) 0.42 (0.05) 0.82 (0.07) λ3 = 0.5 0.57 (0.06) 0.43 (0.05) 0.74 (0.07) λ4 = 0.5 0.57 (0.06) 0.44 (0.05) 0.74 (0.07) λ5 = 1 0.99 (0.09) 0.78 (0.08) 1.23 (0.11) λ6 = 1 1.00 (0.01) 0.48 (0.01) 1.71 (0.02)

NOTE: Standard errors are in parentheses.

56 5.3 Model Comparison 5.3.1 Comparison between Gaussian Joint Model and Log-Gamma Joint Model

Section 5.1 and Section 5.2 contain simulation studies for the Gaussian joint model and the log-gamma joint model. In these illustrations, we use the Gaussian joint model to fit Gaussian distributed simulation data and use the log-gamma model to fit log-gamma distributed simulation data. In this section, we implement both joint models using both the Gaussian distributed data generated according to Section 5.1 (Case 1, 2, & 3) and the log-gamma distributed data generated according to Section 5.2 (Case 1, 2, & 3).We compare the estimates at individual and cluster levels. (1) MSE at individual level To quantify the accuracy of the Gaussian joint model and the log-gamma joint model at individ- ual level, we computed the mean squared error (MSE) of the estimated longitudinal measurements and the MSE of the estimated hazard rate of all subjects and all time points. We simulate 50 replicates (50 according to Section 5.1.1 and 50 according to Section 5.2.1) and compare the MSEs in Table 5.3. The log-gamma joint models have slightly larger longitudinal MSE than the Gaussian joint models, but have much smaller MSE of hazard rate. Thus, the Gaussian joint models are only slightly better in estimating longitudinal trajectories than the log-gamma joint model, but the log-gamma joint model is much better in estimating hazard than the Gaussian joint model. These conclustions are consistent for both Gaussian and log-gamma distributed data. (2) Clustering evaluation We also compare the estimation accuracy of the Gaussian joint model and the log-gamma joint model at cluster level. We start this discussion by evaluating the cluster performance of both joint models. In every MCMC iteration, we update the cluster indicator for each subject, and as such, we obtain a total number of clusters at each iteration. The posterior mode of the number of clusters is computed using the last 3,000 iterations, and can be interpreted as an estimate of the number of clusters. Table 5.5 shows the mean and standard error of those modes for 50 replicate generations of the data. In most cases, both the Gaussian and log-gamma joint models overestimate the number of clusters. However, the estimates from the log-gamma joint model are closer to the truth than the estimates from the Gaussian joint models. The standard errors of the estimated number of clusters from the log-gamma joint model are also consistently smaller than the Gaussian joint model.

57 We also evaluate the cluster of performance using the adjusted Rand index. The Rand index is a measure of the similarity between two different clusterings of the same set of data. It essentially considers how each pair of data is assigned in each clustering. There are two cases that represent a similarity between the clusterings. The first case occers when the elements of the paired data are assigned together in a cluster in each of the two clusterings, and the second case occurs when they are placed into different clusters in both clustering. Difference between clusterings occurs when the elements of the paired data are assigned in the same cluster in one clustering but in different clusters in the other. When the true cluster labels are known, the similarity between the estimated clustering and the truth could be a measure of the accuracy of the estimation. From this, Rand [31] proposed a measure of the similarity between two clusterings of the same data, which is defined as the number of similar assignments of paired data normalized by the total number of pairs. Given a set of N objects S = {o , o , . . . , o }, define the two clusterings with C1 = {C1,C1,...,C1 } and 1 2 N 1 2 k1 C2 = {C2,C2,...,C2 }, where C1 partitions S into k subsets and C2 partitions S into k subsets. 1 2 k2 1 2 The Rand index is formally defined as P rij RI = i,j ,  N  2 where  1, if o and o are in the same subset in C1 and in the same subset in C2  i j   1 2  1, if oi and oj are in different subsets in C and in different subsets in C rij =  0, if o and o are in the same subset in C1 but in different subsets in C2  i j   1 2  0, if oi and oj are in different subsets in C but in the same subset in C for i, j = 1,...,N. The overlap between C1 and C2 can be summarized in a contingency table (Table 5.6). P 1 Intuitively, i,j rij can be considered as the number of agreements between clustering C and C2, and the Rand index represents the frequency of occurrence of agreements over the total pairs. The Rand index ranges between 0 and 1, with 0 indicating that the two clusterings have no similarity (i.e., when one consists of only one cluster and the other has N cluster with single point in each), and 1 indicating that the clusterings are identical. However, the Rand index makes no correction for chance. That is, we cannot tell whether a specific value of RI is large or small, because when cluster assignment is at random, the value of RI

58 is not zero. The non-adjusted Rand index implies a dependency between the number of clusters and the number of objects. Specifically, Morey and Agresti [27] stated that RI will increase to 1.0 as the number of clusters k1 and k2 increase. Thus, some corrections have been made to overcome this disadvantage. Hubert and Arabie [19] proposed an adjusted Rand index based on the assumption that the number of agreements rij has a generalized hypergeometric distribution. They provided the expectation of the Rand index as

 r   r    r   r  P i· P ·j P i· + P ·j i 2 j 2 i 2 j 2 E(RI) = 1 + 2 − .  N 2  N 2 2 2

Then the adjusted Rand index (ARI) is given by,

RI − E(RI) ARI = (5.1) max(RI) − E(RI)  r   r   r   N  P ij − P i· P ·j / i,j 2 i 2 j 2 2 = (5.2)   r   r   N  1 P i· + P ·j / 2 i 2 j 2 2 assuming a maximum Rand index of 1, i.e. max(RI) = 1 in (5.1). In this dissertation, we use the ARI in (5.2) to evaluate the clustering of the joint models. First we need to obtain a point estimate of the clustering, and then we can use the estimated cluster assignment to compute the ARI. Several methods have been proposed to obtain a point estimate of the clustering using draws from the posterior clustering distribution. For example, the maximum a posteriori (MAP) clustering and the least-squares model-based clustering (Dahl [9]). In our simulation study we used Dahl’s method, which selects one of the observed clusterings in the Markov chain as the point estimate. Specifically, the least-square clustering is one of the observed clustering in the last 3,000 iterations which minimizes the sum of squared deviations of its association matrix from the pairwise probability matrix (Dahl [9]). The mean ARI and the standard error of 50 replicates are shown in Table 5.5. The log-gamma joint models have higher ARI than the Gaussian joint models, which means the clustering from the log-gamma joint models is closer to the true clustering than the clustering from the Gaussian joint models. Therefore, we conclude that the log-gamma joint model appears to have better performance in terms of clustering and detecting subgroups among subjects than the Gaussian joint model.

59 Based on the cluster assignments, we can evaluate the estimation of longitudinal trajectory and hazard rate at cluster level. We selected the least-squares clustering and used the values in that iteration as the estimates of β, γ, α and λ at cluster level. Then we computed the estimated trajectories in each cluster using the estimated values of β. For the hazard rate, we compute the mean covariates in each cluster and the estimated hazard rate at those mean covariates. Figure 5.11 and 5.12 shows the results in Case 3 (three cluster simulation study with Gaussian distributed data) and Case 6 (three cluster simulation study with log-gamma distributed data). The red solid lines are the true trajectories and hazard rates in each cluster based on true cluster assignments. The blue dash lines are the estimated trajectories and hazard rates from model GaussianMH, GaussianSS and LG. Both of the Gaussian and the log-gamma joint models estimate the cluster- specific trajectories and cluster-specific hazard well. The Gaussian joint models overestimate the number of clusters. The log-gamma joint model is more accurate at the cluster level in terms of the estimated number of clusters and cluster-specific hazard rate. (3) Effective sample size MCMC provides a way to sample from the full-conditional distributions for the parameters. However, these samples are not independent, which could be an issue when making inferences on the parameters. Suppose there are N samples x1, x2, . . . , xn drawn from a distribution with mean µ and standard deviation σ. Then the population mean of this distribution can be estimated by the sample mean, n 1 X µˆ = x . N i i=1 If the samples are independent, the variance ofµ ˆ is given by,

σ2 V ar(ˆµ) = . N

In practice, the MCMC samples could be correlated, which implies that the variance ofµ ˆ is not σ2 equal to N . Thi´ebauxand Zwiers [36] define the effective sample size (ESS) by equating the 2 ensemble mean square of the time-averaged mean, say σx¯, to the standard formula for the variance of the mean of Ness independent samples, that is

2 2 σ σx¯ = . (5.3) Ness

60 Table 5.5: Clustering evaluation for the Gaussian joint model and the log-gamma joint model

Case Data Distribution Joint Model Number of Cluster ARI (SE) Truth Estimate (SE) Case 1a Gaussian GaussianMH 1 1.3 (0.45) 0.86 (0.35) GaussianSS 1 1.9 (0.78) 0.78 (0.42) LG 1 1.1 (0.32) 0.9 (0.32) Case 2b Gaussian GaussianMH 2 2.3 (0.58) 0.89 (0.05) GaussianSS 2 2.4 (0.61) 0.90 (0.03) LG 2 2.1 (0.32) 0.89 (0.04) Case 3c Gaussian GaussianMH 3 4.2 (1.05) 0.82 (0.06) GaussianSS 3 4.6 (0.90) 0.83 (0.04) LG 3 3 (0) 0.85 (0.03) Case 4d LG GaussianMH 1 2.1 (0.99) 0.4 (0.52) GaussianSS 1 2.2 (1.03) 0.7 (0.48) LG 1 1.0 (0.14) 0.98 (0.14) Case 5e LG GaussianMH 2 2.7 (0.88) 0.88 (0.06) GaussianSS 2 2.9 (0.90) 0.89 (0.03) LG 2 2 (0) 0.91 (0.03) Case 6f LG GaussianMH 3 4.4 (0.80) 0.83 (0.05) GaussianSS 3 4.8 (1.11) 0.82 (0.05) LG 3 3 (0) 0.85 (0.06)

NOTE: Standard errors are in parentheses. aCase 1. one cluster Gaussian simulation data with unique values of β (3, −1). bCase 2. two cluster Gaussian simulation data with unique values of β (3, −1) and (2, −0.5). cCase 3. three cluster Gaussian simulation data with unique values of β (3, −1), (3, −0.5) and (2, −0.5). dCase 4. one cluster log-gamma simulation data with unique values of β (3, −1). eCase 5. two cluster log-gamma simulation data with unique values of β (3, −1) and (2, −0.5). fCase 6. three cluster log-gamma simulation data with unique values of β (3, −1), (3, −0.5) and (2, −0.5).

Table 5.6: Contingency table for the Rand index.

C1 C2 C2 C2 ...C2 Sums 1 2 k2 1 C1 r11 r12 . . . r1k2 r1· ...... C1 r r . . . r r k1 k11 k12 k1k2 k1·

Sums r·1 r·2 . . . r·k2 N

61 Figure 5.11: True vs. estimated longitudinal trajectories and hazard rates at cluster level in Case 3. The red solid lines are the true trajectories and hazard rates in each cluster based on true cluster assignments. The blue dash lines are the estimated trajectories and hazard rates. (a) and (b) are the results from model GaussianMH; (c) and (d) are the results from model GaussianSS; (e) and (f) are the results from model LG.

62 Figure 5.12: True vs. estimated longitudinal trajectories and hazard rates at cluster level in Case 6. The red solid lines are the true trajectories and hazard rates in each cluster based on true cluster assignments. The blue dash lines are the estimated trajectories and hazard rates. (a) and (b) are the results from model GaussianMH; (c) and (d) are the results from model GaussianSS; (e) and (f) are the results from model LG.

63 By Thi´ebauxand Zwiers [36], Laurmann and Gates [22] and following Anderson [2], we have the general expression N−1 σ2 X  |v| σ2 = 1 − ρ (x) (5.4) x¯ N N v v=−(N−1) 2 for the variance σx¯ of the mean over N samples of the variable x in terms of the lag v and lag- correlation function ρv. Here

  N−v 1 X (xi − µ)(xi+v − µ) ρ (x) = − v . (5.5) v N σ2 i=1

2 2 Hence, from (5.3) by equating σx¯ and σ /Ness we obtain a measure of the effective sample size

−1  N−1  σ2 X  |v| Ness = = N  1 − ρv(x) . (5.6) σ2 N x¯ v=−(N−1)

There are different ways to generate measure of the effective sample size (Thi´ebauxand Zwiers [36]). However, in this dissertation, we consider only the quantity defined by (5.6). Note that if the correlation is negative, the effective sample size could be larger than the actual sample size. We should note that (5.4)-(5.6) are all estimates of the properties of the complete MCMC series, of which the xi, i = 1,...,N are a part. Various approaches are proposed to estimate

Ness (Thi´ebauxand Zwiers [36]). Thi´ebauxand Zwiers [36] discussed a method that considers estimating ESS from the power spectum of the observed sequence. For large sample size, they used an approximation for the variance of the sample mean

2 σx¯ ≈ 2πfxx(0)/N, where fxx(τ) is the spectral density function of the observed times series with lag τ. An estimate of ESS is then given by, Nσ2 Ness ≈ . 2πfxx(0) In this dissertation we used the function effectiveSize in R package coda for implementation, which uses the estimate of the spectral density at frequency zero. Table 5.7 shows the effective sample size in the last 3,000 iterations from model GaussianMH, model GaussianSS and the log-gamma joint model in one replicate in Cases 1-6. Model Gaus- sianMH has low effective size since rejections in the Metropolis-Hastings method, which implies

64 positive temporal correlations. The slice sampler effectively increased the effective sample size on β and α in Gaussian joint models. The log-gamma joint models have much higher effective size than the Gaussian joint models on β and γ. Model GaussianSS and the log-gamma joint model have relatively higher effective size on α than model GaussianMH. All programs ran on the High Performance Computing Cluster at Florida State University. For Case 3 with three cluster log- gamma distributed data of 300 subjects, it took about 3.9 hours for the log-gamma joint model and about 3.5 hours for the Gaussian joint model to run 5,000 MCMC iterations. For Case 6 with three cluster Gaussian distributed data of 300 subjects, it took about 3.9 hours for the log-gamma joint model and about 3.0 hours for the Gaussian joint model to run 5,000 MCMC iterations.

5.3.2 Comparison with Parametric Models

The raw data of the longitudinal outcomes could be used as a predictor in the survival model. However, the raw data may have measurement error or missing information. Therefore, a joint model which can identify the longitudinal trajectory and learn the association between the longi- tudinal and survival outcomes may improve the inference for both. We compare the results of our semiparametric joint model with two parametric Cox proportional hazard models,

h(t|Y(t)) = λ(t) exp{γY(t) + Xα}. (5.7)

h(t) = λ(t) exp{Xα}. (5.8)

Here Y(t) is the raw data of the longitudinal outcome, and we consider it as a predictor in the hazard model (5.7). In model (5.8), we don’t consider the effect of the longitudinal outcome in the hazard model. Figure 5.13 shows the estimated cumulative hazard of these two models and our log-gamma joint model. Our semiparametric log-gamma joint model is more accurate according to the prediction of the cumulative hazard. Especially when there are more than a single cluster, since the joint model with Dirichlet process prior could detect the different hazard among those clusters.

65 Table 5.7: Effective sample size for the Gaussian joint model and the log-gamma joint model

Case Data Distribution Joint Model β01 β11 γ α Case 1a Gaussian GaussianMH 85 310 3 101 GaussianSS 1547 1562 279 2128 LG 3000 3000 1210 1799 Case 2b Gaussian GaussianMH 552 566 2 98 GaussianSS 1231 727 3 2029 LG 2205 2376 1458 1867 Case 3c Gaussian GaussianMH 85 310 3 101 GaussianSS 2080 2298 4 2178 LG 3000 3000 1210 1799 Case 4d LG GaussianMH 600 327 21 90 GaussianSS 787 585 206 2592 LG 3000 3000 1405 1739 Case 5e LG GaussianMH 156 151 13 144 GaussianSS 1126 1443 0 1749 LG 3000 3000 1418 1767 Case 6f LG GaussianMH 600 327 21 90 GaussianSS 1913 2360 3 1692 LG 3000 3000 1405 1739

NOTE: Standard errors are in parentheses. aCase 1. one cluster Gaussian simulation data with unique values of β (3, −1). bCase 2. two cluster Gaussian simulation data with unique values of β (3, −1) and (2, −0.5). cCase 3. three cluster Gaussian simulation data with unique values of β (3, −1), (3, −0.5) and (2, −0.5). dCase 4. one cluster log-gamma simulation data with unique values of β (3, −1). eCase 5. two cluster log-gamma simulation data with unique values of β (3, −1) and (2, −0.5). fCase 6. three cluster log-gamma simulation data with unique values of β (3, −1), (3, −0.5) and (2, −0.5).

66 Figure 5.13: Comparison of the semiparametric joint model and the parametric Cox models in Case 6 (three cluster simulation with log-gamma distributed data). The black solid line is the true cumulative hazard. The dash lines are the estimated cumulative hazard of the log-gamma semiparametric joint model (red dash line) and the estimated cumulative hazard of the parametric Cox model with longitudinal effect (blue dash line) and without longitudinal effect (green dash line).

67 CHAPTER 6

APPLICATIONS

We apply the Gaussian semiparametric joint model (reviewed in Chapter 3) and the log-gamma semiparametric joint model (proposed in Chapter 4) to two real datasets, i.e., an HIV data and a prostate cancer data. The aforementioned HIV data was obtained from a randomized clinical trial, where both longitudinal and survival data. Both responses were collected to compare the efficacy and safety of two antiretroviral drugs, didanosine (ddI) and zalcitabine (ddC), in treating patients who had failed or were intolerant of zidovudine (AZT) therapy (Abrams et al. [1]; Guo and Carlin [17]). Our analysis of the HIV data is given in Section 6.2. The second dateset involves prostate- specific antigen (PSA) readings, which was obtained from a clinical trial designed to investigate the effects of a nutritional supplement of selenium on the incidence of prostate cancer. The PSA trial concluded in 1996 and the results were published in Clark et al. [7]. Our analysis of the PSA data is given in Section 6.3.

6.1 HIV Data 6.1.1 Data Description

In AIDS clinical trials, the number of CD4 cells per cubic millimeter of blood is widely used as a biomarker for progression to AIDS when studying the efficacy of drugs to treat HIV-infected patients. It is often measured repeatedly over time (Guo and Carlin [17]). In this trial, as stated in Guo and Carlin [17], 467 HIV-infected patients who met entry conditions (either an AIDS diagnosis or two CD4 counts of 300 or fewer, and fulfilling specific criteria for AZT intolerance or failure) were enrolled and randomly assigned to receive either didanosine (ddI) or zalcitabine (ddC). CD4 counts were recorded at study entry, and again at the visits in the 2rd, 6th, 12th and 18th month. More details regarding the conduct of the trial is referred to Abrams et al. [1] and Goldman et al. [15]. In this dissertation, we conduct a joint analysis using the CD4 counts of all 467 patients as the longitudinal measure and the time-to-death as the survival endpoint. The CD4 counts of the

68 Figure 6.1: Observed trajectories of CD4 cells counts for all 467 patients in the original scale and the square root scale.

AIDS data were transformed by a square root power due to a high degree of skewness towards high CD4 counts (Guo and Carlin [17]). The transformation was also taken to achieve homogeneity of within-subject variance (Taylor et al. [35]).

6.1.2 Results Using Joint Models

Let Yij denote the square root of the CD4 count for the i-th subject at time point j, i = 1, . . . , n and j = 1, . . . , ni. Of the 467 patients included in this analysis, 24 patients had 5 longitudinal observations, 169 had 4, 122 had 3, 91 had 2, and 61 had 1 longitudinal observation. There were 188 patients who died before the end of the study. Figure 6.1 shows the plot of the square root of the observed longitudinal measures over time. Each line represents the longitudinal trajectory for each patient. We include four binary explanatory variables at the covariates in the Cox proportional hazard model, which are Drug (ddI = 1, ddC = 0), Gender (male = 1, female = 0), PrevOI (previous opportunistic infection (AIDS diagnosis) at study entry = 1, no AIDS diagnosis = 0), and AZT (intolerance = 1, failure = 0). We fit this HIV data to the Gaussian model reviewed in Chapter 3 and the log-gamma joint model proposed in Chapter 4, with a linear trajectory ψβi (tij) = β0i +β1itij. We assume the baseline

69 hazard λ(t) to be piecewise constant with L = 4 intervals. The time points defining the intervals were taken to be the observed time points t = 0, 2, 6, 12, and 18. In the Gaussian joint model, the priors and hyperpriors for the parameters in (3.6), (3.4) and (3.10) were taken as

0  b0 ∼ N2 (3, −1) , I2 ,

−1 0 V0 ∼ W ishart(20, (1, 10) × I2), M ∼ Gamma(1, 1),

σ2 ∼ IG(10, 1),

γ ∼ N(−1, 0.1),

λl ∼ Gamma(1, 1), l = 1, 2,..., 5,

0 α ∼ N4((−1, −1, −1, −1) , I4), where recall I represents identity matrix, W ishart and Gamma represent shorthands for the Wishart and Gamma distributions. In the log-gamma joint model, the priors and hyperpriors for the parameters in (4.2), (4.8) and (4.6) were taken as

κ0 ∼ IG(10, 10),

κ1 ∼ IG(10, 10),

κ3 ∼ IG(10, 10),

M ∼ Gamma(1, 1),

α ∼ Gamma(100, 10),

γ ∼ LG(2000, 5000),

λl ∼ Gamma(10, 10), l = 1, 2,..., 5,       X 0.005 × 1n 0.005 × 1n α ∼ cMLG 0 , , , 14 α3 κ3 where “IG”, “LG”, “Gamma” and “cMLG” stand for the inverse gamma, log-gamma, gamma and 0 log-gamma distributions. Here X is the n × 4 covariates matrix, and 14 is a 4-dimensional vector of ones. We ran 5,000 iterations with 2,000 burn-in. We computed the posterior mean of the last 3,000 draws as the point estimate for those parameters, and we used the 2.5% and 97.5% quantiles of the last 3,000 draws as the 95% credible interval.

70 Parameter estimates are shown in Table 6.1. The point estimate of trajectory slope in the linear regression is −0.03 with 95% credible interval of (−0.09, 0.03) in the Gaussian joint model, and −0.04 in the log-gamma joint model with 95% credible interval of (−0.05, 0.01), suggesting a slightly decrease in CD4 count over the study period. The estimated value of γ is −0.64 by using the Gaussian joint model and −0.91 by using the log-gamma joint model. The 95% credible intervals of γ do not cover zero in both of these two joint models. This suggests that when the CD4 cells count decreases, the hazard of death will increase. The non-zero estimate of γ also suggests that a joint model is needed in this problem. The MCMC trace plots and the posterior density estimates of β01, β11, γ, α are given in Figure 6.2-6.5. Similar to Figure 5.7 in Chapter 5, we see multiple modes in Figure 6.2 and Figure 6.4. This may be due to label switching (Stephens [34]). The two modes in the posterior density of γ in Figure 6.3 may be due to high autocorrelation when using Metropolis-Hastings method. Figure 6.6 shows the observed CD4 counts versus the estimated longitudinal trajectories of 4 randomly selected patients using the Gaussian joint model and the log-gamma joint model. Both of these two types of joint models captured the trajectories well. The mean squared residual error (MSRE) of the square root of the CD4 counts for the observations of all patients from the Gaussian joint model is 0.08, which is slightly smaller than the MSRE from the log-gamma joint model, which is 0.15. The estimated association of the longitudinal observations and the hazard rate of death is negative in both Gaussian and log-gamma joint models. This suggests that a lower CD4 count implies higher hazard of death. For those covariates, the hazard ratio of previous AIDS diagnosis over no AIDS diagnosis for the Gaussian joint model is exp(0.07) ≈ 1.07 with a 95% credible interval equal to (0.66, 1.67), while the hazard ratio for the log-gamma joint model is exp(0.22) ≈ 1.25 with a 95% credible interval equal to (0.88, 1.77). This suggests that a patient who was diagnosed with AIDS at study entry has higher hazard of death than one without an AIDS diagnosis at study entry. At the cluster level, we used Dahl’s method (Dahl [9]) to select the “best” iteration from the last 3,000 iterations and used the values of the draws in that iteration as the point estimate for those parameters. Figure 6.7 shows the predicted trajectories, predicted hazard rate and predicted survival probability for each cluster. Different colors represent different clusters. The top panel of the hazard rate plot in Figure 6.7 was restricted for better visualization, because most of the cluster-specific hazard rate were smaller than one. The log-gamma model leads to an estimated 7

71 Table 6.1: Parameter estimates for the HIV data using the Gaussian joint model with slice sampler and the log-gamma joint model.

Gaussian Joint Model Log-gamma Joint Model Parameter Estimate 95% Credible Interval Estimate 95% Credeble Interval β01 3.27 (2.69, 3.75) 3.02 (2.86, 3.76) β11 −0.03 (−0.09, 0.03) −0.04 (−0.05, −0.01) V ar(error) 0.13 (0.12, 0.15) 0.10 (0.09, 0.16) γ −0.64 (−0.66, −0.61) −0.91 (−0.96, −0.87) Drug −0.50 (−0.80, −0.21) −0.41 (−0.68, −0.13) Gender −1.03 (−1.46, −0.58) −0.78 (−1.11, −0.43) PrevOI 0.07 (−0.41, 0.51) 0.22 (−0.13, 0.57) AZT −0.69 (−1.11, −0.33) −0.43 (−0.72, −0.16) λ1 0.26 (0.11, 0.54) 0.37 (0.21, 0.59) λ2 0.50 (0.24, 0.95) 0.55 (0.35, 0.81) λ3 0.65 (0.32, 1.21) 0.70 (0.47, 1.03) λ4 0.61 (0.28, 1.15) 0.69 (0.43, 1.04) λ5 5.05 (1.61, 10.43) 1.39 (0.75, 2.25) clusters, while the Gaussian model leads to an estimated 29 clusters. This corresponds well with the results of the simulation study in Chapter 5, in which the Gaussian joint models tended to overestimate the number of clusters and estimated more clusters than the log-gamma joint model. Note, the prior of the concentration parameter M in the Dirichlet process prior is the same in both joint models. We also compared the predicted survival probability from the joint models with the Kaplan-Meier estimator, which is shown in the survival probability plots in Figure 6.7. The Kaplan-Meier estimator for all patients was in between the estimated cluster-specific survival probabilities.

6.1.3 Diagnosis

It is also important to provide diagnostics in addition to interpreting model fit. It may be the case that both the Gaussian and log-gamma versions of the joint model provide poor fits to the data. The posterior predictive p-value (Meng [25]) is a viable approach to assess model fit (see Section 2.8). For this data set the posterior predictive p-value for the log-gamma model is 0.176 and 0 (for χ2 and square error loss functions, respectively), and the posterior predictive p-value for the Gaussian model is approximately equal to 1 and 0.489 (for χ2 and square error loss functions, respectively). This suggests that we may be overfitting with the Gaussian model and may be

72 Figure 6.2: Trace plots and density curves using the Gaussian joint model with slice sampler for the HIV data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior density estimates of trajectory slope β1.

73 Figure 6.3: Trace plots and density curves using the Gaussian joint model with slice sampler for the HIV data. (a) MCMC trace plot of the link parameter γ. (b) Posterior density estimates of the link parameter γ. (c) MCMC trace plot of covariate coefficients α. (d) Posterior density estimates of covariate coefficients α.

74 Figure 6.4: Trace plots and density curves using the log-gamma joint model for the HIV data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior density estimates of trajectory slope β1.

75 Figure 6.5: Trace plots and density curves using the Gaussian joint model for the HIV data. (a) MCMC trace plot of the link parameter γ. (b) Posterior density estimates of the link parameter γ. (c) MCMC trace plot of covariate coefficients α. (d) Posterior density estimates of covariate coefficients α.

76 Figure 6.6: Observed square root of CD4 cells counts (black circles) vs. estimated longitudinal trajectories of randomly selected 4 patients using the Gaussian joint model with slice sampler (blue dash line) and the log-gamma joint model (red dash line).

Figure 6.7: Predicted CD4 trajectories, predicted hazard rate and predicted survival probability for each cluster by using the Gaussian joint model with slice sampler and the log-gamma joint model for the AIDS data.

77 Figure 6.8: QQ-plot of residuals using Gaussian joint model and log-gamma joint model for the AIDS data.

oversmoothing with the log-gamma model. To obtain a sense of how much we are smoothing in each case, we consider other types of diagnostics. For example, the DIC (see Section 2.9) for the log-gamma model is 404920.2 and the DIC for the Gaussian model is 7025764. This suggests better predictive performance of the log-gamma model. Residual diagnostics are also helpful for determining the quality of the model fit. In Figure 6.8, we plot the qq-plot of the residuals from the Gaussian model and the log-gamma joint model. Specifically, Figure 6.8 plots the sample quantile of residuals, which is [yij − (βˆ0 + βˆ1tij)], versus the theoretical quantiles from a Gaussian distribution. Here, we see the the normal qq-plot suggests heavy tails in both cases. This suggests a better fit using a model that allows for both left and right skewness. Note that the log-gamma model allows for left skewness, and the Gaussian model is symmetric.

6.2 PSA Data 6.2.1 Data Description

We consider the subjects with more than three PSA readings as the longitudinal measure, and the time-to-diagnosis with prostate cancer as the survival endpoint. There are 647 out of 1210 subjects who had more than three PSA readings. Of these 647 subjects, 52 of them had prostate

78 Figure 6.9: Observed trajectories of PSA readings for 647 subjects who have more than three PSA readings.

cancer diagnosis. Their PSA readings were transformed by log(PSA+1) because the data in original sclae is skewed, and some of the PSA readings are equal to zero.

6.2.2 Results Using Joint Models

Let Yij denote the transformed PSA reading log(PSA + 1) for the i-th subject at time point j, i = 1, . . . , n and j = 1, . . . , ni. Of the 647 people included in this analysis, 92 people had 4 longitudinal observations, 80 had 5, 87 had 6, 69 had 7, 56 had 8, 43 had 9, 37 had 10, 33 had 11, 58 had 12, 37 had 13, 19 had 14, 11 had 15, and 25 had more than 15 longitudinal observations. There were 52 people had prostate cancer diagnosis before the end of the study. Figure 6.9 displays the plots of the PSA readings on the original scale over time, and the transformed PSA readings over time. Each line represents the longitudinal trajectory for each person. We do not include covariate in the Cox proportional hazard model in this dissertation. We fit this PSA data to the Gaussian model reviewed in Chapter 3 and the log-gamma joint model proposed in Chapter 4, with linear trajectory ψβi (tij) = β0i + β1itij. We assume that the baseline hazard λ(t) is piecewise constant with L = 15 intervals. The time points defining the intervals were arbitrarily taken to be t = 0, 2, 4, 6, 8, 10, 12, 14.

79 In the Gaussian joint model, the priors and hyperpriors for the parameters in (3.6), (3.4) and (3.10) were taken as

0  b0 ∼ N2 (3, −1) , I2 ,

−1 V0 ∼ W ishart(20, 10 × I2), M ∼ Gamma(1, 1),

σ2 ∼ IG(10, 1),

γ ∼ N(1, 0.05),

λl ∼ Gamma(10, 10), l = 1, 2,..., 7, where I represents the identity matrix, W ishart and Gamma represent the Wishart and Gamma distributions. In the log-gamma joint model, the priors and hyperpriors for the parameters in (4.2), (4.8) and (4.6) were taken as

κ0 ∼ IG(10, 10),

κ1 ∼ IG(10, 10),

κ3 ∼ IG(10, 10),

M ∼ Gamma(1, 1),

α ∼ Gamma(100, 10),

γ ∼ LG(50, 20),

λl ∼ Gamma(10, 10), l = 1, 2,..., 5, where “IG”, “LG”, and “Gamma” stand for the inverse gamma, log-gamma, gamma and log- gamma distributions. We ran 5,000 iterations with 2,000 burn-in. We computed the posterior mean of the last 3,000 draws as the point estimate for those parameters, and we used the 2.5% and 97.5% quantiles of the last 3,000 draws to compute 95% pointwise credible interval. Parameter estimates are shown in Table 6.2. The estimated slope in the linear regression is 0.07 with 95% credible interval of (−0.002, 0.15) using Gaussian joint model and 0.04 with 95% credible interval of (0.01, 0.06) using log-gamma joint model, suggesting a slightly increase in PSA over the study period. The estimated value of γ is 1.17 and the 95% credible interval of γ does not cover zero. This suggests that when the PSA readings increases, the hazard of being diagnosed with

80 Table 6.2: Parameter estimates for the PSA data using the Gaussian joint model with slice sampler and the log-gamma joint model.

Gaussian Joint Model Log-gamma Joint Model Parameter Estimate 95% Credible Interval Estimate 95% Credeble Interval β01 0.96 (0.64, 1.27) 1.11 (0.81, 1.36) β11 0.07 (−0.002, 0.15) 0.04 (0.01, 0.06) V ar(error) 0.05 (0.04, 0.05) 0.04 (0.04, 0.05) γ 1.17 (1.13, 1.17) 0.81 (0.53, 1.07) prostate cancer will also increase. The non-zero estimate of γ also suggests that a joint model is needed in this problem. The MCMC trace plots and the posterior density estimates of β01, β11, γ, α are given in Figure 6.10-6.11. Figure 6.12 shows the observed PSA readings versus the estimated longitudinal trajectories of 4 randomly selected patients using the Gaussian joint model and the log-gamma joint model. Both of these two types of joint models captured the trajectories well. The MSRE of the PSA readings in the original scale from the Gaussian joint model is 1.79, which is smaller than the MSRE from the log-gamma joint model, which is 4.94. The estimated association of the longitudinal observations and the hazard rate of death is positive in both Gaussian and log-gamma joint models. This suggests that a higher PSA implies higher hazard of having prostate cancer diagnosis. At cluster level, we used Dahl’s method to select the “best” iteration from the last 3,000 itera- tions and used the values of the draws in that iteration as the point estimate for those parameters. Figure 6.14 shows the predicted PSA trajectories, predicted hazard rate and predicted survival probability for each cluster. Different colors represent different clusters. In the bottom left panel we truncate the trajectory of the cluster labeled with yellow at the largest observed time of PSA readings. We do this because a high percentage of individuals in this cluster were diagnosed with prostate cancer early, and hence, feel less confident in making estimates after this time point. The decreasing trend of PSA readings in this cluster is consistent with Lin et al. [24], where they found individuals with similar PSA readings could exhibit a similar decreasing trend. The middle top panel of the hazard rate plot in Figure 6.14 was restricted for better visualization, because most of the cluster-specific hazard rate were smaller than 10. The log-gamma model leads to an estimated 7 clusters, while the Gaussian model leads to an estimated 54 clusters. This corresponds well with the results of the simulation study in Chapter 5, in which the Gaussian joint models tended to

81 Figure 6.10: Trace plots and density curves using the Gaussian joint model with slice sampler for the PSA data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior density estimates of trajectory slope β1. (e) MCMC trace plot of the link parameter γ. (f) Posterior density estimates of the link parameter γ.

82 Figure 6.11: Trace plots and density curves using the log-gamma joint model for the PSA data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior density estimates of trajectory slope β1. (e) MCMC trace plot of the link parameter γ. (f) Posterior density estimates of the link parameter γ.

83 Figure 6.12: Observed PSA readings (black circles) vs. estimated PSA longitudinal trajectories of randomly selected 4 subjects by using the Gaussian joint model with slice sampler (blue dash line) and the log-gamma joint model (red dash line).

overestimated the number of clusters and estimated more clusters than the log-gamma joint model, given that the prior of the concentration parameter M in the Dirichlet process prior is the same in both joint models. We also compared the predicted survival probability from the joint models with the Kaplan-Meier estimator, which is shown in the survival probability plots in Figure 6.14. The Kaplan-Meier estimator for all patients was in between those estimated cluster-specific survival probabilities.

6.2.3 Diagnosis

Similar to the AIDS data in Section 6.1, we provide diagnostics in addition to interpreting model fit for the PSA data. It may be the case that both the Gaussian and log-gamma versions of the joint model provide poor fits to the data. The posterior predictive p-value (Meng [25]) is a viable approach to assess model fit (see Section 2.8). For this PSA data set the posterior predictive p-value for the log-gamma model is 1 and 0.027 (for χ2 and square error loss functions, respectively), and the posterior predictive p-value for the Gaussian model is approximately equal to 1 and 0.749 (for χ2 and square error loss functions, respectively). This suggests that the Gaussian model may overfit

84 Figure 6.13: QQ-plot of residuals using Gaussian joint model and log-gamma joint model for the PSA data .

the data, while the result of the log-gamma model is inconclusive. To obtain a sense of how much we are smoothing in each case, we consider other types of diagnostics. For example, the DIC (see Section 2.9) for the log-gamma model is 1920113 and the DIC for the Gaussian model is 6425589. This suggests better predictive performance of the log-gamma model. Residual diagnostics are also helpful for determining the quality of the model fit. In Figure 6.8, we plot the qq-plot of the residuals from the Gaussian model and the log-gamma joint model. Specifically, Figure 6.13 plots the sample quantile of residuals, which is [yij − (βˆ0 + βˆ1tij)], versus the theoretical quantiles from a Gaussian distribution. Here, we see the the normal qq-plot suggests left skewness in both cases. This suggests a better fit using the log-gamma model, and that residuals are distributed more similar to a left-skewed distribution (possibly a log-gamma distribution).

6.3 Discussion

In this chapter, we applied the Gaussian joint mode and the log-gamma joint model on two real data sets, the HIV data and the PSA data. Both of these two sets had repeated observations as the longitudinal measure, and time-to-event as the survival endpoint. We jointly analyzed their longitudinal and survival outcomes using semiparametric Bayesian joint models and implemented

85 Figure 6.14: Predicted longitudinal trajectories, predicted hazard rate and predicted survival prob- ability for each cluster by using the Gaussian joint model with slice sampler and the log-gamma joint model for the PSA data. In the bottom left panel we truncate the trajectory of the cluster labeled with yellow at the largest observed time of PSA readings. We do this because a high per- centage of individuals in this cluster were diagnosed with prostate cancer early, and hence, feel less confident in making estimates after this time point.

via the Markov chain Monte Carlo (MCMC) approach. The focus or this analysis is on recovering the mean longitudinal trajectory and using the trajectory to make prediction of survival prognosis. A Dirichlet process prior was applied on the parameters of the trajectory function to relax distri- butional assumptions on those parameters, which can improve estimates in cases where parametric distributional assumptions are inappropriate. Both the Gaussian joint model and the log-gamma joint model were coded in R. Comparisons between these two joint models were given in Section 6.2 and 6.3. The Gaussian joint model was slightly better in estimating the mean trajectories for at individual level, while the log-gamma model had better performance in clustering and the estimates for each cluster were more interpretable.

86 CHAPTER 7

CONCLUSION AND FUTURE WORK

In this dissertation, we generalized a Bayesian semiparametric joint model proposed by Brown and Ibrahim [6] to jointly analyze longitudinal and survival data. In Chapter 3, we reviewed the Gaussian joint model proposed by Brown and Ibrahim [6] and provided the full conditional distributions for the parameters in their joint model. Metropolis-Hastings and the slice sampler. In Chapter 4, we challenged the Gaussian assumption in the Gaussian joint models and assumed a log-gamma distribution for the longitudinal measures. Then we proposed a computationally efficient method for inference by using the multivariate log-gamma distributions and the conditional multivariate log-gamma distributions as the priors and hyperpriors, which obtained more conjugacy. Full conditional distributions of the parameters in the log-gamma joint model were provided in Chapter 4. In Chapter 5, simulation studies were given to illustrate both the Gaussian joint model and the log-gamma joint model. We fit these two types of joint models on both Gaussian distributed data and log-gamma distributed data and compared their performance. Comparison results were given at individual level and cluster level. At individual level, the Gaussian joint model did slightly better than the log-gamma joint model in terms of the MSE of longitudinal measures. The log- gamma joint model did much better than the Gaussian joint model in estimating the hazard for each individual. In clustering, the log-gamma joint model was better than the Gaussian joint model in terms of cluster assignments which was measured by the adjusted Rand index. The log-gamma joint model was more accurate in estimating the number of clusters, while the Gaussian joint mode tended to have more estimated clusters than the truth. The log-gamma joint model was also more accurate in the estimation of cluster-specific longitudinal mean trajectory and cluster-specific hazard. We only considered linear trajectory for the longitudinal measures in this dissertation . How- ever, the trajectory can be other forms of interests, e.g., quadratic trajectory, which can be a possible future work. Moreover, we assumed the longitudinal error to be independent and iden- tically distributed with either the Gaussian distribution or the log-gamma distribution. However,

87 the longitudinal error within each individual can be correlated. A joint model with time-dependent longitudinal error can be another possible future work. In Chapter 6, we had a real data analysis on the HIV data to capture the trend of CD4 cells count and to learn the association between the CD4 count trend and the hazard of death. We also analyzed the PSA data to capture the trend of PSA readings and learn the association between the PSA trend and the hazard of prostate cancer diagnosis. There are more possibilities for real data analysis. For example, we will implement the Gaussian joint model and the log-gamma joint model on a Behavior Samples data, which has repeated measures of the Behavior Samples scores for children. This study aims to contribute to the understanding of early communication delays and will analyze the correlation between the longitudinal observations and the risk of having a language development disorder.

88 APPENDIX A

ADDITIONAL DETAILS ON THE MULTIVARIATE LOG-GAMMA DISTRIBUTION

A.1 Marginal Multivariate Log-Gamma Distribution

A particular class of the marginal multivariate log-gamma (mMLG) distribution will be useful for Gibbs sampler. In Bradley et al. [5], Theorem 2 defines a particular class of MLG random ∗ ∗ vectors. Specifically, let q ∼ MLG(0m, V, α, κ), and partition this random vector so that q = 0  ∗0 ∗0  ∗ ∗ q1 , q2 , where q1 is a g-dimensional random vector and q2 is a (m − g)-dimensional random vector. Additionally, we partition V−1 = [HB] into an m×g matrix H and an m×(m−g) matrix B as done in (2.7). Consider the class of MLG random vectors that satisfy the following:   −1 R1 0g,m−g V = (Q1 Q2) 1 , 0m−g,g σ Im−g where 0g,m−g is a g × (m − g) matrix of zeros; 0m−g,g is a (m − g) × g matrix of zeros; Im−g is a (m − g) × (m − g) identity matrix;   R1 H = (Q1 Q2) 0m−g,g

0 is the QR decomposition of the m × g matrix H; the m × g matrix Q1 satisfies Q1Q1 = Ig; the 0 0 1 m × (m − g) matrix Q2 satisfies Q2Q2 = 1m−g and Q2Q1 = 0m−g,g; B = σ Q2, where σ > 0; R1 is a g × g upper triangular matrix. Notice that

 (H0H)−1H0  V = 0 . σQ2

It follows from (2.5) that  ∗   0 −1 0  ∗ q1 (H H) H w q = ∗ = 0 , (A.1) q2 σQ2w

89 ∗ ∗0 ∗0 0 where w ∼ MLG(0m, Im, α, κ). The pdf of the joint distribution of q = (q1 , q2 ) is given by

∗ ∗ f(q1, q2|c = 0m, V, α, κ) m α ! 1 Y κ i = i exp α0V−1q∗ − κ0 exp V−1q∗ det(VV0)1/2 Γ(α ) i=1 i m α ! 1 Y κ i  1  1  = i exp α0Hq∗ + α0Q q∗ − κ0 exp Hq∗ + α0Q q∗ . det(VV0)1/2 Γ(α ) 1 σ 2 2 1 σ 2 2 i=1 i

∗ The marginal distribution of q1 in this particular class of MLG distributions is defined as the marginal multivariate log-gamma distribution (mMLG). Multiplying both sides of (A.1) by (Ig, 0g,m−g) we have ∗ 0 −1 0 q1 = (H H) H w. (A.2)

∗ Thus, we can simulate for q1 according to (A.2).

A.2 Data Augmentation Strategies for the cMLG Distribution

It is straightforward to simulate from the univariate, multivariate log-gamma distributions and the marginal MLG distribution, but it is difficult to simulate directly from the cMLG distribution. We now describe a data augmentation technique discussed in Theorem 2 of Bradley et al. [5] so that we can sample from a mMLG distribution instead of sampling directly from a cMLG distribution. Suppose

q1 ∼ cMLG(H, α, κ), and introduce a random vector q2 with improper prior

f(q2) = 1. (A.3)

Assume that q2 is independent from q1. We consider sampling from the following joint density of 0 0 0 q = (q1, q2) .

f(q1, q2|H, α, κ) = f(q1|H, α, κ)f(q2)

= f(q1|H, α, κ) (A.4)

 0 0 ∝ exp α Hq1 − κ exp(Hq1) ,

90 where f(q2) is dropped from (A.4) due to (A.3). Now use Metropolis-Hastings method to simulate from the target density in (A.4) by using q∗ (defined in (A.1)) as a proposal value. Then the acceptance rate in the Metropolis-Hastings is,

 0 −1 0 −1 0 ∗ 0 ∗ exp α V q − κ exp(V q) exp {α Hq1 − κ exp(Hq1)} r = 0 −1 ∗ 0 −1 ∗ 0 0 exp {α V q − κ exp(V q )} exp {α Hq1 − κ exp(Hq1)}  0 1 0 1 0 ∗ 0 ∗ exp α (Hq1 + Q2q2) − κ exp(Hq1 + Q2q2) exp {α Hq − κ exp(Hq )} = σ σ 1 1 ,  0 ∗ 1 ∗ 0 ∗ 1 ∗ 0 0 exp α (Hq1 + σ Q2q2) − κ exp(Hq1 + σ Q2q2) exp {α Hq1 − κ exp(Hq1)} where the normalizing constants of the cMLG distributions are canceled. It is immediate that

lim r = 1, σ→∞

∗ ∗ for fixed real values of q1, q2, q1 and q2. Thus, when σ goes to infinity, the unnormalized MLG ∗ ∗ ∗ distribution of of q = (q1, q2) in (A.1) approaches the target cMLG distribution in (A.4). And ∗ ∗ q2 is not used in our Gibbs sampler, so we can marginalize across q2. Therefore, we can avoid sampling from the cMLG distribution by sampling from the particular MLG distribution in (A.1) ∗ and marginalizing across q2. The above result described in Theorem 2 of Bradley et at. [5] provides a data augmentation technique (i.e., we augment the likelihood with f(q2) = 1) that allows us to sample from a mMLG distribution instead of a cMLG distribution. A downside to this argument is that it is not entirely clear what the distribution of q1 is after marginalizing across the augmented q2, since the density of q2 in (A.3) is improper. In a Bayesian setting, this leaves the form of the posterior distribution of q1 unclear. Thus, an alternative argument that does not produce ambiguity when interpreting the posterior of q1 is to place an improper prior on κ. Improper priors on parameters contained within a model can be interpreted as “noninformative.” Therefore, a second way to avoid simulating from the cMLG distribution is to marginalize across the scale parameter κ and use composite sampling [38]. Let

κ = exp(Bq3 + log(κ1)),

91 and g(κ1) be the density for κ1 = (κ11, . . . , κ1m). Let q3 have an improper prior f(q3) = 1. Now, marginalize the unnormalized cMLG across κ. We have

f(q1|q2 = d, c, V, α) ZZ = f(q1|q2 = d, c, V, α, κ1)g(κ1)dκ1dq3 ZZ n ! Y αi  0 0 ∝ κi exp α Hq1 − κ exp(Hq1) g(κ1)dκ1dq3 i=1 ZZ 0 0  0 0 ∝ exp{α Bq3 + α log(κ1)} exp α Hq1 − κ1 exp(Hq1 + Bq3) g(κ1)dκ1dq3 ZZ 0  0 0 ∝ exp{α log(κ1)} exp α (Hq1 + Bq3) − κ1 exp(Hq1 + Bq3) g(κ1)dκ1dq3 ZZ       0  q1 0  q1 ∝ exp α HB − κ1 exp HB g(κ1)dκ1dq3 (A.5) q3 q3

∝ f(q1|c = 0m, V, α).

Let B satisfy the conditions discussed in Section 2.5, then from (A.5) we know that marginal- izing a cMLG distribution over (q1|q2 = d, c, V, α, κ) across κ is the same as marginalizing a particular class of joint MLG distribution over (q1, q3) across κ1. Thus, to simulate from f(q1|c = 0m, V, α), from (A.5) we first simuate κ1, and then simulate q1 according to (A.2). 0 −1 0 That is, q1 = (H H) H w. Then the samples we simulate from f(q1|c = 0m, V, α) has the same distribution with the cMLG random vector (q1|q2 = 0m−g, c = 0m, V, α). We may consider the collapsed Gibbs sampler by marginalizing out the rate parameter vector κ in those cMLG posteriors. For example, we use a four-step prototype Gibbs sampler to iterate among the following steps: Sampler 1

Step 1. Simulate κ from f(κ|q1, c = 0m, V, α);

Step 2. Simulate α from f(α|q1, c = 0m, V, κ);

Step 3. Simulate V from f(V|q1, c = 0m, α, κ);

Step 4. Simulate q1 from f(q1|c = 0m, V, α, κ), which is a mMLG distribution. From Van Dyk and Park (2008) [38], we know that moving components in a step of a Gibbs sampler from being conditioned on to being sampled can improve the convergence characteristics of the sampler. This neither alters the stationary distribution of the chain nor destroys the com-

92 patibility of the conditional distributions. We can perform Gibbs sampler among the following steps: Sampler 2 ∗ Step 1. Simulate κ from f(κ|q1, c = 0m, V, α); ∗ Step 2. Simulate α from f(α|q1, c = 0m, V, κ ); ∗∗ Step 3. Simulate (κ , V) from f(κ, V|q1, c = 0m, α);

Step 4. Simulate (κ, q1) from f(κ, q1|c = 0m, V, α). In each step we condition on the most recently sampled value of each variable that is not sampled in that step. Thus in Step 2, we condition on the κ∗ sampled in Step 1. Here a superscript “*” and “**” represent intermediate quantities, which means that the sample is an intermediate value but not part of the output of an iteration. In Sampler 2, κ is sampled three times which may be inefficient. But removing any two steps of κ from the iteration will affect the transition kernel, because Step 2 is conditioned on the sample from Step 1 and Step 3 is part of the output of an iteration. As Van Dyk and Park [38] illustrate, such changes to the transition kernel can destroy the correlation structure of the stationary distribution or otherwise affect convergence of the chain. So we consider removing only the samples of intermediate quantities from a sampler. Permuting the steps of a Gibbs sampler does not alter its stationary distribution. Thus, we permute the steps in Sampler 2 and remove the redundant simulates. After permutation, the Gibbs sampler iterates among the following steps: Sampler 3 ∗ Step 1. Simulate (κ , V) from f(κ, V|q1, c = 0m, α); ∗∗ Step 2. Simulate (κ , q1) from f(κ, q1|c = 0m, V, α);

Step 3. Simulate κ from f(κ|q1, c = 0m, V, α);

Step 4. Simulate α from f(α|q1, c = 0m, V, κ). Then we can remove the redundant samples of κ in the first two steps by appropriately marginal- izing κ and obtain the Gibbs sampler with the following steps: Sampler 4

Step 1. Simulate V from f(V|c = 0m, α), which marginalizes across κ;

Step 2. Simulate q1 from f(q1|c = 0m, V, α), which is a mMLG distribution after marginalizing out κ;

93 Step 3. Simulate κ from f(κ|q1, c = 0m, V, α);

Step 4. Simulate α from f(α|q1, c = 0m, V, κ).

Here f(q1|c = 0m, V, α) is an mMLG distribution after marginalizing out κ, which is the same as marginalizing an unnormalized cMLG distribution. Therefore, if we perform a Gibbs sampler with Sampler 1, it will be equivalent to running Sampler 4, which is to use (A.5) to simulate from the target unnormalized cMLG distribution. Recall it is easy to simulate using Sampler 4 when (H0H)−1H0w in (A.2) is easy to compute.

A.3 Proof of Proposition in Section 2.2

Proof of Proposition 1 From (2.3) we have

αα|α−1/2| n    o f(q |α, α) = exp α α−1/2q − α exp α−1/2q + Γ(α) + + αα−1/2 n  o = exp α1/2q − α exp α−1/2q , Γ(α) + + and using Taylor Series expansion of ex we have

αα−1/2   1  q3  f(q |α, α) = exp α1/2q − α 1 + α−1/2q + q2 + O + + Γ(α) + + 2α + α3/2 αα−1/2  1  q3  = exp −α − q2 + O + , Γ(α) 2 + α1/2 where “O(·)” is the “Big-O” notation (e.g., see Lehmann [23]). Then, using Stirling’s approximation [10] for the Gamma function yields,

α−1/2   3  α 1 2 q+ f(q+|α, α) = exp −α − q+ + O q 2π  α α 1  2 α1/2 α e + O α 1  1  ∼ √ exp − q2 , 2π 2 + where the sign ∼ means that the two quantities are asymptotically the same: their ratio tends to 1 as α tends to infinity. Thus, a re-scaled log-gamma distribution converges to the standard normal distribution when its shape and rate parameters go to infinity.

94 A.4 Proof of Proposition in Section 2.3

Proof of Proposition 2

Let wi ∼ LG(α, α), for i = 1, 2, . . . , m. From (2.5) we have

0 1/2 w = (w1, . . . , wm) ∼ MLG(0m, α Im, α1, α1).

Then it follows from Proposition 1 that α1/2w converges in distribution to a multivariate normal distribution with mean zero and m × m identity covariance matrix. Now define the transformation   q = c + V α1/2w , which follows a MLG(c, α1/2V, α1, α1). It follows from Theorem 5.1.8 of Lehmann [23] that q converges in distribution to a multivariate normal distribution with mean c and covariance matrix VV0 as α goes to infinity.

A.5 Proof in Section 2.4

Proof of (2.7) - (2.9) (Bradley et al. [5])

Let c = (c1, c2), where c1 is the first g-dimensional components of c, and c2 is the remaining

(m − g)-dimensional components. The joint distribution of (q1, q2) is given by

f(q1, q2|c, V, α, κ) m α ! 1 Y κ i   q − c    q − c  = i exp α0(HB) 1 1 − κ0 exp (HB) 1 1 det(VV0)1/2 Γ(α ) q2 − c2 q2 − c2 i=1 i m α ! 1 Y κ i = i exp α0Hq + α0Bq − α0V−1c − κ0 exp(Hq + Bq − V−1c) det(VV0)1/2 Γ(α ) 1 2 1 2 i=1 i  0 −1 = C1 exp α Hq1 − exp(Bq2 − V c + log(κ)) exp(Hq1)

 αi  1 Qm κi  0 0 −1 where C1 is the normalizing constant given by C1 = exp α Bq2 − α V c , det(VV0)1/2 i=1 Γ(αi) −1 −1 and (Bq2 − V c)i is the i-th element in the m-dimensional vector (Bq2 − V c). Given q2 = d, the conditional distribution of (q1|q2 = d, c, V, α, κ) is given by

[f(q, c, V, α, κ)]q2=d f(q1|q2 = d, c, V, α, κ) = R [ f(q, c, V, α, κ)dq1]q2=d  0 0 = C exp α Hq1 − κc exp(Hq1) , which is shown in (2.7).

95 APPENDIX B

IRB APPROVALS

Office of the Vice President for Research Human Subjects Committee Tallahassee, Florida 32306-2742 (850) 644-8673 · FAX (850) 644-4392

APPROVAL MEMORANDUM Date: 03/29/2019 To: Pengpeng Wang Address: 214 Rogers Building (OSB), 117 N. Woodward Ave., Tallahassee, Florida 32306 Dept.: STATISTICS DEPARTMENT

From: Thomas L. Jacobson, Chair

Re: Use of Human Subjects in Research A Bayesian Semiparametric Joint Model for Longitudinal and Survival Data (Grant Title: Smart Early Screening for Autism and Communication Disorders in Primary Care)

The application that you submitted to this office in regard to the use of human subjects in the proposal referenced above have been reviewed by the Secretary, the Chair, and two members of the Human Subjects Committee. Your project is determined to be Expedited per 45 CFR § 46.110(6) and has been approved by an expedited review process.

The Human Subjects Committee has not evaluated your proposal for scientific merit, except to weigh the risk to the human participants and the aspects of the proposal related to potential risk and benefit. This approval does not replace any departmental or other approvals, which may be required.

If you submitted a proposed consent form with your application, the approved stamped consent form is attached to this approval notice. Only the stamped version of the consent form may be used in recruiting research subjects.

If the project has not been completed by 03/27/2020 you must request a renewal of approval for continuation of the project. As a courtesy, a renewal notice will be sent to you prior to your expiration date; however, it is your responsibility as the Principal Investigator to timely request renewal of your approval from the Committee.

You are advised that any change in protocol for this project must be reviewed and approved by the Committee prior to implementation of the proposed change in the protocol. A protocol change/amendment form is required to be submitted for approval by the Committee. In addition, federal regulations require that the Principal Investigator promptly report, in writing any unanticipated problems or adverse events involving risks to research subjects or others.

By copy of this memorandum, the chairman of your department and/or your major professor is reminded that he/she is responsible for being informed concerning research projects involving human subjects in the department, and should review protocols as often as needed to insure that the project is being conducted in compliance with our institution and with DHHS regulations.

This institution has an Assurance on file with the Office for Human Research Protection. The Assurance Number is IRB00000446.

Cc: Elizabeth Slate, Advisor HSC No. 2019.26786

96 Office of the Vice President for Research Human Subjects Committee Tallahassee, Florida 32306-2742 (850) 644-8673 · FAX (850) 644-4392

APPROVAL MEMORANDUM Date: 04/29/2019 To: Pengpeng Wang Address: 214 Rogers Building (OSB), 117 N. Woodward Ave., Tallahassee, Florida 32306 Dept.: STATISTICS DEPARTMENT From: Thomas L. Jacobson, Chair

Re: Use of Human Subjects in Research A Bayesian Semiparametric Joint Model for Longitudinal and Survival Data

The application that you submitted to this office in regard to the use of human subjects in the proposal referenced above have been reviewed by the Secretary, the Chair, and two members of the Human Subjects Committee. Your project is determined to be Exempt per 45 CFR § 46.110(b)7 and has been approved by an expedited review process.

The Human Subjects Committee has not evaluated your proposal for scientific merit, except to weigh the risk to the human participants and the aspects of the proposal related to potential risk and benefit. This approval does not replace any departmental or other approvals, which may be required.

If you submitted a proposed consent form with your application, the approved stamped consent form is attached to this approval notice. Only the stamped version of the consent form may be used in recruiting research subjects.

If the project has not been completed by 04/27/2020 you must request a renewal of approval for continuation of the project. As a courtesy, a renewal notice will be sent to you prior to your expiration date; however, it is your responsibility as the Principal Investigator to timely request renewal of your approval from the Committee.

You are advised that any change in protocol for this project must be reviewed and approved by the Committee prior to implementation of the proposed change in the protocol. A protocol change/amendment form is required to be submitted for approval by the Committee. In addition, federal regulations require that the Principal Investigator promptly report, in writing any unanticipated problems or adverse events involving risks to research subjects or others.

By copy of this memorandum, the chairman of your department and/or your major professor is reminded that he/she is responsible for being informed concerning research projects involving human subjects in the department, and should review protocols as often as needed to insure that the project is being conducted in compliance with our institution and with DHHS regulations.

This institution has an Assurance on file with the Office for Human Research Protection. The Assurance Number is IRB00000446.

Cc: Elizabeth Slate, Advisor HSC No. 2019.27195

97 BIBLIOGRAPHY

[1] Donald I. Abrams, Anne I. Goldman, Cynthia Launer, Joyce A. Korvick, James D. Neaton, Lawrence R. Crane, Michael Grodesky, Steven Wakefield, Katherine Muth, Sandra Kornegay, David L. Cohn, Allen Harris, Roberta Luskin-Hawk, Norman Markowitz, James H. Sampson, Melanie Thompson, and Lawrence Deyton. A comparative trial of didanosine or zalcitabine after treatment with zidovudine in patients with human immunodeficiency virus infection. New England Journal of , 330(10):657–662, 1994.

[2] Theodore W. Anderson. The statistical analysis of time series, volume 19. John Wiley & Sons, 2011.

[3] Charles E. Antoniak. Mixtures of dirichlet processes with applications to Bayesian nonpara- metric problems. The annals of statistics, pages 1152–1174, 1974.

[4] Maurice S. Bartlett and D. G. Kendall. The statistical analysis of variance-heterogeneity and the logarithmic transformation. Supplement to the Journal of the Royal Statistical Society, 8(1):128–138, 1946.

[5] Jonathan R. Bradley, Scott H. Holan, and Christopher K. Wikle. Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion). Bayesian Analysis, 13(1):253–310, 2018.

[6] Elizabeth R. Brown and Joseph G. Ibrahim. A Bayesian semiparametric joint hierarchical model for longitudinal and survival data. Biometrics, 59(2):221–228, 2003.

[7] Larry C. Clark, Gerald F. Combs, Bruce W. Turnbull, Elizabeth H. Slate, Dan K. Chalker, James Chow, Loretta S. Davis, Renee A. Glover, Gloria F. Graham, Earl G. Gross, Arnon Krongrad, Jack L. Lesher Jr, H. Kim Park, Beverly B. Sanders Jr, Cameron L. Smith, and J. Richard Taylor. Effects of selenium supplementation for cancer prevention in patients with carcinoma of the skin: a randomized controlled trial. Jama, 276(24):1957–1963, 1996.

[8] Gavin E. Crooks. The amoroso distribution. arXiv preprint arXiv:1005.3274, 2015.

[9] David B. Dahl. Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian Inference for Gene Expression and Proteomics, 4:201–218, 2006.

[10] Persi Diaconis and David Freedman. An elementary proof of Stirling’s formula. The American Mathematical Monthly, 93(2):123–125, 1986.

[11] . The two sample problem with censored data. Proceedings of the fifth Berke- ley symposium on mathematical statistics and probability, 4(University of California Press, Berkeley, CA):831–853, 1967.

98 [12] Michael D. Escobar and Mike West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577–588, 1995.

[13] Andrew Gelman et al. Two simple examples for understanding posterior p-values whose dis- tributions are far from uniform. Electronic Journal of Statistics, 7:2595–2602, 2013.

[14] Richard D. Gill. Censoring and stochastic integrals. Statistica Neerlandica, 34(2):124–124, 1980.

[15] Anne I. Goldman, Bradley P. Carlin, Lawrence R. Crane, Cynthia Launer, Joyce A. Korvick, Lawrence Deyton, and Donald I. Abrams. Response of CD4 lymphocytes and clinical con- sequences of treatment using ddI or ddC in patients with advanced HIV infection. JAIDS Journal of Acquired Immune Deficiency Syndromes, 11(2):161–169, 1996.

[16] Major Greenwood. A report on the natural duration of cancer. A Report on the Natural Duration of Cancer, (33), 1926.

[17] Xu Guo and Bradley P. Carlin. Separate and joint modeling of longitudinal and event time data using standard computer packages. The American Statistician, 58(1):16–24, 2004.

[18] W. Keith Hastings. Monte carlo sampling methods using Markov chains and their applications. 1970.

[19] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.

[20] Edward L. Kaplan and Paul Meier. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282):457–481, 1958.

[21] John P. Klein and Melvin L. Moeschberger. Survival analysis: techniques for censored and truncated data, pages 91–104. Springer Science & Business Media, 2006.

[22] John A Laurmann and W Lawrence Gates. Statistical considerations in the evaluation of climatic experiments with atmospheric general circulation models. Journal of the Atmospheric Sciences, 34(8):1187–1199, 1977.

[23] Erich Leo Lehmann. Elements of large-sample theory. Springer Science & Business Media, 2004.

[24] Haiqun Lin, Bruce W Turnbull, Charles E McCulloch, and Elizabeth H Slate. Latent class models for joint analysis of longitudinal biomarker and event process data: application to longitudinal prostate-specific antigen readings and prostate cancer. Journal of the American Statistical Association, 97(457):53–65, 2002.

[25] Xiao-Li Meng. Posterior predictive p-values. The Annals of Statistics, 22(3):1142–1160, 1994.

99 [26] Bruce D. Meyer. Unemployment insurance and unemployment spells, 1988.

[27] Leslie C. Morey and Alan Agresti. The measurement of classification agreement: An adjust- ment to the Rand statistic for chance agreement. Educational and Psychological Measurement, 44(1):33–37, 1984.

[28] Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.

[29] Radford M. Neal. Slice sampling. Annals of Statistics, pages 705–741, 2003.

[30] Ross L. Prentice. A log gamma model and its maximum likelihood estimation. , 61(3):539–544, 1974.

[31] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.

[32] Xiao Song, Marie Davidian, and Anastasios A. Tsiatis. A semiparametric likelihood approach to joint modeling of longitudinal and time-to-event data. Biometrics, 58(4):742–753, 2002.

[33] David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van Der Linde. Bayesian measures of model complexity and fit. Journal of the royal statistical society: Series b (statis- tical methodology), 64(4):583–639, 2002.

[34] M. Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):795–809, 2000.

[35] J. M. Taylor, Shu-Jane Tan, Roger Detels, and Janis V. Giorgi. Applications of a computer simulation model of the natural history of CD4 T-cell number in HIV-infected individuals. AIDS (London, England), 5(2):159–167, 1991.

[36] H. Jean Thi´ebauxand Francis W. Zwiers. The interpretation and estimation of effective sample size. Journal of Climate and Applied Meteorology, 23(5):800–811, 1984.

[37] Anastasios A. Tsiatis and Marie Davidian. Joint modeling of longitudinal and time-to-event data: an overview. Statistica Sinica, pages 809–834, 2004.

[38] David A. Van Dyk and Taeyoung Park. Partially collapsed Gibbs samplers: theory and methods. Journal of the American Statistical Association, 103(482):790–796, 2008.

[39] Yan Wang and Jeremy M. G. Taylor. Jointly modeling longitudinal and event time data with application to acquired immunodeficiency syndrome. Journal of the American Statistical Association, 96(455):895–905, 2001.

100 [40] Mike West. Hyperparameter estimation in Dirichlet process mixture models. Duke University ISDS Discussion Paper# 92-A03, 1992.

[41] Daowen Zhang and Marie Davidian. Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57(3):795–802, 2001.

101 BIOGRAPHICAL SKETCH

Pengpeng Wang was born on November 25, 1991, in Tianjin, China. She grew up in Tianjin and received a Bachelor of Science in Statistics in 2014 from School of Mathematical Sciences, Nankai University, China. She joined Department of Statistics, Florida State University in 2014 and received a Master of Science in Statistics in 2016. Now she is a PhD candidate in Statistics at Florida State University. During the summer in 2018, she worked at Liberty Mutual Insurance as a Data Science Graduate Intern. As a PhD candidate at Florida State University, Pengpeng has been advised and mentored by Dr. Elizabeth H. Slate. She had been a teaching assistant from 2014 to 2017. She was the instructor for STA1013 “Statistics Through Example”. She has been working with Dr. Elizabeth H. Slate as a research assistant since 2017. She presented this dissertation in “Yongyuan and Anna Li” students presentation competition and won the award in the department in 2019. She was one of the finalists in the Three-Minute Thesis presentation competition at Florida State University in 2018. She also gave presentations in conferences, including 2017 Southern Regional Council of Statistics and 2018 Joint Statistical Meetings. Outside of academics, Pengpeng enjoys playing table tennis. She is currently the president of FSU Table Tennis Club.

102