Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20

Efficient Estimation for Semiparametric Structural Equation Models With Censored Data

Kin Yau Wong, Donglin Zeng & D. Y. Lin

To cite this article: Kin Yau Wong, Donglin Zeng & D. Y. Lin (2018) Efficient Estimation for Semiparametric Structural Equation Models With Censored Data, Journal of the American Statistical Association, 113:522, 893-905, DOI: 10.1080/01621459.2017.1299626 To link to this article: https://doi.org/10.1080/01621459.2017.1299626

View supplementary material

Accepted author version posted online: 03 Mar 2017. Published online: 06 Jun 2018.

Submit your article to this journal

Article views: 481

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=uasa20 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION , VOL. , NO. , –, Theory and Methods https://doi.org/./..

Efficient Estimation for Semiparametric Structural Equation Models With Censored Data

Kin Yau Wong, Donglin Zeng, and D. Y. Lin Department of Biostatistics, University of North Carolina, Chapel Hill, NC

ABSTRACT ARTICLE HISTORY Structural equation modeling is commonly used to capture complex structures of relationships among Received April  multiple variables, both latent and observed. We propose a general class of structural equation models Revised January  with a semiparametric component for potentially censored survival times. We consider nonparametric KEYWORDS maximum likelihood estimation and devise a combined expectation-maximization and Newton-Raphson Integrative analysis; Joint algorithm for its implementation. We establish conditions for model identifiability and prove the consis- modeling; Latent variables; tency, asymptotic normality, and semiparametric efficiency of the . Finally, we demonstrate the Model identifiability; satisfactory performance of the proposed methods through simulation studies and provide an application Nonparametric maximum to a motivating cancer study that contains a variety of genomic variables. Supplementary materials for this likelihood estimation; article are available online. Survival analysis

1. Introduction Asparouhov, Masyn, and Muthén (2006) considered a more Structural equation modeling (SEM) is a very general and general formulation of the association among the latent and powerful approach to capture complex relationships among observed variables. SEM with the Cox proportional haz- multiple factors, both observed and latent (Bollen 1989). A typi- ardsmodelforthesurvivalcomponenthasbeenadoptedfor cal SEM framework consists of a structural model that connects more complex settings, such as multivariate survival times latent variables and a measurement model that relates latent (Stoolmiller and Snyder 2006) and competing risks (Stoolmiller variables to observed variables. SEM is extremely popular in the and Snyder 2014). A popular software program, Mplus (Muthén social sciences and psychology, where unmeasured quantities and Muthén 1998–2015), has implemented SEM with survival and psychological constructs, such as human intelligence and data under the proportional hazards model. The estimation creativity, can be related to and investigated through observed of the nonparametric baseline hazard function is based on data. The text of Bollen (1989) has been cited more than 20,000 piecewise-constant splines, and no theoretical justification is times. Recently, SEM has gained popularity in medical and available. In fact, the standard error for the baseline public health research (Dahly, Adair, and Bollen 2009;Naliboff hazard function is incorrect. et al. 2012). In this article, we propose a general SEM framework that Our interest in SEM was motivated by its potential appli- includes a semiparametric component of the measurement cation to integrative analysis in genomic studies. Recent model for potentially censored survival times. Specifically, technologicaladvanceshavemadeitpossibletocollectdifferent we formulate the effects of latent and observed covariates types of genomic data, including DNA copy number, SNP geno- on survival times through a broad class of semiparametric type, DNA methylation level, and expression levels of mRNA, transformation models that includes the proportional hazards microRNA, and protein, on a large number of subjects. There model as a special case. The observed covariates may include is a growing interest in integrating these genomic platforms manifest variables that depend on latent variables. We study so as to understand their biological relationships and predict nonparametric maximum likelihood estimation (NPMLE), disease progression and death, which are considered potentially under which the cumulative hazard functions are estimated by censored survival times (The Cancer Genome Atlas (TCGA); step functions with jumps at observed survival times. https://tcga-data.nci.nih.gov/tcga/). The proposed SEM is reminiscent of joint modeling for sur- SEM with discrete survival times has been studied by vival and longitudinal data (Henderson, Diggle, and Dobson Rabe-Hesketh, Yang, and Pickles (2001), Rabe-Hesketh, Skro- 2000;TsiatisandDavidian2004). With the latter, the observed ndal, and Pickles (2004), Muthén and Masyn (2005), and longitudinal variables are considered error-prone measure- MoustakiandSteele(2005). For continuous survival time, ments of some underlying latent variables, but the measure- Larsen (2004, 2005) adopted the proportional hazards model ments themselves are not causal determinants of the survival (Cox 1972) with a single latent variable to capture the associ- time. By contrast, our SEM framework allows latent variables ation between the survival time and other observed variables; to have direct effects on survival times, as well as indirect effects

CONTACT Danyu Lin [email protected] Department of Biostatistics, University of North Carolina, Chapel Hill, NC . Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA. ©  American Statistical Association 894 K. Y. WONG, D. ZENG, AND D. Y. LIN through other manifest variables. In addition, our framework W, Z, Y,andη as follows: accommodates much more complex relationships among latent η | Z ∼ Fη(·|Z; ν), (1) variables. Y | (Z, η) ∼ (·|Z, η; ψ), A major challenge in our theoretical development is model FY (2) identifiability. Even for an SEM with normally distributed  (t | W, Z,Y, η) Tk variables, no single set of conditions exists that is both nec- W Tϑ +ZTβ +Y Tα +ηTφ = G  (t) e k k k k , k = 1,...,K, (3) essary and sufficient for model identifiability. Methods that k k deal with special cases of the normal SEM were proposed by where Fη(·|Z, ν) denotes a q-variate normal distribution func- Bollen (1989), Reilly and O’Brien (1996), Vicard (2000), and tion indexed by a parameter vector ν, FY (·|Z, η; ψ) denotes an Bollen and Davis (2009), among others. Most of the methods r-variate parametric distribution function indexed by a param- ψ  are based on the fact that identifiability can be established eter vector , Tk isthecumulativehazardfunctionofTk by solving the equations relating the first two model-implied given (W, Z,Y, η), Gk is a known increasing function, k is moments to the sample moments. This approach is not directly an unspecified positive increasing function with k(0) = 0, and (ϑ , β , α , φ ) applicable to models with nonparametric components, as k k k k are unknown regression parameters. infinite-dimensional parameters cannot be identified through Model (1) is the structural model of the latent variables. a finite number of equations. Because the proportional hazards Model (2) is the measurement model of Y. We assume that structure results in a likelihood function that takes the form of Y and η are independent of W given Z.Models(1)and(2) a Laplace transform, however, we are able to develop sufficient represent the existing SEM framework with Y not restricted to conditions under which the identifiability of a semiparamet- be normally distributed. Equation (3) includes the proportional ric SEM can be established by inspecting simpler parametric hazards and proportional odds models as special cases with models. the choices of Gk(x) = x and Gk(x) = log(1 + x),respectively. Another theoretical challenge is the invertibility of the infor- The proportional hazards model has been considered in the mation operator. For the information operator to be invertible, literature. we require that the score statistic along any nontrivial submodel The survival time Tk is subject to right censoring by Ck.It is nonzero. As in the case of model identifiability, general is assumed that (C ,...,C ) are independent of (T ,...,T ) 1 K ˜ 1 K conditions for the invertibility of the information operator and η conditional on Y, Z,andW.DefineTk = min(Tk,Ck ) for semiparametric models do not exist. In the existing work and k = I(Tk ≤ Ck),whereI(·) is the indicator function. involving latent variables for survival times (Kosorok, Lee, and For a sample of size n,theobserveddataconsistofO ≡ ˜ ˜ i Fine 2004;ZengandLin2010), verifying the invertibility of the (T1i,...TKi,1i,...,Ki,Y i, Zi,W i) (i = 1,...,n). information operator involves inspecting the local behavior of Let θ denote the collection of all Euclidean parameters, and the score statistic around the zero survival time. This approach write A = (1,...,K ). The likelihood function for θ and A does not make full use of the variability of the score statistic is proportional to contributed by the survival times and cannot deal with the n K W Tϑ +ZTβ +Y Tα +ηTφ proposed general modeling framework. We show that the ˜ k k k k Ln (θ, A) = λk(Tki )e i i i invertibility of the information operator can be verified by = = i 1 k 1 inspecting the parametric components of the SEM under some T   ˜ W Tϑ +Z β +Y Tα +ηTφ ki ×G  (T )e i k i k i k k mild conditions in the survival model. k k ki W Tϑ +ZTβ +Y Tα +ηTφ ˜ k k k k The rest of this article is structured as follows. In Section 2, × exp − Gk k(Tki )e i i i we formulate the model and describe our approach to establish × fY (Y i | Zi, η; ψ) fη(η | Zi; ν) dη, (4) model identifiability. In Section 3,wediscussthenumeri- ( ) = ( )/ λ =  =  cal implementation of the NPMLE. In Section 4,wepresent where f x d f x dx for any function f , k , fY FY ,  k theoretical results for model identifiability and describe the and fη = Fη. The NPMLE is defined to be the maximizer of asymptotic properties of the estimators. In Section 5,wereport Ln(θ, A),inwhichk is treated as a step function with jumps ˜ the results from simulation studies. In Section 6,weprovide at Tki with ki = 1(i = 1,...,n). an application to the TCGA data, which motivated this work. We make some concluding remarks in Section 7 and relegate 2.2. Model Identifiability theoretical proofs to the Appendix. We describe our approach to establish model identifiability in this section and defer the technical details to Section 4. The iden- 2. Basic Framework tifiability results can be summarized by two simple rules. Sup- pose that we have arranged the survival times such that for some ≤ ≤ ( , ) ( ,..., ) 2.1. Model and Likelihood 0 K1 min q K , each of T1 TK1 regresses on and only on one latent variable and a set of covariates that are inde- Let η denote a q-vector of latent variables, Y denote an r-vector pendent of the latent variables. (We allow K1 = 0ifnosurvival of uncensored manifest variables, (T1,...,TK ) denote K poten- time satisfies the given conditions, in which case Rule 1 below is tially censored survival times, and W and Z denote two vectors vacuous.)WecallanobservedvariableX an indicator of a latent of observed covariates. Without loss of generality, assume that variable η if X follows a generalized linear model with η as a thesupportofthecovariatesincludeszero.Wespecifythecon- covariate and is independent of all other manifest variables and ditional distributions of η given Z,Y given Z and η,andTk given survival times conditional on η.Wehavethefollowingrules: JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 895

η1,andT2 is a survival time that follows the proportional haz- ards model with covariates η1 and η2. Assume that W and Z are nonconstant and linearly independent, the regression parame- ters for the latent variables in the models of T1 and Y1 are fixed to be one, E(η1) = E(η2) = 0, and the regression parameters of W in the model of T1 and η2 in the models ofY2 and Y3 are nonzero. First, we use T1 to help identify the latent variable distribu- tions. By Rule 1, η1 can be treated as observed when identifying the model. The problem thus reduces to identifying the model shownontheright-handsideofFigure 2. The model can then beshownidentifiablebytheargumentsusedinExample 1. Figure . The first example of SEM to illustrate the identifiability rules. The SEM con- sists of one latent variable, one survival time, and three conditionally independent normal manifest variables. 3. Computation of the NPMLE ( ,..., ) 1. The latent variables attached to T1 TK1 can be In this section, we use Z to denote both W and Z with β (k = ( ,..., ) k treated as observed if each of T1 TK1 depends on 1,...,K) as the corresponding vector of regression parame- at least one observed covariate. ters. Application of a transformation Gk canbeviewedasinclu- 2. If each latent variable has a separate continuous indica- sion of an extra latent variable log sk in the regression equation, tor and the distributions of the latent variables and the where s is a random variable with density g such that G (x) = k∞ k k indicators are identifiable, then the whole model is iden- − −xt ( ) log 0 e gk t dt. We adopt the expectation-maximization tifiable. (EM) algorithm (Dempster, Laird, and Rubin 1977)bytreating To illustrate the usefulness of the two identifiability rules, we the latent variables, including those introduced by the trans- present two examples. formations, as missing data. We perform occasional Newton- Example 1. Consider the model depicted in Figure 1.Inthe Raphson steps to speed up the convergence. In the combined algorithm, either an EM step or a Newton- model, Y1, Y2,andY3 are conditionally independent normal manifest variables of η,andT is a survival time that follows the Raphson step is performed at each iteration. To avoid confusion, proportional hazards model with covariate η. Assume that the we call the latter an outer Newton-Raphson step. For an EM step, note that the conditional expectation for any function ϕ of regression parameter of Y1 on η is fixed to be one, E(η) = 0, (η , si) ≡ (η , s1i,...,sKi) given the observed data is and the regression parameters of η in the models of Y2 and Y3 i i are nonzero. {ϕ(η , s ) | O } The above model is similar to the joint model for survival and E i i i  longitudinal variables. By Bollen (1989)’s three-indicator rule, K ki − ZTβ +Y Tα +ηTφ = C 1 ϕ (η, s)  { ˜ } i k i k k the model of (Y1,Y2,Y3,η) is identifiable. With Y1 serving as k Tki ske an indicator of η, Rule 2 implies that the remaining parame- k=1 ˜ Tki ters are identifiable. Note that Rule 1 is not applicable in this ZTβ +Y Tα +ηTφ × − i k i k k  ( ) case because T does not depend on an independent covariate. exp ske d k t 0 In fact, the model is not identifiable without (Y1,Y2,Y3) because × f (Y | Z , η; ψ) fη(η | Z ; ν)g(s) dη ds ··· ds , the scale of the baseline hazard function and the variance of η Y i i i 1 K cannot be separated.  { }  where k t isthejumpsizeofthestepfunction k at t, s = ( ,..., ) (s) = K ( ) C Example 2. Consider the model depicted on the left-hand side of s1 sK , g k=1 gk sk ,and equals the above Figure 2.Inthemodel,Y1, Y2,andY3 are conditionally indepen- integral evaluated at ϕ(·, ·) = 1. We use the Gauss-Hermite dent normal manifest variables of η2, T1 is a survival time that quadrature to approximate the integrals. To reduce the number follows the proportional hazards model with covariates W and of abscissas, we adopt an adaptive quadrature approach (Liu

Figure . The second example of SEM to illustrate the identifiability rules. The left panel is an SEM that consists of two latent variables, two observed covariates, two survival times, and three conditionally independent normal manifest variables. The right panel is an intermediate step in identifying the SEM on the left. 896 K.Y.WONG,D.ZENG,ANDD.Y.LIN and Pierce 1994). Denote the approximation of the condi- 4. Theoretical Properties tional expectation as Eˆ (·). After taking expectation on the (β , α , φ ) 4.1. Identifiability Conditions functions involved, we update k k k by the one-step Newton-Raphson algorithm on As discussed in Section 2,wesetasideK1 survival ,..., ⎡ ⎤ times, T1 TK1 , that are used to identify the distribu-   ki n ˆ ZTβ + Y Tα + ηTφ tion of the underlying latent variables. We assume that ⎣ E ski exp  i k i k i k  ⎦ log    . (φ ,...,φ ) = RK1 n ˜ ˜ ˆ T span 1 K .WecanchoosetheK1 survival = I(T ≥ T )E s exp Z β + Y Tα + ηTφ 1 i 1 j=1 kj ki kj j k j k j k timessuchthateachisassociatedwithafew,preferablyonly one, latent variables. (K1 isallowedtobe0,inwhichcasewerely Then,weupdatethecumulativebaselinehazardfunctionby solely on the manifest variable Y to identity the latent variable φ = e distribution.) Without loss of generality, we assume that k k  = ,..., e ˆ  ki  (k 1 K1), where k is a q-vector with 1 at the kth position k{Tki}=   , n ( ˜ ≥ ˜ )ˆ ZTβ + Y Tα + ηTφ and0elsewhere.Thisassumptioncanbesatisfiedbyapplying = I Tkj Tki E skj exp j k j k k j 1 j a linear transformation to the latent variables. Effectively, we fix the scale of the first K latent variables, as is common when (β , α , φ ) 1 where k k k are evaluated at the current estimates. In establishing model identifiability for SEM. We partition η into addition, we update the remaining parameters at the maximum (η , η ) η ≡ (η ,...,η ) 1 2 ,where 1 11 1K1 consists of the first K1 of components of η. We consider the following identifiability conditions. For n ˆ the “baseline” hazard functions ( ,..., ),weonlyrequire E[log{ f (Y | Z , η ; ψ) fη(η | Z ; ν)}]. 1 K Y i i i i i identifiability on ,τ[0 ], where τ denotes the study duration. i=1 (C1) If (1,W T, ZT,Y T)Tc = 0 almost surely for some vector c of appropriate dimension, then c = 0.Fork = 1,...,K, If a closed-form solution is not available, then we apply the λ ,τ one-step Newton-Raphson algorithm to the above expression k is continuous and strictly positive on [0 ], and there exists a positive and measurable function g such ∞ k instead. The above algorithm can be generalized to the case −ts that exp{−Gk(t)}= e gk(s) dmk(s),wheremk is where components of Y i thatdonotappearinthemodelof 0 the Lebesgue measure or the counting measure at 1. thesurvivaltimesaremissingatrandomforsomesubjects.In T (C2) For k = 1,...,K1,E(Y αk | Z = 0) = 0, and E(η | Z = this case, we simply drop the corresponding fY terms in the ˆ 0) = 0.Also,foranyvectorsc1 and c2 of appropriate evaluation of E and the complete-data log-likelihood. Y Tc +ηTc ( 1 2 | Z) For an outer Newton-Raphson step, we apply the one- dimensions, E e is finite almost surely. (C3) For k = 1,...,K1, ϑk is nonzero, and αk and β are zero. step Newton-Raphson algorithm directly to the logarithm of k˜ (C4) Consider two sets of parameters (ψ, ν) and (ψ, ν˜).Let Ln(θ, A) given in (4) using a similar adaptive quadrature ,η (Y, η ) Z ,η (Y, η | fY 1 be the density of 1 given .Then, fY 1 1 approximation. At the current estimates, the first derivative of ˜ Z; ψ, ν) = ,η (Y, η | Z; ψ, ν˜) Z Y η log L (θ, A), that is, the score statistic, is the same as the first fY 1 1 for all , ,and 1 n ˜ derivative of the expected complete-data log-likelihood. The implies that ψ = ψ and ν = ν˜. (Y +, η+) (Y, η ) Hessian matrix used in the Newton-Raphson algorithm can be (C5) Let 1 be the components of 1 that appear obtained by Louis’s (1982)formula. in the regression of Tk for some k = K1 + 1,...,K,and − − (Y , η ) η | ,η To determine whether an EM step or an outer Newton- let 1 be the remaining components. Let F 2 Y 1 η (Z,Y, η ) Raphson step is to be performed, we keep track of the difference be the distribution function of 2 given 1 with (Y −, η−) η in the log-likelihood at the previous iteration, either an EM or 1 treated as a parameter vector. Then, 2 is com- + { η | ,η (·|Z,Y, η ) Z = z ,Y = an outer Newton-Raphson step, and the difference at the previ- plete sufficient in F 2 Y 1 1 : 0 y , η+ = η } z y η ous outer Newton-Raphson step. For each iteration, if the log- 0 1 10 for any fixed 0, 0,and 10. likelihood difference at the previous step is too small relative to that at the previous outer Newton-Raphson step, then an outer Remark 1. Condition (C1) pertains to basic requirements on Newton-Raphson step is performed; otherwise, an EM step is thecovariates,thebaselinehazardfunctions,andthetransfor- performed. Upon convergence, Louis’s (1982)formulaisusedto mation functions such that the survival model with observed obtain the information matrix for the estimation of the standard covariates is identifiable. If mk is a point mass at 1, then Gk is errors. simply the identity function. Condition (C2) fixes the location The reason that we use a combination of the EM and parameters of the latent variables and the manifest variables that Newton-Raphson algorithms instead of the Newton-Raphson appear in the regression models of the first K1 survival times. algorithm alone is two-fold. First, EM steps are more stable, Condition (C3) requires that the first K1 survival times depend which is important, especially in early iterations. Second, in the only on their corresponding latent variable and W.Thepres- estimation of the survival model under the EM algorithm, ence of a covariate besides the latent variable is necessary for the regression parameters can be obtained by maximizing the distinguishing the contributions of the baseline hazard func- partial-likelihood-type function, and the estimators of the base- tion and the latent variable to the distribution of a survival time line hazard functions take the form of the Breslow estimator. that follows a mixture distribution. Condition (C4) requires that (Y, η ) Unlike the Newton-Raphson algorithm, the EM algorithm does the model with observed 1 is identifiable. Condition (C5) η (Y, η ) not involve the inversion of a high-dimensional matrix. requires that 2 is complete sufficient conditional on 1 , JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 897

(Y, η ) N ∈ where components of 1 that do not appear in the regres- In addition, for some positive constants M j and c j, j r ∞ r sion of Tk (k = K1 + 1,...,K) are treated as parameters, and R ,andϕ1 ∈ (R ), the rest are held fixed. Conditions (C2) and (C3) are vacuous if   q −M |b | q (NTYb +c b2 ) = = e j=1 j j ≤ e j=1 j j j j ϕ (Y ) K1 0, and condition (C5) is vacuous if K1 K. 1  q M |b | × f (Y | Z, η; ψ) fη(η | Z; ν) ≤ e j=1 j j , We have the following identifiability result. Y b = (η) Theorem 1. Under Conditions (C1)–(C5), the model specified where S for some one-to-one linear transforma- by (1)–(3) is identifiable. tion S. (D4) The function Gk is four-times differentiable, Gk(0) = 0, −κ  κ α β ( + ) 1k ≤ ( ) ≤ ( + ) 2k Remark 2. The condition that k and k are zero for and K1k 1 x Gk x K2k 1 x for k = 1,...,K1 separates the first K1 survival times from the some positive constants κ1k, κ2k, K1k,andK2k.Also, ρk remaining observed variables that are associated with the latent Gk(x)/x → Mk or Gk(x)/ log(x) → Mk as x →∞ variables. This condition is used to simplify the presentation forsomepositiveconstantsMk and ρk. In addi- −κ3k of the identifiability conditions. In the proof of Theorem 1,we tion, exp{−Gk(x)}≤μk(1 + x) for some μk and consider generalized versions of conditions (C3)–(C5), where κ3k >κ2k + 1. Furthermore, for some rk, α and β are allowed to be nonzero.       k k     (3)   (4)  G (x) + G (x) + G (x) k k k < ∞, Remark 3. Theorem 1 implies that the distribution of the latent sup  ≥ G (x)(1 + x)rk variable underlying a given survival time can be completely x 0 k ( ) identified if the survival time only regresses on the latent vari- where G and G j denote the second and jth derivatives able and a set of independent covariates. Thus, the survival times k k of Gk,respectively. ( ) make it easy to identify the model, as only a single survival time (D5) Let (Z k ,Y (k)) be the components of (Z,Y ) that appear is enough to identify an underlying latent variable. By contrast, in the regression of T (k = 1,...,K ). For any vectors this property does not hold for normal random variables. k 1 h1, h2k,andh3k of appropriate dimensions, if Remark 4. The derivation of model identifiability from con- ∂ T dition (C5) uses the property of complete sufficient . f ,η (Y, η | Z; ψ, ν) h ∂(ψ, ν) Y 1 1 1 The derivation is applicable to general latent variable mod- K1 ∂ els; the general result is given by Lemma 1 in the Appendix. (k)T (k)T − (Z h + Y h ) f ,η (Y, η | Z; ψ, ν) Lemma 1 allows for the establishment of model identifiability 2k 3k ∂η Y 1 1 = 1k by inspecting just a part of the model. It includes Reilly and k 1 O’Brien’s (1996) side-by-side rule, which states that the loadings Z Y η h = 0 h = 0 is equal to 0 for all , ,and 1,then 1 , 2k , of an observed variable on any number of latent variables h = 0 = ,..., and 3k for k 1 K1,where fY,η1 is defined in areidentifiableifeachofthelatentvariablesisattachedtoa condition (C4). separate independent observed variable whose distribution is identifiable, as a special case. Remark 5. Conditions (D1)–(D4) are similar to the conditions of Zeng and Lin (2010) for joint modeling of longitudinal and survival data. Extra conditions are imposed on the transfor- 4.2. Asymptotic Properties mations and the distributions of Y and η to accommodate the presence of unbounded covariate Y inthesurvivalmodel.Con- Let d be the dimension of θ, θ0 denote the true value of θ,and  denotethetruevalueof (k = 1,...,K). We impose the dition (D5) is for the invertibility of the information operator. If 0k k α = 0 β = 0 = ,..., following conditions. k and k for k 1 K1, then condition (D5) sim- ply requires that the information matrix of the model for (Y, η ) (D1) The parameter θ0 lies in the interior of a compact set 1 d is invertible. This is parallel to condition (C4) for identifiability.  ⊂ R ,andthefunction0k is continuously differen- λ ( ) ≡  ( )> ,τ = tiable with 0k t 0k t 0on[0 ]foreachk Remark 6. The conditions for identifiability and the invertibility ,..., 1 K. of the information operator (C1)–(C5), and (D5) differ signif- ( ˜ = τ | W, Z)>δ = (D2) With probability one, P Tki 0 (k icantly from the corresponding conditions (C5) and (C7) of ,..., δ > 1 K)forsomefixed 0 0. Zeng and Lin (2010). The latter are stated under very general Z (ψ, ν) ∈   (D3) Consider any fixed and ψν,where ψν con- settings, but they are hard to verify for specific models, espe- (ψ, ν) θ ∈  sists of the -component of every .Forany cially under our SEM framework. By contrast, our conditions > δ = , constant a1 0and 0 1, are easier to verify and have intuitive interpretations. For the   model in Example 1, condition (D5) simply requires that the a (1+|Y|+|η|) δ 1 (Y | Z, η; ψ) η(η | Z; ν) η < ∞. E e fY f d model of (Y1,Y2,Y3,η1) given Z has a nonzero score statistic, which clearly holds. Also, for j = 1, 2, 3, there exists a constant a > 0such 2 ˆ ˆ that Let A0 = (01,...,0K ) and (θ, A) be the NPMLE     (θ, A) V ={v ∈ Rd, |v|≤ } Q ={ ( )  ∂ j  of . Also, let 1 and h t :  f (Y | Z, η; ψ)  ∂ j (η | Z; ν) ∂ψ j Y  ∂ν j fη  ( +|Y|+|η|) h(t) ,τ ≤ 1} with · ,τ being the total variation norm   +   ≤ a2 1 . V[0 ] V[0 ]  (Y | Z, η; ψ)   (η | Z; ν)  e ˆ ˆ  fY  fη on [0,τ]. We consider (θ − θ0, A − A0) as a random element 898 K. Y. WONG, D. ZENG, AND D. Y. LIN

∞ K in l (V × Q ) with (Y1,...,Y5), observed binary variables (Y6,Y7),andasur- vival time T.Theirdistributionsaregivenby ˆ ˆ (θ − θ0, A − A0)(v, h1,...,hK )  (t | Z,Y ,Y ,η ) K τ T 6 7  2  ˆ T ˆ =  ( ) X T β + φ η , X = ( , , , )T, = (θ − θ0) v + hk(s) d(k − 0k)(s). G 0 t exp T T T 2 T Z1 Z2 Y6 Y7 = 0 k 1 logit {P(Y6 = 1 | Z,η2)} = X T β + φ η , X = (1, Z , Z )T, We have the following results. Y6 Y6 Y6 2 Y6 1 2 logit {P(Y7 = 1 | Z,Y6,η2)} Theorem 2. Under conditions (C1)–(C5) and (D1)–(D5), = X T β + φ η , X = ( , , , )T, ˆ K  ˆ  Y7 Y7 Y7 2 Y7 1 Z1 Z2 Y6 1. |θ − θ0|+ = sup ∈ ,τ k(t) − 0k(t) →a.s. 0;   k 1 t [0 ] | η ∼ β + φ η ,σ2 , = , , , and Yj 1 N Yj Yj 1 Yj j 1 2 3 1/2 ˆ ˆ ∞ K   2. n (θ − θ0, A − A0) →d G in l (V × Q ),whereG is 2 Yj | η2 ∼ N βY + φY η2,σ , j = 4, 5, a continuous zero-mean Gaussian process. Furthermore,  j j Yj η | η ∼ β η ,σ2 , 1/2(θˆ − θ ) 2 1 N η 1 η the limiting covariance matrix of n 0 attains the   2 semiparametric efficiency bound. η ∼ ,σ2 . 1 N 0 η1 φ φ Remark 7. The proof of Theorem 2 relies on the Donsker prop- The parameters Y1 and Y4 arefixedtobeone.Themodelis erties of certain classes of functions. It is more challenging to depicted in Figure 3. establish the Donsker results in our setting than in previous set- We set Z1 and Z2 to independent standard normal and ( . )  ( ) = 2 tings (e.g., Kosorok, Lee, and Fine 2004;ZengandLin2010) Bernoulli 0 5 ,respectively,and 0 t t . We considered − becausethelikelihoodfunctionoftheproposedmodelmaycon- the class of logarithmic transformations G(x) = r 1 log(1 + rx) tain the unbounded variable Y. with r = 0or1,whichcorrespondtotheproportionalhaz- ards and proportional odds models, respectively. We generated Remark 8. A key step in proving the asymptotic normality of the censoring times from exp(c),wherec waschosentoyield the NPMLE is to show that the information operator is invert- approximately 30% censored observations. We set (Y1,...,Y5) ible. The result is given by Lemma 2 in the Appendix,which to be missing completely at random for 30% of the subjects. We states that condition (D5), together with conditions (C1)–(C3), set the sample size to 400 and set the number of abscissa points and (C5), implies that the information operator of the model is to 20 for each Gauss-Hermite quadrature. We simulated 5000 invertible. With this result, we can verify the invertibility of the datasets for each setting. The results are summarized in Table 1. information operator of the semiparametric model by inspect- The estimators of all parameters are virtually unbiased for ing the parametric part of the model that contains the observed both the proportional hazards and proportional odds models. and latent variables. For the frailty models in Kosorok, Lee, and The standard error estimators accurately reflect the true vari- Fine (2004), verification of the invertibility of the information ations, and the coverage probabilities of the confidence inter- operator involves inspection of the local behavior of the score vals are close to the nominal level. Standard error estimators = around T 0. However, that approach is limited to frailty dis- for the parameters in the survival model are larger under the tributions that are indexed by a one-dimensional parameter and proportional odds model than under the proportional hazards is not directly applicable to cases with more complex latent vari- model. As a result, the standard error estimators for the param- able distributions such as those in our setting. eters associated with η2 are larger. The standard error estimators for the remaining parameters are very similar between the two models. 5. Simulation Studies We also evaluated Mplus (Muthén and Muthén 1998–2015) T We considered a model with covariates Z = (Z1, Z2) ,two under the proportional hazards model, and the results are pre- latent variables (η1,η2), observed continuous variables sented in Table S.1 of the supplementary materials. The results

Figure . Model used in simulation studies. The SEM consists of two latent variables, an observed covariate, seven binary or normal manifest variables, and a survival time that regresses on the latent variable, some manifest variables, and the observed covariates. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 899

Table . Simulation results for the SEM with two latent variables.

Proportional hazards model Proportional odds model Dep Ind Param Bias SE SEE CP Param Bias SE SEE CP TZ 1 . . . . . . . . . . Z − − − 2 . . . . . . . . . . Y 6 . . . . . . . . . . Y − 7 . . . . . . . . . . η 2 . . . . . . . . . .  (t ) 0 1 . . . . . . . . . .  (t ) 0 2 . . . . . . . . . .  (t ) 0 3 . . . . . . . . . . Y − − 1 Int . . . . . . . . . . Var . − . . . . . − . . . . Y 2 Int . . . . . . . . . . η 1 . . . . . . . . . . Var . − . . . . . − . . . . Y 3 Int . . . . . . . . . . η 1 . . . . . . . . . . Var . − . . . . . − . . . . Y 4 Int . . . . . . . . . . Var . − . . . . . − . . . . Y 5 Int . . . . . . . . . . η 2 . . . . . . . . . . Var . − . . . . . − . . . . Y 6 Int . . . . . . . . . . Z − − − − 1 . . . . . . . . . . Z 2 . . . . . . . . . . η 2 . . . . . . . . . . Y 7 Int . . . . . . . . . . Z 1 . . . . . . . . . . Z − − 2 . . . . . . . . . . Y − − − − 6 . . . . . . . . . . η 2 . . . . . . . . . . η 1 Var . . . . . . . . . . η η 2 1 . . . . . . . . . . Var . . . . . . . . . .

NOTE: Each row corresponds to the regression parameter of the dependent variable “Dep” on the independent variable “Ind” or some other parameter in the model of  (t )  (t )  (t ) “Dep.”“Int”and “Var”stand for the intercept and error variance, respectively. The parameters 0 1 , 0 2 ,and 0 3 correspond to the cumulative baseline hazard function values at the 25%, 50%,and75% quantiles of the survival time. The true value of a parameter is given under “Param.”“Bias” is the empirical bias; “SE” is the empirical standard error; “SEE”is the empirical mean of the standard error estimator; and “CP”is the empirical coverage probability of the 95% confidence interval. for the Euclidean parameters are similar to those presented in microarray platforms, namely Agilent 244K Whole Genome Table 1.Mplusprovidesestimatorforthebaselinehazardfunc- Expression Array, Affymetrix HT-HG-U133A, and Affymetrix tion instead of the cumulative baseline hazard function. Its stan- Exon 1.0. We assumed that the effects of a gene on clinical dard error estimator does not reflect the true variation, and the outcomes are mediated through unobserved protein activity. coverages of the confidence intervals are far below the nominal The latent protein activity is modified by mRNA expression level. and is manifest through the observed protein expression measurements, which were obtained from the reverse-phase protein arrays platform. Figure 4 depicts the SEM fit for 6. Real Data Analysis each gene. We assumed that the observed variables follow ( , , ) We analyzed a dataset on patients with serous ovarian cancer the distributions described in Section 5,with Y1 Y2 Y3 ( , ) = fromtheTCGAproject(TheCancerGenomeAtlasResearch being the three microarray measurements, Y4 Y5 Network 2011). Genomic variables include DNA copy num- (Total protein expression, Phosphorylated protein expression), ( , ) = ( ) ( , ) = ber, SNP genotype, DNA methylation level, and levels of Y6 Y7 Tumor stage, Tumor grade , Z1 Z2 expression of mRNA, microRNA, total protein, and phospho- (Age, Race),andT being progression-free survival time. rylated protein. Demographic and clinical variables include We dichotomized tumor stage into stage II/III versus stage age at diagnosis, race, tumor stage, tumor grade, time to IV and tumor grade into grade 2 versus grade 3/4. Race was tumor progression, and time to death. There are a total of dichotomized into white and nonwhite. We allowed mRNA 586 patients. The median follow-up time was about 2.5 years, expression and protein expression data to be missing for some androughly30%ofthepatientswerelosttofollow-upbefore subjects. We excluded patients with tumor stage I or grade 1, tumor progression or death. The data are available from as those patients may have a disease that is biologically differ- http://gdac.broadinstitute.org/. ent from that of patients with tumors of other stages or grades. We focused on the integrative analysis of clinical outcomes For each gene, we fit the class of transformation models with − and expression levels of mRNA, total protein, and phospho- G(x) = r 1 log(1 + rx) over a grid of r = (0, 0.1,...,2).We rylated protein. We considered mRNA expression as a latent selected the model with the smallest AIC or, equivalently, the variablethatcanonlybeobservedwitherrorthroughthree largest log-likelihood value. 900 K. Y. WONG, D. ZENG, AND D. Y. LIN

Figure . Results from the SEM analysis of the gene ACACA. Analysis results are from  patients with ovarian cancer in the TCGA project. The numbers besides an arrow correspond to the point estimate and standard error estimate (in parentheses) of the regression parameter. The numbers below the latent variables correspond to the point estimate and standard error estimate (in parentheses) of the error variance.

We present the results for the gene ACACA. The sample size framework stems from the appropriate handling of missing is 542. About 30% of the subjects do not have protein expression data, the dimension reduction of the observed covariates, and data, and over 10% of the subjects miss at least one mRNA the flexibility of the survival model. expression measurement. The best-fitting model is obtained at r = 1, which corresponds to the proportional odds model. The point estimates and standard error estimates of the parameters 7. Discussion associated with the latent variables are shown in Figure 4.The In this article, we consider semiparametric SEM for potentially remaining results are shown in Table S.2 of the Supplementary right-censored survival time data. We prove the consistency, Materials. The latent variables have strong positive association asymptotic normality, and semiparametric efficiency of the with the measurement platforms. As expected, latent protein NPMLE. We propose new rules for establishing model identifia- activity and latent mRNA expression are highly correlated. bility and invertibility of the information operator. We construct Latent protein activity is positively associated with progression- an EM algorithm to compute the NPMLE and introduce occa- free survival time, with a p-value of 0.100. Specifically, higher sional Newton-Raphson steps to accelerate the convergence. latent protein activity is associated with shorter progression-free One contribution of Theorem 1 is that it reduces a semi- survival time, which agrees with the findings of the literature parametric identifiability problem to a parametric one; it shows (Menendez and Lupu 2007). The association of ACACA with that the inclusion of the semiparametric component does not tumor stage or tumor grade is weak. make the model less identifiable but, in some sense, makes the The results for the parameters in the nonsurvival models model more easily identifiable. With that being said, the result = are similar between r 0 and 1. The parameters in the sur- hinges on correct specification of the model and does not guar- = vival model have different interpretations between r 0and antee empirical identifiability in a finite sample. Therefore, care = 1. With r 0, a unit increase in latent protein activity would should be taken when fitting a model that is nearly noniden- ( . ) have a multiplicative effect of exp 0 068 onthehazardfunc- tifiable. Another main result of ours is given by Lemma 1. This = tion. With r 1, a unit increase in the latent protein activity lemma is applicable to a wide range of latent variable models and (− . ) would have a multiplicative effect of exp 0 192 on the sur- allows one to deduce the identifiability of a model by inspecting vival odds. For this dataset, the proportional odds model pro- just part of it. vides much stronger evidence for the effect of protein activity on Invertibility of the information operator has received much progression-free survival than the proportional hazards model. less attention in the literature than model identifiability. In For the Cox proportional hazards model, we also present the this article, we prove a general result for invertibility of the results from Mplus in Table S.2. The results from NPMLE and information operator. It is evident from the proof that the Mplus are similar for most parameters. There are considerable invertibility of the information operator can be established differences between the cumulative baseline hazard function using techniques similar to those used to establish model iden- estimates. The standard error estimates for the cumulative tifiability. Specifically, the key to the proof of the identifiability baseline hazard function are not available from Mplus. of the mixture Cox model is that with the presence of a covariate For comparisons, we also fit a proportional odds model that is independent of the latent variable, the contributions to without latent variables for progression-free survival on the the likelihood from the latent variable and the baseline hazard covariates and the two protein expression variables, where function can be separated by considering different values of the the subjects with missing protein expression data were dis- covariate. (In a normal mixture model, however, we lack such carded. The p-value of the Wald test for the joint effect of identifiability results precisely because the random effect and = protein expression is 0.157. With r 0, the Wald test p-value error term are combined linearly and their distributions cannot is 0.578. Therefore, analyses based on standard models fail to be distinguished.) As a result, if two sets of parameters give rise conclude a strong association between the protein expression to the same marginal survival function, then they must do so and progression-free survival. The power of the proposed SEM bygivingrisetothesamerandom-effectdistribution.Basedon JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 901 the proportional hazards structure, we prove a parallel result model is identifiable if the equality of the likelihood values implies for the invertibility of the information operator: the existence of the equality of the two sets of parameters. We derive the equality of a submodel with zero score implies that the random-effect dis- the two sets of parameters in the following steps: tribution has zero score along that submodel as well. Therefore, 1. By conditions (C1), (C2), and (C3’) and the identifiability of ˜ to ensure the invertibility of the information operator of the the mixture Cox model (Kortram et al. 1995), ϑk = ϑk and ˜ mixture Cox model, one only has to ensure that the information k = k for k = 1,...,K1. matrix of the random-effect distribution is invertible. 2. With some algebraic manipulation, the likelihood function Our work can be extended in several directions. First, one canbeexpressedintheformoftheLaplacetransformofthe (Y, η ) may be interested in expanding the model by inclusion of distribution of a function of 1 . The uniqueness of the more latent and observed variables. As the number of variables Laplace transform, together with condition (C4’), implies (α , β , ψ, ν) = (α˜ , β˜ , ψ˜ , ν˜) = ,..., increases, the number of parameters to be estimated increases that k k k k for k 1 K1. as well. Then, it may be desirable to perform variable selection. 3. By the uniqueness of the Laplace transform and the com- η Because a single variable may be associated with multiple plete sufficiency of 2 imposed by condition (C5’), the ( ,..., ,Y ) parameters, one may prefer not to treat parameters as the basic equality of the likelihood functions of TK1+1 TK unit of selection, as in traditional lasso methods (Tibshirani implies the equality of the likelihood functions of ( ,..., ,Y, η) 1996). Instead, methods like group lasso (Yuan and Lin 2006) TK1+1 TK . By the identifiability of the Cox (ϑ , α , β , φ ) = (ϑ˜ , α˜ , β˜ , φ˜ ) that penalize parameters associated with a variable as a group model, we conclude that k k k k k k k k may be considered. for k = K1 + 1,...,K. Y In our model, the distribution of the manifest variable is ProofofTheorem1.The likelihood is given in (4). Here, we con- fully parametric. One can allow a nonparametric transforma- Y sider a single observation and drop the subscript i. Using the argu- tion on . A major challenge arises in extending the asymptotic ments in Section 10.1 of Zeng and Lin (2010), we can set each results to unbounded nonparametric transformation, as the esti- survival time to be right censored at any time point within [0,τ] mator of the transformation function can be unbounded (Zeng when establishing identifiability. Consider two sets of parameters and Lin 2010). (ϑ , α , β , φ , , ψ, ν) (ϑ˜ , α˜ , β˜ , φ˜ , ˜ , ψ˜ , ν˜) k k k k k and k k k k k such that Finally, it would be of interest to consider interval-censored the likelihood values for an observation with the K survival times data. Interval censoring results in a different likelihood function, being right censored are equal almost surely, that is, which makes the computation of the NPMLE and the deriva- tion of its asymptotic properties challenging, even for univariate K   W Tϑ +ZTβ +Y Tα +ηTφ k k k k survival time data. The asymptotic theory for interval-censored exp −k (tk ) ske data is only available in a few simple cases; see Huang and = k 1 Wellner (1997)forareview.   ×gk (sk ) dmk(sk ) fY,η Y, η | Z; ψ, ν dη K   W Tϑ˜ +ZTβ˜ +Y Tα˜ +ηTφ˜ = −˜ ( ) k k k k Appendix: Technical Details exp k tk ske k=1 We present the following conditions, which are clearly implied by   × ( ) ( ) Y, η | Z; ψ˜ , ν˜ η conditions (C3)–(C5): gk sk dmk sk fY,η d (A.1) (C3’) For k = 1,...,K1, ϑk is a nonzero vector. (α , β , ψ, ν) (C4’) Consider two sets of parameters k k for all t1,...,tK ∈ [0,τ], W, Z,andY,where fY,η is the density of (α˜ , β˜ , ψ˜ , ν˜) = ,..., η˜ = and k k for k 1 K1,andlet 1 (Y, η) given Z.Ifmk is a point mass at one, then sk is fixed at one, (η˜ ,...,η˜ ) η˜ = Y T(α − α˜ ) + ZT(β − 11 1K1 ,where 1k k k k g = 1, and the integration with respect to m (s ) can be omitted. ˜ k k k β ) + η ,η (Y, η ) Z k 1k.Let fY 1 be the density of 1 given .Then, For simplicity of description, assume that m is the Lebesgue mea- ˜ k ,η (Y, η | Z; ψ, ν) = ,η (Y, η˜ | Z; ψ, ν˜) Z Y fY 1 1 fY 1 1 for all , , sure. Note that η α = α˜ β = β˜ ψ = ψ˜ ν = ν˜ and 1 implies that k k, k k, ,and . ∞ ( ) ( ) ( ) (C5’) For k = K + 1,...,K,let(Y k , η k , η k ) be the compo- d −ts 1 1 2 sgk (s) ds =−lim e gk (s) ds (Y, η , η ) t→0+ dt nents of 1 2 that appear in the regression of Tk, 0 −( ) −(k) −(k) (Y k , η , η ) d  and let 1 2 be the remaining components. =− {− ( )} = ( ) < ∞. (k) −( ) −(k) lim+ exp Gk t Gk 0 η (Y k , η ) t→0 If 2 is nonempty, then 1 is nonempty, and dt η(k) { (·|Z,Y, η ) Z = is complete sufficient in Fη(k)| ,η 1 : 2 2 Y 1 ( ) Thus, a transformation model can be written as a random-effect z ,Y (k) = y , η k = η } z y η 0 0 1 10 for any fixed 0, 0,and 10,where ( ,..., ) (k) proportional hazards model with known distributions g1 gK ( ) η (Z,Y, η ) Fη k |Y,η is the distribution function of 2 given 1 2 1 for random effects (s1,...,sK ) with finite means. (Y −(k), η−(k) ) with 1 treated as a parameter vector. First, we show that the baseline hazard functions of the first We prove Theorem 1 under the generalized conditions (C1), K1 survival times are identifiable. For each k = 1,...,K1,settl → (C2), and (C3’)–(C5’). The proof makes use of two lemmas given 0forl = k on both sides of (A.1). On each side of the result- at the end of this Appendix. We first provide an overview of the ing equation, integration with respect to Y results in the likeli- T T ˜ (ϑ , α , β , φ , , ψ, ν) Y αk+η1k Y αk+η1k proof. For any two sets of parameters k k k k k and hood of a mixture Cox model with ske or ske as (ϑ˜ , α˜ , β˜ , φ˜ , ˜ , ψ˜ , ν˜) ˜ k k k k k , assume that the likelihood values at the alatentvariable.LetE(·|Z) and E(·|Z) be the expectations ˜ ˜ two sets of parameters are identical almost surely. By definition, the under fY,η (·|Z; ψ, ν) and fY,η (·|Z; ψ, ν), respectively. Theorem 902 K.Y.WONG,D.ZENG,ANDD.Y.LIN

T Y αk+η1k , R → R 3ofKortrametal.(1995)impliesthatE(ske | Z = 0)k = Consider two arbitrary continuous functions f g : .Note T ˜ ˜ Y αk+η1k ˜ ˜ E(ske | Z = 0)k on [0,τ], ϑk = ϑk,andthedistribu- that Y Tα +η − Y Tα +η ( k 1k | Z = 0) 1 k 1k (·|Z; ψ, ν) tion of E ske ske under fY,η Y Tα˜ +η − Y Tα˜ +η ∞   ∞ ∞ ˜ ( k 1k | Z = 0) 1 k 1k (·| − − − is equal to that of E ske ske under fY,η e st f ∗ g (t) dt = e st f (t) dt e st g (t) dt ˜ ˜ T ˜ T ˜ Z; ψ, ν).BecauseE(Y αk + η1k | Z = 0) = E(Y αk + η1k | Z = −∞ −∞ −∞ ˜ 0) = 0 by condition (C2), we see that k = k on [0,τ]. Second, we show that the likelihood function takes the form of a for any s such that the integrals are defined, where ( f ∗ g)(t) ≡ ∞ Laplace transform and use the uniqueness of the Laplace transform −∞ f (t − s)g(s) ds is the convolution of f and g. Therefore, (α , β , ψ, ν) = ,..., ( ∗ )(·) = to prove the identifiability of k k (k 1 K1). Setting f g 0impliesthat tk → 0fork = K1 + 1,...,K andW = 0 on both sides of (A.1), we ∞ ∞ have − − e st f (t) dt e st g (t) dt = 0, −∞ −∞ K1   ZTβ +Y Tα +η k k 1k exp −k (tk ) ske gk (sk ) dsk (·) = k=1 which, if g is positive, implies that f 0bytheuniquenessof ¯ (·) (·) × fY,η (Y, η | Z; ψ, ν) dη the bilateral Laplace transform (Chareka 2007). Because gk e η | (η | Z,Y; ψ, ν) = η | (η˜ | K1   is positive, (A.4)impliesthat f 1 Y 1 f 1 Y 1 ZTβ˜ +Y Tα˜ +η ˜ k k 1k Z,Y; ψ, ν˜) η˜ = exp −k (tk ) ske gk (sk ) dsk ,where 1 is defined in condition (C4’). By condition = (α , β , ψ, ν) = (α˜ , β˜ , ψ˜ , ν˜) = ,..., k 1 (C4’), k k k k for k 1 K1. ˜ × f ,η (Y, η | Z; ψ, ν˜) dη. (A.2) It remains to identify the parameters associated with Y ( ,..., ) TK1+1 TK . By the uniqueness of the Laplace transform, η U = ( ,..., ) = 1k (A.1) implies that Let U1 UK1 , Uk ske ,and fU|Y be the density func- U Z Y tion of given and . By the uniqueness of the Laplace trans-   ˜ S K form, for any continuous functions f and f ,anyopenset ,and − ( ) WTϑ +ZTβ +YTα +ηTφ k tk ske k k k k ( ) any positive real numbers c and c˜, e gk sk dsk k=K1+1 ∞ ∞ × (Y, η | Z; ψ, ν) η fY,η d 2 −cst ( ) = −cst˜ ˜ ( ) ∀ ∈ S   e f t dt e f t dt s K ˜ WTϑ˜ +ZTβ˜ +YTα˜ +ηTφ˜ 0 0 −k (tk )ske k k k k = e gk (sk ) dsk k=K +1 ( ) = ( /˜) ˜( /˜) > 1 implies that f t c c f ct c for all t 0. Therefore, the equal- × (Y, η | Z; ψ, ν) η ,..., Z Y fY,η d 2 ity of (A.2)forallt1 tK1 , ,and implies that

   K ˜   + ,..., W Z Y η η 1 ZT(β −β )+Y T (α −α˜ ) ˜ ˜ for all tK1 1 tK , , , ,and 1,thatis, 1 can be treated f | U | Z,Y; ψ, ν = e k=1 k k k k f | U | Z,Y; ψ, ν˜ U Y U Y as observed for identifying the remaining parameters. Under con- (A.3) U Z Y U˜ = ( ˜ ,..., ˜ ) ˜ = dition (C5’), we can use the arguments in the proof of Lemma for all , ,and ,where U1 UK1 ,andUk ZT (β −β˜ )+Y T (α −α˜ ) 1toshowthattheintegrandsintheaboveequalityareequal k k k k η | η Z e Uk.Let f 1 Y be the density of 1 given and η (ϑ , α , β , φ , ) = at each value of 2.Weconcludethat k k k k k Y.BythedefinitionofU, (ϑ˜ , α˜ , β˜ , φ˜ , ˜ ) = + ,...,  k k k k k for k K1 1 K.

(U | Z,Y; ψ, ν) fU|Y We provide an overview for the proof of Theorem 2. The consis-   = − ,..., − | Z,Y; ψ, ν tencyoftheNPMLEisprovedinthefollowingsteps: fη1|Y logU1 log s1 logUK1 log sK1 s >0 1. By conditions (D2)–(D4), the NPMLE exists, that is, k ˆ K1 k(τ ) < ∞. × −1 ( ) ( ,..., ) ˆ (τ ) Uk gk sk d s1 sK1 2. By conditions (D3) and (D4), k is uniformly bounded. k=1 Helly’s selection theorem then implies that every subse-   ˆ = − v ,..., − v | Z,Y; ψ, ν quence of k has a further converging subsequence. fη1|Y logU1 1 logUK1 K1 RK1 3. By the Glivenko-Cantelli properties of the log-likelihood K1   − v andrelatedfunctionsgivenbyLemmaS2intheSupplemen- × U 1g¯ (v ) e k d v ,...,v , k k k 1 K1 tary Materials, the identifiability of the model, and the non- k=1 negativity of the Kullback-Leibler divergence, we conclude ¯ v the consistency of the NPMLE. where gk(v) = gk(e ). Thus, (A.3)impliesthat The asymptotic normality of the NPMLE follows mainly from    the arguments of van der Vaart (1998, pp. 419–424). Donsker prop- − v ,..., − v | Z,Y; ψ, ν fη1|Y logU1 1 logUK1 K1 erties of the score and related functions are given by Lemma S2 in RK1    the supplementary materials, and the invertibility of the informa- − ˜ − v ,..., ˜ − v | Z,Y; ψ˜ , ν˜ tion operator is given by Lemma 2. fη1|Y logU1 1 logUK1 K1 Z W Z β = K1 ProofofTheorem2.We use to denote both and with k (k v × ¯ (v ) k (v ,...,v ) = . ,..., gk k e d 1 K1 0 (A.4) 1 K) being the corresponding vector of regression parameters. k=1 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 903

Let Parner (1998), we can show that the right-hand side of the above −∞ ˆ (τ ) =∞ inequality tends to if lim supn k . By definition of   K     ˆ ˆ ZTβ +Y Tα +ηTφ  ZTβ +Y Tα +ηTφ ˜ ki (θ, A), the left-hand side of the inequality is bounded below by an  O ; θ, A = e i k i k k G e i k i k k  T i k k ki ( ) ˆ (τ ) k=1 Op 1 term. Therefore, k is uniformly bounded.     ˆ ZTβ +Y Tα +ηTφ  (τ ) k k k ˜ Given the boundedness of k ,Helly’sselectiontheorem × exp −Gk e i i k Tki     implies that, for any subsequence of n, we can always choose a fur- × Y | Z , η; ψ η | Z ; ν η, ˆ fY i i fη i d ther subsequence such that k converges pointwise to some mono- ∗ θˆ θ∗ tone function k and converges to . The desired consistency ˙ (O ; θ, A) (O ; θ, A) θ ∗ ∗ θ i be the derivative of i with respect to ,and result follows if we can show that  = 0k and θ = θ0 almost ˙ (O ; θ, A) (O ; θ, A) k k i [Hk]bethederivativeof i along the path surely. With an abuse of notation, let {n}1,2,... be the subsequence. ( +  ) k Hk . Define First, we prove the consistency. By condition (D4), ⎧ ⎫   −1 n   ⎨n ˙ (O ; θ , A ) ˜ ≤· ⎬     k j 0 0 I Tki ZTβ +Y Tα +ηTφ  ZTβ +Y Tα +ηTφ ki ˜ ( ) =−  ˜ ≤ . i k i k k i k i k k  ˜ k t kiI Tki t e Gk e k Tki ⎩ (O ; θ , A ) ⎭   i=1 j=1 j 0 0 ZTβ +YTα +ηTφ − k k k  ( ˜ ) Gk e i i k Tki × e ( +|Y|+|η|) − −κ +κ +1 By Lemma S2 in the Supplementary Materials and the properties of ≤ eO 1 1 +  (T˜ ) ki 3k 2k . k ki Donsker (and therefore, Glivenko-Cantelli) classes, Thus, condition (D3) implies that n ˙ ˙ 1 k(O j; θ0, A0 ) I (s ≤·) k(Oi; θ0, A0 ) I (s ≤·) → E n (O ; θ , A ) (O ; θ , A ) n K  − −κ +κ + j=1 j 0 0 i 0 0 ˜ ki 3k 2k 1  (Oi; θ, A) ≤ F(Oi; θ) 1 + k(Tki ) , i=1 k=1 uniformly on [0,τ]. Because the score function along the path (A.5) k = 0k + I(·≥s) with other parameters fixed at their true val- ues has zero expectation, where F(Oi; θ) is a random variable with |E{log F(Oi; θ)}| < ∞ ˜   for any θ. By condition (D2), P(Tki = τ) is positive. Therefore, if ˙ ˜ k(Oi; θ0, A0 ) I (s ≤·) dP Tkiki < s /ds k(τ ) =∞, then the right-hand side of (A.5) is zero for large n. −E = . ˆ (Oi; θ0, A0 ) λ0k (s) We conclude that k(τ ) < ∞, such that the NPMLE exists. ˆ We then show that lim sup k(τ ) < ∞ almost surely. From n ˜ (A.5), Algebraic manipulation yields that the uniform limit of k on [0,τ] ˆ is 0k.Notethatk(t) is equal to   n K n 1 θˆ, Aˆ = 1  ˆ { ˜ }+ 1 (O ; θˆ, Aˆ)   log Ln ki log k Tki log i    n n n −1 n ˙ i=1 k=1 i=1 t n = k(O j; θ0, A0 ) I (s ≤·) /(O j; θ0, A0 )  j 1  ˜ ( ) . n n K    d k s 1 ˆ 1 ˆ ˜ −1 n ˙ ˆ ˆ ˆ ˆ ≤ log F(O ; θ) +  log  {T } 0 n = k(O j; θ, A) I (s ≤·) /(O j; θ, A) n i n ki k ki j 1 i=1 i=1 k=1 n K   1 − ( + κ − κ − 1) log 1 + ˆ (T˜ ) . We have shown that the numerator of the integrand in the above n ki 3k 2k k ki i=1 k=1 equation converges uniformly. Similarly, we can show that the denominator of the integrand in the above equation converges uni-  ∗ ∗ ∗ ∗ ˜ −1 n ˜ ˜ | {˙ (O ; θ , A ) ( ≤·) /(O ; θ , A )}| Let N = n = ( I(T ≤·),..., I(T ≤·)).Clearly, formly to E k i I s i and that i 1 1i 1i Ki Ki ˜ the limit is bounded away from 0. Because k converges uni- ∗ n K n formly to 0k,whichisdifferentiablewithrespecttot,  is 1 (θ , ˜ ) =−1  + 1 (O ; θ , ˜ ). k log Ln 0 N ki log n log i 0 N ˆ / ˜ n n n also differentiable with respect to t.Itfollowsthatd k d k i=1 k=1 i=1 λ∗/λ ,τ λ∗ = (∗ ) converges uniformly to k 0k on [0 ], where k k .As −1 (θˆ, Aˆ) − −1 (θ , A˜) The second term on the right-hand side of the above equation is n log Ln n log Ln 0 is nonnegative, ( ) Op 1 .Thus,   n K ˆ ˜ n ˆ ˆ 1 dk Tki 1 (O ; θ, A)      log   + log i ≥ 0. 1 ˆ ˆ 1 ˜ ki ˜ ˜ (O ; θ , A˜) log L θ, A − log L θ , N + O (1) n = = d k Tki n = i 0 n n n n 0 p i 1 k 1 i 1 n K 1 ≤  log nˆ {T˜ } By the Glivenko-Cantelli properties of the class of functions of n ki k ki (O ; θ, A) i=1 k=1 log i given by Lemma S2 and the uniform convergence ˆ / ˜ →∞ n K   of d k d k,settingn onbothsidesoftheaboveinequality 1 − ( + κ − κ − 1) log 1 + ˆ (T˜ ) . yields n ki 3k 2k k ki i=1 = k 1 K ∗ ˜  ∗ ∗ λ (T ) ki (O ; θ , A ) (κ − κ − ) k=1 k ki i ≥ . Note that 3k 2k 1 is positive by condition (D4). Using E log K  0 λ ( ˜ ) ki (O ; θ , A ) the partitioning argument similar to those of Murphy (1994)and k=1 0k Tki i 0 0 904 K. Y. WONG, D. ZENG, AND D. Y. LIN

The left-hand side of the above inequality is the negative Kullback- ∗ Leibler distance of the density indexed by (θ , A∗ ).Fromtheiden- tifiability of the model implied by Theorem 1, we conclude that θ∗ = θ ∗ =  0 and k k0. The desired consistency result follows. To prove the asymptotic normality of the NPMLE, we adopt the arguments of van der Vaart (1998, pp. 419–424). Let Pn be the empirical measure determined by n iid observations, and let P ˙ Figure A.. SEM considered in Lemma . The SEM consists of two sets of latent vari- be the true probability measure. Let θ (θ, A) be the derivative of ˙ ables and two sets of observed variables that may all be multivariate. The observed log L (θ, A) with respect to θ,andlet (θ, A)[H ]bethederiva- X η Y n k k variable depends only on the latent variable 1, but the observed variable d tive of log Ln(θ, A) along the path (k + Hk ).Foranyv ∈ R and depends on both sets of latent variables. W = (h1,...,hK ) with hk ∈ BV[0,τ], where BV[0,τ]isthespace of functions of bounded variation on [0,τ], we have ThefollowingtwolemmasareusedintheproofsofTheorem1 and Theorem 2 and are proved in Section S2 of the Supplementary K Materials. P vT ˙ (θˆ, Aˆ) + ˙ (θˆ, Aˆ) ˆ = . n θ k hk d k 0 Lemma A.1. Let Model A be = k 1     d X | η , η = X | η ∼ |η ·|η , 1 2 1 FX 1 1 In addition,     Y | η , η ∼ F |η ·|η , η , 1 2 Y  1  2 K η | η ∼ η |η ·|η , 2 1 F 2 1 1 T ˙ ˙ P v θ (θ , A ) + (θ , A ) h d = 0. 0 0 k 0 0 k 0k η ∼ η , 1 F 1 k=1 (X,Y ) (η , η ) where are observed, and 1 2 are latent. Model A is Therefore,    depicted in Figure A.1.Let f |η = F , f |η = F , fη |η = F , X 1 X|η1 Y Y|η 2 1 η2|η1  ˜ K η = |η √   and f 1 Fη1 . Assume that: (a) for any density functions fY and T ˙ ˆ ˆ ˙ ˆ ˆ ˆ n Pn − P v θ (θ, A) + k(θ, A) hk dk ˜ fη2|η1 , k=1  √ K     T ˙ ˆ ˆ ˙ ˆ ˆ ˆ |η Y | η , η η |η η | η η =− nP v θ (θ, A) + k(θ, A) hk dk fY 1 2 f 2 1 2 1 d 2 k=1      ˜ ˜ K = |η Y | η , η η |η η | η η ∀Y, η fY 1 2 f 2 1 2 1 d 2 1 T ˙ ˙ − v θ (θ0, A0 ) + k(θ0, A0 ) hk d0k . k=1 ˜ ˜ implies that ( f |η, fη |η ) = ( f |η, fη |η ), that is, the model for Y (A.6) Y 2 1 Y 2 1 η |η η is identifiable if 1 is observed; (b) FX 1 and F 1 are identifiable (X,Y ) η ˙ ˙ based on ;and(c) 1 is a complete sufficient statistic in From the Donsker properties of the classes of functions of θ and k { η | (·|X ) X ∈ X } η | θˆ Aˆ F 1 X : ,whereF 1 X is the conditional distribution implied by Lemma S2 and the consistency of and ,weconclude η X X X function of 1 given ,and is the range of . Then, Model A is that the left-hand side of (A.6)equals η identifiable. A sufficient condition for 1 to be complete sufficient is that the density of X is of the form √   K T ˙ ˙ n P − P v θ (θ , A ) + (θ , A ) h d n 0 0 k 0 0 k 0k   q       k=1 f |η X | η ∝ exp X s η − a η b X , +o (1) . X 1 1 j j 1 j 1 j j p j=1

This term converges to a Gaussian process in l∞(V × QK ).Bythe X = ( ,..., ) η → ( (η ),..., (η )) where X1 Xq , 1 s1 1 sq 1 is one-to-one, Taylor series expansion, the right-hand side of (A.6)isoftheform and b j is nonzero on some open set. √ K Lemma 2. Under conditions (C1), (C2), (C3’), (C5’), and (D5), the T ˆ ˆ − n B1 [v, W] (θ − θ0 ) + B2k [v, W] d(k − 0k ) model given by (1)–(3) has an invertible information operator. k=1 √   √ K ˆ  ˆ Supplementary Materials +op n θ − θ0 + n k − 0k , V[0,τ] k=1 The supplementary materials contain proofs of Lemmas 1 and 2, additional theoretical results, simulation results for Mplus, and additional data analysis results. where B ≡ (B1, B21,...,B2K ) is the information operator and is linear in Rd × BV[0,τ]K .ByLemma2,B is invertible. The rest of the proof then follows the arguments of van der Vaart (1998,pp. Acknowledgments ˆ 419–424). Finally, because vTθ is an asymptotically linear estimator Kin Yau Wong is Doctoral Student (E-mail: [email protected]), vTθ of 0 with the influence function lying in the space spanned by Donglin Zeng is Professor (E-mail: [email protected]), and D. Y. ˆ the score functions, θ is an efficient estimator for θ0.  Lin is Dennis Gillings Distinguished Professor (E-mail: [email protected]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 905 edu),DepartmentofBiostatistics,UniversityofNorthCarolina,Chapel Menendez, J. A., and Lupu, R. (2007), “Fatty Acid Synthase and the Hill, NC 27599, USA. Lipogenic Phenotype in Cancer Pathogenesis,” Nature Reviews Cancer, 7, 763–777. [900] Moustaki, I., and Steele, F. (2005), “Latent Variable Models for Mixed Cat- Funding egorical and Survival Responses, With an Application to Fertility Pref- erences and Family Planning in Bangladesh,” Statistical Modelling,5, This research was supported by the National Institutes of Health grants 327–342. [893] R01GM047845, R01CA082659, and P01CA142538. Murphy,S.A.(1994), “Consistency in a Proportional Hazards Model Incorporating a Random Effect,” The Annals of Statistics, 22, 712–731. [903] References Muthén,B.,andMasyn,K.(2005), “Discrete-Time Survival Mix- ture Analysis,” JournalofEducationalandBehavioralStatistics, 30, Asparouhov, T., Masyn, K., and Muthén, B. (2006), “Continuous Time Sur- 27–58. [893] vival in Latent Variable Models,” in Proceedings of the Joint Statistical Muthén, L., and Muthén, B. (1998-2015), Mplus User’s Guide, 7th ed.,Los Meeting in Seattle, August 2006. ASA Section on Biometrics, pp. 180– Angeles, CA: Muthén & Muthén. [893,898] 187. [893] Naliboff,B.D.,Kim,S.E.,Bolus,R.,Bernstein,C.N.,Mayer,E.A.,and Bollen,K.A.(1989), Structural Equations With Latent Variables,NewYork: Chang, L. (2012), “Gastrointestinal and Psychological Mediators of Wiley. [893,894,895] Health-Related Quality of Life in IBS and IBD: A Structural Equation Bollen,K.A.,andDavis,W.R.(2009), “Two Rules of Identification Modeling Analysis,” The American Journal of Gastroenterology, 107, for Structural Equation Models,” Structural Equation Modeling, 16, 451–459. [893] 523–536. [894] Parner, E. (1998), “Asymptotic Theory for the Correlated Gamma-Frailty Chareka, P. (2007), “A Finite-Interval Uniqueness Theorem for Bilateral Model,” The Annals of Statistics, 26, 183–214. [903] Laplace Transforms,” International Journal of Mathematics and Mathe- Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2004), “Generalized Multi- matical Sciences, 2007. [902] level Structural Equation Modeling,” Psychometrika, 69, 167–190. [893] Cox, D. R. (1972), “Regression Models and Life-Tables,” Journal of the Royal Rabe-Hesketh, S., Yang, S., and Pickles, A. (2001), “Multilevel Models Statistical Society, Series B, 34, 187–220. [893] for Censored and Latent Responses,” Statistical Methods in Medical Dahly,D.,Adair,L.,andBollen,K.(2009), “A Structural Equation Model Research, 10, 409–427. [893] of the Developmental Origins of Blood Pressure,” International Journal Reilly,T.,andO’Brien,R.M.(1996), “Identification of Confirmatory Fac- of Epidemiology, 38, 538–548. [893] tor Analysis Models of Arbitrary Complexity: The Side-By-Side Rule,” Dempster,A.P.,Laird,N.M.,andRubin,D.B.(1977), “Maximum Like- Sociological Methods & Research, 24, 473–491. [894,897] lihood From Incomplete Data via the EM Algorithm,” Journal of the Stoolmiller, M., and Snyder, J. (2006), “Modeling Heterogeneity in Social Royal Statistical Society, Series B, 39, 1–38. [895] Interaction Processes Using Multilevel Survival Analysis,” Psychological Henderson,R.,Diggle,P.,andDobson,A.(2000), “Joint Modelling of Methods, 11, 164–177. [893] Longitudinal Measurements and Event Time Data,” Biostatistics,1, Stoolmiller, M., and Snyder, J. (2014), “Embedding Multilevel Survival 465–480. [893] Analysis of Dyadic Social Interaction in Structural Equation Models: Huang, J., and Wellner, J. A. (1997), “Interval Censored Survival Data: A HazardRatesasBothOutcomesandPredictors,”Journal of Pediatric Review of Recent Progress,” in Proceedings of the First Seattle Sym- Psychology, 39, 222–232. [893] posium in Biostatistics,eds.D.Y.Lin,andT.R.Fleming,NewYork: The Cancer Genome Atlas Research Network, (2011), “Integrated Genomic Springer, pp. 123–169. [901] Analyses of Ovarian Carcinoma,” Nature, 474, 609–615. [899] Kortram, R. A., van Rooij, A. C. M., Lenstra, A. J., and Ridder, G. Tibshirani, R. (1996), “Regression Shrinkage and Selection via the (1995), “Constructive Identification of the Mixed Proportional Hazards Lasso,” JournaloftheRoyalStatisticalSociety, Series B, 58, Model,” Statistica Neerlandica, 49, 269–281. [901] 267–288. [901] Kosorok,M.R.,Lee,B.L.,andFine,J.P.(2004), “Robust Inference for Uni- Tsiatis, A. A., and Davidian, M. (2004), “Joint Modeling of Longitudi- variate Proportional Hazards Frailty Regression Models,” The Annals of nal and Time-To-Event Data: An Overview,” Statistica Sinica, 14, Statistics, 32, 1448–1491. [894,898] 809–834. [893] Larsen, K. (2004), “Joint Analysis of Time-to-Event and Multiple Binary van der Vaart, A. W. (1998), Asymptotic Statistics, Cambridge: Cambridge Indicators of Latent Classes,” Biometrics, 60, 85–92. [893] University Press. [902,904] ——— (2005), “The Cox Proportional Hazards Model With a Continuous Vicard, P. (2000), “On the Identification of a Single-Factor Model With Cor- Latent Variable Measured by Multiple Binary Indicators,” Biometrics, related Residuals,” Biometrika, 87, 199–205. [894] 61, 1049–1055. [893] Yuan,M.,andLin,Y.(2006), “Model Selection and Estimation in Regression Liu,Q.,andPierce,D.A.(1994), “A Note on Gauss-Hermite Quadrature,” With Grouped Variables,” JournaloftheRoyalStatisticalSociety,Series Biometrika, 81, 624–629. [896] B, 68, 49–67. [901] Louis, T.A. (1982), “Finding the Observed Information Matrix When Using Zeng,D.,andLin,D.Y.(2010), “A General Asymptotic Theory for Max- the EM Algorithm,” JournaloftheRoyalStatisticalSociety, Series B, 44, imum Likelihood Estimation in Semiparametric Regression Models 226–233. [896] With Censored Data,” Statistica Sinica, 20, 871–910. [894,897,898,901]